Recently for our project we started using Kettle for ETL purposes. Pentaho Kettle provides UI based tool. Initially it takes quite some time to get used to Kettle UI as it becomes difficult to visualize how to orchestrate available Kettle Steps to solve a business problem. As you know how to use it, it's all about drag and drop a step and configuring it with available UI. With our experience we observed that it's pretty easy to design 90% stuff easily but rest 10% involves a lot of research and at the end involved some hacks which we never liked.
As we created Kettle transformations and jobs, we were not very sure about its testability part. After some research we found that we can use BlackBoxTests class available in Kettle distribution for test purposes. The fundamentals of it are quite simple. You pass some inputs and define the expected file and in the output you get actual output file after executing Kettle transformation. BlackBoxTests asserts if expected file matches with actual file. So for instance if you have a Sample.ktr under test, BlackBoxTests will expect Sample.expected.<txt/xml/csv> as an expected file and Sample.actual.<txt/xml/csv> as actual file to make it work. It tests all available transformations under a folder and subfolders.
By definition Kettle uses kettle.properties (available under $HOME/.kettle folder) which creates complications from testing point of view. However you should be able to test a Kettle transformation in isolation. That's why instead of using kettle.properties, we planned to use application specific property file to pass it to TransMeta class with available injectVariables() method. We were kind of successful but later found out that Kettle still uses kettle.properties even if we use a different property file.
After a lot of debugging we found out the culprit. BlackBoxTests uses EnvUtil.environmentInit() and does all the magic. It loads the kettle.properties by default and to our horror loads into java.lang.System.
We quickly got rid of using EnvUtil but found again that it's not enough to pass the properties from outside. It works for the current transformation but somehow Kettle is not able to pass these properties to embedded sub-transformations. It worked earlier just because EnvUtil.environmentInit() loads properties into java.lang.System.
Overall, though we were finally able to do the testing with BlackBoxTests in isolation with some hacks, we concluded that the Kettle code is not designed to be testable and it can be termed as legacy code in Michael Feather's language.
Tags: kettle integration testing
Filed under kettle | 3 Comments »
It’s always good fun to hear about hard-core programmers that try to solve business intelligence issues.
If you don’t want to load the information in the kettle.properties file, here’s some advice: don’t put anything in there! The Kettle variable or named parameter system does indeed NOT put anything in java.lang.System.
Not testable? You got to be kidding me.
Just because you have problems grasping a few basic concepts, that doesn’t mean you have the right to call Kettle “legacy code” or throw around other insults. Try to find another way to vent your frustrations.
Great post. We are currently looking into the same issue — the ability to test Kettle transforms in isolation — and this is a better explanation than I’ve seen anywhere else (including the Pentaho wiki).
Matt – your solution of not using kettle.properties is a good one, and I’d agree that Kettle is far more testable than most other ETL tools… no need to get so defensive, though. The poster pointed out some legitimate issues, and this is good feedback for your community.
@Matt – We already used a different properties file and passed it through using TransMeta.injectVariables(). It works fine until we have a sub-transformation underneath. Somehow transformation is not able to pass properties to sub-transformation. If you use EnvUtil.environmentInit(), it overrides the properties passed with the ones existing in kettle.properties (should have been other way round).
I may have been a bit harsh in calling the Kettle code “legacy code” but I could see source code of 6000+ lines which is hard to understand and certainly not designed for testability.
While working with Kettle I found following roadblocks for which may be solutions exist but I could not find them in available resources:
1. Manual restart after failure (the ability to restart from where it failed)
2. Transaction over multiple insert steps
3. Automatic retry (for instance HTTP service or web service) and recovery for items that have exhausted their retry count.
4. Integration testing with database independence (using actual db instead of in-memory db right now)
5. Web services portability. For certain standards it doesn’t work. It’s difficult to ask a web-service vendor to change the web-service itself.
Many a times we used to reach to a wall from where we had to find some workaround. Integration testing (using Continuous Integration) example I mentioned in the blog is one of them.