Pentaho Kettle and Integration Testing

Recently for our project we started using Kettle for ETL purposes. Pentaho Kettle provides UI based tool. Initially it takes quite some time to get used to Kettle UI as it becomes difficult to visualize how to orchestrate available Kettle Steps to solve a business problem. As you know how to use it, it's all about drag and drop a step and configuring it with available UI. With our experience we observed that it's pretty easy to design 90% stuff easily but rest 10% involves a lot of research and at the end involved some hacks which we never liked.

As we created Kettle transformations and jobs, we were not very sure about its testability part. After some research we found that we can use BlackBoxTests class available in Kettle distribution for test purposes. The fundamentals of it are quite simple. You pass some inputs and define the expected file and in the output you get actual output file after executing Kettle transformation. BlackBoxTests asserts if expected file matches with actual file. So for instance if you have a Sample.ktr under test, BlackBoxTests will expect Sample.expected.<txt/xml/csv> as an expected file and Sample.actual.<txt/xml/csv> as actual file to make it work. It tests all available transformations under a folder and subfolders.

By definition Kettle uses kettle.properties (available under $HOME/.kettle folder) which creates complications from testing point of view. However you should be able to test a Kettle transformation in isolation. That's why instead of using kettle.properties, we planned to use application specific property file to pass it to TransMeta class with available injectVariables() method. We were kind of successful but later found out that Kettle still uses kettle.properties even if we use a different property file.

After a lot of debugging we found out the culprit. BlackBoxTests uses EnvUtil.environmentInit() and does all the magic. It loads the kettle.properties by default and to our horror loads into java.lang.System.

We quickly got rid of using EnvUtil but found again that it's not enough to pass the properties from outside. It works for the current transformation but somehow Kettle is not able to pass these properties to embedded sub-transformations. It worked earlier just because EnvUtil.environmentInit() loads properties into java.lang.System.

Overall, though we were finally able to do the testing with BlackBoxTests in isolation with some hacks, we concluded that the Kettle code is not designed to be testable and it can be termed as legacy code in Michael Feather's language.

Comments (8)

  1. Matt Casters - Reply

    September 30, 2009 at 12:16 pm

    It's always good fun to hear about hard-core programmers that try to solve business intelligence issues.

    If you don't want to load the information in the kettle.properties file, here's some advice: don't put anything in there! The Kettle variable or named parameter system does indeed NOT put anything in java.lang.System.

    Not testable? You got to be kidding me.

    Just because you have problems grasping a few basic concepts, that doesn't mean you have the right to call Kettle "legacy code" or throw around other insults. Try to find another way to vent your frustrations.

  2. Brad - Reply

    September 30, 2009 at 3:58 pm

    Great post. We are currently looking into the same issue -- the ability to test Kettle transforms in isolation -- and this is a better explanation than I've seen anywhere else (including the Pentaho wiki).

    Matt - your solution of not using kettle.properties is a good one, and I'd agree that Kettle is far more testable than most other ETL tools... no need to get so defensive, though. The poster pointed out some legitimate issues, and this is good feedback for your community.

  3. Shrikant Vashishtha - Reply

    October 1, 2009 at 6:45 am

    @Matt - We already used a different properties file and passed it through using TransMeta.injectVariables(). It works fine until we have a sub-transformation underneath. Somehow transformation is not able to pass properties to sub-transformation. If you use EnvUtil.environmentInit(), it overrides the properties passed with the ones existing in kettle.properties (should have been other way round).

    I may have been a bit harsh in calling the Kettle code "legacy code" but I could see source code of 6000+ lines which is hard to understand and certainly not designed for testability.

    While working with Kettle I found following roadblocks for which may be solutions exist but I could not find them in available resources:
    1. Manual restart after failure (the ability to restart from where it failed)
    2. Transaction over multiple insert steps
    3. Automatic retry (for instance HTTP service or web service) and recovery for items that have exhausted their retry count.
    4. Integration testing with database independence (using actual db instead of in-memory db right now)
    5. Web services portability. For certain standards it doesn't work. It's difficult to ask a web-service vendor to change the web-service itself.

    Many a times we used to reach to a wall from where we had to find some workaround. Integration testing (using Continuous Integration) example I mentioned in the blog is one of them.

  4. Max Hofer - Reply

    January 15, 2011 at 7:51 pm

    ShriKant, have you made any progress in this direction?

    I'm also trying to figure out how to test transformations/jobs in an automated way.

  5. Shrikant Vashishtha - Reply

    January 17, 2011 at 9:20 am

    Hi Max,

    We felt that it was much more difficult to do that and would require a lot of time and resources as there was not much available documentation. Also we didn't feel very confident that we are under control while implementing enterprise level solution with Kettle because on and off we kept on seeing problems coming. We actually left using Kettle and adopted a new strategy based on Spring Integration implementation and it worked for us.

  6. jk - Reply

    March 19, 2013 at 8:23 pm

    I ran into a similar problem, recently. Our application was based on Spring Framework and Hibernate, and I based the testing on Spring's JUnit integration tests. I put some notes here: http://devno.blogspot.com/2013/03/pentaho-kettle-etl-regression-with.html

  7. [...] an article from 2007 talking about a framework for PDI testing, but has no code.  Here’s a blog post with some comments about this [...]

  8. Dan Moore - Reply

    May 10, 2013 at 5:21 am

    I wasn't able to find the blackboxtests classes you mention. Do you know if those are still distributed with kettle?

    Also, I've written a series of blog posts on how to test kettle transformations--Basically, you build a parallel kettle job that exercises your logic and compares it to a golden set of values. (My post focuses on using file based golden data, but can easily be extended to database tables.) The most recent post is here: http://www.mooreds.com/wordpress/archives/1061

    It's a bit different Than JK's solution in that you don't have to write java code (but you do have to write some etl code).

Add a Comment