19 February 2017
For my entire career, I have felt that about every five years I needed to completely overhaul my skillset to keep up with all the changes in technology and practices. Seeing these changes unfold for a few decades, I have no reason to believe that pace of change will drop any time, soon. One of those I see happening right in front of me is the demise of ETL as we have known it for so long. I have been saying for a while, and honestly believe that ETL –as we know it today– is something from the past. Yes, I know it is still omnipresent, but I predict this will become extinct, or nearly extinct fairly soon.
In the “early” days of data warehousing, there was a boom in ETL. And rightfully so. Since so much of the effort in DWH projects goes into data integration, it was only natural that became an area of focus and concern. Along with it, ETL tools grew in importance. Twenty years ago, we would point to the risk of “hand coding” ETL, because it is error prone, time consuming to develop, and difficult (costly) to maintain. Hooray for ETL tools, boos for SQL scripts and stored procedures. DWH maintenance showed we had the truth on our side.
Given these origins, I fully understand when and why people frown when I say “ETL is something from the past!” Because we still need to integrate data as much as we ever did, and debugging and maintaining that code is as important as ever. If not more so (see this post of mine on the happy marriage between Big Data and ‘traditional’ data warehousing). Then why this statement? Why do I see ETL dying?
First, and foremost, ETL packages (like SSIS, but the same holds for all other vendors I am familiar with, like Informatica, DataStage, BODS, or AbInitio, etc.) are a poor and intransparent abstraction of what this part of the architecture actually does. You have a source system, with some data model, and a target data model, and possibly business rules that are being applied on the way in. The mapping of source to target is a mathematical function, one that has been solved a while ago. As impressive as I find “smart” ETL jobs, it’s still non-value added work. Tools do this more efficiently, and the resulting code is a far superior abstraction of what that transformation (T) step actually does (see also this excellent blog post by Maxime Beauchemin that highlights some of these problems, and some other recent innovations). Coding ETL pipelines with a programming language is becoming the norm for Big Data systems. The lessons learned there apply to smaller datasets, too. Native code integrates smoothly with contemporary code management tools. The code can be generated and automated with off the shelf tools, but also with your own customized automation tools. Using native code allows you to easily integrate many languages and technologies in a more natural manner.
Secondly, by focusing on the source and target data models, you elevate the thinking to the level where it needs to be. That makes the documentation of your production code superior for communication purposes. Also, if you believe in Test-Driven ETL development, using code allows you to use standard testing patterns and tools. It opens the door to the new world of living documentation and Specification by Example. But probably at least as important, it frees up the mind to think of alternative ways you might architect such a system, depending on the constraints and requirements. How is the solution expected to grow, how ‘easy’ should you support parallel processing (scalability), etc. The GUI based ETL tools do a mediocre job here, quite apart from the fact that they consume too many scarce human resources.
As I have said before, I foresee that data modelers will become the ETL developers of the future. Data Warehouse Automation is relatively new, but there are already so many options available, that for many settings and technologies at least a few reasonable choices can be considered. I myself have observed a 5-10 improvement in productivity using such tools, and the code they generate is both more readable, as well as more maintainable. ETL is dying…