ETL is dying

Tom Breur

19 February 2017

For my entire career, I have felt that about every five years I needed to completely overhaul my skillset to keep up with all the changes in technology and practices. Seeing these changes unfold for a few decades, I have no reason to believe that pace of change will drop any time, soon. One of those I see happening right in front of me is the demise of ETL as we have known it for so long. I have been saying for a while, and honestly believe that ETL –as we know it today– is something from the past. Yes, I know it is still omnipresent, but I predict this will become extinct, or nearly extinct fairly soon.

In the “early” days of data warehousing, there was a boom in ETL. And rightfully so. Since so much of the effort in DWH projects goes into data integration, it was only natural that became an area of focus and concern. Along with it, ETL tools grew in importance. Twenty years ago, we would point to the risk of “hand coding” ETL, because it is error prone, time consuming to develop, and difficult (costly) to maintain. Hooray for ETL tools, boos for SQL scripts and stored procedures. DWH maintenance showed we had the truth on our side.

Given these origins, I fully understand when and why people frown when I say “ETL is something from the past!” Because we still need to integrate data as much as we ever did, and debugging and maintaining that code is as important as ever. If not more so (see this post of mine on the happy marriage between Big Data and ‘traditional’ data warehousing). Then why this statement? Why do I see ETL dying?

First, and foremost, ETL packages (like SSIS, but the same holds for all other vendors I am familiar with, like Informatica, DataStage, BODS, or AbInitio, etc.) are a poor and intransparent abstraction of what this part of the architecture actually does. You have a source system, with some data model, and a target data model, and possibly business rules that are being applied on the way in. The mapping of source to target is a mathematical function, one that has been solved a while ago. As impressive as I find “smart” ETL jobs, it’s still non-value added work. Tools do this more efficiently, and the resulting code is a far superior abstraction of what that transformation (T) step actually does (see also this excellent blog post by Maxime Beauchemin that highlights some of these problems, and some other recent innovations). Coding ETL pipelines with a programming language is becoming the norm for Big Data systems.  The lessons learned there apply to smaller datasets, too. Native code integrates smoothly with contemporary code management tools. The code can be generated and automated with off the shelf tools, but also with your own customized automation tools. Using native code allows you to easily integrate many languages and technologies in a more natural manner.

Secondly, by focusing on the source and target data models, you elevate the thinking to the level where it needs to be. That makes the documentation of your production code superior for communication purposes. Also, if you believe in Test-Driven ETL development, using code allows you to use standard testing patterns and tools. It opens the door to the new world of living documentation and Specification by Example. But probably at least as important, it frees up the mind to think of alternative ways you might architect such a system, depending on the constraints and requirements. How is the solution expected to grow, how ‘easy’ should you support parallel processing (scalability), etc. The GUI based ETL tools do a mediocre job here, quite apart from the fact that they consume too many scarce human resources.

As I have said before, I foresee that data modelers will become the ETL developers of the future. Data Warehouse Automation is relatively new, but there are already so many options available, that for many settings and technologies at least a few reasonable choices can be considered. I myself have observed a 5-10 improvement in productivity using such tools, and the code they generate is both more readable, as well as more maintainable. ETL is dying…

Advertisements

11 comments

  1. Interesting post – I agree with a lot of what you are saying, particularly wrt to the power of modelling source and target ans using this to drive integration. However one quibble: shouldn’t the title be “ETL tools are dying” rather than ETL (in the generic sense)? I know people get very caught up on ETL vs ELT, but the important thing is that each of the E, L and T appear in both.

    Liked by 1 person

  2. My first job, feels like almost a hundred years ago, was [to cooperate in] the construction of a centralized customer database, which had to become the central master for customer data, and which had to replace the “CCF” (“Central Customer File”), which held the same information but was a slave to other stores.

    How was that ? At the time, the bank had product-oriented applications, say, savings accounts, current accounts, mortgages, … and each of those applications had its own private portion of customer data. Silos of customer data, thus. (Sounds like anything you know ?) The silos deprived more commerically oriented activities/departments in the bank from getting any such thing as a “customer-centric view”, e.g. for purposes of cross-selling. (Sounds like anything you know ?) So it was decided that a “Central File” had to be built. It would be built *in addition to* the existing customer silos, which could thus continue to operate as before. (Sounds like anything you know ?) It would be populated by applying the transactions that kept the silos up-to-date to the central file as well (modulo certain manipulations of course). (Sounds like anything you know ?) There were certain procedures that facilitated to decide that this John Doe having a savings account, was the same person as this Jon Dowe that had a mortgage. (Sounds like any cleansing you know ?) I could go on for a while, but you get the picture. The data quality issues it gave rise to were so massive that the decision that this CCF system had to be replaced was already taken after it had been in production for a mere two years.

    ETL is not a solution, it is the problem itself. And the solution is to simply dispense with both the ‘E’ and the ‘L’. Which is exactly what we did at the time.

    Liked by 1 person

  3. Enter the realm of game changing tools like Capsenta’s Ultrawrap. Eliminates ETL and makes disparate relational databases interoperable and real-time. 6 years in development out of the UT Austin Computer Science Department, quickly gaining widespread adoption. NoETL is what results! Auto schema matching. Semantics-powered.

    Liked by 1 person

  4. Not true – if you are integrating multiple systems together, you will likely always need some logic to weave them together (many call this ETL regardless if you are physically extracting>transforming>loading or if you are using a traditional ETL application). You shouldn’t get too hung up on terms and details. Yes, the technology is changing (it has been doing this for years), but the concept is not. I don’t care if you are using Informatica, shell scripts, python, SQL, etc., they can all be flavors of ETL.

    Your title makes a grand statement, but your justification gets lost in terminology.

    Liked by 1 person

  5. Juan Carlos. Thanks for your thoughts. Please see W3C standards for integration without ETL – and yes folks are getting hung up on terms and details because “it is all about Semantics”. Integrating using the tools you list are ETL since there is movement of data – or an intermediate “engine” which needs continual management as the data changes. There is something better now and and it is not ETL. See: http://smartdata2015.dataversity.net/sessionPop.cfm?confid=91&proposalid=7801

    Liked by 1 person

  6. Hi,
    What i think is ETL is expanding its wings widely across the newer technology as per the present enterprise Faster world to value, staff, integrate, trust, innovate and to deploy.
    But this is a great article , thanks for sharing your thoughts on ETL

    Like

    • Thanks, Ella! Together with a colleague I’m working in a follow-up paper, and I couldn’t agree more: in the age of “fast data” (much lower latencies, streaming) what ETL looks like is evolving. Even to the point where people don’t recognize or call it as such, which in itself is fine – as long as the meta structures are captured at the appropriate level of abstraction so that learning can drive up the quality of the application by generating ever more business value.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s