Is ETL dead? It's a question that has come up a lot in recent years as organizations modernize their analytics infrastructure. Huge shifts are underfoot in the analytics landscape and it isn't always clear where this change leaves ETL.
The short answer? No, ETL is not dead. But the ETL pipeline looks different today than it did a few decades ago. Organizations might not need to ditch ETL entirely, but they do need to closely evaluate its current role and understand how it could be better utilized to fit within a modern analytics landscape.
In this post, we’ll dig into the challenges of traditional ETL and how organizations are supplementing ETL tools and processes with data preparation technologies. Keep on reading to learn more, or download our full report on the “death” of ETL, “EOL for ETL? The Future of Data Wrangling in the Cloud.”
The Trouble with Traditional ETL in a Modern Organization
First, what exactly does ETL mean? ETL refers to three steps (extract, transform, load) used to integrate data from multiple sources into a centralized repository. Roughly 25 years ago, ETL tools were created to automate much of the tedious coding required to retrieve and cleanse data. At the time, ETL was designed to handle data that was generally well-structured, often originating from a variety of operational systems or databases the organization wanted to report against. Specific ETL pipelines were built for a specific set of users. And the end-result was successful—the productivity gains from ETL versus writing code by hand were undeniable.
Today, much of the architecture and data surrounding ETL has changed. No longer are data warehouses the common end target, but, more often, data lakes. The data itself has become much bigger and messier. And even the use cases, which were historically clearly-defined, have grown experimental in nature. Perhaps the biggest difference is that instead of providing data for a few business groups, ETL pipelines are expected to serve a huge variety of users across an organization. Each of these users require different data that has been cleansed and transformed differently. But there’s one commonality—they all want the data fast, and the amount of use cases they’re working with are growing exponentially.
Traditional ETL pipelines have struggled to extend support for the self-service agility required by these emerging analytics use cases. ETL tools were built for IT users, not business users, which often leaves business users waiting in line to get data cleaned, passing specs back and forth until they’ve received their desired output. Meanwhile, IT teams, once considered the target end user for all data operations, are struggling to offload some of the cleansing and standardization tasks found in ETL that business users are begging to take on. Ironically, many organizations now consider ETL pipelines the bottleneck in their analytics efforts—much the same way they looked at code 25 years ago.
ETL vs ELT: Decoupling ETL
Traditional ETL might be considered a bottleneck, but that doesn’t mean it’s invaluable. The same basic challenges that ETL tools and processes were designed to solve still exist, even if many of the surrounding factors have changed. For example, at a fundamental level, organizations still need to extract (E) data from legacy systems and load (L) it into their data lake. And they still need to transform (T) that data for use in analytics projects. “ETL” work needs to get done—but what can change is the order in which it is achieved and new technologies that can support this work.
Instead of an ETL pipeline, many organizations are taking an “ELT” approach, or decoupling data movement (extracting and loading) from data preparation (transforming). This ELT approach follows a larger IT trend. Whereas IT architecture was historically built in monolithic silos, many organizations are decoupling the same components so that they function independently. Decoupled technologies means less work up front (stacks don’t need to be deployed understanding all possible uses and outcomes) and more efficient maintenance. A clean separation between data movement and data preparation also comes with its own specific benefits:
• Less friction. The person or process loading the data isn’t responsible for transforming it to spec at load time. Postponing transformation until after data is loaded creates incentive for sourcing and sharing data. It also preserves the raw fidelity of the data.
• More control. Loading data into a shared repository enables IT to manage all of an organization’s data under a single API and authorization framework. At least at the granularity of files, there is a single point of control.
• More flexibility and transparency. Information can be lost as raw data is “boiled down” for a specific use case. By contrast, untransformed data can be reused for different purposes and leaves a record for auditing and compliance.
Supplementing Data Transformation with Data Preparation
Decoupling the ETL process is a significant step. But many organizations are going even further. Not only are they transforming their ETL pipeline into ELT, but replacing the “T” (transform) with data preparation platforms. Why? Because decoupling ETL has many benefits, but in and of itself, it still doesn’t address the core reason why traditional ETL has become a bottleneck—the high demand from business users for access to data.
Data preparation solutions empower a new set of users to access data, explore it to assess its content and quality, and prepare it for use—while even handling some of the transformation facilities of traditional ETL. Data preparation platforms are built for business users, not IT, and incorporate visualization techniques and machine learning in order to make the data transformation process as intuitive as possible. Examples of modern data preparation platforms include Trifacta and Google Cloud Dataprep, which allow any data professional to transform the data they need while accelerating the total time spent preparing data by up to 90%.
New Beginnings with ETL
The core problems that ETL was built to solve still exist today, and for that reason it remains an important component in many analytics architectures. But organizations need to retire legacy ETL approaches which do not (and were never designed to) meet the needs of business users. Supplementing ETL steps with a data preparation platform is the best way to ensure that business users have the data they need, when they need it, while still partnering with IT.
To learn more about how ETL and data preparation should work hand-in-hand and the new order of operations that organizations are instituting, download our full report on the “death” of ETL, “EOL for ETL? The Future of Data Wrangling in the Cloud.”
Will Davis is Director of Product Marketing at Trifacta.
3 t/m 5 februari 2021 [3 halve dagen online]Praktische tweedaagse workshop met internationaal gerenommeerde spreker Alec Sharp over herkennen, beschrijven en ontwerpen van business processen. De workshop wordt ondersteund met praktijkvoorbeelden en d...
2 maart 2021 (online seminar op 1 ochtend) Cloud Native technologieën als FaaS (Function-As-A-Service), Cloud Native messaging en Serverless API Management zijn belangrijke bouwstenen voor een nieuwe generatie van integratie-architecturen. ...
8 - 12 maart 2021 [5 halve dagen online]Praktische workshop met internationaal gerenommeerde spreker Alec Sharp over het modelleren met Entity-Relationship vanuit business perspectief. De workshop wordt ondersteund met praktijkvoorbeelden en duidelij...
23 en 24 maart 2021 Het Logical Data Warehouse, een door Gartner geïntroduceerde architectuur, is gebaseerd op een ontkoppeling van rapportage en analyse enerzijds en gegevensbronnen anderzijds. Een flexibelere architectuur waarbij sneller nieuw...
14 en 15 april 2021 Organisaties hebben behoefte aan data science, selfservice BI, embedded BI, edge analytics en klantgedreven BI. Vaak is het dan ook tijd voor een nieuwe, toekomstbestendige data-architectuur. Dit tweedaagse seminar geeft antwoord ...
20 april 2021 (online seminar op 1 ochtend)Praktische workshop met Rogier Werschkull over cloud datawarehousing.Wat zijn de voor- en nadelen van Cloud Datawarehousing en hoe pak je dat aan? Tijdens deze online sessie van een halve dag door expert Ro...
22 april 2021 (online seminar op 1 ochtend) Iedere organisatie heeft te maken met het integreren van systemen en applicaties. Maar hoe worden integratieprocessen en informatiestromen nu werkelijk geautomatiseerd? En hoe pakt u dit op een efficië...
18 mei 2021 Praktische workshop Datavisualisatie en Data-driven Storytelling. Hoe gaat u van data naar inzicht? En hoe gaat u om met grote hoeveelheden data, de noodzaak van storytelling, data science en de data artist? Lex Pierik behandelt de stromi...