Today’s digital businesses have a common need: move data from all sources to a central platform from its moment of creation, and deliver actionable insights at the moment of need. This process is called data integration.
To meet this goal, data needs to be extracted from a plethora of data sources, transformed into an easy-to-understand standardized format and loaded into a central data warehouse. ETL is what makes that happen, and it stands for “extract, transform, load.” Constantly working behind the scenes, ETL is the key enabler for analytics and is meant to be a systematic and repeatable process to unify all data in a single record. Analysts can use this single source of truth to guide business decisions. Further down the road, data scientists can use the data to build predictive models and artificial intelligence.
The three stages of ETL
ETL has three main stages: extraction, transformation and loading. Let’s take a deeper look into these stages.
First, data is pulled from data sources such as SaaS applications, event streams and production databases and then put in a staging area, which is an intermediate destination before the data warehouse. Since it may not be feasible to extract all the data from all the sources at the same time, a staging area helps bring all the data together at different times without overwhelming the data sources. A staging area is also useful if there is a problem loading data into the data warehouse, allowing syncs to be rolled back and resumed as needed. Finally, transformations are performed in staging areas.
Data transformation turns raw data into data models that are easily understood by analysts and used to populate data visualizations, dashboards and reports. Data is complicated and can be challenging to interpret, reconcile and utilize. For this reason, data needs to be structured, cleaned and reformatted before it can be used for analytics. Only then can the information be used by analysts and data scientists. Making data understandable involves enriching it with a consistent naming scheme, removing duplicates and cleansing invalid data, and standardizing units of measurement. It also involves structuring data, e.g. by splitting and merging data items, to make it easy to connect with business needs.
The final step in this journey is loading the transformed data into the data warehouse, using batch or stream loading. Put together, your data models may make up a dimensional schema, a simplified data model that represents how your entire organization works.
Challenges of traditional ETL systems
ETL has been around since the 1970s and has changed the way data is managed and used by enterprises. However, in an increasingly cloud-based business environment, the volume, variety and velocity of data are ever increasing. This has made ETL increasingly unscalable in a number of ways.
Traditional ETL systems are housed on-premise in data centers. As an organization’s data needs grow, beefier hardware is not sufficient to solve the problem. Whether on-premise or in the cloud, more servers are needed to scale out the data processing infrastructure. These setups are complex to manage, requiring parallelized machines to run.
Most IT organizations thus minimize the impact on the data warehouse by scheduling batch processing ETL tasks during off-peak hours. As business needs become more real-time, this approach will no longer meet common business requirements. Since this architecture is so difficult to scale, organizations typically sacrifice granularity as data volumes increase.
Unscalable labor effort
As the number of data sources increases, the engineering time required to build and maintain pipelines grows, often in a nonlinear manner. Every new source means a new data pipeline to build and maintain, to say nothing of transformations that join records between sources.
Typically, building an ETL pipeline requires familiarity with parallel processing frameworks. They may need to be manually tuned to meet specific configuration requirements, making them expensive to build, and hard to manage without specially trained staff. With such a level of manual coding and technical know-how needed, any small change in the technology stack can easily destabilize the infrastructure and disrupt daily operations, leading us to the next concern.
There are two conditions under which an ETL pipeline must be rebuilt. The first is when an upstream data schema at the source changes, as all the downstream transformation code must be adjusted accordingly. The second is when downstream data needs change, which likewise requires the construction of new transformations to produce new data models.
Fundamentally, ETL is a complex, engineering-heavy bespoke solution to data integration that requires substantial outlays of labor and capital. The fragility of ETL pipelines also poses a risk of stoppages and other disruptions whenever either data sources or downstream data models change.
ELT: ETL reimagined
A lot has changed since the origins of ETL. Today’s computation paradigm has shifted from on-premise to the cloud, and this movement has enabled data pipelines to move to the cloud. Data warehousing technology has also leapt forward, allowing raw data to be written directly to underlying cloud storage as is, without any prior transformations. There is no more fear of an “exploding” data warehouse since the cloud supports ample data storage and scalable computing power.
These new, cloud-based technologies bring us to ELT, or “extract, load, transform.” ELT data pipelines load data directly into the data warehouse before transforming it, allowing analysts to postpone the data transformation step until after data arrives in the data warehouse. This preserves the original granularity of the data and allows transformations to be performed in the data warehouse environment using SQL.
The most advanced ELT tools are fully managed services, allowing an organization to outsource and effectively automate their data integration, solving the problems of unscalable architecture, unscalable labor effort and fragile workflows.
Fully managed ELT obviates the need to design, build or manage hardware for data integration. All of the infrastructure is hosted in the cloud, and resources are provisioned and retired as needed. If enterprises still host their data on-premise, some ELT tools can extract data behind the firewall and migrate it to the cloud. If more processing power is needed, more servers are automatically started in the cloud, and when storage needs grow, massive storage pools with better price-to-performance ratios can be attached on demand.
Scalable labor effort
Fully managed ELT obviates the need for an organization to build or maintain data pipelines. They simply choose the connectors for their pipelines and get billed based on consumption. Separating the extraction and loading of data from transformations means that ELT providers can treat the raw data models from every source as a standardized commodity, offering the same schema to every customer.
There is another benefit to offering the same solution to a large number of customers. By stress-testing a data pipeline against many different use cases and environments, a provider can quickly identify corner cases that can cause the pipeline to fail, as well as other idiosyncrasies of that data source.
The combination of commoditized raw data models, outsourcing and transformations based in the data warehouse means there is no longer any need to rebuild entire data pipelines as either upstream data sources or downstream business needs change. Whenever an organization needs a new data model, an analyst can simply build it directly in the data warehouse environment, significantly reducing the risk of stoppages.
Reverse ETL: Operationalizing data
Beyond analytics, using data in business operations requires data models to be delivered to the right team at the right time in an accessible manner. Reverse ETL empowers organizations to meet this requirement by syncing transformed data from data warehouses into operational systems and tools such as CRMs and ERPs. Imagine the boost to the customer experience if the marketing department can personalize emails based on the customer’s past buying data in a CRM. Imagine if the same data enables a customer service representative to know exactly how to help the customer.
Data integration is here to stay
In the ever-evolving landscape of data management, ETL has evolved into fully managed ELT as well as reverse ETL. The development of ELT has changed data integration from an IT- and engineering-intensive process to one that is friendly to analysts. It has changed data integration from a fragile, complex and inflexible process to one that is agile, simple and easily scaled.
Data integration will continue to grow in importance as newer cloud data technologies become available. As data becomes an increasingly central part of your enterprise, the need to scale, transform and extract actionable business insights will only grow. To see how Fivetran can help you along the way, sign up for a free trial.