What is ETL?
ETL stands for extract-transform-load, and is commonly used when referring to the process of data integration. Extract refers to pulling data from a particular data source. Transforms are used to make that data into a processable format. Load is the final step to drop the data into the designated target.
What are ETL tools?
ETL tools serve to abstract many of these considerations away. The primary benefit includes not requiring a technical resource to script out these processes to begin with by providing an interface to easily set up data integration from various sources. Modern cloud-based tools, such as Fivetran, continue to advance ease-of-use and reduce technical debt incurred for tool management with features such as:
Simple setup
Cloud applications to create integrations typically are set up with a simple OAuth functionality
Automated updates
Fivetran hosts an internal cursor for each individual data source setup in your account so that we can query only the latest timestamps or database logs to determine changes and reduce the production load.
Schema drift handling
For either added columns and tables in the database source or custom fields in a cloud application, Fivetran’s pipeline will pick up these changes and apply them to the appropriate destination schema so that you don’t need to worry about reworking the pipeline for frequent source changes.
Infrastructure management
Due to our cloud hosting and easy-to-use web interface, you won’t ever have to worry about infrastructure scaling to meet your data volume needs.
Security
Data replicated by Fivetran is always encrypted, whether in-transit or at rest. For database connectivity, we can connect to either on-premise or cloud-hosted databases through a simple IP whitelisting, SSH tunnels, reverse SSH tunnels, or VPN tunnels. We also undergo an annual 3rd party penetration test and SOC 2 type 2 compliance certification. For more on our security, download our whitepaper.
ETL process considerations for your pipeline
When you’re scripting out ETL in-house, you commonly have to consider a few questions:
Do you have dedicated engineering resources not only for the initial build, but for ongoing maintenance?
Data pipelines are typically built to answer a specific question, which means specified endpoints in the API are hit. Over time, your analysts will strive to answer more questions, which might require additional API endpoints to extract from. Your engineers will also have to update your pipelines to accommodate structural changes at the data source. As long as you use a particular data source, maintenance will be an ongoing commitment that consumes engineering time.
How do you handle infrastructure scaling?
This is only applicable if you’re looking for pipelines that manage updating data to handle ongoing reporting. Data volume will only continue to grow, which requires dedicated server(s) to handle updated data volume.
How do you ensure your connectors only pull in the latest data?
Rather than a full extract and comparison of all the data available in the source, which causes undue stress on the production system, your ETL script should use database change tracking logs or timestamps for cloud applications to only extract updated data.
Does the ETL script account for schema drift from source to destination?
Schema changes happen frequently in production databases, which include systems for hosting product data and backends for ERP systems. In cloud applications, custom fields are frequently utilized by system administrators to provide increased flexibility, which results in schema changes, such as additional tables and columns, that need to be applied in the destination.
Security concerns
You must ensure that data traversing the internet to a cloud hosted warehouse or being extracted from a cloud hosted data source is never compromised. Privacy and security are always important, and you must encrypt your data pipeline and restrict access to users with the appropriate credentials.
What does Fivetran do?
Fivetran automated data integration delivers zero-configuration connectors that dynamically adapt as schemas and APIs change, ensuring reliable data access. Fivetran continuously synchronizes data from source to warehouse, and accelerates data analysis by programmatically managing ready-to-query schemas and automating in-warehouse transformations. Fivetran offers a free 14-day trial to ensure that data integration for your use case is satisfied.