Learn

Best Snowflake ETL tools

March 15, 2023

Topics

In this post we will discuss the various modern tools that can be used for ETL via Snowflake, and compare their capabilities and shortcomings. If you're confused which tool to use or want to know which tool best suits your purpose, this post is for you.

These days, businesses and governments actively collect data and monitor it to uncover facts about the efficacy of their processes, get actionable insights and predict future outcomes. Every second, every person generates about 1.7MB of data being, on an average. Data is continuously generated by computers, mobile phones, sensors, hand held devices and monitors, etc. So, in 2022 alone we generated 97 zettabytes, and to give you an idea of this mammoth size, 1 zettabytes is equal to 1 billion terabytes OR or 1 trillion Gigabytes. With such a huge volume of data being generated every day, it becomes very important to properly save and secure it, and to use it judiciously for our benefit. This phenomenon has led to creation of a whole new industry, called Data science and Analytics. Let's discuss some more details of a few functions performed in the Data science and Analytics domain.

What is ETL ?

ETL is an acronym for Extract-Transform-Load, it's a process you go through when you collect data from various sources into a huge data warehouse. This data could come from databases, cloud locations, APIs, sensors and monitors, 3rd party SAAS solutions like Salesforce or Quickbooks etc. The incoming data could be in diverse formats and could require some transformations/pruning before they can be loaded in the data warehouse. After ETL is done, the coalesced data in the warehouse can be use for analytics, reporting and business intelligence purposes. ETL is a continuous process, as data is being generated continuously, businesses need to capture, pre-process and store data on an ongoing basis.

What is reverse ETL and ELT ?

Reverse ETL is the process where you transfer a subset of the data accumulated in your data warehouse, to business applications like Analytical tools / CRMs / Marketing Automation tools etc. ELT is Extract-Load-Transform, where you first load the data into the warehouse, irrespective of its format/preparedness, and later transform it just prior to its use.

What is Snowflake?

Snowflake is a data warehousing service provided as a cloud-Based SaaS (Software-as-a-service) offering to its customers. Using Snowflake, you can build a data warehouse in minutes, as there is no need to set up your infrastructure to store all this data. With Snowflake there is no need to provision storage and bandwidth, no need to install and maintain software, and no need to monitor and manage your data silos. Snowflake does all of this for you, and allows you to go from data to analytics in a very short period of time.

Snowflake is built on Amazon Web Service, Microsoft Azure, and Google Cloud infrastructure; in such a way that it provides virtually unlimited bandwidth and storage, for storing and retrieving data. Snowflake allows you to run click and point analytics on your data too, and lends itself well for being used in coordination with other software.
Snowflake offers transparent and manageable billing; it bills its clients separately for "storage" and "compute" units. This allows the customers to use and pay for both services independently, based on their specific needs. The above USPs make Snowflake, one of the world's leading Cloud Data Warehousing solutions.

What is Snowflake ETL?

Snowflake is known for its versatility, flexibility and ease of use; when compared with other tools in the same genre. Snowflake supports incoming data in many formats like CSV, JSON, Avro, ORC, Parquet, or XML etc. The incoming data can come from many sources like Databases, APIs, Flat files, Cloud storage services S3/GCP/Azure, Data pipelines and streams etc.

How ETL works in Snowflake?

Bulk loading as well as batch loading is supported by Snowflake, you can also define data pipelines that will feed data into Snowflake, as and when it's available. You just have to define the "stage", i.e. a place where raw files will be stored before the data in them goes into tables. For each user and for each table, Snowflake automatically allocates a stage; further users can allocate "named" stages to harness the maximum data loading flexibility provided by the platform.

Your data in Amazon S3 is assumed to be already staged, and you just need to give a "COPY INTO <tablename>" command to load the contents of your S3 file(s) into a specific Snowflake database table. Similarly, for Google Cloud as well as Azure, Snowflake assumes the data files have already been staged in a Cloud Storage bucket, and just need the "COPY INTO <tablename>" directive to start its ETL action.

For Azure, Snowflake currently supports loading from blob storage only, but more support is being added for future versions. Of course there are many settings/optimizations you can do to the above ETL processes, but we can discuss those in a separate blog post. Another cool feature Snowflake provides is called Snowpipe, Snowpipe automates the above "COPY INTO <tablename>" process. You can load data in small batches into Snowflake, and Snowpipe will make it available to users within minutes, without the need to manually issue a COPY command for every batch.

There are 2 ways of alerting Snowflake of a new incoming batch:-

1. Cloud Messaging - As soon as the data to be sourced to Snowflake is available in your cloud storage, you can send an event notification to Snowflake. Snowflake will then copy the new data files into a queue, and then load them into the target table in a continuous, serverless fashion based on parameters defined in a specified pipe object.

@Source: https://docs.snowflake.com/en/

‍

As the above figure suggests, you can configure a Snowpipe for the above cloud storage services.

‍

2. Snowflake REST API endpoints - Snowflake provides an API that can be used to initiate the data transfer, by calling the REST endpoint with the name of a pipe object and a list of

data filenames.

@Source: https://docs.snowflake.com/en/

‍

As shown above, this method is the most versatile and can work well with almost all cloud based storage services. Snowflakes not only supports ETL but it supports ELT too, i.e. you can first quickly load your data into Snowflake and then transform it later. This method gives you the flexibility to pre-process the same raw data multiple times and get it in different formats, where each format can suit a particular analytical method/tool.

Advantages of using Snowflake ETL

Snowflake, as discussed above, supports a variety of data sources, many file formats and pre-processing routines. It also offers its own analytics and visualisation/reporting capabilities, so it's not necessary to use any other analytics platform.

The other main advantages of using Snowflake ETL are:

Readily accepts incoming data from popular cloud storage solutions.

You can either create a automate continuous data loading by using Cloud Messaging, OR use REST API endpoints to initiate ETL whenever needed or when your incoming data has irregular bursts.

Snowflake REST API endpoints make it easy to integrate with your existing applications and analytics tools like Tableau and PowerBI.
An intuitive web interface is also provided for ETL, which is easy to use.
Since Snowflake bills compute and store separately, you can control your spending transparently. In case your data flows are pretty static but your analytics are intensive and varied, you can hire more compute and pay less for storage.

In cases where there is a lot of incoming data being generated regularly, more of storage can be used to keep the data warehouse in sync with the latest data and you can limit the compute cost by running analytics only when substantial newer data has been fed to the storage (i.e. the existing analytics are pretty much based on stale data, and a lot of new data has flown in since).

Snowflake being an industry leader in its segment, provides state of the art encryption, authentication and security features.
Snowflake keeps innovating and newer advanced features are regularly added for users to take advantage of.

Key Factors to Evaluate Snowflake ETL Tools

The key factors to consider when we compare Snowflake ETL tools would be: -

Power and versatility - According to me, a tool must be versatile and powerful, such that it can be used in different scenarios with different data types. I tend to rank "ease of use" a bit lower, as with the explosion of data sources and the variety of file formats available, an ETL tool must be powerful enough to support most if not all of them. So, a good tool must support a variety of data sources and data formats, it should not mandate the need for other tools if data sources/formats change.

Extensibility - As new data sources/formats emerge the tool must either directly support them or lend itself well to integrate with other tools. This requirement is closely related to the first key factor. There are times when a less popular data format or an occasionally used data source needs to be integrated with Snowflake. So, a good ETL tool must provide an API or connectors or messaging services, so that another tool can orchestrate well with it, if and when the need arises.

Ease of use - this is also a very important factor as it enhances user acceptance and efficiency. No matter how versatile an ETL tool is, if it's difficult to use, it will either become unpopular OR waste a lot of time, leading to users switching over to a set of alternative tools. Most tools today provide drag and drop, point and click, simple scripting to allow their efficient usage.
Data Transformation - the tools must provide all the frequently used or prescribed data transformations, this eases analytics and makes the data readily consumable by processes further down the line.

Pricing- the tools must be moderately priced, it must be withing the budget limits for most users. A versatile tool that is too pricey, or a weak tool at throwaway prices will always be avoided.
Documentation and Support - Excellent product documentation and timely support ensure that the tool can be used for the maximum benefit. Good documentation accelerates learning and timely support eases user acceptance.

Top 7 Best Snowflake ETL Tools of 2023

Next, we will discuss how some popular data integration and analytics tools weigh against each other, with respect to their integration and working with Snowflake. Large corporations have many datasets pertaining to their various business processes, and these datasets could be in diverse formats and locations. E.g. their sales/CRM data could be in salesforce, their operations/HR data could be in a local database and their accounting data could be in lets say, quickbooks. All this data must be coalesced and made to work as a seamless whole, such that one can run analytics over the integrated whole to get a clear picture of where the business processes are heading. Data integration tools ease this collection, pruning, comparison and connection of the data.

1. Fivetran:-

We discuss Fivetran first because not only is it one of the most popular, but it's a pioneer in continuous improvement and innovation. Fivetran provides 300+ pre-built, no-code source connectors and supports a multitude of data formats and data sources.

@source: https://fivetran.com/docs/

With Fivetran's automated schema drift handling, it supports dynamic schemas, you need not specify your data structures/table columns before moving in your data. You can just move in your data, and specify the DDL later.

Schema creation and mapping can happen automatically also, so once you need to specify the data definitions only once. Another very useful feature of Fivetran is its pre-built data models, which are built according to most popular data sources, and transform your destination data into analytics-ready datasets.

E.g. your Salesforce objects have a ready mapping available in Fivetran, it will transform incoming salesforce objects into Fivetran output objects/models that lend themselves well for running analytics immediately.

@source: https://fivetran.com/docs/

Fivetran automatically prepares, transforms and enriches your data objects while syncing with upstream connectors. Further, there are pre-built quickstart data models, which allow you to start using them immediately inside Fivetran, without building out your own dbt project.

1) Fivetran ensures that your data is fresh and updated almost continuously, so that you always run analytics on the latest data.

2) Fivetran offers easy setup and automated data pipelines that require minimal maintenance.

3) Supports SQL modelling with defined schemas and ERDs , to make the analytics process easy and intuitive.

4) Fivetran can be run in a fully managed or hybrid or totally self hosted environments, so it can suit everybody's needs and policies.

5) Fivetran’s high-volume replication capabilities can give upto 50 gb+/hr throughput.

@source: https://fivetran.com/docs/

Fivetran's integrated scheduling allows for super easy orchestration of your data pipelines, automatically triggers model runs following the completion of Fivetran connector syncs in your warehouse. Fivetran supports reverse ETL as well as ELT easily.

Pros:-

Pre-built data models for almost all popular data sources
Near real-time data replication
Quick to set up, easy to maintain existing pipelines
Fantastic support, easy to use documentation

Cons:-

You can associate a Quickstart data model with one connector only , though this is done to make the mapping more efficient
Need to understand the pricing model to keep costs low

Pricing:-

With Fivetran, you only pay for monthly active rows (MAR), and as volumes increase the cost per row decreases. For a small data volume, you can process 2 million MAR for close to $1000. For a huge data volume, you can process 70 million MAR for close to $10,000. There is always a free plan to start with, get familiar with your data dynamics and fivetran pricing, and then move to a paid plan.

2. Talend

Talend has positioned itself as more of a data integration tool and does not provide full blown BI capabilities, hence its worth a discussion here. Talend provides more than 100+ connectors for various sources, and could be useful if you have incoming data from many disparate sources. It provides data quality(pre-processing and checks), big data and application integration facilities, making it an extensively used tool. Talend's Continuous integration capabilities reduce the overhead of repository management and deployment. Through its MDM(master-data-management) functionality, Talend ensures that you warehouse is in sync with the continuously updating data streams, and you work on the latest version of your data.

Pros:-

Processes incoming data row by row, so conditional transformations are simpler to implement
Talend supports dynamic schemas (i.e., table structure), so you need not know all your tables columns before pulling in your data. Data items can be classified later, as per the intended usage.
Supports Java coding for business specific customizations

Cons:-

The community edition has limited capabilities in terms of streaming features, it's also difficult to install and run for the first time.
Scheduling features are limited.
Error management is not at par with other tools, some of the error messages are not explanatory.
Suits big data applications more than traditional databases.
You have to subscribe to paid extensions for many tasks, like real time big data package for spark streaming and machine learning work.
Limited support for unit testing of the components in the free edition.

3. Matillion

Matillion is another leading cloud data warehousing solution we will discuss. Matillion ETL allows you to perform powerful transformations and combine transformations to implement complex business logic procedures.

Pros:-

Provides a scheduling orchestration that enables ingestion when data is available, hence it helps in keeping your warehouse in sync with the latest data.
Easy to use drag and drop interface
Provides easy collaboration amongst team members, version control, Push-down ELT technology and 80+ out-of-the-box connectors for other data products.
Snowflake itself has invested in Matillion to bring native connectors to the Snowflake Data Cloud, so you can be assured of pristine quality here.

Cons:-

Sometimes, the collaboration features can create confusion/bottlenecks when team members initiate conflicting processes.
Matillion for Snowflake does not support a DynamoDB ETL connector.
Limited clustering ability, so large datasets take a lot of time to process

Pricing:-

Matillion pricing units are are called credits, credits are consumed by the number of rows loaded by Matillion Data Loader, and by the Virtual Core hours used running Matillion ETL

instances.

Pricing starts from $2.00 per credit(less features), and goes up to $2.70 per credit ( all features). This pricing is for using Matillion as a whole with all its connectors including the Snowflake connector.

4. Integrate.io

Integrate.io is another versatile tool that provides an integration layer for your data warehouse. Integrate.io provides a no-code data pipeline platform, and powers data warehouses using ETL, ELT, CDC, Reverse ETL, and API Management. Integrate.io has a secureETL feature, to ensure that your data is secure and compliant with latest standards like PHI, GDPR(Europe) and HIPAA etc. It has partnered with Amazon's Key Management Service (KMS) to enable field level encryption (FLE), such that your data is always encrypted when it leaves your network.

Pros:-

Provides one of the industry’s fastest ELT data replication functionality, keeping your data in sync with the updates that occurred just 60 seconds ago.
Connects 150+ data sources & destinations, it might have all the needed connectors for tools in your tech stack.
Drag and drop no code interface allows you to easily define transformations and connect your data sources.
Good support team

Cons:-

Its expensive as compared to most alternatives in the same genre
Error reporting does not always lead you to the cause of the error.
The interface gets complex as the number and complexity of your pipelines grow.

Pricing:-

Integrate.io follows a simple flat-rate pricing strategy that charges by the connector, no matter what your data volume, or location is. Any number of users can use it. Pricing for the base product starts at 18,000 per year which supports 20+ native database connectors including big data services , with additional connectors priced at $2,000 per yr per connector.

5. Informatica

Informatica is a leading data integration platform based on ETL methodologies and it also offers analytics capabilities. Informatica is an elite Snowflake partner , has a well defined and deep integration with snowflake. Informatica's Snowflake connector offers easy integration and reading/writing of data to Snowflake.
It offers processing of virtually unbounded data volumes, with appreciable speed.

It also offers atomic transaction mode for DDL and DML statements, offers error management capabilities, such as Notifications & Alerts no need to use DDL as it offers runtime creation of target tables in Snowflake Since the connector is based on JDBC, it offers 128 bit SSL-encryption end-to-end. It also manages your Keys ( e.g. private keys for encryption) in a secure repository. Role based access and authentication are provided. Informatica supports ETL as well as ELT, and can be run on premise as well as cloud/AWS. Informatica scales well, and supports ODBC FULL push-down for faster data processing As it supports partitioning, spawning parallel data pipelines end-to-end is not difficult.

Pros -

Easy to set up, configure and operate.
Flexible as it supports parameterizing and overriding of practically all properties at runtime.
Follows Snowflake "best practices" for data reads and writes, making it one of the fastest connectors available.
Maintains a cache for faster reading of data
You can easily re-use/migrate your connection configured in one environment to another.
Excellent documentation, user base and regular updates.

Cons -

A bit pricey as both players take a piece of the cake
Restrictions in migration mandate that the tables and objects at both the ends must be the same.
Supports Push down only via ODBC driver, so it may need some extra customization( via programming) at your end, if you plan to use JDBC/other DB connectors.

7. Streamsets

Streamsets is another tool worth a mention here, it provides data integration for both hybrid as well as multi-cloud environments. It provides resilient pipelines, and a single interface for managing the data life cycle. Being built on DataOps, it manages data complexity and diversity better than many other products. DataOps is a set of best practices and technologies, for operationalizing data management and data integration when your data is constantly changing/updating, ensuring resiliency and agility.

Pros:-

Handles data drifts better, as it follows DevOps methodologies
Processes streaming as well as record based data efficiently .

Cons:-

Any configuration changes to already setup pipelines will pause ingestion for some time.
Integration with some open source ML/analytics engines like Apache Spark is difficult.

Pricing:-

Starts with $1000/month for 50 published pipelines+5 active jobs+upto 5 users, with 99.9% uptime SLA promise

Conclusion:

We have discussed quite a few ETL tools for Snowflake, with their pros and cons. We have tried to highlight unique features of each tool, along with some of their limitations and challenges in using them . Based on your tech stack and application mix, you can choose the best suitable tool for your current and future needs. We hope this article helps you in making a well informed decision.

Start your 14-day free trial with Fivetran today!

Get started now

Topics

etl