Product

3 questions to ask about your data lake management solution

August 21, 2024

THEMEN

How the Fivetran Managed Data Lake Service stands out in a crowded market.

At Fivetran, we’re thrilled to introduce our latest innovation: the Fivetran Managed Data Lake Service. A forward-looking data lake strategy can significantly reduce time, cost and complexity for enterprises, especially as they accelerate the adoption of generative AI, machine learning and other analytic solutions.

Fivetran seamlessly delivers data to data lakes through automated pipelines that are secure, accurate, reliable and resilient to change. For years, we’ve moved countless petabytes of mission-critical data across thousands of enterprise systems to cloud data warehouses like Snowflake and Databricks. Now, we’re excited by the strong market interest in the Fivetran Managed Data Lake Service as a new destination — bringing data warehouse functionality to the data lake.

In a crowded data integration market, many vendors promise the ability to deliver data to data lakes. But what truly sets a successful solution apart? Which vendors adhere to the right principles and offer the necessary features to support a modern data lake architecture effectively?

The following sections distill these considerations into key questions. If your data integration provider can confidently address these questions, you’re likely on the right path to success.

[CTA_MODULE]

Question 1: Can my data integration solution move data to a data lake?

The most significant benefit of the data lake is where the data resides: in inexpensive, infinitely scalable cloud storage. The most common choices are Amazon Simple Storage Service (S3), Azure Data Lake Storage Gen2 (ADLS built on Azure Blob Storage) and Google Cloud Storage.

Data flows left to right with cloud storage at the right

Keeping your data in a data lake generally reduces cost compared to moving data to a data warehouse. Data lakes are cheaper, more reliable, more fault-tolerant and more flexible than ever, making it an excellent choice for storing virtually anything, including structured, semi-structured and unstructured data.

Does Fivetran support this? Yes, Fivetran supports moving data to data lakes. Our S3 data lake destination, Microsoft OneLake and ADLS destinations are widely used by customers and are among our fastest-growing destinations due to our differentiated approach. As of August of 2024, Google Cloud storage is on Fivetran’s roadmap for future support.

Do other vendors support this? Other data integration vendors offer some support to move data into data lakes but not in a governed and compliant manner compared to Fivetran. Matillion, for example, has support for Snowflake, Databricks, Redshift and BigQuery, but no cloud storage data lake destinations.

Question 2: Does my data integration solution deliver data in an open table format?

Hooray! Your data is ready to land in your data lake. The next central question is about data format: How should the data be structured in the data lake? Twenty years ago, when storing data on a filesystem, data engineers typically used CSV or other delimited or fixed-width file format. Ten years ago, JSON and file formats like Parquet and Avro became common. However, in 2024, we should structure data in an open table format, with Delta and Apache Iceberg^TM being the most prominent examples. Here’s why:

ACID transactions: Delta and Iceberg offer ACID (Atomicity, Consistency, Isolation, Durability) transactions, ensuring data integrity and reliability even in complex, concurrent workloads.
Schema evolution: These formats support schema changes without disrupting existing queries or applications, allowing for flexible data models and adaptability.
Time travel: Both provide time travel capabilities, enabling you to access and query historical data versions for auditing, debugging or machine learning purposes.
Performance optimization: Delta and Iceberg offer features like partitioning, compression and indexing to optimize query performance and reduce storage costs.
Openness and interoperability: While Delta is closely tied to Databricks, both formats are open-source and interoperable with various data processing frameworks and tools.

Does Fivetran support Delta and Iceberg? Yes, Fivetran supports both formats. While most data integration vendors can write data to S3 or ADLS, they often cannot write it in open formats. Typically, data is written in CSV, JSON or Parquet without ACID transaction control, time travel or metadata describing the files. Vendor features vary, as shown below, but this usually results in a disorganized bucket of Parquet files, useful mostly for cold data backups. Many in the industry refer to this as a “data swamp” — the data exists, but it’s nearly impossible to query or extract any analytical value from it.

	Writes data to:
	Delta S3	Delta ADLS	Iceberg S3
Fivetran	Yes	Yes	Yes
Azure Data Factory	No	Yes	No
Qlik Replicate	No	No	No
Informatica	No	No	Only via Hive connector; with limitations
Airbyte	No	No	Only with community-supported connector
AWS DMS	No	No	No
Stitch	No	No	No
Matillion	No	No	No

Question 3: Does my data integration solution fully manage my Delta and Iceberg tables?

Delta and Iceberg are remarkable technologies for storing data in open formats accessible to multiple query engines. Data pipelines should deliver this data to the cloud, but the job isn’t truly complete without a bit more effort.

Catalog integration: A data integration solution that natively integrates with a data catalog ensures the catalog accurately records data pipeline metadata, meaning users can quickly discover, access and govern key datasets.
File clean-up: As pipelines deliver data continuously, data volumes can grow significantly. While enterprises typically need to retain all data, not all time travel snapshots are necessary. A well-architected pipeline should not only deliver data but also manage clean-up. It should support data retention policies and remove orphaned files when they’re no longer relevant.

Can Fivetran fully manage Delta and Iceberg? Yes. Fivetran records all metadata via native integrations with data catalogs including AWS Glue and Databricks Unity Catalog — and more catalog support is on our roadmap, so customers will have the flexibility to use Apache Polaris^TMor other options. Furthermore, Fivetran will automatically delete data snapshots beyond your specified retention duration and clean up files orphaned due to unsuccessful data operations. All of this happens automatically after minimal configuration.

Do other vendors fully manage their Delta and Iceberg pipelines? Generally, no. Customers could potentially add some of these features programmatically, but many are not offered as standard functionality, as shown below.

	Data catalog integration	Delete old snapshots	Delete orphan files
Fivetran	Yes	Yes	Yes
Azure Data Factory	No	No	No
Qlik Replicate	No	No	No
Informatica	No	No	No
Airbyte	No	No	No
AWS DMS	Some	No	No
Stitch	No	No	No
Matillion	No	No	No

Final thoughts

More enterprises are storing increasing volumes of data in data lakes, saving money and gaining incredible flexibility for downstream use cases like business intelligence, machine learning and generative AI.

Choosing the right data integration platform can accelerate data delivery to those lakes. While there are many vendors to choose from, only Fivetran offers the features and vision to fully embrace this architecture. Some vendors, like Matillion, do not even provide data lake destinations. Other vendors deliver data to S3 or ADLS without supporting open table formats like Delta and Iceberg. Only Fivetran fully manages data lake pipelines, incorporating features like catalog integration and table maintenance

At Fivetran, we believe Delta and Iceberg should be a foundational part of your data lake strategy. This future-proof architecture reduces cost and vendor lock-in.

For more information, contact us at sales@fivetran.com or sign up for a free 14-day trial.

Apache Polaris is a trademark of the Apache Software Foundation.

‍Apache Iceberg is a trademark of the Apache Software Foundation.

[CTA_MODULE]

Lead with confidence with our CIO’s guide to data lake management for generative AI.

Discover how to accelerate and ensure successful AI initiatives with with our CIO’s guide to data lake management for generative AI.

Download now

Topics

Data Lakes

Heading

Kostenlos starten

Schließen auch Sie sich den Tausenden von Unternehmen an, die ihre Daten mithilfe von Fivetran zentralisieren und transformieren.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Demo buchen