The data landscape is evolving rapidly, with organizations increasingly turning toward data lakes and data lakehouses to manage and analyze their ever-growing volumes of data. There are many reasons for the rising adoption of data lakes and data lakehouses, including:
- Flexibility compared to traditional data warehouses
- The rise of low cost object storage
- Suitability for storing multiple data types
- The creation of modern table formats, including Apache Iceberg
- The desire for improved performance at scale
Taking this all together, there is no question that the separation of storage and compute has laid the groundwork for the next generation of data architectures based around the data lake, particularly the data lakehouse.
With this increasing momentum, Apache Iceberg is quickly emerging as the new industry standard for data lakehouse adoption. This powerful open table format helps organizations apply data warehouse-like logic to the data lake, including the addition of update, delete and merge features. At the same time, Iceberg includes other notable functionality such as schema evolution, partition evolution and time travel.
As these innovations take shape, a new data stack is emerging around data lakes and data lakehouses, bringing with it a new set of tools and practices. In this blog, we'll explore two of the top tools for building a data lake, demonstrating how integrating Starburst and Fivetran together with Apache Iceberg can create an end-to-end solution for your data analytics.
[CTA_MODULE]
The rise of data lakes
Data lakes promote openness
With the rise of cloud object storage, the traditional data warehouse model is giving way to more flexible data lake solutions. This shift brings significant technological, financial and organizational changes. Data lakes offer increased performance and lower costs compared to data warehouses while also introducing a transformative perspective for engineers in the modern data era. Today, many data engineers aim to build or modernize data architectures based on strong developer principles, such as preventing vendor lock-in, embracing the separation of storage and compute and adopting an open mindset. While these themes are not new, they are now central to the principles of an open data stack in this emerging data ecosystem.
Data lakes promote interoperability
The rise of data lakes and data lakehouses emphasizes modularity and interoperability. Starburst has often called this concept — optionality — the ability of organizations to have choices about the different components in their data stack. Data lakes support this shift, allowing organizations to scale storage and compute resources independently, resulting in more efficient data processing and significant cost savings.
Adopting Apache Iceberg takes optionality to a new level. Iceberg enables multiple engines to interact with the same tables, creating a standardized method for storing all your data. This innovation enhances interoperability, allowing you to use Fivetran for data ingestion and Starburst for compute, all while leveraging Apache Iceberg as your table format within the data lake.
The data lakehouse: Ensuring end-to-end data lake analytics
Schema-on-write vs schema-on-read
Data lakes have emerged as a game-changer for organizations dealing with diverse, voluminous data. Part of the way that data lakes achieve this is through a different approach to schemas. Unlike traditional data warehouses, which enforce a rigid schema-on-write, data lakes support schema-on-read, allowing for more agile data management. This flexibility is crucial in today's data-driven world where the volume, velocity and variety of data are ever-increasing.
Avoiding data swamps
However, this shift towards schema-on-read comes with its own set of challenges. As organizations migrate to data lakes, they encounter new complexities around data quality, accessibility and security. The situation is even more complex in today’s age of modern table formats, leading to potential format incompatibility and increased time to insight.
Without enforcing proper management, data lakes can easily turn into data swamps leading to inaccessible, poor quality data and impaired data visibility. In many ways, this is not a surprising outcome and is another example of the popular phrase in computing “garbage in, garbage out.” Data quality is no different from any code and is a concern from the moment data is ingested into the data lake in its raw form from a variety of sources.
Standardize your data lake with Apache Iceberg
Today, many data lakes still utilize Apache Hive, which was not designed with modern object storage in mind. This leads to performance issues and rising costs as your data grows. Originally open sourced in 2020, Apache Iceberg was initially created by Netflix as a storage format for the query engine Trino, Starburst’s underlying engine. Since then, implementation of multiple table formats within the same data lake has only risen as a concern for data lake users. Without a table format plan, switching between different formats causes major pain for users, requiring them to switch between the different formats.
For example, imagine that you want to use Iceberg on your data lake, but the data is ingested as a different format. The obvious solution is migration, but migrating from Hive or other modern table formats to Apache Iceberg requires designated time and resources from data engineers that could be used on new development projects. Fortunately, both Fivetran and Starburst solve this problem. They anchor their core capabilities around Apache Iceberg, contributing toward an open, modular and interoperable architecture. Fivetran’s Managed Data Lake Service allows the data to be ingested as Iceberg tables. This eliminates the need for table format migration. After that, Starburst can be used as the engine to power your data transformations, create data products for downstream consumers, and secure access to your properly governed data in the lakehouse. This represents a full, end-to-end solution, from ingestion to insight.
[CTA_MODULE]
How Starburst and Fivetran make a complete data stack
Fivetran and Starburst provide complementary solutions that work together to address many of the challenges associated with data lakes. Fivetran specializes in data ingestion, making it easy to consolidate data from various sources into your data lake in a query-ready Iceberg table format. Starburst, on the other hand, excels in providing fast, petabyte-scale analytics perfect for performing interactive analytics or data transformations within the lake. Together, Fivetran and Starburst enable a comprehensive solution for data management within your data lakehouse. Fivetran handles the extraction and loading of raw data into the lake, while Starburst performs the critical last-mile data transformations and analytics.
Ingesting data into the lake with Fivetran
Fivetran’s Managed Data Lake Service streamlines low-level data management tasks entirely when moving structured and unstructured data to your data lake. Fivetran automatically normalizes, compacts and deduplicates your data, standardizing it into an open table format — blocking or hashing sensitive data before it enters the data lake to avoid potential policy violations downstream and reduce future compliance overhead.
Once landed, Fivetran continuously manages your data lake by monitoring for changes in the source and optimizing tables for performance — including removing orphaned files and pulling (and deleting) snapshots as needed to enable reversions. Through native integrations with data catalogs including AWS Glue, users can quickly discover, access and govern key datasets from the lake keeping your data lake clean, compliant and query-ready at all times. From there, users can query and modify the data by leveraging compatible compute engines like Starburst.
Transforming and optimizing your data with Starburst
Powered by the open source query engine Trino, Starburst elevates your existing data lake into a fully-fledged open data lakehouse via the Icehouse, an end-to-end data lakehouse built on Trino and Iceberg. Assisting with complicated lakehouse tasks like data governance, data management and automatic capacity management, Starburst’s Icehouse enables users to perform complex data transformations and analytics within the data lake without worrying about the overhead.
Starburst enables users to customize their optimized Trino clusters to each specific workload, offering three different execution modes allowing you to tailor the workload to your need. For data transformations, data pipelining and ETL jobs, utilize fault-tolerant execution (FTE) mode for best results. Fault-tolerant execution mode enables the execution of complex, long-running and memory-intensive queries, ensuring reliability for critical data pipelines. If you are looking to speed up interactive analytics within your data lake, then utilize the accelerated cluster mode, which utilizes warp speed, the proprietary indexing and caching layer to automatically increase data lake performance. Once the data pipelines are hardened within the data lake, utilize data products to create and define curated datasets and metadata information for downstream consumers.
Fivetran and Starburst in action
As data lakes and data lakehouses continue to evolve, the need to maintain your data lake in an open and effective manner becomes critical. Leveraging Fivetran and Starburst together with Apache Iceberg solves this problem. This approach helps ensure that your data lakehouse remains reliable, cost effective (with Fivetran covering the costs of ingestion into your data lake, greatly reducing your TCO) and scalable. It increases data quality, streamlines operations and accelerates time to insight.
One example of this is Kovi, a Latin American digital rental car company. They recently built a data lakehouse utilizing Fivetran, Starburst Galaxy and Apache Iceberg. The result was substantial cost savings, including 85% faster ad-hoc queries and 55% faster ETL jobs.
Kovi’s architecture is built on AWS infrastructure, providing the scalability, flexibility and automation required for its everyday operation. Using this approach, their data stack handles over 2 billion data points each day. Because Kovi relies heavily on near real-time analytics, they have many operational needs that require quick and fast computations. For example, in one use case, they deployed in-car technology to evaluate driving behavior.
This resulted in a desire to upgrade to a data lakehouse and migrate to Apache Iceberg. After implementing a more performant lakehouse, Kovi transformed its analytics processes to enable near real-time insights. This approach resulted in significant improvements in operational efficiency, including an estimated 10-20% reduction in car maintenance downtime. Kovi can now track geographical data more effectively, which reduces the number of safety-related incidents.
To learn more about if it’s time to incorporate Apache Iceberg into your own environment, visit Starburst’s migration guide.
[CTA_MODULE]