Today, organizations are generating and collecting more data than ever before. In the past, customers have been primarily concerned with data from business systems in the data center. However, now, and over the next several years, the most important customer and operational data will come from a variety of sources, including internal systems, third-party applications, public data sets and the Internet of Things. To make the most of this data, organizations need a way to store, manage and analyze it effectively.
While most companies have adopted data lakes as the ideal platform for storing modern data volumes due to its flexibility, scalability and cost efficiency, they have also remained committed to the data warehouse architecture for certain analytic workloads, including Business Intelligence (BI) and reporting. To leverage data from the data lake, data teams must move the data into the data warehouse in proprietary formats, and often rely on data copies to meet Service-Level Agreements (SLAs) for performance. The result of this architecture is a lot of complexity and management overhead, particularly as requests for access to data in the data lake inevitably increase.
The answer to this two-tiered architecture is a data lakehouse. A data lakehouse is a modern data architecture combining the best capabilities of data lakes and data warehouses. It enables organizations to store large volumes and varieties of data types in a flexible and cost-effective manner, while also satisfying a wide range of analytics use cases, including BI and reporting.
A key component of a data lakehouse stack is the storage layer. While cloud object storage gives data teams the flexibility and scalability to store all of their data, one innovation that has been instrumental in making that data usable for analytics has been table formats. Table formats structure and organize data in a way that makes it easier to work with, and provide benefits including improved query performance, consistency and accuracy and optimized storage.
Apache Iceberg tables
Apache Iceberg is an open-source table format that provides a number of benefits over alternative data lake file and table formats. Iceberg tables are designed to provide better performance, reliability and scalability, while also making it easier to manage and evolve data over time.
Iceberg is well-positioned for sustainable development and enterprise adoption, with many contributors from technology companies including Netflix, Apple, Google, AWS, Stripe, Dremio and many more. It was purpose-built to manage and deliver high-performance analytics on the largest data lakes in the world. Here are a few of the ways Iceberg helps data teams manage their data lakes:
- Flexibility: Apache Iceberg enables users to change how data is organized so that it can evolve over time without requiring rewriting of queries or rebuilding of data structures. It also supports multiple data formats and data sources, making it easy to integrate with existing systems.
- Transactional consistency: Iceberg provides transactional consistency between multiple applications where files are added, removed, or modified atomically, with full read isolation and multiple concurrent writes.
- Schema evolution: Iceberg provides full schema evolution to track changes to a table over time, enabling users to add, remove or modify columns within tables without breaking existing queries or applications.
- Time travel: Iceberg allows querying of historical data and verifying changes between updates, providing a way to track changes to data over time.
- High performance: Iceberg is a high-performance format for huge analytic tables that brings the reliability and simplicity of SQL tables to big data, while making it possible for processing engines like Dremio Sonar, Spark, Trino, Flink, Presto and more to safely work with the same tables, at the same time.
Dremio & Fivetran
Dremio and Fivetran help data teams build data lakehouses with Apache Iceberg tables.
Fivetran automates the process of bringing data from a variety of sources into cloud data lake storage. It provides pre-built connectors for a wide range of data sources, including databases, SaaS applications and cloud storage services. It also provides automatic schema detection and mapping, making it easy to bring in new data sources without having to manually configure mappings. Fivetran also makes it easy to build Iceberg tables, so data teams can take advantage of its data management capabilities.
Use Dremio to access and query your Iceberg Tables
Dremio is an open data lakehouse that enables self-service access to your Iceberg tables with sub-second query performance. Dremio’s semantic layer is the key to its self-service capabilities. With a centralized semantic layer, technical and non-technical data consumers can easily find, join and query data in the data lake and other data sources.
Dremio’s query engine is based on Apache Arrow, an in-memory columnar format designed for interactive analytics. It layers on query acceleration technology to ensure the fastest BI performance, even for complex queries on large datasets. Dremio connects to the most popular BI tools, so data consumers can work with data using the platform of their choice.
In summary, Fivetran provides an easy way to get data from a range of sources into Iceberg tables, where data teams can take advantage of the easy data management and optimization features of an open table format that is purpose-built for enterprise datasets. Dremio delivers self-service data access and the fastest BI performance so data consumers can quickly get insights from their data.
Dremio and Fivetran recently presented a webinar together, where we discussed the value of open table formats, and how users can easily get started with an open data lakehouse built on Apache Iceberg. For more information, view the on-demand webinar here.
You can also get started with Dremio Cloud on AWS for free here. All you pay for are AWS resources.