The choice between using a data lake or a data warehouse often depends on the specific needs of your organization, including the type of data being managed, the required performance expectations and your team's expertise. However, with the rising capabilities of data lakes, things may swiftly change.
To set the stage for this shift, Paul Meighan, Director of Product Management at AWS, spoke with Fivetran’s Kelly Kohlleffel to discuss how data lakes, advancements in metadata capabilities and emerging AI workloads are reshaping data management.
Turning S3 buckets into databases
The emergence of open table formats (OTF), especially Apache Iceberg, is revolutionizing how large datasets are managed and queried. These tools allow organizations to treat massive data stores, such as those in S3, like databases, enabling efficient data processing at scale.
Meighan explains, “The first fundamental driving factor is that you need a storage system that kind of makes sense on the economics.” As companies manage more data, they need to scale efficiently. This is where Iceberg and S3 excel — allowing data to be stored in cost-effective systems like S3 Glacier while still being accessible as if it were stored in a traditional database.
This approach gives organizations the ability to leverage the value of their data while also empowering development teams to build the tools and engines needed to extract insights.
The evolution of data lakes
For years, data warehouses were the go-to solution for handling structured data due to their unmatched performance. However, data lakes have evolved significantly, offering improved storage and management capabilities that make them competitive with data warehouses.
Today’s data lakes, powered by technologies like Iceberg, can support a wide range of analytics engines. This ability to efficiently manage structured, semi-structured and unstructured data gives data lakes a significant edge, making them ideal for businesses handling a variety of data types.
A rise of metadata
As data lakes evolve, effective metadata management has become critical. Metadata layers like Iceberg add structure to raw data, providing functionalities like time travel, schema evolution and enhanced querying capabilities. This makes modern data lakes more manageable and functional, while offering insights similar to traditional databases.
Metadata management also plays a crucial role in AI, ML, and GenAI applications. As Paul explains, a new “flavor” of metadata is emerging, driven by the need to generate vectors for specific AI use cases. These developments will significantly impact how organizations implement AI at scale.
Metadata will be at the core of AI and LLM applications in the future, which Prukalpa Sankar, co-founder of Atlan, highlighted in a recent article. She shared how three types of catalogs use metadata — and how it might evolve in the future:
- The technical catalog: Providing comprehensive metadata and context for a data single source or tool, technical catalogs solve the problem of context by exposing metadata on everything from data assets to tags to policies and lineage. Eventually, metadata will even become a sort of single sign-on for data ecosystems.
- The embedded catalog: For data users who need context about whether their data is updated, trustworthy and timely, embedded catalogs are able to surface metadata that helps users understand and trust their data assets.
- The universal catalog: For those organizations with multiple catalogs or tools, they can bring metadata together into a universal catalog, also called the “catalog of catalogs.” This eliminates the needs for separate catalogs for each data source or tool, resulting in catalogs connecting and surfacing more comprehensive insights.
How organizations use metadata will continue to shift, but for those who stay on top of trends, like new variations of catalogs, can stay ahead of competition and future-proof their insights.
[CTA_MODULE]