AWS on the future of data lakes, metadata and AI innovation

The Director of Product Management at AWS sat down with Fivetran to unpack how Iceberg and metadata are reshaping the future of data lakes, data warehouses and AI.
October 17, 2024

The choice between using a data lake or a data warehouse often depends on the specific needs of your organization, including the type of data being managed, the required performance expectations and your team's expertise. However, with the rising capabilities of data lakes, things may swiftly change.

To set the stage for this shift, Paul Meighan, Director of Product Management at AWS, spoke with Fivetran’s Kelly Kohlleffel to discuss how data lakes, advancements in metadata capabilities and emerging AI workloads are reshaping data management. 

Turning S3 buckets into databases

The emergence of open table formats (OTF), especially Apache Iceberg, is revolutionizing how large datasets are managed and queried. These tools allow organizations to treat massive data stores, such as those in S3, like databases, enabling efficient data processing at scale.  

Meighan explains, “The first fundamental driving factor is that you need a storage system that kind of makes sense on the economics.” As companies manage more data, they need to scale efficiently. This is where Iceberg and S3 excel — allowing data to be stored in cost-effective systems like S3 Glacier while still being accessible as if it were stored in a traditional database.

This approach gives organizations the ability to leverage the value of their data while also empowering development teams to build the tools and engines needed to extract insights. 

The evolution of data lakes 

For years, data warehouses were the go-to solution for handling structured data due to their unmatched performance. However, data lakes have evolved significantly, offering improved storage and management capabilities that make them competitive with data warehouses.

Today’s data lakes, powered by technologies like Iceberg, can support a wide range of analytics engines. This ability to efficiently manage structured, semi-structured and unstructured data gives data lakes a significant edge, making them ideal for businesses handling a variety of data types. 

A rise of metadata

As data lakes evolve, effective metadata management has become critical. Metadata layers like Iceberg add structure to raw data, providing functionalities like time travel, schema evolution and enhanced querying capabilities. This makes modern data lakes more manageable and functional, while offering insights similar to traditional databases.

Metadata management also plays a crucial role in AI, ML, and GenAI applications. As Paul explains, a new “flavor” of metadata is emerging, driven by the need to generate vectors for specific AI use cases. These developments will significantly impact how organizations implement AI at scale.

Metadata will be at the core of AI and LLM applications in the future, which Prukalpa Sankar, ​​co-founder of Atlan, highlighted in a recent article. She shared how three types of catalogs use metadata — and how it might evolve in the future: 

  • The technical catalog: Providing comprehensive metadata and context for a data single source or tool, technical catalogs solve the problem of context by exposing metadata on everything from data assets to tags to policies and lineage. Eventually, metadata will even become a sort of single sign-on for data ecosystems.
  • The embedded catalog: For data users who need context about whether their data is updated, trustworthy and timely, embedded catalogs are able to surface metadata that helps users understand and trust their data assets.
  • The universal catalog: For those organizations with multiple catalogs or tools, they can bring metadata together into a universal catalog, also called the “catalog of catalogs.” This eliminates the needs for separate catalogs for each data source or tool, resulting in catalogs connecting and surfacing more comprehensive insights. 

How organizations use metadata will continue to shift, but for those who stay on top of trends, like new variations of catalogs, can stay ahead of competition and future-proof their insights. 

[CTA_MODULE]

Kostenlos starten

Schließen auch Sie sich den Tausenden von Unternehmen an, die ihre Daten mithilfe von Fivetran zentralisieren und transformieren.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Data insights
Data insights

AWS on the future of data lakes, metadata and AI innovation

AWS on the future of data lakes, metadata and AI innovation

October 17, 2024
October 17, 2024
AWS on the future of data lakes, metadata and AI innovation
The Director of Product Management at AWS sat down with Fivetran to unpack how Iceberg and metadata are reshaping the future of data lakes, data warehouses and AI.

The choice between using a data lake or a data warehouse often depends on the specific needs of your organization, including the type of data being managed, the required performance expectations and your team's expertise. However, with the rising capabilities of data lakes, things may swiftly change.

To set the stage for this shift, Paul Meighan, Director of Product Management at AWS, spoke with Fivetran’s Kelly Kohlleffel to discuss how data lakes, advancements in metadata capabilities and emerging AI workloads are reshaping data management. 

Turning S3 buckets into databases

The emergence of open table formats (OTF), especially Apache Iceberg, is revolutionizing how large datasets are managed and queried. These tools allow organizations to treat massive data stores, such as those in S3, like databases, enabling efficient data processing at scale.  

Meighan explains, “The first fundamental driving factor is that you need a storage system that kind of makes sense on the economics.” As companies manage more data, they need to scale efficiently. This is where Iceberg and S3 excel — allowing data to be stored in cost-effective systems like S3 Glacier while still being accessible as if it were stored in a traditional database.

This approach gives organizations the ability to leverage the value of their data while also empowering development teams to build the tools and engines needed to extract insights. 

The evolution of data lakes 

For years, data warehouses were the go-to solution for handling structured data due to their unmatched performance. However, data lakes have evolved significantly, offering improved storage and management capabilities that make them competitive with data warehouses.

Today’s data lakes, powered by technologies like Iceberg, can support a wide range of analytics engines. This ability to efficiently manage structured, semi-structured and unstructured data gives data lakes a significant edge, making them ideal for businesses handling a variety of data types. 

A rise of metadata

As data lakes evolve, effective metadata management has become critical. Metadata layers like Iceberg add structure to raw data, providing functionalities like time travel, schema evolution and enhanced querying capabilities. This makes modern data lakes more manageable and functional, while offering insights similar to traditional databases.

Metadata management also plays a crucial role in AI, ML, and GenAI applications. As Paul explains, a new “flavor” of metadata is emerging, driven by the need to generate vectors for specific AI use cases. These developments will significantly impact how organizations implement AI at scale.

Metadata will be at the core of AI and LLM applications in the future, which Prukalpa Sankar, ​​co-founder of Atlan, highlighted in a recent article. She shared how three types of catalogs use metadata — and how it might evolve in the future: 

  • The technical catalog: Providing comprehensive metadata and context for a data single source or tool, technical catalogs solve the problem of context by exposing metadata on everything from data assets to tags to policies and lineage. Eventually, metadata will even become a sort of single sign-on for data ecosystems.
  • The embedded catalog: For data users who need context about whether their data is updated, trustworthy and timely, embedded catalogs are able to surface metadata that helps users understand and trust their data assets.
  • The universal catalog: For those organizations with multiple catalogs or tools, they can bring metadata together into a universal catalog, also called the “catalog of catalogs.” This eliminates the needs for separate catalogs for each data source or tool, resulting in catalogs connecting and surfacing more comprehensive insights. 

How organizations use metadata will continue to shift, but for those who stay on top of trends, like new variations of catalogs, can stay ahead of competition and future-proof their insights. 

[CTA_MODULE]

Tune in to the full episode of “The future of data lakes: Open table formats, metadata and AI”
Listen now
Topics
No items found.
Share

Verwandte Beiträge

Active metadata: Open the black box of your data pipelines
Data insights

Active metadata: Open the black box of your data pipelines

Beitrag lesen
A deep dive into data lakes
Data insights

A deep dive into data lakes

Beitrag lesen
Data lakes vs. data warehouses
Data insights

Data lakes vs. data warehouses

Beitrag lesen
Why secure, reliable data is key to building better AI
Blog

Why secure, reliable data is key to building better AI

Beitrag lesen
Fivetran Product Update: March 2025
Blog

Fivetran Product Update: March 2025

Beitrag lesen
Unlocking the potential of human and AI collaboration
Blog

Unlocking the potential of human and AI collaboration

Beitrag lesen
Unlocking peak performance: Data pipeline benchmarking
Blog

Unlocking peak performance: Data pipeline benchmarking

Beitrag lesen

Kostenlos starten

Schließen auch Sie sich den Tausenden von Unternehmen an, die ihre Daten mithilfe von Fivetran zentralisieren und transformieren.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.