The future of data lakes: Open table formats, metadata and AI

AWS’ Director of Product Management Paul Meighan unpacks the next evolution of data lakes and the critical role AI and metadata will play in driving smarter data strategies.

https://fivetran-com.s3.amazonaws.com/podcast/season1/episode14.mp3

Apple Podcasts

JUMP TO

Mentioned in the episode

Topics:

Data transformations

AI/ML

More about the episode

In this episode of the Fivetran Data Podcast, host Kelly Kohlleffel sits down with Paul Meighan, Director of Product Management at AWS. With over 15 years of experience in the storage industry, Meighan shares how enterprises are increasingly looking for ways to integrate more data sources in their environment — especially with data lakes.

From turning S3 buckets into databases to establishing better metadata layers, Meighan explores the rapid evolution of data lakes alongside data warehouses. He also explains the pivotal role AI, ML and GenAI workloads and applications will play in large metadata environments, driving innovative analytics and business insights.

Highlights from the conversation include:

How data lakes are transforming the way companies process data at scale
The evolution of metadata, particularly how vectors are generated for specific use cases
The future impact of Apache Iceberg REST catalog APIs

Watch the episode

Transcript

Kelly Kohlleffel (00:00)

Hi, folks. Welcome into the 5Trans Data Podcast. I'm Kelly Coleffel, your host. Every other week, we will bring you insightful interviews with some of the brightest minds across the data community. We'll cover topics such as AI, machine learning, enterprise data and analytics, data culture and a lot more. Today, I'm really pleased to be joined by Paul Meaghan. He's Director of Product Management at AWS.

Paul has over 15 years of product experience in the storage industry. Before joining AWS, he worked at Dell EMC, where he worked on a variety of storage and backup products. He holds an MBA and a BA in economics from the University of Washington. Paul, it is a pleasure to have you on the show. Welcome in today.

Paul Meighan (00:41)

Thanks for having me, Kelly. It's great to be here.

Kelly Kohlleffel (00:44)

Absolutely. Well, I tell you what, I am looking forward to diving deep into the topic of data lakes today. Before we do that, though, why don't you take us through a little bit more about your current role at AWS and your background.

Paul Meighan (00:58)

Sure, so today I lead product management for S3 and Glacier in AWS. I'm a huge fan of S3. I've been working on it for about six years now, since 2018. Before that, about 10 years in the storage industry at EMC, as you mentioned, working on storage and backup. Before that, I was actually here at Amazon on what you call a “boomerang Amazonian.” I was here from ‘97 until about 2006. So I was in the IT roles, swapping tapes, running around the data centers, doing stuff like that in the early days here at Amazon.

Kelly Kohlleffel (01:41)

All right. Well, very good. Well, I know they're happy to have you, in the chair at Amazon Web Services today. Let's talk a little bit about data lake technology. I mean, Fivetran we've been working with, AWS. We've been working with S3 for a while. I think about where I started first hearing more about data lakes and in any other part of my career, it kind of goes back to Hadoop. Recent years, data lakes have continued to evolve, and I'd just be interested to hear your thoughts.

Over that time, how have data lakes, how has storage technology evolved where you can do so much more today? You've got better management capabilities, more capabilities all the way around, and then what advancements on those technologies specifically at AWS have had the most significant impact in your mind?

Paul Meighan (02:30)

I mean, for us, the emergent, just as an S3 person, as a storage person, the emergence of the Open Table Formats has just been an incredible thing to watch over the course of the last few years. OTFs in general, and Apache Iceberg in particular, have just gone from science projects in very sophisticated customers to kind of the plan for so many in the enterprise, right? And that's been the big thing that we've been tracking and that we've been working with customers on over the I would say over the last several years, is sort of making that transition over to OTF.

And we've a bunch of customers that have huge amount of success with that and love that architecture. And, you know, we've done a lot over the years too, to try and help and have a lot of our roadmap to make it even better for them going forward.

Kelly Kohlleffel (03:32)

What has been, what have you heard, I guess, from customers when you talk OTF, Open Table Format, what has, what's driven them to that more than anything else? You say, “Hey, I can be, you know, can have a storage format that's proprietary. Maybe does this does that, or I can go Open Table Format.” Why are customers choosing that?

Paul Meighan (03:51)

Well, I mean, I think the characteristics of S3 are a big reason why, right? Like, first of all, they want to run these big sort of data warehouse, data analytics workloads, and they want to do it with more and more and more data. And so they look to object storage in general, and S3 in particular, to give them the ability to process that amount of data at scale like that in a way that is economical, right?

I think that's the first fundamental driving factor is that you need a storage system that kind of makes sense on the economics. And so S3 has been useful in that way of delivering that sort of end of the spectrum. And what Iceberg has done is it's come along and added the ability to sort of establish that metadata layer on top of those fundamental storage characteristics that basically turn an S3 bucket almost into a database. And suddenly, customers are enabled to now store data at S3 Glacier Economics and yet query it as if it's a database.

And on top of that, they're able to do that in a way where they can bring many engines into the mix and query that same data set. And that's what customers have really, that's what's been really powerful is that they've been able to establish these data lakes and key up dev teams to then go bring whatever engine or tooling makes sense for them to then go make sense and get value out of the data.

And then on top of that, there's just been a ton of innovation on the features of Iceberg itself, right? With time travel and evolving schema and like all the stuff that Iceberg sort of brings to the table on the sort of feature function level. Sort of bring all those ingredients together. And it's really kind of moving the needle for a lot of analytics shots.

Kelly Kohlleffel (06:04)

I totally agree. You've hit on so many points too. I love the way you say turning an S3 bucket into a database, essentially. I've heard it said turning a data lake into a data warehouse. I really agree with that because when you think about the early versions of data lake technology or storage technology or object stores, it was okay, let's put a lot of really interesting files in there but I had to do a lot of enrichment, a lot of processing, a lot of effort, and a lot of expertise actually to get me to an analytics ready or data product ready type of scenario with those object stores. And when you talk about immediately available in S3 to make it look like a database, this is really, really interesting. And you're saying Iceberg is the facilitator behind that.

Paul Meighan (06:56)

It is. And the fact that the creators were kind of wise enough to bring it in as a standard, I think has also helped a lot. Right? Because I mean, the trajectory of the ecosystem has just been incredible to watch. That's also been a very powerful sort of ingredient here has been, you know, just sort of the rapid sort of rallying of the ecosystem around the standard.

It has just made it much more useful over time.

Kelly Kohlleffel (07:25)

Yeah, I agree. There's been quite a bit of cash thrown around in that ecosystem recently too. So that's interesting. And I think it goes to, you mentioned another point, this ability to plug in many engines or any engines that support iceberg into this S3 world and start immediately using that data. And I was taking a quick look the other day at the engines that support Apache Iceberg. And I was amazed.

It has, you know, there's probably five, six, seven, eight, I think there were like 30, 35 different engines out there right now. These are major platforms that are saying, hey, Apache Iceberg first class for me, I'm going to plug in today, and you can use my compute on Iceberg and S3.

Paul Meighan (08:11)

Yeah, I mean, it's that network effect, right? Like, I mean, the format's got so much momentum right now that it just makes a lot of sense for additional sort of tool sets to plug into it. And again, like if that's where all the data is and customers have kind of voted with their feet that this is the format by which that data should be accessed and updated, it just has a snowball effect in the ecosystem.

Kelly Kohlleffel (08:42)

I like to, I mean I feel like you've got today in that AWS world, you've got a lot of options. You could say, hey, I want to keep my storage within, maybe I'm using Redshift. I want my storage to be within the Redshift world. But if I want to create this data lake effect within S3, plug in Redshift or plug in other AWS services, I've also got that option as well. It may become dependent on the workload or the use case that I'm trying to deliver within my organization.

Paul Meighan (09:10)

Sure. Yeah. Yeah. I mean, there are a bunch of AWS services that can plug in and sit on top. There are also a bunch like, you know, where you can run Redshift natively within the warehouse, and then you get a bunch of acceleration there on top of, you know, with all the optimization that Redshift has done over the years to accelerate queries. You know, but customers have that choice, a large range of choices when they store their data in S3 for sure.

Kelly Kohlleffel (09:39)

Yeah, agree. Okay, and then the third thing I heard you mention, you were talking about innovation, Paul, and you mentioned some of the capabilities like time travel. Can you talk about those just a little bit and the impact that those innovations have had on customers’ uptake of not only S3, but also Iceberg within S3.

Paul Meighan (10:00)

You know, it's funny because like, when we started this talk by kind of talking about like my days early on in IT and, if you think back to the late nineties, if you're old enough to think back that far, you know, there were, you know, databases being stood up to go run the internet. And as enterprise customers went and adopted those, you know, they had very more sophisticated requirements, right? They needed disaster recovery requirements. They had more compliance requirements. They had just like, there's a longer list of things that you got to do if you want your technology to be deployable by enterprise shops, right? And so for me, when you look at kind of some of those innovations that have come along, like time travel, the ability to snapshot tables and some other improvements that have been made over the years, you start to see the outlines of a similar trajectory for Iceberg, where early adopters are in there, very sophisticated customers with large dev teams, they're able to deliver on the requirements that they need in code. And then you have this fatter part of the adoption curve that comes in with enterprise customers, and they need, you know, they need things like, they need to be able to roll back to a node and go to time.

They need to be able to prove that previous snapshots are good. They need sophisticated catalog requirements in order to enforce governance across these critical tables. And so for me, that's kind of like, if I kind of blur my eyes and look at where we are with this right now, that's kind of what I see as the next leg of the race here is that charge into the fat part of the adoption curve, where enterprise requirements that we all know and love from whatever mature product you've ever worked on kind of come to this space. And it's kind of an interesting trajectory to think about.

Kelly Kohlleffel (12:03)

Yeah, I agree. And the more that you can bring those enterprise capabilities to this data lake world, then you start going, okay, data lake data warehouse, do I need both? Do I need one? How would you choose one over the other? Are there specific use cases or workloads that you go, okay, this one is more suited today versus the other? How would you make those decisions?

Paul Meighan (12:29)

I mean, I think a big part of it kind of comes down to kind of who you have on the team and where you are kind of skills wise and where you're starting from, right? Like there is, like I said before, there's a ton of optimization in modern data warehouses today that allow for query performance that you're just not going to get out of the box on many data lakes.

That said, if you have a team that understands Iceberg, how to optimize these tables, if you have certain patterns that you want to go integrate, if you have a wide range of clients that you want to tee up and get going across your tables, you know, there are plenty of reasons why you may want to go with, you know, Parquet and S3 with an OTF on top of it. It's just one of those, you know, choosing one or the other just has so many variables that come into decisions like that. What I find when I talk to customers is that a lot of times it really boils down to what is the team that I have? What are they comfortable with? You know, how much time do we have to write software versus other options, other things that are on my roadmap to go work on.

Kelly Kohlleffel (14:00)

Yeah, agree. Skills play into it a lot, as well as the workloads. I think you've got, you certainly have an interesting play with data lake technology and something like S3 where maybe you're going after GenAI use cases or just AI and ML use cases and I need a wider variety of data. I've always thought about data warehouses as predominantly structured data sets, maybe some semi-structured. But if I'm thinking, okay, I need some unstructured, semi-structured and structured data, it seems like a data lake and S3 would be a really nice place to organize that, keep it in, as you said, make it data product ready or analytics ready.

Paul Meighan (14:45)

I think that data lakes and data warehouses will both evolve down parallel tracks. There's a ton of time and effort and innovation that's gone into the data warehouses over the years. There are, when it comes to just straight up performance against structured data, they are tough to beat.

And that's going to be the case for the foreseeable future. I think, you know, customers are going to look for more and more ways to sort of bridge all the data across, you know, every data source in their environment, across data warehouse, you know, relational data lake. I mean, customers are going to continue to look for ways to bridge and join across all these data sources.

And you've seen us implement a bunch of capabilities and features within AWS to help make that possible. Things like Zero ETL and just a bunch of capabilities to help customers reason about and make the best possible use of all these various data types in the environment. But we're never going to get to one data type to rule them all.

Kelly Kohlleffel (16:07)

Yeah, that's great. Love that. You talked a little bit earlier, Paul, about the metadata layer, the importance of that and what that brings. Can you comment on that a little bit more around the concept of a governed data lake and what metadata management means and really how that all comes together to not only provide this ability to really leverage a data lake in a unique way, but also maybe simplify to a degree the data architecture? Get to that point, like you said, help turn S3 into a database.

Paul Meighan (16:45)

You know, I think the metadata, the emergence of metadata here, I think is like super interesting. And I think we're just kind of starting in on metadata as sort of a first class component of, you know, a customer's overall data store, data strategy. I mean, it's always been important, but as we go forward into this world where core data sets are shared across many applications, and I want to do more things than just query these data sets. I want to train on them. I want to search across them. I want to do things like semantic search across them. You can kind of envision a world where there are just a bunch of different consumption models into a single data set, right? Where I want to use, I want to consume data in a certain way for a certain app and in a different way for another app.

And so, you know, Iceberg is a critical or OTF in general, sort of like a great example of that, where I have sort of my underlying actual whatever, you know, point of sale data sitting in parquet and storage. And then I have this optimized metadata layer on top of it that's built to go run fast queries to enable reporting and data warehouse workloads and data leak workloads on top.

But I think we'll find that more metadata will end up being laid down on top of these core data sets to go enable new and additional use cases on top of our data.

And so it's something that we're like very closely looking at. We're very interested in sort of how metadata evolves. We have a lot of customers who run big metadata systems kind of on top of S3, sort of outside of just the OTF use case. And it's something that we're paying like super close attention to, you know, going forward, especially as AI becomes more more productized in customer environments.

Kelly Kohlleffel (19:05)

Are you seeing those large metadata environments being used more frequently in AI, ML, GenAI types of workloads and applications?

Paul Meighan (19:14)

Sure, there's not so much directly on Iceberg, in the Iceberg case, but there's a whole new sort of flavor of metadata emerging in terms of how vectors are generated to feed those use cases. Also spinning off of underlying data sets in object storage. And so that feels like a similar track in the AI space as to where we were a number of years ago when we were first starting to see these very robust metadata layers spin up for analytics, right? And so I think if we just kind of look back on the analytics side and look at kind of where that, how far those workloads have come over the last several years based on more and more more robust metadata layers sitting on top of them and enabling the apps above the actual storage or actual data itself. You don't have to squint too hard to look over at these emerging apps and start to think about how much of an impact innovation on the metadata side could have on them as well.

Kelly Kohlleffel (20:23)

We have we've hit on a lot of topics. I'm very interested to maybe a prediction for next year prediction for 2025. If you're looking ahead, you have anything you want to throw out as maybe a trend that you see development in the data space that you want to put out there for 2025 as your one to watch?

Paul Meighan (20:47)

For 2025, you know, I mean, just to just to stay on on analytics, I think, you I'm a big fan of the Iceberg REST catalog API. I think that rest catalogs are going to have a big impact.

It's a very clean, very nice API. It brings forward, I think, a standard on the catalog side that I believe that customers are going to rally around. A bunch of customers that I talk to are looking for their way to get to that standard. And I think that will go unblock another wave of innovation at the catalog layer for customers. I think you'll see once customers are there and that thing really starts to evolve, you're going to see a bunch of innovation at the catalog layer that again are going to pour into the ecosystem. I think it's going to be pretty cool for customers.

Kelly Kohlleffel (21:43)

There are, I love that actually. There's some interesting partnerships and announcements going on right now in this catalog space. I was just talking to somebody. I'm a big fan of Prukalpa over at Atlan. She's one of the co-founders of governance data catalog solution. And she had put out just earlier this month, she put out a, in her metadata weekly, something called the war of the catalogs. And she broke down these different catalog types.

And as you were talking, I was thinking, wow, this really resonates with me. She was talking about technical catalogs, embedded catalogs, universal catalogs, and where this is all going. Very similar thoughts to say, hey, this is just really the beginning here and it's an area that you should keep your eye on. So, I'm in a total agreement with you. This is one that I think we should all be pressing into right now to, to see where it's going and to see how we can better leverage it within our data programs.

Paul Meighan (22:41)

Definitely going to be an interesting 2025 on the catalog side.

Kelly Kohlleffel (22:46)

Yeah, absolutely. Absolutely. Well, Paul, I have really, really appreciated this. Thank you so much for joining the show today. Just happy to have you on.

Paul Meighan (22:56)

Yeah, well I appreciate being here man. Thanks for having me, Kelly.

Kelly Kohlleffel (23:00)

Yeah, it was fun. I look forward to keeping up with everything you're doing at AWS. Stay really close to that and happy to have you on in the future as well. Anytime there's some new development in this space that you're working in. Love to hear the stories. Huge thank you to everyone who listened in as well. We really appreciate each one of you. We'd encourage you to subscribe to the podcast on any of the major platforms: Spotify, Apple, Google.

You can find us on YouTube, visit us at fivetran.com/podcast. And also you can send us any feedback or comments. You can hit us up at podcast@fivetran.com. We would absolutely love to hear from you. Thanks so much. See you soon. Take care.

Show the full transcript

Expedite insights

Mentioned in the episode

66%

more effective at replicating data for analytics

33%

less data team time to deliver insights