How modern query engines make governed data lakes accessible

How modern query engines make governed data lakes accessible

Fivetran CEO George Fraser and Starburst CEO Justin Borgman discuss how Starburst’s Trino query engine and open formats like Iceberg drive agile, scalable solutions for AI innovation.

Fivetran CEO George Fraser and Starburst CEO Justin Borgman discuss how Starburst’s Trino query engine and open formats like Iceberg drive agile, scalable solutions for AI innovation.

0:00
/
0:00
https://fivetran-com.s3.amazonaws.com/podcast/season1/episode12.mp3
Topics:
Operational efficiency
Data governance
AI/ML

More about the episode

The modern data technology landscape is fiercely competitive and constantly evolving, with the query engine space experiencing significant innovation. Starburst, as the leading commercializer of the Trino query engine, stands at the forefront of this dynamic shift, offering unparalleled performance and flexibility for data-driven businesses.

To parse through the noise, George Fraser, CEO of Fivetran, and Justin Borgman, CEO of Starburst, convened to discuss essential considerations for data leaders looking to drive transformational outcomes and innovate. 

George and Justin discussed the evolution and importance of data lakes, emphasizing open formats like Iceberg, and explored the synergistic opportunities between Fivetran, Starburst and governed data lakes to enhance enterprise data capabilities. They provide insights into the practical implications, customer use cases and future considerations for data architecture — including what’s required for successful AI utilizing proprietary data. 

“Cost is king when it comes to the ever-increasing volumes of data that customers are wrestling with. This even ties into AI” says Borgman. “The efficiency with which you do all the preceding data management and data analytics steps before getting to the AI piece becomes increasingly important."

Highlights from their conversation include:

  • Why agile, scalable data solutions provide a competitive advantage in AI and analytics initiatives
  • How open formats like Iceberg have revolutionized modern data management for data lakes
  • What the complexities and potential benefits of query federation highlight around its suitability for projects beyond exploratory queries

Watch the episode

Transcript

George Fraser (00:00)

Hi, I'm George Fraser. I'm CEO and co-founder of Fivetran, and I'm joined today by Justin Borgman, CEO of Starburst. Thanks for being with me, Justin.

Justin Borgman (00:10)

Great to be here, George.

George Fraser (00:12)

Starburst is the premier commercializer of Trino, the query engine. Justin, tell me a little bit more about Starburst. What does Starburst do? How does Starburst and Fivetran work together?

Justin Borgman (00:27)

We're about six-and-a-half years old as a company, but the technology was born at Facebook a little over 10 years ago. As you mentioned, it's a query engine that allows you to run SQL analytics on data that lives in a lake and allows you to federate other data sources as well. I'm sure we'll get into that in this podcast at some point. 

One of the nice things about Trino, and therefore Starburst, is that we see a ton of customers choosing to build a lake house or a data-lake-based warehouse leveraging open formats. Rather than loading that data into a traditional data warehouse, they choose to store it in open formats like Parquet files or some of the more modern formats like Iceberg, and query that data there. The result can be a much lower-cost data warehousing solution and one that doesn't have any of the lock-in associated with traditional data warehouses. 

To answer your question about Fivetran and Starburst, I think one of the great ways that we can work together, especially with some of your more recent support of data lakes, is ingesting into S3 and open file formats. It’s something that customers need just as much as if they're running a cloud data warehouse. They need to access the many different data sources that you support and running through that process of ingesting into the lake is a great way to build a lake house.

George Fraser (02:09)

That makes a lot of sense. I suspect we're going to talk a lot about Iceberg today. Iceberg has really changed the game of data lakes and is really the key to the fact that Fivetran and Starburst can now work together. Fivetran can ingest data into an Iceberg data lake that Starburst can read. I suspect that historically most Starburst customers were probably reading data lakes that they had constructed themselves by employing data engineering teams that wrote code. Is that fair?

Justin Borgman (02:41)

That's exactly right. The early adopters were very much DIYers who were building lake houses before they were called lake houses. They were building their own ingest pipelines into Parquet files (or were early adopters of Iceberg) and managing that themselves. At Starburst, we have an opportunity to help make that process easier and manage some of the components that might be challenging for somebody who does not have 50 people on their data engineering team. Companies that aren’t Facebook. 

George Fraser (03:16)

We say at Fivetran that our mission is to make access to data as simple and reliable as electricity. Our goal is to make the problem of centralizing data as easy as plugging into a socket. The cool thing about Iceberg is it allows us to interconnect with the whole open source community, including query engines like Iceberg. 

I think that's what makes it so exciting. Data lakes are becoming more accessible. They're becoming more mainstream. You don't have to have a huge engineering team in order to build a data lake. You can leverage tools like Fivetran and Starburst to put it together, even if you don't have the resources of Facebook. 

There are a lot of query engines out there. What makes Starburst different?

Justin Borgman (04:08)

First of all, its origins. It was born at the largest scale imaginable. It had to be able to handle Facebook's query volumes and hundreds of petabytes of data. It is proven at scale. It's used by some of the largest hyperscalers in the world, like Apple, LinkedIn, Netflix, Airbnb and Lyft. 

Another piece is the performance itself. It's highly tuned, highly optimized and delivers incredible performance. To your point about lakes becoming more mainstream today, I think performance has a lot to do with it. The first lake was Hadoop, 15 years ago — a lot of these concepts were really pioneered back then, but the performance was lacking. Hadoop was not really a suitable alternative to, let's say, Teradata (where I used to work). Today, that is no longer true. The gap has closed and these query engines are incredibly fast. That's another calling card.

I'd say the third piece of it is this notion of being a truly disaggregated stack, meaning that you can query any data source. Lakes tend to be a large center of gravity, and we think they should be for the goal of managing your cost of ownership. Open formats are great, so we encourage putting as much data in Iceberg as you possibly can, but there's also value in being able to federate a query across two different data sources and join two tables that live in two systems and go direct to source. What's interesting about Trino is that it does both really well.

George Fraser (05:57)

What are some examples of federated sources that customers are leveraging with Starburst or with Trino more broadly today? If I'm federating between Iceberg and something else, what might that something else be?

Justin Borgman (06:12)

Some of the most popular are other single node databases that are analytical data marts or at some level in an organization’s data architecture. For example, my SQL database, a Postgres database or RDS on AWS are very common use cases. Another common use case is connecting a classical data warehouse, like Redshift or Snowflake, with data sitting in Iceberg in your lake and joining those two. 

What we find is even those who have largely centralized in a cloud data warehouse most likely have a lake as well, because it becomes cost prohibitive to store everything you have in a cloud data warehouse. Very often people have both and need to federate between those two. 

There are also more innovative use cases where you're querying a Kafka topic or you're querying a NoSQL database like Mongo and joining that with something that lives in the lake. 

George Fraser (07:19)

I will confess to being a query federation skeptic. My perspective on this is that query federation works in two situations. One situation is when you have very high bandwidth from the source, like a data lake. The other situation is demos. 

Query federation works great in demos because you can just write a query with a very selective predicate where account ID equals something, and that will get pushed down in the data source and it will work beautifully. When you're on your real query, however, it will be like, Come back tomorrow after I successfully page through the entire account object of your Salesforce instance. 

I think the examples you gave, like querying operational databases like Postgres or MySQL are kind of on the borderline. It all depends on how much bandwidth a Postgres database has to offer a federated query engine like Trino. It depends on how much data is in that database and it depends how much other load is on that database, because you're going to drive a lot of I/O when you execute that federated query against that database. 

If it's a very small table that you're querying that you're joining with something somewhere else, or if you have a very selective predicate that's getting pushed down, it can work. That one's kind of on the edge, but in general, my perspective on this is I think most people with these sources end up doing change data capture into their data lake simply because they don't want to put that I/O on their production database. 

Once the data sizes get large enough and the predicate pushdown trick doesn't work, which it doesn't work for many queries, it's going to become untenable. You will need to make a copy of the data. 

Fivetran is in the data replication business, so I suppose I'm biased, but I feel confident that query federation, other than federating two lakes, is always going to be kind of niche because the sources don't have enough I/O. Do you want to argue with me?

Justin Borgman (09:34)

Partially. Let me start with where we agree. 

I think you're absolutely right that bandwidth can be a constraint, and it's a constraint that we, from the Starburst/Trino side, have no control over. So you're absolutely right, that's an important consideration. 

I also agree (maybe the part I agree with most) is the “it depends” part. I think another factor is really the use case itself. There are great exploratory use cases, for example, where you are really just trying to answer an exploratory question, not generating a report, serving up a dashboard or something that you're going to do all the time. Rather, this is a one-off question of data that lives in three different places, and you don't want to go through the effort of creating new pipelines to move that data just to answer this one exploratory question. I think that's a great example of where federation is useful. There may be the performance sacrifice that you alluded to, but for the purposes of that exploration, that might be totally okay. 

The way we approach this with customers is we actually lead with “it depends”. We would say there are use cases for both. For performance and controlling your performance SLA, put as much of that as you can in Iceberg. I think that's where you and I completely agree: you want to do that for your dashboards, for your BI, for your reporting and for building data-driven applications (a lot of our customers build data apps on top of Starburst). You want to control those SLAs. 

However, there are also use cases where you want to go direct to source because your question is exploratory, because the freshness of the data matters (like a Kafka topic example) or you need the agility of iterating on a data application that you're building and you want to try out new patterns and new queries. 

I think the other thing that surprises people and might even surprise you, George, is that federation is faster than they expect it to be because their reference point is the traditional data virtualization vendors of a decade ago that had to push down everything. Part of the beauty of Trino is it is a massively parallel processing (MPP) architecture. That parallelism allows Trino to do joins in memory. Especially for complex joins of tables across different data sources, it's doing a lot of that in memory so it can actually execute that query faster than people might expect. 

I'll just wrap by saying I think it depends. I think we agree and then maybe we also disagree at the same time.

George Fraser (12:29)

I think the thing that is great about federation is that it is a great user experience. You can get wired in really quickly.

The latency is not zero, because when you run a query against a SQL database, you're running against a snapshot of the past. It might be 10 milliseconds ago, but it's not zero. I hate it when people say things have zero latency — that's not how physics works. It's always something, and it's probably longer than you think, but it has a great user experience.

I sometimes describe Fivetran as trying to create the user experience of a federated system, but the implementational details are that we move the data. We do mirroring. We create an exact replica of all of your sources in your data lake, and that feels like you're querying them directly. We're always trying to drive the latency lower. We can do one minute pretty reliably now, but there's no finish line there. We're going to keep making it lower and lower every year. 

We can agree that query federation has a great user experience. 

Moving on to other subjects:. At its heart, Trino is a query engine. There is a bit of a renaissance going on in query engines right now. There have been a lot of new ones over the last few years. 

Databricks created Photon, which is a complete rewrite of their SQL query execution engine. Facebook created Velox as a rewrite of the query engine inside Presto, which is like Trino's cousin. They both descend from the same parent. They're siblings, Trino and Presto. 

DuckDB has been getting a lot of adoption lately. We actually use it at Fivetran in part of our data lake implementation to rewrite the data. There are more emerging ones like Polars and lots of others. 

What's going on here? Why are there all these new query engines coming up these days?

Justin Borgman (14:51)

I think it all goes back to something that you mentioned towards the beginning of the podcast, which is that the lake is getting a lot of attention these days, particularly as a really low cost place to store data.

Cost is king when it comes to the ever-increasing volumes of data that the customers are wrestling. This even ties into AI, which is the topic du jour. As you're preparing for your AI projects, you’re thinking of the resources you have to bring to bear and how much it’s going to cost. The efficiency with which you do all of the proceeding data management and data analytics steps before getting to the AI piece becomes increasingly important. 

I think that's driving a lot of focus on data lakes and hence a lot of interest in better query engines to produce a data warehouse experience on a data lake. That was really the purpose of Trino in the first place. It has also been Starburst's mission to provide a data warehouse experience on a data lake. By experience, I mean the performance and functionality that you would expect from a traditional data warehouse. 

I think Databricks recognizes there's a tremendous amount of money in that market. Certainly we recognize that as well. That’s where we're all looking to challenge the cloud data warehouses with an alternative approach.

George Fraser (16:31)

Data lakes are a very interesting enabler of innovation in query engines. If you have a data lake, it is very low risk to try out a new query engine. It's a two way door. You can go back to your old one or try a different one with very little friction because everyone shares the same storage engine. That is a really interesting development for the industry. 

Data lakes have been around for a long time, but as I said earlier, I think they're finally becoming more mainstream. They're becoming more accessible to the average company because (not to toot our own horns here) vendors like us are all wiring into this common format of Iceberg. 

We’ve mentioned Iceberg a bunch of times here. For those who are listening to this but don't know, what the heck is Iceberg and why is it so important?

Justin Borgman (17:29)

It’s basically a layer on top of a Parquet file from a technical perspective. The reason it's important is that it adds the missing pieces on the functionality side of a data warehouse in the lake. I'm referring to things like updates and deletes. 

Historically, in a data lake, if you're working with Parquet files, things were append-only. You just keep appending, appending, appending. That prohibited certain types of use cases. The most classic example is a GDPR use case where, because of GDPR, you need to remove somebody from a table because they've opted out of your marketing database. In a classical data lake setup, that was extremely hard to do. You had to sort of reverse-engineer that result. Now, with a modern format like Iceberg, you can just go delete that row. Delete George Fraser, or Update George Fraser. Do not send George Fraser any mail. That’s what it's allowing. At Teradata, we would call this active data warehousing — being able to actually modify, update and delete your data. 

It's not a transactional database. I don't want to set the wrong expectations. You'd still have an OLTP system for your transactional database, but you can do this level of transactionality for analytical purposes, and that’s super exciting. It closes the last mile on that distinction between Data Lake and Data Warehouse, and that's a big deal. 

It also enables things like time travel, which is cool. You can see your data change over time. There's a lot of interesting functionality you now have with this modern format.

Iceberg was developed at Netflix. In fact, Iceberg began originally to be queried by Presto and Trino. They're a big Trino shop today. The idea there was to actually start to replace their traditional data warehouse workloads and move them to Iceberg and create this data warehouse alternative. I think that's one of the reasons it's super exciting and gaining a ton of momentum in the market.

George Fraser (19:50)

You said the word “transaction.” To me, that's really the crux of what Iceberg is. It allows you to perform transactions on a data lake. Part of that means the ability to do deletes, but it also means doing deletes and inserts and updates all happen at once. That's critical. It really fills the gap between data lakes and traditional database management systems like data warehouses. 

Iceberg is crucial for Fivetran to be able to support data lakes. We didn't have the ability to write to data lakes as a destination for many years because of our model. Our model is mirroring. We create a mirror image of all of your systems of record, whether those systems are database management systems or applications like Salesforce or event brokers like PubSub. No matter what it is, we create a mirror of that data in your central data platform.

You couldn't do that with traditional data lakes because we didn't have the ability to edit and to edit in a transaction. Iceberg gave us that. There were still many more hard problems that we had to solve in order to make it work, but Iceberg was a key enabling technology.

Justin Borgman (21:06)

One of the things that's very exciting and one of the reasons I'm enjoying this podcast with you today is that it allows us to partner together much more effectively. It opens up a whole new frontier. I think you have absolutely moved to where the puck is going and we're super excited to work with you.

George Fraser (21:24)

As are we. 

One of the things we wonder about with these data lake implementations that customers are doing is, to what degree are customers going to write back to this data lake? 

Is this data lake mostly going to be a read-only data repository for them, to the extent that they're running queries and writing outputs to somewhere else, or is the data lake going to be read-write?

What's going on today with Starburst customers? Are people commonly writing back to the data lake?

Justin Borgman (21:55)

They are, both in terms of leveraging Iceberg tables and doing updates and deletes, and also in their own ETL processes, or the T part of your ETL process: running a transformation, and landing that new refined data also in the lake. 

Historically, that might have been where the lake’s job ended, and you might land that transformed data into a traditional data warehouse.I think increasingly because these query engines are able to provide incredible performance now, that last step is no longer necessary. This means that the amount of data you may put into your cloud data warehouse may actually decline over time. Certainly we have a vested interest in saying that ourselves, but I think that's likely true because of the cost, flexibility and the openness that open formats provide.

George Fraser (22:56)

If a customer is using a data lake and they're using Trino as a query engine alongside other query engines, there is a challenge right now. 

If you want all the engines to be able to read and write the same data lake, that write part is a little complicated — trying to set it up so that people can read each other's writes because there's a metastore layer in these data lakes.

Iceberg is not the end of the story. There's one more layer that basically keeps a pointer to the latest version of every table. There are a lot of different metastores floating around right now, and it makes it hard to get these systems. It's still a little bit difficult. 

If all the systems are just reading from a common data lake, that works great. However, if you want them all to be reading and writing and to be able to read each other's writes, things are still a little bit wooly. 

There are multiple visions of how that problem is going to be solved. How do you see that problem today?

Justin Borgman (23:55)

First of all, I agree that is a problem and it is the right place to focus if you're watching this industry and wondering where things settle out. 

To take one step back, for the past year or two there has been a battle at the format layer, because you have Iceberg, but you also have Delta from the Databricks folks. You have Hudi, which came out of Uber. There are multiple formats. It’s becoming increasingly clear that Iceberg is going to be the winner, but there was fragmentation there, too. 

My point in bringing that up is that I think we're now at an earlier point in that same settling-out that needs to take place at the metadata or the catalog layer. 

Databricks has their own catalog. We have a catalog, it's called Gravity. Everybody has a catalog. AWS has Glue. There's the Hive Metastore, which is still widely used, believe it or not. You've got a lot of options and it's too early to say, frankly, who's the winner at that layer. 

As a result, Starburst is trying to support other people's catalogs. If you've got data in Snowflake, we'll be able to access that. If you've got data in the Glue Metastore, we've supported that one for years just because of the prevalence of AWS. We support the Hive Metastore. Our approach is to attack the problem by providing optionality to the customer, because there's no doubt that the dust definitely has not settled at that layer. It may be another couple of years to see which catalog wins out. There are even open source, independent companies built around this catalog layer as well.

George Fraser (25:46)

Last topic: AI. Every podcast today has to talk about AI. What is the relationship between Starburst and AI workloads? Are they siblings? Is Starburst feeding the AI workload? How do they sit next to each other in a typical customer?

Justin Borgman (26:06)

We're primarily providing access to the data that they need. Of course, I think the frontier that we're at in this AI world, from a commercial perspective, is that customers now want to leverage their own data and either create their own models or enrich their own models leveraging their own proprietary data. GPT was sort of the public demo for everybody to try out the concepts. Now they want to make it real and find competitive advantage, and that involves leveraging their own data. 

I think the combination of a lake and federation, when federation makes sense, gives them a lot of optionality and a lot of flexibility as they're developing these models. That’s where we are today. I think there's an interesting future though, going back to what you said about transactions in the lake and data landing back in the lake, where I think vectors could live in the lake. That's a potentially interesting future. 

I think the lake is going to play an increasingly central role, largely because of its openness and its cost benefits. I think that's going to really drive its popularity across a wide array of workloads.

George Fraser (27:22)

It’s a complicated and evolving subject area. I think what you said about customers leveraging their own data for AI workloads is absolutely right. That's going to be an important frontier. 

We're trying to do that at Fiveran right now. Essentially, we're trying to build a knowledge-based bot. If you're a customer support person or a sales engineer at Fivetran, you have to deal with an unusually broad array of arcane minutiae about all these different systems that we connect to. Everything from the memory fragmentation implications of the LOB type of SAP HANA, to which Salesforce tables don't follow the change data capture rules that Salesforce documents. Those are both real examples, and there are thousands like that. We think we're a really good use case for like a knowledge-based bot and it’s a work in progress. 

People seem to have the idea that if you're going to do AI workloads, that AI workload needs to run on a data lake and not on a traditional data warehouse. I think AI workloads can run very well in a data lake, but they also can run very well on a traditional data warehouse. It is just fine to run select star from table in your regular old data warehouse and feed that into your AI pipeline. There's nothing wrong with that; there's no data lake police who are going to come and arrest you. 

AI workloads are so compute-bound that the efficiency of that first step does not matter. You do not need to worry about whether you are spending too many CPU cycles at that stage because you are going to spend many, many more at the subsequent stages. It’s a very promising area. It’s going to drive a lot of change in the data stack, but I also think the data stack you have today is a great data stack for your AI project. It's a mistake to think that we have to rebuild everything just because we're ultimately feeding a text column into a language model rather than a float column into a BI tool. 

I think a lot of the data infrastructure that we've built over the last 10 years is going to be totally reusable for these new workloads. Do you agree?

Justin Borgman (29:55)

I do. 

George Fraser (29:57)

Very nice talking with you, Justin. Thanks for joining me and I'm looking forward to working more together now that our systems can talk to each other through the magic of Iceberg.

Justin Borgman (30:09)

Likewise. Thanks, George. Looking forward to it.

Expedite insights
Mentioned in the episode
PRODUCT
Why Fivetran supports data lakes
DATA INSIGHTS
How to build a data foundation for generative AI
DATA INSIGHTS
How to build a data foundation for generative AI
66%
more effective at replicating data for analytics
33%
less data team time to deliver insights

More Episodes

PODCAST
26:38
Why everything doesn’t need to be Gen AI
AI/ML
PODCAST
26:38
Why everything doesn’t need to be Gen AI
AI/ML
PODCAST
26:38
Why everything doesn’t need to be Gen AI
AI/ML
PODCAST
26:38
Why everything doesn’t need to be Gen AI
AI/ML
PODCAST
26:38
Why everything doesn’t need to be Gen AI
AI/ML
PODCAST
26:38
Why everything doesn’t need to be Gen AI
AI/ML