Data lakes drive next-gen AI infrastructure
Data lakes drive next-gen AI infrastructure
Fivetran CEO George Fraser met with MotherDuck CEO Jordan Tigani to discuss why classic big data architectures aren’t suited for AI/ML.
Fivetran CEO George Fraser met with MotherDuck CEO Jordan Tigani to discuss why classic big data architectures aren’t suited for AI/ML.
More about the episode
The debate around data management is intensifying for modern businesses. As enterprise data needs evolve toward AI/ML, outdated big data architectures have led to inflated storage and usage requirements. Executives aren’t seeing ROI, meanwhile they’re committing to more infrastructure in the form of data lakes and lakehouses.
George Fraser, CEO and co-founder of Fivetran, and Jordan Tigani, CEO and co-founder of MotherDuck, outline a clear path forward, detailing how to build and adapt systems that meet today's real-world needs. Yesterday’s big data infrastructure won’t solve for the AI needs of today. Fraser recommends a single data platform consolidated on a data lake or a lakehouse.
“The architecture of systems was really driven by the size and shape of servers 15-20 years ago,” explains Tigani. Among the petabytes of enterprise data, most data consumers and analysts are only looking at the past few weeks of data. They don’t need cloud servers with infinite compute and storage — they just need a modern laptop.
Highlights from their conversation include:
- How MotherDuck built an innovative new destination connector using Fivetran’s new Partner SDK
- Why modern, commodity laptops are driving local data processing and reducing reliance on cloud
- What executives and data leaders need to know about building AI on data lakes
Watch the episode
Transcript
George Fraser (00:00):
Jordan, welcome to Fivetran's office. Thanks for coming down.
Jordan Tigani (00:03):
Yeah, thanks George. Great to be here to check out your awesome space.
George Fraser (00:08):
So I'm joined by Jordan Tigani today, the CEO and Co-founder of MotherDuck, creator of a new data warehouse. Is that a fair description?
Jordan Tigani (00:16):
Yeah, absolutely.
George Fraser (00:17):
Formerly of SingleStore and before that at Google. So you've worked in this ecosystem for a long time. You made some observations at some of your previous roles that inspired you to work on MotherDuck. Can you tell us a little bit about that?
Jordan Tigani (00:33):
Yeah, sure. So a couple of them were just noticing that most people don't have huge amounts of data and we had been building these systems over time that were designed really around giant data sets. Then it turns out that most users, either they didn't have huge amounts of data, or the data that they interacted with on a daily basis was actually much smaller. You might have 10 petabytes worth of logs data, but you really only look at the last week or the last month worth of data.
Just a couple other ones is the architecture systems were really driven by the size and shape of servers of 15-20 years ago. And when MapReduce was created, et cetera, you needed a lot of servers to handle a reasonable amount of data. But servers are actually huge now and so you don't need to split things up the way you used to.
Then finally, laptops which used to be considered toys now are quite powerful. Like my Mac M2 is incredibly powerful and in fact, you even wrote a blog post that I quote all the time that showed that running a database benchmark on your laptop was faster than running against a cloud data warehouse that must not be named. And so yeah, the hardware architecture and landscape has changed and is continuing to change and yet the systems were designed around a world where that's not necessarily the case. So there seemed to be an opportunity to build something for the future and it's going in the direction of where the hardware and where the data sizes are going versus where people had expected them to go.
George Fraser (02:36):
I couldn't agree more. We've had the same observation from the other side at Fivetran as the one feeding data into a lot of these data warehouses. The way I sometimes describe this is the hardware is growing faster than the data sizes that people actually routinely use are, especially if we're talking about relational data, which is the typical contents of a data warehouse.
We noticed this from founding Fivetran, how small the data sets were even at large companies. One of our observations was that we think a lot of these giant dataset legends are really driven by terrible data pipelines, by people doing things like taking a snapshot of their production database every night and storing it forever. So you have a number of copies of your data, equal to the number of days that your pipeline has been in production, a surprisingly common scenario, and that multiplies the size of your data up thousands of times. But the underlying data set is not actually that big and if you sync it using a change data capture based strategy like Fivetran does, you are going to end up with a surprisingly modest data size.
Interestingly, until recently, all Fivetran connectors, each connection would run on a single node. And only recently we've had initially one customer who had a truly web scale dataset that they were running through Fivetran and we had to finally do sharding. So sharding is still relatively new at Fivetran and still very niche. Most data sources if you do change data capture will easily fit through one node.
And your point about laptops is also well taken. I ran that benchmark because I was curious and I like to tell people that the data warehouse vendors should spend less time worrying about each other and more time worrying about the M4, the next chip. Because they keep getting faster and it's becoming more and more evident that a lot of data sets will run pretty fast on your laptop.
Jordan Tigani (04:44):
Especially if they're not huge. Massive data sets, you need to compute where the data is, but if you have smaller data sets, you basically can use them close to where the user is and then you can get even better performance.
George Fraser (04:59):
MotherDuck is a data warehouse, a data platform, Fivetran is a data pipeline. How do we work together?
Jordan Tigani (05:09):
So a database by itself is not particularly useful if you can't get data in, then you can't actually visualize or use the data on top of it. So being able to connect with the ecosystem is important and some parts of the ecosystem are more important than others and Fivetran as the gold standard in data movement is super important for us to be able to get data in. I think when we first started we had a blog post about what we were doing and somebody said, "This sounds really cool, but how do I get my data in?" And I think Fivetran is just a way that... So it doesn't matter where your data comes from, because there's all sorts of things and locations that are producing data. We can now get that into MotherDuck because we have built this Fivetran connector and so we're very, very excited that we're going to be turning that on, I think, pretty soon. I think it's working in private preview now, the MotherDuck connector?
George Fraser (06:13):
That is correct. It is running.
Jordan Tigani (06:13):
Excellent.
George Fraser (06:14):
It's still in preview, but it is running. One of the things that's unusual about MotherDuck as a destination for Fivetran is that you built it using our new SDK, also still in private preview, although several people have built on it now. The Fivetran SDK is a way for partners like MotherDuck to build their own Fivetran connector, so they write a little bit of code and that code runs on Fivetran infrastructure, but it acts as a shim between the language that Fivetran speaks and the way that the destination wants to get the data.
I think it's an important new direction for us in general of trying to add programmability to Fivetran so that, even though we're in the connector's business and we have built over 500 connectors ourselves, if there's some other connector that we don't have, at the end of the day, if you really want it, you don't have to convince us to do it, you can just go do it yourself. That's the vision we have is to allow people when it's really important to them to go make it happen. And we have sources doing this as well, interestingly, which is another exciting development.
So MotherDuck, we've touched on this a little bit when we talked about the size of data, the size of single nodes, how the relationship between those two things is changing. Those observations have led to some unique characteristics of MotherDuck. Can you tell us, what makes MotherDuck different than other data platforms?
Jordan Tigani (07:43):
Yeah, sure. So I think in order to handle or take advantage of the performance and low latency of powerful local compute, one of the things that we're doing is we're running DuckDB on the client as well as on the server. So it's a different architecture than you typically see in a cloud data warehouse which runs entirely in the cloud. You send the SQL query and you get a result back. We have a database based on this open source DuckDB that runs locally in the client, so if you're running in a browser, it runs in the browser in Wasm. If you're running through a JDBC driver, that JDBC driver has all of the DuckDB database code inside of it. If you're running in Python, it's the same thing. But we also have a matching DuckDB instance that runs in the cloud.
And the other different thing on the architecture side is we're building the scale up system that runs in a single node. So a lot of people, their intuition is like, "Okay, well that's not going to allow me to scale, because I've got a petabyte of data." First of all, most of the time you don't actually read the petabyte data, as we were talking before, and especially with separation of storage, compute, et cetera, you're looking at a smaller set of data. So if you have a thousand users in your company, each one of those users gets their own instance and they can scale down to zero. So DuckDB can run in a hundred megabytes, but then we can scale that up to terabytes and hundreds of cores. So you actually get a lot more flexibility and more compute power.
And then we tie these things together. You can do joins against a local dataset versus remote dataset. We can cache stuff locally and materialize views locally and query caches locally, so there's a bunch of cool things that we can do, once we have this architecture that we can move compute to the user and you can do these 60 frames per second, almost video game style, flying through your data. Things that are literally impossible because of the speed of light if you have to go and hit a data center somewhere in the cloud.
George Fraser (09:58):
It's interesting to think about this model in the context of Fivetran's own particular scenario. Fivetran is technology company, about 1200 employees. I checked recently the size of our dataset and we have single digit petabytes of data in our data warehouse, but the vast vast majority of it is a handful of log tables that record events about everything that's ever happened in Fivetran. People turning connectors on and off, syncs running, syncs succeeding, syncs failing, all that stuff. And those are also the least queried data sets and in fact, we make materialized views of them that are much smaller as a cost saving measure. So exactly towards your point, the whole dataset is not that big and then the parts of it that are queried are that much smaller. So we see evidence of the usefulness of this approach in our own use case.
Jordan Tigani (10:49):
You guys use DuckDB for a bunch of internal stuff as well, right?
George Fraser (10:52):
We do, yeah. We used it in our product. So it's not used by our analytics team, except ad hoc locally. We do occasionally use it for that purpose in more of a notebook environment, which I think a lot of people do with DuckDB today. But we use Fivetran internally in our product to support our data lake writer implementation. So Fivetran has been working on data lakes for years. About a year ago we finally released it, but we had been working on it for years before that.
We wanted to support data lakes as a target, defined as, we're going to replicate your data not to a conventional data platform, data warehouse like Databricks or Snowflake or BigQuery, but instead we're going to deliver it to a set of files in an open source format in S3 or Google Cloud storage or Azure Blob storage.
This is a thing that not everyone wants to do, but the people who want to manage their data this way really want to do it this way. And so we wanted to support that, but we it to work like any other Fivetran destination. We wanted to have proper relational tables, we wanted to have updates, we wanted to have schema evolution. We didn't want data lakes to be this incredibly inconvenient thing where you had to go and build a whole secondary layer of software on top of it just to interact with it. We wanted it to be very much like any other relational database.
To do that, we had to solve the bottom part of a data warehouse. We had to write code that could update and alter huge tables in a reasonable amount of time. That looks like a small fraction of the project of building a data warehouse.
After a couple of false starts, different architectures that didn't work out, the design that finally succeeded was we built a service based on DuckDB. So we have a farm of nodes, each node runs DuckDB and when we want to update a table, we take the table, we break it up into pieces based on files, and we send little groups of files to the different nodes and DuckDB runs in that node, patches up the table, alters the column types, whatever it is that we need to do. It has that horsepower to make edits to large data sets and DuckDB was critical to making that work. In part, because it is exceptionally easy to incorporate into a larger system. DuckDB is very friendly and pliable. It can sit inside of a lot of different systems. Partly because a lot of our existing logic for how we actually update destinations is written in SQL.
People imagine that Fivetran gets the data out of the sources and it's like a series of little nicely formed events. You see this in a lot of Kafka diagrams. They show this nice little change stream, here's a row and here's another row and here's another row, and then we just combine them and apply them. It does not look like that at all. It is so much more complicated. The format that data comes out of sources and is so convoluted, the logic of how to apply it can be quite complex and a lot of that logic that we have is written in SQL. And because DuckDB is fundamentally a SQL system, we were able to use it to run more or less our existing code, our existing destination related code.
And lastly, we were able to hire DuckDB Labs, the original developer of DuckDB, to help us understand DuckDB, use it in this situation and to make changes to DuckDB to support our scenario and to improve performance in our particular situation. So that was a key element of that success.
Jordan Tigani (14:38):
To what extent are you seeing a trend towards lakehouses and more data, not going into data warehouses, but into the Iceberg, Delta, Hudi formats? I mean, I imagine you have a pretty good window into what people are actually doing versus what's in the hype cycle.
George Fraser (15:00):
Yeah, we see this in two ways. We support Databricks as a destination and Databricks internally stores the data in a lakehouse format where you can see and interact directly with the underlying files that represent the table. All modern cloud data warehouses are based on storing the data in kilometer format and object storage, but in a lakehouse model, you can actually directly access that if you so choose. So we see it in Databricks and we see a lot of growth of that as a destination.
Then we also see it in our own data lake destination, where we are the ones writing and maintaining the files in object storage directly, and then the querier sit downstream of that, which is most relevant when you have multiple consumers of the same tables and the tables are very large, to the point that writing them twice is actually meaningful cost.
Then it's also relevant in scenarios where you're just very concerned about future proofing. You want to have total ownership of the storage layer of your data platform separate from any vendor. That's where our direct data lake destination is relevant, and that is also growing very fast. It's our fastest growing destination for the last couple quarters. It's not our largest, but it's not tiny. And so we're seeing it finally starting to go mainstream.
Data lakes are a funny thing they've been talked about forever, for 10 years. This idea of instead of storing your data in a proprietary format of a vendor, store it in an open source format, control it yourself, that appeals to a certain audience. Some people don't care about that, some people care about that a lot. Sometimes I've made fun of it a little bit as a hyped thing. It's like what the cool kids are doing, data lakes. I even wrote a blog post years ago where I said, "Data lakes are so hot right now," and I had a picture of Mugatu from Zoolander, I don't know if you know that reference, but I made fun of them a little bit. But we're on the hype train too now.
But there are people, like I said, who really care a lot about that. And there are concrete reasons why sometimes a true data lake-based system can be a great benefit to the right customer.
Jordan Tigani (17:12):
I guess the data lake hype cycle is on the downswing, and everything AI has sucked all the air out of the room.
George Fraser (17:20):
It's funny, sometimes people think of data lakes and AI as a thing that necessarily goes together, that somehow if your goal is to do non-SQL workloads like a retrieval augmented generation pipeline or just classical machine learning like logistic regression, that somehow, a traditional data warehouse is unsuitable for that workload and you should be using a data lake. You can use a data lake with that workload. They're a good fit, but you can also run workloads like that on a traditional data warehouse. I think this is a misunderstanding that somehow because you're storing it in a traditional data warehouse, you can only query it with SQL and you can't do anything else. You can take the data out and do whatever you want with it, and typically in these kinds of scenarios, the cost of compute of the task to be done, whether it's your logistic regression model or your deep neural network or whatever is so high that the cost of Select Star and Extracted is this big compared to that, so that's a non concern. Not always.
But I do think sometimes people get a little... They build multiple data platforms for different workloads, which you should not do. You should strive to try to share the same data platform for as many workloads as possible because it's a lot of work building and maintaining data pipelines. And so having two separate systems just because you have two different goals is something I'm always urging people not to do.
Jordan Tigani (18:46):
It also seems like with the data lake that there's a little bit less of a trustworthiness of data just because you are... The benefit is, I can throw anything I want to in my data lake, and if you throw anything you want to in the data lake, then that means you're not carefully controlling what goes in. And if you're trying to make decisions based on using AI to make some decisions or training and you haven't carefully curated what's in that, then it's very hard to predict what's going to come out.
George Fraser (19:21):
There's a little bit of a terminology problem that we all have here, which is the term “data lake” is overloaded. It's used to mean two things. One is to store your data in an open source format in object storage that you control. The other is to store data at a very low level of curation. That's what you were just talking about, where the threshold to add data to this is very low. And so that means that some of it might be messy. These are actually two completely independent decisions.
With table formats like Iceberg and Delta, a data lake can have exactly the same characteristics as a traditional data warehouse. Fivetran data lakes look exactly the same as Fivetran data warehouse destinations. They have the same schema. The data is inserted and updated in exactly the same way. That's why it was so hard to build, but the level of curation is no different.
So you talked a little bit about AI. One of the DuckDB founders has this... Hans has this great talk where he says, "Machine learning is not going to move into the database, even if we wish it very hard." And he's making fun of this idea that oh, we can just cram all of machine learning into the data warehouse using new functions. We can just have a machine learning function and select machine learning from my data. And that actually does have a place and I appreciate Hans making fun of it.
So there are a bunch of different ideas about where this is all going, how conventional relational SQL workloads and newfangled machine learning and AI workloads are going to work together, how they're going to connect. What is your theory of how this is all going to play out?
Jordan Tigani (21:08):
I do get what he says. I also was one of the people that helped create BigQuery ML, which is putting machine learning in the database and there are a lot of people that do like that. The people that are doing that are not maybe the typical people that are on the data science team, they're more on the analytics side who can like, "Hey, now I can actually make some predictions in my database."
Yeah, it does seem like the weight of tooling stuff in the data science world of, everything's in Python and PyTorch and all this ecosystem is not going to get shoved inside the data warehouse. The data warehouse vendors, of course, they are the ones that... They own the data and have a lot of vested interest in trying to bring those worlds together. So I think that there are some opportunities for being able to run PyTorch over the data that's in your database or being able to make it so that you can operate over data in your data warehouse, while making it feel local.
So with what MotherDuck is doing, where we have the local execution, there's a local DuckDB inside your Python process, you can be actually running your Psychic Learn or PyTorch or other data science framework, or you can be running in R and the data will feel local. And some of the things are getting translated to SQL queries and running in the cloud. Some of them are pulling data locally and operating locally. And so we're thinking that this hybrid execution is going to be a way to enable us to blend, to let you operate in the ecosystem that you're comfortable with, but also have the canonical data still be governed and owned by a data warehouse.
George Fraser (23:16):
I think there's a lot of merit to that. I think, in addition to what MotherDuck is working on, there's a hacker's version of that where you run a query on your data warehouse, you download the data, that's the first cell in your Python notebook. I myself have several Python notebooks that look like this, I will tell you, and you save it and then the rest of it proceeds from there and that all runs locally. And that can work incredibly well for realistic data sizes. A lot of problems can be solved very efficiently using the benefit of these incredibly fast laptops here. So I'm with you that I think we're going to see local execution have more and more of a role in the future, whether it's via systems that were designed with that in mind or hacker-hybrid execution on top of a conventional cloud data warehouse.
Then you have the Snowflake co-processor model is how I like to think about what they're doing, where they say, "Well, you can deploy non-SQL workloads on separate compute inside of our environment." And that has its own merits as well because you're keeping all the data inside the ring fence. So there's a few different visions out there. They're all interesting. It's an exciting time to be around.
So these technological changes, they can unlock all kinds of changes that ripple across the industry in unexpected ways. Are we on the precipice of another one of these right now? And if so, what is it? What is the foundation under our feet that's moving and what do you think is going to happen as a result of that?
Jordan Tigani (24:55):
I think one of the things that was driving the last one was the A, movement to cloud, but B, also this fear that the giant data is going to change how we have to deal with everything. I think that's one of the things that people jumped onto Hadoop. They're like, "Oh, we're not going to be able to handle all this data, we need Hadoop." And that turned out to be a dead end.
Now, I think, people are starting to wake up to the idea that, okay, well actually the most salient feature about my data is not how big it is. And then they're also realizing that, hey, we're spending a ton of money building all this infrastructure and we're not getting that much value out of it.
The most salient data problem is the user experience of that data versus the size of that data. So making it easier to use, easier to get data in, easier to do data transformations, easier to incorporate with all the things that you're currently doing, easier to get things out of it. And I think I'm a little skeptical on this AI text to SQL side of things, but if we can raise the level of abstractions that people are operating at and enable people to more naturally ask questions of their data and get reliable answers out, then I think that's going to help usher in a next generation of data tooling and data products and actually allow people to actually get the value out of their data that they have the potential to get.
George Fraser (26:49):
In many ways, it's a very old problem. Even as hard as it is to just get all your data in one place, getting meaningful insights out of it is even harder and the perpetual challenge, but maybe we're about to make a lot of progress on that. Let's hope.
Jordan Tigani (27:04):
I hope so. Yeah.
George Fraser (27:05):
Thanks for coming by today, Jordan.
Jordan Tigani (27:06):
Yeah, thanks so much, George.