Fivetran and Databricks CEOs reveal the secret to AI

George Fraser, CEO and co-founder of Fivetran, and Ali Ghodsi, CEO and co-founder of Databricks, offer an insider’s perspective on the hardest parts of building and deploying generative AI in the enterprise.

https://fivetran-com.s3.amazonaws.com/podcast/season1/episode4.mp3

Apple Podcasts

JUMP TO

Mentioned in the episode

Topics:

AI/ML

More about the episode

The competition around enterprise generative AI efforts is intense. With so much buzz around this game-changing tech, it’s practically impossible to separate fact from fiction. We brought together two of the industry’s most authoritative and forward-thinking data leaders to share the secrets and important details for building a winning GenAI strategy.

“Every company we talk to wants to do generative AI,” says Ghodsi. “The dirty secret of AI is that the hardest part is the data.” Ultimately, it boils down to centralizing as much data as possible and making it as high-quality as you can.

Here's a sneak peek into the insights they shared:

Successful GenAI initiatives require fast, reliable and scalable data integration
The more data, the better — especially if you can ingest it from all your third-party sources
Techniques like retrieval augmented generation (RAG) are key to trustworthy and predictable LLMs

Watch the episode

Transcript

George Fraser (00:00)

Hi, I'm George Fraser, CEO of Fivetran, and I'm joined today by Ali Ghodsi, CEO of Databricks, a close partner. Ali is also a neighbor here in the East Bay. Thank you for coming by today, Ali.

Ali Ghodsi (00:12)

Thank you. My pleasure. Yeah, that was an awesome commute.

George Fraser (00:16)

Databricks and Fivetran work together for hundreds of customers. Databricks is our fastest-growing destination. For those not super familiar with the space, how do Databricks and Fivetran work together? Why do both these tools need to exist?

Ali Ghodsi (01:43)

That's a great question. Databricks, we do a lot of analytics, AI, data science, you name it. That means you do a lot of stuff with data. Without data, the platform is useless. You have no access to data, no bueno. Now, with generative AI and LLMs, it is even more important to get more data sets into the platform. So, how do you get that data in? How do you get data in from Workday, from Salesforce, SAP, MySQL, you name it, all these different data sources? It's the number one thing everyone has to do first. If you haven't done that yet, there's really nothing you can do with the platform. That's where Fivetran comes in. Fivetran makes that really easy to get all these data sources into the lakehouse and then let people process it.

Especially with generative AI, large language models and so on, the game is now about quality. Okay, so we built an AI model, and we evaluated it but the quality is not quite there. How do we improve the quality? Actually, the secret sauce often is, if there's more data, if there's more specialized data you have from different data sources, that you haven't captured, they already exist in different systems inside of your corporation, but you haven't infused them into your AI models, then you can experiment and see does that help improve the data quality? You can continue iteratively, and add more data sets, using Fivetran into Databricks.

George Fraser (01:50)

I agree with everything you said, and I would describe it from the other perspective. What Fivetran does is we get all your data in one place, but Fivetran doesn't do anything with the data. We won't even show it to you in our UI.

Ali Ghodsi (02:02)

That's why people love it, by the way. It just works.

George Fraser (02:04)

We are specialists. We work very hard to make a very complicated thing seem very simple. But in the end, you get all your data in a lakehouse, but you need another system in order to do something with it. One of the things we see with Databricks – which is the fastest-growing destination on Fivetran right now – is what people love about it – the diversity of things that you can do with it. Unlike a lot of systems, which only give you a certain data model or a certain workflow. Databricks, from the very beginning, has had this great diversity of workloads that you can run on it. You can run notebooks, you can run SQL queries through JDBC connections. You can run Python code, you can run Scala code, you can run streaming. You can do a lot of different things in Databricks.

Ali Ghodsi (02:49)

Even R. You would like that, right?

George Fraser (02:50)

Really, you can even do R?

Ali Ghodsi (02:52)

You can do R. Didn't you do that, back in the day?

George Fraser (02:54)

I used to write a little R back in the day. I was more of a MATLAB guy than an R guy.

Ali Ghodsi (02:57)

Okay, all right.

George Fraser (02:58)

But I did write some R when I needed to do a GLM actually in 2009.

Ali Ghodsi (03:05)

There you go. That's the hottest thing you could do today actually, in the market. Yeah.

George Fraser (03:08)

Running a GLM?

Ali Ghodsi (03:09)

Actually, yes.

George Fraser (03:10)

You know, GLMs still work.

Ali Ghodsi (03:11)

I know.

George Fraser (03:12)

There's a lot of hot stuff about the latest and greatest language models.

Ali Ghodsi (03:16)

Yeah.

George Fraser (03:17)

Nonetheless, GLMs are still pretty great. Logistic regressions are still pretty great. I did one recently.

Ali Ghodsi (03:24)

That's what everybody's doing on the platform, still today. Right? If you want to nail those numbers, you want to get them right, and you don't want the LLM that's just making up numbers for you, that's what works.

George Fraser (03:33)

I'm really curious to hear more from you about this because one of the cool things about Databricks is you can do so much with it. What do you see from the inside of Databricks? What's driving most of the workloads? What are the workhorses? What's the new stuff? What do people care the most about?

Ali Ghodsi (03:49)

Yeah, about a third of the workloads are AI on the whole platform. But the bulk of it is actually what now would be called traditional machine learning, things like GLM, logistic regression, those kinds of things. But, everybody wants to do generative AI now. Every company we talk to wants to do generative AI. There are multiple departments within the companies that are fighting, "I own GenAI in our company. No, I own that." Everybody's rushing. The CEO's telling them, "We have to get into production with the GenAI model. It's not good. It's taking too long. When are we going to get...?" Obviously, that's what everybody's exploring. The large language models,fine-tuning and building your models using GPUs is the fastest growing service. We have a dashboard in Databricks, that's built in Databricks, that looks at record consumption every day, and it's like the only service that has had record consumption every day through December, Christmas. It's like people train more and more models, so that's what's hot. That's what everybody wants to do.

But then you ask them like, "Okay, so how's it going with the GenAI? Are you in production?" "Well, define what do you mean by production?" "Well, I mean are you getting value out of it?" "Yeah, actually we want to talk to you about that. How do we know we're getting value? What is the quality?" I think everybody's exploring it. I think this year is going to be the year where the rubber hits the road. Are you going to be in production, does it have quality and is it bringing the business value that you wanted or is the investment worth it? I think it's going to be very important here for generative AI.

I think the workhorse of the AI workloads is classic machine learning, numerical tabular data and you want to get those predictions right and accurate. That's what people typically are doing. But, that's just a fraction of the platform. Then of course for all of this, it's back to the Fivetran. The dirty secret of all AI is that the hardest part of AI, machine learning and generative AI, is the data. Are there data sources that you can extract more information from? That's the secret sauce. That's where Fivetran comes in.

George Fraser (06:00)

We have been having a lot of discussions at Fivetran of late, "Do we need to do something different to support these new workloads?" My first reaction to it is, "Well, the data sources we're syncing now, contain a lot of text data. They contain transcripts of calls. They contain descriptions of customers. They contain support interactions. They contain lots of stuff that you conceivably might use to fine-tune a language model. My first take is, "Well, I think from a Fivetran perspective, this is just another workload. This problem looks more or less the same. We need to get the data from here to there. This is a new thing you're going to do with data." Do you think there's any additional nuance to that? Do you think that the rise of these new workloads is going to drive innovation in the data movement space? Is there a different need that is created by these new workloads?

Ali Ghodsi (06:55)

That's a great question. I mean, obviously, we don't know. I don't know what the future will look like. We think about the same questions too. How is this big secular shift of generative AI going to change everything, or is it, and how is it going to change it? I think for people who are writing very complex pipelines for ETL — Fivetran is awesome. It just makes it work. You take away all the corner cases and all the problems. But there's also a lot of custom code people are writing for ETL pipelines, in big data processing. The question is, do you need to write all that code anymore in the future? Can LLM just write that code for you and just be right? That's one that we are exploring and seeing how much of that can we just automate away, and large language models can just code it out for you.

George Fraser (07:38)

I think that would be fantastic, you may be surprised to hear me say. We don't think of ourselves as an ETL company. We think of ourselves as a data movement company. That was the original key decision that unlocked the things people like about Fivetran. It's because we said, "Well, we're not going to solve that problem. We're going to leave that for you. Do you need to write a complicated pipeline that transforms your data? That's going to happen after we deliver the data." We're going to treat data movement as a first-class problem, and that allows us to do a lot of automation, but it does leave behind this job that the user has to do in Databricks after the data arrives.

Ali Ghodsi (08:12)

Yeah.

George Fraser (08:15)

The more that job can be automated, I think the more productive everybody will be. One of the things I've long observed is that the appetite for data is infinite. It will expand to fill the budget and productivity available. If you make people 10 times more efficient and you make the tools 10 times cheaper, they will simply do-

Ali Ghodsi (08:39)

More.

George Fraser (08:40)

... 10 times more projects.

Ali Ghodsi (08:41)

Exactly.

George Fraser (08:42)

There are always more things on the backlog.

Ali Ghodsi (08:43)

Exactly, yep. 100%.

George Fraser (08:43)

There seems to be no limit.

Ali Ghodsi (08:45)

100%, yeah. But I think that's a great point, that so with Fivetran, you just move the data from A to B, and then transformation needs to happen. A lot of that is happening in Databricks today. The question is, is there a way in the future, where a lot of that transformation can be just automated away with GenAI?

George Fraser (08:59)

Do you think that the critical step right now, the rate-limiting step on people trying to get value out of data, is the curation and transformation of the data, building those complex, highly customer-specific workloads, and specific pipelines?

Ali Ghodsi (09:13)

I think so. I think it's boring and it's not sexy, but it's the most important part. All the downstream cool stuff, whether it's GenAI, chatbots or just reporting, PowerPoint, Tableau dashboards or whatever it is, only works if the stuff before it, the data movement and the transformation parts are perfect, with no errors, no issues, never breaking, hit their SLAs, and that's hard to do. If you could automate more of that, I do think it'll be game-changing for the downstream use cases. I think that's where the world is headed, and it's super important. That's really where the secret sauce lies, I think, in most organizations that know how to do this stuff well.

George Fraser (09:57)

We've talked about this idea of the lakehouse a couple of times, we've used that term. Not everyone is familiar with it yet, although it's on its way. Lakehouse is something you and I have argued about over the years. I was a little bit of a lakehouse skeptic, a couple of years ago. From what we've seen of late, all of the major platforms, Microsoft has OneLake, BigQuery has BigLake, Snowflake is going all in on Iceberg and AWS has had Glue for a while. Everybody is converging on this idea of the lakehouse, which really was pioneered by Databricks. My question for you is how does it feel to be right, Ali?

Ali Ghodsi (10:38)

The next thing that we're focused on is what we call data intelligence, that's where our attention is and our energies. It's great that the world is converging on the lakehouse. If there's one misconception about the lakehouse that they're not getting right, is that a lot of folks are saying, "Hey, so if you store all your data in a data lake, and then it's just stored there, then you have a lakehouse, right? Then you do your workload on top of it. You can run SQL on top of that data lake. I have a lakehouse, right?" Well, no. That's the data lake architecture from 2010. That is not a lakehouse. Yes, you store your data all in a lake, but you don't have a lakehouse.

To really get the lakehouse, you also need governance, and the governance needs to be not at the level of files. If governance is at the level of files, so we just have millions of files lying there on a data lake, you're back to the data swamp. You're going to lose governance, you're going to have issues with security. There are going to be challenges down the line. That world doesn't work. It's the world that Cloudera, Hortonworks, the Hadoop vendor invented 15 years ago. I would say that's a missing ingredient. A lot of people are saying, "Yeah, lakehouse is it. Yeah, we agree with the lakehouse approach that Databricks took." Well, double-click on that governance, how that's done. You better have fine-grain security, you better have views of your data that are just consistent, at the level of the business. You better have support for machine learning models, all these things, and you can govern them, secure them and cost control them. That I think is lacking in many of the systems out there. They'll get there, but I think that's a misconception right now.

George Fraser (12:16)

Everyone's at a different stage for sure.

Ali Ghodsi (12:18)

You published a thing, back in the day, that said Databricks is a relational DBMS.

George Fraser (12:25)

Yes, I did. It meant something important – which I do believe – I think that the key idea of the lakehouse, which you've just described in a different way. I think of it as bringing the best parts of a relational database management system to a data lake.

Ali Ghodsi (12:42)

Yep.

George Fraser (12:43)

People think of RDBMSs as a big box in a data center, that's very expensive and that has a DBA with a ponytail watching over it all the time. But the key characteristic of a relational database is joins and that governance that you just talked about. And in the form of a bunch of different key features of Delta Lake, Databricks accomplishes those. I think that's a cool evolution of the ecosystem. In some ways, it's the logical next step of the separation of computing and storage.

Ali Ghodsi (13:22)

Yeah.

George Fraser (13:23)

We said, "Okay, we have separate compute and storage, but now we're going to have explicit separation. You're going to have two systems," and in principle, you can run different execution engines on top of the same storage engine.

Ali Ghodsi (13:35)

Yep.

George Fraser (13:36)

There's an interesting implication of that, which is a Databricks customer, using Delta Lake – if they so choose – they can mix Databricks with other execution engines.

Ali Ghodsi (13:47)

They can.

George Fraser (3:48)

You could run scikit-learn in Python on your laptop, and you could just bypass the Databricks compute layer, and talk directly to the storage layer. That's a really interesting thing to do, as a vendor, to let go of that level of control to say, "Do what you want. Even if you want to cut me out. Go ahead."

Ali Ghodsi (14:05)

Yeah.

George Fraser (14:06)

What do you think about that decision, and that intentional abdication of lock-in?

Ali Ghodsi (14:12)

Yeah, I mean it was on purpose. The whole idea was how do you disrupt the market? How do you disrupt an existing market? The idea was to level the playing field. We're just going to put the files here on the lake. The customer owns it. We don't own it, at Databricks. Let the best vendor win here, and people can mix and match. It means that we are also not locking them in. Once you're using Databricks, we can't just say, "Okay, we're done here. We've got this customer, they can't leave us." No, the data is on the lake. Any day they can say, "Well, wait a minute, I'm getting a little bit better results for the same workload. I can run an SQL warehouse using one of the competitors over here. Seems to have improved a lot in the last six months. Maybe I should use them instead." But also it gives us a chance then six months later to say, "Well actually, now we're better, so maybe flip over."

Yeah, the lock-in is decreased, but you need the governance. And I think you nailed it, the fine-grained access control and being able to do that. You don't want to have just the data swamp. You don't just want a lake, and all the data is there, and that's it. There has to be a much higher-level data warehouse, governance model. You might say, "Well, wait a minute, are you just saying... so maybe it's just a data warehouse, you just have separate computer storage." I would say no. A lot of the engines out there today, they can't actually do ML or AI. You can say, "Yeah, we can bring in another engine." You could, but I think there's power in one unified engine, that can both do relational joins, and the stuff that data warehouses do and an engine that also natively can do AI. It can do advanced analytics, iterative, keep training the model and making it better and better. If you can do that, I think that's more powerful, I think when you switch between the engines, that always creates friction.

George Fraser (15:54)

There's a talk by Mark Raasveldt, who is a German computer scientist, and one of the creators of DuckDB, in which he says, "ML is not going to move into the database, even if we wish it very hard."

Ali Ghodsi (16:09)

He's right.

George Fraser (16:10)

He points out that the ML ecosystem is simply so large, we cannot just go and create a bunch of UDFs and say, "There we go, solved it. Now you can do ML in your database." No, this is a whole ecosystem of its own, and we have to build bridges between them, in order to solve user problems. But they're going to both continue to exist.

Ali Ghodsi (16:34)

Yep.

George Fraser (16:35)

We need to figure out how to make them work together.

Ali Ghodsi (16:37)

Yep.

George Fraser (16:38)

I'm curious, how do you find it to be different, being a founder, the unique perspective of a founder? What do you think you do differently than someone who started tomorrow in your role would do?

Ali Ghodsi (16:52)

Yeah, I think it's the long-term perspective. I mean, things that we ponder like, "Hey, what's GenAI going to actually do to the field? What do we need to change? What does Fivetran fundamentally need to change because of GenAI? What does Databricks need to fundamentally change, because of GenAI?" Those kinds of things, I think the professional CEOs – and of course I'm painting with a broad brush here, but generally speaking – they're more focused on let's look at the financial metrics. What growth do we need next year, and then what are the levers? How much do we invest here and there, and then how much do we get that growth? That's okay to continue to sustain the thing you have, and at the margins make the financial statements look better, and I don't want to dismiss that. I have to do that. You have to do that too.

But I think that going deep down, into the product, and thinking about, "What's going on with the product? Where is this product headed?" I think that's something unique that you rarely find in professional CEOs, and you find it in all founding CEOs, like all founders, or people that are early on in the company that are running their companies, they're deep into that. I think there are some professional CEOs out there, like Adobe, and other companies who are deep into the tech. But that's I think the big thing. The other thing is I think probably founding CEOs typically don't mind taking big bets, or big pivots if they believe in it. If they see a change coming, or something that's needed, they will do that. Whereas I think if someone comes in, and they have a four-year gig, and then maybe they'll get another four-year gig in the same company, if they do well, they probably don't want to rock the boat too much.

George Fraser (18:23)

Yeah, it's easier to make those kinds of big decisions when you've been there the whole time. You have a certain source of legitimacy…

Ali Ghodsi (18:31)

Yeah.

George Fraser (18:32)

... and I think a willingness to take big risks. There are challenges when you've been running a company since the beginning. There are some unique challenges to that too. I find you have to work at being able to recognize your own mistakes …

Ali Ghodsi (18:48)

Yeah.

George Fraser (18:49)

... and correct them. It's harder when everything is your fault, at some level. A trick I sometimes use is, I will sit there and imagine if I were hired as the new CEO of Fivetran tomorrow, what would I do? What would be the thing where I would walk in and be like, "I can't believe this idiot did this. We're changing this right away."

Ali Ghodsi (19:09)

Yeah.

George Fraser (19:20)

Do you have any tricks like that that you try to use to see your blind spots?

Ali Ghodsi (19:14)

I think one of the principles that I try to push for the company, and myself, is to do what you said, which is we should do whatever a new company does. That new startup that's starting today and doesn't have any baggage, and doesn't have all that revenue, what would they do? We need to do that. If we're deviating from what they are doing, that's the disruption that's in the making. That's the little small startup that no one cares about, that soon will be a gigantic threat to your whole business.

George Fraser (19:40)

I love that principle. I often encourage people to ask this question, "If we were starting over, how would we do it?" It isn't that you immediately go and do that.

Ali Ghodsi (19:49)

Yeah.

George Fraser (19:50)

Often the answer is, you don't do that.

Ali Ghodsi (19:51)

Yeah.

George Fraser (19:52)

But you should always think about it.

Ali Ghodsi (19:53)

Yes.

George Fraser (19:54)

You should always look at it, and you should know if you're not doing that, why not?

Ali Ghodsi (19:58)

Yeah.

George Fraser (19:59)

I think it's a very important exercise and very powerful.

Ali Ghodsi (20:03)

Yeah.

George Fraser (20:04)

We've discovered a lot of things that way.

Ali Ghodsi (20:05)

How have you changed as CEO, over here?

George Fraser (20:09)

I've changed in every way. I mean, in the early days, I was the fastest programmer. That was my most important contribution, was just the sheer volume of code. There have been many different versions of my job and myself, since then. I often say that I just made a decision a long time ago that I'm just going to become whatever person I need to be to do the job as it exists now, and it changes. Every couple of years, I would say, it's a completely different job and a completely different set of things to focus on. I mean, now a lot of the work I do, for example, is around getting the right metrics. Which was not important when we were very small. What about yourself? What has changed for you, in what you need to be good at, as the company has grown?

Ali Ghodsi (21:06)

Same. I mean, every two years I think the job completely changes, new scale and new challenges. You've got to develop new skills, I do think it's like a new job every two years, essentially.

George Fraser (21:20)

When you look at Fivetran over the next year, the most important things we're working on are data lake support, R, because there are still ways that we can make it better, cheaper for the customer, and more performant that are happening right now, that will benefit us both. We're looking at these enterprise data sources that have been around for a long time. They might not be the ones everyone's talking about right now, but they are incredibly important.

And enterprise scale is the other thing that's a really big priority right now. Being able to handle the largest companies on each of those data sources, in the world. We have some of them now, and it's the thing that works at a scale of one terabyte, at 50 terabytes, suddenly it's a different game, and you have to do different strategies. Our vision at Fivetran is that no matter what, from the user's perspective, it just works. All of those challenges disappear behind that user interface. What are the key things that you're looking at for Databricks over the next, let's say 12 months? What do you think are the most important vectors of innovation, and adoption of recent innovation that you're looking at?

Ali Ghodsi (22:30)

Yeah, we'll start with the one, "Hey, you own the data." By the way, those vendors that you mentioned are like, "Oh wait, this is my data, don't take it, or if you're taking it, let's micromanage exactly what you're doing with it." I think they're going to all come around. You can see it. There are already pockets in those companies, big companies, that are like, "Hey, the data is leaving anyway. Who are we kidding? They need to join it with other data. They're not going to join it with other data with us. It needs to go to this." That's the paradigm, the data moves into a central location, where it just gets joined. You join all the different data sources that Fivetran helps you move data to, right? That's the paradigm. That's not going to change with GenAI or anything like that either. Even if we get superhuman intelligence and so on, you need to do fast joins with those data sets.

You're not going to do that in place. You need to do that in a place where you can do joins fast, and that's the relational database. I agree with that. You said Delta Lake, which is the data format that lets our customers own their data, so that's great. The lakehouse, you said, people have more or less adopted it. It's the future. There's a problem, there's a small little detail, which is like, “Oh, well there's actually three different formats.” There's Delta Lake Open Source Project, there's Iceberg Open Source Project, there's Hudi Open Source Project and there's probably others as well, and that sucks. Perfect, this to be the right architecture, but there's this VHS/Betamax thing going on.

George Fraser (23:53)

I love that. I love that analogy. It's perfect.

Ali Ghodsi (23:55)

Once you pick Betamax, you can't use the VHS. It's like, "What?" "Yeah, literally you can't insert... it won't fit." "Isn't it exactly the same thing?" "Yeah, it is. There's some small detail. Actually, this one is half an inch bigger." "Well, does it need to be?" "Not really. But anyway, you can't, so maybe buy three video players for each of these." That's really unfortunate. One of the big things is this Project Uniform that we started, which is can we just generate the metadata for all three, and if you storing the data in Delta, you just get all three, out of the box. We want to push that to its extreme and make sure that you get full compatibility with Hudi, with Iceberg, so that it works the same. It doesn't matter.

If you store your data in Delta, it is 100% the same, both on the read path and on the write path. There are other projects out there that are doing this too, such as a project called One Table and so on. I really hope there are ways we can do that with Parquet in the future. But basically, if we can eliminate this problem of VHS/Betamax, so that there's one universal format, they all look and smell the same. Why do we need to Balkanize things? That's going to be a big priority for us, to keep pushing that, eliminate that. Nobody benefits from that. Actually, you know who benefits, is the people who don't want this lakehouse architecture. Otherwise, the people that are in the lakehouse architecture, we all just lose from this little VHS/Betamax thing, because it Balkanizes, and not all the workloads work with each other. There are three different versions of the lakehouse.

George Fraser (25:28)

From your lips to God's ears, we're exactly on the same page. I think we also want this to happen. We're sitting from the other end trying to figure out what can we do to help move the ecosystem towards this future, where we eliminate the VHS/Betamax problem.

Ali Ghodsi (25:42)

Yep.

George Fraser (25:44)

I think it'll be great for customers, it'll be great for the ecosystem. It will help that virtuous cycle of competition, and innovation, when there is no lock-in, that we both want.

Ali Ghodsi (25:54)

Yep.

George Fraser (25:55)

Let's hope.

Ali Ghodsi (25:56)

Let's do it.

George Fraser (25:57)

Let's make it happen.

Ali Ghodsi (25:58)

Yeah, let's do it. That's a big priority for us. The other one for us is data intelligence, which I think is also important to Fivetran, which is how can we build AI models that are truthful, that do not hallucinate, and give correct answers that enterprises want? There was an example, we were working with a customer, and they had built something before we got there. They had a chatbot, and when you asked it about recommendations about, "Oh, what's the best product in this category, of that company?" The chatbot had to spit out recommendations for competitors' products. “That's a better product.”

You don't want that, you don't want that in production. How do you do quality monitoring in production? How do you make sure that the AI is tuned to do what you want it to do, and actually trained on your data, so fine-tuned, it's truthful and you have control over that and you know what it's doing. That's why we call it data intelligence. The AI model has that intelligence of the data. It understands the data that it got from Fivetran, and from that source. It needs to be completely coherent and consistent with the data that it got from Fivetran. That's going to be a big and important focus area for us, this year.

George Fraser (27:05)

What you're talking about there is a legitimate research problem, because the way these language models work is somewhat mysterious.

Ali Ghodsi (27:14)

Yep, it is. I'm a big believer in RAG, retrieval-augmented generation. I think it's the only way to solve this problem, at least in the near future. I don't think there is a better approach. Just having a large language model that is just predicting the next word, and just producing the next word, who knows what it's going to spit out? You might not want it to say the thing that it's saying. However in RAG, you have a database, and you insert the data that you got from Fivetran, and it is in there. Now these are the facts that we want you to know, by the way, you can update it. You can keep running Fivetran regularly, and update so that the vector database is always up-to-date. No more, "Hey, sorry. I have a cutoff from 2021. I don't know the answer to that." You can update it in real time. You can put access control on it.

Then in the RAG system, you use the large language models to answer questions. You take the question, but then you instruct the large language model, you retrieve from the database the most relevant facts and say, "Hey, I'm giving you these facts. Please adhere to these, and respond using this information that I gave you from this database, right now." That has a bunch of benefits. One, you can update the database, so you can get this data from Fivetran more regularly, and you can keep updating it, so it's always up-to-date. Two, you can put access control on the database. You can say, "Well, I don't want George to have access to this very sensitive salary information here." Then third, I think you can reduce hallucinations, because you're going to that, and you can also verify-

George Fraser (28:48)

If you're prompting it with something that you know, based on the search, is relevant to the question that was asked.

Ali Ghodsi (28:57)

Exactly. You also know the source, so the model can tell you. “By the way, the reason I'm saying this is because of this fact that came out of it,” and you can go verify. "Oh, I see." Or "Actually no, it didn't actually. I misread that fact. It's actually not there."

At least there's a way to verify, rather than just saying, "Hey, the best product is that product." It's like, "That product doesn't even exist." Think of RAG as a whole system. You can now actually optimize the models, the LLMs, to be very truthful to what's in that database. You actually can train models that maybe are not good at many things, but they're very good at really paying close attention to what you give them. So you can start to specialize this whole system to be more and more truthful, and pay attention to data quality. That's what we call data intelligence. I think that's going to be important for both of our companies, and I think for a lot of customers out there, this is super important.

George Fraser (29:41)

I think retrieval-augmented generation makes a ton of sense. At one level, it looks like a little bit of a hack. It's like, "Well, I'm just going to do a search, and then I'm going to feed the relevant text into the context window of the language model, and then ask it about it." But sometimes hacks are amazing, and it has the benefit that you can bring a lot of the traditional ideas from relational databases, about things like access control, into a language model pipeline, in a straightforward way.

Sometimes people ask, "Hey, should Fivetran be supporting vector databases as targets, in order to support these retrieval-augmented generation pipelines?" My answer so far is, "I don't think so." I think that is best done as a post-processing pipeline. Because it's not like you're just going to put all of your data into a retrieval-augmented generation, into a vector database.

Ali Ghodsi (30:40)

Yeah.

George Fraser (30:41)

You're going to want to put highly selected data in there. You're going to want to pre-process it. You're going to change your mind sometimes about what approach you want to do to embedding, and so you're going to need to replay. That looks to me a lot like the transformation and curation steps that we've always said are best thought of as a separate problem, that's done as a second step. You've thought about this a lot too though. Am I wrong? Is the new popular destination for Fivetran next year going to be a vector database?

Ali Ghodsi (31:13)

I'm sure you'll have to add it, and there'll be requests for it, but I think you're right. I think that processing needs to happen, and it turns out exactly how you, it's called “chunk” that data before you put it in there, how you do the ranking, and exactly how you tune the embeddings. Those things matter a lot to the quality that you're going to get. I do think for each application you want to get those right. Yeah, there's I think some processing that happens in between, but that's one of our focus areas, getting that right. But I think that's something that the market hasn't, "Oh, we've figured this out." Then it's like everybody does it the same way, and this is old news from the '70s. It’s not. This is active research, but that's the second focus area for us, that we're focused on this data intelligence.

George Fraser (31:57)

You and I have both been working in this space for over 10 years, and there’s been a lot of change. In the beginning, the hot thing was Hadoop.

Ali Ghodsi (32:07)

Yep

George Fraser (32:08)

One of the most fun and most exciting things – and sometimes the most scary things about this space – is it’s not over.

Ali Ghodsi (32:15)

Yep

George Fraser (32:16)

There’s a new wave of innovation around the corner it seems, every couple of years. To me, that’s just a part of the fun, agree?

Ali Ghodsi (32:29)

Yeah, looking forward to seeing what happens next. With AI, GenAI, the new data sources, the renewed excitement around data and AI in the world by everyone now. It will be interesting to see.

George Fraser (32:38)

Couldn’t agree more.

Show the full transcript

Expedite insights

Mentioned in the episode

66%

more effective at replicating data for analytics

33%

less data team time to deliver insights