There’s unprecedented industry buzz around generative AI, but what does it really look like for companies who are trying to deploy it? George Fraser and Ali Ghodsi have the inside track as Fivetran and Databricks are two key platforms that make GenAI happen. Here’s what they see:
- A third of all workloads on Databricks are AI-related
- Of the Databricks AI workloads, most are traditional machine learning, generalized linear models (GLMs) and logistic regression
None of this surprises Ghodsi: “Every company we talk to wants to do generative AI. There are multiple departments within companies that are fighting, saying ‘I own it,’ ‘No, I own it!’” But there’s much more to it than office politics.
“The dirty secret of AI is that the hardest part is the data,” reveals Ghodsi. “Are there data sources you can extract more from? That’s the secret sauce. That’s where Fivetran comes in.”
Most companies are still working on the building blocks. There’s more posturing and head-scratching than game-changing developments. “This is going to be the year where the rubber hits the road for GenAI,” Ghodsi suggests. “Are you going to be in production and is it actually bringing the business value you wanted?”
The take-home message: It’s unrealistic to just spin up an LLM and expect it to provide value. Before embarking on a generative AI effort, carefully consider if your use case would not be better served with simpler predictive modeling or even heuristics. Finally, without the ability to quickly centralize data from any data source, enterprise organizations won’t be successful with GenAI, or AI of any kind.
Data ownership in the age of AI
Third-party data ownership is one of the most important issues right now, especially with GenAI’s insatiable need for data. For all analytics use cases – business intelligence, machine learning and generative AI – the more data, the better, assuming companies can integrate it from third-party sources.
For Ghodsi, customer-owned data is foundational to Databricks’ Data Intelligence Platform design. He would rather offer a level playing field and, in his words, let the best vendor win. “The customer owns the data. Once you’re using Databricks, we can’t just say, ‘Okay, we’re done here. We’ve got the customer and they can’t leave us.’ No, the data is on the lake.”
Intentionally avoiding vendor lock-in is one way he’s hoping to disrupt the market. Databricks wants the customer to have choices, even if it means they’ll try competing lakehouse services from Microsoft, Google, Snowflake or AWS. However, he’s quick to point out that openness isn’t the only thing those platforms are getting wrong.
“If there’s one misconception about the lakehouse, it’s that if you store all your data in a data lake and run workloads on it, then you have a lakehouse,” Ghodsi adds. “No, that’s the data lake architecture from 2010. To really get [the lakehouse], you also need governance with fine-grained security.”
Ensuring factual and relevant GenAI
Enterprises need to build AI models that produce factual and relevant outputs. Without proper quality control, generative AI can hallucinate – produce false or nonsensical outputs – or otherwise generate undesirable outputs. Databricks recently engaged with a customer who had deployed a chatbot that recommended competitors’ products to its customers — a very public and disheartening moment.
“You don’t want that in production,” Ali shares. “But how do you do quality monitoring in prod? How do you tune the AI and train it on your data so it’s truthful? How do you control it and know what it’s doing?” Both he and George have an answer: retrieval augmented generation (RAG).
RAG uses a separate vector database that enriches prompts to an LLM with additional facts and context. It’s purpose-built for this problem. Ali recommends for any RAG deployment: “You can keep running Fivetran regularly and update it so your vector database is always up-to-date. No more ‘Sorry, I have a cutoff from 2021 and can’t answer that.’ You can update it in real time and have access control on it.”
With RAG in place, companies can optimize their models to be more truthful, allowing them to truly focus on data quality and governance. When the model knows, trusts and understands the underlying data, engineers can investigate and verify why it produced certain answers. LLMs are more truthful with high-quality, well-governed data.
Fraser sees RAG as the way through hallucinations. “At one level, it looks like a little bit of a hack, but sometimes hacks are amazing. It has this benefit that you can bring a lot of the traditional ideas from relational databases, like access control, into a language model pipeline in a straightforward way.”
Listen to the full episode of the Fivetran Data Podcast here.