Build vs. buy for AI: Choosing the right data foundation

A new MIT Technology Review Insights report highlights that 64% of C-suite executives are prioritizing data readiness to achieve AI success, but they’re encountering challenges building the data foundation it requires. Based on my experience working with hundreds of enterprise companies over the last 20 years of my career, this isn’t surprising.

The biggest pitfalls I see for struggling organizations are related to data integration and pipelines. It’s not just anecdotal — data integration is the number one challenge for 45% of the MIT report respondents. Trustworthy and accurate AI requires datasets that can only be produced by high-quality, efficient pipelines. Unfortunately, the most common approach to enterprise data pipelines remains legacy DIY methods that might have been effective in the past, but now contribute to average losses of $406 million per year.

You cannot build a strong data foundation for AI without automated, reliable and secure data integration. Problems with data integration will cause downstream problems in governance, security and data quality, as the MIT report clearly shows.

This gets especially tricky with GenAI because it’s such a seductive technology. Prompt an LLM and you’ll get a well-phrased, clearly-written response that feels accurate even if it isn’t. Without a data quality mitigation strategy like retrieval-augmented generation (RAG) to incorporate proprietary business data, businesses are more likely to encounter costly AI hallucinations. RAG requires reliable pipelines, and the GenAI response provides no indication when such pipelines are, well, a pipe dream.

You achieve competitive differentiation through GenAI, with data movement as a prerequisite to achieve this. Yes, DIY pipelines can work for AI, but there’s plenty of evidence to suggest that it’s much harder than using a commercially available, enterprise-grade data integration platform. Shouldn’t you focus on competitive differentiation instead of building and maintaining pipelines?

[CTA_MODULE]

Pitfalls of a DIY data foundation

DIY data pipelines do not scale. DIY pipelines are often built at a moment in time to solve a specific need for a specific, sometimes proprietary, data source. But, what happens when the scale of the need changes? When the data volumes dramatically increase? When the logic of what fields, tables or schemas changes? When the expert who built the pipeline(s) leaves the company? That's when DIY pipelines become expensive liabilities that few, if any, people know how to support.

DIY pipelines inevitably cost enterprises large sums of money, resources and time throughout their lifetime. This includes the task of building the pipeline, maintaining it, updating it and even deprecating it when another scalable solution is found. But, the desire to have total control is understandable, even when the downsides are high. Future-proofing and predicting the complexity and scale of said future is incredibly difficult.

When Mike Hite took the CTO role for the global luxury retailer Saks, he inherited dozens of slow, DIY pipelines with no monitoring. Any new data integrations took up to six months to ramp to production. Hite needed scalable, future-proof pipelines to refocus his team on higher-value work, so he chose Fivetran.

“The beauty of Fivetran is that it solves a very complex problem very simply for us: ingesting lots of different data. It’s one of the fundamental pieces of our AI strategy and allows us to bring in new novel data sets and determine whether they’ll be useful for us.” — Mike Hite, CTO at Saks Fifth Avenue

Hite’s team quickly transitioned out of DIY pipeline support and into adopting new technologies like AI, ML and GenAI. Saks is ahead of their competitors because they’re not distracted by problems that are easily solved by commercial solutions.

When faced with new innovations like GenAI, every company should ask the classic “buy vs. build” question. Building DIY is tempting, especially given how easy it is to get a “quick and dirty” prototype up and running — aided by GenAI or not — but it’s not enough to be future-proof for ongoing business needs. For (misguided) budget reasons or otherwise, organizations create scenarios where they build instead of buy. They don’t realize that their legacy approach will eventually present insurmountable engineering problems. That’s where organizations need to be smart about the technology they choose to manage and maintain pipelines.

Simplifying AI problem-solving with better pipelines

A reliable data pipeline means one less thing to troubleshoot when AI doesn’t provide a correct or expected response. With DIY pipelines, you have to question your own integrations, investigate where data quality issues occurred at every layer and determine whether the pipeline itself is the culprit. However, with a modern, automated solution like Fivetran, you can remove the pipeline from the data quality equation, knowing there’s built-in schema change support and automatic propagation of data source changes.

The resource and operational burdens of DIY pipelines are high. While there’s potential for proprietary innovation if done right, the maintenance becomes overwhelming when data sources frequently change. For instance, if 30 data sources adjust their schemas over the course of a year, that’s 30 pipelines you have to update and maintain, all while ensuring business-critical functions remain uninterrupted. This burden becomes especially pronounced when real-time decision-making is required.

With the right data pipeline, organizations can reliably provide real-time access, making it a highly desirable capability that enhances overall AI performance and outcomes.

[CTA_MODULE]

Pitfalls of a DIY data foundation

“The beauty of Fivetran is that it solves a very complex problem very simply for us: ingesting lots of different data. It’s one of the fundamental pieces of our AI strategy and allows us to bring in new novel data sets and determine whether they’ll be useful for us.” — Mike Hite, CTO at Saks Fifth Avenue

Simplifying AI problem-solving with better pipelines

With the right data pipeline, organizations can reliably provide real-time access, making it a highly desirable capability that enhances overall AI performance and outcomes.

[CTA_MODULE]