This article was first published on Forbes Tech Council on September 12, 2024
Data integration is a leading priority for enterprise executives, with 82% of senior executives considering scaling AI a top priority. However, this ambition is frustrated by the longstanding practice of maintaining separate data architectures for batch and streaming use cases. Batch use cases, such as business intelligence and reporting, typically involve relatively modest amounts of structured data queried from data warehouses to support decisions that are unlikely to be revised by the second or minute. By contrast, streaming use cases such as fraud detection, real-time recommendations and business process automation rely on real-time, high-volume feeds of data into data lakes to react dynamically to changes in customer or operational activity. These use cases use different technological stacks because they fundamentally require different types of data, levels of throughput and turnaround times.
[CTA_MODULE]
Separate data architectures pose serious long-term challenges. They require expensive, duplicate provisioning of infrastructure and engineering time. Duplicate data stacks and data models constructed under different assumptions may produce conflicting versions of the truth. Furthermore, redundancy multiplies the effort and expense of governance and security.
Until recently, there was no alternative to this segregated data architecture. The emergence of the governed data lake combines the capabilities of data warehouses and data lakes, allowing a single data stack to support both batch and streaming use cases. The ability to unify all data use cases under a common platform with heavy use of automation will grow in importance as organizations produce ever larger and more granular digital data footprints. A unified, automated data platform will also be an essential capability for supporting generative AI as use cases emerge that leverage data from novel combinations of sources.
We have seen these capabilities in action; one of our customers, Saks, completely rebuilt and unified their data ecosystem, using its newfound flexibility and scalability to support a growing roster of data-driven products such as personalized shopping experiences through AI agents.
Twin pillars of a strong data foundation
A solid data foundation enables a team to move, store, analyze and productionize data. This foundation should also minimize friction between all tools and technologies and offer modularity to ensure extensibility and avoid vendor lock-in.
The first pillar of this data foundation is a unified data platform, specifically a governed data lake with structured data lake formats. Data lakes are inherently scalable, flexible and cost-effective, a form of commodity storage that can be combined with a team’s choice of compute engine. When used with structured data lake formats such as Apache Iceberg and Delta Lake, data lakes benefit from governance, security and a relational structure that can be easily accessed using SQL. Increasingly, governed data lakes accompany ecosystems of complementary tools for analytics and application development.
The second critical pillar of a strong data foundation is automation. In the past, the development of parallel data architectures accompanied the need to master different engineering languages and paradigms for batch and streaming use cases. Automation can sidestep this problem by enabling teams to assemble a data architecture using easy-to-use, off-the-shelf tools.
Instead of building and maintaining functions like data pipelines within a bespoke data architecture, data teams should instead focus on selecting the right tools, considerably decreasing time to value, lowering overall engineering overhead and using the time savings to focus on higher-value analytics and engineering work. As the sheer breadth of data sources used by modern organizations continues to grow, bespoke data architectures will become increasingly impractical and automation will become increasingly necessary.
Unified data platform for generative AI
A unified data platform helps solve one of the most common pitfalls of analytics, including generative AI: rushing an AI model into development without a strong data foundation. But complex data pursuits like generative AI bring additional challenges as well.
At the implementation level, simply feeding large volumes of data at a model won’t necessarily enhance its performance. Instead, your team needs to be judicious about the data used for training and augmentation and ensure it is well-organized.
Security is another major concern, even with a streamlined data architecture. Data breaches at leading corporations in recent years have led to financial losses in the hundreds of millions. Critical considerations include access control, internal governance, encryption and, if necessary, private networking.
Smart and steady
Amidst the excitement of new technologies and capabilities, due diligence remains critical—your organization’s reputation and bottom line depend on it. Generative AI, like all data use cases, requires a robust data foundation. Practical applications for generative AI—automated customer support, code development acceleration, rapid prototyping of products—may utilize the full range of an organization’s structured, unstructured, batch and streaming data. As organizations pursue ever more advanced and sensitive uses for data, the importance of limiting redundancy and duplication while easing governance and security will only continue to grow.
As long as big data and cloud computing have been in play, data engineering has often faced the challenge of lock-in with costly, complex legacy systems. The emergence of the governed data lake and generative AI presents a unique opportunity to reevaluate your data architecture from first principles to ensure that it will be able to meet emerging needs.
[CTA_MODULE]