Generative AI is a power tool for the human mind, accelerating intellectual and creative work of all kinds. The core capabilities of generative AI — information retrieval, synthesis, and ideation — offer transformative potential for enterprises in every industry. According to McKinsey, generative AI may add up to $6.1 trillion to $7.9 trillion to the global economy annually in the coming decades.
Like all analytics, generative AI depends on reliable, centralized access to data. Data centralization enables data exploration and the development of data products. Once deployed, an AI model needs fresh and up-to-date data to ensure its outputs reflect ongoing developments.
Unstructured data is particularly important in AI
Conventional analytics, like business intelligence, reporting, and predictive modeling, are typically performed on structured data — fields organized into tables or markup documents. Structured data usually records transactions performed by applications and database backends. These granular digital footprints provide invaluable insights into an organization’s operations.
However, most of an organization’s data — between 80% and 90% — is unstructured, consisting of text, images, code, video, audio, and other digital assets found in correspondence, documentation, knowledge bases, marketing collateral, code bases, asset libraries, and other sources. This unstructured data contains a wealth of insight, much of it qualitative and difficult to capture in a table.
Generative AI models are designed specifically to help users understand and leverage large volumes of unstructured data. Large language models (LLMs), for instance, are trained on large volumes of text in order to extract semantic and contextual relationships between words. This enables a number of practical use cases. At a basic level, large language models are like search engines on steroids, offering an unparalleled ability to retrieve, summarize, and iterate on information. A model trained on your company’s internal documentation can answer questions about company policies. Another model trained on your engineering team’s codebase can act as a copilot, helping engineers quickly write code based on known patterns.
Nonetheless, structured data remains important as the backbone of reporting, business intelligence, and predictive modeling (i.e., non-generative AI). It is also readily converted into unstructured data — a table of figures, for instance, can readily be turned into a series of declarative, factual statements:
"In 2024, the conversion rate of enterprise accounts was 35%."
Perhaps more importantly, an AI agent connected to a predictive model can pass along such data to that model while combining and incorporating its output with its own analysis. Many questions require both quantitative and qualitative data to answer; one of the principles of prompt engineering is that more context is nearly always better. Academic papers, after all, include not only tables of figures but written analysis and even source code. Large language models are notoriously bad at purely computational questions and may benefit from integration with other systems that specialize in mathematical reasoning.
Some data is semi-structured – a table or markup document may, for instance, contain large text fields. The Fivetran Unified RAG dbt package, based on our work on FivetranChat, handles such data.
The importance of unstructured data for AI has made the data lake centrally important as a destination, because it can readily accommodate both unstructured and structured data at scale. Teams deploying AI must solve the basic challenge of integrating both structured and unstructured data from across disparate sources into data lakes.
Why unstructured data is especially challenging to integrate
Structured, especially tabular, relational data, forces data to conform with standardized naming conventions and formats and usually comes with a predefined schema outlining relations between different concepts as well as metadata indicating the semantic meaning of each element. In short, it is far easier to ensure quality and governance in structured data.
By contrast, unstructured data is inherently unsuited for storage with standardized formatting and is not automatically bundled with schema enforcement and metadata. Unstructured data can encompass a bewildering range of different media in a huge range of formats and at very large volumes.
As such, it is inherently more difficult to guarantee the quality and regulatory compliance of data and to govern it more generally.
Automated data integration provides the answer
Fresh, accurate, compliant, and governed data are not optional, especially with public-facing AI deployments. In general, the volume, velocity, and variety of modern data pose vexing challenges in data integration.
While these problems are far from impossible to solve, they represent a tremendous investment in engineering time. The solution to integrating structured data, as Fivetran has long advocated, is automated data integration. Our extensive catalog of more than 700 connectors encompasses common SaaS, ERP, and transactional database sources. Our database connectors, in particular, feature capabilities like in-pipeline configurations and row filtering, giving your team granular control over what and how data is integrated. A major element of data integration is data curation, ensuring that only the most useful and relevant data.
Automated data integration is also the solution for unstructured data. While SharePoint and Google Drive excel at storing shared knowledge, they aren’t designed to uncover insights hidden within unstructured documents. With Fivetran’s file connectors, you can now centralize unstructured files - PDFs, images, documents, and more - alongside structured data in your warehouse or lake. With evolving generative AI techniques, this consolidation enables refinement of downstream transformations without re-executing the entire pipeline.
If you need to integrate from an unsupported source, our Connector SDK will allow your team to construct a new connector compatible with the Fivetran core application. By building through the Fivetran platform, you can always expect the utmost in scalability, reliability, and security.
Don’t just take our word for it—HubSpot used Fivetran to integrate previously inaccessible text-based human resources information, using generative analytics to glean insights into employee performance and management practices. Likewise, according to Mike Hite, CTO of Saks, “Fivetran solves a very complex problem very simply for us: ingesting lots of different data. It’s one of the fundamental pieces of our AI strategy and allows us to bring in new novel data sets and determine whether they’ll be useful for us.”
[CTA_MODULE]