The importance of unstructured data for AI

Much of your organization’s most important data is unstructured. Automated data integration allows you to access it, and AI to analyze it.
March 11, 2025

Generative AI is a power tool for the human mind, accelerating intellectual and creative work of all kinds. The core capabilities of generative AI — information retrieval, synthesis, and ideation — offer transformative potential for enterprises in every industry. According to McKinsey, generative AI may add up to $6.1 trillion to $7.9 trillion to the global economy annually in the coming decades. 

Like all analytics, generative AI depends on reliable, centralized access to data. Data centralization enables data exploration and the development of data products. Once deployed, an AI model needs fresh and up-to-date data to ensure its outputs reflect ongoing developments.

Unstructured data is particularly important in AI

Conventional analytics, like business intelligence, reporting, and predictive modeling, are typically performed on structured data — fields organized into tables or markup documents. Structured data usually records transactions performed by applications and database backends. These granular digital footprints provide invaluable insights into an organization’s operations.

However, most of an organization’s data — between 80% and 90% — is unstructured, consisting of text, images, code, video, audio, and other digital assets found in correspondence, documentation, knowledge bases, marketing collateral, code bases, asset libraries, and other sources. This unstructured data contains a wealth of insight, much of it qualitative and difficult to capture in a table.

Generative AI models are designed specifically to help users understand and leverage large volumes of unstructured data. Large language models (LLMs), for instance, are trained on large volumes of text in order to extract semantic and contextual relationships between words. This enables a number of practical use cases. At a basic level, large language models are like search engines on steroids, offering an unparalleled ability to retrieve, summarize, and iterate on information. A model trained on your company’s internal documentation can answer questions about company policies. Another model trained on your engineering team’s codebase can act as a copilot, helping engineers quickly write code based on known patterns.

Nonetheless, structured data remains important as the backbone of reporting, business intelligence, and predictive modeling (i.e., non-generative AI). It is also readily converted into unstructured data — a table of figures, for instance, can readily be turned into a series of declarative, factual statements:

year account_segment conversion_rate
2024 enterprise 35%

"In 2024, the conversion rate of enterprise accounts was 35%."

Perhaps more importantly, an AI agent connected to a predictive model can pass along such data to that model while combining and incorporating its output with its own analysis. Many questions require both quantitative and qualitative data to answer; one of the principles of prompt engineering is that more context is nearly always better. Academic papers, after all, include not only tables of figures but written analysis and even source code. Large language models are notoriously bad at purely computational questions and may benefit from integration with other systems that specialize in mathematical reasoning.

Some data is semi-structured – a table or markup document may, for instance, contain large text fields. The Fivetran Unified RAG dbt package, based on our work on FivetranChat, handles such data.

The importance of unstructured data for AI has made the data lake centrally important as a destination, because it can readily accommodate both unstructured and structured data at scale. Teams deploying AI must solve the basic challenge of integrating both structured and unstructured data from across disparate sources into data lakes.

Why unstructured data is especially challenging to integrate

Structured, especially tabular, relational data, forces data to conform with standardized naming conventions and formats and usually comes with a predefined schema outlining relations between different concepts as well as metadata indicating the semantic meaning of each element. In short, it is far easier to ensure quality and governance in structured data.

By contrast, unstructured data is inherently unsuited for storage with standardized formatting and is not automatically bundled with schema enforcement and metadata. Unstructured data can encompass a bewildering range of different media in a huge range of formats and at very large volumes. 

As such, it is inherently more difficult to guarantee the quality and regulatory compliance of data and to govern it more generally. 

Automated data integration provides the answer

Fresh, accurate, compliant, and governed data are not optional, especially with public-facing AI deployments. In general, the volume, velocity, and variety of modern data pose vexing challenges in data integration.

While these problems are far from impossible to solve, they represent a tremendous investment in engineering time. The solution to integrating structured data, as Fivetran has long advocated, is automated data integration. Our extensive catalog of more than 700 connectors encompasses common SaaS, ERP, and transactional database sources. Our database connectors, in particular, feature capabilities like in-pipeline configurations and row filtering, giving your team granular control over what and how data is integrated. A major element of data integration is data curation, ensuring that only the most useful and relevant data.

Automated data integration is also the solution for unstructured data. While SharePoint and Google Drive excel at storing shared knowledge, they aren’t designed to uncover insights hidden within unstructured documents. With Fivetran’s file connectors, you can now centralize unstructured files - PDFs, images, documents, and more - alongside structured data in your warehouse or lake. With evolving generative AI techniques, this consolidation enables refinement of downstream transformations without re-executing the entire pipeline.

If you need to integrate from an unsupported source, our Connector SDK will allow your team to construct a new connector compatible with the Fivetran core application. By building through the Fivetran platform, you can always expect the utmost in scalability, reliability, and security.

Don’t just take our word for it—HubSpot used Fivetran to integrate previously inaccessible text-based human resources information, using generative analytics to glean insights into employee performance and management practices. Likewise, according to Mike Hite, CTO of Saks, “Fivetran solves a very complex problem very simply for us: ingesting lots of different data. It’s one of the fundamental pieces of our AI strategy and allows us to bring in new novel data sets and determine whether they’ll be useful for us.”

[CTA_MODULE]

Start for free

Join the thousands of companies using Fivetran to centralize and transform their data.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Data insights
Data insights

The importance of unstructured data for AI

The importance of unstructured data for AI

March 11, 2025
March 11, 2025
The importance of unstructured data for AI
Much of your organization’s most important data is unstructured. Automated data integration allows you to access it, and AI to analyze it.

Generative AI is a power tool for the human mind, accelerating intellectual and creative work of all kinds. The core capabilities of generative AI — information retrieval, synthesis, and ideation — offer transformative potential for enterprises in every industry. According to McKinsey, generative AI may add up to $6.1 trillion to $7.9 trillion to the global economy annually in the coming decades. 

Like all analytics, generative AI depends on reliable, centralized access to data. Data centralization enables data exploration and the development of data products. Once deployed, an AI model needs fresh and up-to-date data to ensure its outputs reflect ongoing developments.

Unstructured data is particularly important in AI

Conventional analytics, like business intelligence, reporting, and predictive modeling, are typically performed on structured data — fields organized into tables or markup documents. Structured data usually records transactions performed by applications and database backends. These granular digital footprints provide invaluable insights into an organization’s operations.

However, most of an organization’s data — between 80% and 90% — is unstructured, consisting of text, images, code, video, audio, and other digital assets found in correspondence, documentation, knowledge bases, marketing collateral, code bases, asset libraries, and other sources. This unstructured data contains a wealth of insight, much of it qualitative and difficult to capture in a table.

Generative AI models are designed specifically to help users understand and leverage large volumes of unstructured data. Large language models (LLMs), for instance, are trained on large volumes of text in order to extract semantic and contextual relationships between words. This enables a number of practical use cases. At a basic level, large language models are like search engines on steroids, offering an unparalleled ability to retrieve, summarize, and iterate on information. A model trained on your company’s internal documentation can answer questions about company policies. Another model trained on your engineering team’s codebase can act as a copilot, helping engineers quickly write code based on known patterns.

Nonetheless, structured data remains important as the backbone of reporting, business intelligence, and predictive modeling (i.e., non-generative AI). It is also readily converted into unstructured data — a table of figures, for instance, can readily be turned into a series of declarative, factual statements:

year account_segment conversion_rate
2024 enterprise 35%

"In 2024, the conversion rate of enterprise accounts was 35%."

Perhaps more importantly, an AI agent connected to a predictive model can pass along such data to that model while combining and incorporating its output with its own analysis. Many questions require both quantitative and qualitative data to answer; one of the principles of prompt engineering is that more context is nearly always better. Academic papers, after all, include not only tables of figures but written analysis and even source code. Large language models are notoriously bad at purely computational questions and may benefit from integration with other systems that specialize in mathematical reasoning.

Some data is semi-structured – a table or markup document may, for instance, contain large text fields. The Fivetran Unified RAG dbt package, based on our work on FivetranChat, handles such data.

The importance of unstructured data for AI has made the data lake centrally important as a destination, because it can readily accommodate both unstructured and structured data at scale. Teams deploying AI must solve the basic challenge of integrating both structured and unstructured data from across disparate sources into data lakes.

Why unstructured data is especially challenging to integrate

Structured, especially tabular, relational data, forces data to conform with standardized naming conventions and formats and usually comes with a predefined schema outlining relations between different concepts as well as metadata indicating the semantic meaning of each element. In short, it is far easier to ensure quality and governance in structured data.

By contrast, unstructured data is inherently unsuited for storage with standardized formatting and is not automatically bundled with schema enforcement and metadata. Unstructured data can encompass a bewildering range of different media in a huge range of formats and at very large volumes. 

As such, it is inherently more difficult to guarantee the quality and regulatory compliance of data and to govern it more generally. 

Automated data integration provides the answer

Fresh, accurate, compliant, and governed data are not optional, especially with public-facing AI deployments. In general, the volume, velocity, and variety of modern data pose vexing challenges in data integration.

While these problems are far from impossible to solve, they represent a tremendous investment in engineering time. The solution to integrating structured data, as Fivetran has long advocated, is automated data integration. Our extensive catalog of more than 700 connectors encompasses common SaaS, ERP, and transactional database sources. Our database connectors, in particular, feature capabilities like in-pipeline configurations and row filtering, giving your team granular control over what and how data is integrated. A major element of data integration is data curation, ensuring that only the most useful and relevant data.

Automated data integration is also the solution for unstructured data. While SharePoint and Google Drive excel at storing shared knowledge, they aren’t designed to uncover insights hidden within unstructured documents. With Fivetran’s file connectors, you can now centralize unstructured files - PDFs, images, documents, and more - alongside structured data in your warehouse or lake. With evolving generative AI techniques, this consolidation enables refinement of downstream transformations without re-executing the entire pipeline.

If you need to integrate from an unsupported source, our Connector SDK will allow your team to construct a new connector compatible with the Fivetran core application. By building through the Fivetran platform, you can always expect the utmost in scalability, reliability, and security.

Don’t just take our word for it—HubSpot used Fivetran to integrate previously inaccessible text-based human resources information, using generative analytics to glean insights into employee performance and management practices. Likewise, according to Mike Hite, CTO of Saks, “Fivetran solves a very complex problem very simply for us: ingesting lots of different data. It’s one of the fundamental pieces of our AI strategy and allows us to bring in new novel data sets and determine whether they’ll be useful for us.”

[CTA_MODULE]

Experience Fivetran data lakes for yourself.
Sign up

Related blog posts

The case for using structured and semi-structured data in generative AI
Data insights

The case for using structured and semi-structured data in generative AI

Read post
AI readiness requires a unified data architecture
Data insights

AI readiness requires a unified data architecture

Read post
8 data and AI predictions for 2025
Data insights

8 data and AI predictions for 2025

Read post
No items found.
The importance of open table formats for modern data lakes
Blog

The importance of open table formats for modern data lakes

Read post
Why data lakes are the keystone of AI workloads
Blog

Why data lakes are the keystone of AI workloads

Read post
Building a Fivetran connector in <1 hour with Anthropic’s Claude AI
Blog

Building a Fivetran connector in <1 hour with Anthropic’s Claude AI

Read post

Start for free

Join the thousands of companies using Fivetran to centralize and transform their data.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.