This article was first published on Forbes Tech Council on January 6, 2025
Recent security failures have resulted in massive breaches, compromising terabytes of data and hundreds of millions of records. Government bodies and major companies – including entertainment and cybersecurity firms – have all experienced leaks, ransomware, and cyberattacks. Highly regulated organizations like the UK’s NHS, the Indian Council of Medical Research, and the US Consumer Financial Protection Bureau have not been immune. Uber even faced a $324 million fine for failing to sufficiently safeguard the transfer of sensitive data from the EU to the US.
With increased scrutiny from regulations like the EU-U.S. Data Privacy Framework and the American Privacy Rights Act, secure data handling is more critical than ever. Modern enterprises handle a massive scale of data from a wide variety of sources, compounding this problem.
In today’s landscape, companies can’t afford the risk of data breaches. Instead, companies need a systematic, scalable approach to securely manage their data. Investing in secure, automated data integration is the key to reliably and efficiently safeguarding valuable information.
Data pipelines can be the weakest link in your ecosystem
Modern data workflows use data pipelines to move data from applications, operational systems, and other sources to a data warehouse or data lake. Even though they are not meant to store sensitive data, data pipelines must access and handle it to perform backups, data syncs, and other tasks.
Many organizations build their data pipelines in-house. However, DIY data integration is inherently complicated and engineering-heavy, with a high potential for creating inadvertent security weaknesses. DIY data pipelines create both technological and organizational points of failure. Not only is designing, building, and maintaining a secure data pipeline intrinsically tricky, but analytics and engineering teams also contend with competing priorities and do not specialize in security and governance.
Such teams may not implement security and governance best practices, leading to serious design and engineering flaws. For example, running all processes in a single server or container for simpler management allows both malicious and accidental exposures to compromise entire tech stacks. Other oversights include the absence of security and governance features, such as the ability to monitor and control access. Breaches are inherently difficult to track; even if the data is not persisted, it may be accidentally exposed or replicated in transit. Pipelines may also break down, leading to the loss of critical data that is difficult or impossible to recover.
These issues all grow with the volume and variety of data an organization handles. Security is a highly specialized field in its own right, and public-facing systems should be validated through audits, penetration tests, and design reviews.
Organizations, particularly in highly regulated industries like government, defense, healthcare, and finance, try to circumvent this problem by maintaining data on-premises or in private clouds for additional security. Yet, recent breaches demonstrate that this approach is far from foolproof.
How to leverage a secure, automated solution for data integration
Automated data integration offers a technological solution to both labor scarcity and the challenges of security and governance for data in transit, addressing the vulnerabilities of DIY data pipelines.
Traditionally, data teams could not automate on-premises data integration because the system hosted data in a proprietary environment that outside parties and tools could not (and should not) access. Even with the assistance of commercial tools and technologies, data teams were forced to assume direct responsibility for building and maintaining pipelines as well as securing and governing data. On-premises data integration consumed engineering hours and created openings for engineering oversights to expose sensitive data while limiting opportunities to scale both the volume and variety of data.
To address these limitations, some organizations are turning to architectures that separate the mechanics of data movement from the systems that control and manage integration workflows. This involves distinguishing between the data plane, where information is transferred, and the control plane, which governs processes without directly handling the data itself.
We call this approach hybrid deployment, a model designed to allow secure automation of on-premises data integration. By maintaining this separation, sensitive data—such as personally identifiable information—stays within its originating environment, even as workflows are managed remotely. This enables organizations to centralize control while sidestepping many of the risks associated with DIY pipelines.
As an emerging technology, hybrid deployments are still developing in their breadth and complexity of use cases. For example, some solutions might not yet support all of the data sources or destinations that legacy on-premises solutions solve. There are also very niche cases where companies cannot even let operational metadata leave their environment, requiring them to only use fully self-hosted solutions.
However, for organizations obligated to maintain data on-premises for security, governance, and compliance, automated data integration with hybrid deployment presents an opportunity to not only provide access to important on-premises data from sensitive operations but also unify it with less sensitive data from cloud-based applications and other sources. This ability to comprehensively centralize and access data is key to enabling analytics of all kinds, from reporting and predictive modeling to AI.
[CTA_MODULE]