Data is a critical resource with endless potential to drive innovation. But not all data infrastructures are up to the task of supporting the volume, velocity and variety of data an organization needs. Infrastructure modernization describes any process in which an organization changes the tools, technologies and platforms it relies on for data operations.
Why modernize infrastructure?
Every organization that chooses to modernize infrastructure does so in service of some other important use case. Fundamentally, modernizing infrastructure is about improving the capabilities of your data organization by improving its tools. Common benefits include improved flexibility in terms of compute and storage, lower costs and reduced engineering workloads. Other considerations include interoperability with new technologies and attracting fresh, best-in-class talent. Newer data technologies are all cloud-based, and recent graduates of engineering, machine learning and data science programs are likely trained in cloud-based environments.
The main reasons to modernize infrastructure are to enable the centralization of data at infinite scale, with modest upfront capital outlays, and to complement other cloud-based analytics technologies. When companies improve their data centralization capabilities, they usually do so by adding or changing some elements of their data stack.
Centralizing data is the first and essential step in enabling actionable use of data, especially for democratizing data and building systems that monetize data. Infrastructure modernization plays a major role in all of these stages. It can take place along several, non-mutually exclusive dimensions:
- An organization may move from on-premise operational systems, pipelines and destinations to the cloud for greater flexibility and speed along with reduced maintenance.
- Data teams may switch a data pipeline from ETL to a more modular, flexible ELT-based architecture.
- Different destinations differ by cost structure, performance and other attributes, so an organization may migrate from one type of destination to another.
- In the pursuit of greater organization agility, a data team might upgrade its data movement capabilities from intermittent, batch updating to real-time or streaming.
On-premise to cloud
The main advantages of cloud-based data infrastructure are ease of use, scalability, cost control and interoperability with newer technologies. Cloud-based data infrastructure enables an organization to centralize data at an infinite scale without upfront capital outlays and to leverage the power of other cloud-based analytics technologies.
The centerpiece of most data stacks is a data warehouse, serving as a destination and single source of truth for analytics. Unlike their on-premises counterparts, cloud data warehouses are:
- Elastic – Users can separately scale compute and storage up and down as needed
- Fast – Cloud data warehouses are optimized for large-scale analytics
- Cheap – Cloud-based infrastructure is often more affordable than DIY on-premise setups, leveraging providers’ economies of scale as well as the continually falling costs of storage, computation and bandwidth
Most importantly, a cloud-based, managed data warehouse obviates the need to set aside the resources and downtime to design, build and maintain data centers and server farms.
ETL to ELT
There are two approaches to data movement, ELT (Extract, Load, Transform) vs. the more outdated ETL (Extract, Transform, Load).
ELT has a number of clear advantages over ETL:
- Simplifying data integration – In ELT, the destination is populated with data directly from the source, with no more than a light cleaning and normalization to ensure data quality and ease of use for analysts.
- Lower failure rates – By decoupling extractions and loads from transformations, both changes to upstream schemas and downstream business requirements no longer result in failures of extractions and loads.
- Automate workflows – Since automated extraction and loading returns raw data, it’s used to produce a standardized output. This eliminates the need to constantly build and maintain pipelines featuring custom data models. It also allows derivative products, like templated analytics, to be produced and layered on top of the destination.
- Easier outsourcing – Since the ELT pipeline can produce standardized outputs and allows for easier changes to the pipeline, it’s easier to outsource your data integration to third parties.
- Flexible scaling – Organizational data needs change constantly based on business, market and client relationships. When data processing loads increase, automated platforms that use cloud data warehouses can autoscale within minutes or hours.
- SQL transformation support – ELT shifts transformation from an engineering-intensive process that requires careful scripting to one that is performed in the destination by analysts. Transformations can be written in SQL rather than scripting languages such as Python.
As your organization's needs grow in sophistication and state-of-the-art technologies continue to evolve, you may need to replace your destinations to accommodate a larger scale or diversity of data types, such as media files and other unstructured data. Different kinds of destinations feature different kinds of tradeoffs. Data warehouses, for instance, range from highly tunable self-hosted setups to easy-to-use but costly serverless architectures that can scale compute and storage up and down as needed.
Batch to real-time
The freshness of data is a critical concern for keeping an organization agile and responsive to rapidly changing market conditions. Moving from a daily cadence of reports to a turnaround of minutes (or faster) can enable an organization to observe, orient, decide and act in real-time. Real-time analytics offers the following benefits, in particular:
- Faster decision-making and greater organizational agility – The ability to refresh data in a matter of minutes shortens decision cycles and enables executives, managers and individual contributors alike to intelligently react to new developments without delay.
- Enabling data democratization – Real-time data has cascading effects on decision support throughout an organization. It not only enables to analysts
- Personalized customer experiences – Real-time data movement is an essential prerequisite for machine learning and predictive analytics models that serve customers on-the-fly through recommendations and other offers.
- New business use cases – Personalized customer experiences are just the tip of the iceberg. Interactive dashboards and all manner of business process automation are enabled by real-time data as well.
Meeting the challenge of modernization
While smaller and newer organizations with minimal incumbent infrastructure can easily adopt and begin using new solutions, larger and more established organizations cannot easily rip-and-replace due to the importance of workflows that depend on the existing infrastructure.
Unless your organization has no existing obligations downstream of your infrastructure, you must keep existing processes running while you set up the new ones. The good news is that there are several ways to contain costs in the new environment as it ramps up. You don’t need to migrate all parts of your infrastructure at once and it can make sense to modernize infrastructure piecemeal.
In terms of data migrations, it’s a good idea to start with applications, which produce valuable data (especially in sales and marketing) but are less demanding than operational databases in terms of technical complexity and security requirements. Migrating applications is a low-risk, low-cost approach to building a strong proof of concept. Building on the strength of initial success, your team will be able to approach subsequent efforts with more confidence, setting up your team to activate connectors to more data sources, including operational databases, over time.