Data architecture for business intelligence (BI) has come a long way. The cloud-based modern data stack, consisting of automated data pipelines, data warehouses, transformation tools and business intelligence platforms, has been instrumental in massively improving the quality and performance of analytics to the point where data is not only used for discovery and decision support but also to improve and automate operational processes.
The foundation of a successful BI data stack is a centralized (or centralized-per-use-case) and governed data layer. This has specifically solved the following ubiquitous challenges:
- Siloed data sources lead to siloed reporting, making it difficult to assemble a comprehensive picture across disparate data sources.
- The difficulty of building a single source of truth and data provenance, resulting in conflicting findings and metrics.
- Lack of governance, leading to the exposure and handling of sensitive data by inappropriate parties.
Similar problems exist in data science, especially for machine learning and artificial intelligence. Having realized the business benefits of this governed, centralized data layer as a foundation of our BI, why haven’t we applied the same principles to our modern data science architectures?
When I worked as a data scientist, I was frustrated that the following challenges, despite being largely solved for BI still exist in the data science world:
- Lack of data governance; data for data science projects is often taken from data stores that are not well governed, leading to the same issues of exposure as with BI as well as models that are unlikely to be successful on production data.
- Underdeveloped DataOps infrastructure, leading to models that may be highly tuned and optimized but with no platform to deploy and productionize them.
- Neglect for regulatory compliance, leading to poorly understood models handling sensitive data that are subsequently rejected for production use.
Both of these challenges are the main drivers behind the fact that 87 percent of data science projects do not make it into production. The time distribution of a typical data science project is roughly as follows; 45 percent cleaning and prepping the data, 21 percent exploring and visualizing data, 11 percent selecting the model and 12 percent tuning the model. Only the remaining 11 percent is devoted to productionizing and deployment of the model.
In short, data scientists spend far too much time on basic data management chores, much like data analysts did before adopting the modern data stack. I suspect that data scientists, myself included, often have a strong bias toward spending the remaining time building elaborate and highly-tuned models instead of focusing on basic infrastructure and a model’s potential to be brought into production. To paraphrase Donald Knuth, “premature optimization is the root of all evil.”
Getting your machine learning into production
Let’s move from problem mode to solution mode (it is part of my job title after all!). The first step towards getting a higher return on investment from your data science projects is to assess if you actually have the tools or processes to actually support data science or even a genuine business use case. The high failure rate of data science suggests that organizations have rushed into data science projects so that they can reference them in their annual plans or press releases without a real business pain that requires a data science use case.
If you do need to leverage the considerable potential of data science technologies, tackling the challenges outlined above involves taking a step back and solving the DataOps infrastructure problem, ensuring that models are comprehensible and compliant with regulations and making data governance a priority. The good news is that the solution to many of these challenges leverages the same modern data stack you use for business intelligence.
To begin with, you can pursue relatively simple and practical machine learning projects using structured data you already store in a data warehouse. You very likely already collect structured sales, marketing and product behavior, for instance. Some data warehouses even support machine learning models written in SQL.
For more advanced machine learning use cases that depend on unstructured documents and media files, one possible workflow is to use a governed data lake as a staging area for unstructured or semi-structured data before storing transformed feature vectors in a data warehouse. Keeping all of the data associated with a model, including output data, in your data warehouse will help your analysts see inside the “black box” of a model, making it more accessible. A transparent model is also more likely to pass internal audits and risk assessments; ultimately resulting in it being more likely to become part of a production workflow.
The modern data stack that you apply to business intelligence can be applied to data science and machine learning as well. New analytics data storage technologies increasingly support both relational and unstructured data. Fivetran now writes data to S3 in a relational structure using the Apache Iceberg table format, while AWS Lake Formation (via AWS Glue) applies data governance and granular access control. As your data practice matures, such technologies that make data lakes tractable will be key to helping your data science projects get into production.