Data engineers, analysts, and scientists have been benefiting from the evolution of the modern data stack (MDS) for the best part of a decade but there’s much more to come, according to three CEOs who have been instrumental in its development. Top of their to-do lists is making analytics accessible to more people, solving latency issues and ultimately delivering more data into the hands of non-technical users.
Tristan Handy, CEO of dbt Labs, highlights the importance of a semantics layer and new tooling that provides an interface between business users and analytics engineers, enabling them to work together more productively.
“Historically, it's been quite challenging for downstream data consumers to know what to do with the rows and columns surfaced by engineers,” Tristan says. “Having a semantic layer on top allows business users to be a lot more productive. People’s roles are shifting along with the tooling landscape.”
Putting people at the center of the data ecosystem is crucial, according to Fivetran CEO George Fraser, which means companies have to step up and get better at understanding what users are trying to do with data. Fivetran is an exemplar of this approach, having done something similar for data analysts and engineers.
“We took something that for decades was famous as the worst part of working with data: data pipelines,” says George. “Most of the pipelines were custom built in a painful process that was always breaking. We figured out a way to automate the worst parts of that and made it disappear.”
Three areas for future MDS development were highlighted by the CEOs: the evolution of data platforms, the continued focus on open source tools to drive innovation and a greater focus on latency.
Evolution to the data lakehouse
Analysts, engineers, data scientists and machine learning specialists may find themselves requiring different versions of the stack. A more innovative approach, says Databricks CEO Ali Ghodsi, is the data lakehouse, which combines the best components of data lakes and data warehouses together. It provides a single unified architecture for all teams to use.
“Today, people have to pick two completely different stacks depending on if they want to ask a question about the past or the future,” Ali says. “If I want to know what my revenue was last week, then I use a data warehouse and a BI tool on top of it and I can ask what my revenue was. If I want to just tweak that question and ask what my revenue is going to be next week, that's an AI stack, which uses a data lake.”
“The lake is the AI data science part; the house is the warehouse BI part,” Ali continues. “It enables more innovation in the ecosystem and more integration with tools that are already out there.”
Another use case is to address challenges when it comes to compute paradigms that are not natively supported by a database management system. “The lakehouse is one way to solve that problem, which is you store the data in an open format and you give the customer direct access to those files,” says George.
Further into the future, Ali sees the lakehouse as a way to democratize data and make it accessible to business users. Delivered as a self-service platform for data users who don’t know how to code, it would break down barriers and change the way people access insights.
Open source standards: A platform for innovation
The development of the MDS could not have happened without open source software and open standards. “Proprietary formats stifled innovation,” says Ali. “By making it open, we allow any tools to access the data, and any company to come along and enrich the stack and keep building on it.”
The widespread adoption of dbt, which has revolutionized data transformation, has proved the point, highlighting how open source and open standards enable interoperability in places it was never intended. “We're sort of dragging various systems by force into the world of open standards and interoperability, and I think that's a good thing,” says Tristan.
George describes open source as the liberator from vendor lock-in that used to plague the industry. “The desire to lock users into an ecosystem, whether that is driven by where the data lives or a weird, stored procedure format, is the most destructive force in our space and has prevented more innovation,” he says. “If nothing else, more open standards will help us to move on from that.”
Solving latency issues
Lowering latency is a very important frontier in the modern data stack, George explains. “People are solving it today with enormous effort and enormous costs. I think this is one of the most important things for us to attack in the next few years.”
And there’s a case to be made for streaming in the pursuit of real-time processing. For dbt, streaming is essential. “If you have to rerun your entire pipeline, even with as much incrementality as you can build into a batch-based data pipeline, it is still not performant enough to get you close enough to real time,” adds Tristan.
The weakest link in the chain will determine how latency affects performance, according to Ali, but he sees huge strides being made. Databricks’ streaming workload revenue tripled in the last year, and he cites many examples of companies, such as Rolls Royce, operating at the leading edge, using real-time streaming and machine learning to transform their businesses.
The reality, however, is that most organizations are a long way behind in terms of data maturity and much work still has to be done. The modern data stack is here to help companies move further towards more advanced analytics that can add business value and make them more competitive.
This article is based on a TechCrunch panel discussion about the future of the modern data stack. If you missed the event, you can still watch the on-demand webinar here.