Arjun Narayan is the co-founder and CEO of Materialize, and George Fraser is the co-founder and CEO of Fivetran. This post is cross-posted on the Materialize blog.
Apache Kafka is a message broker that has rapidly grown in popularity in the last few years. Message brokers have been around for a long time; they're a type of datastore specialized for "buffering" messages between producer and consumer systems. Kafka has become popular because it's open-source and capable of scaling to very large numbers of messages.
Message brokers are classically used to decouple producers and consumers of data. For example, at Fivetran, we use a message broker similar to Kafka to buffer customer-generated webhooks before loading them in batches into your data warehouse:
In this scenario, the message broker is providing durable storage of events between when a customer sends them, and when Fivetran loads them into the data warehouse.
However, Kafka has occasionally been described as something much more than just a better message broker. Proponents of this viewpoint position Kafka as a fundamentally new way of managing data, where Kafka replaces the relational database as the definitive record of what has happened. Instead of reading and writing a traditional database, you append events to Kafka, and read from downstream views that represent the present state. This architecture has been described as "turning the database inside out."
In principle, it is possible to implement this architecture in a way that supports both reads and writes. However, during that process you will eventually confront every hard problem that database management systems have faced for decades. You will more or less have to write a full-fledged DBMS in your application code. And you will probably not do a great job, because databases take years to get right. You will have to deal with dirty reads, phantom reads, write skew, and all the other symptoms of a hastily implemented database.
Tripping (up) on ACID
The fundamental problem with using Kafka as your primary data store is it provides no isolation. Isolation means that, globally, all transactions (reads and writes) occur along some consistent history. Jepsen provides a guide of isolation levels (inhabiting an isolation level means that the system will never encounter certain anomalies).
Let's consider a simple example of why isolation is important. Suppose we’re running an online store. When a user checks out, we want to make sure all their items are actually in stock. The way to do this is to:
- Check the inventory level for each item in the user’s cart.
- If an item is no longer available, abort the checkout.
- If all items are available, subtract them from the inventory and confirm the checkout.
Suppose we are using Kafka to manage this process. Our architecture might look something like this:
The web server reads the inventory level from a view downstream from Kafka, but it can only commit the transaction upstream in the checkouts topic. The problem is one of concurrency control: if there are two users racing to buy the last item, only one must succeed. We need to read the inventory view and confirm the checkout at a single point in time. However, there is no way to do this in this architecture.
The problem we now have is called write skew. Our reads from the inventory view can be out of date by the time the checkout event is processed. If two users try to buy the same item at nearly the same time, they will both succeed, and we won't have enough inventory for them both.
Event-sourced architectures like these suffer many such isolation anomalies, which constantly gaslight users with “time travel” behavior that we’re all familiar with. Even worse, research shows that anomaly-permitting architectures create outright security holes that allow hackers to steal data, as covered in this excellent blog post on this research paper.
Using Kafka alongside a database
These problems can be avoided if you use Kafka as a complement to a traditional database:
OLTP databases perform a crucial task that message brokers are not well suited to provide: admission control of events. Rather than using a message broker as a receptacle for “fire and forget” events, forcing your event schema into an “intent pattern”, an OLTP database can deny events that conflict, ensuring that only a single consistent stream of events are ever emitted. OLTP databases are really good at this core concurrency control task - scaling to many millions of transactions per second.
Using a database as the point-of-entry for writes, the best way to extract events from a database is via streaming change-data-capture. There are several great open CDC frameworks like Debezium and Maxwell, as well as native CDC from modern SQL databases. Change-data-capture also sets up an elegant operational story. In recovery scenarios, everything can be purged downstream and rebuilt from the (very durable) OLTP database.
Don't accidentally misbuild a database
The database community has learned (and re-learned) several important lessons over decades. Each one of these lessons was obtained at the high prices of data corruption, data loss, and numerous user-facing anomalies. The last thing you want to do is to find yourself relearning these lessons because you accidentally misbuilt a database.
Real-time streaming message brokers are a great tool for managing high-velocity data. But you will still need a traditional DBMS for isolating transactions. The best reference architecture for “turning your database inside out” is to use OLTP databases for admission control, use CDC for event generation, and model downstream copies of the data as materialized views.
Kafka and Fivetran
If you're interested in automatically sending Kafka event data to a cloud data warehouse, check out how Fivetran works with Kafka, and try out Fivetran to see if it's the right solution for you!