You’ve probably heard quite a bit about Apache Kafka, the leading system for storing streaming data that LinkedIn open sourced back in 2011. Apache Kafka is an extremely powerful and scalable technology, but very few organizations even come close to fully realizing its potential benefits because they just don’t have the need to do streaming analytics on massive amounts of data in real time -- yet. So why, then, should organizations even consider implementing it now?
Simply put, streaming analytics means doing analytics in real time as the data comes in as opposed to running analytics on data that is permanently stored somewhere (such as a data lake). Many data-driven organizations that are pursuing the development of use cases like recommendation engines, predictive maintenance, or fraud detection are moving toward streaming analytics. It’s very likely that streaming analytics is, or soon will be, in a majority of roadmaps for companies large and small.
Self-driving cars are just one example of devices that will produce massive amounts of streaming data.
For basic streaming analytics, an option to handle most use cases could be a simple point-to-point messaging queueing system such as ZeroMQ. Kafka, on the other hand, is built for high performance and designed to be scalable -- very, very, scalable -- as evidenced by LinkedIn’s announcement last year that they are processing over 1.4 trillion messages live per day using Kafka on a handful of nodes. But aside from other giant internet services companies like Google, Facebook, and Twitter, who else might need to scale at this level?
Indeed, for most, implementing Kafka before the amount of data makes it necessary can seem like overkill. But organizations with any social or Internet of Things (IoT) projects in their roadmap in the next several years might be surprised how their current architecture is not set up to deal with the amount and frequency of data they could be receiving, not to mention the ways they’ll use it. The way we here at Dataiku see it, now is a great time to get ahead of the game and become familiar with Kafka - especially considering the surprisingly minimal investment required to adopt it.
Kafka: How It Works (And a Quick Glossary)
Kafka’s defining feature is its scalability. But even if you don’t need the scaling (yet), there is no other solution available that matches its performance and reliability. Kafka is a distributed, partitioned, and replicated log, and it's optimized for massive throughput. Basically, it stores data in the order it comes in, and it makes these logs redundant across the nodes of the cluster. Data expires in Kafka, so you need to use it or store it elsewhere; otherwise it will eventually disappear.
Kafka serves and stores data using a publish-subscribe pattern in topics (more or less the equivalent of a folder in a file system) with built-in replication and partition. A Kafka cluster can have many topics, and each topic can be configured with different replication factors and numbers of partitions. In Kafka parlance, a producer is an inbound data connection that writes data into a topic, whereas a consumer is an outbound data connection. For example, a program that listens to IoT sensors and writes to a Kafka topic would be a producer, and an application making decisions based on this data would be a consumer.
Bypassing Batches for Real-Time Analytics
A typical data science pipeline generally starts with data dumped into a table in a database or a distributed file system like Hadoop. From there, analysts use languages like Python or Spark to execute computations on the data. The results are then stored in another database table, which another system (a webapp for example) can pull from. This type of step-by-step data transformation and movement is called batch processing because the data modifications are done in, well, batches.
With Kafka, the data streams right into a topic as it is received. For data science, you would use a service such a Spark Streaming to do the computations, which you would store in yet another topic within Kafka, and that would be pushed automatically to your webapp.
Scalability and Reliability for the Coming Data Deluge
That said, it is clear that scalability is a huge benefit from using Kafka. Using a standard Hadoop-based data lake with streaming data requires the creation of a new file every single time the data lake ingests new data. As additional sources of streaming data are ingested by the data lake with higher and higher frequencies of updates on each one, more and more HDFS blocks (HDFS file parts) are created. At some point, the name node is no longer able to process it all. From that point on, accessing the data becomes too slow and inhibits the whole data lake activity. Not only does Hadoop prefer big files to short files, it also cannot append data to an existing file in HDFS.
Or to put it less technically, Hadoop is not built to handle the size and frequency of data that a world of IoT will bring.
Finally, the reliability of the system is better than anything else available right now. Every partition within a Kafka topic can be replicated, so you can be sure that you will never lose data within a topic.
Find an Excuse to Use Kafka Now
So why are we, here at Dataiku, writing about Kafka? First of all, as makers of a data science platform that leverages open source technologies, we don’t have a choice but to be on par with all the latest and hottest technologies available in the data ecosystem. Second, several customers have come to us asking for help in building apps that rely on streaming data. In a majority of cases, because the current data being streamed isn’t all that big, we could have recommended implementing a simple messaging-queueing system. But why not prepare for when that data does become big? After all, even if they may seem like overkill for immediate needs, adopting powerful and scalable technologies is like insurance for the future. Furthermore, many of our customers see adopting Kafka as an opportunity to grow internal data team expertise.
Although Kafka can be a bit of a pain to set up in and of itself -- it typically requires a Zookeeper cluster, quite a lot of parameters to pick, and lacks monitoring features if taken straight out of the Apache distribution -- setup with an enterprise distribution like the one from Cloudera is very easy and even includes robust monitoring features.
Once you get Kafka up and running, the system overhead of using it is repaid much faster than a Spark cluster, for instance. So, in our recommendation, it’s better to go through the minor frustration of getting Kafka set up now, because you’ll gain a familiarity with it that you’ll be grateful for when your organization’s need for streaming analytics arrives sooner than expected.