After learning about Hadoop and distributed data storage, and what exactly Spark is in the previous episodes, it's time to dig a little deaper to understand why even if Spark is great, it isn't necessarily a miracle solution to all your data processing issues. It's time for Spark for super beginners episode 3!
As always, I try to keep these articles as easy to understand as possible, but if you really are a super data padawan you probably need to have a quick look at episode 1 and episode 2 to understand what I'm talking about. You can always go back to a previous episode later:
- episode 1: a little history about distributed data and Hadoop, why, and how it was done before Spark;
- episode 2: enter Spark, what it is, what it changes;
- episode 3: more about the epic battle between Spark and Hadoop (this is it);
- episode 4: Hadoop meets Spark in Data Science Studio, the reconciliation.
- process large volumes of data
- super fast,
- in a resilient manner
- and it is particularly practical for Machine Learning on very large datasets.
So why wouldn’t everyone set up Spark?!
Spark is fast. Yes. It’s faster overall than other technological stacks for sure. However, it isn’t optimal for all the use cases out there. It’s much slower for certain specific operations than MapReduce, and each query does take some time to set up. You should always chose your system based on what you’re going to be doing with it so as to not waste resources.
If you just want to do SQL type queries on distributed data, you can just go with Impala for example, no need for Spark. (The good news is Data Science Studio 2.1 comes with Impala as well as Spark! I know, it’s crazy I should mention it). Spark is useful when you want to do lots of different things with your data. It’s like a Swiss army-knife. One that’s really good at Machine Learning.
Do you need stream processing?
So you probably don’t need all of those super advanced features that come with a Hadoop and Spark stack. I'm guessing you are not Amazon (yet) and you don’t have that much data to work with. Analyzing your customers’ behaviors each evening to send them really cool notifications or emails, or recommend products for them is already pretty awesome. Features like Spark Streaming, if they do sound pretty revolutionary in general, are only really useful for certain use cases.
Most of the time, you don’t need to be able to get live insights from your data live. Basically, unless you work in trading, fraud detection, network monitoring or you're a search or recommendation engine, you do not need “real-time” big data infrastructures.
There's a chance you don't even need Hadoop or Spark anyway
Also, Spark is generally faster than MapReduce, yes, but if your data can fit in a single server’s RAM, it will always be faster to not distribute it and process it all in one place. In that case, using non distributed Machine Learning libraries like scikit-learn and performing queries in a SQL database is still the most efficient way to go. More often than not, your data is not as big as you think and can actually hold in RAM (remember what Random Access Memory is?).
In many cases you can aggregate your distributed data in a smaller dataset that will not need to be distributed. You can then perform Machine Learning on it in RAM. Efficient predictive modelling is based on enriching your data and choosing the right features, and if you do that right you save a lot of storage space and can very often process your aggregated data in RAM, without having to go through distributed systems.
This is important because today non distributed algorithm libraries using Python or R are much more developed and offer more possibilities than distributed projects. And even if you want to predict churn on a really large volume of logs stored on a cluster for instance, after cleaning your data you’ll actually be processing one line per customer, which is already a much smaller volume of data. You can then run a very large variety of algoruthms on when that dataset holds in memory.
I'm guessing you're now wondering why we bothered integrating Spark if few people really need it. Spoiler alert, it is so you can grow and scale your predictive models in the future!_
Moreover, it is still just the beginning of Apache Spark. It’s an open source project so it moves fast, but that also means that the support infrastructure and security around it aren’t very advanced yet. Also, with each new version of Spark a lot of things change and you may have to edit a lot of your code. That means you’re going to have to invest a lot in maintenance and you can’t necessarily have an application that easily works on multiple Spark versions.
In the end, Hadoop and Spark aren’t actually enemies at all and work together very well. Spark was originally designed as an improvement of MapReduce to run on HDFS and loses its effectiveness when implemented otherwise. You can even consider Spark as a feature to be added to a Hadoop infrastructure to allow for machine learning, stream processing, and interactive data mining.
The two products are open-source projects anyway, which makes it less relevant to talk about competition. The companies making money over these infrastructures today offer both and advise customers on which infrastructure to use based on their needs. And if you think about it, it’s actually a good thing for open source projects to be in a competition since it makes them that much more dynamic!
Alright, we're all done for today, I hope you can see a little more clearly in the world of Spark after this. Tomorrow, I'll finish up with a bit about what Spark's integration to Data Science Studio brings to the table in terms of your predictive modeling experience.