This is it guys, the last episode of my quadrilogy on Spark for data padawans. For this last section, I've tried to sum up what having a technology like Spark in DSS brings to the mix. Tell me what you want to know about next!
There's always time to catch up and read up on the other articles of the series :
- episode 1: a little history about distributed data and Hadoop, why, and how it was done before Spark;
- episode 2: enter Spark, what it is, what it changes;
- episode 3: the epic battle between Spark and Hadoop;
- episode 4: Hadoop meets Spark in Data Science Studio, the reconciliation (finally!)
The possibility of Big Data
In the last episode, I discussed why Spark isn't a miracle solution to all data modeling problems. It's faster than MapReduce and you can do a lot with it, especially when you're doing Machine Learning on your data cluster. However, today you don't necessarily need to process your algorithm on distributed data. If you design your features properly, your data can hold in RAM and it's then a better choice to use non-distributed libraries like scikit-learn or R packages.
Having Spark as well as Hadoop working natively inside Data Science Studio means that even if you don’t have data that big today, you can be confident that when you do, you can keep using your favorite algorithms and know that they’ll be able to scale. And as data these days is growing exponentially, you can be sure the tool will follow your company as it grows!
Data science Studio is a tool that was built to work in the age of Big Data. That's why it's so important for us to keep adding new features and integrations to keep up with all the technological progress that is going on in the field of data management and processing. And things are moving fast!
Technologies coming together natively
Moreover, I discussed how, depending on the specific task you're doing, there is a set up time for each Spark calculation and Spark isn't always the best way to go, or the fastest. This is when it becomes very usefull to use Spark in DSS.
DSS is a general purpose tool so you can do lots without having to learn each specific new technology or language. When you hook up your Spark you can run a lot of DSS’s basic features on your Spark processor, so super fast. You don’t even have to go through Spark and use Scala or Java, you can query directly in SQL(Hive) on Spark SQL, Python with the PySpark notebook or R with SparkR and have all the advantages of Spark without the hassle. And all of Hadoop’s functionalities work together with Spark in the Studio!
Spark in Data Science Studio also opens up new programming horizons with the integration of Shell recipes. Your developers can code in Java and Scala within the Studio and work with Data Scientists using R or Python. Taking advantage of Spark’s processing power all along.
Distributed Machine Learning!
The addition of Spark most importantly brings new possibilities to Data Science Studio since we now integrate distributed algorithms natively without having to go through code and using outside libraries! You’ll be able to use the visual interface to apply a Random Forest, logistic regression, linear regression or a gradient boosted tree on your distributed data. I can't really explain what these algorithms are (yet!) but your data team will be happy with this.
And we're done with Spark for data padawans! It's been fun trying to make sense of all that technical jargon and write everything I've learned since joining Dataiku down. Please tell me if some things aren't super clear in the articles, or if you disagree (everyone makes mistakes, even me!).
Also, this has been so much fun for me I figured I'd do it again, on another subject.
Tell me what you would want me to write about in my next series of articles!
I'd like to end this by saying a special thank you to everyone who helped me understand Spark at Dataiku and outside, especially Adrien who was very helpful in making sure these articles made sense.