Big Data for Data Padawans Episode 4: Bringing Spark & Hadoop Together

Data Basics Alivia Smith

This is it guys, the last episode of my quadrilogy on Spark for data padawans. For this last section, I've tried to sum up what having a technology like Spark in Dataiku brings to the mix. 

The Possibility of Big Data

I previously discussed why Spark isn't a miracle solution to all data modeling problems. It's faster than MapReduce and you can do a lot with it, especially when you're doing machine learning on your data cluster. However, today you don't necessarily need to process your algorithm on distributed data. If you design your features properly, your data can hold in RAM and it's then a better choice to use non-distributed libraries like scikit-learn or R packages.

Having Spark as well as Hadoop working natively inside Dataiku means that even if you don’t have data that big today, you can be confident that when you do, you can keep using your favorite algorithms and know that they’ll be able to scale. And as data these days is growing exponentially, you can be sure the tool will follow your company as it grows!

Dataiku is a tool that was built to work in the age of data science, machine learning, and AI. That's why it's so important for us to keep adding new features and integrations to keep up with all the technological progress that is going on in the field of data management and processing. And things are moving fast!

Technologies Coming Together Natively

Moreover, I discussed how, depending on the specific task you're doing, there is a set up time for each Spark calculation and Spark isn't always the best way to go, or the fastest. This is when it becomes very useful to use Spark in Dataiku.

Dataiku is a general purpose tool, so you can do lots without having to learn each specific new technology or language. When you hook up your Spark you can run a lot of Dataiku’s basic features on your Spark processor (so super fast). You don’t even have to go through Spark and use Scala or Java; you can query directly in SQL(Hive) on Spark SQL, Python with the PySpark notebook or R with SparkR and have all the advantages of Spark without the hassle. And all of Hadoop’s functionalities work together with Spark in Dataiku!

Spark in Dataiku also opens up new programming horizons with the integration of Shell recipes. Your developers can code in Java and Scala within Dataiku and work with data scientists using R or Python. Taking advantage of Spark’s processing power all along.

Distributed Machine Learning

The addition of Spark most importantly brings new possibilities to Dataiku since we now integrate distributed algorithms natively without having to go through code and using outside libraries. You’ll be able to use the visual interface to apply a Random Forest, logistic regression, linear regression or a gradient boosted tree on your distributed data. 

You May Also Like

Fine-Tuning a Model (In Plain English!)

Read More

How to Reach the Apex of Data Preparation

Read More

How to Address Churn With Predictive Analytics

Read More

What Is MLOps?

Read More