Spark for Data Padawans Episode 1: a look at distributed data storage

Technology | | Alivia

If you've been anywhere near data in the past year or so you must have heard about the war going on between Spark and Hadoop for total control over the management of large amounts of data.

We have a big announcement coming at Dataiku about Spark, so ever since I started working that word has been popping up every day and I kept wondering what it could mean. I’ve already written about my limited technical background before arriving at Dataiku. Luckily, in the past month, I’ve had the opportunity to speak to all of our brilliant data scientists and developers, as well as a couple of data experts. I’ve learned so much about Data Science and Big Data infrastructure and I figured there might be people out there who were as clueless as me when I first heard the words Hadoop, Hive or Spark.

So here is my investigation into what the heck Spark is. It is a long and winding path and there are several obstacles to overcome along the way, so I couldn't write about it all at once and made it into a quadrilogie:

So let us begin our journey into big data.

It all begins with a definition

The first step to understanding what Spark is was of course googling it, and checking Wikipedia:

"Apache Spark is an open source cluster computing framework originally developed in the AMPLab at University of California, Berkeley but was later donated to the Apache Software Foundation where it remains today. In contrast to Hadoop's two-stage disk-based MapReduce paradigm, Spark's multi-stage in-memory primitives provides performance up to 100 faster for certain applications. By allowing user programs to load data into a cluster's memory and query it repeatedly, Spark is well-suited to machine learning algorithms."

Suffices to say that was not helpful. Luckily we have lots of people at Dataiku who took some time to clear things up for me.

The first part of the definition is that Spark is an open source cluster computing framework. That means that it’s used to process data that is distributed across clusters. So step 1 in understanding why Spark is useful is understanding clusters.

Why do you distribute data storage?

The general idea is that when you start collecting, storing, and analyzing really large quantities of data, storing it on one machine makes it really inefficient if your data doesn’t fit in your machine’s RAM (Random Access Memory - a part of your computer’s storage that’s accessible fast but volatile).

You also don’t want to have all of your data stored in just one place because if that one computer goes down you lose everything. And one of the rules when working with computers is that they always crash. Moreover, the more data you have stored in one server, the longer it takes to read through it to find the relevant information.

The origins of Hadoop

These issues first started preoccupying Google, whose business model is based on storing huge amounts of data and having access to all of it at one time. Makes sense. Their system was adapted into the Hadoop Distributed File System (HDFS for experts).

hadoop logo
Yes, Hadoop has one of the cutest logos in Big Data.

The basis of the system is to store the data on a cluster of servers. Instead of having to invest in a really large server, the system functions on the hard drives of lots of computers that you can compare to your PC. This system is infinitely scalable.

The data is stored redundantly (meaning the same information is written on several different servers, usually 3), so that the system isn’t weakened by one server going down. The data is processed with the processing system Hadoop MapReduce which operates over the cluster resource manager Hadoop YARN. MapReduce can take distributed data, transform it (Map), and aggregate it (Reduce) into other useful data.

basic hadoop stack
Basic Hadoop Stack.

The cluster of servers is constituted of a client server, a driver node, and worker nodes (they’re also often called master nodes and slave nodes, but that’s not super politically correct). A node is a server by the way, which is basically a computer. I am grossly simplifying but the driver node distributes jobs and tracks them across the worker nodes who execute the jobs. The client server is where HDFS, MapReduce, and YARN are installed. Data is loaded into the client server when it is written by HDFS or processed by MapReduce, then it is written on the worker nodes. And YARN manages the resource allocation or the jobflow. Basically, YARN is da boss.

Whenever you write a new record or when you make a query, the driver clusters know exactly where to write or where to find the info in the worker clusters thanks to the metadata ("data about data") stored by HDFS.

There are then a whole lot of layers on top of MapReduce as well, like Hive that allows you to write SQL queries and Pig to code in Pig Latin. This is important because MapReduce is in Java and pretty hard to program without those additions.

hadoop stackA little bit more advanced Hadoop Stack.

Hadoop rules

The Hadoop ecosystem allows you to store large amounts of data and access it faster. It’s cheaper to set up than a ginormous server and also much faster for processing the data. Indeed, when you multiply the number of machines, you’re exponentially multiplying the processing speed as well as the number of jobs you can run at the same time. The redundancy principle built in HDFS means the system is protected against server failures.

And that's it for episode 1! Read along for part 2 of my investigation into Spark vs Hadoop: when Spark-the-underdog rises against Hadoop-the-great in an epic battle for big data domination. But are they really enemies?

Other Content You May Like