The current technologies allow to build a data science stack for very cheap, performing as well or even better than stuff used to cost a lot a few years ago.
In this (very long) post, I am going to show how to leverage Amazon Web Services (platform), Vertica (analytical database), and R Studio Server to put together its own infrastructure. I just wanted it to be exhaustive, since some of the steps involved can be found separately on the web, but not as a whole.
We'll need the following services from Amazon:
- an EC2 instance
- a EBS disk attached to the instance to increase its storage and allow data persistence
- a S3 bucket
If you don't have it, register for AWS account and for the required services.Let's start with the instance. Go to the management console and under the EC2 tab, click on Launch Instance. The first thing you need to specify is the AMI you want to use. The AMI contains notably the OS. I choose an EBS-backed Red Hat 6.2 (64 bits) distribution, a bit more expensive than other Linux distros, but supported by both Vertica and R Studio Server. Its AMI ID is ami-41d00528 :
Next, Amazon is asking to specify the instance type and region. Since we are going to perform intensive tasks, let's choose a m1.xlarge to start with. Note that the instance could be resized later if needed.
Leave defaults for the next settings (just make sure to set the Shutdown Behavior to "Stop"):
You can then put tags on the instance. Let's set the name to "Big Data Stack":
Select the key pair you want to use :
Specify your security group.
Note that you just need to open TCP port 22 to have ssh access to the machine:
We're almost done. Just click on Launch Instance on the last summary window :
Now just wait until the instance starts. Once running, you'll need the public name of the instance :
The instance is up and ready ! But its storage capacity is limited. To get extra space, we'll need to attach an EBS to the instance. The process is easy but not completely straightforward.
Just go back to the console, and again under the EC2 tab, go to Elastic Block Store -> Volumes. Click on "Create Volume" and specify the size and zone. Let's choose 1 Tb, and the same zone as our instance :
You now need to attach it to the instance. Just click on the tick box of your EBS and under "More..." select "Attach Volume":
To configure the attachment, you need to select the instance, and the device you want to use :
NOTE : there is a typo on the screenshot above. The device you want to use on RHEL 6 is/dev/sdp.
At this stage, we're done with the console. We can log into the instance to continue the configuration.
Just open a terminal, and ssh to the machine using the key pair you specified earlier and the public name of the machine:
ssh -i data_biz.pem email@example.com
Let's mount the EBS on the instance (AWS does not support automatic mounting of an EBS) to be able to actually use it. It can be a little bit tricky, since for some reasons, the device used in the console is not the device actually used by the instance. To get the some info on the devices, you can use the following command:
The output will allow you to identify the right device:
Since this is the first use of the EBS, we need to create a file system on it. To do so, issue for instance:
mkfs -t ext3 /dev/xvdt
This creates an ext3 type fil system on the disk. NOTE: don't re-issue this command if you re-attach the EBS, or all the data will be lost !
Then create a mount point on the instance, and actually mount the EBS on it:
mkdir -p /data mount/dev/xvdt /data
That's it ! Your EBS is ready for use:
One last thing, since we are going to need S3 later, just install the s3cmd lib on the instance, which allows to programmatically interact with S3. First, update your yum repo:
cd /etc/yum.repos.d wget http://s3tools.org/repo/RHEL_6/s3tools.repo
And install it:
yum install s3cmd.
Note that you'll need to set up s3cmd with your credentials. Just run s3cmd --configure and type in your access and secret keys.
When it comes to analytics, you really don't want a database engine built for something else, like MySQL or even PostgreSQL. You need to go for a dedicated system, an "analytical' database, using internal mechanics (columnar storage...) optimized for this kind of worloads (big table scans, a lot of joins and aggregations...). A few options are available. To name a few vendors offering Community Editions of their analytical databases : InfiniDB (which has some stabilities and functionalities issues), Greenplum or Vertica. Note that the paid "Enterprise" versions are all massively parallel, which means that they will be able to scale better than the single node CE editions.
I choose Vertica because I heard a lot of hype around it, and wanted to test it by myself. You'll first need to register to get access to the CE program (can take a few days). Once done, you can download the installers. I put mine on S3 to be able to get easily on my instance. So when you have your installer sitting on S3, just download it on your instance :
s3cmd get s3://thomascabrol/installers/vertica*.rpm
Once downloaded, just install it by running:
rpm -Uvh vertica-ce-5.1.1-0.x86_64.RHEL5.rpm
You'll need a new directory to hold Vertica's data and to set proper read/write permissions (well, actually that's full permission...):
mkdir -p vertica
chmod -R 777 vertica
Now we're ready to actually configure it by using the install_vertica script. We want maximum storage for Vertica, so we are going to tell him to use the EBS as the data directory:
/opt/vertica/sbin/install_vertica -d /data/vertica/
The installation starts :
A new user, dbadmin, is created and you are prompted for a password. Once completed, login as dbadmin:
You are ready for creating your first database. To do so, use adminTools:
You will now be guided through the process:
Yeah ! That worked, now we can play a bit :)
Playing with Vertica
Let's use the Movie Lens data set, which can be found there. First, get the data:
mkdir -p /data/movie_lens
This is the content of the archive :
The main data is in the ratings.dat file, which looks like this:
You want to have a fields separator easily understandable by Vertica, so just swap the "::" with a semi-colon:
Now we're ready for Vertica. Start the vsql utility :
/opt/vertica/bin/vsql -d analytics
Create a new table :
And load the data:
It takes slightly above 10 seconds to load 8 millions records (yeah, I need to investigate why 2 millions records are missing...), not bad at all...
Let's create also the movie lookup table:
And now for example look for the 10 most rated movies :
You get the results after a very short time, that's really cool !
Installing RStudio Server
Now that our nice Vertica instance is up and running, let's install RStudio Server to complement it with a killer analytics software. The installation process is quite straightforward.
First, "su root" to change user back to root and update your yum repo:
And install R base:
Get the RStudio Server rpm:
In order to install RStudio, there is a couple dependencies to get first, if they are not yet on your system:
RStudio Server can now be installed:
We're almost done. Let's create a user dedicated to RStudio :
You'll also be prompted for a password at this step.
RStudio Server is now installed and running on the machine. It can be accessed via a browser, via port 8787. This port is not open to public for security reasons (remember that we just opened port 22 on the machine). To access it from the outside, we can open a ssh tunnel from our local machine to the EC2 instance, and forward port 8787. Open a new terminal window on your local machine and type:
Now open a browser and point to localhost, on port 8787 :
Just type in your Rstudio user and password:
You're in. You now have access via a browser to RStudio on your EC2 machine !
Configuring connectivity from RStudio to Vertica
This is the last step. We need to make RStudio being able to talk to Vertica. This is done via ODBC. The first step is to install the RODBC package (either via the package manager in RStudio, or via yum install). Next, a couple of configuration files have to be created :
- /etc/odbc.ini with the following informations
- /etc/vertica.ini with the following informations
Now export the VERTICAINI variable:
Go back to your local browser on localhost:8787 (RStudio) and you can connect to Vertica from R Studio :
If we go back to the most rated movies use case, we can build a nice graph like this:
That's it ! Just go find some data and play with Vertica and RStudio on Amazon !