Current technologies allow us to build a data science stack for very little, and it will perform as well or even better than stuff that used to cost a lot a few years ago.
In this (very long) post, I am going to show you how to leverage Amazon Web Services (platform), Vertica (analytical database), and RStudio Server in order to put together your own infrastructure. I want this post to be as exhaustive as possible to the extent where I've included and joined steps that can be found separately on the web.
We'll need the following services from Amazon:
- an EC2 instance
- a EBS disk attached to the instance to increase its storage and allow data persistence
- a S3 bucket
If you don't have these, register for an AWS account and for the required services.
Let's start with the instance. Go to the management console and under the EC2 tab, click on Launch Instance. The first thing you need to do is specify the AMI you would like to use which, importantly, contains the OS. I chose an EBS-backed Red Hat 6.2 (64 bits) distribution, a bit more expensive than other Linux distros, but supported by both Vertica and RStudio Server. Its AMI ID is ami-41d00528.
Next, Amazon asks us to specify the instance type and region. Since we're going to perform intensive tasks, let's choose a m1.xlarge to start with. Note that the instance could be resized later if needed.
Leave defaults for the next settings (just make sure you set the Shutdown Behavior to "Stop").
You can then put tags on the instance. Let's set the name to "Big Data Stack".
Select the key pair you want to use.
Specify your security group.
Note that you just need to open TCP port 22 to have ssh access to the machine.
We're almost done. Just click on Launch Instance on the last summary window.
Now just wait until the instance starts. Once running, you'll need the public name of the instance.
The instance is ready to go! But its storage capacity is limited. To get extra space, we'll need to attach an EBS to the instance. The process is easy but not completely straightforward.
Just go back to the console, and again under the EC2 tab, go to Elastic Block Store, then Volumes. Click on "Create Volume" and specify the size and zone. Let's choose 1 Tb, and the same zone as our instance.
You now need to attach it to the instance. Just click on the tick box of your EBS and under "More..." select "Attach Volume".
To configure the attachment, you need to select the instance, and the device you want to use.
NOTE: there is a typo on the screenshot above. The device you want to use on RHEL 6 is /dev/sdp.
At this stage, we're done with the console. We can log into the instance to continue the configuration.
Just open a terminal, and ssh to the machine using the key pair you specified earlier and the public name of the machine:
ssh -i data_biz.pem email@example.com
Let's mount the EBS on the instance (AWS does not support automatic mounting of an EBS) to be able to actually use it. It can be a little bit tricky since, for some reason, the device used in the console is not the device actually used by the instance. To get some info on the devices, you can use the following command:
The output will allow you to identify the right device.
Since this is the first use of the EBS, we need to create a file system on it. To do so, issue for instance:
mkfs -t ext3 /dev/xvdt
This creates an ext3 type fil system on the disk. NOTE: don't re-issue this command if you re-attach the EBS, or all the data will be lost!
Then create a mount point on the instance, and actually mount the EBS onto it:
mkdir -p /data mount/dev/xvdt /data
That's it! Your EBS is ready to use.
One last thing, since we're going to need S3 later, just install the s3cmd lib on the instance, which allows it to programmatically interact with S3. First, update your yum repo:
cd /etc/yum.repos.d wget http://s3tools.org/repo/RHEL_6/s3tools.repo
And install it:
yum install s3cmd.
Note that you'll need to set up s3cmd with your credentials. Just run s3cmd, configure it, and type in your access and secret keys.
When it comes to analytics, you really don't want a database engine built for something else, like MySQL or even PostgreSQL. You need to go for a dedicated system, an "analytical" database, using internal mechanics (columnar storage...) optimized for these kind of workloads (big table scans, a lot of joins, and aggregations...).
A few options are available. A few vendors offering Community Editions of their analytical databases: InfiniDB (which has some issues relating to stabilities and functionalities), Greenplum, or Vertica. Note that the paid "Enterprise" versions are all massively parallel, which means that they will be able to scale better than the single node CE editions.
I chose Vertica because I heard a lot of hype about it and wanted to test it myself. You'll first need to register to get access to the CE program (this can take a few days). Once that's done, you can download the installers. I put mine on S3 to be able to get it easily on my instance. So, when you have your installer sitting on S3, just download it on your instance:
s3cmd get s3://thomascabrol/installers/vertica*.rpm
Once downloaded, just install it by running:
rpm -Uvh vertica-ce-5.1.1-0.x86_64.RHEL5.rpm
You'll need a new directory to hold Vertica's data and to set proper read/write permissions (well, actually that's full permission...).
mkdir -p vertica
chmod -R 777 vertica
Now we're ready to actually configure it by using the install_vertica script. We want maximum storage for Vertica, so we're going to tell it to use the EBS as the data directory.
/opt/vertica/sbin/install_vertica -d /data/vertica/
The installation starts :
A new user, dbadmin, is created and you're prompted for a password. Once completed, log in as dbadmin:
You're ready to create your first database. To do so, use adminTools:
You will now be guided through the process:
Yeah! That worked, so now we can play a bit :)
Playing with Vertica
Let's use the Movie Lens dataset, which can be found here. First, get the data:
mkdir -p /data/movie_lens
This is the content of the archive.
The main data is in the ratings.dat file, which looks like this:
You want to have a fields separator easily understood by Vertica, so just swap the "::" with a semi-colon:
Now we're ready for Vertica. Start the vsql utility:
/opt/vertica/bin/vsql -d analytics
Create a new table.
And load the data.
It takes slightly more than 10 seconds to load 8 millions records (yeah, I need to investigate why 2 million records are missing...). Not bad at all...
Let's also create the movie lookup table.
And now for an example, look for the 10 most rated movies.
You'll get the results after a very short time - really cool!
Installing RStudio Server
Now that our nice Vertica instance is up and running, let's install RStudio Server to complement it with a killer analytics software. The installation process is quite straightforward.
First, "su root" to change user back to root and update your yum repo.
And install R base.
Get the RStudio Server rpm.
In order to install RStudio, there are a couple of dependencies to get first, if they are not already on your system.
RStudio Server can now be installed.
We're almost done. Let's create a user dedicated to RStudio.
You'll also be prompted for a password at this stage.
RStudio Server is now installed and running on the machine. It can be accessed via a browser, via port 8787. This port is not open to the public for security reasons (remember that we just opened port 22 on the machine). To access it from the outside, we can open a ssh tunnel from our local machine to the EC2 instance, and forward port 8787. Open a new terminal window on your local machine and type.
Now open a browser and point to localhost, on port 8787.
Just type in your RStudio user name and password.
You're in. You now have access via a browser to RStudio on your EC2 machine!
Configuring connectivity from RStudio to Vertica
This is the last step. We need to enable RStudio to talk to Vertica. This is done via ODBC. The first step is to install the RODBC package (either via the package manager in RStudio, or via yum install). Next, a couple of configuration files need to be created:
- /etc/odbc.ini with the following informations
- /etc/vertica.ini with the following informations
Now export the VERTICAINI variable.
Go back to your local browser on localhost:8787 (RStudio) and you can connect to Vertica from RStudio.
If we go back to the most rated movies use case, we can build a nice graph like this:
That's it! Just go find some data and you can play with Vertica and RStudio on Amazon!