Big data. The term means many things to many people. The best definition I've heard is data of the size that won't fit on your laptop. With 1 terabyte hard drives available by "fit" I mean it's too big to process on your laptop. A top-of-the-line MacBook pro packing a quad-core processor and 16GB of RAM can analyze a lot of data, however it's easy to surpass even those mighty specs. For instance, if you want to download the Airline On-Time Performance data, you'll need 12GB. Want the Reddit Comment Corpus? You'll need 250GB just to store the compressed data. Want to uncompress it? You had better purchase an external hard drive. And analyzing it? You're going to need some power to do that.
So What About Excel?
A number of years ago datasets could easily be opened and worked with in Excel. There are a number of powerful features in Excel that allow you to clean, standardize and munge data. However even the mighty spreadsheet has it's limitations. But what are they? Sure, a 100 megabyte spreadsheet seems to take forever to open even on a Windows machine packed with memory, but what do we mean by "limitations"?
Since Excel 2007, the following have been the max capabilities of the application*:
- Worksheet size: 1,048,576 rows by 16,384 columns
- Number of worksheets: limited by available memory
- Column width: 255 characters
- Number of characters in a single cell: 32,767
- Number of characters in the header or footer: 255
What about the PowerPivot Excel add-in you ask? That supports files up to 2GB in size and enables you to work with up to 4GB of data in memory*. That's pretty good, but not good enough to handle even one of today's small datasets.
So what do we do if our dataset is larger than what Excel can handle?
Let's Replace Excel and Keep the Familiarity
The bottom line is this - Excel is very easy to use, which is one reason so many businesses use it. But as we've seen it has limitations. Many of today's datasets simply won't work in it.
Enter Dataiku DSS.
Even better than seeing my data in front of me, as I write my recipes to prepare my data for analysis, I can see how the transformations I'll apply will affect the data, all without changing the underlying data!
How many times have you made a major update to a spreadsheet, saved and closed it, and then realized you didn't want to save the update, but it was already too late?
I'll admit, I've done it. With DSS we don't have to worry about that.
What About My Macros?
More advanced Excel users have created macros with Excel VBA that help automate parts of the spreadsheet. I've seen macros that will:
- Manipulate data and text
- Apply style and formatting
- Pull data from various data sources (databases, text files, etc.).
- Create entirely new spreadsheets
- Do all of the above in any given order
In Dataiku DSS we have steps, which we can chain together to create recipes.
A data cleaning recipe for a full address might be composed of the following steps:
- Trim the whitespace from both ends of the string
- Extract the zip code
- Extract the state
- Standardize the formatting of the state
- Extract the city
- Extract the apartment or suite number
- Split the address into the house number and street name
Going back to what I said earlier about saving changes you don't want, if you don't want to apply a particular step you simply turn it off.
No more try and undo!
As Easy as Excel with a Lot More Power
Excel has its merits and its place in the data science toolbox. For many companies it's the go-to tool for working with small, clean datasets.