Original blog post written by Laurent Couarraze and published on January 5, 2015.
Preparing Data in DSS
In the following article, and as promised in their last blog post "Why You Should Use a Data Science Tool", Affini-Tech explores and illustrates the benefits of using Dataiku's Data Science Studio (DSS) software for the very first step of any data oriented workflow: data preparation.
Articles about data science rarely dwell on data preparation even though this step is absolutely necessary for most data analysis projects. This lack of literature regarding data preparation is even more surprising in light of how time consuming the process usually is. Let’s take a closer look at how a data science tool such as DSS can help simplify data preparation.
Loading data is often very complex because of the multitude of existing formats. Parsing necessitates the use of a vast spectrum of tools (coding, python scripts, SQL queries…). With DSS, accessing numerous data sources is possible (files, Hadoop, SQL, Cloud Storage, NoSQL, twitter streams…). Furthermore, thanks to parameter configurations and pre-visualization features, parsing is easy. For example, one can navigate in a HDFS Hadoop tree, select a log file, and define the lecture parameters.
Integrated drag-and-drop visualization modules help the users understand their data and allow them to quickly build basic graphs (bars, point clouds, lines…):
Additional assets of DSS are its formatting and data transformation capacities. The studio contains more than 50 available “processors” that the user can sequentially apply to a dataset. In the studio, users will find all types of classic data transformations: filtering, cleaning, date joins, manipulation of character chains, mathematical formulas, natural language, joins… And thanks to a contextual menu, the user can easily access these commonly used data preparation functionalities. For example, it is possible to select a chain of characters and, in one click, to automatically replace all of the character occurrences in a column.
The studio's simplicity of use immediately translates into a noteworthy increase in productivity. But this ease of use does in no way imply that the studio's data transformation options are limited. On the contrary, thanks to programmable processors, users can create their own data enrichments and ensure total control of transformation sequencing.
In their next blog post, Affini-Tech will dive into these different aspects and take a look at workflow management as well. Stay posted!