As I mentioned in my previous post, I also had a chance to talk about what we've been up to over here at Dataiku during my stay in Berlin:
- Dataiku Flow, the next-generation data pipeline orchestrator
- dctc, our swiss army knife for remote files manipulation.
Flow, data pipelines made easy
Any real-life data analysis project is made of a large number of tasks, using various tools (Hadoop MapReduce, Hive, Pig, SQL, Python, R, NoSQL, ...).
Flow brings a whole new approach to the problem of orchestrating and managing these kinds of complex data pipelines.
- Automatic dependencies management: don't worry anymore about which of your intermediate or target datasets are not up to date anymore and need to be recomputed. Flow automatically manages dependencies and recomputes datasets in the right order. And it even does that with automatic parallelism to take advantage of your cluster's resources.
- Data quality checks: Flow does not only check that your pipeline executes correctly, it also helps you test that the data it produces are valid. Discover automatic tests for data pipelines !
- Advanced integration in many tools: Flow has native knowledge of many Data Scientist tools (Pig, Hive, IPython, Pandas & SciKit, SQL, ...) and deeply integrates within, providing centralized schema management : declare your datasets only once, and manipulate them safely using any tool.
The people in the "Palais" room were clearly interested by Flow's approach and several persons approached me, anxious to get their hands on Flow. We will make the first betas of Flow available quite soon now. You can register to get informed as soon as we have some stuff ready.
dctc: Make the pain of data transfers go away
Important note: due to shifting priorities, we had to discontinue development of dctc.
dctc also garnered quite a lot of interest. dctc was a command-line tool to ease the manipulation and transfer of files accross various storage types (Filesystem, HDFS, Amazon S3, Google Cloud Storage, FTP, SCP/SFTP).
The idea of dctc was to use it for listing, uploading, downloading, incremental synchronization, dispatch of files in several files, in-cloud edition, and much more.