Dataiku Data Science Studio and MongoDB

Dataiku Product Florian Douetteau

There’s a new sheriff in town and his name is Mongo! MongoDB, that is. And this guy is doing things differently… no tables and no relational structure here. MongoDB is all about dynamic schemas, native language drivers, high performance, horizontal scalability, and high availability.

MongoDB

MongoDB is essentially a NoSQL document-oriented database that uses JSONesque documents with dynamic schemas. This type of data structure is capable of holding arrays and other documents, essentially facilitating the creation of a singular entity that, in a relational database, would require multiple tables. By the way, MongoDB doesn’t use tables and doesn’t use schemas — the code itself defines the schema. In short, MongoDB features a unique data structure that reduces complexity while being able to quickly adapt to changes.

From a non-IT point-of-view, MongoDB is relatively easy to set-up and, in many cases, can be much faster than traditional SQL databases. If you need a database that excels at retrieving a lot of immutable data while offering deep-query capabilities, then MongoDB is your database of choice. In this scenario, MongoDB is very well-suited to transactional use cases.

It is important to note, however, that MongoDB is not a catch-all solution designed to put RDBMS solutions in their graves. There are many use cases in which traditional SQL-based databases excel, the most important of which (for our purposes) is data analytics. The choice of which type of database to use is largely determined by need... “one size” does not necessarily fit for all situations. For example, Dataiku users may prefer to use MongoDB as a big data source, but rely on a genuine analytical system (e.g., Hadoop, Spark, MPP database) for analytical processing.

MongoDB does have a place, however, particularly for business sectors that require a database solution capable of low latency on a global scale; in other words, they need to serve up data fast. For example, Amazon loses 1% in sales for every 1/10th second delay… Google’s traffic drops by 20% for every 0.5 second delay. Given this, it should come as no surprise that the following organizations all implement large-scale deployments of MongoDB:

  • Adobe: Uses MongoDB as the underlying database for its Experience Manager — it stores petabytes of data in large-scale content repositories;
  • Craigslist: Uses MongoDB to archive billions of records and support querying of archives at runtime;
  • LinkedIn: Uses MongoDB as its backend database;
  • Shutterfly: Uses MongoDB to store over 18 billion photos from over seven million users;
  • and many more, such as McAfee, eBay, Foursquare, and SAP.

You May Also Like

Alteryx to Dataiku: AutoML

Read More

Conquering the Data Deluge Through Streamlined Data Access

Read More

I Have Databricks, Why Do I Need Dataiku?

Read More

Dataiku Makes Machine Learning Accessible, Transparent, & Universal

Read More