Modern Data Science: Monogamy or Ménage à Trois?

Scaling AI Lara Khanafer

What I call monogamy in a technological environment is to remain faithful to only one development language. So yes, I know you’re thinking: "coding and being married (or in a relationship) are two completely different things." I don’t agree. In the following article, I’ll walk you through my field investigation into the polygamous lives of data scientists.

SQL vs R vs Python

What Does Monogamy Have To Do With Data Science?

Monogamy, oh what a word! Lust or restraint, property or freedom, single or plural, nature or nurture, all types of opposing concepts come to mind. So why in the h*ll are the words data and science coming to ruin all of this excitement?

Everything began a few weeks ago as my friends and I sat at a coffee shop discussing marriage, relationships, and… monogamy. That’s when the question we’ve all asked ourselves at least once came up: are human beings meant to date only one person at a time? Does this belief come from our nature or from what society expects of us?

As the afternoon went on, and as the conversation reached its climax, my friends and I came to the same conclusion: monogamy is not natural. Let me explain our reasoning: by nature, every species wants to produce the best possible offspring, and the odds of this happening only increase by multiplying partners. It really all comes down to basic science.

At this point, you’re still wondering how data science relates to any of this aren’t you? Well, as I walked home later that day, my mind began to wander. I thought about the conversation I’d just had with my friends. I thought about the concept of monogamy. I thought about work (because hey, what working person doesn't?). And that’s when monogamy and data science collided. It’s not as crazy as it sounds - let me explain.

What I call monogamy in a technological environment is to remain faithful to only one development language. So yes, I know you’re thinking coding and being married (or in a relationship) are two completely different things. Personally, I don’t fully agree. I believe both are a question of love, hate, addiction, fatigue, and pride all mixed up together.

In your daily work, you are maybe already used to being unfaithful - i.e., using different technologies to fulfill different needs. In this case, congrats, you’ve made my point on your own and you can probably just skip to the last part of this article for best practices in technological infidelity.

As for the rest of you, keep on reading. In the following article, I’ll walk you through my field investigation into the polygamous lives of data scientists.

Monogamy: A Mismatch With Data Science Productivity

When managing a data science project that involves multiple people, some may feel that imposing one single tool for everyone is the most efficient approach. Why? Because it’s nice to have standardized code in one language, to ensure that code written by anyone is compatible with the rest of the codebase.

But very often, the perfect library to solve your problem is written in another language. In this case, it seems that the best approach is to rewrite the whole thing in your favorite language or to write bindings to the other language's library. But, as most people who are still reading this article know, doing the latter not only limits full performance, but can also give way to bindings that are incomplete or not up to date, making compilation and execution harder.

Ok, you get it: this is a perfect example of the ineffectiveness of technological monogamy.

So what can you do? What languages should you use for what specific use case? To answer these questions, I went and asked our team what they thought the main advantages of each languages and tools are, and how you can make them work together. Because yes, exploring new possibilities, breaking down some barriers, and taking risks feels great. And empowering. Turns out, this isn't only true in life. It's true in data science as well.

Every Love Affair Has Its Perks: A Language for Every Need

R, the statistical computing language:

First of all, its statistic libraries are amazing:

  • There are many packages with a lots of statistical test possibilities.
  • Loads of specialized packages like association rules, a market basket analysis, etc.
  • Researchers in statistics publish their results in R, so those scholar libraries are pretty cool and very complete.

And the perks don't stop there:

  • It’s a real statistics language, so it can handle a lot of columns with few lines of code.
  • It’s similar to MATLAB, which you may have learned during college.
  • It has built-in data frames.
  • It has beautiful package management.
  • The core community is active and helpful
  • There’s a great package to see your data on maps (ggmap).
  • It’s a declarative language. So loops are slow, but data manipulation can be fast if you do it correctly.
  • Finally, it makes time series easy to handle.

Python, for data science MacGyvers:

  • It’s a scripting language - in other words, a toolbox language that enables developers to do lots of things very quickly.
  • It’s also a real generic programming language, thus being useful for all parts of a data science project: statistics, text and logs parsing, data preparation, data visualization, web development, all of which can come together smoothly in a Jupyter Notebook.
  • It’s very good at interacting with the web, for scraping, processing audio signals, or text.
  • It makes building web APIs easy.
  • It makes package management even easier, including several environments.
  • There’s a huge Python community!
  • It’s well integrated your machine's system to easily launch with command lines.
  • You can use Python to write loops, even if they are as slow as in R.
  • Packages like Pandas and scikit-learn are used by everyone and can help you get started quickly in data science.
  • Scikit Learn's API has set up a standard, which is followed by many including XGBoost.
  • TenseorFlow allows you to try out deep learning on a lower level (lists, set, dictionary, etc.).
  • It allows task automation with a clearer syntax than bash.
  • It’s fast to learn, intuitive, and easy to pick up.

SQL: The first, the last, the everything?

  • Everyone - ok not everyone, but a lot of people - know it.
  • A lot of treatments are expressed very naturally, like advanced joins, group, and window functions.
  • Its implementation in Hive, Impala, and SparkSQL brings it to the top of the list for data preparation.
  • It’s ideal to do non-procedural actions like transforming a table in another table.
  • It’s pretty much just the standard way to ask a database anything.

Julia/Matlab, a scientist’s first love

  • Mainly designed for scientists, especially astrophysics or specialists in signal treatment, even though many research laboratories use it for machine learning;
  • Its main advantage is in matrix calculus (and that it’s available in Python too with Numpy and Scipy).
  • There are a lot of functions for image manipulation.

Scala, the hipster data scientist’s language of choice

  • It’s the native language for Spark, even if you can also use SparkR, PySpark, and SparkSQL. It remains a favorite for Spark aficionados.
  • It’s useful to understand Spark error messages, which look super weird to non Scalaers.
  • Apart from the fact that it’s cool, it compiles to a JVM, which makes it as efficient as Java, but it’s way more concise.
  • It allows functional programming, a paradigm in which many problems are natural to express.
  • It’s intellectually rich and complicated which makes it EXCITING (i.e., it's a great way to differentiate yourself at your job).

Making the Technological "Ménage à Trois" Work

This is all well and good, but you wouldn’t build an Ikea shelf with the wrong screwdriver, and the same goes for a data project! None of these languages are right for every single thing you want to do. You need to switch screwdrivers when you change project (or a least make sure the one you’re using is still the best).

However, even if you can switch languages or tools, when you’re working on a data science project, the challenge is getting people to work together, each with their own languages. You need a common environment where you can share information about the projects, see what everyone else is doing, and understand how the project is moving along.

When you do, multilingual data projects are a real plus. Some of our data scientists won a Kaggle competition thanks to language mixing. They were also working in three different cities and on three different schedules. How did they do it? They used Dataiku. Why? Because:

  • Even with different mindsets and different languages, 1 common goal = WIN.
  • When one language falls short, they could switch to the others to keep making progress.
  • They wanted to parallelize feature calculations and compute lots of engineering features at the same time in SQL.
  • Finally, they were able to blend their models in R and Python, compare results, and therefore select the best submission.

Doesn’t That Sound Beautiful? Yes. Yes, it does.

Back to our monogamy issues. The problem with most amoral activities is usually the planning, or the organization that goes around it. The real reason it’s so scary to cheat on a partner is that it requires a lot of extra organization and careful planning. For data science polygamy to succeed, you need to make sure that every member of your team - or partnership - can communicate, collaborate, and plan according to and with one another. Same goes with polyamorous relationships. Just saying.

So how exactly is this technological "ménage à trois" supposed to work? It all comes down to one concept: collaboration. By this point, I hope I’ve convinced you that using various languages in data science projects is a good (the best) way to:

  • Be more efficient
  • Maintain a team of happy data scientists and analysts that enjoy their freedom and thus continue to express and improve their individual skill sets
  • Discover new possibilities to have the best possible prediction/clustering models
  • Design, build, and actually deploy data products efficiently

So now you're thinking: “Ok, great, this girl has just sold me some sort of utopian data science environment. But in reality, complete technological polygamy in a unified environment is impossible. I know because I've tried.” Well, you are wrong. It is possible! That’s right: at Dataiku, we don't only believe in bringing data, technology, and people together, no matter their level of expertise or skill set. We actually do it, one deployed and running data product at a time. At Dataiku, we believe that teams can only be effective if all members come as they are.

You May Also Like

How the Dataiku Universal AI Platform Redefines Enterprise AI

Read More

The 3 Pillars for Scaling AI in Enterprises

Read More

Your 2024 Analytics Wrapped: Top Dataiku Features for Analysts

Read More