Got questions about implementing Dataiku or the technology behind it? Clément Stenac, co-founder and CTO here at Dataiku, has answers.
Stéphanie Hertrich, Cloud Architect at Microsoft, recently interviewed Clément about a range of topics — we shared some of his broader insights on enterprise data teams, tools, and getting value from data previously.
Here, we’ve summarized excerpts of the conversation pertaining specifically to Dataiku, its technology, and implementation.
Stéphanie Hertrich, SH: On the topic of scalable and available services, is Dataiku a SaaS solution?
Clément Stenac, CS: Dataiku is a software that users download and install on their infrastructure. So for many customers, that means in the cloud, but for others, it's still their data center (it’s about 50/50).
SH: How can you ensure availability of the service when you don’t have control of the underlying infrastructure?
CS: It's not magic! It's up to the client to deploy multiple instances, which is what we support — scaling out and adding new nodes. Convincing our customers to trust us with hosting their data would be complicated, especially since we’re focused on large, international enterprises.
Dataiku service availability: Magic, it is not!
But on top of that, there are also underlying technical issues; for example, when it comes to processing as close to the data as possible, SaaS is not a good solution. On the other hand, this works well in the cloud — we integrate with AWS, Microsoft Azure, and GCP via their managed Hadoop solutions. This solution allows customers who are in the cloud to get started super easily.
SH: Looking at the example of Microsoft Azure and its Hadoop/Spark-managed solution HDInsight, how can this be used with Dataiku?
CS: This is a unique case; the Hadoop-managed offer on Azure has Dataiku installed and ready to use in the cluster. In fact, it’s a simple check option when provisioning the HDInsight cluster on the portal. Dataiku has been validated by Microsoft as a solution compatible with HDInsight, and integration is done automatically.
Editor’s note: You can read more about the Dataiku and Microsoft HDInsight integration here.
SH: On which nodes of the Hadoop/Spark cluster is Dataiku deployed?
CS: Dataiku is installed on an edge node (not a "work" node), and each job is processed on the cluster and its compute nodes. One of the great strengths of Hadoop’s YARN component is the ability to deploy processing without having to install anything specific on the nodes themselves.
SH: How do you build and implement Dataiku?
CS: The primary challenge is that we’re not SaaS — it brings a lot of inconveniences, namely scalability, control, and diversity. But of those, diversity is perhaps the most challenging.
We integrate with five different Hadoop distributions, around 15 databases, and four Linux distributions - there’s a lot of variability there. Hadoop evolves very quickly, and often, the paint isn’t even dry — it has lots of components that were edited by different teams from different companies that are often in competition with each other. Every distribution has made different choices, fixed different bugs — sometimes, the versions aren’t the same after a patch. All of this to say that when we come to our clients, Hadoop is a bit like a giant airplane control panel where we’re not sure which buttons have been pressed.
"We integrate with five different Hadoop distributions, around 15 databases, and four Linux distributions — there’s a lot of variability..."
This means we have to test on an unbelievable number of different development environments. Our developers can easily provision different environments on demand, which we’re really proud of. We’re based on containers, virtual machines, and the cloud. We don’t host a production environment, but instead lots of test infrastructure, mainly hosted with OVH.
For workloads where we need more flexibility, we are on Microsoft Azure and AWS. There isn’t really a strong reason for this — historically we were on AWS, but now, it’s more Azure, particularly for our training needs. When training clients, everyone starts the same exercise at the same time, so we have consumption peaks. In fact, this requires a lot more flexibility than any of our own internal needs.
SH: What portion of your clients already know Hadoop?
CS: There’s a huge difference between the French and American markets — for example, in the U.S., 100% of our clients already know Hadoop. Two or three years ago in France, people were barely discovering big data and beginning to experiment with Hadoop, so in that market, we make lots more recommendations on architecture.
We arrived later in the U.S. than in France, and the approaches are still quite different. In the United States, clients tend to explain their existing architecture and want to know if it’s supported. Whereas in France, the focus is more on getting value from business data from the perspective of a few particular use cases.
SH: What technologies does Dataiku use?
CS: We are generally guided by the aforementioned deployment constraints and challenges of our clients, so therefore we have to keep our product as simple as possible.
Given the complexity of our customers' systems, we have to keep Dataiku as simple as possible.
Our architecture is multi-process but also monolithic in the sense that it’s self-contained. Basically, the solution embeds everything it needs, including the databases that are SQLite and H2. We code primarily in Java, which is one of the principal languages of big data and that is considered a good compromise between performance and productivity (we never wanted to have our core written in Python or in Ruby, for example).
On the backend, you’ll find a web server that does job scheduling, storage and management of metadata, and search indexing. We also have some Python and R processes as well as, obviously, Spark processes. And on the front end, we use a single-page application (SPA) in AngularJS.
SH: How do you manage inter-process communication?
CS: Right now, it’s synchronous with HTTP APIs but with a poll-based overlay that allows you to use it asynchronously. We’ve thought about passing on a message bus but it would require us to embark it, which is not simple because you have to manage the entire chain around it and secure it.
For us, adding technology into the stack is always a tradeoff between what we’ll gain and what might cause problems. That brings us to our second biggest challenge when it comes to our customers: what happens when things don’t work?
Well, we can’t connect to the customer’s machine to see logs like you could do in the case of a SaaS solution, so we’ve developed lots of diagnostic tools and safety checks. We have a big “generate report” button in the UI that allows us to generate a Zip with all the contextual information we need to understand what happened, what bug it triggered (or which strange button the customer has pressed in Hadoop). Often, the customer who has installed Dataiku is not the one using Hadoop, so diagnostics can be a bit complicated; semi-automating debugging has been a huge development investment on our part.
SH: What tools do you use for continuous integration?
CS: For integration, we use Jenkins. For tests (apart from the unit tests specific to each language), we use Python and also Selenium for UI tests. In addition, we use Sonar for code quality as well as Github for source and issue tracking — it’s a pretty classic setup, actually.
SH: What is the breakdown of the technical profiles on your team?
CS: Dataiku’s development team is made up of 20 people: 10 developers plus QA, PM, UX/UI, and DevOps. It’s a rather small team, actually. The sales and service teams are much larger, and they includes very technical pre-sales staff. And it’s generally the data science team that works on onboarding, enablement, and training with our customers. So our development team isn’t the one primarily involved in or responsible for that part, although we do assist at times in onboarding and with technical expertise.
SH: Who makes product and roadmap decisions?
CS: We have a product management team, so it’s usually a joint decision between them and me. Customer feedback drives a lot of our decisions - of course, we sell a subscription-based service, so renewal is very important to us.
We’re also building up our customer success team, which interacts regularly with our customers to discover their needs for everything from larger, more strategic features to smaller things like misplaced buttons (it’s often little frustrations that can prevent widespread adoption).
Many roadmap decisions are customer-driven - small frustrations can add up over time, and we try to address those needs.
And of course we try to anticipate technologies that our customers will adopt - for example Apache Kudu, which is gaining momentum and that we are going to be integrating soon.
SH: What are you most proud of?
CS: The quality of support we provide to our customers. Because of the aforementioned variety and complexities, things don’t always work. But our support is always on par, leaving satisfied customers.