Thanks to our marketing team, we Dataikers have been travelling around a lot lately to attend Big Data events and tradeshows. Recently, we were at the Hadoop Summit in Dublin and just last month spent a few days at Strata London (see our cool booth below). Here are a few trends and insights I thought worthwhile to share after listening to talks and speaking with existing or prospective clients, big data experts, and other big data software vendors.
The Hype Around Streaming Analytics
The night before Hadoop Summit in Dublin, I attended the Hadoop User Group Ireland, hosted by our partner Sonra Intelligence. There I got a deep dive into the new(er) streaming analytics frameworks such as Apache Storm or Apache Kafka that are gaining a lot of traction thanks to rapidly maturing products and a host of real-world use cases, from real-time fraud detection to complex event processing or cybersecurity.
This was confirmed throughout the various talks that were held during the events: with the explosion of data streams (the “sessionization" of the world), evolving business challenges and new customer expectations for instant responses, streaming analytics are becoming a must-have in a data analytics stack.
Amongst companies currently using streaming analytics: Bouygues Telecom, Zalando, Otto Group, Yahoo, Capital One...
The Challenge: Less About the Platform, More About the Applications
This is something we often hear of course, but as many organisations are completing their journey from traditional data architectures to Data Lakes (or swamps for the unlucky ones) and finding ROIs that fall short of their expectations, there is an intense focus on leveraging big data architectures to deliver business value.
Even if the data science use cases in various industries are starting to be well defined, and companies can learn from the early adopters’ successes and failures, there is still a major challenge in being able to build and deploy business-oriented applications based on data.
One of the common mistakes for organisations resides in the fact that Big Data innovation is still too often being driven by the IT department, with too little involvement of the business stakeholders. This usually leads to over-complex solutions without clearly defined metrics to asses the business impact of the solution that is being developed. To be fair, this is often inherited from situations where business units are not mature enough in their understanding of the new possibilities around data analytics, or see it as a complex and tech-first issue.
Two recommendations are often presented to overcome this problem:
Involve business users at every step of the data science project (see below)
Implement agile and iterative development cycles to control the fit with business needs
The Focus: Building Data-driven Organizations
Even at such tech-focused events, the talks that garnered the most attention revolved around the organization of data teams and departments - or how to structure collaboration between different profiles in a Data Lab in order to maximize their efficiency. This reflects the current focus on bringing data-driven decision making and fostering innovation through data in every layer and department of a company.
In this regard, building a (multidisciplinary) Data Lab or Data Department is only one part of the equation - managers also need to develop a self-service analytics approach, empowering analysts and BI teams to explore, combine and analyze complex data sources, ultimately sharing insights and opportunities for improvement throughout the company.
The talks also offered interesting insights on how collaboration between different profiles was key to a data team’s efficiency (and we’ve been convinced of that for a long time at Dataiku). Essentially, the key metric for efficiency should be a mix between time-to-market and usability (i.e. how well the solution fits the initial business need). Thus, it is key to have business representatives that will ensure fit with business needs, data scientists that will translate that need into algorithms, and developers that will translate those algorithms into applications or data products.
In case you missed it, you can check out this cool talk by Anne-Sophie Roessler on the challenges of building a Data Lab.
Hadoop is Dead, Long Live Apache (and Open Source)
During the 2 days at Hadoop Summit, I heard about the following big data related Apache projects: Flink, Drill, Zeppelin, Eagle, Helium, Falcon, Atlas, Kylin, Phoenix, Ranger, Flume, Airavata, Beam - with some of them being hailed as “the next big thing”. While some were designed to overcome some of Hadoop’s limitations (batch only, mapreduce bottleneck, security…), others are addressing new areas or challenges. While I did not always understand the exact value proposition of these products, it seems clear that a new wave of innovative frameworks to efficiently work data at scale is coming from the Apache foundation. This raises serious questions for companies that are having trouble seeing clearly through the big data technologies ecosystem and keeping up with rapidly evolving products and standards.
On the subject of open source, it was very interesting to hear a testimony by IBM called Surviving the Hadoop Revolution” that outlined the challenges they faced when trainsitioning from a 100% commercial model to a hybrid model (IBM has open sourced several of its business applications and has become a very active contributors in open source projects such as Spark, OpenStack, Cloud Foundry, Node.js, Linux, and Eclipse).
In any event, the world of open source is a key driver behind the evolution and adoption of big data technologies, and with every future data scientist learning R, Python, or Spark in college, software vendors have much to gain to leverage those rapidly evolving frameworks in their products.
From Data Lab to Data Product
In the past years we’ve seen our clients and other companies address the challenge of building data lab environments, usually comprised of a sandbox big data environment and multidisciplinary teams. Their role is, for the most part, to explore, prototype and (when possible) test new data analytics use cases in real life conditions. While some are still struggling to get their Data Lab (or Data Departments) up and running efficiently, some of the more mature organizations have succeeded in building Data Labs that churn out promising prototypes at impressive rates.
However, a key challenge remains in being able to deploy those prototypes into production, so as to directly impact the operational processes of the company, thereby generating maximum value. Indeed, articulating the prototyping / design world and the production / run world in an efficient manner is still a cause for headaches in many organizations that are struggling to find the right tools, processes and governance.
There was a notable focus around using Spark to deploy data science workflows from prototype to operations. Both Cloudera and Hortonworks made presentations arguing the Spark was the new frontier in terms of running at-scale advanced data wrangling and machine learning pipelines.
More on that subject in our whitepaper: Data Science: Getting From Design to Production
Last but not least, in the interest of being data driven (also, because our blog posts don’t get published otherwise), I made these quick charts about the companies and speakers that enlightened us with their wisdom :)
And these boxplot showing the distribution of ratings for the different keywords (tags) associated to the talks - ok I could have made an effort here :)
Disclaimer : this was done on very very small data :)