I’ve been part of the Dataiku sales team for a few months now. Most of the deals I have closed and most of the prospects I have talked to are in the banking and insurance industries. During our discussions, and depending on the different people I speak with, similar stakes and questions often come up. Here are a few of the significant ones that come up most frequently.
What is Really New with Data Science Compared to Data Mining and Statistic Studies?
Let’s start with a few definitions:
Data science - this field breaks into a number of different areas, from constructing and configuring big data infrastructures, for doing the analysis and creating the right transformations that lead to consumable business results.
Data mining - an analytic process designed to explore data in search of consistent patterns and/or systematic relationships between variables, and then to validate the findings by applying the detected patterns.
Now, let’s dig deeper. When talking with marketing, CRM, risks or actuary teams, all experienced in statistics or data mining, one of the recurrent concerns is to see how (and if) data science will affect the type of analysis that they’ve been accustomed to for years.
Is data science just another buzzword that makes these classical analytics sexier? Or, quite the opposite, is data science such a breakthrough that it threatens what they’ve always used and known?
Obviously, the answer is somewhere in between. Data mining and data science share the same purposes, i.e. to have better knowledge, understanding, and anticipation of an activity through models relying on data. Marketing analysts and data miners have not waited for data scientists to come along in order to make predictions based on these models! Nonetheless, people I meet usually take an interest in data science when addressing new subjects that require the manipulation of new data sources, which often challenge historical tools and methods.
But What is Being Challenged and How?
- Size of data sources: for instance, when exploiting data like web logs or transaction logs to enrich the knowledge of customer behavior
- Use of non-structured data: for example, using telematics data to develop new personalized insurance products such as "pay as you drive" car insurance contracts
- Crossing heterogeneous data: for instance, mixing CRM data with social and logs data in order to predict a significant life event (buying a car or a house).
So what are the differences and benefits of data science compared to data mining? From the discussions I’ve had, here are a few pointers:
First, the capacity of data science technologies to natively enhance large, unstructured, and heterogeneous data sources, where data mining tools are usually not implemented for such use and need - at minima and if possible - to be upgraded and adapted.
Second, to work with these new types of data - dirtier than before – one must take advantage of the exploratory, iterative, and hacker-minded approach on which data science relies.
Third, machine learning algorithms implemented in the modern data science framework are generally better adapted to the prediction or segmentation problems that are at stake. Indeed, in such cases, deploying a first efficient model quickly then improving it with recurrent training, building of new features, and optimizing variables is often far more efficient than trying to design the most statistically accurate model in one try.
Thus, in the short term, I am not of those who believe that data science should replace data mining and statistical studies in the banking and insurance industries. Data mining and statistical studies are often linked to "marketing factories" to emphasize their industrial aspect in the daily and structured delivery of scores.
On the other hand, data science is usually assimilated to a "lab" to accentuate its R&D and exploratory dimensions. Such designations – “marketing factory” and “lab” – are in and of themselves a good illustration of the fact that people do not imagine replacing working factories by experimental labs. That said, my customers are aware that these data science "labs" will be the place where new and efficient answers are found to emerging, innovative, and strategic projects that will drive their competitive advantages.
Such advantages could go anywhere from having the best possible knowledge of customers to engage with them in the most individual and appropriate way, launching new data-oriented products priced individually on basis of real use, as well as detecting better and quicker risks or frauds that represent monumental sums of money.
Will I Need a New Infrastructure for Data Science, in Addition to an Already Extensive One?
Banking and insurance are some of the industries with the most important legacy in their Information Systems. A significant part of strategic activities have always run on mainframe systems in an architecture completed by other layers over the years. So, when in conversation with IT managers, I am frequently asked: “Is it better for me to add a new layer or to replace or improve an existing layer?” As a non-technical guy, I’m not always comfortable giving advice on such questions that are a) hugely impactful b) specific to each organization. Nonetheless, here are a few thoughts from the discussions I’ve had.
We regularly remind our customers that data science does not need a big data infrastructure to generate value from their data. Nonetheless, 99% of all the discussions I’ve had in the bank industry involve implementing big data infrastructures (this isn't the case for some other industries).
The bad news is that such implementations usually involve an additional layer to the complex existing information system. The good news is that some organizations have made this challenge an opportunity to make their information system more relevant to address new businesses goals.
Those successful organizations I’ve had discussions with seem to adopt similar practices.
First, they design a big data infrastructure by focusing on the concrete answers it should bring to a representative range of businesses stakes. This approach generally relies on an experimental and iterative implementation and is often very different from IT Projects conduced over many years, like BI or ERP projects.
With this approach, it is possible to quickly begin data science projects and to experience a primary benefit: to end with silo management of data, that is shared by almost every bank and insurance company, by giving the analysts quick access to numerous sources of existing data they could not access beforehand.
Other organizations live with an efficient cohabitation of existing BI infrastructure for structured data and a big data infrastructure that leverages at lower cost storage and processing of data. Also, easier access to more heterogeneous data for the business teams usually leads them to increased autonomy when it comes to crossing and processing the data on their own, without depending on IT teams or setting a “pirate” parallel cluster.
As for the discussions on data science vs. data mining, my opinion for the short term is that big data infrastructures will complement then optimize the use of other information system layers rather than replace them all together.
Will I Have to Hire Armies of Data Scientists?
The hiring of a completely new team of data scientists is a common discussion I have with our customers, typically with direction of strategy, innovation teams or digital transformation teams of banking and insurance companies.
We see that this recruitment concern was more significant a year ago than it is today and that the approach is evolving - maybe due to a higher maturity in the field of data science. What is standing out is that data science teams are better off if they are made up of internal existing skillsets as well as external ones and/or newcomers. Thus, instead of hiring solely data scientists, it is at a team level (business understanding, statistics, software development…) that organizations are truly beginning to feel the value of data science for their business.
This is particularly true in my discussions with banks and insurance companies that have had, for many years, strong internal skills in statistics in various teams (marketing, risks, actuary…) as well as a significant use of external skill sets such as consultants and integrators.
Thus, from experience, I suggest that banks and insurance companies in particular start with data science by setting up “data lab” teams or projects, usually grouping IT and business people (notably data miners and analysts), that cover data processing and statistic skills, and completing this team with profiles that have specific data science development skills (like R or Python programming).
By getting different people to work together within the lab, the collective intelligence and value that results is far superior than that of having multiple data scientists per say.
At Dataiku, it is always a pleasure to see how much Data Science Studio is met with strong interest by the financial industry, especially when it comes to strategic aspects of the business.
So how can I explain that a 1-year old solution generates such interest in an industry well known for its difficulties to implement innovative solutions? Maybe because it answers the following concerns that come up so often:
- How can I easily address the new business goals that I cannot address with my existing Data Mining or Statistic tools?
- How will the analysts be able to start with the new big data infrastructure I set up for them? Especially considering the numerous and difficult to use big data technologies involved?
- How will I get the new (or growing) data team to speak the same language and to really collaborate efficiently, while the data miners do not understand what the R developers are doing (and vice versa)?
I guess the initial convictions of Dataiku’s founding team answer these common concerns of banking and insurance companies quite well. Data Science Studio is born from the shared belief that the data science market lacks a solution that makes it possible for every organization to benefit from the high potential of data science technologies and algorithms, by minimizing the skill gap to get involved and the difficulties to comprehend such a rich and complex technical eco-system.
If you are asking yourself some of the above questions, please reach out to me.
And finally if you’re currently in London, don’t miss the Innovate Finance event on 16th June where Kurt Muehmel, Dataiku Business Engineer, will moderate the talk « Bigger Data & the Internet of Things - How Is Innovation Disrupting Insurers’ understanding of risk?"