Robert Dempsey is that cool, data-driven developer/marketer/explorer that everyone (or almost-everyone) wants to be. In part two of the talk I had with him in September, he had lots of words of advice for people getting started in data science.
Robert Dempsey (RD) is that guy who loves getting dirty in code and is looking for solutions constantly but who also enjoys explaining what he does to business people and believes that data should be accessible to everyone. He got into data science from the business end by working on analytics-driven marketing in his software development company.
Now he’s a consultant for ARPC and is constantly looking for ways to make his team more collaborative and efficient. I was very lucky to talk to him about his work process in a previous interview. He discussed how his team works on data science projects and how Dataiku helps collaborate on these projects by getting various standalone languages and technologies to work together. Today, he has things to say to aspiring data scientists out there.
AS: Hello Robert, what's your advice for people beginning in data science and who are wondering what technologies they should look at and what language (R vs Python?) Or what tool to begin with?
RD: A lot of the technology is becoming a lot easier to set up and more powerful these days. Take Spark for example. I can run Spark, and I have, on a little cluster at my house with 3 different laptops and a RaspberryPi. It’s becoming so much easier to set up these systems and to do these things. That’s also why more people are doing them!
However, for people starting out it’s still difficult to know where to start because there are so many resources. To make it simple - there are two languages leading the day: it’s either Python or R. You do have to pick one and people argue both. I’m a Python guy all the way and I’ve seen R but I’m pretty meh. Some people say you must know R if you’re going to be doing data science, but I’ve always gotten along just fine with Python.
Learning one of those languages is definitely the way to go. You do have to learn one before you’re able to do anything solid and put custom work into production. Outside of that, I would say stay away from them as much as humanly possible. Keep within the tools that make the work much easier, tools like Dataiku.
Luckily there's a trend in data science today, and companies like Dataiku are creating tools that make data science a lot easier and more approachable for beginners. That was one of the things that led me to Dataiku.
For a long time, to do data science, you either had to be good at Python development or good at R or have a degree in statistics. A lot of people out there who are doing data scienc-y things at their jobs or trying to learn to do data analysis have no time to go back and learn how to do statistics (or even care to). Our business users for example are not going to go back to school to learn stats. They need to use these new tools that make certain aspects of data science much more approachable and not quite so mystical.
In Dataiku I can do all of my data wrangling in a visual format without any code. Dataiku makes it really easy to apply work that I’ve created on one file to another file with the same file format. I can do all sorts of things with Dataiku that, if I had to do them in code, would be an incredible amount of quite annoying work and very time consuming.
AS: Can you give us an example of how you work with both tools like Dataiku and languages like R and Python?
RD: Sure. For example, I designed my first predictive algorithm with the help of Dataiku. I wanted to do a scoring system. I had lots of training data that I could use a supervised training method on, so I thought I’d try and figure out how to do predictive modeling.
I’m already a pretty good developer, but I was not going to go back to school to learn stats. So I did my research. What I found was that I spent HOURS and hours (read weeks) reading blog posts and books before I even got to writing my algorithm. And then when I tested it I was getting some weird results. This was the first model I’d ever created, so I was thinking I might be doing it right, but maybe not.
I fed all my data to Dataiku and looked at what it spat out, and I noticed it was ignoring part of my data. I then updated my model from what Dataiku had done, tested it on additional data, and then put that into production. I’ve been using the predictive model ever since, so that’s worked out really well!
When I showed people I work with the modeling and what Dataiku had done they said: - "How does Dataiku even do that?" I told them: - "Well they have a team of data science people who know what they’re doing so they built all of this for us!"
I ultimately learned what that was called - factor analysis - but Dataiku did that before I even knew what it was so I didn’t have to do it myself. It makes the whole process a lot easier.
Of course I’m a developer and I wanted to put it into production, so I couldn’t do everything in Dataiku, I had to get my hands dirty. Thanks to Dataiku, I did enough to figure out what I was doing wrong and then implement that in the production system. It’s a great learning tool!
AS: You’ve mentioned how more and more business people are realizing how important analytics are for marketing, for example; is that an important trend?
RD: I come from marketing, but I don’t really consider myself a "marketer" per se, so I can say: you should watch out for the marketing people who think they’re doing a lot of analysis but aren’t. One of the things I tell people at my job: if their definition of data analysis is running a report, that is not data analysis. Apart from that, a lot of business people have been trying to do real data analytics with Excel for years. Excel isn’t a good tool for serious data analysis either. There are so many more advanced data science techniques. Making them more approachable and easier is definitely key.
For instance, take how I replaced our scoring method with a predictive model, you can use that for so many business problems. It is more advanced but if you get a tool that can help figure that stuff out for you, then you don’t have to worry about getting it wrong. But always remember – garbage in garbage out. The quality of the underlying data can make or break you.
In general, the learning curve in data science is extremely steep for someone who knows nothing about statistics. To create my first predictive model for example took me weeks of work because there was no one place I could go to learn all the things I had to learn. You learn one thing, and then you think “ok now I know Python, so I could look at scikit-learn,” and then “now I can look at modeling, but which way of doing modeling should I check,” and "what does all that even mean?" And then all of a sudden it’s weeks later, and you haven’t gotten anything done because you’ve spent so much time reading and trying to figure things out. It seems that it should be so much easier, but today it’s not.
The problem is there’s no one resource where you can go and find what you need to know to do just what you want to do. This is an issue because a lot of business people function just like me. They don’t to wait until they know everything to start taking action. They learn just enough as they go. I don’t think I should need to be a stats wizard just to create a predictive model! It should be easier.
AS: Great! Thanks you for the talk, anything you want to add about what data science needs today to conclude?
RD: I generally just believe things need to get better. It is difficult for an average person to learn all this stuff because when you do start looking it seems like there’s so much stuff to learn. Most people don’t have time to learn all the things. I don’t think they should have to get another degree just to do it either because data analysis is so important; Data has been growing by leaps and bounces; people have been saying that for years. Today more and more people are finally realizing that they can learn so much from their data, and tools like Dataiku make that easier and more approachable.
If you're interested in code-free machine learning, check out Dataiku: