I was lucky enough to talk to data fanatic and Dataiku fan Robert Dempsey in September. We discussed everything from his job, tools to help his team collaborate, data wrangling, marketing and analytics, and why it's hard to get started in data. This is part 1 of our great talk!
Father of Data Wranglers in DC, and inspiration to thousands of data scientists, Robert Dempsey is one of those geeks who can figure out everything they put their mind to, from management or marketing to designing machine-learning algorithms. Robert discovered Dtaiku Data Science Studio about 6 months ago and immediately fell in love with the tool.
Following a few exchanges, we decided to find out more about how he works with data on a day-to-day basis. So I interviewed Robert and here’s what I learned.
Alivia Smith (AS): Hello Robert, can you tell us a little about how you got into data science?
Robert Dempsey (RD): I started digging into data a number of years ago when I founded a software development company. At first, I was doing all the work. When I started hiring people, I shifted towards the business aspects and ultimately got into marketing. I learned about Inbound Marketing and started creating content and bringing people to me instead of advertising and doing push. With Inbound Marketing there’s a lot of data analysis, so I started getting into dealing with data. That was 7-8 years ago and I really got into it.
At my previous job I continued to do a lot of inbound marketing and therefore a lot of analysis. Quickly, I started attending all the data meetups here in DC, and noticed that everyone was talking about data wrangling, or as they call it now “data engineering.” Basically, that encompasses everything up to the “data analysis.” They were not only saying that data engineering took up to 80 or 90 percent of the total time spent on data projects, but also how horrible it was. But no one was talking about how to actually do that part of the job. So I thought if it’s a majority of every data project on earth then we should have a meetup that talks solely about that, and teaches people how to be more efficient. That is how I started Data Wranglers in DC two years ago.
AS: And what are you working on these days?
RD: Right now I’m the data ops team lead at ARPC. At first I was doing project management and Inbound Marketing, but I wanted to do more data work. So what I do now has a lot more to do with data engineering. We do a lot of data wrangling and generate a ton of reports. For example, we have to churn out this big report every week. We get a large amount of data from lawyers who don’t have the time to clean or standardize their data. Imagine the most horrific data formats ever… that’s basically what we receive each week. We then have to bring all this data in and do huge amounts of cleaning on it. In the end, we produce a report that our client uses to manage a large, multibillion dollar settlement. In addition, my team also does a lot of ad hoc and automated reports both for our client and to support our internal teams.
AS: Can you tell us more about how your team works?
RD: We’re extremely small - there’s 4 people in the team, including myself, and we have to do all of this stuff. That’s also why automation is key!
We collaborate with the settlement coordination team, the team that manages the internal settlement operations. When they have a question the application can’t answer they come to us for the answer.
We have three different sources of data:
There’s what we call our offline data which is everything we get from the lawyers. This data is stored in files and is ultimately brought into a SQL database.
Then we have our claims processing system which is also SQL based.
Finally we have our ancillary data sources, which is additional data we use to augment the other sources.
A lot of business people also come in and ask us complex questions about the data that we have. To answer these questions, we have to connect all of our different data sources.
In order to do that, I do everything in Python. We have one person on the team that took over all the work in R. I have another Python developer on our team, and then one of our other guys is learning Python. All of us know SQL.
Because our team is small, we basically touch all aspects of the job; there’s no real division of labor which makes our collaboration more iterative and fluid.
AS: What’s your role in the team? Do you have more of a management role?
RD: I’m the team lead but I’m not a manager per se. I like to be hands on. I’m a geek!
There’s a few things I hate and one of the things I hate is doing the same thing over and over again. So I automate a huge amount of our process. For instance, I automate a lot of things that are purely in SQL, and I’ve automated a majority of the reporting as well. I’ve also created various dashboards for us, and set up Elasticsearch and Logstash to store and analyze the logs generated by our automated reporting.
One of my jobs is also looking at different tools that we can use on the data team to make our lives a lot easier.
I’m also trying to move us as an organization to do more analytics, so one of the things I did was to create a predictive model for one of our processes. I created and deployed a predictive model based on a scoring algorithm I’d created. That worked out very well.
AS:You mentioned you look into tools to make your team more efficient. What do you look for in these tools?
RD: When I joined the team and starting looking at how the data team was working, I immediately noticed the main problem was how horrible cleaning all the different data sources was. It could take a full day just to get the data into the shape we needed it to be. We needed some better tooling! The team was using straight SQL for cleaning, and I was monkeying a lot with Excel. Neither of these tools are made for data engineering, so I thought there has to be a better way!
I started searching for tools by typing in “data wrangling tools” in a search engine. That’s when I discovered Data Science Studio (aka DSS). I was also testing Pentaho and Talend tools at that time but they weren’t intuitive at all. I just couldn’t figure them out without reading a manual… and most would say I’m not a stupid person. In the end I deleted all of them from my computer except for DSS.
For us, one of the key selling points for Data Science Studio is visualizing our workflow in a collaborative way and being able to bring in the different languages we use. So much of what we do is stand-alone. We have some functions that are in R, everything I do is in Python and we have a lot of SQL. It’s all bits and pieces and there’s no formalized workflow, no way to collaborate. It’s all over the place, and documentation is an ongoing challenge. Don’t get me wrong, it does work, however we needed a better way to tie everything together into a cohesive unit.
That’s why what starts out as a simple request from our client turns into a 2-day report creation process, and that’s if everything goes right! It’s nuts.
Data Science Studio supports everything that we use: Python, R, SQL. We can set up workflows that allow us to collaborate on projects as a team and to visualize each other’s progress. And I didn’t have to read through loads of help documents to figure out the tool, which is great. It’s intuitive and easy to figure out, for the most part. Those are the key things that I really like about DSS.
That's it for today folks! Keep your ear to the ground for part 2 of Robert Dempsey's interview, where he gets into where to start when you want to get into data science and gives us lots of great advice.