In the big data engineering world, Spark and stream processing are the words on everybody's lips these days. Here are a few words of wisdom by somebody why actually works with Spark everyday, the great Helena Edelson! We caught up with her at Spark Summit in Amsterdam last month to hear what she really thinks of Spark.
Since the exciting release of Data Science Studio 2.1, which integrates Spark, we’ve been having fun discovering the Spark community and being part of its dynamic ecosystem. With the release of 2.2 and the Prediction API server, we've also gotten into the business of stream processing and we're joining yet another dynamic commmunity!
We could really sense this when we attended Spark Summit for the first time in Amsterdam at the end of October. We got to meet so many people that were passionate about the technology and discussing their diverse use cases. I was particularly excited to speak to Helena Edelson, VP of Product Engineering at Tuplejump, and a speaker at Spark Summit. We, of course, talked about Spark. Here is what she had to say!
AS: Hello Helena! To get started, can you tell us a little more about what you do?
HE: I recently joined Tuplejump as their VP of Product Engineering, having been a cloud engineer and doing big data for a long time. At Tuplejump, we have two different parts of the company. We do international consulting for companies, and we have a platform and services supporting big data blending and fast analytics. So we do sophisticated data collection and blending combining machine learning and analytics to understand the intention of the analyst, provide a unified view of your data from multiple data sources, any locations, both streaming and non-streaming, for fast, easy, advanced data analysis.
In this way, anyone, anytime can feed many disparate data sources into the system easily and start deriving meaning from their data. We present a holistic view of all of your data and then, based on your queries, we can do the rest of the work of engineering and data science for you via the platform in realtime. And right now we’re using Spark for a lot of this.
AS: That’s interesting. That’s similar to what Data Science Studio is doing for data scientists: making it possible for business analysts to take over part of their job! So, tell us, how did you originally get into data engineering?
HE: Well, after a decade or so in distributed messaging engineering, and then many more years as a senior cloud engineer doing large-scale cloud applications and infrastructure automation work, I accepted a role at a big data cyber security company on their cloud engineering team. I was really interested in working with the data versus the applications just moving the data around. So I started working on a new project doing big data analysis.
I added a new Scala layer to our Hadoop-based analytics system and automated it end-to-end. I had always worked in completely asynchronous event-driven environments, so this batch and scheduled world seemed odd to me. After this, I started a new project, still big data analytics but with streaming, and I didn’t want to use hadoop, just Spark over Cassandra with Akka. We also leveraged Elastic Search but that now isn’t needed with fast querying using columnar storage.
AS: What were you looking for when you implemented Spark?
HE: I liked the fact that it was built in Scala and used Akka, and still does, even though I know they’re taking out the Akka. I’ve been using Scala and Akka in production for over six years. I also like the fact that Spark is very intuitive, coming from being a Scala engineer. Working with the data collections is very similar to what we do in Scala. It’s also very intuitive after using Scalding, which is a Scala analytics framework for batch, over Cascading. And I love that, as an engineer and not a trained data scientist, it allows me to get requirements for analytics and be able to implement them very easily.
That’s one thing that I found interesting here at Spark Summit. I met a data scientist who was also a speaker and, coming from the data scientist side, she’s been the one introducing Spark to the engineers versus my experience of being an engineer introducing Spark. So it works both ways and I think that that’s really interesting for a product. It’s very accessible for both sides.
And the other really great thing about Spark is that I can integrate both my streaming and my batch computations in the same application very easily. That’s very useful. And I can even replace my batch infrastructure all together and do everything in my streaming layer, completely removing the need for ETL. This can save a company millions of dollars, and for several reasons.
AS: It’s interesting that you mention streaming because, here at Spark Summit, we keep hearing about stream-processing, especially with Spark streaming. It’s really the talk of the town! We’ve heard some people saying it’s amazing and the future, and others saying that it is not useful for everything. What’s your opinion on that?
HE: That’s true. To answer your question, here’s a good example of what stream processing can be used for. Years ago, I was working on Hadoop scheduled batch jobs. Some jobs were daily aggregations of different events per day and some more in-depth analysis on that. So you have data for each hour of the day - on that day. You’re collecting a day’s worth of data, and every hour more data is being stored. By the time you actually do the computation, 99% of your data is completely stale. Whereas with streaming, you can constantly see what’s happening.
When you think about it that way, it’s very interesting. When you need to know immediately about particular anomalies so that you can react, or with machine learning, if you want to predict in the stream when something will probably occur in order to proactively respond in domains like cyber security, then it’s extremely relevant.
AS: So is there anything you haven’t enjoyed with Spark?
Everyone loves Spark.
AS: Well that’s great! We were talking with some people here at Spark Summit who said they thought the error messages are really scary.
HE: I’ve read a lot of complaints from people who were not happy or just very confused by the error messages. I’ve been unphased myself. I find that a lot of those messages are related to Akka. Having handled Akka error messages for years, they make sense to me. Properly done error handling is difficult for any language and any product. I don’t see it as a thing for Spark specifically, but I do understand where these people are coming from.
AS: I’d like to come back to what you were saying, about how your product is replacing data engineers. Do you think that's a general trend today where more and more people are getting into analytics who aren’t trained engineers and they have to have tools available to do that on their own ?
HE: It really depends. I generally work for technology providers, so we’re producing technology. But I have talked to a few people lately that have noticed that trend. There are definitely people trying to automate the role of the engineer more, to make it easier and more accessible.
AS: That’s something that we’ve noticed that a lot of our clients deal with in data science projects. The engineers work on the infrastructure and the data scientists work on the project, and then the business analysts don’t really understand what anybody’s been doing, so the project fails because of that.
HE: Right, and that’s something that we’re trying to make more seamless at Tuplejump. If you’re an analyst, we’re trying to make Tuplejump (the engineering and data science) completely transparent in your workflow. Whatever your favourite tool is, you can work on it and hopefully not even be aware of what Tuplejump is doing. We want to make it all seamless, intuitive and fast.
AS: So are there any other technologies that you’re excited about right now, other than Spark?
HE: We’re at Spark Summit right now so I am focusing my talk on Spark streaming technologies, their use cases, and how to integrate them. But there are lots of streaming products today, like Apache Flink, and Gearpump is another new one. I’m also speaking next at QCon in San Francisco in a track on streaming at scale.
There are so many different use cases that call for different technologies. For instance, we all know Netflix does a lot of streaming, but not all of their streaming work is analytics based. Since all of their streaming isn’t specifically set up for data science, it doesn’t make sense to apply what they’ve done to any business.
AS: How about after Spark? How do you think the technology is going to keep evolving ?
HE: What I can say is that everything is moving very fast. You can start working on a prototype using one technology and, before you’ve even gotten to MVP, you hear about some new thing that allows you to do more or just do things differently. It’s all a lot to keep up with, particularly when you’re a producer of technology. You’ve got to get your solution out to the market before everyone else!
AS: To conclude, do you have advice for a beginner trying to understand big data technologies and figure out what system to implement?
HE: There are many choices out there today. As with anything in software, we have many solutions available for any given problem. You really have to consider what your use cases, requirements and constraints are, and then be aware of the different facets of technologies available. When you’re picking a particular stack, it’s also really important to think about how the different technologies collaborate together.
Also keep in mind that there’s always some kind of give and take. You might have to compromise on a desired functionality with one technology because it really helps you in another area that has more importance for you. It’s really about knowing what you’re working with and what you really need. It’s not about what you heard X or Y is doing, or what technology everyone is talking about or what the well known companies are using. You should really look into what you’re trying to do and what’s the best answer for it.
AS: Any final words you’d like to add?
HE: Everyone is coming up with really great ideas for solving things very quickly. I’ve been involved in open source for a very long time, so I always appreciate how everyone is collaborative. I’ve been speaking at more and more conferences and you see people from all around the world, talking and sharing and working together. It’s truly great.
Watch Helena's talk at Spark Summit.
For more awesome interviews by Dataiku, you can check out Olivier Grisel's talk on scikit learn and big data technologies, and our conversation with Robert Dempsey on data wrangling and teamwork.