In this episode of the Banana Data Podcast, our hosts are joined by Nathan Mannheimer (Director of Data Science and ML at Tableau) as they unriddle the value of data visualization, explore what can go right and what can go wrong, and search for an answer to inquiries surrounding the future of data visualization.
Don't have your headphones handy? That's not a problem. Here is a complete transcript of the episode for you!
Corey: Hello everyone and welcome to the Banana Data Podcast! My name is Corey Strausman and I am a community manager here at Dataiku. You're joining us for a very special episode today. I'm joined by my guest, the wonderful CPM.
Before I let CPM go on, I'm just going to say one thing for our very loyal listeners. Last time, we said that CPM and I would figure out if we could record a podcast by ourselves. We lied, so we apologize for that. We have a very special guest, and it will all be worth it. CPM, what do you have to say?
CPM: Really excited to be back here today. Last time we talked about the importance of storytelling with data and discussed how the emotional impact of a story can help inspire people to take action on the results of data. You know, data is just numbers, but how can a story create empathy and have a call to action that inspires people to do something?
So, today it's also really great to have you all because we're going to be delving more into data visualization, and we're joined by Nathan Mannheimer, Director of Data Science and Machine Learning at Tableau. Nathan, do you want to take a moment to introduce yourself?
Nathan: Hey folks, really glad to be here. My name's Nathan and, like CPM said, I'm a product director here at Tableau. I work in areas where data science and machine learning really intersect with Tableau's core mission of helping people see and understand their data. So a lot's going on. It’s a very exciting space to be in, and it's an area where I think there are some very interesting and useful applications for data visualization. So, very happy to be here today to talk about data visualization.
Corey: I'm going to start with a pretty simple and pointed question, wanting you to feel free, to jump in and be like, what are you talking about? So, when it comes to visualization, they say that a picture is worth a thousand words. How many of those words are accurate?
Nathan: I think that's a great question. This cuts into a number of different parts of data visualization. The first is the data itself.
A very effective data visualization can be built on very inaccurate or sort of misinformed data. So, a lot of thought has gone into how we can accurately convey information and how we do that in a way that is going to be understood really quickly and easily. If the raw material isn't solid, the data visualization is going to be inherently misleading.
Then taking it up to the next level — how do we process information? Our visual system is the highest bandwidth conduit into our brain. It's one of our most effective senses as we navigate the world. That's the kind of physiology that we build on when we're constructing good data visualizations. So, we know that it's a very effective tool for conveying information, but that it still requires a thoughtful design to turn an okay visualization into a great one. It means focusing on what the story is in the visualization.
Are we making sure that the story that we want to tell and the facts that we were interested in expressing are popping to the surface? Are they not being clouded out by other noise in the visualization? So, assuming the data is good, it's really a matter of making sure that the encodings or the channels that we're constructing in that visualization are mapped to the facts and the data that we think is most important to convey. When those two are lined up, visualization can be a really powerful tool for conveying information.
CPM: I think it was the University of Minnesota that said the human brain processes visuals 60,000 times faster than they do text. That ingestion of information is quite rapid with a well-constructed visualization. On that note of trying to highlight the specific insight that you want to communicate, how do you go about that process?
There's so much information that's within data. Assuming that the data set has been mined accurately and constructed in a well-designed way, how do you hone in on that insight and start telling a story with the visualization?
Nathan: This is where this is still a very human-centered process. When we're building a visualization or constructing a template for automated future visualizations that we might want to construct, most datasets have more than one or two or three fields in them. It's really important to understand what the key things for this particular visualization are and try to not overload that.
When we're constructing visualizations, unfortunately, humans are limited to seeing in three spatial dimensions. We often see that the most effective visualizations are two-dimensional so that limits the number of tools in our belt that we have to transform information — raw data — into the visual world. We know that humans are very good at certain comparisons, for example, position, comparing whether points are close or far from each other or whether bars have different lengths or heights. We can use position very effectively. That's why bar charts and scatterplots are almost ubiquitous. Going beyond that it starts to get a little bit more complicated, and we have to be a little bit more thoughtful.
There were some really fantastic visualizations for Napoleon's March on Moscow. Really famous, really creative data is at play with these encodings. It requires thought on the key elements of the story and, then, are we conveying those effectively? Are we mapping those into the visual world in a way that's making the best use of the tools at hand? Sometimes the story is more complex than a few variables and we have to think a little bit more creatively. Are we going to create multiple visualizations that tell a combined story? Are we going to transform the data in ways that extract the most useful or salient information then go back to the basics and use those two-dimensional encodings again?
CPM: I think we've all seen some over-encoding in images, or in graphics, where they use shape and color and transparency and size and the X axis and the Y axis and maybe an animation. It’s so overwhelming trying to see ten different dimensions in one image even though all the information is there and accurate. The takeaway from the images is completely lost and sort of distracting to the crowd. Finding that right balance and honing in on exactly the one or two insights you want somebody to draw is incredibly important.
Nathan: It's one of those things where, when we're really successful, a visualization shouldn't require a lot of work to decode. It should almost be pre-attentive. You should be able to look at it and not have to go back and forth to the legend or to other descriptions. It should tell its own story as independently as possible. That's the ideal. Of course, it's not always possible to do that. Sometimes there are specific domains where the information is just more complex, and we have to be a little bit more thoughtful depending on the audience. But, it’s always the goal that a visualization should essentially stand on its own and really push its story into the minds of the audience.
Corey: Well, there's a quote that I saw recently where people hear statistics, but they feel stories. That quote reminds me of this piece that we recently read from Forbes, “Humanizing AI Keys for Cognitive Design Thinking and Custom AI”. It talks about a five-step process that humanizes the creation process. Before I go into that process, do you want to go into the product side — how you go into developing a process to ensure that the data and the visualization are collaborative with each other, creating an all-around great user experience?
Nathan: I think this is something that is sometimes misunderstood in data science, and people think data visualization is slotted into just one or two parts of the process. At its core, data visualization is just a tool for conveying information, and understanding information is important at every step of the machine learning and data science process, whether that's understanding the current state of the world, helping us frame up a question, fixing an inefficiency, or helping characterize the magnitude or impact of that problem. Then, it gets into pulling relevant data, understanding the completeness of that data, and understanding the exploratory data analysis.
It is kind of the classic: “What is the data telling us at a first glance?” What are interesting patterns that may emerge from that? And then, it comes to the modeling process and understanding if one model is performing better than another. What metrics do we measure that question by? What features are driving a model to make a prediction one way or another at a global level, at a prediction level, or even at a sort of a subgroup level? Is there a disparate impact on model predictions or error as we break the data down? Those are really information-seeking tasks, and visualization is a great tool for information-seeking. When we put a model into production, it goes even further. We want to monitor and make sure that the model is performing well.
We're looking at data often changing over time, looking at changing volumes. These are great application opportunities for visualization. An important piece that affects a wide range of model outcomes, but not all, is understanding how we take those predictions and humanize that. How do we bring that into a business process or a working process that people can understand, especially people who maybe don't have a background in stats or ML, and certainly might not ever call themselves a data scientist?
How do we make predictions and uncertainty around those predictions graspable to those folks who just want to do their jobs? I want to better understand their world, and that's a great application for visualization as well. We're taking new streams of information predictions and conveying those to people in the places where they're already working. Throughout that whole end-to-end spectrum of data science and machine learning, visualization plays a key role and helps us understand the salient information at each step.
CPM: Yeah. The data visualization is important at every step of the data analysis pipeline, but also can serve each of the archetypes who might interact with both the technical side and non-technical side.
So as you mentioned, like in the classic EDA sector, maybe analysts will serve well by virtue of presenting that information in aggregation forms and things like that. Numbers won't only tell the story. Cross tabulations won't only tell the story but actually seeing that percent change is impactful for the modeling piece. It's the data scientists or the data engineers who can evaluate those models and also the business stakeholders reading from a dashboard once the model is in production to see the output. Therefore, it's not only helping each of those steps in the data analysis pipeline, but also the stakeholders who might interact with either the building or the ingestion of information.
Nathan: Absolutely, and I think we can even abstract it further than just saying a dashboard necessarily. I would say a cross tab, as a form of data visualization, we're usually conveying aggregate numbers in a formatted way that's easier for people to understand, but dashboards are a really common tool and that's great.
There are many other ways to consume visualizations. There are embedded applications: things that pop up where you're consuming information — maybe on your phone, on your smartwatch, really wherever you're interacting with computers. Sometimes, even in a static printed out form, we can have visualizations that help convey information for each of those parts of the process.
Corey: So we're gonna segue a little bit here. Now we're going to talk a little bit about uncertainty. CPM, do you want to introduce this topic?
CPM: Yeah. The measurement of uncertainty in images is quite important to talk about because you could present accurate information, but the way in which you present could also be prone to telling a story or ultimate insight that might be misleading.
So, I'd like to talk about the difference between statistical versus practical significance here and how that ladders up to telling an insight that's true, but might also not necessarily tell the whole truth.
Nathan: That's a great point, and it's a point that's certainly key to data visualization. I would say broadly key to data science as well. We should never sort of abandon our common sense and our need to understand the domain that we're working in even if we have really great tools that are accessible and useful to us. There were some silly examples of this kind of thing, and I think those are funny. Sometimes I don't like using those because it leads people to think “that will never happen to me” or “I would never make that kind of a mistake.”
There are spurious correlation charts that you can track over time, and they show things like consumption of Swiss cheese and engineering doctorate awarded over time. There's like, .94 Correlation coefficient between the two over 10 years. That's statistically significant, obviously, borrowing a very interesting relationship. However, it isn't really practically significant because we can infer that there's not really a causal relationship between those things. It's a fun example that sort of shows us you can draw interesting statistical patterns from nothing or from human spurious correlations, but there are very real and practical examples of that.
There are of course other examples like a classic dataset like Anscombe's quartet where four datasets have identical statistical measures. So two columns, means both standard deviations are the same. The linear model that you might fit to those are exactly the same, but when you visualize them, they're completely different. So this is an example of where visualization shows us things that statistics might not. Neither is better than the other in any way. They're highly related, but we should always think critically. We should always use all of the tools at our disposal to better understand a problem before we walk away with a conclusion.
CPM: Absolutely. Thinking about these tools that are at our disposal, one thing that we have to keep in mind is that it's the way in which the tool is used. It could change the way it actually impacts our audience. For example, if I'm thinking of a very simplistic bar chart that's analyzing two different groups, two different bars up against one another, looking at the proportion between these two crowds of version A vs version B there could be a, you know, a 49% to 51% split.
If I have my Y axis set at zero, it doesn't look like there's much of a difference, but if I have my Y axis zoomed in to between 48 and 52, it's going to look like there's such a big difference between those two bars. The information that I'm presenting is technically accurate in both of those visualizations, but the way in which I'm using the tool is sort of manipulating the way in which I want my audience to ingest that information. So, as with everything, the tool can be used for good or for bad. In that case, there is a very small margin of difference between those two bars, but the way I’m choosing to show it to my audience could tell a totally different story.
Nathan: Absolutely. Sometimes there's a story that you want to convey, and sometimes a small difference can be significant. It's actually very similar to P hacking with visualization. You're hacking your visual controls to make something pop out and make a story seem significant. I think that in many cases you see this in popular media quite a bit. People go in with a conclusion and stories in mind. The visualizations to tell these stories, in many cases, can be very misleading.
Corey: A solution in search of a problem. Speaking of tools, I wanted to kind of talk about Tableau, right? Tableau has really transformed the way that businesses and people (both specialists and common people) look at how data and business outcomes are presented.
So, when we're looking at the visualization, especially at the enterprise level, what is the bottom line? We talked about how data visualization can really help people who aren't used to seeing raw data to better visualize it. It creates more of a human-centered approach. From the data science standpoint, it helps impact and visualize EDA, but if you're using Tableau and you're a data scientist or a decision-maker, how does a product like Tableau promote better business outcomes so that you can show how your data is making an impact and show how resources could be better used.
Nathan: It's a great question. The mission for Tableau is to help people see and understand their data. Tableau didn't invent data. Visualization techniques have been foundational in the space for hundreds of years, if not longer. What Tableau did was make those techniques accessible to a wider range of people who had questions about the world, were interested in their business and how data was moving around them but previously didn’t have a tool that worked for them. Tableau allowed them to explore on their own and not have to go to somebody else asking a question and waiting for a response to come back. Freeing up of that creativity and curiosity is where we see really huge success in Tableau because most people are curious.
Most people want to know why things are the way they are and they want to do things better and faster. So, in creating a set of tools, a set of experiences that allows people to unbridle that creativity, we found something really special and really exciting.
We're asking, how does that improve business outcomes? How does that improve awareness? It allows people who are familiar with the data, familiar with the domain, familiar with how the data is being created to ask questions on their own and start to understand that data. You're able to follow down the path of exploration and insight as new questions come up in real time. So, what Tableau did was really special in bringing that to a wider range than had ever had access to the power of data visualization before.
CPM: Yeah, I love this concept of creativity. Letting your mind run wild can often help in finding something that you weren't even looking for in the first place.
I forget which companies do it. I think maybe Google, Facebook, and a couple of the big-name companies have hackathon days where they give a whole day or a couple of days to have an open-ended question and say, “You can do whatever you want as long as you come back and make sure you present to the team.” That has been the start of a lot of different ideas or product features at different companies because you allowed for the space to creatively explore the data. I love incorporating that, maybe automating the things that are the day-to-day, and freeing up some of that time to pursue other endeavors.
Nathan: As we've grown, it's not necessarily just running. Letting your mind run wild can be really fun.Exactly to your point, kind of bringing that back to the group becomes really important, especially in an enterprise setting. We want to allow people to not just explore, but to also do that in an environment where their starting point has been vetted and cleaned up by people who are expert in the systems and the data.
Then, allow them to take what they've done and bring it back to share with their colleagues and leadership, so that the organization is really being empowered by more people contributing to the conversation about what success looks like and how we understand that.
Corey: Nathan or CPM, do you guys have a personal experience that you can talk about with how data visualization sort of transformed for something for you? It could be a dataset, it could be a piece of data, or a more relative example. I'd love to hear how data visualization has impacted you either in a specific way or in a broader way, in your day-to-day work, in your professional life, in your personal life.
CPM: On one of our previous episodes, we talked about weeks of our life, which certainly helped me have a little bit more carpe diem. I can also think of another fun one. It was from Kurzgesagt. In a nutshell this YouTube channel has really great animation skills. It was about evolution and looked at all of the different life forms and how they are connected, from the very beginnings of amoebas all the way to humans. It was really impactful to see how it splayed out over time. Another one about the expansion of the universe from the big bang to now. It really helped me feel like I'm a part of a greater whole and realize that I'm just a speck of dust on the timeline of history. It sort of grounded me a little bit and related back to that “I have to seize the day” from the weeks of life discussion.
Nathan: One of the ones that I really remember, because it was a big sort of moment for me early on in my career, is from when I was starting to learn Python and play around with writing simple programs. Some of the people that I worked with were exploring. I forget the exact problem, but it was a probability problem about drawing combinations of M&M’s out of a packet. Two people solved the problem analytically and came up with one solution. Since I was sort of testing out my Python, I did a simulation and ran through it.
When I'm looking at my results, I did something really basic like map, plot, lib, histogram of the outcome. My most likely number for the probability was an order of magnitude off from what they had gotten. We came back sitting around and they were like, “Oh you probably made a mistake somewhere. This is definitely the right answer.” When we came back the next day and looked at it again, it turned out that I was actually right. I had solved the problem correctly using a relatively simple program, visualizing the output, and looking at that result. That was really exciting and powerful to me.
The fact that I could unlock that kind of power. These guys were better at math than me, so it took some convincing and back and forth, but, in the end, it was validating to see that I was capable of solving a problem that obviously was a little bit of a challenge to figure out. That was a really formative experience for me and something that in statistical resampling simulation, visualization, and exploring results has really stuck with me through my career. That was a big exciting win for me in that space and something that I really remember.
Corey: I think it's only fair if I provide an example as well since I put you guys on the spot here. It's not really an example, as much as a recommendation. I'm someone who likes to follow current events, politics, international news, a lot. One of my favorite people to follow is data visual artist at the Guardian, Mona Chalabi. I highly recommend that you check her out. She takes all the hot button issues that have data around them, and she visualizes it. It's like a work of art, but it's also really informative as well. It transforms the experience and makes you think.
Corey: This will be our last point that I bring up. As data visualization continues, what's next? How does data visualization continue to evolve?
Nathan: That's a great question, because I think if you look through the course of the industry, the fundamentals haven't really changed in a long time.
We have pretty solid research on what people can easily understand visually and what the effective ways to create a visualization are. So, the challenge is, how do we continue to allow more people into this world and more people to explore their world using visualization as a tool? How do we convey the value to people who might not ever think of themselves as somebody for whom working with data analytics and visualization is their core job? You might not ever write any code; so, how do we make those outputs more accessible? I think we've seen some really awesome stuff happening in the last few years in data-driven journalism. Some really fantastic things are being produced. How do we make that understanding of the world more ubiquitous and accessible to more people?
It's really not a fundamental change in what data visualization is, in my opinion, but more about how we let more people participate. How do we share the results with more people? There's a lot that goes along with data infrastructure, and helping the average population’s statistical literacy increase so that people aren't making mistakes. How can we help people do the right things well? That requires a bit more domain knowledge and understanding, but I would say in a couple of words, ubiquity and expansion of access is what’s next.
CPM: I totally agree. Not only with the dissemination of that information, but there's also ongoing research on the psychological effect of visualizations and specifically on anthropological visualizations. Meaning using visualization to embody a human element. This includes: determining whether it's a figure of a human being, determining if it's an individual or an aggregate, defining more abstract versus human shape, and exploring labeling as a human being and the effect that that has on the person viewing that visualization and more relating to it. It's inconclusive as of right now as to whether or not that labeling heightens somebody's ingestion of insight and has them follow through with a call to action. But still, I think that might be the next step — analyzing that realm.
Nathan: There's some really interesting work that was done around a visualization type called Chernoff faces, which are kind of funny if you've seen them. Basically the thinking was, people are really good at reading expressions and emotions of another human's face. So, what if we could translate that into a visualization? It turned out that it didn't actually work well, and it was in a lot of ways confusing. I'd be very interested to see where this all goes and if there are ways to actually find success in that. I think sometimes adding more to the visualization can be distracting. So, if we can crack that and make visualizations that speak to people on a deeper, emotional level, I think that would be fantastic. There's certainly been a number of cases where it has been tried and met with mixed success.
Corey: I think this is a good place to wrap it up. Nathan, thank you so much for joining us today. This was a wonderful conversation. I'm going to give you the last word in a second. Just one quick plug for the Banana Data podcast.
If you like what you hear, please be sure to subscribe to the Banana Data podcast, wherever you listen to podcasts. If you guys are unfamiliar with Tableau — if you're listening to this podcast, I find that hard to believe — check out Tableau. It's an amazing resource. Nathan, I'll give you the final word to wrap everything up.
Nathan: Well, Corey, CPM. It was really great being here. I really enjoyed being able to talk about data visualization with you guys, obviously, an area where there's so much going on. Certainly, this would be a very long podcast if we even tried to be complete. It was very exciting to talk about everything, and I think we're going to see lots of interesting stuff coming around. In everything that we've talked about here, for the near future and further out, it is so great to have a chance to be a part of it.
Corey: Thank you, Nathan. Thank you, CPM. See you next time everyone.