Maximize Your Data Potential Beyond Spreadsheets

Dataiku Product, Scaling AI Catie Grasso

Microsoft Excel has been at the forefront of democratizing the use of data within an organization, in part due to its tremendous ease of use for data manipulation. However, with the drastic increase in the volume of data available, the diversity of sources, and the continuous evolution of data processing technologies, the spreadsheet has, over time, shifted from being an enabler to more of a limiting factor — with the potential to be detrimental to any organization serious about achieving true AI transformation. Issues such as data accuracy, siloed work, security, human error, and limitations with large datasets only scratch the surface of the frustrations associated with spreadsheets, and these challenges will likely exacerbate as the use of data continues to rise.

In Oracle's survey of 14,000 employees and business leaders from 17 countries, 86% said that dealing with more data makes decision-making harder in their personal and professional lives. Additionally, 59% mentioned encountering decision dilemmas several times every day. To avoid constraints and bottlenecks like this, data executives and data team managers alike should encourage the transition from disparate spreadsheets into something more scalable such as a collaborative data science tool.

While the transition out of spreadsheets for more sophisticated analyses will not be instantaneous, it’s a critical process that data executives and managers should start immediately. If people across the organization remain in spreadsheets exclusively, not only will the company fall way behind in the race to AI, but there are potentially real, business-impacting consequences. 

In this blog, we’ll highlight the key areas where the shortcomings of spreadsheets (in the lens of data science and AI) are apparent and actionable ways data executives and team managers can equip their teams to start making the switch now.

Enterprise Data Science Requirements Where Spreadsheets Fall Short

1. An Unreliable Source of Truth: Data Governance and Trust Concerns

As an organization’s livelihood is dependent on the trust of its users, it is imperative that data is protected and governed. Many companies make major decisions based on spreadsheets cobbled together by people copying and pasting from other files and are overly reliant on macros developed nearly two decades ago. For data sources that contain sensitive information, spreadsheets are notably precarious as there’s no actual audit trail or data lineage. There’s no hassle-free way to see versions or steps taken to cleanse a particular dataset and, further, each version needs to be saved and comments applied in order for any changes to actually be implemented.

Without a consolidated repository, teams risk blurred lines between projects, extreme lack of organization, and, most concerning, security. A centralized data science tool — like Dataiku — from which all data work happens, on the other hand, makes data and AI Governance infinitely simpler in the following ways:

  1. Teams can keep documentation of data sources with sensitive or proprietary information.
  2. The history of the data (where it came from, what happened to it, where it’s being used) can easily be seen.
  3. At any given point, teams can view what data is used for a given project and who “owns” what data.
  4. They can keep documentation so any contributor can easily explain what has been done on a specific project.

2. A Plethora of Inefficiencies

A significant part of data science involves identifying areas of the data-to-insights pipeline where efficiencies can be identified. Working with spreadsheets, though, is completely counterproductive to this progress, as it is inundated with numerous inefficiencies.

Notably, organizations will have to continue hiring more and more staff because spreadsheets are inefficient from both a manpower and actual output point of view. Spreadsheets are not conducive to complex AI projects as they require the accessing and processing of large amounts of data, moving that data from one repository to another, and then using computation resources. With spreadsheets, tasks like efficiently connecting to a database, defining key operations, and scheduling processes for repeat usage and benefits can’t actually be streamlined.

By using spreadsheets for data analytics, teams essentially stifle their growth. By moving out of spreadsheets, they will be able to increase the size and number of unique datasets as well as the complexity of data analysis. Learn how Novartis revolutionized its analytics and AI capabilities, transitioning from manual processes to harnessing the full potential of Dataiku's end-to-end data science platform, empowering informed decision-making while conquering data silos

Further, spreadsheets are prone to several data profiling and formatting issues. For example, in spreadsheets, users create filters and pivot tables, but things can go awry when a column contains thousands of distinct values or when there are duplicates from different spellings. Because spreadsheets have no visual representation for each value, the user has to toggle back and forth between pivot tables and filtered data to even start to understand the data, let alone extract insights. Spreadsheets also auto-format dates or timestamps incorrectly without providing a way to reverse the auto-formatting. While not an impossible problem to solve, teams wind up needing to write extra formulas just to get their data formatted correctly (while running the risk of missing something). 

Not only are the data-driven organizations of today collecting an incredible amount of data, but they strive to combine data from all available sources (conventional and unconventional) to get a holistic view of their business. With inefficiencies like the ones outlined above in the rearview, teams can save a tremendous amount of time and ensure they have access to the most recent data.

3. An Antiquated Approach

In a previous Dataiku survey, a staggering 58% of respondents said that most of the data preparation for their machine learning pipelines occurs in one system and machine learning is built in another. Data prep should move out of separate spreadsheets and into the same place that machine learning is happening so the projects can be more easily expanded and developed.

Further, spreadsheets hinder collaboration and the ability to share ideas in real time. To make collaboration more organic, teams need to centralize their analytics. In Dataiku, for example, teams can document project goals, project inputs and outputs, and project inner workings as well as track planned future project enhancements, all under one roof.

Data prep in spreadsheets also takes way too much time away from developing new ideas and working on new projects, whereas in a data science tool, work can be saved and reused on future projects to reduce wasted time and boost efficiency. Siloed groups may do the same data preparation work over and over, without even knowing the work has already been done by someone else. Without that visibility, teams will continue to reproduce duplicate work. On the contrary, if everything is in one place, those working with data can easily capitalize on others’ analysis and become profitable with AI at a faster clip.

Dataiku customer Rabobank perfectly demonstrates the urgency of doing everything in one place so simple spreadsheet analysis can be easily elevated to an AI project:

You might start with simple insight questions, but those insights might lead to new initiatives. For example, if you know certain customers have certain risks, you quickly get into predictive analytics. If you start with a BI tool, then you have to do all kinds of work to set up a new environment. Dataiku allows us to start with relatively simple insight questions and grow toward a more specific predictive question, developing a model all in the same tool. This mirrors our innovation funnel.

-Roel Dirks, Product Manager, Big Data Lab at Rabobank

Conclusion

By performing all data preparation tasks, from the most lightweight to the most advanced, in one place — preferably the same place where it will get taken to the next level with machine learning — teams will truly be able to perform enterprise-level data projects that contain large datasets, foster collaboration, and ensure data accuracy, all in a fraction of the time it takes in various spreadsheets. 

All of the challenges outlined in this blog really boil down to one key notion: Spreadsheets are not scalable for organizations dedicated to and invested in achieving Everyday AI. You should begin the move out of spreadsheets and into a collaborative data science tool today if you want to:

  • Avoid the high-impact risks that can derive from data security problems or costly spreadsheet errors.
  • Work more seamlessly with larger datasets and perform more complex analyses and make cumbersome, unwieldy spreadsheets a thing of the past.
  • Do everything you can easily do in spreadsheets and easily build onto that to scale machine learning models, providing visibility from data preparation to model deployment in one place.

You May Also Like

From Vision to Value: Visual GenAI in Dataiku

Read More

Data Preparation Dataiku Hidden Gems: Part 2

Read More

Maximizing Enterprise Data Products Distribution

Read More