When it comes to data on startups, Startup Genome is the gold standard — their yearly report on the startup ecosystem is well-respected (not to mention well-cited). But how are they able to find and make sense of data in order to produce it?
We spoke to Munish Malhotra, Director of Analytics and Data Science at Startup Genome, to learn more and here’s what we found out about their challenges (hint: they might sound familiar…).
They Have to Fill in the Blanks
The nature of the business is that structured, readily available datasets don’t necessarily exist (in fact, this is the case at least 20 percent of the time for Startup Genome). And when they do find data, it’s often incomplete. So just like other enterprises making their way in the age of AI, data quality is an issue (our survey of more than 50 CDOs showed that it is one of the top data issues worldwide).
If data doesn't exist, the Startup Genome team starts from a blank slate
That means they have to put data through a set of business rules in order to fill out the missing information. For example, the first step might be to manually hunt for any missing data and the second might be to create a standard estimation of the missing data.
They Must Always Consider the Context
When doing data analysis, the team at Startup Genome has to minimize bias and be able to consider the context of their data in order to truly draw meaning from it. For example, if there is data on how many engineers are graduating in a region, they need to be able to determine how it is relevant and whether there is there a correlation between that data and startups.
Maybe in some cities, there are lots of graduates, but not a lot of startups. However, in context, that doesn’t necessarily mean there isn’t a correlation. This could be because those cities don’t provide the resources and support for recent graduates to work in startups.