As organizations strive to collect and capitalize on ever-increasing amounts of data, the importance of data fitness and governance is more critical than ever. You may have heard the term "data stewardship" used in this context.
In a nutshell, data stewardship is the management and oversight of an organization's data assets to help provide business users with high-quality data that is not only easily accessible in a consistent manner, but also compliant with policy and/or regulatory obligations. This task is usually a joint effort between IT, line of business data owners, and the central data office, if it exists.
Today, we see more and more organizations pursuing data science and AI initiatives in order to improve efficiency or reduce risk, remain relevant in a fast-changing technological environment, and drive competitive advantage. As they begin to embed machine learning and AI methodologies into the core fabric of the way they do business and involve more types of professionals in these initiatives — a state that we at Dataiku refer to as Enterprise AI — it's time to go beyond just data stewardship, and think about data science stewardship. But what exactly does this involve?
Moving From Data Stewardship to Data Science Stewardship
Let's break down the stewardship concept further to identify some key principles, and then consider how they apply to data science. In general, stewardship is about the careful and responsible management of something entrusted to one's care. Examples could be a large estate, the resources of a community, or arrangements for a group of people who have a shared purpose.
Good stewards tend to:
- Act in service of an ideal larger than themselves.
- Believe in sustainability.
- Embrace innovation and change.
- Practice inclusiveness.
- Be team players and are quick to give others credit.
A thoughtfully planned AI strategy, most likely led by a chief analytics officer or chief data officer, will exhibit similar characteristics, which may include:
- AI not just for the sake of checking the box on a cool-sounding trend, but as a way to combine the human capacities for learning, perception, and interaction all at a level of complexity that ultimately supersedes our own abilities — and makes lives better.
- As the ancient saying goes, "change is the only constant." AI and machine learning techniques are advancing at such a rapid rate, half the battle is just keeping up with the latest and greatest methods and assessing which can or should be implemented. In-house research scientists and devOps specialists can be a valuable investment towards the cause of embracing innovation and change, all the while maintaining a laser focus on sustainability.
In the context of data science, sustainability deals with ensuring the continued reliability of AI-augmented processes into the future. This includes model maintenance to account for changing conditions, as well as processes which promote reusability to avoid reinventing the wheel for each new use case. Sustainability also includes the need for a flexible framework that allows for the incorporation of the latest technologies — without a complete tear-down of existing work. AI systems must be able to rapidly adapt to change in order to stay relevant and valuable.
- Finally, inclusiveness. Fostering collaboration amongst people with varying job profiles, perspectives, and technical skills is the key to unlocking enterprise-scale AI. Data scientists are a rare breed and only make up a tiny fraction of overall employees. By also including subject matter experts and business analysts who understand both business objectives and market dynamics in data projects, organizations will be able to execute and operationalize AI projects in a more responsible and democratized way.
Although an organization's AI strategy is formed from the top down, each person who works with data or uses it to make decisions takes some accountability for being good data science stewards. Modularizing and documenting workflows or code, working on and storing projects in a centralized, shared platform rather than on a local hard drive, and publishing data products and assets for others to leverage are just a few best practices.