Taking PRIDE in Responsible AI via Data Collection & Analysis

Dataiku Product, Scaling AI Triveni Gandhi

Data collection is not new to the enterprise and serves as the foundation for all analytics across organizations. However, collecting information about someone’s gender, race, religion, or sexual orientation has a storied history around the world. Throughout time, demographic data have been used to manipulate public opinion, track minority populations, and perpetuate systems of violence against minority people. Given the vast history of misuse, many countries (particularly in Europe) actively legislate against collecting certain demographic data and require transparent reporting processes in other cases. With that context in mind, asking people to reveal private information can be complicated and fraught with valid concerns. Many ask, “Why do you need this data? What purpose does sharing this data serve?” This is a good place to start when it comes to the topic of LGBTQ+ data collection. 

Depending on the cultural and legislative expectations around data collection and analysis, some organizations may choose to use quantitative analysis to measure the impact of policies. To execute on quantitative analysis, we do need data — but collecting data on someone’s sexual preferences or orientation is not straightforward, given the sensitive nature of this information and the potential for misuse, sharing private information, or broader misrepresentation. 

This PRIDE month at Dataiku, we are especially focused on the importance of good data collection and privacy practices — not only because it is an important part of our RAFT Framework for Responsible AI — but because collecting data on vulnerable populations should focus on the safety and security of people first, while abiding to local contexts and regulations. This begs two questions: Should organizations collect demographic data of LGBTQ+ employees or customers and, if so, how can they be intentional and responsible about how that information is stored and used? 

iphone with lock icon on it

Setting the Context

When analyzing employee satisfaction or customer responsiveness to products and services, sexual preferences and gender orientation can provide unique and intersectional insights into how organizations are performing. By breaking down analyses across different subgroups, we can find areas of improvement — such as reframing parental leave policies to be gender neutral — or refine products to be more inclusive of diverse needs and uses. The analyses can improve outcomes and ensure continued progression to goals that align with the core tenets of PRIDE: to create more diverse, inclusive, and safe places for people from all backgrounds.

Best Practices in LGBTQ+ Data Collection

Ethical Intentions 

Without data collection, it is difficult to uncover trends and patterns and equally difficult to measure whether our actions have consistent impact. However, this means starting a data collection project with the intention of actively helping those people on whom data is being collected. Without clearly defined and ethical assumptions at the beginning of a project, no LGBTQ+ demographic data should be collected (even if local regulation allows it). 

Additionally, any decisions taken as a result of this data collection or analysis should be carefully reviewed for potential harmful consequences before implementation. To do so, we recommend organizations start with pre-build risk assessments and clearly document the intentions of collecting data and subsequent analysis. The Govern Node in Dataiku acts as a central watchtower where business leaders and data experts can work together to clearly define their expectations and goals of any new analysis and make sure the project stays aligned to ethical intentions as it is built and deployed. 

Informed Consent

Tracking and gathering information about users and employees is a common practice for organizations, but it is unsafe (and in some cases illegal) to do so without providing transparent communication that this practice is occurring. Moreover, consenting to this collection is important — people should be asked to opt-in to data collection in a judgment free way and they should feel comfortable saying no to providing any information they do not wish to. While this may mean fewer data points or more missing values within a dataset, the cost of gathering sensitive information against someone’s will or knowledge is far higher. 

Anonymization

Even if a person feels comfortable sharing their sexual orientation or other demographics for the purposes of analysis, anonymizing data is a good method for ensuring the privacy and protection of information. This anonymization can be as simple as not asking for identifying information such as someone’s name and personal details, or can be done after collection using anonymization tools like those in the Dataiku Prepare recipe. 

Privacy and Security

Without a doubt, the most important practice in sensitive data collection is securing access to the data and their analyses to only relevant and aligned stakeholders. Through a combination of database privileges, controls over the type of analysis that is allowed on a set of data, and governance of use cases, organizations can be more confident that data is safe and used in an ethical manner. 

In short, collecting data to support LGBTQ+ initiatives can be quite beneficial, but requires thoughtful consideration of privacy concerns and specific contexts. Organizations should always prioritize the safety and security of their users and employees data, even if that means not collecting that information to begin with. Dataiku, for example, supports this type of work because:

  • It provides a single, central AI platform for all data work, from advanced analytics to Generative AI.
  • It supports the multiplicity of regulations and cultural contexts, especially for organizations that operate in many different countries.
  • It enables tech stack optionality and long-term transformation, all while ensuring oversight, guardrails, and regulatory compliance across borders.
  • It creates transparency via automated model documentation, row-level prediction explanations, centralized audit logs, and more to create a traceable and transparent path to Everyday AI.

You May Also Like

Maximizing Text Generation Techniques

Read More

Looking Ahead: AI Hurdles IT Leaders Need to Overcome in 2025

Read More

No-Code ML and GenAI With Dataiku and Fabric

Read More

Unpacking 3 of the Biggest Controversies in AI Today

Read More