Taking PRIDE in Responsible AI via Data Collection & Analysis

Data collection is not new to the enterprise and serves as the foundation for all analytics across organizations. However, collecting information about someone’s gender, race, religion, or sexual orientation has a storied history around the world. Throughout time, demographic data have been used to manipulate public opinion, track minority populations, and perpetuate systems of violence against minority people. Given the vast history of misuse, many countries (particularly in Europe) actively legislate against collecting certain demographic data and require transparent reporting processes in other cases. With that context in mind, asking people to reveal private information can be complicated and fraught with valid concerns. Many ask, “Why do you need this data? What purpose does sharing this data serve?” This is a good place to start when it comes to the topic of LGBTQ+ data collection.

Depending on the cultural and legislative expectations around data collection and analysis, some organizations may choose to use quantitative analysis to measure the impact of policies. To execute on quantitative analysis, we do need data — but collecting data on someone’s sexual preferences or orientation is not straightforward, given the sensitive nature of this information and the potential for misuse, sharing private information, or broader misrepresentation.

This PRIDE month at Dataiku, we are especially focused on the importance of good data collection and privacy practices — not only because it is an important part of our RAFT Framework for Responsible AI — but because collecting data on vulnerable populations should focus on the safety and security of people first, while abiding to local contexts and regulations. This begs two questions: Should organizations collect demographic data of LGBTQ+ employees or customers and, if so, how can they be intentional and responsible about how that information is stored and used?

iphone with lock icon on it

Setting the Context

When analyzing employee satisfaction or customer responsiveness to products and services, sexual preferences and gender orientation can provide unique and intersectional insights into how organizations are performing. By breaking down analyses across different subgroups, we can find areas of improvement — such as reframing parental leave policies to be gender neutral — or refine products to be more inclusive of diverse needs and uses. The analyses can improve outcomes and ensure continued progression to goals that align with the core tenets of PRIDE: to create more diverse, inclusive, and safe places for people from all backgrounds.

Best Practices in LGBTQ+ Data Collection

Ethical Intentions

Without data collection, it is difficult to uncover trends and patterns and equally difficult to measure whether our actions have consistent impact. However, this means starting a data collection project with the intention of actively helping those people on whom data is being collected. Without clearly defined and ethical assumptions at the beginning of a project, no LGBTQ+ demographic data should be collected (even if local regulation allows it).

Additionally, any decisions taken as a result of this data collection or analysis should be carefully reviewed for potential harmful consequences before implementation. To do so, we recommend organizations start with pre-build risk assessments and clearly document the intentions of collecting data and subsequent analysis. The Govern Node in Dataiku acts as a central watchtower where business leaders and data experts can work together to clearly define their expectations and goals of any new analysis and make sure the project stays aligned to ethical intentions as it is built and deployed.

Informed Consent

Tracking and gathering information about users and employees is a common practice for organizations, but it is unsafe (and in some cases illegal) to do so without providing transparent communication that this practice is occurring. Moreover, consenting to this collection is important — people should be asked to opt-in to data collection in a judgment free way and they should feel comfortable saying no to providing any information they do not wish to. While this may mean fewer data points or more missing values within a dataset, the cost of gathering sensitive information against someone’s will or knowledge is far higher.

Anonymization

Even if a person feels comfortable sharing their sexual orientation or other demographics for the purposes of analysis, anonymizing data is a good method for ensuring the privacy and protection of information. This anonymization can be as simple as not asking for identifying information such as someone’s name and personal details, or can be done after collection using anonymization tools like those in the Dataiku Prepare recipe.

Privacy and Security

Without a doubt, the most important practice in sensitive data collection is securing access to the data and their analyses to only relevant and aligned stakeholders. Through a combination of database privileges, controls over the type of analysis that is allowed on a set of data, and governance of use cases, organizations can be more confident that data is safe and used in an ethical manner.

In short, collecting data to support LGBTQ+ initiatives can be quite beneficial, but requires thoughtful consideration of privacy concerns and specific contexts. Organizations should always prioritize the safety and security of their users and employees data, even if that means not collecting that information to begin with. Dataiku, for example, supports this type of work because:

It provides a single, central AI platform for all data work, from advanced analytics to Generative AI.
It supports the multiplicity of regulations and cultural contexts, especially for organizations that operate in many different countries.
It enables tech stack optionality and long-term transformation, all while ensuring oversight, guardrails, and regulatory compliance across borders.
It creates transparency via automated model documentation, row-level prediction explanations, centralized audit logs, and more to create a traceable and transparent path to Everyday AI.

Taking PRIDE in Responsible AI via Data Collection & Analysis

Setting the Context

Best Practices in LGBTQ+ Data Collection

Ethical Intentions

Informed Consent

Anonymization

Privacy and Security

You May Also Like

MIT Says 95% of GenAI Pilots Fail: Here’s How to Beat the Odds

Introducing Agent Hub: The Workspace for Enterprise Agents

Agent Sprawl Is the New IT Sprawl, Here's How to Control It

The Business Case for MCP

Taking PRIDE in Responsible AI via Data Collection & Analysis

Setting the Context

Best Practices in LGBTQ+ Data Collection

Ethical Intentions

Informed Consent

Anonymization

Privacy and Security

Build Responsible Generative AI Applications

Subscribe to the Dataiku Blog

You May Also Like

MIT Says 95% of GenAI Pilots Fail: Here’s How to Beat the Odds

Introducing Agent Hub: The Workspace for Enterprise Agents

Agent Sprawl Is the New IT Sprawl, Here's How to Control It

The Business Case for MCP