AI ethics: beyond not stealing data
Ethically acquiring data for training AI is only part of the picture. A company's 'data enrichment' practices matter too.
Most of the talk about AI ethics revolves around whether companies are legally and morally acquiring data created by humans to use for training their AI / ML tools. Although data sourcing is highly important, and most of the big AI companies aren’t doing it well, it’s not the only dimension of ethics that matters. A second ethical area that I want to share with you all today is data annotation and labeling (sometimes called ‘enrichment’).
Why AI Needs Human Data Labelers
Some machine learning or AI systems use methods which don’t require pre-assigned labels or metadata on the data. (These are called ‘unsupervised’ methods). Other methods require some or all data to have labels (‘supervised’ or ‘semi-supervised’).
For supervised methods, all of the terabytes of content an AI company acquires need to be labeled before the data gets used for training a tool. And some models require the data to be annotated as well. For instance (simplified examples):
A picture of a fruit tree might need to be annotated to say that it’s a tree, what kind of fruit tree it is, the parts of the tree (trunk, branches, leaves, fruits), whether any insects or birds (or people!) are in the picture, and whether individual fruits on the tree are ripe.
An image of a ‘clowder’ of 6 cats might need to be annotated to say it shows 6 cats, to describe their colors or breeds, whether they’re wearing collars, if they’re kittens or adult cats, what they’re doing, what’s in their environment, and how they’re interacting with each other. (And the picture might need to be labeled to say whether it’s a drawing, sketch, painting, photograph, or AI-generated.)
An article about a study on software technology might need to be labeled to categorize it as academic or industrial, what type of study it was (e.g. observational), the authors’ affiliations, which software technologies, whether the study results were positive or negative, key limitations, etc.
Some labeling has been automated. However, manual data labeling by humans remains a critical job in the AI / ML ecosystem.
In other cases, manual labeling by humans underlies ‘content moderation’ on software platforms that allow users to post content. They generally use a combination of automated review and human oversight for determining whether reported or detected content is objectionable and should be blocked or removed.
Most of us are aware of this content moderation and have heard stories about users’ content being blocked when it shouldn’t, or not being blocked when it should. In AI / ML terms, those are called:
“false positives” (wrongly labeling as bad content that shouldn’t be blocked), and
“false negatives” (failing to mark as bad content that should be blocked)
What might not be obvious: for those automated content reviews to work well, lots of humans have to label LOTS of content - including horrific content that should be blocked or removed - to classify it as objectionable or not. That means humans who have data labeling jobs in this part of the AI ecosystem have to spend hours looking at some truly awful content, the worst that other humans (or AI tools) can generate.
Automated and human data labeling can also contribute to biases in datasets and in AI-based tools. Let’s save that ethics topic for another day, though, and focus on the workers.
Ethics of Data Labeling Work
Companies can be unethical about employing data labelers in (at least) two big ways:
Exploitative underpayment, overworking, and general mistreatment of workers
Requiring workers to review and label volumes of horrific, violent, or degrading content, without regard for the emotional health and well-being of the workers
Sadly, many big AI companies - and the ‘suppliers’ they use for hiring data labeling workers - are not behaving ethically in one or both regards. And news has been emerging on the human harms that this unethical behavior is causing.
Current State
This is not a new problem: back in May 2020, Facebook settled a lawsuit brought by some of its content moderators in 4 US states (California, Arizona, Texas, and Florida) who suffered from PTSD.
Since then, many AI companies have moved their ‘data enrichment’ operations out of the USA. Similar reports have emerged from Kenya, the Philippines, and India.
Two months ago, 60 Minutes did a deep dive in Kenya on this issue, as they reported in this documentary “Human Loop: Training AI takes heavy toll on Kenyans working for $2 an hour” on YouTube:
This article from mania.africa, posted on Mastodon on Feb. 1, reminded me of the 60 Minutes story and the need for wider awareness of this aspect of AI ethics:
“I think something that we may have overlooked is how the AI companies may take advantage of countries with lax or underdeveloped regulation (on AI and digital work) and target data workers from these countries knowing they have no recourse. I have done a few tasks on Remotasks in the past and it's preposterous to see payments as low as $0.0001 for annotating an image. It's quite clear that the companies behind such work, AI or otherwise, are clearly taking advantage of the workers, especially since these workers are from countries where they are already marginalized and lacking in opportunity. For instance, a majority of Kenya's university-educated youth will often lack jobs, and with over a million graduates entering the market annually, the problem only gets worse. Consequently, the majority of tech savvy graduates will go for online work, only to be discriminated against and treated unfairly as compared to workers from other well-developed countries."
“AI’s Hidden Human Cost: The Struggle of Kenya’s Data Workforce”, by David Mania, 2025-02-01
To the best of my knowledge, this data enrichment aspect of ethics is not considered in any of the emerging ‘ethical AI’ certifications:
Fair Training (fairlytrained.org) - Their ‘L’ certification is an audit-based approach to verifying ethical data sourcing (licensing). And in the past 12 months, 19 companies have earned Fairly Trained certification. (We reported on some of them here and here last year.) However, this Fairly Trained certification does not address data enrichment.
Voice-Swap / BMAT - This certification is a technical approach to verifying ethical data sourcing for music. However, it does not address data enrichment. (Also, as of this writing, no companies are yet known to have completed the certification. I’ve messaged Voice-Swap to confirm this.)
ISO 42001 - Anthropic reported earning ISO 42001 certification for responsible AI for their Claude LLM on Jan. 13, 2025. However, from a quick search of Anthropic’s Trust Center, there’s no indication that responsible data labeling or annotation practices are covered by ISO 42001.
Partnership on AI - This initiative has developed guidelines cover paying market wages and treating workers fairly. However, they do not appear to address emotional and mental health impact on workers, and the site mentions only one company that has adopted their ethical data enrichment guidelines: Google DeepMind.
It’s worth noting that the EU AI Act talks about “labeling”, but only in the context of labeling AI-generated content as such. Data enrichment is briefly mentioned in Article 10 (Data and Data Governance), in the context of high-risk systems. It does not appear to address the ethics of enriching the data used as input for the AI-based tools.
If any of you know of ways that the data enrichment aspect of ethics IS reflected in these or other standards, or other companies who have earned relevant certifications, please share so we can all learn!
What’s Next?
I’ll be on the lookout for new information about ethics around data enrichment and whether the large AI companies are taking steps to treat data labelers better. And of course I’ll share whatever I find out. (Subscribe for free to not miss the follow-ups.)
My ask: please check out the article and the video. Then, when you’re choosing an AI-based tool to use (especially if you pay for it), try to check not just where their data came from, but how they they got it enriched for use with AI. If they don’t operate fairly, or they aren’t transparent about how they operate, ask yourself:
Is there another tool I could use, instead of funding and supporting an AI company that (definitely OR probably) exploits and mistreats other humans so badly?
I know it’s easy to feel powerless as an individual in the face of the huge corporate juggernauts who dominate the AI tool market. But we do have power collectively. Individual snowflakes have almost no impact before they melt, but a lot of snow makes an avalanche (or great skiing, if you’re a winter sports fan)! We all can & should use our power for good - one tool choice at a time, one payment at a time. It will add up.
References
See this article for more about false positives and false negatives:
Some articles I found while searching for standards relevant to data enrichment:
“Improving Conditions for Data Enrichment Workers: Resources for AI Practitioners”, Responsible Sourcing Library - Partnership on AI (undated, but likely mid-2022)
“Content Moderation Is Terrible by Design: A conversation about how to fix the front lines of the internet.” by Sarah T. Roberts (UCLA) / Harvard Business Review (editor: Thomas Stackpole), 2022-11-09
“Ethical AI Data Annotation and Improved Worker Conditions”, Julie Trinkvel / Deepomatic blog, 2023-05-03
“Artificial intelligence (AI) is revolutionizing our world, but to ensure its ethical and responsible use, proper regulation is essential.”, by Aïcha / Innovatiana, 2023-06-23
“How We Define Harm Impacts Data Annotations: Explaining How Annotators Distinguish Hateful, Offensive, and Toxic Comments”, Angela Schöpke-Gonzalez, Siqi Wu, Sagar Kumar, Paul J. Resnick, Libby Hemphill, 2023-09
“Exploring the Complex Ethical Challenges of Data Annotation”, by Beth Jensen / Stanford Institute for Human-Centered Artificial Intelligence, 2024-07-10
“Looking out for the human in AI & Data Annotation”, by Vipul Kapoor / Mindkosh, 2024-09-09
“Data Annotation's Role in Shaping Ethical AI Governance Post-AGI”, by Ayush Parti / Pareto.AI, 2024-11-11
“Analysis: Meta’s content moderation was a failed experiment”, Reed Albergotti / Semafor, 2025-01-08 (see footnote in the article: Meta is stopping fact-checking, but is continuing content moderation)