When AI Goes Astray: High-Profile Machine Learning Mishaps in the Real World

Author:Murphy | View: 22091 | Time: 2025-03-23 13:01:10

The transformative potential of Artificial Intelligence (AI) and machine learning has often made headlines in the news, with plenty of reports on its positive impact in diverse fields ranging from healthcare to finance.

Yet, no technology is immune to missteps. While the success stories paint a picture of machine learning's wonderful capabilities, it is equally crucial to highlight its pitfalls to understand the full spectrum of its impact.

In this article, we explore numerous high-profile machine learning blunders so that we can draw lessons for more informed implementations in the future.

In particular, we will look at a noteworthy case from each of the following categories:

(1) Classic Machine Learning (2) Computer Vision (3) Forecasting (4) Image Generation (5) Natural Language Processing (6) Recommendation Systems

A comprehensive compilation of high-profile machine learning mishaps can be found in the following GitHub repo called Failed-ML:

GitHub – kennethleungty/Failed-ML: Compilation of high-profile real-world examples

(1) Classic Machine Learning

Headline

Amazon AI recruitment system: Amazon's AI-powered automated recruitment system was canceled after evidence of discrimination against female candidates.

Details

Amazon developed an AI-powered recruitment tool to identify top candidates from a decade's worth of resumes. However, since the tech industry is predominantly male, the system exhibited biases against female applicants.

For instance, it started downgrading resumes containing the word "women's" or those from graduates of two women-only colleges while favoring certain terms (e.g., ‘executed') that appeared more frequently in male resumes.

Amazon attempted to rectify these biases but faced challenges in eliminating the discriminatory tendencies. Consequently, they discontinued the project in early 2017, emphasizing that the system was never utilized for actual candidate evaluations.

Key Lesson

Bias in training data can be perpetuated in machine learning models and lead to unintended and discriminatory outcomes in AI systems, emphasizing the importance of diverse and representative datasets.

Link to Report

Amazon Reportedly Killed an AI Recruitment System

(2) Computer Vision

Headline

Google's AI for diabetic retinopathy detection: The retina scanning tool fared much worse in real-life settings than in controlled experiments, with issues such as rejected scans from poor scan quality and delays from intermittent internet connectivity when uploading images to the cloud.

Details

Google Health outfitted 11 clinics across Thailand with a deep-learning system trained to spot signs of eye disease in diabetic patients. During in-lab experiments, the system could identify signs of diabetic retinopathy from an eye scan with more than 90% accuracy and give a result in less than 10 minutes.

However, when deployed on the ground, the AI system often fails to give a result due to poor image quality, rejecting more than a fifth of the images.

Nurses also felt frustrated when they believed the rejected scans showed no signs of disease or when they had to retake and edit an image that the AI system had rejected. Poor internet connections in several clinics also caused delays in uploading images to the cloud for processing.

Key Lesson

It is important to tailor AI tools to the specific conditions and constraints of real-world environments, including factors like image quality and offline availability.

Reference

Google's medical AI was super accurate in a lab. Real life was a different story.

(3) Forecasting

Headline

Zillow instant-buying (iBuying) algorithms: Zillow suffered significant losses in their home-flipping business due to inaccurately overestimated prices from their machine learning models for property valuation.

Details

Real estate companies like Zillow have been using the iBuying business model, where they purchase homes directly from sellers and then re-list them after doing minor work.

One of the first steps in Zillow's decision to purchase any home is the "Zestimate" – a machine-learning-assisted estimate of a home's market value in the future based on a model trained on millions of home valuations and relying on data such as a home's condition and ZIP code.

However, Zillow's system misjudged future home values, leading them to make numerous above-market offers to homeowners, particularly during the real estate volatility caused by the COVID-19 pandemic.

This misjudgment eventually resulted in the shutdown of Zillow's instant-buying operation and a projected loss of $380 million.

Key Lesson

Continuous model monitoring, evaluation, and retraining are critical so that data drift from new events (resulting in changes in test data distribution) is captured in the models for up-to-date predictions.

Reference

Why Zillow Couldn't Make Algorithmic House Pricing Work

(4) Image Generation

Headline

Stable Diffusion (text-to-image model): Stable Diffusion exhibited racial and gender bias in the thousands of generated images related to job titles and crime.

Details

An analysis by Bloomberg of over 5,000 images produced with Stable Diffusion uncovered significant racial and gender biases.

Using the Fitzpatrick Skin Scale for categorization, they found that images generated for high-paying jobs predominantly featured subjects with lighter skin tones, while darker skin tones were linked with lower-paying occupations like "fast-food worker."

Similarly, gender analysis revealed that for every image depicting a woman, almost three images depicted men, with most high-paying jobs being dominated by male representations with lighter skin tones.

Stable Diffusion derives its data from LAION-5B, the world's most extensive publicly available image-text dataset sourced from various online sites, including depictions of violence, hate symbols, and more.

Key Lesson

Auditing the data used for machine learning is vital. For example, if images depicting amplified stereotypes find their way back into future models via augmented training data, next-generation text-to-image models could become even more biased, creating a snowball effect of compounding bias.

Reference

https://www.bloomberg.com/graphics/2023-generative-ai-bias

(5) Natural Language Processing

Headline

ChatGPT citations: A lawyer used the popular chatbot ChatGPT to supplement his findings but was provided with completely fabricated past cases that do not exist.

Details

When a lawyer named Steven Schwartz used ChatGPT to aid in preparing a court filing for a lawsuit relating to an injury due to an airline's negligence, things quickly went awry.

The brief he submitted included citations from several supposedly relevant court decisions, but neither the airline's attorneys nor the presiding judge could locate any of those cited decisions.

Schwartz, who had practiced law for three decades, admitted to using ChatGPT for legal research, unaware of its potential to produce fabricated content. He had even asked ChatGPT to verify the validity of the cases, to which it erroneously confirmed their existence.

Key Lesson

Relying solely on the outputs of generative models like ChatGPT without human verification can lead to significant inaccuracies, underscoring the need for human oversight (i.e., human-in-the-loop system) and cross-referencing with trusted sources.

Reference

Here's What Happens When Your Lawyer Uses ChatGPT

(6) Recommendation Systems

Headline

IBM Watson for oncology: IBM's Watson allegedly provided numerous unsafe and incorrect recommendations for treating cancer patients.

Details

Once seen as the future of cancer research, IBM's Watson supercomputer has reportedly been making unsafe recommendations for cancer treatments.

A notable instance is when it advised giving medication to a cancer patient with severe bleeding, a drug that could potentially exacerbate the bleeding. However, this suggestion was noted to be hypothetical and not applied to an actual patient.

The underlying issue stems from the nature of the data fed into Watson. IBM researchers have been inputting hypothetical or "synthetic" cases to enable the system to be trained on a wider variety of patient scenarios.

However, this also meant that many recommendations were largely based on the treatment preferences of a select few doctors providing the data rather than insights derived from real patient cases.

Key Lesson

The quality and representativeness of training data are paramount in machine learning, especially in critical applications like healthcare, to avoid biased and potentially harmful outcomes.

Reference

IBM's Watson gave unsafe recommendations for treating cancer

Wrapping it up

While machine learning has brought many benefits, we must remember that it is imperfect, as illustrated by the numerous real-world mistakes in this article. It is critical that we learn from these errors so we can leverage AI and machine learning better in the future.

Check out this GitHub repo for the full compilation of machine learning blunders.

Before you go

I welcome you to join me on a journey of Data Science discovery! Follow this Medium page and visit my GitHub to stay updated with more engaging and practical content. Meanwhile, have fun exploring both the successes and slip-ups in machine learning!

Running Llama 2 on CPU Inference Locally for Document Q&A

Micro, Macro & Weighted Averages of F1 Score, Clearly Explained

Create a Clickable Table of Contents for Your Medium Posts

Tags: Artificial Intelligence Data Science Deep Learning Machine Learning Technology