Please Make this AI Less Accurate

Author:Murphy  |  View: 21492  |  Time: 2025-03-22 22:04:19

Accuracy is one of those words that everyone intuitively assumes they understand and that most people believe is always better when it is higher.

With the rise in attention on Artificial Intelligence (AI) and the increasing awareness of lapses in reliability or accuracy of outputs, it is important for more people to understand that data products, such as AI, don't follow the same rules of consistency or accuracy of other technologies.

The Confusion Matrix

To illustrate, let me introduce the concept of a "Confusion Matrix". This will be very familiar to any Data Scientists that have built predictive models for classification purposes. It may be new to others but I find that the concept, the methodology and the human/Business interaction involved are a useful case study to understand accuracy terminology in machine learning more broadly. It is a helpful visual tool to understand both nuance and trade-offs in these terms.

Confusion Matrix template by the author

When we speak about total accuracy we mean the amount of correct predictions (the sum of the green boxes above) out of all total predictions (the sum of the four boxes above). So this is where you may here terms like "Our pregnancy test is 99% accurate". It is talking about accuracy of all test predictions both those that say the user is and is not pregnant.

The nuance appears when you seek to understand in which of the two remaining red boxes that "inaccurate" percentage sits in.

For rare events, you could achieve a very high accuracy by predicting that the event never happens (no model required). However, for different models and use cases the cost or risk associated with inaccuracy is not equal or consistent.

Put plainly, a lower accuracy model may intentionally be that way because you want to reduce how often you mis-predict in one direction or another. In doing this you have to choose to compromise overall model accuracy.

Is it more risky to predict (or classify) that someone is pregnant and to be wrong or the other way around?

Is it more dangerous to diagnose someone as not having cancer when they do?

Is it more harmful to label something as hate speech and take it down from a platform or not?

Some of these examples have an obvious answer while in others you will find two people disagree. This shows the spectrum that exists in terms of both the stakes at hand when dealing with inaccurate predictions but also the complexity of the decisions made. One person's bug is another person's feature.

Chatbots and LLMs

To shift gears away from the comparatively simple case of classification models, there is widespread discussion currently about "hallucinations" in Large Language Model (LLM) outputs. For some users these hallucinations have been deemed so serious that they have ceased to use the tools for fear of hallucinations they may not be able to identify. However, some experts claim that these are part of the AI design. This article in Scientific American highlights that Chatbots are developed and trained to respond and even if they are inaccurate in that response they are doing what they were trained to do. Unfortunately, for the unsuspecting user, they will often respond just as confidently with a wrong answer as they do with a right one. Just as the humans they are attempting to replicate can.

Thanks to ChatGPT's rapid rise to mainstream adoption, the LLM example is playing out in a public discourse that many other model types did not have. The general population didn't have the same opportunity to familiarise themselves with the various realities of accurate or inaccurate predictions or to have a discussion about their pros and cons. This, of course, doesn't mean they didn't exist.

Trade-offs

The most important thing to understand when building, deploying or, indeed, using AI or a model output is "What is it trying to achieve?". Only through understanding the objectives can we build something that improves our ability to responsibly achieve these without the technology. Similarly, only in understanding the decisions behind the use case can users interact responsibly with the output.

Careful what you wish for saying, image created by author

Underneath every model or AI instance is a data optimisation problem. Depending on the makeup of your data you can in some cases build exceedingly accurate models that will give you exactly what you optimise towards. One widely adopted example of this is the automated ad serving technology used by Meta and Google. When setting up a campaign you ask for a specific conversion or outcome. If you choose clicks then that is what you will get. These clicks may not convert to valuable outcomes for your business and in some cases may actually include some bots but this is the risk you are taking when you ask the model to deliver this to you.

Recommender engines are another very common model that we interact with regularly. Whether through Amazon's "customers like you", Tiktok's content algorithm or the Netflix home page we are served with what the machine thinks we "want to see" in a lot of our daily interactions. But is that what we WANT to see or is it what fits the objectives of the company? In Amazon's case they want us to purchase and ideally to purchase something with a higher margin than an alternative. Tiktok want eyes on screen for as long as possible so they can monetise these eyes through ads served in between the content. Netflix wants us to quickly find something we will enjoy watching (and better still binge) so that we will stay in their platform and choose them for more of our viewing needs. All recommender engines but all with different target behaviours that fit what the business needs even if this is connected to what the customer wants.

Back to the Confusion Matrix

When a Data Scientist or a Machine Learning engineer is reviewing the Confusion Matrix of different models they need to have the objective of the model in mind.

What are we trying to achieve and what does good look like?

As I mentioned earlier, people find the concept of accuracy intuitive. This can be a negative because it means that they bring their own assumptions to the table. For example, if something is less than 50% accurate I often hear "well that is worse than flipping a coin". Sure on the surface that is true. But what if, due to the rarity or imbalance of events, we are actually coming from a baseline of 1% (or less) accurate if we were making a random guess? Then 10% accuracy is already a 10X improvement.

We need to think about accuracy in relative terms and in terms of improvement and value add vs. no model (or the model we had before).

Next we need to decide where we want our wrong predictions to fall – taking into consideration the risk and cost I mentioned earlier. This is a decision of whether a False Positive is better or worse than a False Negative.

The True Positive rate is also called the sensitivity of the model. To maximise this is to minimise False Negatives (aka Type 2 errors) and to increase the "hit rate" or probability of detection of what we are predicting for. The more sensitive our model the less likely we are to wrongly say it doesn't exist/to miss something that is there.

The True Negative rate is also called the specificity of the model. To maximise this is to minimise False Positives (aka Type 1 errors) and to increase how selective we are about our predictions. The more specific our model is the less likely we are to wrongly say something exists when it doesn't but the more likely we are to miss something that is there.

So what?

Whether done actively by a Data Scientist, done through neglect by an inexperienced Data Scientist or done automatically by AI, this is what is happening underneath the hood of accuracy refinement. It cannot achieve all things for all people so it comes back to what it was built to do and how success was defined.

Hearing a one statistic accuracy assessment about an AI instance does not tell you the full story. Context is absolutely paramount and not just your perception but that of the designer. Higher doesn't necessarily mean better if you don't know how the decisions were made.


Understanding how our Data and AI products actually deliver against our business strategy is key to unlocking their value.

If this is something you or your leadership team need help with then check out my offering on kate-minogue.com

Through a unique combined focus on People, Strategy and Data I am available for a range of consulting and advisory engagements to support and enhance how you deliver on your strategy across Business, Data and Execution challenges and opportunities. Follow me here or on Linkedin to learn more.

Tags: Artificial Intelligence Business Data Science Ethics Machine Learning

Comment