How to call bullshit on AI companies (aka a short lesson on recall)


, , ,

Now that software ate the world, what’s for dessert? Those in the know know that it’s AI. It seems everyone took Kevin Kelly’s recommendation to “take X and add AI” to heart. Fast forward to 2018 and all startups tout adding AI to X. There’s a fine line between hype and bullshit, so how do you separate the substance from the snake oil (or to be fair the dream)? Knowing how to do this is important for investors, prospective customers, and prospective employees.

Can you identify AI reality from AI illusion?

The simplest question to ask to cut through the bullshit is “what’s your recall?” You can follow that up with “is your data balanced?” to see through the illusion. Don’t know what recall is? Read on to learn what it is and why it’s important.

The Abridged Introductory Primer On AI And Machine Learning Models

Understanding error is key to most engineering and scientific fields. Models make predictions based on the data they were trained on. An error is the difference between what a model predicts and what the actual value is. Most machine learning models frame problems as a classification problem. Why? Here’s a thought experiment: is it easier to guess my weight or guess if I’m overweight or not? If you guessed the latter, read on!

Many classification problems get simplified to binary classification, such as whether a picture contains a hot dog or not. Binary decision problems seem trivial, but we use them all the time. For example, consider all of these AI-powered decisions: grant loan/don’t grant loan, guilty/not guilty, hire/don’t hire.

Binary classification problems are easiest to solve

Now, AI companies are obliged to tell you how great their model is. They may say something like “our model is 95% accurate”. Zowee! But what does this mean exactly? In terms of binary classification it means that the model chose the correct class 95% of the time. This seems pretty good, so what’s the problem?

Suppose I create an AI that guesses the gender of a technical employee at Facbook. As of 2017, 19% of STEM roles are held by women. Behind the scenes, my model is really simple: it just chooses male every time (bonus question: is this AI?). Because of the data, my model will be 81% accurate. Now 95% doesn’t seem all that impressive. This dataset is known to be unbalanced, because the classes are not proportional. A better dataset would have about 50% women and 50% men. So asking if a dataset is balanced helps to identify some tricks that make models appear smarter than they are.

My gender predictor always chooses men

Error Metrics

In machine learning (and elsewhere), two metrics help distinguish good models from bad models. (Disclosure: my model above is bad). Precision and recall are typically defined in terms of true and false. A true positive is when I guess true and the answer is true. How do I convert my male/female classes to true or false? Since there are just two classes, we can convert the classes into an equivalent predicate: p(x) = x \mbox{ is male}. This seems obvious for the male class. For the female class, we are saying that false means x \mbox{ is not male}.

Precision is defined as the ratio of true positives versus the sum of true positives and false positives. Mathematically, this looks like

\mbox{precision} = \frac{TP}{TP + FP}.

In other words, of all the cases where you guess true, how many were correct? In terms of gender, for all the cases where you guess male, how many were actually male? If I always guess male, then this value will be 81% if I’m looking at the Facebok data.

On the other hand, recall is the ratio of true positives versus the sum of true positives and false negatives.

\mbox{recall} = \frac{TP}{TP + FN}

In this case, a false negative means seeing a male and guessing female. In other words, of all the male cases, how many did you correctly identify? My stupid model will have a recall of 1 since it never guesses female. A real model will behave differently. For example, in hiring, there may be a lot of highly qualified candidates. Suppose one of my criteria is that someone must come from a Top 10 school. That means I’ve filtered out a lot of highly qualified candidates. In essence, I’m labeling these highly qualified candidates as not qualified. This is a false negative. It’s easy to see how over-zealous filtering will result in a low recall but high precision.

The final metric is accuracy, which is often used in a general sense. The technical definition of accuracy is the sum of true positives and true negatives versus the total number of cases.

\mbox{accuracy} = \frac{TP + FP}{N}

The Relationship Between Precision And Recall

To better understand the relationship between these two metrics, let’s look at hiring some more. There are numerous companies touting AI-based hiring. Some of these companies focus on screening resumes to identify candidates to interview. There are actually two decisions being made. First is whether to interview someone. The second is whether to hire someone. In predicate form, this looks like p(x) = \mbox{interview } x. In a typical scenario, a hiring manager looks at this pool of candidates and decides whether or not each one is actually worthy of an interview. This is precision. Remember that by using highly selective criteria or having an unbalanced dataset, it’s possible to skew precision.

Is this the right metric to evaluate a hiring model? Most tech companies complain that there aren’t enough qualified candidates. According to one analysis, over 50% of Facebook employees listed on LinkedIn graduated from a Top 10 university. (We’ll ignore how “Top 10” is defined for this discussion.) This is about 100,000 people annually out of about 3 million students graduating with a bachelor’s or higher. Since Facebook is ignoring many qualified candidates (yes, this is a philosophical question) in the broader population, their hiring has low recall. This same problem exists in the financial world where investment banks and hedge funds typically only hire from the Ivy League. When this becomes part of the hiring rubric, recall is very low. In practical terms, we’re creating an artificially small talent pool in order to optimize precision. Yuck.

Models that follow a similar strategy will also have low recall. An algorithm that focuses on improving precision is only focusing on half the problem. A truly spectacular model would have high recall and high precision. So the next time someone talks about how great their model is, ask them what their recall is and whether their dataset is balanced.

Technical Notes

The easiest way to compute accuracy, precision, and recall in R is to construct a confusion matrix. Suppose you have a vector of predictions pred and a vector of actual results act. The confusion matrix is simply the contingency table.

cm <- table(act, pred)

The three performance metrics can be computed easily. It’s useful to remember these definitions and know how to interpret them.

accuracy <- sum(diag(cm)) / sum(cm)
precision <- cm['TRUE','TRUE'] / sum(cm[,'TRUE'])
recall <- cm['TRUE','TRUE'] / sum(cm['TRUE',])

Other error and performance metrics exist, not just for classification. Knowing which metrics to use depends on the situation and the relative impact of correct guesses versus incorrect guesses (e.g. mammograms have a high cost for false positives).

If you found this article helpful, please share it on social media.

Brian Lee Yung Rowe is founder and CEO of Pez.AI, an AI startup fostering continuous business improvement using customer interactions with chatbots. Learn more about Pez.AI’s turnkey data-driven chatbot platform at