The site how-old.net landed last week and became a viral sensation. Quite simply, it allowed anyone to upload a picture and have it analyzed by machine learned classifiers for presence of faces and predictions of the ages and genders of any detected faces. Since it went up, people have been posting their pictures and HowOldNet’s predictions for them, and they usually provide some colorful commentary on its performance. As an example, the picture below is of me (left) and Philip Resnik (right) as analyzed by HowOldNet. It did pretty well: the age is a bit high for me and a bit low for Philip, and it got us both as men.
Quite simply, the response to the site went far beyond it’s creators’ expectations. As they note in their blog post about how-old.net, they put it together to showcase Microsoft’s platform of Azure tools and hoped to get a few hundred people to try it out. Instead they got over 35,000 uploads. Given the rate at which people are still posting #HowOldNet pictures to Twitter, they are probably in the hundreds of thousands by now. It even went beyond social media and was mentioned on shows like The Talk. As such, HowOldNet and the reaction to it provide an instructive episode in how people respond to the predictions of machine learned classifiers.
Here are five observations that arise from the episode.
Observation #1: People respond emotionally to the output of a machine algorithm, and their response depends on whether it under or over predicts their age.
People don’t seem to get the output of HowOldNet and just say “that was right” or “that was wrong”. Instead, they laugh at it, get creeped out by it, cheer when it says they are younger, or get frustrated by its different predictions for the same person on different pictures.
Here are some examples of things people said:
It’s fascinating to see how people take these predictions and spin them. When it is a prediction that is lower than their true age, many seem to feel genuinely complimented. When the prediction is older, they appear to be offended. “Robot, how could you say I’m older than I am!?” This is simply a continuation of the general tendency humans have to attribute agency to inanimate objects.
We regularly construct personalities for our cars, computers, toaster ovens, and even moving geometric shapes. We act as if those objects have intentions akin to those of animals and our fellow humans. This tendency is even more accentuated when the inanimate object is a computer running a set of instructions (an algorithm) that performs a prototypically human task like face recognition and age prediction.
When it gets it right, the uncanny valley often shows up clearly in the commentary (e.g. it’s “eerily correct” or “creepy”). People get weirded out when machines are too human or perform too much like a human. So, damned if it does, and damned if it doesn’t (but, please, give us more!).
What’s particularly fun about this episode is that we could potentially see quite similar emotional responses if HowOldNet was randomly choosing an age between 0 and 85 for every face. What probably matters most is that the subject matter is a personal one (pictures of one’s self or loved ones) and that an automated black box is coming back with judgments of that personal subject matter.
Observation #2: People are subject to systematic biases when evaluating the performance of an algorithm like HowOldNet.
A quick scan of #HowOldNet tweets indicates that the majority are about errors made by the algorithm. To see if this is more than just an impression, I looked at 30 tweets and classified them as indicating correct or incorrect predictions or ambiguous results (I could not tell if the poster thought it was correct or not). The numbers were 19 incorrect, 7 correct, and 4 ambiguous. Leaving the ambiguous ones out, that indicates an accuracy rate of 27%. That sounds terrible, right? Well, hang on a moment: there are a number of problems with this evaluation.
To begin with, my little evaluation is just 30 items pulled on a particular day at a particular time of the day, so my sample is not at all representative of the overall group of posts about HowOldNet. It should be noted that this is a terrible error metric for this task: we should be estimating how far off the predictions are instead of a binary correct/incorrect measure. It’s much better to say that a 27-year-old is 33 than to say she is 58. This problem is even worse because we are using self-reports of accuracy rather than an objective measure of the age and whether it is correct. This means that we might find people like the output when it predicts they are 5 years younger than they are, but say it is incorrect when it predicts they are 3 years older. These lopsided error boundaries are also likely to be moveable depending on the age of the person we are predicting: e.g. many people might say it is quite incorrect if a 29-year-old is said to be 25, but that is is reasonably correct if a 79-year-old is said to be 75.
Another major problem with this evaluation is that it depends on the quality of predictions of self-selected reporters. What that means is that there are likely to be systematic biases in who reports a correct or incorrect result, and this could dramatically skew the measure of performance of the algorithm. These biases arise from multiple sources. Here are a few:
– People with certain personality types or backgrounds are more likely to post the result of the submission to social channels like Twitter and Facebook.
– There are likely to be biases in whether people post correct and incorrect results. My guess is that posting is more common for incorrect results since it is more fun and shareable.
In other words, by the time we even get to an evaluation like my 30-tweet example, the results many give us a very skewed measure of the true underlying accuracy rate.
Despite this defense of the algorithm, it is still the case that many of the examples of errors made by the algorithm are the kind that most humans wouldn’t make, like saying that someone in their twenties is nearly fifty years old.
When howlers like this occur frequently enough, it is reasonably fair to say that the algorithm hasn’t passed the sniff test, and, furthermore, that careful, representative evaluation that addresses the issues I mentioned above won’t be especially revealing.
Observation #3: People underestimate the challenge of tasks like this for machines.
Researchers in artificial intelligence have long dealt with the fact that people think that things humans do well are easy, and their algorithms are judged harshly as a result. In the 1950’s, many people assumed that it would not take long to have computers that could converse in natural language, while at the same time they argued that computers would never beat humans at chess. Pretty much everyone can speak, so it’s easy; chess requires mastery, so it should be hard for a computer. Of course, the opposite is true: for a well-defined, constrained task like chess, a computer can take advantage of its superior memory and speed to beat people (as Deep Blue showed in beating Garry Kasparov). Language and visual processing, however, require deep knowledge of how the world works, rich approximations of the information contained in the signal, and much more.
As one example of something people mostly take for granted with HowOldNet is that face detection itself is a challenge. Algorithms work quite well for this now, but it took quite a bit of effort to get there, and it is still possible to over-predict and under-predict. When face detection works, people just don’t even notice that it was a thing that needed to be solved.
When it does misfire, it sometimes leads to humorous results. We all know that humans see faces all over the place–it’s an expression of the phenomenon of pareidolia, which is the tendency to find patterns in random signals. Basically, evolution has primed our brains to be highly sensitive to faces and facial expressions, and the over firing of our facial neural circuitry may be connected to our predilection for assigning intention to inanimate objects.
And it turns out that face detection algorithms like that used in HowOldNet find faces in noise. Here’s a great example of it finding a “ghost:”
Assuming the algorithm has correctly identified a face, there are still many further challenges when it comes to predicting age and gender. First, there is basic image analysis and classification, which is itself a rich and complex topic. Facial rotation, lighting, contrast and camera angle can make the inputs look unlike any of the training data that the algorithm uses and leave it more confused. Humans correct for lighting almost effortlessly and rarely even realize that they are compensating for dramatic differences in light and color–one only needs to look at the checker shadow illusion for a perfect example. How well a machine learning based algorithm handles such shading and color variations depends a great deal on the training data available to it and how well such variations cover individuals of different genders and ages.
Observation #4: People rarely consider how poorly humans would perform the same task.
As noted above, a problem with judging HowOldNet based on self-reports is that it is based primarily on people responding to what the algorithm predicted for their own profile. However, a reasonable evaluation would be not to ask whether the algorithm predicted the subject’s true age, but to see how far it is from predictions of that individual by other people. I personally saw plenty of examples of people saying that HowOldNet was ridiculous to call them 35-45 years old; they apparently were less than 30-years-old, but in many cases I would have probably said they were 35 or older just looking at their picture.
Race figures into this as well: it’s well known in the African-American community that “black don’t crack” and that white Americans regularly underestimate the age of older African-Americans. And it turns out that white Americans regularly overestimate the age of African-American boys. So, not only can people just have a hard time correctly guessing the age of a person—they can also exhibit incorrect and systematic biases with respect to particular groups.
Note that if HowOldNet had instead been WhichRaceNet, a tool for predicting the race of people in photographs, there would have been even more mayhem, and not just due to accuracy/inaccuracy issues of the algorithm. The bigger problem with this is that humans can’t even agree objectively on what race this or that person is. As Sen and Wasow (2014) note, the definition of who falls into which racial category varies over time and a given individual may identify with one race at one time and with another race a decade later (which is to say that race itself is socially constructed and not an immutable property of the individual). In Brazil, there are at least three different and incompatible systems of racial categorization. Like many Americans being born this century, my own children are biracial: one has darker skin and is considered “black” by most people while the other is usually considered “white”. What does it even mean to racially categorize them as one or the other when they have the same parents and grow up in the same house? This is exactly the issue addressed in the Sen and Wasow paper mentioned above, and there is a great deal of nuance to appreciate–nuance that few people assessing the results of a WhichRaceNet algorithm would have ever considered. (Also, note that even while individual level predictions can be problematic, determining aggregate statistics about race based on individual predictions for a large number of individuals can still be quite informative and useful.)
Another related task where humans poorly understand how other humans perform the same classification is sentiment analysis. People generally agree on which sentences are positive or negative, but when you throw the neutral category into the mix, they disagree quite a bit about positive-neutral and negative-neutral–to the point that two people only agree about 80-85% of the time overall. Humans also easily misread sarcasm, so they even muck up positive-negative determinations in such cases.
Observation #5: Academic researchers often circumscribe their empirical world too much.
Academic research is often the genesis of the algorithms that underly systems like HowOldNet. Empirical evaluation of the performance of systems for face detection and image classification are a key part of how progress is made. The same is true for my field, natural language processing.
Nonetheless, academic research often stays confined to datasets that have been created under assumptions that dramatically simplify the world of inputs that a given algorithm will be exposed to. For example, an image dataset for gender might only contain images that actually contain a single human face; in making choices about modeling and data processing, the experimenter won’t account for images that contain no faces or multiple faces. Or, it might be small enough dataset that there are very few people of a given race and the experimenter never realizes that their algorithm has built-in assumptions that make it perform poorly for those cases.
It seems likely to me that the modelers behind HowOldNet worked with a clean evaluation set, but when the models were deployed, the resulting system clearly didn’t have this “luxury.” Anybody and everybody could upload their picture and express their opinion about it. Similarly, at People Pattern, we don’t have this luxury with predicting demographics and interests for social profiles. We don’t discard less common types, or exclude profiles because they don’t speak one of a set number of languages. We have to do our best given the inputs, and sometimes that means abstaining and not making a prediction. At the end of the day, we need to analyze all of our customers’ audiences, and that means every one of the profiles of interest.
Wrap up
It’s pretty cool that the creators of HowOldNet built it in under a day, and it was gutsy of them to put the system up for anyone to try. From the sounds of it, they were not the creators of the image classifiers that were showcased, and instead were responsible for bringing the relevant bits together using Azure. As such, they could always say “yeah, we’re just using this classifier that was ready to grab off the shelf, so don’t blame us for the errors.” From what I’ve seen, it does look like the algorithm could benefit from improvements. However, I hope I’ve also demonstrated that the algorithm may in fact work much better than one would be led to believe from the posts one sees on social media.
People tend to think they can generalize from anecdotal evidence, but this is a fallacy: evaluation requires careful design and measurement to make a fair and meaningful assessment.
Recent Comments