
How good is artificial intelligence at solving those knotty interpersonal problems that can strain our relationships? David Robson puts the “wise reasoning” of chatbots to the test.
How can you help three siblings warring over the best way to honour their dead mother? What should we do when a couple tries to draw us into their arguments? How should a wife deal with her new husband’s demand that she goes to bed at the same time as him – a source of considerable friction in their life together?
Some of these problems may seem trivial amid the challenges facing the world today, but they represent the kinds of dilemmas that we all face in our day-to-day lives. And they are far from easy to solve. Each side struggles to see the other’s perspective; we often make faulty assumptions and fail to account for our biases and prejudices. The result of our poor judgement can be a serious source of stress and unhappiness that lingers for months or even years after the event has unfolded.
Your capacity to navigate these quandaries isn’t captured in standard intelligence tests, but recent research on “wise reasoning” suggests that it can be measured reliably – and the differences between two people can have serious consequences for their respective wellbeing.
In the first of the BBC’s new series, AI Vs the Mind, I investigated whether artificial intelligence in the form of large language models like ChatGPT could provide some of the wisdom we lack. Having written extensively about human intelligence, decision making and social reasoning, I had suspected that the answer would be a resounding no – but I was in for a surprise.
Raw brainpower
The question of how to measure the capacity of the human mind has occupied psychologists since the earliest days of the discipline. In the early 20th Century, Alfred Binet and Théodore Simon designed a series of tests to track a child’s intellectual development through school. The psychologist might recite a string of numbers and ask the child to repeat it back to them – which could assess short-term memory. Or they might be given three words and asked to form a sentence using the vocabulary – a sign of their verbal prowess.
A few years later, the US psychologist Lewis Terman translated and expanded these tests to include items for older children, such as “If two pencils cost five cents, how many pencils can you buy for 50 cents?”. He also changed the way the results were expressed. Given that older children would generally score better than younger children, he created tables of the average score for each age group. Comparing the child’s score with these averages allowed you to work out their mental age, which you then divided by their chronological age and multiplied by 100 to find their “intelligence quotient” or IQ. A child of 10, who scored the same as the average 15-year-old, had an IQ of 150, for example.

More like this:
● How AI is testing the boundaries of human intelligence
● The chatbots that say they can feel emotions
● How a balloon surprised its creators
IQs tend to follow the distribution of a “bell curve” – with most people’s IQs falling around the average of 100, and far fewer reaching either extreme. For example, according to the reference sample for the “Wechsler Adult Intelligence Scale” (WAIS), which is currently the most commonly used IQ test, only 10% of people have an IQ higher than 120. Identifying where someone’s cognitive ability falls on the normal curve is now the primary means of calculating their IQ.
There is no doubt that IQ can predict some important outcomes in life. As you might expect of its origins in education, it is especially effective at predicting people’s academic success and their careers in professions that lean on memory and highly abstract thinking, such as medicine or law, although it is important to note that IQ is not the only factor.
IQ’s predictive power in other domains is the subject of debate, leading some scientists to propose various alternative measures of specific abilities such as creativity, rational decision-making, and critical thinking that we may tend to associate with general intelligence.
AI v the Mind
This article is part of AI v the Mind, a series that aims to explore the limits of cutting-edge AI, and learn a little about how our own brains work along the way. Each article will pit a human expert against an AI tool to probe a different aspect of cognitive ability. Can a machine write a better joke than a professional comedian, or unpick a moral conundrum more elegantly than a philosopher? We hope to find out.
Some psychologists have even started investigating whether you can measure people’s wisdom – the good judgement that should allow us to make better decisions throughout life. Looking at the history of philosophy, Igor Grossmann at the University of Waterloo in Canada first identified the different “dimensions” of wise reasoning: recognising the limits of our knowledge, identifying the possibility for change, considering multiple perspectives, searching for compromise, and seeking a resolution to the conflict.
In various experiments, Grossmann and his colleagues asked participants to think out loud about various social or political dilemmas, while the psychologists rated them on each of these “dimensions”. The prompts included letters to a popular advice column, Dear Abby (who would be known as an “agony aunt” in British English) that detailed the problems described at the start of this article. The participants also viewed newspaper articles describing international conflicts. In each case, they were asked to talk about the ways the situations would unfold and the thinking behind their conclusions.
Grossmann found that this measure of wise reasoning can better predict people’s wellbeing than IQ alone. Those with higher scores tended to report having happier relationships, lower depressive rumination and greater life satisfaction. This is evidence that it can capture something meaningful about the quality of someone’s judgement.
As you might hope, people’s wisdom appears to increase with life experience – a thoughtful 50-year-old will be more sage than a hot-headed 20-year-old – though it also depends on culture. An international collaboration found that wise reasoning scores in Japan tend to be equally high across different ages. This may be due to differences in their education system, which may be more effective at encouraging qualities such as intellectual humility.

Wisdom can depend on context – people tend to be wiser when reasoning about other people’s problems rather than their own, for example – a phenomenon known as Solomon’s Paradox after the biblical king who struggled to apply his famously sage judgement to his personal life. Fortunately, we can remedy this deficit using certain psychological strategies. When people imagine discussing their problem from the point of view of an objective observer, for example, they tend to consider more perspectives and demonstrate greater intellectual humility.
Wise AIs?
So far, all these experiments have been conducted on human brains. But could artificial intelligence demonstrate wisdom?
Platforms like ChatGPT are called large language models, which have been fed on huge volumes of text to predict how a human would respond to a particular prompt. Further feedback from real human users has helped to refine the algorithms. You won’t need me to explain how successful this has become: if you have even glanced at the news, you’ll have seen the excitement – and fear – about the potential of these bots.
The algorithms certainly perform well on traditional measures of intelligence. In 2023, the assessment psychologist, Eka Roivainen, of Oulu University Hospital in Finland, recently fed ChatGPT questions from the WAIS, with components on vocabulary, general knowledge, arithmetic, abstract reasoning and concept formation. It scored 155 – which, for a human, is higher than 99.9% of test-takers. When reporting his results in Scientific American, Roivainen confessed that he did not score as highly as the chatbot.
Inspired by Roivainen’s results, I asked Grossmann about the possibility of measuring an AI’s wise reasoning. He kindly accepted the challenge and designed some suitable prompts based on the “Dear Abby” letters, which he then presented to OpenAI’s GPT4 and Claude Opus, a large language model from Anthropic. His research assistants – Peter Diep, Molly Matthews, and Lukas Salib – then analysed the responses on each of the individual dimensions of wisdom.
Grossmann emphasises that any results must be treated with caution – given the time constraints of this article, the analysis was “quick and dirty” without the typical rigour that would be required for a scientific paper. Nevertheless, the responses are highly intriguing.








