Humans are natural-born learners. In fact, our life-long journey of learning begins in the womb, as proven by the fact that the mood of newborn babies is clearly affected by music or environmental sounds they have heard before being born. The way we are able to interact with, learn from, adapt to and predict our environment is perhaps one of our greatest advantages as a species. Another great advantage is undoubtedly language. While being able to learn and adapt is advantageous to a single individual, the ability to communicate learned concepts is advantageous to everybody – even individuals who have not had a chance to learn a skill or experience a particular situation.

So many breakthroughs in the field of AI have been inspired by learning in biological organisms that it is difficult to imagine the former without the latter. Everything from the brain structure through cell microbiology to evolution has been replicated and integrated in one form or another into a computational model. Yet, language modelling has been so successful that there has been little incentive to look beyond statistical methods and distributed embeddings constructed from incomprehensibly large volumes of text and to understand how humans learn to use the wonderful tool that language is.

Despite the fact that automatic speech recognition systems are so sophisticated that they are now capable of isolating individual speakers in a crowd, this alone does not amount to understanding what those speakers are saying. Even very complex neural networks that give the appearance of language understanding by producing eerily coherent text are in fact relying on the extraction of linguistic patterns, a task which current neural network models are exceedingly good at. These patterns are inferred from corpora containing billions of words – indeed, no single human being would hear that many words in their entire lifetime. To put this into perspective, most children would be exposed to only about 30 million spoken words on average by their third year (although the variance is enormous, a problem known as The Early Catastrophe). This is two to three orders of magnitude less than the amount of written text currently being used to train neural networks. Despite the seemingly insufficient information and the relative messiness of spoken compared to written language, by the age of six (after being exposed to an average of 50 million words), most children would have a decent vocabulary and a good grasp of the grammar of their native language (including morphology), and would be capable of forming coherent sentences, attracting and directing the attention of listeners through requests, questions and stories, and understanding humour (including sarcasm).

Thinking about the current state of language modelling, there is a subtle but important circular reasoning arising from the fact that we are trying to infer the meaning of language snippets (words, phrases and sentences) by looking solely at other such snippets. In this way, the learning system never leaves the domain of language itself (and in most cases it’s only text). As current AI systems do not have “common sense”, meaning that they lack understanding of their environment or a way to represent it conceptually, language learning becomes a purely closed-domain task – the “learned” words, phrases and sentences do not refer to anything except perhaps other words, phrases or sentences.

In this context, patterns discovered solely on the basis of text samples emerge only because they reflect the underlying cognitive and categorical patterns that we have constructed internally by learning how to interact, predict and ultimately understand the world around us. Only after that process of understanding is well underway do humans start assigning linguistic meaning to the concepts that they have understood. 

As an illustration, let us look at Zipf’s law for words, which states that the frequency of a word is inversely proportional to its rank in a frequency table. It is clear that the reason why this distribution is observed in language is that it maps to very similar distributions in the real world – for example, the population distribution of cities, the distribution of the number of available destinations from airports around the world and the family income distribution in any particular country are all Zipfian. Therefore, since we use words as frequently as we need to use the underlying concept, a Zipfian distribution is to be entirely expected. Looking at the concept distribution plot derived from a recently compiled concept graph (which also clearly follows a Zipfian distribution), the words and phrases for concepts that we refer to often (“country”, “brand” and “music genre”) are much more common than ones representing concepts that we only think of rarely or only in specific contexts, such as “scheduling service” and “malabsorption syndrome”.

Looking at how babies learn about the world and how that leads to linguistic competence, we need to start thinking about how this can be translated into language learning for machines. We need to break out of the closed domain of purely linguistic input (predominantly text) and start integrating multisensory input and learning algorithms that would enable virtual agents or embodied robots to build an internal model of the world on their own. This would provide the fundament for the ability of intelligent agents to learn increasingly abstract concepts by building upon less and less abstract ones, ultimately grounding their world model on the concrete physical experience of their environment. 

Learning such a representation would also help intelligent agents interact with humans – be it by representing physical and abstract aspects of a complex task (thus obtaining an intuitive understanding of what the task actually means) or by being able to associate physical cues with abstract ideas (for instance, a sigh and slumped shoulders might be associated with the concept of ‘tired’). Such conceptual representations can also aid in reducing the amount of data and computing power required for learning. If new data can be integrated into the representation by finding the common elements between what we already know and what we want to learn and thus minimising the amount of new data that we need to store and the amount of learning we have to undertake. To illustrate the point, humans know exactly how to walk around a house they’ve never been to before because they understand the concepts of “door”, “hallway” and “wall” and how they relate to each other. We do not have to represent each and every new door or wall we encounter as a shiny new concept – we just have to refer to the relevant concepts that we already have stored in our mind. Importantly, learning the words referring to these concepts is trivial if we already know the concepts – however, learning a concept from words alone is an impossible task. Quoting cognitive scientist Stevan Harnad, “[h]ow can the meanings of the meaningless symbol tokens, manipulated solely on the basis of their (arbitrary) shapes, be grounded in anything but other meaningless symbols? The problem is analogous to trying to learn Chinese from a Chinese/Chinese dictionary alone.” (source). This is known as the symbol grounding problem – the problem of associating linguistic symbols with meaning.

Drawing inspiration from biology again might point us in the right direction in regard to how machines could build such a representation. A lot of the underlying work has already been done or is actively being worked on, with agents now actively closing the gap between the way humans and machines learn by being able to represent and manipulate their environment through autonomous exploration. The most natural step from here would be to imbue these new (l)earnest explorers with language – not based on the statistics of meaningless surface shapes, but by using a grounded model providing the base for deep and rational understanding of the world. More importantly, while autonomous exploration can contribute to the acquisition of language, language itself can be used to guide and enrich this exploration through feedback. Just as children use language in exploring the world and understanding purely abstract ideas such as values, culture and goals, intelligent agents could use language to understand the essential ingredients of the concept of being a human.

Image credit: Earthyspirit - Own work, CC BY-SA 4.0,