Large language models (LLMs) such as ChatGPT and Google Bard have attracted a great deal of publicity in recent months due to their impressive abilities to engage in conversation, answer open-ended questions, find information, write essays, generate computer code, produce poetry, and solve certain problems.
The improvement of these systems has generated significant concern regarding the potential harms of such technology. Various potential risks have been raised, including use of LLMs to rapidly spread misinformation, rapid automation of large numbers of jobs, and artificial agents behaving in unexpected ways or even escaping human control entirely.
A major driving force underpinning these concerns is the rapid rate of improvement of LLMs, and the belief that this rate of progress will soon lead to systems that exceed human capabilities in many tasks, leading to drastically disruptive effects on the economy and society.
In this essay, I will argue that such fears are exaggerated. While there are legitimate concerns about the safety and reliability of LLMs, I do not think it is likely that such systems will soon reach human levels of intelligence or capability in a broad range of tasks.
Instead, I argue that such systems have intrinsic limitations which cannot be overcome within the existing development paradigm. Also, continual growth in capabilities based on increasing the number of parameters and size of the training data will only continue for a few more years before running its course.
I also argue that the adoption of such systems will be slow, occurring over years to decades rather than months to years (as some have argued). Therefore, their impacts on society and the economy will be more gradual and evolutionary rather than sudden and revolutionary.
Current LLMs are based on the transformer architecture. These are very large neural networks which are trained on huge corpuses of data, most of which is from the internet. The models are usually trained to predict the next word in a sentence, and during training they learn complex statistical associations between words in natural language.
Recently, OpenAI has extended this framework by adding a technique called Reinforcement Learning from Human Feedback (RLHF). This involves presenting queries and their corresponding LLM outputs to humans, who then provide ratings as to the quality of the responses.
These ratings are then used to fine-tune the language model, altering its output to improve its ratings from human feedback. This technique has enabled language models to produce output that is more useful to humans, and has improved the performance of language models as chatbots.
The OpenAI team has also made other additions and modifications to its newest model (GPT-4) to improve its capabilities as a chatbot, though very few public details are available about this.
Judging by the number of contributors to the GPT-4 paper (which lists 93 ‘core contributors’ and hundreds of other contributors) relative to previous GPT-3 paper (which lists only 31 authors), it appears that OpenAI has devoted a lot of time to adjusting, augmenting, and modifying the model in various ways.
We know that systems have been put in place to filter out queries likely to lead to harmful or offensive results. There is also evidence that GPT-4 has a limited ability to check for faulty assumptions in the queries or instructions it is given, though it is unclear how this has been done. Nonetheless, it appears that extensive development work has been done beyond the initial stage of training the transformer on a large text corpus.
In my view, the fact that such extensive augmentations and modifications are necessary is an indication of the underlying weaknesses and limitations of the transformer architecture. These models learn complex associations between words, but do not form the same structured, flexible, multimodal representations of word meaning as humans do.
As such, they do not truly ‘understand’ language in the same sense as humans can. For many applications, this does not matter. But in other cases it can manifest in extremely bizarre behaviour, including models accepting absurd premises, making faulty inferences, making contradictory statements, and failing to incorporate information that is provided.
A related issue is the known tendency of LLMs to ‘hallucinate’, making up facts, information, or non-existent libraries of computer code when giving responses. I dislike the term ‘hallucination’ because it implies there is some fundamental distinction between veridical knowledge that the LLM has correctly learned and hallucinations, which it simply makes up.
In my view, the fact that such extensive augmentations and modifications are necessary is an indication of the underlying weaknesses and limitations of the transformer architecture.
In fact, there is no such distinction, because LLMs do not form memories of events or facts in the way humans do. All they are capable of is storing complex statistical associations in their billions of learned parameters.
When the model produces some string of words as an output, this is equally the product of its internal learned parameters regardless of whether humans would evaluate the string as true or false.
Furthermore, an LLM has no notion of truth or falsity; it simply learns word associations. (Here I am ignoring the possibility that GPT-4 may be augmented with capabilities beyond its basic transformer architecture, since there is no public information about this. And, at any rate, the underlying architecture is still a transformer model). As such, the problem of ‘hallucinations’ is not some teething issue or minor annoyance, but is intrinsic to the architecture and method of training of LLMs.
Of course, various proposals exist for how to mitigate this limitation, such as augmenting LLMs with curated datasets of encyclopaedic facts or common-sense knowledge. While promising, such proposals are not new and face many problems of their own right. While they may be successful in the long run, I do not believe there is any simple or easily implemented solution to the problem of ‘hallucinations’ in LLMs.
Another core limitation of LLMs which has been the focus of extensive research is their difficulty in exhibiting compositionality. This refers to the ability to combine known elements in novel ways by following certain abstract rules.
Many cognitive scientists have argued that compositionality is a critical component of the human ability to understand novel sentences with combinations of words and ideas never previously encountered.
Prior to the release of GPT-4, the best transformer models still struggled to perform many compositional tasks, often only succeeding when augmented with symbolic components (which is difficult to scale to real-world tasks), or when given special task-specific training.
At the time of writing, I am not aware of GPT-4 having been subjected to these types of tests. Although I anticipate it would outperform most existing models – given that it shares the same transformer architecture – I doubt it will be able to completely solve the problem of compositionality.
The underlying limitation is that transformer-based language models do not learn explicit symbolic representations, and hence struggle to generalise appropriately in accordance with systematic rules.
There have also been efforts to circumvent some of these limitations and use LLMs for a wider range of tasks by developing them into a partially autonomous agent. The approach is to chain together a series of instructions, allowing the model to step through subcomponents of a task and reason its way to the desired conclusion.
One such project called Auto-GPT involves augmenting GPT with the ability to read and write from external memory, and allowing it access to various external software packages through their application programming interfaces (APIs).
It is too early to say what will become of such projects, though early investigations indicate some promising results but also plenty of difficulties. In particular, the model often gets stuck in loops, fails to correctly incorporate contextual knowledge to constrain solutions to the problem, and has no ability to generalise results to similar future problems.
Such difficulties illustrate that LLMs are not designed to be general purpose agents, and hence lack many cognitive faculties such as planning, learning, decision making, or symbolic reasoning.
Furthermore, it is exceedingly unlikely that simply ‘plugging in’ various components to an LLM in an ad hoc manner will result in an agent capable of performing competently in a diverse range of environments.
The way the components are connected and interact is absolutely crucial to the overall capabilities of the system. The structure of the different cognitive components of an agent is called a cognitive architecture, and there has been decades of research into this topic in both cognitive psychology and computer science.
As such, I think it is naïve to believe that such research will be rendered irrelevant or obsolete by the simple expedient of augmenting LLMs with a few additional components. Instead, I expect that LLMs will form one component of many that will need to be incorporated into a truly general-purpose intelligent system, one which will likely take decades of further research to develop.
Recent improvements in LLMs have primarily occurred as a result of dramatic increases in both the number of model parameters and the size of the training datasets. This has led to a rapid increase in training costs, largely due to the electricity usage and rental or opportunity cost of the required hardware. For example, the cost of training GPT-3 was probably several million dollars, compared to over one hundred million for GPT-4.
Assuming current growth rates continue, within about five years further increasing model size will become infeasible even for the biggest governments and tech firms, as training costs will reach tens of billions of dollars.
The next few years will be a critical period for LLMs, in which there will be much experimentation and failed attempts as companies compete to find the best way to deploy the technology.
Separately from the issue of training cost, there is also the question of the availability of training data. Existing models require enormous training datasets, with the size increasing exponentially from one iteration to the next. For example, GPT-3 was trained on a primary corpus of 300 billion words derived from the internet. Based on historical trends, Epoch estimates that high quality language data will be exhausted by 2024 or 2025, and low quality data by 2032.
I am not arguing here that the development of LLMs will cease within five years or that further improvements are impossible. Rather my point is that the primary method by which improvements have been achieved over the past five years will cease to be feasible. As such, we cannot expect current rates of progress to continue indefinitely. Similar views have been expressed by other researchers, including Ben Goertzel, Gary Marcus, and Sam Altman.
In light of these considerations, along with the intrinsic limitations discussed above, I do not think it is plausible that LLMs will reach or exceed human performance in a wide range of tasks in the near future.
The next few years will be a critical period for LLMs, in which there will be much experimentation and failed attempts as companies compete to find the best way to deploy the technology. It will take considerable time and effort to turn LLMs into a viable product, and even longer to adapt its use to various speciality applications and for the technology to become widely adopted.
Many companies and organisations will seek ways to use LLMs to augment their existing internal processes and procedures, which also will take a great deal of time and trial and error.
Contrary to what some have implied, no new technology can ever simply be ‘plugged in’ to existing processes without substantial change or adaptation. Just as automobiles, computers, and the internet took decades to have major economic and social impacts, so too I expect LLMs will take decades to have such impacts. Yet other technologies, such as nuclear fusion, reusable launch vehicles, commercial supersonic flight, are still yet to achieve their promised substantial impact.
One of the major limitations of using existing LLMs is their unreliability. No important processes can currently be trusted to LLMs, because we have very little understanding of how they work, limited knowledge of the limits of their capabilities, and a poor understanding of how and when they fail. They are able to perform impressive feats, but then fail in particularly unexpected and surprising ways.
Unpredictability and unreliability both make it very difficult to use LLMs for many business or government tasks. Of course, humans regularly make mistakes, but human capabilities and fallibilities are better understood than those of LLMs, and existing political, economic, and governance systems have been developed over many decades to manage human mistakes and imperfections.
I expect it will similarly take many years to build systems to effectively work around the limitations of LLMs and achieve sufficient reliability for widespread deployment.
It is also valuable to take a historical perspective, as the field of artificial intelligence has seen numerous examples of excessive hype and inflated expectations. In the late 1950s and early 1960s, there was a wave of enthusiasm about the promise of logic-based systems and automated reasoning, which were thought to be capable of overtaking humans in many tasks within a matter of years.
The failure of many of these predictions led to the first ‘AI winter’ of the 1970s. The 1980s saw a resurgence of interest in AI, this time based on new approaches such as expert systems, the backpropagation algorithm, and initiatives such as Japan’s Fifth Generation computer initiative. Underperformance of these systems and techniques led to another ‘AI winter’ in the 1990s and early 2000s.
The most recent resurgence of interest in AI has largely been driven by breakthroughs in machine learning and the availability of much larger sources of data for training. Progress in the past 15 years has been rapid and impressive, but even so there have been numerous instances of inflated expectations and failed promises.
IBM’s Watson system which won Jeopardy! in 2011 was heralded by IBM as a critical breakthrough in AI research, but subsequently they spent years attempting to adapt the system for use in medical diagnosis with little success.
Self-driving cars developed by Google attracted substantial publicity in 2012 with their ability to drive autonomously on public roads with minimal human intervention. But, a decade later, there remain considerable challenges in closing the last few small portions of the journey where humans still need to take over.
While such comparisons can never be definitive, I believe these historical precedents should temper our expectations about the rate of progress of the latest set of techniques in artificial intelligence research.
In conclusion, LLMs have intrinsic limitations which are unlikely to be resolved without fundamental new paradigms. The increasing costs of training and limited stock of quality training data will mean that growth of LLMs at present rates will not be able to continue for more than a few years. Furthermore, historical parallels indicate that it will take years for LLMs to become widely adopted and integrated into existing economic and social processes.
Overall, there is little reason to believe that LLMs are likely to exceed human capabilities in a wide range of tasks within a few years, or displace large fractions of the workforce. These outcomes may occur in 30 or 50 years time, but almost certainly not within the next five or 10 years – and not solely due to the continued development of LLMs.
While there are legitimate concerns and problems associated with the rapid improvement of LLMs, we should not be distracted by inflated concerns about catastrophic impacts in the near term.
Photo by Mojahid Mottakin on Unsplash.