Do LLM’s Have A Ceiling

AI

We are now almost a year into the Large Language model breakthrough that for many, probably felt like it came out of nowhere. However, like most monumental breakthroughs, Large Language Models are the result of equally significant—though lesser- known—innovations that compounded over the years to bring us one of the most defining technologies probably since the transistor.

As investors and corporations race to compete in the future of AI, it’s helpful to take a step back and understand how we got here to have a better idea of where we are going. Unlike our usual approach to work, AI innovations cannot be brute forced by just doing more—they’re the result of deep understanding and patience. Sometimes those innovations leap forward so fast that they run into a walls. In this discussion we’ll exploring what some of those walls might be, and how we are trying to climb over them.

But lets first begin with whats at the core innovation that enabled the era of Large Language models work. Underpinning the major milestones that led to the mass proliferation of LLMS is a type of neural network architecture called a transformer. The idea was first presented in a paper titled “Attention is all you need,” published by Google researchers along with the University of Toronto in 2017. For those that don’t already know the T in ChatGPT stands for Transformer.

Prior to Transformers, language models tried to rely on a couple different deep learning approaches. Those being; machine learning-oriented statistical techniques; Convolutional Neural Nets (CNN’s), which are best applied for computer vision; and Recurrent Neural Networks (RNN’s), which, due how they process sequential data were actually the best option for most natural language models prior to transformers.


(The transformer architecture – discussed below)

So what made the transformer such a significant leap forward over prior techniques? The innovation is of course, very complex, so we’ll try to keep it as simple as possible while still hitting on the main features of the architecture though we won’t go into exactly how a transformer works. 

A transformer has two parts:

  1. an encoder (for the input)

  2. a decoder (for the output)

For the encoder side, a transformer is a combination of three key techniques: positional encoding, attention, and self-attention. Explaining these in detail would be out of scope for this discussion, but a summary would be that; positional encoding refers to understanding where a word is in a sentence and its relationship, by distance, to other words; attention refers to how important a word is; and self attention helps understand the specific meaning of a word relative to the words around it. These techniques allow a transformer to pay attention to how all the words in a sentence, regardless of their position relative to each other, can inform the context and meaning of the sentence along with the individual words. This differs from a recurrent neural network, which mostly processes the words in sequence. For example, let’s compare how an RNN and a transformer would understand this sentence.

“The server, which is training a model right now, is overheating.”

An RNN would start by reading the first word "the," then the second word "server," and so on—only one word at a time. It would try to keep a "memory" of what it has read so far to understand the sentence.

When it reaches the word "is," the RNN has to remember that "the server" is the subject of the sentence, despite the words "which is training a model right now" being in between the subject and the conclusion of the clause. RNN’s struggle to remember all the words in a sentence and because of this they lose the understanding of their relation to each other, especially as the distance between the words increases - these are referred to as “long term dependencies”

A transformer, on the other hand, looks at all the words in the sentence at the same time. So, when it reads "is," it can easily associate it with "the server" by giving higher "attention" to it, despite the intervening words.

This way, it can directly learn the relationships between words, no matter how far apart they are in a sentence. And by doing this for every word in the sentence, the transformer can get a better overall understanding of the sentence. This ability to form relationships between words is also what allows a transformer to understand that the idea of a server in the sentence above differs from a server in the context of a hospitality worker and a computer (this is what self-attention is about). The associating words in the sentence give context to the specific word itself as well as the sentence overall.

This complex layering of associations between words informs how the encoder and decoder portion of a transformer work. The encoder uses this technique to be able to read and understand the inputs. While the decoder uses this technique to produce an output from the framed understanding of the input provided by the encoder. The decoder might just do something simple like classify the sentiment of a sentence (happy, sad, a joke, informative) or actually generate a reply, as we see with chatGPT.

This breakthrough allowed transformer-based models to be trained significantly faster on larger volumes of data because the training workload could be parallelized across several GPUs. RNNs, because of their need to process language sequentially, one word at a time, could not run multiple workloads on a body of text at once. 

This is why we have these multibillion parameter models now. A short sentence with transformers isn’t just a few word associations captured in a neural net—it could be dozens of associations between words. If you scale this up, the amount of data that is captured for a paragraph in a transform model is orders of magnitude higher than using an RNN to process the same text. We are now beginning to form the foundations of how this significant leap forward in data capture both helped LLMs generate human-like text while also pushing the limits of the hardware they run on.

What are the limits, and what do we need to pass them?

Well, there are a few perceived limitations that have entered conversation circles, but how big these limitations are is still yet to be seen. None the less the main areas to consider are memory, context length, and data quality.

Memory

This problem is part of the general concept of the memory wall problem first coined by Wulf and Mckee in the 1990s. In the context of LLMs, the rate of parameter growth in training is increasing exponentially relative to the rate of improvements in memory bandwidth and capacity, which have still been following Moore’s law (doubling every two years).

Being specific, these problems are mostly related to issues such as the storage of parameters, the management of activation memory, and management of memory across distributed hardware. The simplest problem in the set to understand is just the storage of parameters. For a model to run as quickly and as accurately as LLMs run today, they require their parameters to be stored in the working memory of a system, not on long-term storage like you would for your documents, music, or photos, etc. When you have billions of parameters, this can require gigabytes or even terabytes of memory. 

For reference, at the time of writing the NVIDIA A100 is one of the premier GPUs for AI today, and it presently only has 80GB of working memory and 2TB/Sec bandwidth. This problem isn’t just easily solved by chaining GPUs together, there are limits to this as well. Imagine dividing up your brain into 1000 parts. It’s not just that signals can take longer to process; the lower density of neural connections between regions may result in information missing between connections all together. Granted, with enough resources, some of these challenges can be reduced, though its generally accepted that there is an upper limit to how much parallelization can be used for both training and inference (inference is the compute used when querying a trained model for outputs) .

So what exactly does this all mean? Well, it means that we likely can’t make models with more parameters without seeing some reduction in performance. This reduction can be speed— because it’s just so big, and we have to traverse a large network—or accuracy, because of the disconnect required to manage data across multiple hardware devices.

Just like how transformers had to emerge to overcome the limits of RNNs, researchers are actively exploring how to overcome some of the physical challenges introduced by LLMs and their compute hungry transformer networks. These techniques may include:

  • Optimizing memory during training and quantizing parameters for storage

  • New software for parallelizing the compute and memory

  • New algorithms that increase performance over fewer parameters 

  • Utilizing techniques like the ensemble of experts to divide up large models into smaller specialized models.

The most obvious solution is also patience—just waiting for hardware to steadily improve. 

Context

In June 2023 a paper was released titled Lost in the Middle: How Language Models Use Long Contexts. The general thesis of the paper was that large language models when presented with a long context tend to remeber the beginning and end of content in the context but tend to “forget” the middle. This is not all to different from humans and is referred to as the recency effect or recency bias.

For those that are unsure what context refers to, its the information supplied to a language model when prompting it. For example, you may include a specific company contract you have as part of the context and ask it to summarize it for you. They way the paper tested this forgotten middle problem was to to create a set of “documents” and provide them to the model. Given that most language models have a limit to how many tokens (words) can be supplied as part of the context, the researches choose to take short blurbs from X number of wikipedia articles (the X was the number of documents added to the context). By having a variety of content in this documents it made the information in the context very differentiated and unrelated. The next step was to ask the language model in the prompt a question about content in one of the documents. When the document that had the answer to the question was at the beggining or end of the context the language model did well to return the answer. But when the answer was in document in the middle of the set of documents (therefore in the middle of the context) it performed poorly retrieving the answer from the document.

When charting the relationship between the number of documents and the accuracy of the answer based on where the document with the answer existed in the context a distinct U shape appeared - showing better performance at the beginning and end. When testing across available foundational models it appeared that all of them experience the same problem which suggests this is an emergent problem with transformers. To date no one has a clear answer on how to solve long context problems with models. Some of this could maybe be solved with fine tuning domain specific models over retrieval based techniques. However, no clear consensus has been discussed at time of writing.

Data & Data Quality

Fine tuning models with data requires more data than can be available to make meaningful improvements to the model. This shortage of data can be supplemented using synthetic data. A common way to generate synthetic data today is utilizing techniques that have been used to train early open source models like Alpaca and Vicuna. The way Stanford research teams generated this synthetic data was by prompting foundational models like Davinci 3.5 for some data and fine tuning from this generated data. For example if you wanted to fine tune a language model to understand more about trees you could prompt it 1000’s of times for various descriptions of trees and use those generated responses as training inputs to your model. For vision models, a useful example would be to generate photos of cars in foggy driving conditions (which there is probably not as much available data as sunny weather driving) and use these generated photos to help train a cars computer vision model to better identify cars in fog.

With regard to data quality, this is a problem I refer to as a regression to the mean. Some recent papers have hinted at this, suggesting chatGPT has gotten “dumber” as a result of training on user feedback. You can read more about this problem in this article but the short answer is that LLM’s do not have the ability to discern if one piece of information is better or more accurate than other. This is because LLM’s don’t fundamentally understand what they are communicating and have no function for determining if one corpus of text is more accurate or intelligent than another. Since most of the information provided to LLM’s came from pulling data off the internet you can reasonably expect that the majority of information was of average quality. The result of average data quality for training results in average responses due to denser clustering of this content in the models embedding space. The only way to meaningfully solve this problem to date is through expert curation of content - which is very expensive and human centric process still. It is rumoured that Google is investing upwards of $1B on data quality for training.

Summary

The question still remains open-ended: When will we, or have we, started to hit the point of diminishing returns for these foundational models? At least so far, we seem to be innovating around these early challenges and continuing to push out advances in model capabilities. But is there a point where we can no longer improve over the constraints we are considering today or future ones we have yet to discover? This is probably one of the most interesting aspects of Large Language Models, the fact that we constructed them, and then have to study them to understand them better - all while they study what we produce to understand us.

Previous
Previous

How to Think About Evolving a System

Next
Next

Architecture: The Underrated Moat