A Reasonably Digestible Explanation of Transformers

Aidan Tilgner
11 min readMay 24, 2024

--

Transformers can be complicated, but they don’t have to be! Join me on a comprehensive, easy-to-understand exploration of transformers.

(If you’d like to listen to me read this article, find it for free on my Substack, where I post weekly)

Most of the models and products that are popular today–ChatGPT, Stable Diffusion, Midjourney, etc.–rely on the “transformer” architecture. In 2017, Google AI researchers released a paper entitled Attention is All You Need, which detailed transformers and their groundbreaking performance implications. Since then, the industry has widely adopted this architecture as the state of the art.

OpenAI, Google, Meta, Anthropic, and other top players in the burgeoning AI industry have built their cash cows on top of this particular architecture. Why is it so special? What makes it tick under the hood? My hope is that these questions and more will be answered for you by the end of this post.

Transformers explained…

I want to clarify that I won’t be diving deep into the math behind these models, or technically savvy details. My goal here is to provide you with sufficient knowledge to better understanding how these chatbots really work when you talk to them. Think of this as more of a starting point into other ventures, such as Prompt Engineering–which I’ve written about before–or even further AI research.

Understanding tokens…

When you first give a prompt to a transformer, it converts your input into a more manageable form for a computer, such as vectors, which represent tokens. Tokens are the basic units of text which serve as the input for various models. This process can vary, but we’ll break down OpenAI’s GPTs (Generative Pretrained Transformers) as an example. As described by OpenAI’s Tokenizer Platform:

Large language models (sometimes referred to as GPT’s) process text using tokens, which are common sequences of characters found in a set of text. The models learn to understand the statistical relationships between these tokens, and excel at producing the next token in a sequence of tokens.

For example, when you input the phrase “I like cheeseburgers”, gpt-3.5-turbo’s encoder divides it into the following tokens:

I| like| cheese|burg|ers

Where “I”, “ like”, “ cheese”, “burg” and “ers” are all interpreted separately. These tokens are then represented mathematically using vectors:

[40, 1093, 17604, 10481, 388]

These numbers aren’t arbitrary either, but are learned representations which capture the semantic meaning of words. That means that while we may see these as just a bunch of numbers, to the corresponding model they have meaning.

Semantic embedding…

Once our input has been converted into a more desirable form, it is then processed through an embedding layer. This means that each token is mapped into a high-dimensional space, which captures richer semantic and syntactic information. Going back to the non-arbitrary nature of the tokens, each one will essentially be mapped to an area which has relevant data based on past learning.

I find it helpful to imagine this as walking through a field of random objects, which are sorted by their relevance to one another. Suppose you’re looking for a cheeseburger, and you notice there is a napkin near you. Looking around the napkin, you notice a ketchup bottle a little further away. Approaching the ketchup, you notice a hot dog, and soon a spatula as well. Eventually, you find your cheeseburger, surrounded by objects with some sort of relation to it. The more related the object, the closer it is to the cheeseburger.

In our case, instead of a cheeseburger, we have tokens, and these tokens are placed in a higher dimensional space than the 2d field that we’re imagining. This means that the complex landscape of relations between objects can be represented in more ways than just spatial proximity. Relevant data to each individual token is therefore captured in this step, and fed forward to the next layer.

The next layer simply adds “positional encodings”, which allow the transformer to understand the order of tokens. This is because the model doesn’t process the input step by step. So if we take our example tokens from above–“I”, “ like”, “ cheese”, “burg” and “ers”–this step would order them for the model’s reference:

  1. “I”
  2. “ like”
  3. “ cheese”
  4. “burg”
  5. “ers”

(As a side note, 3Blue1Brown has an excellent video visually representing this process.)

Paying attention…

Now we can get to the core of the model, the self-attention mechanism. This is the groundbreaking technique that has allowed modern artificial intelligence models to perform so well. At its heart, the attention mechanism enables the model to focus on relevant data when processing each token.

For each token, the attention mechanism calculates a score for every other token in the input context, based on how relevant it is. This means that when processing an individual token, the transformer can focus on only semantically relevant data. This process is repeated for every token in the input sequence, generating a single output vector for each.

These output vectors are calculated in “transformer blocks,” which include multiple components such as the Multi-Head Attention (MHA) mechanism and the Feed-Forward Network (FFN). The MHA is crucial because it allows the model to focus on different parts of the input simultaneously, working in multiple subspaces. However, the FFN is equally important as it processes each token in higher-dimensional spaces, adding depth to the model’s understanding.

Without the FFN blocks, we might encounter what’s known as the “attention bottleneck,” where the capacity of individual attention heads, which typically operate in smaller dimensions, could limit the model’s performance. The FFN helps overcome this by handling larger dimensions, ensuring the model remains efficient and effective.

Additionally, attention generally uses a method called softmax to prioritize which data points to focus on, making some values much more significant than others. The softmax function converts the attention scores into probabilities that sum to one, effectively weighting the relevance of each token. In contrast, the FFN uses different mathematical functions (nonlinearities) to process data in a way that’s more efficient for the model’s performance.

You can think of this process as generating a list of likely next words and ranking them based on their probabilities, allowing the model to make more accurate predictions.

(3Blue1Brown also has a great visual explanation of this.)

Next word prediction…

Eventually, every output vector will be mapped into a larger “vocabulary space”, which contains all the possible tokens that the model could output to the user. A function is then applied to each vector within this space to turn them into a probability distribution over all possible output tokens. This step determines the likelihood of each token being the next token in the sequence.

From the probability distribution, the most likely next word in the sequence is chosen. A bit of variability comes into play here, which allows for fine-tuning of the next word selection process, and a bit of variety in model responses. Temperature, for example, effectively controls the randomness of the model’s outputs by allowing less probable words to be chosen. After the next word is selected, it’s added to the input sequence, and the process repeats.

This “autoregressive” model means that this process will continually repeat until the next token picked from the probability distribution says to stop. That is when you have your completed response from the model, and you can use is as you please. This is also why the words from an LLM response seem to appear one at a time.

Putting it all together…

So, to recap these various steps, we start with the tokenization process. This process splits the text input (or other input) into a more numeric manageable form, usually called tokens. Then, these tokens are embedded into a semantic space, which essentially means that each word is associated with other semantically relevant information from the training data.

After this has been done, we arrive at the self-attention mechanism, which ranks each word and its relation to the other words in the input sequence, and generates an output vector for it. Eventually, a single output is computed from the various output vectors, and used to find a probability distribution of what the next token in the sequence might be. Once that probability distribution has been generated, the most probable word can be chosen as the predicted “next token”, which is then added to the sequence.

This is done repeatedly until the next token predicted is the “stop token”, which signifies that the response is complete.

The special part…

There were several key problems that the transformer architecture solved through its innovative design. From the attention mechanism to more performant training and other scalability features, there is a reason this is the state-of-the-art.

Scalability…

The aforementioned paper, Attention is All You Need, aimed to improve the performance of Google Translate and other machine translation technologies. The previous architecture popular for this task, Recurrent Neural Networks (RNNs), experienced performance drop-off with larger amounts of text. This is because RNNs work sequentially, processing text in a specific order, with each data point being processed in relation to previous ones.

This sequential nature means that as the amount of input increases, so does the amount of computation required to process it. The efficiency is significantly diminished as the input size increases, making it challenging to scale these models. Enter transformers.

Transformers also see an increase in computation requirements with increased input. However, they are highly parallelizable. This means that the computations required for processing the next token can be spread out across multiple operations that happen simultaneously, rather than in order. This parallelism is where GPUs become crucial, as they enable parallel computing.

Additionally, due to their parallelizable nature, transformers can be trained much more efficiently using modern GPU datacenters. You might have heard of the number of “parameters” a model has and its impact on overall performance. The number of parameters in a model can be thought of as a measure of the model’s complexity and, therefore, its capacity for understanding intricate patterns.

More complexity requires more training and, consequently, more compute power, but it also results in a model capable of handling more complex tasks. I like to think of this as the models “head” getting bigger, and therefore it being able to “wrap its head” around bigger things. As transformers have grown larger, we’ve seen significant performance improvements.

Capturing more detail…

Another issue with Recurrent Neural Networks is that they have a hard time capturing long-term dependencies and the complexity of human language. They suffer greatly from something known as the “vanishing gradient problem,” which essentially means that as the model increases in size, its ability to understand more complex dependencies between objects does not.

This leads to a situation where you get diminishing returns instead of greater performance as the complexity of the model increases. This isn’t good when you want your model to work on a large set of input. Transformers mitigate this problem because they use a self-attention mechanism, which allows them to consider all positions in the input sequence simultaneously, instead of sequentially.

Additionally, transformers use a concept called “positional encodings,” which essentially tells the model where each token in the sequence is ordered. This means their simultaneous processing doesn’t compromise their understanding of the order of sequences. Since the order of words in language is an important nuance, capturing it without relying on a sequential process is a huge win for performance.

The most critical aspect of the architecture, however, is likely the “self-attention” concept for which the introduction paper was named. Self-attention allows the model to capture complex nuances between various tokens in a sequence across a very large semantic space. Each token input can therefore be understood much deeper by the model, allowing for more accurate and useful next-token predictions.

Generally applicable…

Aside from the practicality of training, we’ve also seen a monumental increase in the general applicability of these models. Big companies like OpenAI and Google don’t want to create a model for your specific use case; they want to create a model for ALL specific use cases. The goal for them is to create a foundation model, something that other companies can adapt and specialize to their own needs — for a price, of course.

So, there is a huge benefit in creating something which is not only performant on some tasks but can be applied to a wide range of tasks. Transformers allow for this, being used not only for language tasks but also for processing video, audio, and images. For example, Midjourney’s image generation, OpenAI’s Whisper audio transcription, and OpenAI’s Sora video creator are all transformer-based.

These models are context-aware due to their self-attention and can capture the relevant parts of an input sequence regardless of the task. This is why patterns like retrieval-augmented generation are so successful in creating better generations because they help the model focus on specific context.

However, aside from inserting task-relevant context directly into a prompt, you can also adapt a model more fundamentally towards a certain set of data. Through a process called “fine-tuning,” a model’s “parameters” can be adjusted toward a specific set of data. This allows for the use of a foundation model in a more specialized context by nudging the model to always have certain context more readily available.

Change on the horizon…

Transformers have been the cool kid on the block for quite some time now, and for good reason. However, novelty fades and innovation does not waste time finding new solutions to old problems. Transformers, for all of their performance and efficiency improvements, might not be sustainable for much longer, as energy availability limits their massive compute requirements. In a few years, we could look back at this architecture the same way we now look back at Recurrent Neural Networks.

Mamba, for example, is a novel deep learning architecture which has sent waves through AI communities by solving some of the problems which transformers face today. They scale linearly as the input sequence increases, as opposed to the quadratic scaling of transformers, making them more efficient. This is done while maintaining a much simpler design, and performing better than transformers in preliminary tests.

While it’s too early to know if Mamba specifically will be the next state-of-the-art, it’s worth considering that transformers are only a stepping stone towards the future.

(Let me know if you’d be interested in a similar post to this one going over the Mamba architecture.)

How to apply this…

I can’t tell you exactly what to take away from this post or what to learn from here, as that will depend greatly on your own goals and journey. However, from this post I hope that you understand AI overall a bit better, and specifically transformers. These models are only going to get more advanced over time, but will likely rely on the same fundamental ideas.

Understanding them is the key to leveraging them as they grow in influence and popularity. You’ll likely start to see them enter your workplace to some degree in the future, if you haven’t already. So, it is my goal to help you understand these models better, and give you actionable insights to put your best foot foward. Thank you for your time!

Going Further

Click here to find this post with included up-to-date resources for continued learning on my Substack!

If you’re interested, I post weekly articles on my Substack, Silicon and Synapses. You can find that here.

--

--

Aidan Tilgner
Aidan Tilgner

Written by Aidan Tilgner

Software Developer working my way through the world and sharing what I learn with all of you. More of my writing — aidantilgner.substack.com

No responses yet