6 minutes
AI has taken the world by storm, with AI models breaking new ground at an increasingly rapid rate. One of Revium's Lead Developers, Aleks, explains how the Transformer, an innovative neural network, sparked the current wave of AI.
When OpenAI released ChatGPT in November 2022 it quickly made waves across the world, but before ChatGPT the concept of Large Language Models (LLMs) was relatively unfamiliar to the general public.
While many are aware of OpenAI's contributions in the space, very few know that this whole wave of AI innovation actually started much earlier, back in 2017. A Google research paper titled Attention Is All You Need laid the groundwork for almost everything we are seeing today. The paper introduced the concept of the Transformer, a neural network architecture that uses a self attention mechanism. Unlike older neural architectures, it doesn't rely on recurrent units, which means it trains much faster.
So, what exactly is the Transformer and how does it power today's AI models? In this article, we explore the Transformer architecture, understand its components, and discuss how it shaped the AI we use today.
A Transformer is a type of neural network that accepts a sequence of words as input and produces another sequence of words as output. It achieves this by understanding the context and maintaining the relationships between these sequences of words. For instance, given the question "What is the capital of France?" the Transformer's answer would be "Paris is the capital of France".
Each sentence in the Transformer is divided into individual chunks called tokens, where each word or character represents a single token. For example, the sentence "What is the capital of France?" would be divided into seven tokens: "What", "is", "the", "capital", "of", "France", and "?". The transformer then analyses these tokens to understand the relationships between them. It focuses on specific words and their connections, like "capital" and "France". Based on this analysis, the transformer generates a new text sequence as output. In our example, it would identify "Paris" as the capital of France and produce the answer: "Paris is the capital of France".
This is a simplified explanation, but provides a basic idea of how transformers work.
Early neural networks processed data sequentially, predicting the next word in a sequence based on previous words. For example, in the context of "weather forecast," the word "forecast" might be suggested after "weather". However, this method had its limitations, primarily slow processing speed for long sequences. These early neural networks also struggled to maintain context over extended sequences, which limited their ability to generate coherent, contextually rich text and connect concepts within a paragraph.
Transformers, on the other hand, address these issues. They can process an entire sequence simultaneously which significantly improved their speed. Moreover, they can capture complex relationships and dependencies across long text sequences, greatly enhancing language understanding and generation capabilities. This is why they have become the preferred architecture for many natural language processing tasks.
The Transformer is a complex assembly of various components that work together to produce the final output.
We start with the input sentence, in our previous example this is "What is the capital of France?”.
The input is then divided into tokens. Each token is then converted into a mathematical vector called an embedding. These embeddings capture the semantic information of each token and its context within the sentence.
In simpler terms, imagine a two-dimensional space where each token is assigned a position based on its meaning. Tokens with similar meanings, such as "town" and "city", are positioned closer together. "Capital" is placed a bit further away reflecting its related, but not identical, meaning to "city" and "town". Tokens like "France" and "Paris" are placed close together, indicating their connection as a country and its capital.
On the other hand, tokens like "Istanbul" and "Paris" are distant in this space, as they don't share a close contextual relationship.
The Transformer, unlike earlier neural networks, does not inherently understand positional information. Therefore, a method called positional encoding is a crucial part of token embeddings. It allows the Transformer to identify each token's precise location in a sentence. By incorporating positional embeddings, the Transformer can interpret the sequence of tokens, ensuring it understands their order and relative positions.
For instance, there's a significant difference between "Paris is in France" and "France is in Paris". The sentence "Paris is in France" makes perfect sense, as Paris is a city within France. However, "France is in Paris" does not make sense.
Positional encodings add information about the position of tokens in the input embeddings. Each token of the sentence is taken, and before it's fed into the neural network, a number is assigned to it. For the sentence "What is the capital of France?", "What" would have the number 1, "is" would be assigned number 2, and so on. Essentially, the Transformer stores information about the token’s position in the data, rather than in the structure of the network.
The embedded sentence with the positional encoding is then fed into the "encoder". Each encoder layer consists of two sub-layers: self-attention and feed-forward mechanisms.
Self-Attention: This part lets the Transformer pay attention to important parts of the sentence when encoding each token. Unlike older models that process information one token at a time, self-attention looks at every token in relation to all others. This helps the Transformer understand how tokens relate to each other. Let's take the sentence "What is the capital of France?". Using self-attention, the Transformer decides how important each token is to understand the whole sentence. It evaluates how "capital" is linked to "France" and understands the significance of the token "What" and the character "?" in identifying it as a question. It captures the relationship between these tokens to find the answer. This mechanism helps the encoder understand the context and relationships needed for accurate sentence comprehension.
Feed-forward functions as a fine-tuner, refining the output from self-attention to capture subtle details that might be overlooked by self-attention alone. This combined analysis allows the Transformer to achieve superior performance in various tasks like machine translation and sentiment analysis.
After processing the input sequence, the encoder passes it to the decoder. Unlike the encoder, the decoder operates auto-regressively, generating the output sequence one token at a time. It uses both the previously generated output and the encoder output as inputs, continuing this process until an end-of-sentence token is generated.
While the decoder uses sub-layers similar to the encoder, it handles self-attention differently. For instance, in the sentence "Paris is the capital of France", when computing the token "capital", the decoder shouldn't access subsequent tokens such as "of" or "France". Instead, it should only attend to "capital" and any preceding tokens. This rule applies to all tokens, each of which can only attend to those before it. To prevent the decoder from peeking at future tokens in the sequence, a masking technique is applied before calculating attention. This mask blocks future tokens while allowing access to past ones, ensuring the decoder considers only the information available up to the current token.
The decoder's final output is a sequence of words, which in response to the question "What is the capital of France?" would be "Paris is the capital of France."
The Transformer architecture has undoubtedly revolutionised Generative AI. Its ability to analyse entire sequences simultaneously, coupled with its powerful self-attention mechanism, has led to significant improvements in language understanding and generation. This complex technology has played a pivotal role in the current AI landscape.
By Lead Developer Aleks Trpkovski