GPT Model: How Does it Work?

During the last few years, the buzz around AI has been enormous, and the main trigger of all this is obviously the advent of GPT-based large language models. Interestingly, this approach itself is not new. LSTM (long short-term memory) neural networks were created in 1997, and a famous paper, "Attention is All You Need," was published in 2017; both were the cornerstones of modern natural language processing. But only in 2020 will the results of GPT-3 be good enough, not only for academic papers but also for the real world.
Nowadays, everyone can chat with GPT in a web browser, but probably less than 1% of people actually know how it works. Smart and witty answers from the model can force people to think that they are talking with an intelligent being, but is it so? Well, the best way to figure it out is to see how it works. In this article, we will take a real GPT model from OpenAI, run it locally, and see step-by-step what is going on under the hood.
This article is intended for beginners and people who are interested in programming and Data Science. I will illustrate my steps with Python, but deep Python understanding will not be required.
Let's get into it!
Loading The Model
For our test, I will be using a Gpt-2 "Large" model, made by OpenAI in 2019. This model was state-of-the-art at the time, but nowadays it does not have any business value anymore, and the model can be downloaded for free from HuggingFace. What is even more important for us is that the GPT-2 model has the same architecture as the newer ones (but the number of parameters is obviously different):
- The GPT-2 "large" model has 0.7B parameters (GPT-3 has 175B, and GPT-4, according to web rumors, has 1.7T parameters).
- GPT-2 has a stack of 36 layers with 20 attention heads (GPT-3 has 96, and GPT-4, according to rumors, has 120 layers).
- GPT-2 has a 1024-token context length (GPT-3 has 2048, and GPT-4 has a 128K context length).
Naturally, the GPT-3 and -4 models provide better results in all benchmarks compared to the GPT-2. But first, they are not available for download (and even if they were, running a 175B model may require a very expensive computer to do it), and second, to understand how it works, GPT-2 is good enough.
To use the model in Python, we need two objects: the model itself and the tokenizer:
from transformers import GPT2LMHeadModel, GPT2Tokenizer, set_seed
tokenizer = GPT2Tokenizer.from_pretrained('gpt2-large')
model = GPT2LMHeadModel.from_pretrained('gpt2-large')
The transformers library is smart enough, and it will automatically download the model during the first start. Now, let's see how we can use it.
Tokenizer
A tokenizer is a crucial part of every language model. Neural networks cannot work with text directly, and the tokenizer converts the text into an array:
print(tokenizer("Paris will", return_tensors="pt"))
#> tensor([[40313, 481]])
Here, the text "Paris will" was converted into a tensor (in our case, it's an array of digits) [40313, 481]
. It is easy to see that the word "Paris" for the model is just a single token 40313.
We can easily make the backward conversion:
print(tokenizer.decode([40313]))
#> Paris
Interestingly, we can try to encode the word "paris" and get two tokens instead of one:
print(tokenizer.encode("paris"))
#> [1845, 271]
Here, "paris" was transformed into two tokens, "par" and "is." As we can guess, only the most popular words are converted into single-digit tokens; other words are just split into parts. The reason is straightforward; it's just impossible to encode all English words into one table. This approach also allows the model to learn and use new, unknown, or misspelled words.
Readers can find the full GPT-2 vocabulary in JSON format on GitHub; this file has 50,257 records. Interstingly, the GPT-3 and GPT-4 models use another tokenizer, named tiktoken. It is not compatible with an old one, but the general logic remains the same:
import tiktoken
tokenizer = tiktoken.encoding_for_model("gpt-3.5-turbo")
tokenizer.encode("Paris")
#> [60704]
tokenizer.encode("paris")
#> [1768, 285]
However, the "GPT-2" model is still supported by tiktoken as well:
tokenizer.encoding_for_model("gpt-2").encode("Paris")
#> [40313]
Running The GPT Model (Easy Way)
Now, when the text is converted into tokens, we can use the GPT model to generate the output. With the help of the transformers
library, we can do it in only 10 lines of code:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2-large')
model = GPT2LMHeadModel.from_pretrained('gpt2-large')
text = "Paris will"
model_input = tokenizer.encode(text, return_tensors="pt")
set_seed(42)
output = model.generate(model_input,
max_length=32,
pad_token_id=tokenizer.eos_token_id)[0]
print(output)
#> tensor([40313, 481, 307, 262, 717, 1748, 287, 262, 995,
#> 284, 423, 257, 3938, 16359 ... ])
print(tokenizer.decode(output))
#> Paris will be the first city in the world to have a fully
#> automated train system ...
As discussed before, the output of the model is also a tensor; we need a backward conversion to get a text. A set_seed
method initializes the internal random generator. With this method, the GPT model will always return the same string as a response to the same prompt.
Running The GPT Model (Hard Way)
We were able to generate the output using the GPT model, but the generation itself is still hidden in the library implementation. Let's go one level deeper and run the process manually, token by token.
The same way as before, first we need to load the model. I will also convert the prompt phrase into tokens:
import torch
import torch.nn.functional as F
text = "London was"
model_input = torch.tensor(tokenizer.encode(text)).unsqueeze(0)
#> tensor([[23421, 373]])
model = GPT2LMHeadModel.from_pretrained('gpt2-large')
model.eval()
Now, let's run the generation. Here is our Step 1:
outputs = model(model_input, labels=model_input)
loss, logits = outputs[:2]
As an output, we get an array of logits – the unnormalized probabilities for each token. Let's see it in more detail:
print(logits)
#> tensor([[[ 1.1894, 4.7469, 0.0803, ..., -6.4411, -5.3999, 1.9996],
#> [-0.1938, 1.7015, -3.6939, ..., -6.9758, -3.3617, 0.0359]]])
print(logits.shape)
#> [1, 2, 50257]
What can we get from this output? To understand this, let's recall the original diagram of the Transformers architecture from the "Attention is all we need" paper:

First, GPT is a language model; it was trained on terabytes of text, and as an output, it generates the "output probabilities." For example, in the phrase "London was," the probability of the next token "a" is definitely higher than the probability of the word "no.".
Second, GPT is also an auto-regressive model, and the token generation runs only one step per iteration. As mentioned in the picture, the output is "shifted right." In our example, we sent two tokens into the model and got two tensors as an output. Why tensors? The output of the model has a [1, 2, 50257]
shape; it is not a single token but an array of probabilities for all tokens (50257 is the size of the GPT-2 vocabulary). In our example, we sent to the model an array of two tokens (our input, [23421, 373]
) and got two arrays of 50257-length as an output.
Practically, we are only interested in the last token because we already know the previous ones. To generate the next token, we need to get the last tensor and take the tokens with the highest probability from it:
logits = logits[:, -1, :]
#> tensor([[-0.1938, 1.7015, -3.6939, ..., -6.9758, -3.3617, 0.0359]])
top_k = 5
indices_to_remove = logits[0] < torch.topk(logits[0], top_k)[0][..., -1, None]
logits[:, indices_to_remove] = -float("Inf")
next_tokens = torch.multinomial(F.softmax(logits, dim=-1), num_samples=5)
print(tokenizer.decode(next_tokens.squeeze()))
#> "one", "the", "a", "also", "not"
Here, I printed five most likely tokens, corresponding to our prompt, "London was." But practically, we need only one. We also need to add a new token to the next input:
next_token = torch.tensor([[next_tokens[0][0]]])
model_input = torch.cat((model_input, next_token), dim=1)
#> [23421, 373, 530]
Now, we are ready for Step 2. The process is the same; the only difference is that we have three tokens as input:
outputs = model(model_input, labels=model_input)
loss, logits = outputs[:2]
print(logits)
#> tensor([[[ 2.1159, 4.8143, -0.3819, ..., -8.6419, -5.5092, 1.1465],
#> [ 0.4149, 1.4974, -2.9283, ..., -7.9501, -3.9587, 0.1875],
#> [-1.2257, 0.9350, -4.2245, ..., -7.4362, -4.9682, -0.8710]]]))
print(logits.shape)
#> torch.Size([1, 3, 50257])
Now, we sent three tokens to the model, and it returned three 50257-length arrays of the token probabilities. As we can see, the process is surprisingly inefficient; that's why a high-end GPU is required for fast calculations.
Let's get the best 5 tokens again and find the next word:
indices_to_remove = logits[0] < torch.topk(logits[0], top_k)[0][..., -1, None]
logits[:, indices_to_remove] = -float("Inf")
next_tokens = torch.multinomial(F.softmax(logits, dim=-1), num_samples=5)
print(tokenizer.decode(next_tokens.squeeze()))
#> "of", "city", "such", "place", "the"
next_token = torch.tensor([[next_tokens[0][0]]])
model_input = torch.cat((model_input, next_token), dim=1)
#> [23421, 373, 530, 286]
After adding a token, we have four tokens in the sequence. We can repeat the process enough times, and this is our final phrase:
print(model_input)
#> tensor([[23421, 373, 530, 286, 262, 1178, 4113, 326,
#> 550, 262, 11917, 284, 1302, 510, 284, 262, 1230, 13]])
print(tokenizer.decode(model_input.squeeze()))
#> London was one of the few places that had the courage to stand
#> up to the government.
By the way, getting the top-N most likely tokens is only one of the possible strategies. There are different ways of choosing the next token, and readers who are interested in more details can read a HuggingFace blog post from 2020:
How to generate text: using different decoding methods for language generation with Transformers
Conclusion
In this article, we were able to run the GPT model and generate the output token-by-token. I hope, with this article, readers can better understand how GPT works.
With this understanding, we can also try to find an answer to another question: can this model have any type of consciousness? I think the answer is obvious. On one side, a GPT model was trained on terabytes of data; it remembers a lot of facts and has a lot of "encyclopedic" knowledge. On the other side, we can see several challenges in the generation process that prevent us from saying that this model is truly conscious:
- The GPT model itself is frozen. Its knowledge is limited by the date when it was created. The model itself is just an array of weights that can be saved in a file and stored on a CD-ROM, hard disk, or SD card (and this file, naturally, cannot have any thoughts or intentions on its own).
- The GPT model itself has no memory and cannot learn anything new. The model file is "read-only," and it is not changing during any conversation. Every new request is calculated from scratch. Readers may ask how they can make dialogs with GPT if it has no memory. Well, modern libraries like LangChain can add and summarize previous conversation details and automatically add them to the next prompt. The chat history can also be stored by web developers in the database. However, the GPT model itself is stateless and does not "remember" any conversations when the generation is finished.
- Last but not least, we can see the lack of a main "secret ingredient," a lack of self-consciousness. GPT is a language model. It can generate impressive answers as a result of our prompt, and it is doing it well. But without these prompts, the model itself is not generating anything. We can send the same prompt 10 times and get the same response 10 times; the model will never change its "mind." What is self-consciousness? We as humans take it for granted, but as far as I know, there is no clear answer yet about how it works. The model itself has no internal process of "thinking," no "feedback loop," no goals, and no intentions.
Can these problems be solved? Well, it's literally a billion-dollar question. Nowadays, the progress in AI is going very fast, and nobody knows what will happen in the future. Predictions about the AGI (Artificial General Intelligence) vary, from "we'll hit the AGI in just 5 years" to "the AGI is decades (30–50+ years) away." We can only be sure that right now, thousands of teams and individuals in the world are probably trying to achieve this goal. When will they succeed? For now, it's unknown.
Thanks for reading. If you enjoyed this story, feel free to subscribe to Medium, and you will get notifications when my new articles will be published, as well as full access to thousands of stories from other authors. You are also welcome to connect via LinkedIn. If you want to get the full source code for this and other posts, feel free to visit my Patreon page.
Those who are interested in using language models and natural language processing are also welcome to read other articles:
- A Weekend AI Project (Part 1): Running Speech Recognition and a LLaMA-2 GPT on a Raspberry Pi
- A Weekend AI Project (Part 2): Using Speech Recognition, PTT, and a Large Action Model on a Raspberry Pi
- A Weekend AI Project (Part 3): Making a Visual Assistant for People with Vision Impairments
- LLMs for Everyone: Running LangChain and a MistralAI 7B Model in Google Colab
- LLMs for Everyone: Running the LLaMA-13B model and LangChain in Google Colab
- 16, 8, and 4-bit Floating Point Formats – How Does it Work?