A Beginner-Friendly Introduction to LLMs

I've been wanting to write a tutorial on Large Language Models (LLMs) for a while now and since then I've been thinking about how to write a series of beginner-friendly articles to understand and get started with LLMs. In this article, I will attempt to provide a beginner-friendly introduction to LLMs and explain the key concepts in a simple way and without digging further into the technical aspects. My hope is that after reading this article, you will feel more comfortable reading more advanced documentation on LLMs.
Not a Medium member? No worries! Continue reading with this friend link.
Table of contents
· 1. Introduction · 2. LLMs definition · 3. Some LLMs ∘ 3.1. BERT family ∘ 3.2. GPT family ∘ 3.3. PaLM family ∘ 3.4. LLaMA family · 4. LLMs system ∘ 4.1. General architecture ∘ 4.2. Training process ∘ 4.3. LLM's inputs and outputs · 5. Use cases · 6. How adapt LLMs ∘ 6.1. Fine-tuning ∘ 6.2. Prompting · 7. Challenges · 8. Conclusion
1. Introduction
Large Language Models, or LLM for short, are the current topic of discussion not only in the research field but also in industry. Their remarkable ability to generate human-like text across a wide range of fields and tasks made them involved in many aspects of our lives: from powering virtual assistants and chatbots to expert services such as language translation services, content generation, sentiment analysis, etc.
While the term "Large Language Models" is relatively recent, the concept of "language models" has been in use for a considerable period. This concept is employed to describe any machine learning model that aims to predict or generate plausible language. However, it seems insufficient to describe most of the recent language models and thus the adoption of the term LLMs.
So, what exactly are LLMs? To answer this question, we will start by defining LLMs. Next, we will present some known LLMs as examples before presenting their general architecture. We will then see how they are used and what are the different techniques used to adapt them. Finally, we will cite some challenges and limitations to consider before and during the use of LLMs.
2. LLMs definition
To define LLMs, let's start by examining the definitions set by some of the leading AI organizations:
Definition 1:
Large language models (LLMs) are deep learning algorithms that can recognize, summarize, translate, predict, and generate content using very large datasets. (NVIDIA) [1]
Definition 2:
A large language model (LLM) is a type of artificial intelligence model that utilizes machine learning techniques to understand and generate human language. (Red Hat) [2]
Definition 3:
Large language models (LLMs) are machine learning models that are very effective at performing language-related tasks such as translation, answering questions, chat and content summarization, as well as content and code generation. LLMs distill value from huge datasets and make that "learning" accessible out of the box. (Databricks) [3]
Definition 4:
LLMs, or Large Language Models, are the key component behind text generation. In a nutshell, they consist of large pretrained transformer models trained to predict the next word (or, more precisely, token) given some input text. Since they predict one token at a time, you need to do something more elaborate to generate new sentences other than just calling the model – you need to do autoregressive generation. (Hugging Face) [4]
As for my personal definition, I see LLMs as language models with larger architecture (a larger number of weights), trained on larger dataset while requiring a larger amount of resources (hardware and software) and capable of performing a larger set of tasks with greater efficiency compared to the traditional language models.
3. Some LLMs

Eumm, now that we have a clear definition of LLM let's review some of the existing LLM models. Note that the list below is not exhaustive, there are other powerful models available in the field today. I have selected only a few to keep this article concise. The models presented are grouped into families where I tried to present for each model: its architecture, the year of release, the number of weights, the training process and the used datasets.
3.1. BERT family

Bidirectional Encoder Representations from Transformers [5], or BERT for short, is one of the earliest LLM models that's published in 2018 by Google research team. It's a transformer based deep neural network that processes the text from both directions (left-to-right and right-to-left). This latter enables capturing the context from both directions simultaneously.
The BERT's model architecture consists of a bidirectional Transformer encoder with a total of 340M parameter. BERT's input can be a single sentence or a pair of sentences packed together into a single sequence where every sentence starts with the token [CLS] and the pair of sentences are separated by a special token [SEP].
BERT is pre-trained on Books Corpus with 800M words and English Wikipedia with 2500M words and can be finetuned by just adding an output layer to create models for several tasks such as question answering.
After BERT, several BERT-based models have been proposed to improve its robustness (RoBERTa [6]), reduce memory and accelerate learning (ALBERT [7]).
3.2. GPT family

Generative Pre-trained Transformer, or GPT for short, is a series of LLMs developed by OpenAI. As BERT, GPT models are transformer-based deep neural networks pre-trained on large amounts of text data.
GPT-1 [8] is the first LLM of this series that's published in 2018. It consists of a 12-layer Transformer decoder with a total of 117M parameters. Similarly to BERT, GPT-1 is pre-trained on Books Corpus and can be finetuned by adding an output layer to create models for more specific natural language processing tasks, such as question answering and machine translation.
GPT-2 [9] is the second LLM of the GPT family that's published in 2019. The GPT-2 model follows the GPT-1 architecture with a total of 1.5B parameters. It was trained on a larger dataset, WebText, after removing Wikipedia documents for analysis sake. Another particularity of this model, is that it's able to learn some tasks without explicit supervision, in other words, no fine tuning was needed (zero-shot task transfer).
GPT-3 [10] came in 2020 to demonstrate that increasing the language models size improves the performance across various tasks without task-specific training. Its architecture is similar to GPT-2 with a total of 175B parameters and it was trained with an even larger dataset compared to GPT-2.
GPT-4 [11] is the latest and most powerful LLM in the GPT family launched in 2023. It's a multimodal model transformer-based LLM that can take text and image and input and output text. GPT-4 is pre-trained using public datasets and licensed datasets to predict the next token and then finetuned using Reinforcement Learning from Human Feedback, RLHF for short, a machine learning approach to improve model performance.
3.3. PaLM family

Pathways Language Model [12], or PaLM, is a decoder-only transformer-based large language model released by Google in 2022 with a total of 540B parameters and trained on 780B of tokens. Training this huge model using a huge dataset was achieved through the use of Pathways that are a machine learning system for high efficient training of LLMs and was described as follows:
… a new ML system which enables highly efficient training of very large neural networks across thousands of accelerator chips, including those spanning multiple Tensor Processing Units (TPU) v4 Pods. [12]
This very large architecture allowed PaLM to outperform the finetuned models on multi-step reasoning tasks. It's also capable in multi-lingual tasks and source code generation. It was trained on multilingual social media conversations, webpages, books in english, Github, multilingual Wikipedia and news in English.
After PaLM, several PaLM-based models have been released such as U-PaLM [13] and Flan-PaLM [14].
3.4. LLaMA family

LLaMA is a collection of open-source language models developed by Meta where the first collection was proposed in February, 2023. As GPT, LLaMA models are transformer-based deep neural networks pre-trained on large datasets.
The first set of LLaMA [15], called LLaMA-1, assembled a collection of models with the number of parameters ranges from 7B to 65B and trained on trillions of tokens of exclusively public datasets such as CommonCrawl, C4, Github and Wikipedia. These models are based on the transformer architecture with improvements inspired from GPT-3, PaLM and GPTNeo. Through this collection, Meta demonstrates that superior performance can be achieved by smaller models trained on larger datasets. Consequently, the LLaMA-13B outperformed GPT-3 on most benchmarks [15].
In the same year, July 2023, the second set of LLaMA [16], called LLaMA-2, was released. LLaMA-2 includes a set of pretrained and finetuned LLMs with the number of parameters ranges from 7B to 70B trained on new public datasets. LLaMA-2 is an updated version of LLaMA-1, finetuned to align with human preferences.
4. LLMs system
Now that we have explored some of the existing LLMs, I'd like to highlight the key components of their whole system, namely the general architecture of the models, the training process and the nature of the inputs and outputs.

4.1. General architecture
In general, the architecture of LLMs is based on the Transformer model published in the paper: Attention is all you need (in 2017). A transformer is an encoder-decoder architecture with the self-attention mechanism. The encoder maps the input sequence of symbol representation to a sequence of continuous representation while the decoder generates from the encoder output, a sequence of symbols for one element at a time. The self-attention mechanism here, enables the Transformer to capture dependencies between words.
Some of the LLMs, like the BERT family, follow the encoder-only architecture design that focuses on understanding the context of the input text such as text classification. Some other LLMs, like GPT, PaLM and LLaMA families, opted for decoder-only architecture design that focuses on generating output text based on the provided context.
4.2. Training process
Based on the training process, LLMs can be divided to two different types:
- The first type includes LLMs that are pretrained on large corpus of text using unsupervised learning to capture general linguistic patterns and semantic relationships in the data. These LLMs can then be finetuned on specific tasks with labeled data to improve the model's performance and adapt the learned knowledge to the specifics of the target task.
- The second type includes LLMs that are trained to perform well on several tasks implicitly without finetuning (zero-shot learning). However, these models can be adapted to a specific task by providing them with the description or prompt of the task. Moreover, they can also be improved by training the model on minimal additional training examples for the task.
4.3. LLM's inputs and outputs
LLMs was initially proposed to work on text data: they often take raw text as input that can be sentences, paragraphs or documents and output text that can also be sentences, paragraphs and or documents. The raw input is typically transformed into tokens that represent words or subwords units depending on the employed tokenization algorithm. Similarly, LLMs typically output tokens that can be converted back into human-readable text using the model's tokenizer.
Some recent LLMs operate not only on text but also images to generate text output like GPT-4. Other LLMs went beyond that to operate on videos and audio also and generate images alongside texts like Gemini. Here we're talking about more complex architecture generally called Multimodal Language Models (MLMs).
5. Use cases
Hmmm at this moment, you may wonder where and how exactly these large models are used? Indeed, LLMs have a wide range of applications and across various fields thanks to their ability to understand and generate human-like text. Some LLMs use cases include:
- Text classification: that consists of classifying text into defined classes or labels like in sentiment analysis, spam detection and document classification.
- Named entity recognition: that consists of detecting entities mentioned in text such as names of people or locations.
- Question answering: that consists of generating answers to questions while considering the context.
- Text generation: that consists of generating relevant text like translating, summarizing and paraphrasing.
- Dialogue systems: that consists of building conversational agents, chatbots or assistants that can engage in natural language conversations with users.
LLMs are, today, involved in more complex tasks such as: code generation and problem solving.
6. How to adapt LLMs
When training LLMs, they acquire the general general abilities for solving various tasks. However, most time, LLMs are needed to perform better only on one specific task. Therefore, several techniques have been proposed in order to adapt LLMs to a given task among which we will present fine-tuning and prompting.
6.1. Fine-tuning
Fine-tuning is a popular approach in machine learning in general. It consists of further training on a specific task or data domain. In LLMs, it's commonly used for tasks like text classification, sentiment analysis and language translation.
6.2. Prompting
In generative AI models, prompting is providing textual input to guide the model's output. It generally consists of providing questions, instructions, examples or/and input data. There are various variants of prompt engineering techniques including: Chain of Thought (CoT), Tree of Thought (ToT), self-consistency, expert prompting, etc.
7. Challenges
So far, we have talked about the power and potential of LLMs and their applications in different human-like tasks. This has raised new challenges that some may infer from the previous sections:
- The first obvious challenge is the cost in all aspects: training LLMs requires large datasets that sometimes are not public, significant computational resources such as GPUs or TPUs, large amounts of memory and human resources with different backgrounds such as machine learning engineers, data scientists, and researchers. It also requires several days, weeks or even months.
- Another challenge and the most important is related to ethical considerations: safety, security, fairness and privacy. LLMs can generate content that is harmful, offensive, or dangerous and can be exploited for malicious purposes, such as generating fake news that can threaten users safety and security. They can also generate fairness and privacy issues by leading to unfair treatment or discrimination against certain groups and leaking confidential information respectively.
- LLMs also suffer from hallucinations, outdated knowledge and reproducibility.
8. Conclusion
Here comes the end of this article! In this article, I shared with you a brief introduction to LLMs. Through this introduction, we defined LLMs according to some AI leaders' companies, explained its general architecture, and presented some of the existing models. This is the first article on LLMs and certainly not the last! I will be writing more tutorials on them and its various technologies, with examples, so stay tuned.
My aim through my articles is to provide my readers clear, well-organized and easy-to-follow tutorials, offering a solid introduction to the diverse topics I cover and promoting good coding and reasoning skills. I am on a never-ending journey of self-improvement, I share my findings with you through these articles. I, myself, frequently refer to my own articles as valuable resources when needed.
Thanks for reading this article. If you appreciate my tutorials, please support me by following me and subscribing to my mailing list. This way, you'll receive notifications about my new articles. If you have any questions or suggestions, feel free to leave a comment.
References
[1] https://www.nvidia.com/en-us/glossary/large-language-models/
[2] https://www.redhat.com/en/topics/ai/what-are-large-language-models
[3] https://www.databricks.com/product/machine-learning/large-language-models
[4] https://huggingface.co/docs/transformers/llm_tutorial
[5] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
[6] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., … & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
[7] Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2019). Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942.
[8] Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training.
[9] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9.
[10] Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., … & Amodei, D. (2020). Language models are few-shot learners. arXiv preprint arXiv:2005.14165.
[11] Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., … & McGrew, B. (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
[12] Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., … & Fiedel, N. (2023). Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240), 1–113.
[13] Tay, Y., Wei, J., Chung, H. W., Tran, V. Q., So, D. R., Shakeri, S., … & Dehghani, M. (2022). Transcending scaling laws with 0.1% extra compute. arXiv preprint arXiv:2210.11399.
[14] Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., … & Wei, J. (2024). Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70), 1–53.
[15] Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M. A., Lacroix, T., … & Lample, G. (2023). Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023. doi: 10.48550. arXiv preprint arXiv.2302.13971.
[16] Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., … & Scialom, T. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
Image credits
All images and figures in this article whose source is not mentioned in the caption are by the author.