My Notes:

Pre-Training
Tokenization
- Naive Approaches
  - Problems with this approach
- Byte Pair Encoding
Training the Model
Visualizing an LLM
Inferences
GPT-2
- [The Original Paper](#the-original-paperhttpscdnopenaicombetter-language-modelslanguagemodelsareunsupervisedmultitasklearnerspdf)
Base Model
Post-Training
- [InstructGPT ( Original Paper )](#instructgpt-original-paperhttpsarxivorgpdf220302155)
  - Open Source Datasets Visualization ( This visualization is of UltraChat, it's popular but now that Models became good it is mostly synthetic data. )
Hallucinations , Memory , Awareness
Transformer related Stuff
- Self-Attention
  - Mathematical Expression for Self-Attention
Sources:

Pre-Training

Download and preprocess the data.

Tokenization

Process of converting text(string) into tokens (integers).

Naive Approaches

Using word level or character level mapping using LUTs.

Problems with this approach

Word level:
- Vocabulary size is huge.
- Out of vocabulary words.
- Rare words are not handled well.
Character level:
- Too many tokens.
- Long sequences.
- Doesn't capture semantics.
Doesn't account for other languages.

Byte Pair Encoding

Iteratively replaces the most common pair of consecutive tokens with a new token.
This allows the model to learn sub-word units and handle out-of-vocabulary words more effectively.
For example, the word "unhappiness" might be tokenized as "un", "happi", and "ness".
This is a greedy algorithm and is not optimal ( but it is used in practice).
The way I understand this handles out of vocabulary words and rare words well is because we humans also kind of do the same, suffix-prefix and relational understanding is captured in this kinda.

Training the Model

Training the neural net to adjust it to get the predictions as the stats from training sets.

Visualizing an LLM

These are transformer architecture visualizations.

Inferences

The process of using the trained model to make predictions on new data.
Iteratively keep predicting the next token until we reach the end of sentence token or stopping criteria is met.

GPT-2

The Original Paper

The original paper that introduced the GPT-2 model.
It was based on transformer architecture and it has 1.6B parameters.
Context length was 1024 tokens.
Trained on 100 B tokens.
Current SOTA models are around 500~700B parameters for CoT, (DeepSeek-R1 is 670B). ( Gemini output token size is 64k, insane scaling man)

Base Model

Token level trained dataset simulator.
Probabilistic model that predicts the next token given the previous tokens.
Regurigation: will recite some stuff from what it remembers from training.
Will need few shot prompting ( and from this using it's in-context understanding) it might act like an assistant.

Post-Training

InstructGPT ( Original Paper )

68 page paper 🛐.
Basically taught us how to actually make InstructGPT, by creating a dataset of human feedback.
Trained the model to follow instructions and be more helpful. Like remove toxic features by including those examples and teaching the model to say no.

Open Source Datasets Visualization ( This visualization is of UltraChat, it's popular but now that Models became good it is mostly synthetic data. )

Hallucinations , Memory , Awareness

How to reduce hallucinations?

People figured out that we can add examples in the dataset like I don't know the answer to that or I am not sure about that.
- Now how do we even figure this out? -> You Generate Synthetic Questions using powerful Models, ask them multiple times, answer changes/gets it wrong, you add an example in the instruct to say it doesn't know.
- So After this process of adding multiple instructions, model learns that when it is not sure it should say it doesn't know.
We also teach to use tools like web search using special tokens.

Memory

!!! "Vague recollection" vs. "Working memory" !!!

Knowledge in the parameters == Vague recollection (e.g. of something you read 1 month ago)
Knowledge in the tokens of the context window == Working memory

Knowledge of Self

Can be given as the "SYSTEM_PROMPT".
Basically by itself the model doesn't know what it is, so highly probably is that it would spit / hallucinate that I am developed by openAI or something like that , that doesn't mean that someone stole openAI's model and used it, or their data, it's just generating the most probable answer it knows.

Post-Training

SFT ( Supervised Fine Tuning )

So all these stuff we discussed above is basically about hallucinations and adding instruct datasets, it's supervised fine tuning.
It teaches how to follow instructions using the labeled data.

RLHF ( Reinforcement Learning from Human Feedback )

This is what we discussed on the LLM Understanding#How to reduce hallucinations?, as by showing the model to answer I don't know.
Reinforce what good behavior is and what bad behavior is.

DPO (Direct Preference Optimization)

Newer method of RLHF. You make the model choose between two outputs and then you train it to choose the better one. (this is why chatgpt asks you which preference you like better)
More stable and less expensive than RLHF.
well gets around the same results or better.

Pasted image 20250522143616.png

Self-Attention

Mathematical Expression for Self-Attention

The self-attention mechanism is like THE core of transformer models that power LLMs.

Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

Where:

$Q \in R^{n \times d_{k}}$ represents the queries
$K \in R^{n \times d_{k}}$ represents the keys
$V \in R^{n \times d_{v}}$ represents the values
$d_{k}$ is the dimension of the keys (used for scaling)
$n$ is the sequence length

The attention weights:

\begin{document}
\begin{tikzpicture}[scale=0.8]
\draw[->] (0,0) -- (4,0) node[right] {Token position (keys)};
\draw[->] (0,0) -- (0,4) node[above] {Token position (queries)};
\draw[very thin,color=gray] (0,0) grid (3,3);

\filldraw[fill=blue!20] (0,0) rectangle (1,1);
\filldraw[fill=blue!70] (1,0) rectangle (2,1);
\filldraw[fill=blue!30] (2,0) rectangle (3,1);

\filldraw[fill=blue!80] (0,1) rectangle (1,2);
\filldraw[fill=blue!10] (1,1) rectangle (2,2);
\filldraw[fill=blue!40] (2,1) rectangle (3,2);

\filldraw[fill=blue!20] (0,2) rectangle (1,3);
\filldraw[fill=blue!50] (1,2) rectangle (2,3);
\filldraw[fill=blue!90] (2,2) rectangle (3,3);

\node at (1.5,-0.5) {Attention Matrix};
\end{tikzpicture}
\end{document}

For multi-head attention, we have:

MultiHead (Q, K, V) = Concat ({head}_{1}, \dots, {head}_{h}) W^{O}

Where each head is computed as:

{head}_{i} = Attention (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V})

With learnable projection matrices:

$W_{i}^{Q} \in R^{d_{model} \times d_{k}}$
$W_{i}^{K} \in R^{d_{model} \times d_{k}}$
$W_{i}^{V} \in R^{d_{model} \times d_{v}}$
$W^{O} \in R^{h d_{v} \times d_{model}}$

The loss function for next-token prediction is typically cross-entropy:

L = - \sum_{i = 1}^{n} \sum_{j = 1}^{V} y_{i, j} \log (p_{i, j})

Where:

$n$ is the number of tokens in the sequence
$V$ is the vocabulary size
$y_{i, j}$ is 1 if token $i$ is the $j$ -th word in the vocabulary, 0 otherwise
$p_{i, j}$ is the predicted probability that token $i$ is the $j$ -th word

Sources

Taken from Andrej Karapathy (the goat) deep dive into LLMs.
LLM Architecture Visualization
UltraChat Atlas.

Pre-Training

Tokenization

Naive Approaches

Problems with this approach

Byte Pair Encoding

Training the Model

Visualizing an LLM

Inferences

GPT-2

The Original Paper

Base Model

Post-Training

InstructGPT ( Original Paper )

Open Source Datasets Visualization ( This visualization is of UltraChat, it's popular but now that Models became good it is mostly synthetic data. )

Hallucinations , Memory , Awareness

How to reduce hallucinations?

Memory

Knowledge of Self

Post-Training

SFT ( Supervised Fine Tuning )

RLHF ( Reinforcement Learning from Human Feedback )

DPO (Direct Preference Optimization)

Transformer related Stuff

Self-Attention

Mathematical Expression for Self-Attention

Sources