My Notes:

Pre-Training

Tokenization

Naive Approaches

Problems with this approach

Byte Pair Encoding

Training the Model

Visualizing an LLM

Inferences

GPT-2

Base Model

Post-Training

Hallucinations , Memory , Awareness

How to reduce hallucinations?

Memory

!!! "Vague recollection" vs. "Working memory" !!!

Knowledge of Self

Post-Training

SFT ( Supervised Fine Tuning )

RLHF ( Reinforcement Learning from Human Feedback )

DPO (Direct Preference Optimization)

Transformer related Stuff

Pasted image 20250522143616.png

Self-Attention

Mathematical Expression for Self-Attention

The self-attention mechanism is like THE core of transformer models that power LLMs.

Attention(Q,K,V)=softmax(QKTdk)V

Where:

The attention weights:

\begin{document}
\begin{tikzpicture}[scale=0.8]
\draw[->] (0,0) -- (4,0) node[right] {Token position (keys)};
\draw[->] (0,0) -- (0,4) node[above] {Token position (queries)};
\draw[very thin,color=gray] (0,0) grid (3,3);

\filldraw[fill=blue!20] (0,0) rectangle (1,1);
\filldraw[fill=blue!70] (1,0) rectangle (2,1);
\filldraw[fill=blue!30] (2,0) rectangle (3,1);

\filldraw[fill=blue!80] (0,1) rectangle (1,2);
\filldraw[fill=blue!10] (1,1) rectangle (2,2);
\filldraw[fill=blue!40] (2,1) rectangle (3,2);

\filldraw[fill=blue!20] (0,2) rectangle (1,3);
\filldraw[fill=blue!50] (1,2) rectangle (2,3);
\filldraw[fill=blue!90] (2,2) rectangle (3,3);

\node at (1.5,-0.5) {Attention Matrix};
\end{tikzpicture}
\end{document}

For multi-head attention, we have:

MultiHead(Q,K,V)=Concat(head1,,headh)WO

Where each head is computed as:

headi=Attention(QWiQ,KWiK,VWiV)

With learnable projection matrices:

The loss function for next-token prediction is typically cross-entropy:

L=i=1nj=1Vyi,jlog(pi,j)

Where:

Sources

LLM-Deep-Dive.png