LLMs 101

Complete reference guide

Everything you need to understand Large Language Models

Every topic from the interactive mind map, laid out as a readable guide. No technical background required.

Contents

What is a Large Language Model?

Large Language Models

The foundation

A Large Language Model (LLM) is an AI system trained to understand and generate human language at scale. The "large" refers to two things: the vast amount of text it was trained on, and the billions of mathematical parameters (weights) inside it.

At its core, an LLM does one deceptively simple thing: given some text, predict what comes next. When trained extraordinarily well across trillions of words, reasoning, knowledge, and language understanding emerge as side effects of that prediction task.

The four key dimensions of how LLMs work are: the mathematics underpinning them, the training process that builds them, the architectural choices that define them, and the prompting techniques that get the best out of them.

Key concepts Next token prediction Parameters Foundation models Emergent capabilities

The Mathematics of LLMs

LLM maths isn't exotic — it builds on linear algebra, calculus, and probability applied at enormous scale. The entire forward pass that converts your prompt into a response is essentially a chain of matrix multiplications. Every token becomes a vector — an ordered list of ~4,096 numbers — and every transformation the model applies is a matrix multiplication.

Linear algebra

Vectors & matrices

The foundation of all LLM computation. Every token is converted into an embedding vector — an ordered list of ~4,096 floating-point numbers that encodes its meaning as a position in high-dimensional space. Words with similar meanings end up geometrically close in this space.

Every transformation the model applies — attention, feed-forward layers — is a matrix multiplication: multiplying two grids of numbers together. The billions of "parameters" in a model are literally the individual numbers inside these matrices.

Key concepts Embedding vectors Weight matrices Matrix multiplication Dot products Cosine similarity

Calculus & gradient descent

How models learn

Calculus drives training. The model's error — how wrong its predictions are — forms a high-dimensional "loss landscape." Training is the process of rolling a ball downhill on that landscape, adjusting parameters in the direction that reduces error.

The gradient tells you which direction downhill is. Backpropagation is the algorithm that efficiently calculates that gradient across billions of parameters simultaneously, flowing error signals backwards through the network layer by layer. The Adam optimiser is the most widely used variant for adapting the learning rate per parameter.

Key concepts Loss function Gradient descent Backpropagation Learning rate Adam optimiser

Probability & statistics

Softmax & sampling

The final step of every forward pass produces a probability distribution — not just "the next word is X" but a ranked list across all ~100,000 tokens in the vocabulary.

The temperature parameter controls this distribution. At temperature 0, you always pick the highest-probability token (deterministic, repetitive). At temperature 1, you sample proportionally (creative, varied). At temperature 2, the distribution flattens (chaotic).

Top-p (nucleus) sampling is a refinement that samples only from the smallest set of tokens whose cumulative probability exceeds a threshold p — avoiding very low-probability tokens regardless of temperature.

Key concepts Softmax function Temperature Top-p sampling Top-k sampling Perplexity

Self-attention mechanism

Query / Key / Value

The key innovation of the Transformer (Vaswani et al., 2017). Self-attention asks: for every token, which other tokens in the input should I pay attention to when building my understanding of this token?

It computes three vectors per token: a Query (what am I looking for?), a Key (what do I offer?), and a Value (what information do I carry?). The attention score between any two tokens is the dot product of their Q and K vectors, scaled and passed through softmax to produce weights. Those weights determine how much each token's Value contributes to the output.

This is computed in parallel across all tokens simultaneously — far faster than older recurrent (LSTM) models that processed one token at a time. Modern models use multi-head attention, running this process in parallel across many independent heads.

Key concepts Multi-head attention Query / Key / Value Attention weights Scaled dot-product Positional encoding

How LLMs are Trained

Training a frontier model is one of the most resource-intensive activities in computing. It happens in sequential stages: data collection → pre-training → instruction tuning → alignment. Training Meta's Llama 3 70B consumed approximately 6.4 million GPU-hours on H100 chips — roughly $30–50M AUD for a single run.

Data collection & cleaning

Web crawl, books, code

Models are trained on massive text corpora: web crawls (Common Crawl is the most used source), books (Books3, Project Gutenberg), Wikipedia, GitHub code repositories, scientific papers (ArXiv), and more. Meta's Llama 3 was trained on 15 trillion tokens. OpenAI has never disclosed GPT-4's exact data volume.

The raw data is enormous but messy — full of spam, duplicate content, low-quality pages, and toxic material. Significant engineering effort goes into filtering, deduplication, and quality scoring before training begins. The quality of training data arguably matters as much as the architecture itself.

Key concepts Common Crawl The Pile RedPajama FineWeb C4 dataset

Pre-training

Next token prediction

The foundational and most expensive stage. The model starts with random parameters (essentially noise) and learns by reading text, trying to predict the next token, comparing its prediction to the actual token, calculating how wrong it was (the "loss"), and adjusting all parameters slightly to be less wrong. This process repeats billions of times.

The result is a base model — sometimes called a foundation model — that deeply understands language but is strange to talk to. If you ask "What is the capital of France?", it might continue your text as if it's a quiz, not a question. Base models need further stages to become useful assistants.

Key concepts Causal language modelling Next token prediction Foundation models Base models Training loss

Instruction tuning (SFT)

Supervised fine-tuning

After pre-training, supervised fine-tuning (SFT) trains the model on thousands to millions of examples of the desired input-output format: human message → assistant reply → human message → assistant reply.

This is where a base model becomes a usable assistant — it learns the conversational format, how to respond helpfully, and basic safety behaviours. SFT is relatively cheap compared to pre-training. Every major model (GPT, Claude, Gemini, Llama-Instruct) goes through this stage. The quality and diversity of the SFT dataset heavily influences the model's personality and instruction-following ability.

Key concepts Conversational format FLAN dataset Alpaca ShareGPT Open Hermes

Reinforcement Learning from Human Feedback

Reward model, PPO

The step that defines modern chatbots. Human raters are shown multiple responses to the same prompt and rank them from best to worst. These rankings train a separate reward model that learns to predict human preferences.

The main LLM is then fine-tuned using reinforcement learning (specifically PPO — Proximal Policy Optimisation) to generate responses the reward model scores highly. This gives models their tendency to be helpful, to decline harmful requests, and to structure answers in particular ways.

Anthropic's Constitutional AI is a variation where the model is also evaluated against a written set of principles — a "constitution" — and the model itself uses these to self-critique and revise responses before human raters see them.

Key concepts Reward model PPO optimiser Constitutional AI Human preference data Proximal Policy Optimisation

DPO — Direct Preference Optimisation

Direct preference optimisation

A more recent and efficient alternative to RLHF that achieves similar alignment results without needing a separate reward model. DPO directly optimises the language model on preference pairs — shown a prompt and a preferred vs rejected response — using a mathematically elegant reformulation that treats the LLM itself as the implicit reward model.

Many open-source models use DPO: Mistral models, Zephyr, many Llama fine-tunes, and Tulu. Results are comparable to RLHF for most tasks at significantly lower training cost, making it popular in the research community and for smaller labs that can't afford full RLHF infrastructure.

Key concepts Preference pairs Implicit reward model Zephyr model Tulu ORPO variant

Synthetic training data

AI-generated training

Models like Claude, GPT-4, and Gemini are now partially trained on data generated by other AI models. Meta used Llama 3 to help generate instruction-following training data for Llama 3. Microsoft's Phi series (Phi-1, Phi-2, Phi-3) is almost entirely trained on high-quality synthetic data generated by GPT-4 — and achieves remarkable performance for its tiny size.

This raises interesting questions: can AI quality improve recursively? Research suggests it can — up to a point — but errors and biases compound over generations if synthetic data isn't carefully curated and filtered. It also enables labs to generate specific kinds of training data that are rare or expensive to collect naturally.

Key concepts Phi-3 (Microsoft) Orca methodology WizardLM Self-Instruct Magpie dataset

Model Architectures & Families

Not all LLMs are built the same way. The Transformer (2017) is the universal foundation — but within that, there are major design variants. The AI landscape is also divided between closed models (GPT-4, Claude, Gemini — weights never released) and open weight models (Llama, Mistral, Qwen, DeepSeek — weights publicly available).

Dense transformer

GPT-2, GPT-3 style

The classic architecture — every parameter is activated for every token processed. Simple, well-understood, and the foundation of early large models. GPT-2 (2019) and the original GPT-3 (2020) were dense transformers.

The limitation: at very large scales, running all parameters for every token becomes prohibitively expensive. A 175B dense model must activate all 175B parameters to process a single token. This drove research into more efficient architectures like Mixture of Experts, where only a fraction of parameters activate per token. Dense models are still widely used for smaller scales (7B–13B) where efficiency is less critical.

Examples GPT-2 BERT T5 original GPT-3 LLaMA 1 Falcon

Mixture of Experts (MoE)

Routing & efficiency

Now dominant at the frontier. Instead of one large neural network, MoE models have many smaller "expert" networks and a router that decides which 2–4 experts should handle each token. You get the total capacity of a very large model while only activating a fraction of parameters per token.

Mixtral 8x7B (Mistral AI) has 46.7B total parameters but only activates ~12.9B per token — similar inference cost to a 13B dense model but with much higher knowledge capacity. GPT-4 is widely believed to be an MoE model (never confirmed by OpenAI). DeepSeek V3 uses MoE with 671B total but only ~37B active parameters per token.

Examples Mixtral 8x7B GPT-4 (believed MoE) Qwen 3.5 DeepSeek V3 Switch Transformer

Reasoning models

o1, o3, DeepSeek R1

A different paradigm: rather than generating an answer immediately, the model produces a long internal "chain of thought" — working through the problem step by step before giving its final answer. This dramatically improves performance on mathematics, logic, and complex coding problems at the cost of being slower and more expensive to run.

OpenAI's o1 (September 2024) and o3 were the first widely deployed reasoning models. DeepSeek R1 (January 2025) shocked the industry by matching o1's benchmark performance at a fraction of the training cost — reportedly ~$6M vs hundreds of millions.

Examples OpenAI o1 OpenAI o3 DeepSeek R1 Gemini Thinking QwQ-32B

OpenAI model family

GPT-4o, o1, o3

GPT-4o ("omni") — the most widely used commercial model as of 2024–25. Fast, natively multimodal (text, images, and audio processed together), available via ChatGPT and API.

o1 / o3 — reasoning-first models that produce internal chain-of-thought before answering. Dramatically better at maths and complex logic. Slower and more expensive. o3 is the current frontier reasoning model.

GPT-4.1 — the most recent in the GPT-4 family as of early 2026, optimised for instruction following and long-context coding tasks.

Models GPT-4o o1 o3 GPT-4.1 ChatGPT DALL-E 3

Anthropic — Claude models

Constitutional AI

Anthropic was co-founded in 2021 by former OpenAI researchers with a specific focus on AI safety research. Their models are known for strong writing, nuanced instruction following, a 200K context window, and strong coding ability.

Constitutional AI is Anthropic's key differentiator: rather than purely human feedback, Claude is also trained against a written set of principles — a "constitution" — that the model itself uses to self-critique and revise its own responses before human raters evaluate them. This allows more scalable and principled alignment.

Models Claude 3.5 Sonnet Claude 3 Opus Constitutional AI claude.ai Anthropic API

Google DeepMind — Gemini

Gemini 2.0 family

Google's frontier model family, built natively multimodal from the ground up — text, images, audio, and video are treated as equal inputs rather than text being primary with vision bolted on as a separate module.

Gemini 2.0 Flash is notably fast and cost-efficient, widely used in production applications. Google has extended Gemini to a 1 million token context window — enough to process entire large codebases or long books in a single prompt. Gemini powers Google Search AI Overviews and deep Workspace integrations.

Models Gemini 2.0 Flash Gemini 2.0 Pro 1M token context Google AI Studio Vertex AI

Open weight models

Llama · Mistral · DeepSeek

Meta Llama 3 — fully open weights. Llama 3.3 70B is exceptional for its size and highly competitive with closed models. Being open enables the entire Ollama ecosystem.

Mistral AI (France) — punches above its weight. Mistral 7B outperformed much larger models at launch (2023). Mixtral 8x7B is a landmark open MoE model.

DeepSeek (China) — shocked the world with R1 in January 2025, matching OpenAI o1 for ~$6M training cost. DeepSeek V3 is an exceptional open MoE model. Qwen (Alibaba) and Gemma (Google) round out the major open families.

Models Llama 3.3 70B Mistral Large Mixtral 8x7B DeepSeek V3 Qwen 3.5 Gemma 3 Phi-4

Prompting Techniques

Prompting is often underrated as a skill, but technique has a massive effect on output quality — arguably more impact than switching between similar-sized models. The model has the capability; your job is to activate it precisely. Key principles: be specific about context and goal; specify format explicitly; give the model a role; and use examples.

Chain of thought prompting

Step-by-step reasoning

One of the most powerful prompting techniques. Adding "let's think step by step" or "think through this carefully before answering" to a prompt measurably improves performance on reasoning tasks, even in standard (non-reasoning) models.

You force the model to generate intermediate reasoning tokens before its conclusion — those intermediate tokens become part of its context, influencing the final answer. Works particularly well for maths, logic, multi-step planning, and complex decision-making.

Reasoning models (OpenAI o1, DeepSeek R1) essentially do chain-of-thought automatically and at greater depth, running hundreds or thousands of reasoning tokens internally before producing their visible response.

Key concepts Zero-shot CoT Tree of Thoughts Self-consistency ReAct prompting Scratchpad

Few-shot prompting

Examples in context

Providing examples before your actual request so the model extrapolates the pattern. Rather than "classify this email: [email]", you provide two or three labelled examples first, then your actual input.

Especially powerful for: specific output formats, classification tasks, style matching, and any task where showing is clearer than telling. The model doesn't need to be retrained — it pattern-matches from examples in its context window alone. This is called in-context learning, and it was one of the surprising capabilities that emerged from large-scale pre-training.

Key concepts Zero-shot (no examples) One-shot (1 example) Few-shot (2–5 examples) In-context learning Format matching

System prompts

Frames model behaviour

The invisible instructions that frame the entire conversation. Every commercial AI product — Claude.ai, ChatGPT, Gemini, Copilot — has a system prompt you don't see that shapes how the model behaves: its persona, what it will and won't do, its output style, and its focus area.

When you use the Ollama API, OpenAI API, or Anthropic API directly, you control this fully via the "system" message role. A well-designed system prompt can transform a general model into a focused specialist. "You are a senior contract lawyer reviewing clauses for liability" genuinely shifts outputs toward that expertise domain by activating relevant training patterns.

Key concepts OpenAI API "system" role Ollama Modelfile Anthropic system param Persona assignment Role prompting

RAG — Retrieval Augmented Generation

Retrieval augmented generation

Rather than asking the model to recall information from training (unreliable for specific facts), RAG retrieves relevant documents first and injects them into the context window: "Here are relevant sections from these documents. Based only on this, answer: [question]."

This is how Perplexity, Microsoft Copilot in Word, and many document tools work. It dramatically reduces hallucination for factual tasks and allows the model to answer questions about content it was never trained on — your private documents, recent news, internal company data.

The key components are: a vector database (to store document chunks as embeddings), an embedding model (to convert query and chunks to vectors for similarity matching), and a retrieval step (to find the most relevant chunks before calling the LLM).

Key concepts Vector database LlamaIndex LangChain Pinecone pgvector Chroma

Context window

128K–1M tokens

The maximum amount of text an LLM can "see" at once — its working memory for a single conversation or task. Everything outside the context window is invisible to the model. This is why LLMs have no memory between conversations unless memory is explicitly managed.

Context window sizes have grown dramatically: GPT-4o has 128K tokens. Claude 3.5 has 200K. Gemini 1.5 Pro reached 1 million tokens. As context windows grow, the line between "prompting" and "giving the model all relevant information" increasingly blurs — you can now feed entire codebases or books in a single prompt.

1 token ≈ 0.75 words, so 128K tokens ≈ a novel-length document. However, model quality degrades in the middle of very long contexts — the "lost in the middle" problem.

Key facts GPT-4o: 128K Claude 3.5: 200K Gemini Pro: 1M Qwen 3.5: 256K Lost in the middle

Prompt injection attacks

Security & safety

A security risk in AI-powered systems. If an LLM processes untrusted external content — emails, web pages, documents, database entries — that content can contain hidden instructions designed to hijack the model's behaviour.

Indirect prompt injection is particularly dangerous: the attack payload is hidden in content the model retrieves autonomously — a web page the agent visits, a PDF it reads — not in the user's direct input. This is a genuine attack vector in production AI systems, distinct from jailbreaking.

Why it's hard to fix: the model fundamentally cannot distinguish between trusted instructions and untrusted injected data if both appear in the same context window as text.

Key concepts Indirect injection Direct jailbreak Data exfiltration Agent hijacking LLM firewall

Broader Themes

Open vs closed models

Weights, access, safety

Closed models (GPT-4, Claude, Gemini) — the weights are never released. You can only access them via API at per-token cost. Labs argue this is necessary for safety and commercial sustainability.

Open weights (Llama 3, Mistral, Qwen, DeepSeek) — the actual model file is publicly downloadable. Anyone can run, inspect, study, or fine-tune it. This is what makes Ollama and local AI possible. "Open weights" is subtly different from "open source" — Meta releases Llama's weights but not always all training code or data.

Fully open source (rare) — weights + training code + training data all released. EleutherAI's models and Pythia qualify. Enables full scientific reproducibility.

Examples Meta Llama (open weights) Mistral (open weights) EleutherAI GPT-NeoX Hugging Face AI2 OLMo

Safety & alignment

Hallucination, bias, risk

Alignment is the problem of ensuring AI systems do what humans actually want, not just what they were literally instructed to do at training time. Safety is the broader challenge of ensuring AI systems don't cause harm as they become more capable.

Hallucination — where models generate plausible-sounding but false information — is the most common practical safety concern. It occurs because models are pattern-matchers, not fact-databases. RAG, grounding with tool use, and output verification all help.

Bias — models can reflect and amplify biases present in training data. This is an active area of research and the subject of significant regulatory attention globally. Anthropic was founded specifically around AI safety research and their "responsible scaling policy" commits to safety evaluations before deploying more capable models.

Key concepts Constitutional AI RLHF Red-teaming Scalable oversight Mechanistic interpretability

Hardware & infrastructure

GPUs, CUDA, H100

LLMs run on GPUs — Graphics Processing Units — because their architecture is optimised for the massively parallel matrix multiplications that neural networks require. NVIDIA dominates the training hardware market, with the H100 being the current standard chip for frontier model training. Each H100 costs around $30,000 USD and frontier training runs use tens of thousands of them.

CUDA is NVIDIA's programming platform that makes GPUs accessible to AI frameworks like PyTorch. For local inference, llama.cpp enables running quantised models on consumer hardware — including Apple Silicon Macs via the Metal Performance Shaders (MPS) backend. Quantisation reduces model precision from 32-bit to 4-bit or 8-bit floats, reducing memory requirements ~8x with modest quality loss.

Key concepts NVIDIA H100 A100 CUDA platform Google TPU llama.cpp GGUF format Apple Metal (MPS)

Explore the interactive mind map

Navigate all these concepts visually — click any node to expand and explore deeper explanations.

Open the mind map →