What is a Large Language Model?
OverviewLarge Language Models
The foundation
A Large Language Model (LLM) is an AI system trained to understand and generate human language at scale. The "large" refers to two things: the vast amount of text it was trained on, and the billions of mathematical parameters (weights) inside it.
At its core, an LLM does one deceptively simple thing: given some text, predict what comes next. When trained extraordinarily well across trillions of words, reasoning, knowledge, and language understanding emerge as side effects of that prediction task.
The four key dimensions of how LLMs work are: the mathematics underpinning them, the training process that builds them, the architectural choices that define them, and the prompting techniques that get the best out of them.
The Mathematics of LLMs
MathematicsLLM maths isn't exotic — it builds on linear algebra, calculus, and probability applied at enormous scale. The entire forward pass that converts your prompt into a response is essentially a chain of matrix multiplications. Every token becomes a vector — an ordered list of ~4,096 numbers — and every transformation the model applies is a matrix multiplication.
Linear algebra
Vectors & matrices
The foundation of all LLM computation. Every token is converted into an embedding vector — an ordered list of ~4,096 floating-point numbers that encodes its meaning as a position in high-dimensional space. Words with similar meanings end up geometrically close in this space.
Every transformation the model applies — attention, feed-forward layers — is a matrix multiplication: multiplying two grids of numbers together. The billions of "parameters" in a model are literally the individual numbers inside these matrices.
Calculus & gradient descent
How models learn
Calculus drives training. The model's error — how wrong its predictions are — forms a high-dimensional "loss landscape." Training is the process of rolling a ball downhill on that landscape, adjusting parameters in the direction that reduces error.
The gradient tells you which direction downhill is. Backpropagation is the algorithm that efficiently calculates that gradient across billions of parameters simultaneously, flowing error signals backwards through the network layer by layer. The Adam optimiser is the most widely used variant for adapting the learning rate per parameter.
Probability & statistics
Softmax & sampling
The final step of every forward pass produces a probability distribution — not just "the next word is X" but a ranked list across all ~100,000 tokens in the vocabulary.
The temperature parameter controls this distribution. At temperature 0, you always pick the highest-probability token (deterministic, repetitive). At temperature 1, you sample proportionally (creative, varied). At temperature 2, the distribution flattens (chaotic).
Top-p (nucleus) sampling is a refinement that samples only from the smallest set of tokens whose cumulative probability exceeds a threshold p — avoiding very low-probability tokens regardless of temperature.
Self-attention mechanism
Query / Key / Value
The key innovation of the Transformer (Vaswani et al., 2017). Self-attention asks: for every token, which other tokens in the input should I pay attention to when building my understanding of this token?
It computes three vectors per token: a Query (what am I looking for?), a Key (what do I offer?), and a Value (what information do I carry?). The attention score between any two tokens is the dot product of their Q and K vectors, scaled and passed through softmax to produce weights. Those weights determine how much each token's Value contributes to the output.
This is computed in parallel across all tokens simultaneously — far faster than older recurrent (LSTM) models that processed one token at a time. Modern models use multi-head attention, running this process in parallel across many independent heads.
How LLMs are Trained
TrainingTraining a frontier model is one of the most resource-intensive activities in computing. It happens in sequential stages: data collection → pre-training → instruction tuning → alignment. Training Meta's Llama 3 70B consumed approximately 6.4 million GPU-hours on H100 chips — roughly $30–50M AUD for a single run.
Data collection & cleaning
Web crawl, books, code
Models are trained on massive text corpora: web crawls (Common Crawl is the most used source), books (Books3, Project Gutenberg), Wikipedia, GitHub code repositories, scientific papers (ArXiv), and more. Meta's Llama 3 was trained on 15 trillion tokens. OpenAI has never disclosed GPT-4's exact data volume.
The raw data is enormous but messy — full of spam, duplicate content, low-quality pages, and toxic material. Significant engineering effort goes into filtering, deduplication, and quality scoring before training begins. The quality of training data arguably matters as much as the architecture itself.
Pre-training
Next token prediction
The foundational and most expensive stage. The model starts with random parameters (essentially noise) and learns by reading text, trying to predict the next token, comparing its prediction to the actual token, calculating how wrong it was (the "loss"), and adjusting all parameters slightly to be less wrong. This process repeats billions of times.
The result is a base model — sometimes called a foundation model — that deeply understands language but is strange to talk to. If you ask "What is the capital of France?", it might continue your text as if it's a quiz, not a question. Base models need further stages to become useful assistants.
Instruction tuning (SFT)
Supervised fine-tuning
After pre-training, supervised fine-tuning (SFT) trains the model on thousands to millions of examples of the desired input-output format: human message → assistant reply → human message → assistant reply.
This is where a base model becomes a usable assistant — it learns the conversational format, how to respond helpfully, and basic safety behaviours. SFT is relatively cheap compared to pre-training. Every major model (GPT, Claude, Gemini, Llama-Instruct) goes through this stage. The quality and diversity of the SFT dataset heavily influences the model's personality and instruction-following ability.
Reinforcement Learning from Human Feedback
Reward model, PPO
The step that defines modern chatbots. Human raters are shown multiple responses to the same prompt and rank them from best to worst. These rankings train a separate reward model that learns to predict human preferences.
The main LLM is then fine-tuned using reinforcement learning (specifically PPO — Proximal Policy Optimisation) to generate responses the reward model scores highly. This gives models their tendency to be helpful, to decline harmful requests, and to structure answers in particular ways.
Anthropic's Constitutional AI is a variation where the model is also evaluated against a written set of principles — a "constitution" — and the model itself uses these to self-critique and revise responses before human raters see them.
DPO — Direct Preference Optimisation
Direct preference optimisation
A more recent and efficient alternative to RLHF that achieves similar alignment results without needing a separate reward model. DPO directly optimises the language model on preference pairs — shown a prompt and a preferred vs rejected response — using a mathematically elegant reformulation that treats the LLM itself as the implicit reward model.
Many open-source models use DPO: Mistral models, Zephyr, many Llama fine-tunes, and Tulu. Results are comparable to RLHF for most tasks at significantly lower training cost, making it popular in the research community and for smaller labs that can't afford full RLHF infrastructure.
Synthetic training data
AI-generated training
Models like Claude, GPT-4, and Gemini are now partially trained on data generated by other AI models. Meta used Llama 3 to help generate instruction-following training data for Llama 3. Microsoft's Phi series (Phi-1, Phi-2, Phi-3) is almost entirely trained on high-quality synthetic data generated by GPT-4 — and achieves remarkable performance for its tiny size.
This raises interesting questions: can AI quality improve recursively? Research suggests it can — up to a point — but errors and biases compound over generations if synthetic data isn't carefully curated and filtered. It also enables labs to generate specific kinds of training data that are rare or expensive to collect naturally.
Model Architectures & Families
ArchitecturesNot all LLMs are built the same way. The Transformer (2017) is the universal foundation — but within that, there are major design variants. The AI landscape is also divided between closed models (GPT-4, Claude, Gemini — weights never released) and open weight models (Llama, Mistral, Qwen, DeepSeek — weights publicly available).
Dense transformer
GPT-2, GPT-3 style
The classic architecture — every parameter is activated for every token processed. Simple, well-understood, and the foundation of early large models. GPT-2 (2019) and the original GPT-3 (2020) were dense transformers.
The limitation: at very large scales, running all parameters for every token becomes prohibitively expensive. A 175B dense model must activate all 175B parameters to process a single token. This drove research into more efficient architectures like Mixture of Experts, where only a fraction of parameters activate per token. Dense models are still widely used for smaller scales (7B–13B) where efficiency is less critical.
Mixture of Experts (MoE)
Routing & efficiency
Now dominant at the frontier. Instead of one large neural network, MoE models have many smaller "expert" networks and a router that decides which 2–4 experts should handle each token. You get the total capacity of a very large model while only activating a fraction of parameters per token.
Mixtral 8x7B (Mistral AI) has 46.7B total parameters but only activates ~12.9B per token — similar inference cost to a 13B dense model but with much higher knowledge capacity. GPT-4 is widely believed to be an MoE model (never confirmed by OpenAI). DeepSeek V3 uses MoE with 671B total but only ~37B active parameters per token.
Reasoning models
o1, o3, DeepSeek R1
A different paradigm: rather than generating an answer immediately, the model produces a long internal "chain of thought" — working through the problem step by step before giving its final answer. This dramatically improves performance on mathematics, logic, and complex coding problems at the cost of being slower and more expensive to run.
OpenAI's o1 (September 2024) and o3 were the first widely deployed reasoning models. DeepSeek R1 (January 2025) shocked the industry by matching o1's benchmark performance at a fraction of the training cost — reportedly ~$6M vs hundreds of millions.
OpenAI model family
GPT-4o, o1, o3
GPT-4o ("omni") — the most widely used commercial model as of 2024–25. Fast, natively multimodal (text, images, and audio processed together), available via ChatGPT and API.
o1 / o3 — reasoning-first models that produce internal chain-of-thought before answering. Dramatically better at maths and complex logic. Slower and more expensive. o3 is the current frontier reasoning model.
GPT-4.1 — the most recent in the GPT-4 family as of early 2026, optimised for instruction following and long-context coding tasks.
Anthropic — Claude models
Constitutional AI
Anthropic was co-founded in 2021 by former OpenAI researchers with a specific focus on AI safety research. Their models are known for strong writing, nuanced instruction following, a 200K context window, and strong coding ability.
Constitutional AI is Anthropic's key differentiator: rather than purely human feedback, Claude is also trained against a written set of principles — a "constitution" — that the model itself uses to self-critique and revise its own responses before human raters evaluate them. This allows more scalable and principled alignment.
Google DeepMind — Gemini
Gemini 2.0 family
Google's frontier model family, built natively multimodal from the ground up — text, images, audio, and video are treated as equal inputs rather than text being primary with vision bolted on as a separate module.
Gemini 2.0 Flash is notably fast and cost-efficient, widely used in production applications. Google has extended Gemini to a 1 million token context window — enough to process entire large codebases or long books in a single prompt. Gemini powers Google Search AI Overviews and deep Workspace integrations.
Open weight models
Llama · Mistral · DeepSeek
Meta Llama 3 — fully open weights. Llama 3.3 70B is exceptional for its size and highly competitive with closed models. Being open enables the entire Ollama ecosystem.
Mistral AI (France) — punches above its weight. Mistral 7B outperformed much larger models at launch (2023). Mixtral 8x7B is a landmark open MoE model.
DeepSeek (China) — shocked the world with R1 in January 2025, matching OpenAI o1 for ~$6M training cost. DeepSeek V3 is an exceptional open MoE model. Qwen (Alibaba) and Gemma (Google) round out the major open families.
Prompting Techniques
PromptingPrompting is often underrated as a skill, but technique has a massive effect on output quality — arguably more impact than switching between similar-sized models. The model has the capability; your job is to activate it precisely. Key principles: be specific about context and goal; specify format explicitly; give the model a role; and use examples.
Chain of thought prompting
Step-by-step reasoning
One of the most powerful prompting techniques. Adding "let's think step by step" or "think through this carefully before answering" to a prompt measurably improves performance on reasoning tasks, even in standard (non-reasoning) models.
You force the model to generate intermediate reasoning tokens before its conclusion — those intermediate tokens become part of its context, influencing the final answer. Works particularly well for maths, logic, multi-step planning, and complex decision-making.
Reasoning models (OpenAI o1, DeepSeek R1) essentially do chain-of-thought automatically and at greater depth, running hundreds or thousands of reasoning tokens internally before producing their visible response.
Few-shot prompting
Examples in context
Providing examples before your actual request so the model extrapolates the pattern. Rather than "classify this email: [email]", you provide two or three labelled examples first, then your actual input.
Especially powerful for: specific output formats, classification tasks, style matching, and any task where showing is clearer than telling. The model doesn't need to be retrained — it pattern-matches from examples in its context window alone. This is called in-context learning, and it was one of the surprising capabilities that emerged from large-scale pre-training.
System prompts
Frames model behaviour
The invisible instructions that frame the entire conversation. Every commercial AI product — Claude.ai, ChatGPT, Gemini, Copilot — has a system prompt you don't see that shapes how the model behaves: its persona, what it will and won't do, its output style, and its focus area.
When you use the Ollama API, OpenAI API, or Anthropic API directly, you control this fully via the "system" message role. A well-designed system prompt can transform a general model into a focused specialist. "You are a senior contract lawyer reviewing clauses for liability" genuinely shifts outputs toward that expertise domain by activating relevant training patterns.
RAG — Retrieval Augmented Generation
Retrieval augmented generation
Rather than asking the model to recall information from training (unreliable for specific facts), RAG retrieves relevant documents first and injects them into the context window: "Here are relevant sections from these documents. Based only on this, answer: [question]."
This is how Perplexity, Microsoft Copilot in Word, and many document tools work. It dramatically reduces hallucination for factual tasks and allows the model to answer questions about content it was never trained on — your private documents, recent news, internal company data.
The key components are: a vector database (to store document chunks as embeddings), an embedding model (to convert query and chunks to vectors for similarity matching), and a retrieval step (to find the most relevant chunks before calling the LLM).
Context window
128K–1M tokens
The maximum amount of text an LLM can "see" at once — its working memory for a single conversation or task. Everything outside the context window is invisible to the model. This is why LLMs have no memory between conversations unless memory is explicitly managed.
Context window sizes have grown dramatically: GPT-4o has 128K tokens. Claude 3.5 has 200K. Gemini 1.5 Pro reached 1 million tokens. As context windows grow, the line between "prompting" and "giving the model all relevant information" increasingly blurs — you can now feed entire codebases or books in a single prompt.
1 token ≈ 0.75 words, so 128K tokens ≈ a novel-length document. However, model quality degrades in the middle of very long contexts — the "lost in the middle" problem.
Prompt injection attacks
Security & safety
A security risk in AI-powered systems. If an LLM processes untrusted external content — emails, web pages, documents, database entries — that content can contain hidden instructions designed to hijack the model's behaviour.
Indirect prompt injection is particularly dangerous: the attack payload is hidden in content the model retrieves autonomously — a web page the agent visits, a PDF it reads — not in the user's direct input. This is a genuine attack vector in production AI systems, distinct from jailbreaking.
Why it's hard to fix: the model fundamentally cannot distinguish between trusted instructions and untrusted injected data if both appear in the same context window as text.
Broader Themes
ContextOpen vs closed models
Weights, access, safety
Closed models (GPT-4, Claude, Gemini) — the weights are never released. You can only access them via API at per-token cost. Labs argue this is necessary for safety and commercial sustainability.
Open weights (Llama 3, Mistral, Qwen, DeepSeek) — the actual model file is publicly downloadable. Anyone can run, inspect, study, or fine-tune it. This is what makes Ollama and local AI possible. "Open weights" is subtly different from "open source" — Meta releases Llama's weights but not always all training code or data.
Fully open source (rare) — weights + training code + training data all released. EleutherAI's models and Pythia qualify. Enables full scientific reproducibility.
Safety & alignment
Hallucination, bias, risk
Alignment is the problem of ensuring AI systems do what humans actually want, not just what they were literally instructed to do at training time. Safety is the broader challenge of ensuring AI systems don't cause harm as they become more capable.
Hallucination — where models generate plausible-sounding but false information — is the most common practical safety concern. It occurs because models are pattern-matchers, not fact-databases. RAG, grounding with tool use, and output verification all help.
Bias — models can reflect and amplify biases present in training data. This is an active area of research and the subject of significant regulatory attention globally. Anthropic was founded specifically around AI safety research and their "responsible scaling policy" commits to safety evaluations before deploying more capable models.
Hardware & infrastructure
GPUs, CUDA, H100
LLMs run on GPUs — Graphics Processing Units — because their architecture is optimised for the massively parallel matrix multiplications that neural networks require. NVIDIA dominates the training hardware market, with the H100 being the current standard chip for frontier model training. Each H100 costs around $30,000 USD and frontier training runs use tens of thousands of them.
CUDA is NVIDIA's programming platform that makes GPUs accessible to AI frameworks like PyTorch. For local inference, llama.cpp enables running quantised models on consumer hardware — including Apple Silicon Macs via the Metal Performance Shaders (MPS) backend. Quantisation reduces model precision from 32-bit to 4-bit or 8-bit floats, reducing memory requirements ~8x with modest quality loss.