Large Language Models | A Little Bit of Knowledge

The basics

LLM (Large Language Model)

A type of AI trained on enormous amounts of text to predict what comes next in a sequence. That's it. The sophistication of the output, code, essays, reasoning, conversation, emerges from doing that one thing at enormous scale. ChatGPT, Claude, Gemini, Llama: all LLMs.

Foundation Model

A large model trained on broad data that can be adapted to many tasks. LLMs are a type of foundation model. The term distinguishes the base model from the applications built on top of it. GPT-4 is a foundation model. ChatGPT is an application built on it.

Token

The unit of text an LLM processes. Not quite a word, not quite a character, roughly three or four characters on average. "Knowledge" is one token. "A Little Bit of Knowledge" is five. Models are priced by token and limited by how many they can handle at once. Everything you put in and get out is measured in tokens.

Context Window

The maximum amount of text a model can hold in its working memory at once, input and output combined. Measured in tokens. A small context window means the model forgets earlier parts of a long conversation. A large one means it can reason across entire documents. Context window size is one of the most important things to check when choosing a model for a task.

Inference

The act of running a model to generate a response. Training is when the model learns from data. Inference is when it uses what it learned to answer your question. When you pay for API usage, you're paying for inference. It's computationally expensive, which is why it costs money.

Multimodal

A model that can handle more than one type of input, text, images, audio, video. GPT-4 Vision is multimodal: you can send it an image and ask questions about it. Text-only models are cheaper and faster. Multimodal models are more capable but cost more per call. Know which you need before you choose.

How they work

Training

The process by which a model learns from data. A foundation model is trained on vast amounts of text from the internet, books, and other sources. Training is done once (or periodically) by the company that builds the model. It's expensive, billions of dollars in compute. Users don't train models; they use them.

RLHF (Reinforcement Learning from Human Feedback)

A technique for making models more helpful and less harmful by having humans rate their outputs, then training the model to produce outputs humans prefer. The reason Claude or ChatGPT feels more useful than a raw language model. Also the reason they sometimes refuse things or add caveats where a raw model wouldn't.

Temperature

A setting that controls how random a model's outputs are. Low temperature (close to 0) means the model picks the most predictable next token every time, consistent, reliable, sometimes dull. High temperature means it takes more risks, creative, varied, sometimes wrong. For factual tasks, low temperature. For creative work, higher.

Hallucination

When a model generates something plausible-sounding but factually wrong, a made-up citation, a fictional law, a person who doesn't exist. It's not lying. It's predicting what text should come next based on patterns, and sometimes the pattern points somewhere wrong. Hallucination is the most important limitation to understand before deploying any LLM in a context where facts matter.

Alignment

The problem of making an AI do what humans actually want, rather than what it's technically optimised for. A model optimised to generate text people rate highly might learn to flatter rather than inform. Alignment research is the attempt to close the gap between what we say we want and what models actually do.

Embedding

A way of representing text as numbers so that similar meanings end up numerically close together. "Dog" and "puppy" have similar embeddings. "Dog" and "mortgage" do not. Embeddings are how search, recommendations, and retrieval systems understand meaning rather than just matching words. Most LLM applications use embeddings somewhere under the hood.

Prompting

Prompt

The input you give a model. Every query, instruction, or question is a prompt. The quality of the prompt directly affects the quality of the output. Not because the model needs special incantations, because clear questions get clear answers, and vague questions get vague answers. The vocabulary you use in a prompt is what the model uses to understand what you want.

System Prompt

Instructions given to a model before the conversation starts, usually by the developer rather than the user. Sets the model's persona, constraints, and context. "You are a helpful assistant. Answer only in plain English." Users of ChatGPT rarely see the system prompt. Developers building applications write it.

Zero-shot / Few-shot

How many examples you give a model before asking it to do a task. Zero-shot: no examples, just the instruction. Few-shot: a small number of examples showing the pattern you want. Few-shot prompting reliably improves output quality for structured tasks. You don't need to retrain the model, just show it what good looks like.

Chain of Thought

A prompting technique that asks the model to reason step by step before giving an answer. "Think through this carefully before responding." Improves accuracy on complex reasoning tasks. The model is more likely to get the right answer if it's made to show its working. Widely used, widely underused by people who just want a quick answer.

RAG (Retrieval Augmented Generation)

A technique that gives a model access to a knowledge base it wasn't trained on, by retrieving relevant documents and putting them in the context window before generating a response. Instead of relying on what the model memorised during training, you give it the right information for the task. Reduces hallucination. Essential for applications where facts must be current or specific.

Grounding

Connecting a model's output to verifiable sources. A grounded response cites where its information came from. Grounding reduces hallucination and makes outputs checkable. RAG is one way to ground a model. Giving it access to real-time web search is another. Ungrounded models should never be trusted for facts without verification.

Choosing a model

Frontier Model

The most capable models available at a given time, typically GPT-4o, Claude Opus, Gemini Ultra. The best performance, the highest cost, the most capable reasoning. Not always the right choice. Many tasks are done better and cheaper by a smaller model. Frontier models are for the tasks that genuinely require them.

Open Weights

Models whose parameters are publicly released, so anyone can download and run them. Llama (Meta), Mistral, and Qwen are open weights models. Distinct from open source, which would include the training code and data too. Open weights models can run locally, be fine-tuned, and cost nothing in API fees. The tradeoff is that frontier closed models are currently more capable.

Fine-tuning

Training an existing model further on your own data to specialise it for a specific task or domain. A foundation model fine-tuned on medical records becomes better at clinical language. Fine-tuning is cheaper than training from scratch but still requires data and compute. For most use cases, a well-written prompt gets you further than fine-tuning.

Latency

How long the model takes to respond. For real-time applications, chatbots, live tools, latency matters. Smaller models are faster. Frontier models are slower. Some providers offer faster, cheaper versions of their models for latency-sensitive applications. A model that produces a great answer in ten seconds is the wrong choice if your users expect a response in two.

Cost per Token

What you pay to process input and generate output, charged per thousand or million tokens. Prices vary wildly between models and providers. A frontier model might cost 100x more per token than a smaller one. At low volumes it doesn't matter. At scale it defines your unit economics. Always run the numbers before committing to a model in production.

Rate Limit

A cap on how many requests or tokens you can send to an API in a given time period. Designed to prevent overload and ensure fair access. Hitting a rate limit means your application queues or fails until the window resets. Enterprise plans have higher limits. Building a system that handles rate limits gracefully is non-optional at scale.

Reasoning Model

A model specifically designed to work through problems step by step before answering, often slower and more expensive than standard models, but significantly better at complex logic, maths, and multi-step tasks. OpenAI's o1 and o3, and Anthropic's Claude with extended thinking, are examples. Use them when accuracy on hard problems matters more than speed.

Agents and applications

Agent

An LLM that can take actions, not just generate text. It can call tools, browse the web, write and run code, send messages. Given a goal, an agent breaks it into steps and executes them. The model is the brain. The tools are the hands. The gap between a chatbot and an agent is the difference between advice and action.

Tool Use / Function Calling

The ability for a model to call external functions or APIs mid-conversation. "Look up the weather." "Search the web." "Run this query." The model decides when to use a tool, uses it, and incorporates the result into its response. Tool use is what turns a language model into something that can interact with the real world.

Orchestration

Managing multiple models or agents working together on a task. One model plans. Another executes. A third checks the output. Orchestration is the logic that coordinates them. As AI applications get more complex, orchestration is where the real engineering happens.

Guardrails

Rules or systems that constrain what a model will say or do. Some are built into the model (RLHF, safety training). Others are added by developers, filters, classifiers, output validation. Guardrails prevent a model from saying something harmful, off-brand, or factually wrong. In production systems, guardrails are not optional.

Vibe Coding

Writing software by describing what you want in plain language and letting an AI generate the code. You iterate by feel. "make this faster," "the button should be blue," "something's wrong with the login." The programmer's role shifts from writing syntax to directing intent. Coined in early 2025. Controversial among developers. Increasingly how non-developers build software.

The labs

OpenAI

The company behind GPT-4, GPT-4o, the o-series reasoning models, and ChatGPT. Originally a non-profit, now commercially structured with Microsoft as a major investor. The most widely known AI lab. Their API is the most commonly used in production applications.

Anthropic

The company behind the Claude models. Founded in 2021 by former OpenAI researchers. Focused on AI safety. Claude is widely regarded as one of the best models for long-document reasoning, careful writing, and following complex instructions. The model running this site.

Google DeepMind

Google's AI research division, responsible for the Gemini series of models. Gemini has the largest context window of the frontier models and is deeply integrated with Google's products. Access via Google AI Studio and the Gemini API.

Meta AI

Meta's AI division, responsible for the Llama series of open weights models. Llama models can be downloaded, run locally, and fine-tuned freely. Meta's decision to open-weight their models has significantly accelerated the open AI ecosystem.

xAI

Elon Musk's AI company, responsible for the Grok models. Grok has real-time access to X (formerly Twitter) data. Positioned as less restricted than competitors. A smaller player in the frontier model space but growing quickly.