What is a Token in LLMs?
A token is the fundamental unit of text that LLMs process (typically a word, subword, or punctuation mark) produced by a tokenizer that splits raw text into integer IDs the model can compute on, where roughly 1 token equals 0.75 English words on average.
Overview
Everything LLMs do is measured in tokens: context windows, output limits, pricing, latency. Understanding tokenization is essential because it explains otherwise-confusing behaviors: why some words cost more than others, why non-English languages cost more per character, why code consumes more tokens than prose, and why the model occasionally splits 'unexpected' words mid-character. Modern tokenizers (BPE for GPT, SentencePiece for Llama, tiktoken for OpenAI, Anthropic's custom tokenizer for Claude) all share the same fundamental approach but differ in vocabulary size, language coverage, and how aggressively they merge frequent subword pairs.
How tokenization works
Tokenizers convert raw text strings into sequences of integer token IDs that the model's embedding layer then maps to vectors. The most common approach is Byte-Pair Encoding (BPE): start with characters as atomic tokens, then iteratively merge the most-frequent character pairs into new tokens until you have a fixed vocabulary size (typically 30K-200K tokens). Common words become single tokens ('the', 'and', 'AI'), while rare words get split into pieces ('Anthropic' might be one token, but a rare medical term might split into 4-5 subwords). Special tokens mark structure ([CLS], <|im_start|>, <|tool_use|>) and are reserved for the model's internal grammar.
Why token counts vary by language and content
English tokenizes most efficiently (about 0.75 tokens per word on average) because English-language text dominates the training data tokenizers were optimized for. Non-English languages tokenize less efficiently: Chinese, Japanese, Korean, Hindi, Arabic, and most non-Latin scripts often tokenize at 2-4× the cost of English for the same semantic content. Code tokenizes inefficiently too because variable names, indentation, and punctuation all consume separate tokens. JSON is particularly token-heavy because of all the brackets and quotes. The practical impact: serving non-English users or working with code often costs 2-3× more per request than English text of similar length.
Tokens, latency, and cost
Almost everything about LLM economics is per-token. Pricing: providers charge separately for input tokens (cheaper) and output tokens (more expensive, typically 2-5× input cost). Latency: time-to-first-token depends on input length (the model must process all input before generating); inter-token latency is roughly constant; total response time = input tokens × prefill speed + output tokens × generation speed. Context windows: every token in the prompt counts against the limit. Optimization in production usually means reducing token count: tighter prompts, shorter system messages, prompt caching for repeated prefixes, and structured output to reduce unnecessary verbosity.
Use cases
- Estimating cost before sending a request (input tokens × input price + expected output tokens × output price)
- Capping response length via max_tokens parameter to control cost and latency
- Optimizing prompts to fit within context window limits
- Comparing model pricing across providers on a per-token basis
- Using prompt caching to amortize repeated prefix tokens at 90% discount
Examples in production
OpenAI tiktoken
Open-source Python tokenizer that exactly replicates GPT-3.5/GPT-4 tokenization, used to count tokens before API calls.
SourceHugging Face Tokenizers
Fast Rust-backed tokenizer library supporting BPE, WordPiece, SentencePiece, and Unigram: the tokenizer for most open-source models.
SourceAnthropic
Claude API includes a count_tokens endpoint for exact token counting before sending requests, helpful for prompt budgeting.
SourceToken compared to alternatives
| Alternative | Choose Token when | Choose alternative when |
|---|---|---|
Characters Raw character count: what naive text length measures | Use tokens for any LLM cost or context window calculation: they're what the model actually counts | Use character count only for display or basic length estimates |
Words Whitespace-separated word count | Use tokens for precise LLM-related sizing: about 1 token per 0.75 words for English | Use word count for human-readable length estimates |
Common pitfalls
- Estimating tokens via 'characters / 4': works for English prose, breaks for code and non-English text
- Forgetting that output tokens cost more than input tokens: long-form generation is expensive
- Sending raw HTML, JSON, or XML when a more compact representation would work: markup is token-heavy
- Not capping max_tokens: runaway responses can exceed context windows and cost
- Assuming all providers tokenize identically: they don't; the same text is different token counts on GPT vs Claude vs Gemini
Questions about Token.
Tokenizers are trained on a corpus dominated by English. Non-Latin scripts, especially Chinese, Japanese, Korean, Hindi, Arabic, often tokenize at 2-4× the rate of English for equivalent content. Some open-source models (mBERT, XLM-R) use multilingual tokenizers that even out costs across languages, but most frontier model tokenizers favor English.
No: output tokens are typically 2-5× more expensive than input tokens because generating each token requires a full model forward pass while input tokens are processed in parallel during prefill. This pricing structure is universal across providers and a key consideration when designing applications: a workflow that summarizes long input into short output is much cheaper per request than one that expands short input into long output.
For chat-style responses, 1024-2048 max_tokens is typical. For document generation, 4096-8192. Always set max_tokens explicitly: running without it lets the model generate up to the model's full context window, which can be expensive and slow.
Need help implementing Token?
BearPlex builds production AI systems that use Token for Fortune 500s and high-growth scale-ups. Outcome-based pricing. 90-day embedded sprints.