Why does my non-English text cost so much more?

Tokenizers are trained on a corpus dominated by English. Non-Latin scripts, especially Chinese, Japanese, Korean, Hindi, Arabic, often tokenize at 2-4× the rate of English for equivalent content. Some open-source models (mBERT, XLM-R) use multilingual tokenizers that even out costs across languages, but most frontier model tokenizers favor English.

Are input and output tokens priced the same?

No: output tokens are typically 2-5× more expensive than input tokens because generating each token requires a full model forward pass while input tokens are processed in parallel during prefill. This pricing structure is universal across providers and a key consideration when designing applications: a workflow that summarizes long input into short output is much cheaper per request than one that expands short input into long output.

What's a reasonable max_tokens for chat responses?

For chat-style responses, 1024-2048 max_tokens is typical. For document generation, 4096-8192. Always set max_tokens explicitly: running without it lets the model generate up to the model's full context window, which can be expensive and slow.

Start a conversation

AI engineering glossary

What is a Token in LLMs?

A token is the fundamental unit of text that LLMs process (typically a word, subword, or punctuation mark) produced by a tokenizer that splits raw text into integer IDs the model can compute on, where roughly 1 token equals 0.75 English words on average.

Last updated 2026-04-28BearPlex AI Engineering Team

Overview

Everything LLMs do is measured in tokens: context windows, output limits, pricing, latency. Understanding tokenization is essential because it explains otherwise-confusing behaviors: why some words cost more than others, why non-English languages cost more per character, why code consumes more tokens than prose, and why the model occasionally splits 'unexpected' words mid-character. Modern tokenizers (BPE for GPT, SentencePiece for Llama, tiktoken for OpenAI, Anthropic's custom tokenizer for Claude) all share the same fundamental approach but differ in vocabulary size, language coverage, and how aggressively they merge frequent subword pairs.

How tokenization works

Tokenizers convert raw text strings into sequences of integer token IDs that the model's embedding layer then maps to vectors. The most common approach is Byte-Pair Encoding (BPE): start with characters as atomic tokens, then iteratively merge the most-frequent character pairs into new tokens until you have a fixed vocabulary size (typically 30K-200K tokens). Common words become single tokens ('the', 'and', 'AI'), while rare words get split into pieces ('Anthropic' might be one token, but a rare medical term might split into 4-5 subwords). Special tokens mark structure ([CLS], <|im_start|>, <|tool_use|>) and are reserved for the model's internal grammar.

Why token counts vary by language and content

English tokenizes most efficiently (about 0.75 tokens per word on average) because English-language text dominates the training data tokenizers were optimized for. Non-English languages tokenize less efficiently: Chinese, Japanese, Korean, Hindi, Arabic, and most non-Latin scripts often tokenize at 2-4× the cost of English for the same semantic content. Code tokenizes inefficiently too because variable names, indentation, and punctuation all consume separate tokens. JSON is particularly token-heavy because of all the brackets and quotes. The practical impact: serving non-English users or working with code often costs 2-3× more per request than English text of similar length.

Tokens, latency, and cost

Almost everything about LLM economics is per-token. Pricing: providers charge separately for input tokens (cheaper) and output tokens (more expensive, typically 2-5× input cost). Latency: time-to-first-token depends on input length (the model must process all input before generating); inter-token latency is roughly constant; total response time = input tokens × prefill speed + output tokens × generation speed. Context windows: every token in the prompt counts against the limit. Optimization in production usually means reducing token count: tighter prompts, shorter system messages, prompt caching for repeated prefixes, and structured output to reduce unnecessary verbosity.

Use cases

Estimating cost before sending a request (input tokens × input price + expected output tokens × output price)
Capping response length via max_tokens parameter to control cost and latency
Optimizing prompts to fit within context window limits
Comparing model pricing across providers on a per-token basis
Using prompt caching to amortize repeated prefix tokens at 90% discount

Examples in production

OpenAI tiktoken

Open-source Python tokenizer that exactly replicates GPT-3.5/GPT-4 tokenization, used to count tokens before API calls.

Source

Hugging Face Tokenizers

Fast Rust-backed tokenizer library supporting BPE, WordPiece, SentencePiece, and Unigram: the tokenizer for most open-source models.

Source

Anthropic

Claude API includes a count_tokens endpoint for exact token counting before sending requests, helpful for prompt budgeting.

Source

Token compared to alternatives

Alternative	Choose Token when	Choose alternative when
Characters Raw character count: what naive text length measures	Use tokens for any LLM cost or context window calculation: they're what the model actually counts	Use character count only for display or basic length estimates
Words Whitespace-separated word count	Use tokens for precise LLM-related sizing: about 1 token per 0.75 words for English	Use word count for human-readable length estimates

Common pitfalls

Estimating tokens via 'characters / 4': works for English prose, breaks for code and non-English text
Forgetting that output tokens cost more than input tokens: long-form generation is expensive
Sending raw HTML, JSON, or XML when a more compact representation would work: markup is token-heavy
Not capping max_tokens: runaway responses can exceed context windows and cost
Assuming all providers tokenize identically: they don't; the same text is different token counts on GPT vs Claude vs Gemini

Related BearPlex services

Model Engineering & Fine-Tuning RAG & Knowledge Systems

Full AI glossary

FAQ

Questions about Token.

Use the provider's official tokenizer: tiktoken for OpenAI, Anthropic's count_tokens API endpoint for Claude, the Hugging Face tokenizer matching the open-source model you're using. Don't estimate via character / 4 except as a rough sanity check: it's wrong for code and non-English text.

Need help implementing Token?

BearPlex builds production AI systems that use Token for Fortune 500s and high-growth scale-ups. Outcome-based pricing. 90-day embedded sprints.

Talk to BearPlex See case studies

What is a Token in LLMs?

Overview

How tokenization works

Why token counts vary by language and content

Tokens, latency, and cost

Use cases

Examples in production

OpenAI tiktoken

Hugging Face Tokenizers

Anthropic

Token compared to alternatives

Common pitfalls

Related terms

Related BearPlex services

Questions about Token.

Related reading

Need help implementing Token?