What are Scaling Laws in AI?
Scaling laws are empirical mathematical relationships discovered in deep learning research showing that model performance improves predictably as a power-law function of model size, dataset size, and compute: establishing the theoretical foundation for the modern frontier-model arms race and explaining why investing in larger models, more data, and more compute yields predictable capability gains.
Overview
Scaling laws are arguably the single most important empirical discovery in modern AI. The first major scaling laws paper (Kaplan et al., OpenAI, 2020) showed that LLM performance improves smoothly and predictably as a function of compute, model size, and dataset size: turning AI capability development from an unpredictable research problem into a more engineering-tractable scaling problem. The Chinchilla paper (Hoffmann et al., DeepMind, 2022) refined the laws with a key finding: most existing models were over-parameterized relative to their training data. The implications shaped the entire industry: justification for billion-dollar training runs, frontier-lab strategies focused on scale, the decision to train smaller models on more data (Llama 3 trained on 15T tokens), and the broader argument that capability progress would continue with continued investment.
Kaplan scaling laws (2020)
Kaplan et al. trained many LLMs at varying sizes and compute budgets to discover how performance varied as a function of: (1) model parameter count, (2) training dataset size, (3) total compute budget. The key finding: performance follows power-law relationships in all three dimensions, meaning each dimension's relationship to performance can be plotted as a straight line on log-log axes. This let researchers predict model performance at scales they couldn't yet train. The Kaplan paper specifically suggested that for a given compute budget, optimal performance came from large models trained on relatively modest data: leading the field to train ever-larger models like GPT-3 (175B parameters).
Chinchilla scaling laws (2022)
Hoffmann et al. (DeepMind) revisited Kaplan's analysis with more rigorous methodology and reached a different conclusion: most existing LLMs were under-trained relative to their size. The Chinchilla finding: for a given compute budget, optimal performance comes from a balanced ratio of model parameters to training tokens (roughly 20 training tokens per parameter). Their Chinchilla model: 70B parameters trained on 1.4T tokens: outperformed the larger Gopher 280B trained on less data. This finding reshaped industry strategy: instead of always going bigger, train smaller models on much more data. Llama 3 (15T training tokens) and most frontier 2024-2026 models reflect Chinchilla-style data-heavy training.
Why scaling laws matter for production
Scaling laws have practical implications even for teams not training frontier models: (1) Predictability of model capability, knowing scaling laws lets you predict whether a hypothetical model would be sufficient for your task before you commit to using it; (2) Right-sizing decisions: scaling laws inform whether you should use a 7B, 13B, 70B, or frontier-scale model for a given task; (3) Distillation economics: distillation works because smaller models can match larger models on narrow tasks even when scaling laws say larger models are generally better; (4) Investment justification: scaling laws are why frontier labs continue to invest in larger models even when current models seem 'good enough'; (5) Research direction: scaling laws focus the industry on scaling existing architectures rather than inventing new ones, which has shaped where research effort goes.
Use cases
- Predicting whether a smaller model would be sufficient for a given task
- Justifying compute investment for frontier model training
- Sizing model selection decisions for production deployment
- Understanding why frontier labs continue to invest in larger models
- Architecting fine-tuning vs distillation strategies based on capability scaling
Examples in production
OpenAI (Kaplan et al., 2020)
'Scaling Laws for Neural Language Models': the foundational paper that established empirical power-law scaling for LLMs.
SourceDeepMind (Hoffmann et al., 2022)
Chinchilla paper ('Training Compute-Optimal Large Language Models') refined scaling laws and reshaped industry training strategy.
SourceMeta (Llama 3)
Llama 3 trained on 15T tokens reflects Chinchilla-influenced data-heavy training; the model significantly outperforms older similarly-sized models trained on less data.
SourceScaling Laws compared to alternatives
| Alternative | Choose Scaling Laws when | Choose alternative when |
|---|---|---|
Architecture innovation Improving model performance via new architectures | Scaling has dominated capability gains in 2020-2026 | Architecture innovation matters but has produced smaller gains than pure scaling at frontier |
Algorithmic efficiency Better training algorithms producing more capability per FLOP | Scaling is more reliable; algorithmic gains compound with scaling | Algorithmic efficiency improvements multiply scaling benefits: both matter |
Common pitfalls
- Treating scaling laws as inviolable physics: they're empirical relationships that may break at extreme scales or with new architectures
- Applying language-model scaling laws to other domains without verification
- Expecting scaling laws to predict specific capabilities: they predict aggregate performance, not specific behaviors
- Ignoring data quality: scaling laws assume reasonable-quality training data; garbage in, garbage out
- Underestimating capability emergence: some capabilities appear suddenly at scale rather than scaling smoothly
Questions about Scaling Laws.
For frontier model trainers: train smaller models on more data than the pre-Chinchilla pattern suggested. For everyone else: prefer modern data-heavy models (Llama 3.1 8B trained on 15T tokens) over older sparser-trained models (older 13B models trained on 1T tokens). Newer compute-optimal models often outperform older over-parameterized ones.
Less clearly. Scaling laws were derived for pre-training; fine-tuning has weaker and less consistent scaling behavior. Fine-tuning quality depends more on data quality, hyperparameters, and method (PEFT vs full fine-tuning) than on pure scale. Production fine-tuning is more art than the well-understood pre-training scaling laws would suggest.
Increasingly. Recent work (OpenAI's o1, Anthropic's extended thinking) suggests inference-time compute also scales: using more compute per query (longer reasoning, more search) improves performance in predictable ways for some tasks. This 'inference-time scaling' is an active research area and changing how production AI is designed for hard tasks.
Need help implementing Scaling Laws?
BearPlex builds production AI systems that use Scaling Laws for Fortune 500s and high-growth scale-ups. Outcome-based pricing. 90-day embedded sprints.