Most model briefs about DeepSeek R1 were written in the week it broke the app-store charts. This one is written for a different moment: the model is eighteen months old, the hype has moved on, and the engineering question has gotten sharper. R1 is no longer the model you evaluate because it is in the news. It is the model you evaluate because it is the most permissively licensed serious reasoning model you can put on your own hardware.
That framing changes what matters. So this brief covers what a build team actually needs: the license, the real deployment cost, the latency behavior, and where R1 wins or loses against the alternatives in mid-2026.
What it actually is
DeepSeek R1 (DeepSeek-AI, January 2025, later published in *Nature*) is a Mixture-of-Experts reasoning model: 671B total parameters with 37B activated per token and a 128K context window, per the official model card. The research contribution was showing that large-scale reinforcement learning, without human-annotated reasoning traces, is enough to make long chain-of-thought behavior emerge: self-checking, backtracking, strategy switching.
Two release details matter more for engineering than the headline model:
- The R1-0528 refresh. The May 2025 update (model card) is the checkpoint you should be evaluating, not the January original. Per DeepSeek's own published numbers, AIME 2025 accuracy went from 70% to 87.5%, LiveCodeBench from 63.5% to 73.3%, hallucination rate dropped, function calling improved, and system prompts became properly supported.
- The distill ladder. DeepSeek released six distilled models (Qwen-based 1.5B, 7B, 14B, 32B and Llama-based 8B, 70B) that inherit R1's reasoning style at hardware budgets normal companies have. The Qwen-based distills carry Apache 2.0 from their base models.
The license, and what commercial use really permits
The weights are released under the MIT License, and the model card is explicit that this "supports commercial use" and allows "any modifications and derivative works, including, but not limited to, distillation and fine-tuning."
For client products, MIT is about as clean as model licensing gets. Concretely, compared to the community licenses attached to Llama 4:
- No monthly-active-user threshold and no revenue gate.
- No attribution badge required in your product UI.
- No naming requirements on fine-tuned derivatives.
- Distilling R1's outputs into your own smaller model is expressly permitted, which is exactly the clause that makes R1 attractive as a teacher model.
One nuance worth counsel's five minutes: the *Llama-based distills* are not pure MIT. They are derived from Llama 3.1 and 3.3 checkpoints and remain subject to those Llama license terms. If license simplicity is the point, use the Qwen-based distills.
The other conversation that comes up in every regulated-industry evaluation: DeepSeek is a Chinese lab, and some procurement teams stop there. The engineering answer is that with self-hosted open weights there is no telemetry and no data path to the vendor at all. The weights are inspectable files running in your VPC. That is a materially different risk posture from sending prompts to any hosted API, DeepSeek's or anyone else's.
Real deployment cost
As of July 2026, the access landscape has shifted in a way most older writeups miss: DeepSeek's own API platform has moved to the V4 generation, and the legacy deepseek-reasoner endpoint that served the R1 line is scheduled for deprecation on July 24, 2026. First-party per-token access to R1 is effectively ending. What remains is what the MIT license always made possible: running the weights yourself, or renting them from a GPU cloud.
Self-hosting footprints, derived from the published parameter counts (weights only, before KV cache):
- Full R1 at native FP8: 671B parameters at one byte per parameter is roughly 670GB of weights. That is a multi-GPU node in the 8x 96GB-141GB class. This is a serious infrastructure commitment, not a pilot-project footprint.
- Full R1 at ~4-bit quantization: roughly 340GB, which still means several datacenter-class GPUs, and reasoning quality at aggressive quantization needs eval before you commit.
- R1-Distill-Qwen-32B at BF16: roughly 64GB, a single H100/A100-80GB with headroom, or two 48GB cards.
- R1-Distill-Qwen-7B / Llama-8B at 4-bit: workstation and even laptop territory.
The honest cost conclusion: for most mid-market deployments the full 671B model is the wrong first choice. The 32B distill inside your own cloud, promoted to the full model only if evals prove the gap matters, is the pattern that survives contact with a budget.
Latency and eval behavior that matters
R1 spends tokens to think, and you pay for that in both latency and output-token cost. DeepSeek's own 0528 notes show average reasoning depth on AIME questions rising from roughly 12K to 23K tokens per question. Product implications:
- Time-to-first-answer is a product decision. A visible answer can arrive minutes after the request on hard prompts. If the workload is interactive chat, R1-class reasoning is the wrong default path; route only the hard cases to it.
- Budget output tokens, not requests. Cost models that assume a few hundred output tokens per call will be wrong by an order of magnitude on reasoning workloads.
- Decoding settings are not optional. The model card recommends temperature in the 0.5 to 0.7 range (0.6 with top-p 0.95 for 0528); greedy decoding degrades output and invites repetition loops.
- Benchmarks are self-reported. The numbers above are DeepSeek's published figures on the model cards. Treat them as directional and run your own task-level evals; that is the standard we apply in every model engineering engagement.
When to use it, and when not
Use DeepSeek R1 when:
- You need strong multi-step reasoning (math-adjacent logic, code synthesis, structured analysis) inside your own infrastructure boundary, with data that cannot leave.
- You want a teacher model for distillation and the license must permit it.
- Latency tolerance is measured in tens of seconds and correctness is worth the wait: batch analysis, agent planning steps, review pipelines.
Do not use it when:
- The workload is fast interactive chat or high-volume extraction. A non-reasoning model, or a small distill, is cheaper and quicker.
- You have no GPU story and were relying on DeepSeek's own API for the long term; the first-party R1 endpoint is going away as of July 2026, so plan around self-hosting or a GPU cloud from day one.
- Your compliance regime requires a vendor to stand behind the model contractually. MIT weights come with no warranty and no counterparty.
How we would architect it for a client
The pattern we reach for mirrors the privilege-aware architecture we published in our SaulLM-7B brief and built for Letti AI: the open model runs inside the client's VPC as a specialized capability, never as the whole system.
Concretely, for a regulated-data reasoning workload:
- A routing gateway classifies each request. High-volume, low-difficulty traffic goes to a small model (an R1 distill or a Qwen 3 mid-size); only genuinely hard reasoning requests hit the large model. This single decision usually dominates the economics.
- R1-Distill-Qwen-32B as the sovereign workhorse on a single 80GB-class GPU in the client's sovereign cloud footprint, with the full R1 reserved for an offline eval track until the quality gap on the client's own tasks justifies the cluster.
- Reasoning-trace hygiene. R1 emits its chain of thought; we treat those traces as sensitive intermediate data (they restate the input), so they are logged under the same access controls as the source documents and never surfaced raw to end users.
- Deterministic verification after generation, the same discipline as citation checking in legal work: schema validation, unit-testable claims checked in code, and a bounded re-prompt loop. Reasoning models reduce error rates; they do not remove the need for verification.
That architecture is why the license section of this brief matters more than the benchmark section. MIT is what makes every one of those four decisions available to you at all.
