Skip to main content
All model briefs
2026.07.03Open-Weights LLM
10 min read

DeepSeek V3Open-Weights LLM

The open-weights workhorse behind R1, and the routing decision most teams get backwards: when you do not need a reasoning model at all.

Hamad Pervaiz
Hamad Pervaiz
Founder & CEO, BearPlex
Share
Reference
Parameters
671B MoE (37B active); V3.2 685B
Base model
-
License
MIT (V3-0324 onward)
Publisher
DeepSeek-AI
Paper date
2024.12.27

DeepSeek R1 got the headlines; V3 does the work. In every architecture we have built around the DeepSeek family, the V3 line handles the overwhelming majority of traffic, because most production requests do not need a reasoning model, they need a fast, competent generalist. This brief covers the V3 line as an engineering asset in mid-2026: what the weights are, what the license ladder actually says, what happened to the first-party API, and, most importantly, the V3-versus-R1 routing decision that determines your economics.

What it actually is

DeepSeek-V3 (technical report, December 2024) is a Mixture-of-Experts model with 671B total parameters and 37B activated per token, built on Multi-head Latent Attention and the DeepSeekMoE architecture, trained on 14.8T tokens, with a 128K context window per the official model card. Its claim to fame at release was efficiency: the report puts full training at 2.788M H800 GPU hours, a number that reset industry assumptions about frontier training budgets.

The line evolved in three verified steps, and the checkpoint you evaluate matters:

  1. V3-0324 (March 2025, model card): the same architecture post-trained harder. DeepSeek's own numbers: MMLU-Pro 75.9 to 81.2, AIME 39.6 to 59.4, plus "increased accuracy in Function Calling, fixing issues from previous V3 versions." That last line is the one agent builders should read twice.
  2. V3.2 (2025, model card): 685B parameters, introduces DeepSeek Sparse Attention (DSA), which the card describes as substantially reducing computational complexity while preserving performance, a revised chat template with tool-aware thinking, and self-reported performance in GPT-5's neighborhood.
  3. V4 (2026): the successor generation, also open-weights on Hugging Face. The V3 line is no longer the newest DeepSeek, which is precisely why it is now cheap, well-understood infrastructure.

V3 is also the base that DeepSeek R1 was trained from, which is why the two route so cleanly together: same lineage, same tooling, different thinking budgets.

The license, and what commercial use really permits

The ladder is simple and worth getting exactly right. The original V3 shipped with MIT code but weights under a DeepSeek "Model License" that explicitly supports commercial use. From V3-0324 onward, the weights themselves are MIT: the model card states "this repository and the model weights are licensed under the MIT License," and V3.2 carries the same MIT license. No user thresholds, no attribution badges, no naming rules on derivatives, distillation expressly on the table.

Practical read: standardize on V3-0324 or later and the license conversation with counsel takes five minutes. The China question gets the same answer we gave in the R1 brief: self-hosted open weights are inspectable files in your VPC with no telemetry and no vendor in the request path, which is a stronger data posture than any hosted API, DeepSeek's or anyone else's.

Real cost: API reality and self-hosting

The first-party API story changed in 2026, and older writeups will mislead you. As of July 2026, DeepSeek's API platform serves the V4 generation: deepseek-v4-flash (1M context, cache-miss input $0.14, output $0.28 per million tokens) and deepseek-v4-pro ($0.435 in / $0.87 out). The platform states plainly that the legacy names deepseek-chat and deepseek-reasoner are deprecated on July 24, 2026, with the old names mapping to V4-flash modes in the interim. There is no first-party per-token V3 endpoint to build on anymore.

So the V3 line in 2026 is what the MIT license always made it: a model you run yourself or rent from a GPU cloud. Self-hosting footprints, derived from published parameter counts (weights only, before KV cache), mirror R1's, since it is the same chassis:

  • Full V3 at native FP8: roughly 670GB of weights, an 8-GPU datacenter node in the 96GB-141GB-per-card class.
  • At ~4-bit quantization: roughly 340GB, still multiple datacenter GPUs, with quality-versus-quant evals mandatory before committing.
  • Per-token compute is the good news: 37B active parameters means throughput behaves like a mid-size dense model once the memory is paid for. Memory is the entry fee; serving economics are the reward.

If that footprint is out of budget, the honest alternatives are a hosted V3/V3.2 endpoint from a GPU cloud, or a smaller open model entirely, such as Qwen 3 mid-sizes.

The V3-versus-R1 routing decision

This is the section this brief exists for. The industry default of "buy the smartest model and send everything to it" is exactly backwards with a reasoning model in the stack, because reasoning is priced in output tokens and time. DeepSeek's own R1-0528 notes put average reasoning depth around 23K tokens per AIME-class question; a V3-class model answers a routine request in a few hundred tokens, seconds sooner.

The decision rule we deploy:

  • Default lane: V3. Extraction, summarization, drafting, classification, translation, straightforward tool calls, RAG answer synthesis. This is 80-95% of real traffic in the systems we run, and on this traffic a reasoning model is strictly worse: slower, costlier, no measurable quality gain.
  • Escalation lane: R1. Multi-step planning, math-adjacent logic, code synthesis with intricate constraints, anything where your evals show chain-of-thought actually moves accuracy.
  • The router is a classifier, not a vibe. Route on measurable signals (task type, schema complexity, retry history), audit the escalation rate, and treat "we route by difficulty" as a claim your logs must prove.

V3.2's tool-aware thinking modes blur this line inside a single checkpoint, which is convenient, but the budget discipline is identical: thinking tokens are spend, and the router (or the mode flag) is where you control it.

When to use it, and when not

Use DeepSeek V3 when:

  • You need a frontier-adjacent generalist inside your own infrastructure boundary under a clean MIT license.
  • You are building the two-lane DeepSeek stack: V3 for volume, R1 for the hard 5-20%, one toolchain, one deployment story.
  • Function calling and structured outputs on-prem matter (use V3-0324 or later; the card's own fix notes tell you why).

Do not use it when:

  • You have no GPU story and wanted a first-party API: the V3-era endpoints are gone as of July 2026, and pretending otherwise is technical debt with a countdown timer.
  • Your workload fits a 32B-class open model; a smaller model at a fraction of the memory footprint wins the total-cost math.
  • You need a contractual counterparty behind the model. MIT weights ship with no warranty and no SLA; that is the trade.

How we would architect it for a client

The pattern is the sovereign-cloud two-lane stack we described in the R1 brief, with V3 promoted to its rightful place as the default lane:

  1. Routing gateway in front, classifying every request; V3 (or a hosted V3.2 endpoint) takes the volume lane, R1 or V3.2-thinking takes the escalation lane, and the escalation rate is a monitored SLO, not a hope.
  2. One serving stack: same MoE chassis, same quantization and serving toolchain for both lanes, which halves the operational surface compared to mixing model families.
  3. Deterministic verification after generation: schema validation, code-checked claims, bounded re-prompts. Non-reasoning models fail faster and cheaper; they still fail, and the harness, not the model, owns correctness.
  4. An eval track against V4: the successor weights are open too, and the V3-to-V4 promotion should happen when your task evals justify it, on your schedule rather than a vendor's deprecation calendar. That optionality is what owning weights buys, and it is the quiet, compounding argument for the whole open-weights approach.

Frequently asked

Yes. From the V3-0324 checkpoint onward, the model card states the repository and the model weights are licensed under MIT, and V3.2 carries the same license: no user thresholds, no attribution requirements, no naming rules, distillation permitted. The original December 2024 V3 release used a separate DeepSeek Model License for the weights that also explicitly supports commercial use, but the clean recommendation is to standardize on V3-0324 or later and keep the stack fully MIT.

Shipping open-weights llm in production?

BearPlex engineers AI systems for regulated enterprises. If you're evaluating a model like DeepSeek V3 for production, we'd like to talk.