Is DeepSeek V3 free for commercial use?

Yes. From the V3-0324 checkpoint onward, the model card states the repository and the model weights are licensed under MIT, and V3.2 carries the same license: no user thresholds, no attribution requirements, no naming rules, distillation permitted. The original December 2024 V3 release used a separate DeepSeek Model License for the weights that also explicitly supports commercial use, but the clean recommendation is to standardize on V3-0324 or later and keep the stack fully MIT.

When should we use DeepSeek V3 instead of R1?

For most traffic. Extraction, summarization, drafting, classification, RAG synthesis, and routine tool calls, which make up 80-95% of requests in typical production systems, get no measurable benefit from a reasoning model, just higher latency and output-token cost (DeepSeek's own notes show R1-0528 averaging around 23K reasoning tokens on hard questions). Route V3 as the default lane and escalate to R1 only where your evals show chain-of-thought moves accuracy, and audit that escalation rate in your logs.

Can we still use DeepSeek V3 through DeepSeek's own API?

Effectively no, as of July 2026. The first-party platform now serves the V4 generation (deepseek-v4-flash and deepseek-v4-pro), and DeepSeek's pricing page states the legacy deepseek-chat and deepseek-reasoner names are deprecated on July 24, 2026, mapping to V4 modes in the interim. The durable ways to run the V3 line are the MIT weights in your own infrastructure or a hosted endpoint from a third-party GPU cloud.

What does it cost to self-host DeepSeek V3?

The published architecture is 671B total parameters with 37B active per token, so weights alone are roughly 670GB at native FP8: an 8-GPU datacenter node before you account for KV cache. Around 4-bit quantization that drops to roughly 340GB, still multi-GPU, and quantized reasoning quality needs evaluation before you commit. The compensating economics: only 37B parameters are active per token, so once memory is provisioned, throughput behaves like a mid-size dense model.

Which V3 checkpoint should we deploy?

V3-0324 at minimum, and evaluate V3.2. V3-0324 is where the weights went MIT and where DeepSeek's card reports major post-training gains (MMLU-Pro 81.2, AIME 59.4) plus explicit function-calling fixes, which matter for agent work. V3.2 (685B, MIT) adds DeepSeek Sparse Attention for cheaper long-context compute and a tool-aware thinking mode. The original December 2024 checkpoint is now primarily of historical and research interest.

How does DeepSeek V3 relate to R1 and V4?

V3 is the base model R1's reasoning training was built on, which is why the two pair so naturally in a routed architecture: same lineage and toolchain, different thinking budgets. V4 is the 2026 successor generation, and notably its weights are also published on Hugging Face. Owning V3 in production therefore comes with a built-in upgrade path: run V4 in an offline eval track and promote it when your own task metrics justify it, on your schedule.

Does BearPlex deploy DeepSeek V3 in client work?

We evaluate it as the default volume lane in DeepSeek-based architectures: a routing gateway in front, V3-class serving for the majority of traffic, R1 or a thinking mode for the hard minority, deterministic verification behind both, all inside the client's cloud boundary. It is the standard candidate wherever data residency, license cleanliness, and cost control at scale outweigh the convenience of a hosted frontier API, and the client's own evals make the final call.

DeepSeek V3: The Engineering Decision Brief

DeepSeek R1 got the headlines; V3 does the work. In every architecture we have built around the DeepSeek family, the V3 line handles the overwhelming majority of traffic, because most production requests do not need a reasoning model, they need a fast, competent generalist. This brief covers the V3 line as an engineering asset in mid-2026: what the weights are, what the license ladder actually says, what happened to the first-party API, and, most importantly, the V3-versus-R1 routing decision that determines your economics.

What it actually is

DeepSeek-V3 (technical report, December 2024) is a Mixture-of-Experts model with 671B total parameters and 37B activated per token, built on Multi-head Latent Attention and the DeepSeekMoE architecture, trained on 14.8T tokens, with a 128K context window per the official model card. Its claim to fame at release was efficiency: the report puts full training at 2.788M H800 GPU hours, a number that reset industry assumptions about frontier training budgets.

The line evolved in three verified steps, and the checkpoint you evaluate matters:

V3-0324 (March 2025, model card): the same architecture post-trained harder. DeepSeek's own numbers: MMLU-Pro 75.9 to 81.2, AIME 39.6 to 59.4, plus "increased accuracy in Function Calling, fixing issues from previous V3 versions." That last line is the one agent builders should read twice.
V3.2 (2025, model card): 685B parameters, introduces DeepSeek Sparse Attention (DSA), which the card describes as substantially reducing computational complexity while preserving performance, a revised chat template with tool-aware thinking, and self-reported performance in GPT-5's neighborhood.
V4 (2026): the successor generation, also open-weights on Hugging Face. The V3 line is no longer the newest DeepSeek, which is precisely why it is now cheap, well-understood infrastructure.

V3 is also the base that DeepSeek R1 was trained from, which is why the two route so cleanly together: same lineage, same tooling, different thinking budgets.

The license, and what commercial use really permits

The ladder is simple and worth getting exactly right. The original V3 shipped with MIT code but weights under a DeepSeek "Model License" that explicitly supports commercial use. From V3-0324 onward, the weights themselves are MIT: the model card states "this repository and the model weights are licensed under the MIT License," and V3.2 carries the same MIT license. No user thresholds, no attribution badges, no naming rules on derivatives, distillation expressly on the table.

Practical read: standardize on V3-0324 or later and the license conversation with counsel takes five minutes. The China question gets the same answer we gave in the R1 brief: self-hosted open weights are inspectable files in your VPC with no telemetry and no vendor in the request path, which is a stronger data posture than any hosted API, DeepSeek's or anyone else's.

Real cost: API reality and self-hosting

The first-party API story changed in 2026, and older writeups will mislead you. As of July 2026, DeepSeek's API platform serves the V4 generation: deepseek-v4-flash (1M context, cache-miss input $0.14, output $0.28 per million tokens) and deepseek-v4-pro ($0.435 in / $0.87 out). The platform states plainly that the legacy names deepseek-chat and deepseek-reasoner are deprecated on July 24, 2026, with the old names mapping to V4-flash modes in the interim. There is no first-party per-token V3 endpoint to build on anymore.

So the V3 line in 2026 is what the MIT license always made it: a model you run yourself or rent from a GPU cloud. Self-hosting footprints, derived from published parameter counts (weights only, before KV cache), mirror R1's, since it is the same chassis:

Full V3 at native FP8: roughly 670GB of weights, an 8-GPU datacenter node in the 96GB-141GB-per-card class.
At ~4-bit quantization: roughly 340GB, still multiple datacenter GPUs, with quality-versus-quant evals mandatory before committing.
Per-token compute is the good news: 37B active parameters means throughput behaves like a mid-size dense model once the memory is paid for. Memory is the entry fee; serving economics are the reward.

If that footprint is out of budget, the honest alternatives are a hosted V3/V3.2 endpoint from a GPU cloud, or a smaller open model entirely, such as Qwen 3 mid-sizes.

The V3-versus-R1 routing decision

This is the section this brief exists for. The industry default of "buy the smartest model and send everything to it" is exactly backwards with a reasoning model in the stack, because reasoning is priced in output tokens and time. DeepSeek's own R1-0528 notes put average reasoning depth around 23K tokens per AIME-class question; a V3-class model answers a routine request in a few hundred tokens, seconds sooner.

The decision rule we deploy:

Default lane: V3. Extraction, summarization, drafting, classification, translation, straightforward tool calls, RAG answer synthesis. This is 80-95% of real traffic in the systems we run, and on this traffic a reasoning model is strictly worse: slower, costlier, no measurable quality gain.
Escalation lane: R1. Multi-step planning, math-adjacent logic, code synthesis with intricate constraints, anything where your evals show chain-of-thought actually moves accuracy.
The router is a classifier, not a vibe. Route on measurable signals (task type, schema complexity, retry history), audit the escalation rate, and treat "we route by difficulty" as a claim your logs must prove.

V3.2's tool-aware thinking modes blur this line inside a single checkpoint, which is convenient, but the budget discipline is identical: thinking tokens are spend, and the router (or the mode flag) is where you control it.

When to use it, and when not

Use DeepSeek V3 when:

You need a frontier-adjacent generalist inside your own infrastructure boundary under a clean MIT license.
You are building the two-lane DeepSeek stack: V3 for volume, R1 for the hard 5-20%, one toolchain, one deployment story.
Function calling and structured outputs on-prem matter (use V3-0324 or later; the card's own fix notes tell you why).

Do not use it when:

You have no GPU story and wanted a first-party API: the V3-era endpoints are gone as of July 2026, and pretending otherwise is technical debt with a countdown timer.
Your workload fits a 32B-class open model; a smaller model at a fraction of the memory footprint wins the total-cost math.
You need a contractual counterparty behind the model. MIT weights ship with no warranty and no SLA; that is the trade.

How we would architect it for a client

The pattern is the sovereign-cloud two-lane stack we described in the R1 brief, with V3 promoted to its rightful place as the default lane:

Routing gateway in front, classifying every request; V3 (or a hosted V3.2 endpoint) takes the volume lane, R1 or V3.2-thinking takes the escalation lane, and the escalation rate is a monitored SLO, not a hope.
One serving stack: same MoE chassis, same quantization and serving toolchain for both lanes, which halves the operational surface compared to mixing model families.
Deterministic verification after generation: schema validation, code-checked claims, bounded re-prompts. Non-reasoning models fail faster and cheaper; they still fail, and the harness, not the model, owns correctness.
An eval track against V4: the successor weights are open too, and the V3-to-V4 promotion should happen when your task evals justify it, on your schedule rather than a vendor's deprecation calendar. That optionality is what owning weights buys, and it is the quiet, compounding argument for the whole open-weights approach.

DeepSeek V3Open-Weights LLM

What it actually is

The license, and what commercial use really permits

Real cost: API reality and self-hosting

The V3-versus-R1 routing decision

When to use it, and when not

How we would architect it for a client

Frequently asked

Related work

Related reading

Shipping open-weights llm in production?