Is Qwen 3 really free for commercial use?

Yes. Every Qwen 3 model, dense and MoE, from 0.6B to 235B, is released under Apache 2.0, which permits commercial use, modification, distribution, and white-labeling with no user thresholds, no attribution badge in your product, and no naming requirements on fine-tuned derivatives. Standard Apache 2.0 conditions apply, such as retaining license text and notices when you redistribute, and it includes an explicit patent grant that MIT-licensed alternatives lack.

Which Qwen 3 size should we deploy?

Run the ladder against your own eval set and take the smallest model that clears the quality bar. As a starting map from the published parameter counts: 4B for high-volume classification on modest hardware, 32B on a single 80GB GPU (or roughly 16GB at 4-bit) as the general workhorse, 30B-A3B when you want near-small-model serving cost with larger-model quality, and 235B-A22B on a multi-GPU node for the hard tier. The shared chat template makes moving between rungs an eval decision rather than a rebuild.

What is Qwen 3's hybrid thinking mode, and why does it matter operationally?

Each checkpoint can either reason step-by-step in a think block before answering or respond immediately, controlled by the enable_thinking flag or /think and /no_think commands in conversation. Operationally this means one deployment serves both fast interactive traffic and hard reasoning requests, with your gateway deciding per request whether to spend reasoning tokens. Without it, you typically operate a fast model and a separate reasoning model, which doubles the serving, eval, and upgrade surface.

Why does Qwen 3 keep winning regulated on-prem evaluations?

Three properties compound. Apache 2.0 across the whole family makes legal review a lookup instead of a bespoke license analysis. The 0.6B-to-235B ladder means the same family fits whatever hardware the client already has, from edge boxes to datacenter nodes, with one integration. And self-hosted weights keep every token inside the client's infrastructure boundary, which is the core data-residency requirement in finance, healthcare, and government work. Evaluations are won on the intersection of constraints, and Qwen 3 rarely fails any single gate.

What context window does Qwen 3 actually have?

Per the official materials: the three smallest dense models (0.6B, 1.7B, 4B) carry 32K contexts, and the larger models 128K-class contexts. The 235B-A22B flagship is 32,768 tokens natively, extended to 131,072 with YaRN rope scaling, which must be explicitly enabled in the serving configuration. If your workload depends on long context, validate quality at your real document lengths with YaRN configured rather than assuming ceiling behavior.

How does Qwen 3 compare to DeepSeek R1 and Llama 4?

They solve different constraints. DeepSeek R1 (MIT) is a dedicated 671B-scale reasoning model: strongest when maximum reasoning depth inside your own infrastructure is the requirement. Llama 4 is natively multimodal but carries a bespoke community license with attribution and derivative-naming clauses. Qwen 3's edge is breadth under one clean license: eight sizes, hybrid per-request reasoning, 119 languages, Apache 2.0 everywhere. In our evaluations the decision usually falls out of the client's constraints, and we run task-level evals rather than trusting anyone's leaderboard, including the vendors'.

Can we fine-tune Qwen 3 and ship it under our own product name?

Yes. Apache 2.0 places no naming requirements or attribution badges on derivative models, so a fine-tuned Qwen 3 can ship under your brand, embedded in your product, including white-label arrangements. This is a sharp contrast with the Llama 4 Community License, which requires distributed derivatives to carry Llama at the beginning of the model name and the product to display Built with Llama. Keep the standard Apache notices in redistributed source, and the rest is your call.

Qwen 3: The Apache 2.0 Ladder, Engineering Brief

When a bank, a health system, or a government body evaluates open-weights models for an on-prem deployment, the shortlist process looks nothing like a leaderboard. Legal reviews the license before engineering benchmarks anything. Infrastructure asks what runs on the hardware already racked. Product asks what happens when the pilot needs to scale down to an edge device or up to a flagship. Qwen 3 is the family that keeps clearing all three gates at once, and this brief is about why.

What it actually is

Qwen 3 (Qwen Team, April 2025 release; technical report May 2025) is a family of eight open-weights models spanning three orders of magnitude, per the official announcement:

Six dense models: 0.6B, 1.7B, 4B, 8B, 14B, and 32B. The three smallest carry 32K context; the larger ones 128K.
Two Mixture-of-Experts models: Qwen3-30B-A3B (30B total, 3B activated) and the flagship Qwen3-235B-A22B (235B total, 22B activated; 128 experts with 8 active, per the model card, 32,768 context natively and 131,072 with YaRN).

Training scale roughly doubled over the prior generation: about 36 trillion tokens versus Qwen2.5's 18 trillion, with multilingual coverage expanding from 29 to 119 languages and dialects.

The signature feature is the hybrid thinking mode: one checkpoint that reasons step-by-step in <think> blocks when enable_thinking is on (or /think is issued in-conversation) and answers immediately when it is off. The technical report adds a thinking-budget mechanism, so inference-time reasoning depth becomes a dial rather than a model choice. You deploy one model and choose per-request whether to pay for reasoning, instead of running a dedicated reasoning model beside a fast one.

The license, and what commercial use really permits

Every Qwen 3 model, from the 0.6B to the 235B flagship, ships under Apache 2.0. In enterprise legal review, that one fact does more work than any benchmark:

It is a known quantity. Apache 2.0 has two decades of history and pre-existing approval in most corporate open-source policies. The review is a lookup, not an analysis. Compare the Llama 4 Community License, which is short but bespoke: attribution clauses, derivative naming rules, and a user-threshold gate that each need a lawyer's read.
It includes an explicit patent grant, which MIT does not. For patent-sensitive industries this is a real, if rarely decisive, point in Apache 2.0's favor.
No attribution badge, no naming rules, no usage thresholds. Fine-tune it, rename it, white-label it, embed it in a product you sell. Standard Apache 2.0 conditions apply (keep the license text and notices in redistributed source), none of which touch your product's UI or brand.
Uniformity across the ladder matters more than it looks. The license story does not change when you move from the 4B on an edge box to the 235B in the datacenter. One legal approval covers the entire deployment surface, now and as it grows.

Real deployment cost

The ladder is the cost story. Working from published parameter counts (weights only, before KV cache):

Qwen3-4B at BF16: roughly 8GB. Consumer GPUs, small cloud instances, serious edge hardware.
Qwen3-32B at BF16: roughly 64GB, a single 80GB-class GPU; at 4-bit quantization roughly 16GB, a single 24GB card.
Qwen3-30B-A3B: 30B of memory, but only 3B parameters active per token, so it serves with the per-token compute of a small model. This checkpoint is frequently the price-performance sweet spot in our evals.
Qwen3-235B-A22B at FP8: roughly 235GB of weights, a multi-GPU node, with 22B-active MoE economics per token.

For teams that want managed capacity before committing GPUs: as of July 2026, Together AI lists serverless Qwen3-235B-A22B at $0.60 per million input tokens and $3.60 per million output tokens, with an FP8 throughput tier at $0.20 and $0.60. We treat hosted pricing like that as the rent-before-you-buy phase of a self-hosted versus managed decision, not the end state for regulated data.

Latency and eval behavior that matters

Thinking mode changes the cost curve, per request. With thinking on, the model emits reasoning tokens before the answer: better on hard tasks, slower and more expensive on everything. The engineering win is that this is a request-level flag, so your gateway can route by difficulty without a second deployment.
Sampling settings are documented and non-optional. The model card specifies temperature 0.6, top-p 0.95 for thinking mode, temperature 0.7, top-p 0.8 for non-thinking, and warns in capitals against greedy decoding in thinking mode because of repetition loops. Bake these into the serving layer; do not leave them to application defaults.
Context is 32K native on the flagship, 131K with YaRN. Long-context work needs the YaRN rope-scaling configuration enabled and validated; do not assume 128K-class behavior out of the box.
119 languages makes the family a default candidate for multilingual products, an area where the Llama 4 instruct models list 12 languages.
We quote no leaderboard numbers here. The Qwen team publishes competitive claims against frontier reasoning models; as with every vendor, the numbers that matter are the ones from evals on your task, your documents, your language mix.

When to use it, and when not

Use Qwen 3 when:

The deployment is on-prem or sovereign-cloud in a regulated industry and license friction must be near zero.
You need one model family across heterogeneous hardware: edge devices, workstation inference, and datacenter serving, with a single chat template and one legal review.
Your product roadmap includes distributing or white-labeling a fine-tuned model under your own brand, which Apache 2.0 permits and the Llama license complicates.
The workload mixes quick interactive requests and hard reasoning ones, and per-request thinking control beats operating two model fleets.

Do not use it when:

You need the absolute frontier of reasoning quality regardless of ownership; evaluate DeepSeek R1 and the closed APIs against it on your tasks.
The workload is natively multimodal document intake; the core Qwen 3 line is a text family, and Llama 4's early-fusion design or dedicated vision-language models fit better.
Your compliance regime requires a contractual counterparty standing behind model behavior. Apache 2.0 weights come with no warranty; that gap is filled by your engineering discipline, not the license.

How we would architect it for a client

The recurring shape is the regulated-SaaS pattern we know from building PeoplePlus, our own HR platform, where tenant data boundaries and predictable unit economics matter more than peak benchmark scores:

One family, three tiers. Qwen3-4B for high-volume classification and routing, Qwen3-32B (or 30B-A3B) as the workhorse for generation and extraction, and the 235B flagship reserved for the low-volume hard tier. Because the chat template and tokenizer are shared, promotion between tiers is an eval decision, not a re-integration project.
Thinking mode as a gateway policy. The routing layer sets enable_thinking per request class and caps reasoning budgets, so cost is governed centrally instead of negotiated per feature team. This is the same routing discipline described in our DeepSeek R1 brief, collapsed into one model family.
Sovereign deployment first. Weights live in the client's sovereign cloud footprint with no vendor in the inference path; hosted endpoints are used only for pre-commitment evaluation, then retired.
Eval harness before model choice. Our model engineering engagements start by building the client's task-level eval set, then let the ladder compete: the smallest Qwen 3 that clears the quality bar wins the tier. That procedure, more than any single launch, is why this family keeps ending up in production: it almost always has a rung that clears the bar at the lowest infrastructure cost in the room.

Apache 2.0 at every size is not a marketing detail. It is the property that lets one architecture, one legal review, and one integration serve a product from pilot to scale. That is the whole brief.

Qwen 3Open-Weights LLM

What it actually is

The license, and what commercial use really permits

Real deployment cost

Latency and eval behavior that matters

When to use it, and when not

How we would architect it for a client

Frequently asked

Related work

Related reading

Shipping open-weights llm in production?