Skip to main content
All model briefs
2026.07.03Open-Weights LLM
9 min read

Qwen 3Open-Weights LLM

Eight Apache 2.0 models from 0.6B to 235B with one chat template, and why that ladder keeps winning regulated on-prem evaluations.

Hamad Pervaiz
Hamad Pervaiz
Founder & CEO, BearPlex
Share
Reference
Parameters
0.6B / 1.7B / 4B / 8B / 14B / 32B dense + 30B-A3B / 235B-A22B MoE
Base model
-
License
Apache 2.0
Publisher
Qwen Team (Alibaba)
Paper date
2025.05.14

When a bank, a health system, or a government body evaluates open-weights models for an on-prem deployment, the shortlist process looks nothing like a leaderboard. Legal reviews the license before engineering benchmarks anything. Infrastructure asks what runs on the hardware already racked. Product asks what happens when the pilot needs to scale down to an edge device or up to a flagship. Qwen 3 is the family that keeps clearing all three gates at once, and this brief is about why.

What it actually is

Qwen 3 (Qwen Team, April 2025 release; technical report May 2025) is a family of eight open-weights models spanning three orders of magnitude, per the official announcement:

  • Six dense models: 0.6B, 1.7B, 4B, 8B, 14B, and 32B. The three smallest carry 32K context; the larger ones 128K.
  • Two Mixture-of-Experts models: Qwen3-30B-A3B (30B total, 3B activated) and the flagship Qwen3-235B-A22B (235B total, 22B activated; 128 experts with 8 active, per the model card, 32,768 context natively and 131,072 with YaRN).

Training scale roughly doubled over the prior generation: about 36 trillion tokens versus Qwen2.5's 18 trillion, with multilingual coverage expanding from 29 to 119 languages and dialects.

The signature feature is the hybrid thinking mode: one checkpoint that reasons step-by-step in <think> blocks when enable_thinking is on (or /think is issued in-conversation) and answers immediately when it is off. The technical report adds a thinking-budget mechanism, so inference-time reasoning depth becomes a dial rather than a model choice. You deploy one model and choose per-request whether to pay for reasoning, instead of running a dedicated reasoning model beside a fast one.

The license, and what commercial use really permits

Every Qwen 3 model, from the 0.6B to the 235B flagship, ships under Apache 2.0. In enterprise legal review, that one fact does more work than any benchmark:

  • It is a known quantity. Apache 2.0 has two decades of history and pre-existing approval in most corporate open-source policies. The review is a lookup, not an analysis. Compare the Llama 4 Community License, which is short but bespoke: attribution clauses, derivative naming rules, and a user-threshold gate that each need a lawyer's read.
  • It includes an explicit patent grant, which MIT does not. For patent-sensitive industries this is a real, if rarely decisive, point in Apache 2.0's favor.
  • No attribution badge, no naming rules, no usage thresholds. Fine-tune it, rename it, white-label it, embed it in a product you sell. Standard Apache 2.0 conditions apply (keep the license text and notices in redistributed source), none of which touch your product's UI or brand.
  • Uniformity across the ladder matters more than it looks. The license story does not change when you move from the 4B on an edge box to the 235B in the datacenter. One legal approval covers the entire deployment surface, now and as it grows.

Real deployment cost

The ladder is the cost story. Working from published parameter counts (weights only, before KV cache):

  • Qwen3-4B at BF16: roughly 8GB. Consumer GPUs, small cloud instances, serious edge hardware.
  • Qwen3-32B at BF16: roughly 64GB, a single 80GB-class GPU; at 4-bit quantization roughly 16GB, a single 24GB card.
  • Qwen3-30B-A3B: 30B of memory, but only 3B parameters active per token, so it serves with the per-token compute of a small model. This checkpoint is frequently the price-performance sweet spot in our evals.
  • Qwen3-235B-A22B at FP8: roughly 235GB of weights, a multi-GPU node, with 22B-active MoE economics per token.

For teams that want managed capacity before committing GPUs: as of July 2026, Together AI lists serverless Qwen3-235B-A22B at $0.60 per million input tokens and $3.60 per million output tokens, with an FP8 throughput tier at $0.20 and $0.60. We treat hosted pricing like that as the rent-before-you-buy phase of a self-hosted versus managed decision, not the end state for regulated data.

Latency and eval behavior that matters

  • Thinking mode changes the cost curve, per request. With thinking on, the model emits reasoning tokens before the answer: better on hard tasks, slower and more expensive on everything. The engineering win is that this is a request-level flag, so your gateway can route by difficulty without a second deployment.
  • Sampling settings are documented and non-optional. The model card specifies temperature 0.6, top-p 0.95 for thinking mode, temperature 0.7, top-p 0.8 for non-thinking, and warns in capitals against greedy decoding in thinking mode because of repetition loops. Bake these into the serving layer; do not leave them to application defaults.
  • Context is 32K native on the flagship, 131K with YaRN. Long-context work needs the YaRN rope-scaling configuration enabled and validated; do not assume 128K-class behavior out of the box.
  • 119 languages makes the family a default candidate for multilingual products, an area where the Llama 4 instruct models list 12 languages.
  • We quote no leaderboard numbers here. The Qwen team publishes competitive claims against frontier reasoning models; as with every vendor, the numbers that matter are the ones from evals on your task, your documents, your language mix.

When to use it, and when not

Use Qwen 3 when:

  • The deployment is on-prem or sovereign-cloud in a regulated industry and license friction must be near zero.
  • You need one model family across heterogeneous hardware: edge devices, workstation inference, and datacenter serving, with a single chat template and one legal review.
  • Your product roadmap includes distributing or white-labeling a fine-tuned model under your own brand, which Apache 2.0 permits and the Llama license complicates.
  • The workload mixes quick interactive requests and hard reasoning ones, and per-request thinking control beats operating two model fleets.

Do not use it when:

  • You need the absolute frontier of reasoning quality regardless of ownership; evaluate DeepSeek R1 and the closed APIs against it on your tasks.
  • The workload is natively multimodal document intake; the core Qwen 3 line is a text family, and Llama 4's early-fusion design or dedicated vision-language models fit better.
  • Your compliance regime requires a contractual counterparty standing behind model behavior. Apache 2.0 weights come with no warranty; that gap is filled by your engineering discipline, not the license.

How we would architect it for a client

The recurring shape is the regulated-SaaS pattern we know from building PeoplePlus, our own HR platform, where tenant data boundaries and predictable unit economics matter more than peak benchmark scores:

  1. One family, three tiers. Qwen3-4B for high-volume classification and routing, Qwen3-32B (or 30B-A3B) as the workhorse for generation and extraction, and the 235B flagship reserved for the low-volume hard tier. Because the chat template and tokenizer are shared, promotion between tiers is an eval decision, not a re-integration project.
  2. Thinking mode as a gateway policy. The routing layer sets enable_thinking per request class and caps reasoning budgets, so cost is governed centrally instead of negotiated per feature team. This is the same routing discipline described in our DeepSeek R1 brief, collapsed into one model family.
  3. Sovereign deployment first. Weights live in the client's sovereign cloud footprint with no vendor in the inference path; hosted endpoints are used only for pre-commitment evaluation, then retired.
  4. Eval harness before model choice. Our model engineering engagements start by building the client's task-level eval set, then let the ladder compete: the smallest Qwen 3 that clears the quality bar wins the tier. That procedure, more than any single launch, is why this family keeps ending up in production: it almost always has a rung that clears the bar at the lowest infrastructure cost in the room.

Apache 2.0 at every size is not a marketing detail. It is the property that lets one architecture, one legal review, and one integration serve a product from pilot to scale. That is the whole brief.

Frequently asked

Yes. Every Qwen 3 model, dense and MoE, from 0.6B to 235B, is released under Apache 2.0, which permits commercial use, modification, distribution, and white-labeling with no user thresholds, no attribution badge in your product, and no naming requirements on fine-tuned derivatives. Standard Apache 2.0 conditions apply, such as retaining license text and notices when you redistribute, and it includes an explicit patent grant that MIT-licensed alternatives lack.

Shipping open-weights llm in production?

BearPlex engineers AI systems for regulated enterprises. If you're evaluating a model like Qwen 3 for production, we'd like to talk.