Every open-weights model brief has a license section. This one *is* the license section, with a model attached. Llama 4 is a genuinely capable multimodal family, but in our client evaluations the technical comparison is rarely what decides the engagement. The Llama 4 Community License is. It is short, readable, and full of obligations that product teams routinely discover after the architecture is committed. Read it first; the model specs will still be there afterward.
What it actually is
Llama 4 (Meta, April 5, 2025) is Meta's first natively multimodal, Mixture-of-Experts open-weights generation. Text and image tokens are fused early into one backbone rather than bolted together with an adapter. Two models shipped with downloadable weights:
- Llama 4 Scout: 17B active parameters, 16 experts, 109B total, with a claimed 10 million token context window. Meta states it fits on a single H100 with Int4 quantization.
- Llama 4 Maverick: 17B active parameters, 128 experts, 400B total, with a 1M token context window per the model card, released in BF16 and FP8 (Meta: FP8 "fits on a single H100 DGX host").
Both are distilled from Llama 4 Behemoth (288B active, roughly 2T total), which was still training at announcement time. The instruct models cover 12 languages, with a knowledge cutoff of August 2024.
The MoE math is the part to internalize: per-token compute resembles a 17B dense model while memory requirements resemble the total parameter count. Llama 4 is therefore cheap to *run* per token and expensive to *hold* in memory, the exact inverse of what most capacity planning assumes.
The license, and what commercial use really permits
The Llama 4 Community License (effective April 5, 2025) grants a royalty-free, worldwide license to use, modify, and redistribute. It is not open source in the OSI sense, and four clauses matter for client products:
1. The 700M-MAU gate. Quoting the license: "If, on the Llama 4 version release date, the monthly active users of the products or services made available by or for Licensee, or Licensee's affiliates, is greater than 700 million monthly active users in the preceding calendar month, you must request a license from Meta." For nearly every company on earth this clause is irrelevant; it exists to exclude Meta's direct rivals. But note the measurement date: your MAU *on the Llama 4 release date*. Crossing 700M later does not retroactively strip the license.
2. "Built with Llama" is mandatory and visible. If you distribute Llama Materials, or a product or service (including another AI model) that contains them, you must "prominently display 'Built with Llama' on a related website, user interface, blogpost, about page, or product documentation." For an internal tool nobody outside the company sees, this is trivial. For a white-label product your client resells under their own brand, it is a real conversation: the attribution requirement travels with the model, and "prominently" is Meta's word, not ours.
3. Derivative models must carry the name. If you use Llama 4, or its outputs, to build a fine-tuned or improved model that you distribute, the license requires you to "include Llama at the beginning of any such AI model name." Your fine-tune cannot ship as "AcmeMed-8B"; it ships as "Llama-AcmeMed" or it stays in-house. Purely internal models you never distribute are not caught by this.
4. Redistribution paperwork. Distributing the weights or a derivative means shipping a copy of the license agreement and the notice file text: "Llama 4 is licensed under the Llama 4 Community License, Copyright © Meta Platforms, Inc. All Rights Reserved." Usage must also comply with the Acceptable Use Policy.
The practical read for client work: for internal deployments, the license costs you a line in the docs. For customer-facing products, budget a short legal review and a product decision about the attribution badge. For products whose whole value is a proprietary fine-tuned model you distribute to customers, the naming clause can be disqualifying on brand grounds alone, and Apache 2.0 alternatives or MIT alternatives start winning the evaluation before benchmarks are even discussed.
Real deployment cost
Meta's own hardware claims set the floor: Scout on a single H100 at Int4; Maverick FP8 on a single 8x H100 DGX host. Working from the published parameter counts (weights only, before KV cache):
- Scout at BF16: 109B parameters is roughly 218GB, a multi-GPU node.
- Scout at Int4: roughly 55GB, which is how Meta's single-H100 claim works. Quantization to 4-bit is a quality tradeoff you eval, not a free lunch.
- Maverick at FP8: roughly 400GB, hence the 8x H100 host sizing.
Two cost notes from practice. First, the 17B-active MoE design means throughput per GPU-hour is strong once the memory is provisioned; the cost cliff is the provisioning, not the serving. Second, that giant context window is not free: KV cache grows with tokens actually in context, so "10M context" workloads carry memory and latency costs that have nothing to do with the weights. Treat the 10M figure as Meta's stated architectural ceiling, and eval long-context quality on your own documents before designing a product around it.
Latency and eval behavior that matters
- Interactive-grade latency. Unlike reasoning models, Llama 4 does not spend thousands of thinking tokens before answering. For chat, extraction, and summarization the latency profile is that of a 17B dense model.
- Multimodality is native, not bolted on. Early fusion means image-plus-text prompts (screenshots, scanned forms, photos with captions) go through one model, simplifying pipelines that previously chained a vision encoder into a text LLM.
- Language coverage is thinner than rivals. Twelve supported languages, versus 119 claimed by Qwen 3. For multilingual products, check your language list first.
- Benchmark claims deserve skepticism. We are deliberately quoting no launch benchmark numbers in this brief. Vendor comparisons, every vendor's, are marketing until reproduced on your own tasks; run task-level evals before committing.
When to use it, and when not
Use Llama 4 when:
- The workload is multimodal document intake (forms, IDs, invoices, screenshots) and you want one open model, self-hosted, instead of an OCR-plus-LLM chain.
- You need long-context review over large document sets inside your own VPC and have validated quality at your actual context lengths.
- Attribution is a non-issue (internal tools) and the Meta ecosystem tooling matters to your team.
Do not use it when:
- Your product distributes a fine-tuned model under your own brand; the naming clause forces "Llama" into the name.
- White-label constraints make "Built with Llama" undisplayable.
- You need small edge deployments; the family has no small dense models, which is exactly where the Qwen 3 ladder is strong.
How we would architect it for a client
The fit we see most often is document-heavy regulated platforms, the same shape as the NDIS provider-management platform we build as a long-term development partner (Vertex360): high volumes of scanned forms, compliance documents, and mixed-media evidence that cannot leave the client's environment.
The pattern:
- Scout as the multimodal intake layer in the client's sovereign cloud: single-GPU Int4 deployment handling classification, extraction, and structured summarization of image-plus-text documents, with outputs validated against schemas in code.
- License compliance as a deliverable, not an afterthought. The engagement checklist includes the attribution placement decision, the notice file in every distributed artifact, and a naming review for any fine-tuned derivative before it gets a product name.
- A fallback lane to a permissively licensed model. Because the license terms are product-shaping, we architect the model interface so Llama 4 is swappable; if the client's product strategy later collides with the attribution or naming clauses, the migration is a config change, not a rebuild. This is standard practice in our model engineering work regardless of vendor.
Llama 4 is a good model wrapped in a license that is fine for most and fatal for some. Which one you are is a twenty-minute legal read. Do it first.
