top of page

What Are Model Weights and Why Do They Matter?

  • Apr 16
  • 27 min read
3D neural network visualization of model weights

Every time you ask an AI model a question and it gives back a useful, coherent answer, something remarkable is happening behind the scenes. A network of billions of tiny numbers — called model weights — is firing in sequence, multiplying and summing, shaping raw text into meaning. Those numbers hold everything the model ever "learned." They are, in a very real sense, where AI intelligence lives. And right now, in 2026, who owns those numbers — and who gets to see them — is one of the most important questions in all of technology.

 

Get the AI Playbook Your Business Can Use today, Right Here

 

TL;DR

  • Model weights are the numerical parameters inside a neural network. They encode what a model has learned from its training data.

  • Modern large language models (LLMs) contain billions to trillions of these numbers. GPT-3, released in 2020, had 175 billion. Meta's Llama 4 Behemoth has a planned 2 trillion.

  • Training adjusts weights through a mathematical process called backpropagation. The goal: minimize prediction errors.

  • "Open weights" models (like Meta's Llama and DeepSeek's V3 series) publish their weights publicly, enabling fine-tuning, auditing, and local deployment.

  • "Closed weights" models (like OpenAI's GPT series and Anthropic's Claude) keep their weights private, accessible only via API.

  • The EU AI Act, which became enforceable for GPAI model providers from August 2, 2025, now mandates specific disclosures tied to model weights and training compute.


What are model weights?

Model weights are the numerical parameters inside an artificial neural network. Each weight is a floating-point number that scales how strongly one neuron influences another. Weights are learned during training by repeatedly adjusting values to reduce prediction error. Together, they encode all of a model's knowledge and capabilities.

 

Get the AI Playbook Your Business Can Use today, Right Here

 




Table of Contents

1. Background: What Are Neural Networks?


To understand model weights, you first need a basic picture of how neural networks are built.


A neural network is a computational system loosely inspired by how the brain processes information. It is made up of layers of connected units called neurons (or nodes). Each neuron receives numerical inputs, does arithmetic on them, and passes an output to the next layer.


The simplest form, called a perceptron, was described mathematically in the 1950s by Frank Rosenblatt. For decades, the field remained limited. Then, in the 2010s, two things converged: massive datasets became available, and GPUs (graphics processing units) proved powerful enough to run the math at scale. That convergence produced deep learning — neural networks with many layers, trained on huge amounts of data.


The networks used in modern AI — from image classifiers to large language models — are deep neural networks. They have an input layer (where data enters), multiple hidden layers (where computation happens), and an output layer (where results are produced). The connections between every pair of neurons across adjacent layers each carry a weight.

 

Get the AI Playbook Your Business Can Use today, Right Here

 

2. What Are Model Weights, Exactly?


The Core Definition

A model weight is a single floating-point number. It controls how strongly the output of one neuron influences the input of the next neuron in the network.


Mathematically: the output of a neuron is a function of the weighted sum of its inputs. If neuron A has an output value of 0.5 and the weight on its connection to neuron B is 0.8, it contributes 0.5 × 0.8 = 0.4 to neuron B's input. That process repeats across millions — or billions — of connections.


Every weight starts as a near-random number. Training systematically adjusts those numbers until the network produces correct (or at least useful) outputs.


Weights vs. Parameters: Is There a Difference?

You will often see the terms weights and parameters used interchangeably. They are related but not identical. Weights are the most important class of parameters, but networks also have biases (constants added to neuron outputs) and, in some architectures, additional learned values. In common usage, when someone says a model has "175 billion parameters," they mean 175 billion learnable numbers — mostly weights.


Weights vs. Activations

A common source of confusion: weights are not the same as activations. Activations are temporary values that flow through the network when processing a specific input. They are computed and discarded. Weights are permanent — they are stored as the model file and remain fixed during inference (unless the model is being fine-tuned).


What Weights Encode

After training, each weight does not correspond to a single fact or concept. The knowledge is distributed. For example, the ability of a language model to complete a sentence correctly is stored across millions of weights working together, not in any single parameter. This distributed encoding is why neural networks are sometimes hard to interpret — you cannot point to one weight and say "this one stores the word 'Paris.'"

 

Get the AI Playbook Your Business Can Use today, Right Here

 

3. How Weights Are Learned: Training and Backpropagation


The Training Loop

Training a neural network follows a cycle:

  1. Forward pass. Feed a batch of training data into the network. The data propagates through all layers, producing a prediction.


  2. Loss calculation. Compare the prediction to the correct answer using a loss function — a mathematical formula that measures the error. Common examples: cross-entropy loss for classification tasks, mean squared error for regression.


  3. Backward pass (backpropagation). Compute how much each weight contributed to the error. This uses calculus: specifically, the chain rule to propagate error gradients backward through the network.


  4. Weight update. Adjust each weight slightly in the direction that reduces the error. The size of the step is controlled by the learning rate, a key training hyperparameter.


  5. Repeat. Do this for millions of batches until the model's predictions are accurate enough.


The mechanism driving weight updates is gradient descent. Think of the loss function as a hilly landscape. Each weight represents a position on that landscape. Gradient descent is the process of always stepping downhill — toward lower error. The gradient tells the optimizer which direction is "down" at any given point.


Modern training uses variants of gradient descent: Adam, AdamW, and SGD with momentum are the most common. These algorithms track information across updates (like a running estimate of how much each weight has moved before) to accelerate learning and avoid getting stuck in flat regions or overshooting.


The Role of Training Data

Weights are shaped entirely by training data. A model trained exclusively on English text will have weights that are useless for Chinese text unless that training data is included. A model trained on medical literature will learn different weight configurations than one trained on legal documents. The weights are, in a very literal sense, a compressed statistical summary of the training corpus.


Memory Requirements During Training

Training is extremely memory-intensive. The 50-layer ResNet image classifier — a relatively modest model — has approximately 26 million weight parameters and computes roughly 16 million activations on a forward pass. Using 32-bit floating-point values for each, storage already reaches 168 MB. In practice, training on GPUs with mini-batches of 32 samples pushes memory requirements to over 7.5 GB of DRAM for ResNet-50 alone, as reported by Graphcore in their analysis of deep learning memory demands (Graphcore, 2023). That is before you consider modern frontier models with hundreds of billions of parameters.

 

Get the AI Playbook Your Business Can Use today, Right Here

 

4. How Big Are Modern Model Weights?


The Scale of Modern Models

The explosion in model size over the last five years is one of the defining stories in AI.

Model

Organization

Year

Parameters

Weights Status

GPT-3

OpenAI

2020

175 billion

Closed

GPT-4

OpenAI

2023

Undisclosed (est. ~1.8T MoE)

Closed

Llama 3.1 405B

Meta

July 2024

405 billion

Open

DeepSeek-V3

DeepSeek AI

Dec 2024

671 billion total, 37B active

Open

Llama 4 Maverick

Meta

April 2025

400 billion total, 17B active

Open

Llama 4 Behemoth

Meta

2025 (in training)

~2 trillion total

Open (planned)

DeepSeek-V3.2

DeepSeek AI

2025

671 billion (MoE)

Open (MIT license)

Sources: Meta AI blog (2024–2025), DeepSeek Hugging Face repository (2024–2025), OpenAI technical reports.


What "Billion Parameters" Actually Means in Storage

A single 32-bit floating-point number takes 4 bytes of storage. A model with 7 billion parameters stored in 32-bit precision therefore occupies 7 × 4 = 28 gigabytes of disk space — before any compression. Most modern models are distributed in 16-bit (fp16 or bfloat16) precision, halving that to 14 GB for a 7B model.


Apple's Core ML documentation confirms that quantizing from 32-bit to 16-bit precision provides up to a 2x reduction in storage and generally does not affect model accuracy in most use cases (Apple Developer Documentation, 2024). Further quantization to 8-bit (int8) halves the size again, though with some accuracy tradeoff.


For the 405-billion-parameter Llama 3.1 release from Meta in July 2024: stored in fp16, the weights file runs to approximately 810 GB. Running that model requires at minimum a server-class node with multiple high-end GPUs.


Mixture of Experts: Big Models That Don't Always Think Big

One architectural innovation has made enormous models more practical: Mixture of Experts (MoE). Instead of activating every parameter on every inference call, an MoE model routes each input token through only a fraction of its "expert" sub-networks.


DeepSeek-V3 is a strong MoE language model with 671 billion total parameters but only 37 billion activated for each token. This means inference costs are closer to those of a 37B dense model, while the total capacity of 671B is available across different inputs.


Meta's Llama 4 Maverick contains 17 billion active parameters, 128 experts, and 400 billion total parameters, offering high quality at a lower price compared to Llama 3.3 70B.

 

Get the AI Playbook Your Business Can Use today, Right Here

 

5. Open Weights vs. Closed Weights


What "Open Weights" Means

When a model's weights are released publicly, anyone can download the file, load it into a compatible framework, run inference locally, and fine-tune the model on new data. The weights are just files — typically in formats like .safetensors or PyTorch's .pt / .bin formats — hosted on platforms like Hugging Face.


Open weights do not necessarily mean the model is fully open source. The training code, training data, and other components may remain private. But publishing the weights alone gives developers enormous flexibility.


What "Closed Weights" Means

Closed weights models are accessible only via an API. You send a request to the company's servers, they run inference on their hardware, and you receive the output. You never have direct access to the parameters themselves.


OpenAI's GPT series, Anthropic's Claude, and Google's Gemini Ultra are the most prominent examples of closed-weight frontier models as of 2026. Companies keeping weights proprietary typically cite safety concerns (preventing misuse of highly capable models), competitive advantage, and the enormous investment made in training.


The Debate

The open vs. closed weights debate has intensified dramatically in 2024–2026. The core tension: open weights democratize access and accelerate research, but may also put powerful capabilities in the hands of malicious actors. Closed weights allow companies to maintain safety controls but concentrate power in a small number of organizations.


Meta CEO Mark Zuckerberg has argued publicly (July 2024) that open source AI is the path forward, citing innovation, security through scrutiny, and the historical benefits of open source software. Critics of open weights — including some AI safety researchers — argue that frontier models with open weights cannot be "recalled" if safety issues are discovered post-release.

 

Get the AI Playbook Your Business Can Use today, Right Here

 

6. Case Studies


Case Study 1: Meta's Llama and the Open Weights Ecosystem

What happened: Meta released the first Llama model in February 2023 — initially on a restricted, research-only basis. Weights were shared case-by-case. Within days, the weights leaked on 4chan and spread via BitTorrent, illustrating how difficult it is to control weight distribution once released.


Meta responded by progressively opening access. Subsequent versions of Llama were made accessible outside academia and released under licenses that permitted some commercial use.


On July 23, 2024, Meta released Llama 3.1 in 8B, 70B, and 405B parameter variants. At 405 billion parameters, Llama 3.1 405B was trained using 16,000 Nvidia H100 GPUs. It is currently the largest open-source frontier language model ever publicly released, and the 405B marked the first time a frontier LLM was available for anyone to work with and build upon.


Outcome: Thousands of fine-tuned variants of Llama models appeared on Hugging Face within weeks of each release. Companies built commercial products on Llama. Researchers published papers using the weights for interpretability and safety analysis. The open release validated the argument that publishing weights does not automatically destroy competitive advantage — Meta continued to attract top AI talent and grow its AI assistant products.


Then came Llama 4: Released on April 5, 2025, Llama 4 Scout has 17 billion active parameters and 109 billion total parameters, while Llama 4 Maverick has 17 billion active parameters and 400 billion total parameters — both available for download.


Case Study 2: DeepSeek-V3 — Open Weights at Frontier Performance

What happened: In December 2024, Chinese AI lab DeepSeek released DeepSeek-V3, a 671-billion-parameter MoE model. The weights were published on Hugging Face. The technical report was simultaneously published on arXiv.


What stunned the industry was the cost: DeepSeek completed the pre-training of DeepSeek-V3 on 14.8 trillion tokens at an economical cost of only 2.664 million H800 GPU hours. At market rates of approximately $2/hour, that translates to roughly $5.3 million in compute — an order of magnitude less than comparable U.S. frontier models.


DeepSeek-V3.2, released in 2025 under an MIT license with complete model weights published on Hugging Face, scored 96.0% on the 2025 American Invitational Mathematics Examination (AIME), surpassing GPT-5 High's 94.6%.


Outcome: DeepSeek's releases triggered a major reassessment of the economics of frontier AI. If open-weight, frontier-class models could be trained for a few million dollars, the argument that safety requires closed weights — which often implicitly assumed that only well-funded, safety-conscious labs could train frontier models — needed rethinking. The releases also demonstrated that model weights developed outside the U.S. could match or exceed the best proprietary models in benchmark performance.


Case Study 3: The EU AI Act and Weights as a Regulatory Object

What happened: The EU AI Act (Regulation EU 2024/1689) entered into force on August 1, 2024. Its provisions governing General-Purpose AI (GPAI) models became applicable on August 2, 2025.


The AI Act requires that for open-weight GPAI models, the model's parameters (including weights), architecture information, and usage information must be made publicly available in a format that enables access, use, and modification.


Any model trained using more than 10²³ FLOPs qualifies as a GPAI model subject to the Act's obligations. Any model trained using more than 10²⁵ FLOPs is presumed to have high-impact capabilities and may be classified as systemic-risk GPAI.


The Act's enforcement powers for GPAI model providers — including fines — are set to apply from August 2, 2026.


Outcome: This is the first time a major jurisdiction has written model weights explicitly into law. It established that weights are not just technical artifacts but legally significant objects. GPAI providers must now document, disclose, and in some cases notify regulators about the parameters underlying their models.

 

Get the AI Playbook Your Business Can Use today, Right Here

 

7. Model Weight Quantization: Shrinking Without Losing Too Much


What Is Quantization?

Quantization is the process of reducing the numerical precision of weights to save storage and computation. Instead of storing each weight as a 32-bit float (fp32), you represent it as a 16-bit float (fp16 or bfloat16), an 8-bit integer (int8), or even a 4-bit integer (int4).


By default, models are produced with weights in floating-point 32-bit (fp32) precision. Weights can be quantized to 16 bits, 8 bits, 7 bits, and so on down to 1 bit. Quantizing from fp32 to fp16 provides up to a 2x savings in storage and generally does not affect the model's accuracy.


Practical Consequences of Quantization

Precision

Bits per weight

7B model size

70B model size

Typical accuracy impact

fp32

32

~28 GB

~280 GB

Baseline

fp16 / bf16

16

~14 GB

~140 GB

Minimal

int8

8

~7 GB

~70 GB

Small

int4

4

~3.5 GB

~35 GB

Moderate

At int4 quantization, a 7B model fits comfortably in a consumer GPU with 6–8 GB VRAM. This has enabled millions of people to run capable models on their own hardware — something impossible just three years ago.


Meta's Llama 4 Scout fits on a single H100 GPU with int4 quantization.


Emerging Research: Neural Weight Compression

Beyond standard quantization, researchers are now applying neural networks to compress other neural networks' weights. A 2026 paper from arXiv (arXiv:2510.11234) trained encoder-decoder networks specifically on weight tensors from Llama 3-8B, learning a compression codec that outperforms standard quantization at higher bitrates. By learning directly from data, the resulting neural codec captures the distributional structure of language model weights without relying on rigid, handcrafted components such as the random Hadamard transform. This is an early but promising direction — using AI to compress AI.

 

Get the AI Playbook Your Business Can Use today, Right Here

 

8. Policy and Regulation: Weights as a Legal Matter


The EU AI Act's Explicit Treatment of Weights

The EU AI Act is the most detailed legal treatment of model weights to date. It defines GPAI models by their training compute, mandates transparency about parameters, and draws a line between open-weight models and those carrying systemic risk.


On July 18, 2025, the European Commission published its guidelines on the scope of obligations for providers of GPAI models under the AI Act. These guidelines clarify:

  • A model is a GPAI model if it was trained with more than 10²³ FLOPs and can generate language, images, or video.

  • A model is a systemic-risk GPAI if trained with more than 10²⁵ FLOPs.

  • Open-source GPAI providers benefit from limited exemptions under the AI Act — including no obligation to provide technical documentation to downstream providers — but must still comply with training data summary and copyright policy requirements.


The AI Act entered into force on August 1, 2024, and will be fully applicable on August 2, 2026. The governance rules and obligations for GPAI models became applicable on August 2, 2025.


The U.S. Policy Landscape

As of 2026, the United States does not have a federal law governing model weights with the specificity of the EU AI Act. The Biden administration's October 2023 Executive Order on AI required frontier model developers to share safety evaluation results with the government, but did not mandate disclosure of weights. The Trump administration's subsequent AI policy focus shifted toward removing barriers to AI development rather than imposing new disclosure requirements.


Export controls, however, have become a de facto form of weight regulation. U.S. export controls on semiconductor chips (particularly NVIDIA H100 and A100 GPUs) have slowed the ability of certain countries to train frontier models from scratch. These controls have not, however, prevented access to open-weight models already released.


Copyright and Ownership of Weights

A pressing legal question: who owns model weights? Weights encode statistical patterns learned from training data, which in many cases includes copyrighted text, code, and images. As of 2026, no major jurisdiction has definitively settled whether model weights are themselves copyrightable, whether training on copyrighted data without license constitutes infringement, or whether derivative models trained on open-weight models carry forward licensing obligations.


The EU AI Act requires GPAI providers to have a policy for EU copyright compliance, but the question of what that compliance looks like in practice — especially for open-weight models that others fine-tune — remains active litigation and regulatory territory.

 

Get the AI Playbook Your Business Can Use today, Right Here

 

9. Pros and Cons of Open vs. Closed Weights


Open Weights: Pros

  • Transparency. Researchers can study, audit, and interpret the model. This is essential for safety and bias research.

  • Fine-tuning. Developers can adapt the model to specific domains (medical, legal, code) without paying per-API-call inference fees.

  • Local deployment. Organizations with strict data privacy requirements can run the model on their own infrastructure. No data leaves their environment.

  • Lower cost. Open-weight models eliminate API fees for high-volume inference.

  • Ecosystem effects. Open releases produce community-driven improvements, benchmarks, and tooling. The Llama ecosystem is a textbook example.

  • Competition. Open-weight models from DeepSeek have demonstrably reduced API pricing across the industry.


Open Weights: Cons

  • Irreversible. Once released, weights cannot be recalled. If a safety flaw is discovered, it cannot be patched in users' local copies.

  • Misuse risk. Bad actors can fine-tune models to remove safety training (so-called "jailbroken" or "uncensored" variants). This has already happened with every major open-weight release.

  • Compute requirements. Frontier open-weight models still require expensive hardware. A 405B parameter model needs a multi-GPU server node — out of reach for most individuals.

  • Support burden. Model developers receive no revenue from local deployments and bear reputational risk from misuse.


Closed Weights: Pros

  • Safety controls. The provider can update, retrain, or restrict the model at any time. Safety issues can be addressed centrally.

  • Monetization. API pricing funds continued research and development.

  • No hardware burden on the user. Inference happens on the provider's infrastructure.


Closed Weights: Cons

  • Black box. Users and regulators cannot verify what the model is doing or why.

  • Lock-in. Customers depend on the provider's continued operation and pricing.

  • No fine-tuning for sensitive domains. Users cannot train the model on proprietary data without sending that data to the provider's servers.

  • Cost at scale. High-volume inference via closed API can become very expensive compared to self-hosted open models.

 

Get the AI Playbook Your Business Can Use today, Right Here

 

10. Myths vs. Facts


Myth 1: "More Parameters Always Means Better Performance"

Fact: Parameter count is one factor, not the only one. Architectural choices, training data quality, and post-training alignment all matter enormously. Llama 4 Maverick, with 17 billion active parameters (and 400B total), outperforms models with far higher active parameter counts on multiple benchmarks. DeepSeek-V3 matched GPT-4-class performance with only 37B active parameters per token. Efficient architectures like MoE have largely decoupled "total parameters" from "compute per inference."


Myth 2: "Open Weights = Open Source"

Fact: Open source, as defined by the Open Source Initiative (OSI), requires full access to code, training data, weights, and the freedom to modify and redistribute all components. The OSI published the Open Source AI Definition in October 2024, which requires open-source AI to be released with details about its training data that Meta does not disclose for Llama. Most "open weights" releases — including Llama and DeepSeek — release weights and often code, but not training data. The term "open weights" is more accurate than "open source" for most current releases.


Myth 3: "Weights Contain Explicit Knowledge Like a Database"

Fact: Weights do not store information like database records. Knowledge is distributed across millions of parameters in a highly compressed, non-human-readable form. You cannot search weights for a specific fact the way you can query a database. When a model "knows" something, that knowledge is an emergent property of how the weights interact during a forward pass.


Myth 4: "Quantized Models Are Always Significantly Worse"

Fact: For many practical tasks, 8-bit and even 4-bit quantized models perform comparably to their fp16 counterparts. Quantizing from fp32 to fp16 provides up to 2x storage savings and generally does not affect model accuracy. 4-bit quantization introduces more visible degradation, particularly on complex reasoning tasks and rare knowledge retrieval, but for everyday applications the practical difference is often minimal.


Myth 5: "Closed Weights Are Always Safer"

Fact: Closed weights prevent casual misuse but do not eliminate risks. Closed models are accessed via APIs and can still produce harmful outputs. They can be probed through carefully crafted inputs ("jailbreaks") and can fail in ways users cannot predict or audit. Closed weights also concentrate power — a scenario that itself carries systemic risks if a small number of organizations control access to frontier AI.

 

Get the AI Playbook Your Business Can Use today, Right Here

 

11. Pitfalls and Risks

Weight poisoning. During fine-tuning, malicious training data can embed hidden behaviors into model weights — a form of attack called a backdoor. A model with poisoned weights might behave normally until it encounters a specific trigger in the input, then produce targeted harmful outputs. This is an active area of AI security research.


Catastrophic forgetting. When fine-tuning a pre-trained model on new data, the new training can overwrite existing weight configurations, causing the model to "forget" previously learned capabilities. This is called catastrophic forgetting, and avoiding it requires specific techniques such as regularization or parameter-efficient fine-tuning methods like LoRA.


Weight theft. Even for closed-weight models, weights can theoretically be extracted through repeated API queries — a process called model extraction or model stealing. An attacker sends inputs designed to map the model's behavior and uses those responses to train a local approximation. This remains an active area of adversarial ML research.


Regulatory non-compliance. As the EU AI Act's enforcement powers for GPAI providers come into effect in August 2026, companies that have deployed GPAI models without adequate documentation of their weights and training compute face potential fines. The AI Act allows fines of up to 3% of global annual turnover for non-compliance with GPAI obligations.


Weight size and infrastructure mismatch. Organizations enthusiastically downloading large open-weight models sometimes discover that their infrastructure cannot support them. A 405B parameter model in fp16 requires roughly 810 GB of GPU VRAM — more than eight Nvidia H100 80GB cards just for the weights, before accounting for activation memory during inference.

 

Get the AI Playbook Your Business Can Use today, Right Here

 

12. Future Outlook


Weights Are Getting Larger — and Smaller at the Same Time

The frontier is pushing toward multi-trillion parameter models. Meta's planned Llama 4 Behemoth is projected at approximately 2 trillion total parameters. At the same time, smaller models are getting dramatically better. Techniques like distillation — training a small model to mimic a large one — and neural weight compression mean a 7B parameter model in 2026 can outperform a 70B model from 2023.


Researchers are now training compression networks on datasets of pretrained weight tensors, with approaches like NWC (Neural Weight Compression) consistently outperforming standard quantization baselines, with advantages becoming more pronounced at higher bitrates.


Analogue Hardware for Weights

Digital processors (GPUs and CPUs) are not the only substrate for weights. Analogue memory devices — chips where a physical property like electrical conductance encodes a weight — offer significant energy efficiency advantages. A 2022 study in Nature Communications reported a generalized computational framework for translating software-trained weights into analogue hardware weights while minimizing inference accuracy degradation (Nature Communications, 2022). As of 2026, analogue AI accelerators remain pre-commercial but represent a plausible direction for ultra-efficient inference in future edge devices.


The Policy Trajectory

The EU AI Act's Commission enforcement powers for GPAI model providers enter into application on August 2, 2026. This means regulators will soon be able to impose fines on GPAI providers who fail to meet weight and compute disclosure requirements. The practical effect: every organization training or distributing a frontier AI model for use in the EU now needs a formal process for documenting their parameters.


U.S. federal AI policy remains fragmented heading into 2026, but export controls on AI chips — which function as indirect weight regulation — are expected to tighten further. The question of whether the U.S. will adopt something analogous to the EU's compute threshold framework is one of the defining policy questions of the current moment.


Hardware Memory as the Binding Constraint

SK Hynix completed world's first HBM4 development and readied mass production in September 2025. High-Bandwidth Memory (HBM) is the physical substrate on which model weights reside during inference on large AI accelerators. The development of HBM4 and the forthcoming LPDDR6 standard for mobile devices directly determines how large a model can run on a given chip. The memory wall — the gap between how fast a processor can compute and how fast it can load weights from memory — remains the primary engineering constraint on inference speed for large models.

 

Get the AI Playbook Your Business Can Use today, Right Here

 

13. FAQ


Q: What is the difference between model weights and model architecture?

The architecture is the blueprint — it defines how many layers exist, how they are connected, and what operations each layer performs. Weights are the learned numbers that fill in the blueprint. Two identical architectures trained on different data will have completely different weights and very different capabilities.


Q: Can you update model weights after deployment without retraining from scratch?

Yes. Fine-tuning involves continuing to train an already-trained model on new data, adjusting existing weights. Parameter-efficient methods like LoRA (Low-Rank Adaptation) update only a small subset of weights or add small matrices alongside existing weights, drastically reducing compute requirements. This allows domain adaptation for a fraction of the cost of full training.


Q: Why don't AI companies just publish their weights? Isn't transparency good?

The decision is more complex than transparency alone. Safety concerns are real — published weights can be fine-tuned to remove safety training. Competitive advantage is real — weights represent billions of dollars of R&D investment. Regulatory risk is real — publishing weights that might enable harmful use creates liability. Companies weigh these factors differently; Meta and DeepSeek have concluded openness serves their interests, while OpenAI and Anthropic have not.


Q: What happens to weights when a model is fine-tuned?

Fine-tuning adjusts weights in the direction that minimizes error on the new training data. The original weights serve as the starting point. With full fine-tuning, all weights are updated. With LoRA or similar methods, the original weights are frozen and only small additional matrices are learned. The risk of full fine-tuning is catastrophic forgetting of previously learned capabilities.


Q: How do model weights relate to AI hallucination?

Hallucination — when a model generates plausible-sounding but false information — emerges from how weights encode statistical patterns. The model generates text by predicting what tokens are statistically likely given the context. If the training data contains errors, or if a pattern is underrepresented, the weights may encode incorrect associations. Weights do not store explicit factual records to check against; they produce outputs based on learned distributions.


Q: Can model weights be stolen or extracted?

Technically, yes — through model extraction attacks where an adversary repeatedly queries a model's API and trains a local approximation using the responses. In practice, effective extraction of a large frontier model is extremely difficult and expensive, requiring millions of queries. It is an active area of security research and a reason some companies rate-limit their APIs aggressively.


Q: What is weight decay in training?

Weight decay is a regularization technique that adds a penalty to the loss function for large weight values. It pushes weights toward zero, preventing any single weight from becoming dominant. This helps the model generalize to new data rather than overfitting to training examples.


Q: How do I choose which quantization level to use for a local model?

Start with bf16 if your hardware supports it — minimal accuracy loss, half the memory of fp32. Move to int8 if you need the model to fit in less VRAM, with small accuracy tradeoffs acceptable for most tasks. Use int4 (GGUF format via llama.cpp, or GPTQ/AWQ) if you need the model to run on consumer hardware; expect some degradation on complex reasoning. Always benchmark on your specific task rather than relying on general claims.


Q: What is the EU AI Act's threshold for "systemic risk" GPAI models?

Any model trained using more than 10²⁵ FLOPs is presumed to have high-impact capabilities and may be classified as systemic-risk GPAI under the EU AI Act. Providers of such models must notify the European Commission within two weeks of reaching this threshold and must conduct mandatory model evaluations and risk assessments.


Q: What does "parameter count" actually tell you about a model's capabilities?

Parameter count is an imperfect proxy. More parameters generally mean more capacity to store and apply learned patterns. But the relationship is not linear, and architectural efficiency matters enormously. A 17B active-parameter MoE model like Llama 4 Maverick can outperform dense 70B models on many benchmarks. Parameter count should be considered alongside training data quality, training duration, post-training alignment, and the specific task being evaluated.


Q: Are weights the same thing as training data?

No. Training data is the input used to adjust weights during training — text, images, code, or other content. Weights are the output of that training process: the numerical parameters that encode what the model learned. Training data and weights are distinct objects. Some open-weight models share their weights but not their training data; others share neither; a very small number share both.


Q: What does "FP8 training" mean and why does it matter?

FP8 (8-bit floating point) is a lower-precision format for computing during training. Meta trained Llama 4 Behemoth using FP8 precision on 32,000 GPUs, achieving 390 TFLOPs/GPU. Training in FP8 reduces memory usage and increases throughput compared to FP16 or FP32, enabling larger models to be trained on a given set of hardware. It requires careful handling to avoid numerical instability.


Q: Why do larger models require so much memory to run?

Memory in neural networks is required to store input data, weight parameters, and activations as an input propagates through the network. In training, activations from a forward pass must be retained until they can be used to calculate error gradients in the backward pass. During inference (not training), the primary requirement is loading all the weights into GPU memory before processing begins — which for a 70B fp16 model means roughly 140 GB of VRAM.

 

Get the AI Playbook Your Business Can Use today, Right Here

 

14. Key Takeaways

  • Model weights are the numerical parameters inside a neural network. They are learned through training and represent all the model's knowledge.


  • Modern frontier models contain hundreds of billions to trillions of parameters. Storage requirements range from gigabytes for small quantized models to terabytes for full-precision frontier models.


  • Weights are adjusted through backpropagation and gradient descent — a process that can take millions of GPU-hours and tens of millions of dollars for frontier models.


  • The open vs. closed weights distinction is one of the most consequential debates in AI. Open weights enable transparency, fine-tuning, and local deployment; closed weights enable centralized safety controls and commercial advantage.


  • DeepSeek's 2024–2025 releases demonstrated that frontier-class open-weight models can be trained at dramatically lower cost than previously assumed, reshaping competitive dynamics across the industry.


  • Quantization reduces weight precision to shrink model size. Bf16 and int8 have minimal accuracy impact; int4 enables consumer hardware deployment with moderate quality tradeoffs.


  • The EU AI Act now treats model weights as legally significant objects. Its enforcement powers for GPAI model providers take full effect in August 2026.


  • Mixture of Experts (MoE) architectures decouple total parameter count from per-inference compute cost, making enormous models practically deployable.


  • Safety risks specific to weights include backdoor attacks (weight poisoning), catastrophic forgetting during fine-tuning, and model extraction attacks via API queries.


  • Hardware memory — specifically GPU VRAM and HBM bandwidth — remains the primary physical constraint on model weight deployment.

 

Get the AI Playbook Your Business Can Use today, Right Here

 

15. Actionable Next Steps

  1. If you are exploring AI for the first time: Download a quantized open-weight model (e.g., Llama 3.1 8B in GGUF format via Ollama or LM Studio) and run it locally. This gives you direct experience with what weights produce — without cloud costs or data privacy concerns.


  2. If you are a developer building AI applications: Evaluate open-weight models on Hugging Face for your specific use case before committing to a closed API. For many applications, a quantized 7B or 13B model on your own infrastructure will be cheaper and faster than per-token API pricing at scale.


  3. If you are fine-tuning models: Use LoRA or QLoRA for parameter-efficient fine-tuning rather than full fine-tuning. This dramatically reduces compute requirements and helps avoid catastrophic forgetting.


  4. If you operate in the EU: Audit whether your AI products incorporate GPAI models subject to the EU AI Act. If so, check whether your provider complies with documentation and weight disclosure requirements, and whether your own modifications to open-weight models trigger provider obligations under the Act.


  5. If you are an AI researcher: Review the EU AI Act's GPAI Code of Practice (published 2025) and assess whether your institution's model releases comply with training data summary and copyright policy requirements.


  6. If you are a business evaluating AI risk: Ask any AI vendor whether their model weights are proprietary or open, whether you can run inference locally, and what happens to your data during inference. These questions directly affect your data privacy posture.


  7. Monitor the August 2026 enforcement date for the EU AI Act's GPAI provisions. Any organization deploying frontier AI in the EU should have compliance processes in place before that date.

 

Get the AI Playbook Your Business Can Use today, Right Here

 

16. Glossary

  1. Activation: A temporary numerical value computed by a neuron during a forward pass. Unlike weights, activations are not stored permanently — they are computed fresh each time the model processes an input.

  2. Backpropagation: The algorithm used to compute how much each weight contributed to a model's prediction error. It applies the chain rule of calculus to propagate gradients from the output layer back to the input layer.

  3. Bias (in neural networks): A constant numerical value added to a neuron's weighted input before activation. Biases are parameters, like weights, and are learned during training.

  4. FLOP / FLOPs: Floating Point Operation(s). A measure of computational effort. Training a modern frontier LLM requires 10²³ to 10²⁶ FLOPs. The EU AI Act uses FLOPs as the key metric for classifying GPAI models.

  5. Fine-tuning: Continuing to train a pre-trained model on new, usually smaller datasets to adapt it to a specific task or domain. Fine-tuning updates the model's weights.

  6. Gradient descent: An optimization algorithm that iteratively adjusts weights in the direction that reduces the loss function. The "gradient" is a vector indicating which direction increases the loss fastest; descent goes the opposite way.

  7. GPAI (General-Purpose AI) model: The EU AI Act's term for an AI model capable of performing a wide range of tasks — what many informally call a "foundation model." LLMs like GPT, Claude, and Llama are GPAI models.

  8. HBM (High-Bandwidth Memory): The specialized memory used in AI accelerators (like Nvidia H100 GPUs) to store and rapidly access model weights during inference. HBM bandwidth directly determines how fast a model can process inputs.

  9. Inference: The process of running a trained model on new inputs to generate predictions or outputs. During inference, weights are fixed (not updated).

  10. LoRA (Low-Rank Adaptation): A parameter-efficient fine-tuning technique that freezes the original model weights and trains small additional matrices. Dramatically reduces compute and memory requirements for fine-tuning.

  11. Loss function: A mathematical formula that measures the difference between a model's prediction and the correct answer. Training aims to minimize the loss function.

  12. MoE (Mixture of Experts): An architecture in which a model has multiple specialist sub-networks ("experts") and routes each input through only a subset of them. Allows very large total parameter counts while keeping per-token compute cost manageable.

  13. Parameter: A learnable numerical value in a neural network. Weights and biases are the primary types of parameters.

  14. Quantization: Reducing the numerical precision of weights (e.g., from 32-bit float to 8-bit integer) to decrease storage size and improve inference speed, at some potential cost to accuracy.

  15. Safetensors: A file format developed by Hugging Face for storing model weights. Designed for safety (cannot execute arbitrary code on load) and efficiency. Now widely used for distributing open-weight models.

  16. Systemic-risk GPAI: An EU AI Act classification for GPAI models trained using more than 10²⁵ FLOPs. Subject to additional obligations including mandatory risk assessment, model evaluation, and notification to the European Commission.

  17. Weight poisoning (backdoor attack): A security attack in which malicious data is injected into the training process to embed hidden behaviors in model weights. The model appears normal until a specific "trigger" input activates the hidden behavior.

 

Get the AI Playbook Your Business Can Use today, Right Here

 

17. Sources and References

  1. Brown, T., et al. (2020-05-28). Language Models are Few-Shot Learners (GPT-3 paper). OpenAI / arXiv. https://arxiv.org/abs/2005.14165

  2. Meta AI. (2024-07-23). Llama 3.1: The Most Capable Openly Available LLM to Date. https://ai.meta.com/blog/meta-llama-3-1/

  3. Meta AI. (2025-04-05). Llama 4: The beginning of a new era of natively multimodal AI innovation. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

  4. DeepSeek AI. (2024-12). DeepSeek-V3 Technical Report. arXiv:2412.19437. https://huggingface.co/deepseek-ai/DeepSeek-V3

  5. DeepSeek AI. (2025-01). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948. https://huggingface.co/deepseek-ai/DeepSeek-R1

  6. European Commission. (2024-08-01). EU AI Act — Regulation (EU) 2024/1689. Official Journal of the European Union. https://digital-strategy.ec.europa.eu/en/policies/regulatory-framework-ai

  7. European Commission. (2025-07-18). Guidelines on the Scope of Obligations for Providers of GPAI Models. https://digital-strategy.ec.europa.eu/en/policies/guidelines-gpai-providers

  8. Steptoe. (2025-08). EU AI Act Obligations for GPAI Models Now Applicable. https://www.steptoe.com/en/news-publications/steptechtoe-blog/eu-ai-act-obligations-for-gpai-models-now-applicable.html

  9. WilmerHale. (2025-07-24). European Commission Issues Guidelines for Providers of General-Purpose AI Models. https://www.wilmerhale.com/en/insights/blogs/wilmerhale-privacy-and-cybersecurity-law/20250724-european-commission-issues-guidelines-for-providers-of-general-purpose-ai-models

  10. EU Artificial Intelligence Act (informational site). (2025). Overview of Guidelines for GPAI Models. https://artificialintelligenceact.eu/gpai-guidelines-overview/

  11. Graphcore. (2023). Why Is So Much Memory Needed for Deep Neural Networks? https://www.graphcore.ai/posts/why-is-so-much-memory-needed-for-deep-neural-networks

  12. Apple Developer Documentation. (2024). Compressing Neural Network Weights — Guide to Core ML Tools. https://apple.github.io/coremltools/docs-guides/source/quantization-neural-network.html

  13. Mackin, C., et al. (2022-08-20). Optimised weight programming for analogue memory-based deep neural networks. Nature Communications. https://www.nature.com/articles/s41467-022-31405-1

  14. TechCrunch / Wiggers, K. (2024-07-23). Meta releases its biggest 'open' AI model yet. https://techcrunch.com/2024/07/23/meta-releases-its-biggest-open-ai-model-yet/

  15. TechCrunch / Wiggers, K. (2024-04-18). Meta releases Llama 3, claims it's among the best open models available. https://techcrunch.com/2024/04/18/meta-releases-llama-3-claims-its-among-the-best-open-models-available/

  16. Wikipedia. (2025). Llama (language model). https://en.wikipedia.org/wiki/Llama_(language_model)

  17. Introl Blog. (2025-12-02). DeepSeek-V3.2 Matches GPT-5 at 10x Lower Cost. https://introl.com/blog/deepseek-v3-2-open-source-ai-cost-advantage

  18. Medium / Barnwal, R. (2025-08-20). DeepSeek V3.1: The Open-Source Giant Challenging Proprietary AI. https://rajeevbarnwal.medium.com/deepseek-v3-1-the-open-source-giant-challenging-proprietary-ai-with-smarter-context-and-reasoning-7b409f9a99de

  19. arXiv. (2026-01-29). Neural Weight Compression for Language Models. arXiv:2510.11234. https://arxiv.org/html/2510.11234

  20. Linux Foundation Europe. (2025-07-15). What Open Source Developers Need to Know about the EU AI Act. https://linuxfoundation.eu/newsroom/ai-act-explainer




 
 
bottom of page