Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions src/content/docs/glossary/activation/messages/en.json
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
},
"whyItMatters": {
"title": "Why It Matters",
"body": "Activations determine what information reaches the next layer and what backpropagation will differentiate. Memory planners track activation tensors during training because they are large but temporary, unlike checkpointed parameters."
"body": "Activations determine what information reaches the next layer and what backpropagation will differentiate. Memory planners track activation tensors during training because they are large but temporary, unlike checkpointed parameters. This page is the broad foundation for that idea; pages like ReLU, LeakyReLU, and SiLU zoom in on specific FFN activation choices, while SwiGLU shows how gating can turn that idea into a different FFN block shape."
},
"simpleExample": {
"title": "Simple Example",
Expand All @@ -20,7 +20,7 @@
},
"commonConfusions": {
"title": "Common Confusions",
"body": "A hidden activation is not the same as softmax: softmax turns vocabulary logits into a probability vector at the output head, while activations are internal layer outputs that may never be normalized across the vocabulary. Saying a “ReLU activation” refers to the nonlinearity applied inside a layer, not to the softmax step at decode time."
"body": "A hidden activation is not the same as softmax: softmax turns vocabulary logits into a probability vector at the output head, while activations are internal layer outputs that may never be normalized across the vocabulary. Saying a “ReLU activation” refers to one specific nonlinearity that shapes FFN activations, not to the softmax step at decode time. ReLU, LeakyReLU, and SiLU are specific activation choices; SwiGLU goes one step further and changes the FFN into a gated two-branch block."
},
"related": {
"title": "Related Concepts And Modules"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,15 +8,15 @@
},
"whyItMatters": {
"title": "Why It Matters",
"body": "Attention decides what each position can read from the sequence; the FFN decides how to transform what was read into richer features. Most transformer blocks alternate these two steps, so recognizing the FFN slot helps you read architecture diagrams and spot when a model swaps a dense MLP for a mixture-of-experts layer while keeping the block shape."
"body": "Attention decides what each position can read from the sequence; the FFN decides how to transform what was read into richer features. Most transformer blocks alternate these two steps, so recognizing the FFN slot helps you read architecture diagrams and spot when a model swaps a dense MLP for a mixture-of-experts layer while keeping the block shape. This page is the broad map of that slot; nearby pages like Standard FFN, ReLU, SiLU, and SwiGLU zoom in on one default block shape or one activation-driven variant inside it."
},
"simpleExample": {
"title": "Simple Example",
"body": "Imagine a decoder block after self-attention has updated every word vector. The FFN runs the same recipe on each vector: multiply by a wide weight matrix, apply a gated or ReLU-style activation, then multiply by a second matrix back to hidden size. Position three's vector never mixes with position seven inside the FFN—that mixing already happened in attention."
},
"commonConfusions": {
"title": "Common Confusions",
"body": "The FFN is not attention: it does not look at other tokens. It is also not the language-model head at the stack top—that head maps final hidden states to vocabulary logits. A mixture-of-experts layer replaces the dense FFN with routed expert MLPs but still sits in the same block slot after attention."
"body": "The FFN is not attention: it does not look at other tokens. It is also not the language-model head at the stack top—that head maps final hidden states to vocabulary logits. Standard FFN is the default dense version of this slot, while ReLU, LeakyReLU, and SiLU name different nonlinearities that can live inside it. SwiGLU changes the internal FFN shape with gating, and a mixture-of-experts layer replaces one shared dense path with routed expert MLPs while still sitting in the same block slot after attention."
},
"related": {
"title": "Related Concepts And Modules"
Expand Down
12 changes: 6 additions & 6 deletions src/content/docs/glossary/leaky-relu/messages/en.json
Original file line number Diff line number Diff line change
@@ -1,22 +1,22 @@
{
"title": "LeakyReLU",
"description": "A ReLU-style activation that keeps a small negative slope instead of zeroing every negative FFN value.",
"description": "An FFN activation like ReLU that still lets a small negative signal pass instead of clamping all negative values to zero.",
"sections": {
"whatItIs": {
"title": "What It Is",
"body": "LeakyReLU is a small change to ReLU. Positive hidden values pass through as usual, but negative values are multiplied by a small constant such as 0.01 instead of becoming zero. In a standard FFN after attention, that means a token can keep a weak negative signal rather than shutting the feature off completely."
"body": "LeakyReLU is a ReLU-style activation with a small slope on the negative side. Inside an FFN, positive hidden values still pass through normally, but negative values are shrunk rather than cut off entirely. The result is still a simple per-token hidden transform in the same FFN slot after attention."
},
"whyItMatters": {
"title": "Why It Matters",
"body": "The main reason to use LeakyReLU is to avoid losing all gradient signal on the negative side. A dense FFN with plain ReLU can leave some hidden units inactive for long stretches if they keep landing below zero. LeakyReLU softens that cutoff, so papers sometimes use it when they want ReLU-like behavior with a less brittle negative branch."
"body": "This variant changes one specific behavior relative to ReLU: negative responses do not disappear completely. That makes LeakyReLU useful as a comparison point when you want to understand what changes if an FFN keeps a faint negative signal alive instead of throwing it away. It stays a dense FFN choice, not a routing or architecture change."
},
"simpleExample": {
"title": "Simple Example",
"body": "If the hidden values are [-2.0, 0.7, 3.4] and the leak factor is 0.01, LeakyReLU produces [-0.02, 0.7, 3.4]. The FFN still runs the same expand then activate then project recipe; only the activation rule changes."
"body": "Suppose an FFN hidden vector contains -2, -0.3, 0.1, and 4. A LeakyReLU with a small negative slope might turn that into something like -0.02, -0.003, 0.1, and 4 before the block projects back down. The big idea is that the negative side still contributes a little instead of becoming exact zero."
},
"commonConfusions": {
"title": "Common Confusions",
"body": "LeakyReLU is not a gated FFN like SwiGLU and it is not a sparse router like mixture of experts. It stays inside the same dense standard FFN block. It also does not make negative values equally important as positive ones; the negative side is still much smaller, just not forced to zero."
"body": "LeakyReLU is still an activation inside a standard FFN, not a separate expert path or a new transformer layer. It differs from plain ReLU only in what happens to negative values. It also does not mean the model is using a gated FFN such as SwiGLU, because gating changes the FFN shape while LeakyReLU only changes the nonlinearity inside the usual dense path."
},
"related": {
"title": "Related Concepts And Modules"
Expand All @@ -28,5 +28,5 @@
"title": "References"
}
},
"openingSummary": "LeakyReLU keeps a small negative slope inside an FFN instead of clamping every negative hidden value to zero, so weak signals can still flow through the block."
"openingSummary": "LeakyReLU is a small variation on ReLU: it keeps the same FFN role, but negative hidden values leak through at a reduced scale instead of dropping to zero."
}
4 changes: 2 additions & 2 deletions src/content/docs/glossary/leaky-relu/page.mdx
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: LeakyReLU
description: A ReLU-style activation that keeps a small negative slope instead of zeroing every negative FFN value.
description: An FFN activation like ReLU that still lets a small negative signal pass instead of clamping all negative values to zero.
kind: "glossary"
registryId: "concept.leaky-relu"
messageNamespace: "local"
Expand All @@ -9,9 +9,9 @@ status: "published"
tags:
- foundations
aliases:
- "LeakyReLU"
- "leaky ReLU"
- "leaky rectified linear unit"
- "Leaky ReLU activation"
updatedAt: "2026-06-18"
---

Expand Down
4 changes: 2 additions & 2 deletions src/content/docs/glossary/mixture-of-experts/messages/en.json
Original file line number Diff line number Diff line change
Expand Up @@ -8,15 +8,15 @@
},
"whyItMatters": {
"title": "Why It Matters",
"body": "MoE increases total parameter count while keeping compute per token roughly flat, because only a few experts activate per step. That tradeoff matters when scaling language models: you can add capacity without multiplying FLOPs for every position. Capacity limits, load balancing, and routing noise also shape training stability, so recognizing MoE helps you read model cards that list expert counts and top-k routing."
"body": "MoE increases total parameter count while keeping compute per token roughly flat, because only a few experts activate per step. That tradeoff matters when scaling language models: you can add capacity without multiplying FLOPs for every position. Capacity limits, load balancing, and routing noise also shape training stability, so recognizing MoE helps you read model cards that list expert counts and top-k routing. It also helps to see where MoE sits relative to the rest of the FFN family: it replaces the default dense block in the usual transformer slot rather than adding a whole new stage elsewhere."
},
"simpleExample": {
"title": "Simple Example",
"body": "Suppose a block has sixty-four expert MLPs but top-2 routing. A token vector enters the router, which picks the two highest-scoring experts—say experts 7 and 41. Each expert applies its own two-layer MLP; the router weights blend their outputs into one vector that continues through normalization and the residual add. The next token in the batch may activate a completely different pair."
},
"commonConfusions": {
"title": "Common Confusions",
"body": "MoE is not a model ensemble: ensembles combine separate full models at inference, while MoE keeps one shared stack and only sparsely activates internal experts. MoE is also not the same as a dense FFN—both sit after attention in the block, but dense FFN runs one MLP for every token whereas MoE selects a small subset. Finally, total expert parameters are not all used on every forward pass; active compute tracks top-k, not the full expert pool."
"body": "MoE is not a model ensemble: ensembles combine separate full models at inference, while MoE keeps one shared stack and only sparsely activates internal experts. MoE is also not the same as a dense FFN—both sit after attention in the block, but dense FFN runs one MLP for every token whereas MoE selects a small subset. SwiGLU is another FFN variant, but it keeps one shared gated dense block rather than routing tokens across experts. Finally, total expert parameters are not all used on every forward pass; active compute tracks top-k, not the full expert pool."
},
"related": {
"title": "Related Concepts And Modules"
Expand Down
12 changes: 6 additions & 6 deletions src/content/docs/glossary/relu/messages/en.json
Original file line number Diff line number Diff line change
@@ -1,22 +1,22 @@
{
"title": "ReLU",
"description": "A pointwise activation that keeps positive FFN values and clamps negative ones to zero.",
"description": "A simple activation that keeps positive values and turns negative values into zero inside many FFN hidden layers.",
"sections": {
"whatItIs": {
"title": "What It Is",
"body": "ReLU stands for rectified linear unit. Inside a standard FFN, it looks at each hidden value after the expand layer and applies a simple rule: keep positive numbers, replace negative numbers with zero. The FFN stays in the same slot after attention, but this activation changes which hidden features stay active before the block projects back to model width."
"body": "ReLU stands for rectified linear unit. Inside a standard FFN, it leaves positive hidden values alone and replaces negative ones with zero before the next projection. That gives the FFN a simple nonlinearity, so the block can reshape features instead of behaving like one big linear map."
},
"whyItMatters": {
"title": "Why It Matters",
"body": "ReLU became popular because the rule is cheap to compute and easy to reason about. When a token's hidden feature is positive, the feature passes through unchanged; when it is negative, the feature drops out for that step. That sharp gate can help a dense FFN learn sparse feature patterns, which is why ReLU remains a useful baseline when papers compare newer activations."
"body": "ReLU is one of the easiest activation choices to understand, so it often acts as the baseline when people compare FFN variants. In a transformer block, the attention sublayer mixes information across tokens, then the FFN with ReLU transforms each token's hidden state on its own. If you know what ReLU changes, later pages on smoother or gated variants are easier to read."
},
"simpleExample": {
"title": "Simple Example",
"body": "Suppose an FFN hidden vector contains [-2.1, 0.7, 3.4]. ReLU turns it into [0, 0.7, 3.4]. The token stays in the same per-position feed-forward path; only the hidden values change before the next projection."
"body": "Imagine an FFN hidden vector with values like -2, -0.3, 0.1, and 4. ReLU turns that into 0, 0, 0.1, and 4 before the FFN projects the vector back to model width. In plain terms, the layer stops carrying the negative hidden responses forward but keeps the positive ones."
},
"commonConfusions": {
"title": "Common Confusions",
"body": "ReLU is only the activation step inside an FFN, not the whole feed-forward block. It also is not the same as LeakyReLU or SiLU, which keep some information for negative values instead of forcing every negative entry to zero. In transformer papers, seeing ReLU usually tells you how the hidden FFN state is shaped, not that the model changed its attention or residual path."
"body": "ReLU is an activation choice inside an FFN, not a separate transformer block. Swapping in ReLU does not move the feed-forward slot or turn the model into a mixture-of-experts layer. It is also not the same as LeakyReLU: plain ReLU cuts negative values all the way to zero, while LeakyReLU lets a small negative signal keep flowing."
},
"related": {
"title": "Related Concepts And Modules"
Expand All @@ -28,5 +28,5 @@
"title": "References"
}
},
"openingSummary": "ReLU is a simple FFN activation that zeroes negative hidden values, giving transformer MLPs a cheap nonlinearity that decides which features remain active."
"openingSummary": "ReLU is the simplest common FFN activation: after the hidden projection, it keeps positive values, zeros out negative ones, and then the block projects the result back down."
}
6 changes: 3 additions & 3 deletions src/content/docs/glossary/relu/page.mdx
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: ReLU
description: A pointwise activation that keeps positive FFN values and clamps negative ones to zero.
description: A simple activation that keeps positive values and turns negative values into zero inside many FFN hidden layers.
kind: "glossary"
registryId: "concept.relu"
messageNamespace: "local"
Expand All @@ -9,9 +9,9 @@ status: "published"
tags:
- foundations
aliases:
- "ReLU"
- "rectified linear unit"
- "ReLU activation"
- "rectifier"
- "relu activation"
updatedAt: "2026-06-18"
---

Expand Down
12 changes: 6 additions & 6 deletions src/content/docs/glossary/silu/messages/en.json
Original file line number Diff line number Diff line change
@@ -1,22 +1,22 @@
{
"title": "SiLU",
"description": "A smooth activation that multiplies each FFN hidden value by its sigmoid gate.",
"description": "A smooth FFN activation that scales each hidden value by a soft gate based on that same value.",
"sections": {
"whatItIs": {
"title": "What It Is",
"body": "SiLU stands for sigmoid linear unit. For each hidden value x inside an FFN, it outputs x multiplied by sigmoid(x). Large positive values pass through strongly, values near zero are softened, and negative values are reduced smoothly instead of being cut off sharply. Many modern transformer blocks use SiLU because it fits well with wide FFN layers after attention."
"body": "SiLU stands for sigmoid linear unit. Inside an FFN, it multiplies each hidden value by a soft gate computed from that same value, so the response changes smoothly instead of snapping at zero the way ReLU does. The FFN still sits in the same transformer slot after attention and still transforms each token on its own."
},
"whyItMatters": {
"title": "Why It Matters",
"body": "Compared with ReLU, SiLU changes hidden states more gradually. That smoother shape often works well in large language model FFNs, where small changes in hidden values can matter across many layers. SiLU also matters because SwiGLU builds on the same nonlinearity: the gate branch in a SwiGLU block uses SiLU before it scales the value branch."
"body": "SiLU is a useful bridge between simple activations and more modern gated FFN designs. It keeps the dense FFN shape of a standard FFN, but it gives the hidden transform a smoother response that many recent architectures build on. If you understand SiLU, the jump to SwiGLU is much easier because SwiGLU reuses the same smooth gating idea inside a two-branch FFN."
},
"simpleExample": {
"title": "Simple Example",
"body": "If one hidden value is 3, sigmoid(3) is close to 0.95, so SiLU keeps most of that signal. If another value is -2, sigmoid(-2) is small, so the output stays negative but shrinks in magnitude. The token still follows the same standard FFN slot; SiLU only changes how the expanded hidden features are filtered."
"body": "Imagine an FFN hidden value is slightly negative. ReLU would cut it to zero, but SiLU usually leaves a small negative output because the gate fades it down instead of shutting it off completely. Large positive values still pass through strongly, so the FFN can keep strong positive evidence while treating weaker values more gently."
},
"commonConfusions": {
"title": "Common Confusions",
"body": "SiLU is an activation, not a full gated FFN by itself. SwiGLU uses SiLU as part of a larger two-branch feed-forward design, while plain SiLU can also appear inside an otherwise standard FFN. SiLU is also not the same as softmax or sigmoid output heads; it is an internal hidden-state transform."
"body": "SiLU is still just an activation choice inside a dense FFN, not a separate expert-routing layer and not a gated FFN by itself. A model using SiLU can still have a standard FFN block shape. SwiGLU goes further by adding a second branch that gates the main branch, while mixture-of-experts changes which FFN path a token uses."
},
"related": {
"title": "Related Concepts And Modules"
Expand All @@ -28,5 +28,5 @@
"title": "References"
}
},
"openingSummary": "SiLU smoothly gates each FFN hidden value with its own sigmoid score, giving transformer MLPs a softer activation than ReLU and setting up the gate used in SwiGLU blocks."
"openingSummary": "SiLU is a smooth FFN activation: it softly gates each hidden value using that value itself, so the dense block keeps the same slot as a standard FFN but changes how hidden responses flow through it."
}
6 changes: 3 additions & 3 deletions src/content/docs/glossary/silu/page.mdx
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: SiLU
description: A smooth activation that multiplies each FFN hidden value by its sigmoid gate.
description: A smooth FFN activation that scales each hidden value by a soft gate based on that same value.
kind: "glossary"
registryId: "concept.silu"
messageNamespace: "local"
Expand All @@ -9,9 +9,9 @@ status: "published"
tags:
- foundations
aliases:
- "SiLU"
- "sigmoid linear unit"
- "Swish"
- "SiLU activation"
- "swish"
updatedAt: "2026-06-18"
---

Expand Down
Loading