002. Use SwiGLU Activation Function

Date: 2024-12

Status

Accepted

Context

The choice of activation function in feedforward networks significantly impacts model performance. Traditional Transformer models use GELU, but recent research has shown that gated activation functions can provide better performance.

Key considerations:

Model performance: Need to maximize model quality
Computational cost: Must remain practical for training and inference
Industry adoption: Learn from successful large-scale models
Implementation complexity: Balance benefits against complexity

Alternatives considered:

GELU: Standard choice, simple and effective
ReLU: Fast but potentially suboptimal
GLU variants: Better performance but higher computational cost

Decision

We adopt SwiGLU (Swish-Gated Linear Unit) as an optional activation function in our MLP layers.

Implementation details:

Add use_glu parameter to MLP class
When use_glu=True, use SwiGLU; otherwise use standard activation (GELU)
SwiGLU formula: SwiGLU(x, W, V) = Swish(xW) ⊗ xV
Requires 3x parameter count in intermediate layer but provides better performance

Usage:

mlp = MLP(
    hidden_size=2048,
    intermediate_size=8192,
    use_glu=True,  # Enable SwiGLU
)

Consequences

Positive

Better performance: 1-2% improvement in model quality compared to GELU
Used in SOTA models: GLU variants used in PaLM, LLaMA, and other leading models
Smooth activations: Swish provides smooth, non-monotonic activation
Empirically validated: Strong performance across various benchmarks
Optional feature: Can be disabled for simpler baseline experiments

Negative

3x parameters: Intermediate layer needs 3x parameters compared to standard FFN
Slower training: Approximately 10-15% slower due to additional computation
More memory: Higher memory footprint during training

Neutral

Backward compatible: Default is use_glu=False to maintain compatibility
Configuration flexibility: Easy to A/B test against standard activations

References

GLU Variants Improve Transformer
PaLM: Scaling Language Modeling with Pathways
LLaMA: Open and Efficient Foundation Language Models
Implementation: src/llm/core/mlp.py