[ PROMPT_NODE_22347 ]
Rope
[ SKILL_DOCUMENTATION ]
# RoPE: Rotary Position Embeddings
Complete technical guide based on RoFormer paper (arXiv 2104.09864) and HuggingFace transformers implementation.
## Table of Contents
- Mathematical Formulation
- Implementation Details
- Scaling Techniques
- Production Usage
## Mathematical Formulation
**Source**: RoFormer: Enhanced Transformer with Rotary Position Embedding (arXiv 2104.09864)
### Core Idea
RoPE encodes absolute position with a rotation matrix while naturally incorporating relative position dependency in attention.
### Formulation
Given position index `m` and embedding dimension `d`:
```
Rotation Matrix R_θ(m):
[cos(mθ₁) -sin(mθ₁) 0 0 ]
[sin(mθ₁) cos(mθ₁) 0 0 ]
[0 0 cos(mθ₂) -sin(mθ₂) ]
[0 0 sin(mθ₂) cos(mθ₂) ]
...
where θⱼ = base^(-2j/d) for j ∈ [0, 1, 2, ..., d/2)
```
**Key property**: Attention between positions m and n depends only on relative distance (m - n).
### Derivation
**Step 1: Position encoding via rotation**
```
q_m = W_q x_m rotated by mθ
k_n = W_k x_n rotated by nθ
```
**Step 2: Attention score**
```
score(q_m, k_n) = q_m^T k_n
= (Rotated query) · (Rotated key)
= f(q, k, m-n)
```
The score depends on relative position `m - n`, not absolute positions.
## Implementation Details
**Source**: HuggingFace transformers/modeling_rope_utils.py
### Basic RoPE Implementation
```python
import torch
import math
def precompute_freqs_cis(dim: int, end: int, theta: float = 10000.0):
"""Precompute rotation frequencies (cos + i*sin)."""
# Compute inverse frequencies
freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim))
# Position indices
t = torch.arange(end, device=freqs.device)
# Outer product: (end, dim/2)
freqs = torch.outer(t, freqs).float()
# Convert to complex exponential (Euler's formula)
freqs_cis = torch.polar(torch.ones_like(freqs), freqs) # e^(i*θ) = cos(θ) + i*sin(θ)
return freqs_cis
def reshape_for_broadcast(freqs_cis, x):
"""Reshape frequency tensor to match x dimensions."""
ndim = x.ndim
assert 0 <= 1 self.max_seq_len:
# Scale proportionally
scale = seq_len / self.max_seq_len
else:
scale = 1.0
# Scale positions
t = torch.arange(seq_len, device=device).type_as(self.inv_freq)
t = t / (self.scaling_factor * scale)
freqs = torch.outer(t, self.inv_freq)
emb = torch.cat((freqs, freqs), dim=-1)
return emb.cos(), emb.sin()
```
**Pros**: Adapts to input length
**Cons**: Different behavior for different lengths
### 4. YaRN (Yet another RoPE extensioN)
**Source**: arXiv 2309.00071
**Most sophisticated**: Combines multiple techniques.
```python
class YaRNScaledRoPE(nn.Module):
"""YaRN: NTK + Attention Temperature + Ramp."""
def __init__(
self,
dim,
max_seq_len=2048,
base=10000,
scaling_factor=1.0,
beta_fast=32,
beta_slow=1,
attn_factor=1.0
):
super().__init__()
self.scaling_factor = scaling_factor
self.beta_fast = beta_fast
self.beta_slow = beta_slow
self.attn_factor = attn_factor
# Compute frequencies
inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))
self.register_buffer("inv_freq", inv_freq)
def forward(self, seq_len, device):
t = torch.arange(seq_len, device=device).type_as(self.inv_freq)
# NTK-by-parts: Different scaling for different frequencies
inv_freq_mask = (self.inv_freq > 1 / self.beta_fast).float()
# Low frequencies: NTK scaling
# High frequencies: Linear scaling
# Middle: Smooth ramp
inv_freq_scaled = self.inv_freq / self.scaling_factor
freqs = torch.outer(t, inv_freq_scaled)
emb = torch.cat((freqs, freqs), dim=-1)
return emb.cos() * self.attn_factor, emb.sin() * self.attn_factor
```
**Pros**: State-of-the-art context extension
**Cons**: More complex, more hyperparameters
## Production Usage
### HuggingFace Integration
```python
from transformers import AutoModelForCausalLM, AutoConfig
# Linear scaling
config = AutoConfig.from_pretrained("meta-llama/Llama-2-7b-hf")
config.rope_scaling = {
"type": "linear",
"factor": 4.0 # 2k → 8k
}
# NTK-aware scaling
config.rope_scaling = {
"type": "ntk",
"factor": 4.0
}
# Dynamic scaling
config.rope_scaling = {
"type": "dynamic",
"factor": 4.0
}
# YaRN scaling
config.rope_scaling = {
"type": "yarn",
"factor": 16.0,
"original_max_position_embeddings": 2048,
"attention_factor": 1.0,
"beta_fast": 32,
"beta_slow": 1
}
model = AutoModelForCausalLM.from_config(config)
```
### Custom Implementation
```python
class RoPEAttention(nn.Module):
def __init__(self, config):
super().__init__()
self.num_heads = config.num_attention_heads
self.head_dim = config.hidden_size // config.num_attention_heads
# Projections
self.q_proj = nn.Linear(config.hidden_size, config.hidden_size, bias=False)
self.k_proj = nn.Linear(config.hidden_size, config.hidden_size, bias=False)
self.v_proj = nn.Linear(config.hidden_size, config.hidden_size, bias=False)
self.o_proj = nn.Linear(config.hidden_size, config.hidden_size, bias=False)
# RoPE
self.rotary_emb = RotaryEmbedding(
dim=self.head_dim,
max_seq_len=config.max_position_embeddings,
base=config.rope_theta
)
def forward(self, hidden_states, attention_mask=None, position_ids=None):
bsz, seq_len, _ = hidden_states.size()
# Q, K, V
query_states = self.q_proj(hidden_states)
key_states = self.k_proj(hidden_states)
value_states = self.v_proj(hidden_states)
# Reshape: (batch, seq_len, num_heads, head_dim)
query_states = query_states.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
key_states = key_states.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
value_states = value_states.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
# Apply RoPE
cos, sin = self.rotary_emb(seq_len, device=hidden_states.device)
query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
# Attention
attn_output = F.scaled_dot_product_attention(
query_states, key_states, value_states,
attn_mask=attention_mask
)
# Reshape and project
attn_output = attn_output.transpose(1, 2).contiguous()
attn_output = attn_output.reshape(bsz, seq_len, -1)
attn_output = self.o_proj(attn_output)
return attn_output
```
## Performance Comparison
**Scaling method comparison** (8k → 32k extension):
| Method | Fine-tune Steps | Perplexity | Memory | Speed |
|--------|----------------|------------|---------|-------|
| Linear | 1000 | 12.5 | 1.0× | 1.0× |
| NTK | 500 | 11.8 | 1.0× | 1.0× |
| Dynamic | 1000 | 12.2 | 1.0× | 0.98× |
| YaRN | 400 | 11.2 | 1.0× | 0.95× |
**Source**: YaRN paper (arXiv 2309.00071)
## Resources
- **RoFormer Paper**: https://arxiv.org/abs/2104.09864
- **YaRN Paper**: https://arxiv.org/abs/2309.00071
- **HuggingFace RoPE Utils**: https://github.com/huggingface/transformers/blob/main/src/transformers/modeling_rope_utils.py
- **Rotary Embeddings PyTorch**: https://github.com/lucidrains/rotary-embedding-torch
Source: claude-code-templates (MIT). See About Us for full credits.