Deep Learning 7 min read

Building From Scratch: Implementing Rotary Position Embeddings (RoPE)

Follow my journey implementing Rotary Position Embeddings from the ground up for "S1" — a foundational language model I'm building to truly understand how Transformer based architectures work.

Anubhav Banerjee
Deep Learning Engineer & Software Developer

Excited to share a new milestone in my personal deep-learning journey: I've successfully implemented the Rotary Position Embeddings (RoPE) from scratch! This isn't just an isolated exercise; it's a critical component for "S1," a foundational language model (targeting ~125M-150M parameters) that I'm building from the ground up, reading research papers and trying to replicate the knowledge gained into actual hard code.

My primary goal here is fundamental understanding. Just to clarify, I am not creating a novel state-of-the-art model (not yet, anyway!). This project is a personal deep dive to deconstruct the magic of Transformer-based Deep Learning architectures and truly learn how each component works from the first principles.

Understanding RoPE: From 2D Intuition to Multi-Dimensional Reality

Implementing RoPE was fascinating. Diving into the original paper ("RoFormer" by Jialin Su, et al.) was the key source through which I was able to write the code, with some assistance from the Llama codebase by Meta.

True Relative Positioning: The Core Idea

The core idea is brilliant. Instead of adding or concatenating a positional vector, RoPE rotates the query and key vectors based on their absolute position. The genius is that the resulting dot-product attention now depends only on the relative distance between the tokens.

Starting Simple: The 2D Case

To explain it in simple words, we all know vectors and dot products. Think of the initial token (word for simplicity) as a D-dimensional vector X. Now in Attention, to explain in simple words, we project our input token into 3 different vector spaces to act as 3 different entities (Query, Key, and Value), which are obtained by multiplying X with matrices Wq, Wk, Wv. You can refer to articles to understand multi-head self attention in more detail.

Now in classical absolute positional embedding, we take X, and add this with another vector called Pi (where i denotes the token position in the input sequence). This means we have a pre-trained vector for each input position (pos 1 will have one, 2 will have another and so on), these vectors are learned during the pre-training process of the AI itself.

$$X' = X + P_i$$

Now let's say the input is "I went to Hollywood hills while visiting ____", so every token (assume word here) like "went" is going to have a vector X, and since X here appeared at position 2, we would have added the position vector of 2 to the embedding X. And then passed it for projection into the Query, Key, and Value spaces.

But RoPE and other relative positional embeddings argue that the relation between two tokens (which the attention implicitly weighs via the dot product) is more dependent upon the relative distance of two tokens rather than their absolute positions. So if let's say two tokens appeared at positions 3 and 5, and in some other input they both appeared at positions 17 and 19, then they should both be closely related irrespective of their absolute position, because what matters more is the relative distance between them. So APE (Absolute positional embedding) isn't able to achieve this, because the dot product (inner product) of the key and the query of the two tokens are also a function of their absolute positions m and n.

The Rotation Insight

This is where RoPE comes in. To understand rope, the paper tells us to think in terms of 2D vectors and complex numbers for now. Let's say we have two 2D contextual embeddings or vectors A = (x, y) and B = (p, q). Now to introduce or morph positional information into their dot product, we just rotate these two vectors by angles m·θ and n·θ where θ is some base frequency.

What this does is it adds a sense of relative position between these 2 vectors. Say the two vectors are far apart, then the angle between them would be large and hence their dot product would be smaller, which is in line with normal human intuition that far away tokens should have lesser attention scores.

So the vectors become Arot = A × exp(i·m·θ) and Brot = B × exp(i·n·θ) if you just think of the 2D vectors in the Argand plane as complex numbers. Then their dot product simply becomes:

$$\text{Attention Score} = A_{rot} \cdot B_{rot} = A \cdot B_{rot}^{*} = A \cdot B \cdot \exp(i·(m-n)·θ)$$

Key Insight: You can see here that the attention score now only depends upon the contextual embeddings A and B and their relative distance (m-n) and NOT absolute m and n.

Scaling to High Dimensions: The Real Implementation

Now here's where it gets really interesting - how do we extend this elegant 2D rotation to the 128 or 256 dimensional query and key vectors we actually use in transformers?

The brilliant insight is: we don't rotate the entire high-dimensional vector at once. Instead, we split it into pairs of dimensions and rotate each pair independently!

Think of it this way: if you have a 128-dimensional query vector, you split it into 64 pairs: (dim 0, dim 1), (dim 2, dim 3), (dim 4, dim 5), and so on. Each pair gets treated as a 2D vector and rotated in its own 2D plane, just like we described above.

But here's the clever part - each pair is rotated by a different frequency. The first pair (dim 0, dim 1) rotates with frequency θ₁, the second pair with θ₂, the third with θ₃, and so on. These frequencies follow a geometric progression based on the formula:

$$θ_i = \text{base}^{-2i/d}$$

where base is typically 10,000 and d is the dimension of the vector.

Why Different Frequencies?

Because this creates a rich, multi-scale positional encoding. The lower dimensions (with higher frequencies) capture fine-grained, short-range positional relationships, while higher dimensions (with lower frequencies) capture coarse-grained, long-range relationships. It's like having multiple clocks ticking at different speeds - a second hand, minute hand, and hour hand all giving you different scales of time information simultaneously.

So in practice, when you implement RoPE for a 128-dim vector at position m:

  1. Split the vector into 64 pairs of consecutive dimensions
  2. For pair i (dimensions 2i and 2i+1), compute the rotation angle: m·θᵢ
  3. Rotate that pair using the 2D rotation matrix (or complex number multiplication)
  4. Stack all the rotated pairs back together

The mathematical beauty is that when you compute the attention score between position m and position n, all these rotations collapse into relative position (m-n) for each frequency band, just like in the 2D case! The dot product becomes a sum over all dimension pairs, each contributing information about the relative distance at its own frequency scale.

A Crucial Clarification: Why Doesn't It Wrap Around?

You might be thinking: "Wait, if we're using rotation angles, won't they cycle back after 2π? Won't tokens that are far apart eventually look similar again due to this periodicity?"

This is a brilliant question, and here's the key insight: In RoPE, the angles don't actually "wrap around" in a way that causes distant tokens to overlap!

Here's why: While it's true that mathematically exp(i·θ) has period 2π, the geometric intuition of "rotation causing decay" is just that - an intuition builder for the 2D case. In reality, what matters for the attention score is the cosine of the relative angle difference: cos((m-n)·θ).

For small relative distances (m-n), cos((m-n)·θ) ≈ 1 (high attention). As (m-n) grows, this cosine oscillates BUT with an important caveat - we're not using just ONE frequency! We have multiple frequencies (θ₁, θ₂, θ₃...) across different dimension pairs. Each frequency creates its own oscillation pattern, and when you sum across all 64 pairs (for a 128-dim vector), these oscillations interfere constructively for nearby tokens and destructively for distant tokens.

Think of it like a hash function: distant positions create nearly orthogonal patterns across the multi-frequency space, ensuring they don't accidentally align. The higher frequencies handle short-range discrimination, while lower frequencies (which take much longer to complete even one cycle) handle long-range structure. A token at position 5 and position 1005 will have vastly different rotation patterns across all frequency bands, preventing any "accidental similarity."

So the decay isn't just from one growing angle - it's from the increasingly chaotic and misaligned pattern across dozens of different frequencies, each ticking at its own rate!

Implementation Elegance

It's applied after the Q and K linear projections. This feels very intuitive, as it "twists" the vectors in their high-dimensional space to encode position just before they interact in the attention mechanism, all without adding extra trainable parameters. No lookup tables, no learned embeddings - just pure mathematical rotation based on position and frequency.

Code Implementation

Here's a simplified implementation showing how RoPE works in practice:

import torch
from torch import nn

class RotaryPositionalEmbedding(nn.Module):
    def __init__(self, config: S1Config):
        super().__init__()

        base_freq = self.computeBaseFreq(config)

        self.register_buffer("base_freq", base_freq, persistent=False)

    @staticmethod
    def computeBaseFreq(config: S1Config) -> torch.Tensor:
        base = config.base
        dim = config.dim
        device = config.device

        arangeTensor = torch.arange(0, dim, 2, dtype=torch.int64).to(device=device, dtype=torch.float)
        dimTensor = arangeTensor / dim

        baseFreq = 1.0 / (base ** dimTensor)

        return baseFreq

def computeOrtogonalPairs(X):
    XFirst = X[..., :X.shape[-1] // 2]
    XSecond = X[..., X.shape[-1] // 2 : ]
    return torch.cat((-XSecond, XFirst), dim=-1)

def computePosEmbd(Q, K, cos, sin, unsqueezeDim = 1):
    cos = cos.unsqueeze(unsqueezeDim)
    sin = sin.unsqueeze(unsqueezeDim)
    Q_rot = Q * cos + computeOrtogonalPairs(Q) * sin
    K_rot = K * cos + computeOrtogonalPairs(K) * sin
    return Q_rot, K_rot

# Usage example (assuming you have S1Config defined)
# config = S1Config(dim=128, base=10000, device='cuda')
# rope = RotaryPositionalEmbedding(config)
#
# # For a given position, compute cos and sin values
# seq_len = 512
# t = torch.arange(seq_len, device=config.device)
# freqs = torch.outer(t, rope.base_freq)
# cos = torch.cos(freqs)
# sin = torch.sin(freqs)
#
# # Apply to query and key matrices
# q_rot, k_rot = computePosEmbd(query_tensor, key_tensor, cos, sin)

The elegance of this implementation lies in its efficiency. The `computeOrtogonalPairs` function creates the orthogonal pairs needed for rotation by swapping the second half of dimensions with the first half (with a sign change), while `computePosEmbd` applies the rotation using the computed cosine and sine values. This approach avoids complex number operations and works directly with real tensors, making it computationally efficient.

Built-in Long-Range Decay

A direct consequence of this formulation is that the attention strength naturally decays as the relative distance between tokens increases. This is a crucial and desirable property for modeling sequences that other methods often have to learn implicitly. With multiple frequencies, you get both rapid decay for fine details and slower decay for long-range dependencies.

Key Takeaways

Building from ground up piece by piece is helping me solidify these concepts and stay at the bleeding edge of technology in a way that reading alone simply can't. Here are the key insights I've gained from this implementation:

  • Mathematical Intuition: Understanding why the rotation works requires diving into the linear algebra, but the payoff is a much deeper appreciation for the elegance of the solution.
  • Multi-Scale Encoding: The use of different frequencies creates a rich positional encoding that captures both short-term and long-term relationships.
  • Implementation Efficiency: Complex number operations provide an elegant way to implement all the 2D rotations simultaneously.
  • Parameter-Free Design: No additional parameters need to be learned, making the approach both efficient and theoretically sound.

What's Next?

Implementing the (Contextual Layer) Multi-Head-Latent-Attention!

For the curious: If you're interested in the mathematical details, I highly recommend reading the original RoFormer paper. The elegance of the approach becomes even more apparent when you see how the mathematical formulation leads to the relative position dependence.

Stay tuned for updates as I continue building S1 from the ground up. The next posts will cover the multi-head attention mechanism, feed-forward networks, and eventually the complete architecture!

Building in Public

Are you also building machine learning systems from scratch? I'd love to hear about your journey and exchange insights.

Get in Touch