Use Ctrl+P (or Cmd+P) to save as PDF. Back to paper
Every transformer needs to know where each token sits in a sequence. The original paper (Vaswani et al., 2017) solved this by adding a sinusoidal vector to each token embedding before the first layer. Position was baked in as addition.
That approach worked. It also had a structural limitation: the model learned absolute position. It had no native way to reason about how far apart two tokens were. Relative distance had to be inferred from the pattern of absolutes.
Rotary Position Embedding (Su et al., 2021) encodes position as rotation. Instead of adding a position vector to a token embedding, RoPE rotates the query and key vectors inside the attention mechanism by an angle proportional to their position in the sequence.
The result: when you compute the dot product between a rotated query at position m and a rotated key at position n, the positional component reduces to a function of (m - n). The model attends based on relative distance, automatically, without any change to the attention formula.
This is not a minor implementation detail. It means the model generalizes to relative positions it has never seen, which is the prerequisite for context extension.
LLaMA, Mistral, Qwen, Falcon, PaLM 2 — every major open-weight family adopted RoPE almost immediately after Su et al. published. The reasons are practical:
EleutherAI's analysis (2021) confirmed that RoPE outperformed learned absolute embeddings and sinusoidal encodings on long-sequence tasks, while adding negligible overhead.
If position is a rotation angle, and the angle is determined by a base frequency (typically 10,000), then tokens beyond the training length get angles the model has never been trained on. Performance degrades.
This gave rise to a line of engineering work entirely dedicated to RoPE scaling:
All of this work is possible because position is a parameter in a rotation, not a fixed additive offset. You can rescale a rotation. You cannot easily rescale an additive embedding without retraining.
When you load a model and set a context length beyond its training length, the inference engine applies a RoPE scaling strategy automatically. In llama.cpp, the --rope-scale and --rope-freq-base flags control this. In vLLM and Hugging Face, the rope_scaling config field in the model's config.json specifies the method.
If you extend context without the right scaling strategy, the model does not crash. It degrades silently. Perplexity rises. Coherence drops. The failure is invisible unless you are testing for it.
Knowing that position is a rotation — not an addition — is what makes that failure legible.
1. Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., and Liu, Y. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv:2104.09864. https://arxiv.org/abs/2104.09864
2. Wang, P. (2021). Rotary Embeddings: A Relative Revolution. EleutherAI Blog. https://blog.eleuther.ai/rotary-embeddings/
3. Peng, B., Quesnelle, J., Fan, H., and Shippole, E. (2023). YaRN: Efficient Context Window Extension of Large Language Models. ICLR 2024. https://arxiv.org/abs/2309.00071
4. Ding, Y., Zhang, L., Zhang, C., Xu, Y., Shang, N., Xu, J., Yang, F., and Yang, M. (2024). LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens. ICML 2024. https://arxiv.org/html/2402.13753