Use Ctrl+P (or Cmd+P) to save as PDF. Back to paper
You write a prompt. The model reads it, attends to it, and generates a response. The prompt is input. The model is a black box.
Activation steering is different.
Instead of adding input, you reach inside the model — into the residual stream, at a specific layer — and add a vector. That vector was computed by taking the difference in mean activations between two contrasting behaviors: the model generating helpful output versus refusing, or the model producing biased text versus neutral text. The difference is a direction in activation space. You apply it additively during the forward pass.
The model never "sees" this intervention as input. It has already processed the tokens. You are modifying the computation itself.
This matters for several reasons.
Prompts are surface-level. A clever system prompt can be overridden by a more clever user prompt. Activation steering is applied below the token level — it operates on internal representations that the user has no access to.
Prompts compete. If your system prompt says "be concise" and the user says "write me 2,000 words," one wins and one loses. Steering vectors compose. Research has demonstrated that multiple simultaneous behavioral interventions can be combined by summing their steering vectors. They do not fight each other the way contradictory instructions do.
Steering generalizes differently. A prompt works because the model has been trained to follow instructions. Steering works because the model has learned representations. A steering vector computed on one set of examples can transfer to inputs the model was never explicitly instructed about.
The practical frontier is still early. Dynamic steering mechanisms — those that activate only when a classifier detects a specific semantic condition in the input — reduce the risk of over-steering, where a fixed additive vector degrades unrelated capabilities. HyperSteer goes further, using a small transformer network to generate steering vectors conditioned on both the steering goal and the base model's current activations.
None of this replaces prompting. Prompting is still the right tool for guiding task behavior. But if you are building a system that requires behavioral guarantees — not just behavioral suggestions — activation steering is the layer worth understanding.
The residual stream is not a black box. It is a structured space. Learn its geometry.