Spaces:
Running
on
Zero
A newer version of the Gradio SDK is available:
6.0.2
title: Olmo2 Sae Steering Demo
emoji: ๐
colorFrom: blue
colorTo: yellow
sdk: gradio
sdk_version: 5.32.0
app_file: app.py
pinned: true
license: mit
short_description: Steering OLMo-2-7b using sparse autoencoders (SAEs)
๐๏ธ OLMo-2 Feature Steering Demo
This demo showcases how Sparse Autoencoders (SAEs) can be used to steer the behavior of OLMo-2 7B by manipulating specific learned features. Watch how the model's responses change dramatically when we activate different semantic features!
๐ What is Feature Steering?
Feature steering uses SAEs to decompose a language model's internal representations into interpretable features. By manipulating these features, we can control specific aspects of the model's behavior - like making it talk about superheroes, Japan, or baseball!
๐ฎ Available Steering Features
- ๐ฆธ Superhero/Batman - Activates superhero and vigilante themes
- ๐พ Japan - Steers responses toward Japanese culture and topics
- โพ Baseball - Introduces baseball-related content
๐ How to Use
- Choose a steering type from the dropdown (or keep "None" for baseline)
- Adjust the strength slider (1.0 is default, higher = stronger effect)
- Type your message and press Enter
- Compare the outputs - left shows unsteered, right shows steered responses
- Continue the conversation - steering effects persist across turns!
๐ Technical Details
- Blog Post:
- Base Model: allenai/OLMo-2-1124-7B-Instruct
- SAE Model: open-concept-steering/olmo2-7b-sae-65k-v1
- Dataset:
- Dataset Used to Collect:
- SAE Architecture: 65k hidden features
- Steering Method: Feature clamping with error preservation
๐ง Implementation
The steering works by:
- Encoding hidden states through the SAE to get feature activations
- Clamping specific features to desired values
- Decoding back to get steered hidden states
- Adding back the SAE reconstruction error to preserve capabilities
# Simplified steering logic
feats = sae.encode(hidden_states) # Get features
feats[..., feature_idx] = steering_value # Clamp feature
steered = sae.decode(feats) + error # Reconstruct + preserve error
๐ Example Conversations
Try these prompts to see steering in action:
- "What should I do this weekend?"
- "Tell me a story"
- "What's your favorite hobby?"
- "Give me some life advice"
๐ Acknowledgments
- Allen Institute for AI for OLMo-2
- Hugging Face Fineweb for the dataset
- The open-source community for SAE research and tools
- Hugging Face for hosting this demo
๐ Learn More
Note: Very high steering strengths (>1.5x) may cause incoherent outputs as the feature activation moves outside its natural range.