Spaces:

hbfreed
/

olmo2-sae-steering-demo

Running on Zero

App Files Files Community

olmo2-sae-steering-demo / README.md

hbfreed

Update README.md

daea878 verified 6 months ago

preview code

raw

history blame contribute delete

3.29 kB

A newer version of the Gradio SDK is available: 6.0.2

Upgrade

metadata

title: Olmo2 Sae Steering Demo
emoji: 📈
colorFrom: blue
colorTo: yellow
sdk: gradio
sdk_version: 5.32.0
app_file: app.py
pinned: true
license: mit
short_description: Steering OLMo-2-7b using sparse autoencoders (SAEs)

🎛️ OLMo-2 Feature Steering Demo

This demo showcases how Sparse Autoencoders (SAEs) can be used to steer the behavior of OLMo-2 7B by manipulating specific learned features. Watch how the model's responses change dramatically when we activate different semantic features!

🌟 What is Feature Steering?

Feature steering uses SAEs to decompose a language model's internal representations into interpretable features. By manipulating these features, we can control specific aspects of the model's behavior - like making it talk about superheroes, Japan, or baseball!

🎮 Available Steering Features

🦸 Superhero/Batman - Activates superhero and vigilante themes
🗾 Japan - Steers responses toward Japanese culture and topics
⚾ Baseball - Introduces baseball-related content

🚀 How to Use

Choose a steering type from the dropdown (or keep "None" for baseline)
Adjust the strength slider (1.0 is default, higher = stronger effect)
Type your message and press Enter
Compare the outputs - left shows unsteered, right shows steered responses
Continue the conversation - steering effects persist across turns!

📊 Technical Details

Blog Post:
Base Model: allenai/OLMo-2-1124-7B-Instruct
SAE Model: open-concept-steering/olmo2-7b-sae-65k-v1
Dataset:
Dataset Used to Collect:
SAE Architecture: 65k hidden features
Steering Method: Feature clamping with error preservation

🔧 Implementation

The steering works by:

Encoding hidden states through the SAE to get feature activations
Clamping specific features to desired values
Decoding back to get steered hidden states
Adding back the SAE reconstruction error to preserve capabilities

# Simplified steering logic
feats = sae.encode(hidden_states)          # Get features
feats[..., feature_idx] = steering_value   # Clamp feature
steered = sae.decode(feats) + error        # Reconstruct + preserve error

📖 Example Conversations

Try these prompts to see steering in action:

"What should I do this weekend?"
"Tell me a story"
"What's your favorite hobby?"
"Give me some life advice"

🙏 Acknowledgments

Allen Institute for AI for OLMo-2
Hugging Face Fineweb for the dataset
The open-source community for SAE research and tools
Hugging Face for hosting this demo

📚 Learn More

Note: Very high steering strengths (>1.5x) may cause incoherent outputs as the feature activation moves outside its natural range.