hbfreed's picture
Update README.md
daea878 verified

A newer version of the Gradio SDK is available: 6.0.2

Upgrade
metadata
title: Olmo2 Sae Steering Demo
emoji: ๐Ÿ“ˆ
colorFrom: blue
colorTo: yellow
sdk: gradio
sdk_version: 5.32.0
app_file: app.py
pinned: true
license: mit
short_description: Steering OLMo-2-7b using sparse autoencoders (SAEs)

๐ŸŽ›๏ธ OLMo-2 Feature Steering Demo

This demo showcases how Sparse Autoencoders (SAEs) can be used to steer the behavior of OLMo-2 7B by manipulating specific learned features. Watch how the model's responses change dramatically when we activate different semantic features!

๐ŸŒŸ What is Feature Steering?

Feature steering uses SAEs to decompose a language model's internal representations into interpretable features. By manipulating these features, we can control specific aspects of the model's behavior - like making it talk about superheroes, Japan, or baseball!

๐ŸŽฎ Available Steering Features

  • ๐Ÿฆธ Superhero/Batman - Activates superhero and vigilante themes
  • ๐Ÿ—พ Japan - Steers responses toward Japanese culture and topics
  • โšพ Baseball - Introduces baseball-related content

๐Ÿš€ How to Use

  1. Choose a steering type from the dropdown (or keep "None" for baseline)
  2. Adjust the strength slider (1.0 is default, higher = stronger effect)
  3. Type your message and press Enter
  4. Compare the outputs - left shows unsteered, right shows steered responses
  5. Continue the conversation - steering effects persist across turns!

๐Ÿ“Š Technical Details

๐Ÿ”ง Implementation

The steering works by:

  1. Encoding hidden states through the SAE to get feature activations
  2. Clamping specific features to desired values
  3. Decoding back to get steered hidden states
  4. Adding back the SAE reconstruction error to preserve capabilities
# Simplified steering logic
feats = sae.encode(hidden_states)          # Get features
feats[..., feature_idx] = steering_value   # Clamp feature
steered = sae.decode(feats) + error        # Reconstruct + preserve error

๐Ÿ“– Example Conversations

Try these prompts to see steering in action:

  • "What should I do this weekend?"
  • "Tell me a story"
  • "What's your favorite hobby?"
  • "Give me some life advice"

๐Ÿ™ Acknowledgments

๐Ÿ“š Learn More


Note: Very high steering strengths (>1.5x) may cause incoherent outputs as the feature activation moves outside its natural range.