Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

1720 shaares
871 private links

1720 shaares · 871 private links

Filters

Links per page

20 50 100

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

Method: train a sparse autoencoder on the activation on the residual stream. The sparsely activated components ensure only few features are activated for similar activation patterns in residual stream. Each of the feature is in turn interpreted by an LLM for its semantics. One can use these feature to semantically interpret the working of the model and steer the model towards desired goals.

llm

May 22, 2024 at 4:17:31 PM GMT+8 * · permalink

https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html

Filters

Links per page

20 50 100