Daily Shaarli

All links of one day in a single page.

May 22, 2024

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

Method: train a sparse autoencoder on the activation on the residual stream. The sparsely activated components ensure only few features are activated for similar activation patterns in residual stream. Each of the feature is in turn interpreted by an LLM for its semantics. One can use these feature to semantically interpret the working of the model and steer the model towards desired goals.