Daily Shaarli
May 22, 2024
Method: train a sparse autoencoder on the activation on the residual stream. The sparsely activated components ensure only few features are activated for similar activation patterns in residual stream. Each of the feature is in turn interpreted by an LLM for its semantics. One can use these feature to semantically interpret the working of the model and steer the model towards desired goals.