1688 shaares
869 private links
869 private links
Method: train a sparse autoencoder on the activation on the residual stream. The sparsely activated components ensure only few features are activated for similar activation patterns in residual stream. Each of the feature is in turn interpreted by an LLM for its semantics. One can use these feature to semantically interpret the working of the model and steer the model towards desired goals.