1690 shaares
869 private links
869 private links
A LLM uncensoring technique by finding the embedding direction of refusals in the residual stream outputs. One can choose to negate the refusal direction in the output to block the representation of refusals.
More on LLM steering by adding activation vectors: https://www.lesswrong.com/posts/5spBue2z2tw4JuDCx/steering-gpt-2-xl-by-adding-an-activation-vector