Refusal in LLMs is mediated by a single direction — AI Alignment Forum

1734 shaares
871 private links

1734 shaares · 871 private links

Filters

Links per page

20 50 100

Refusal in LLMs is mediated by a single direction — AI Alignment Forum

A LLM uncensoring technique by finding the embedding direction of refusals in the residual stream outputs. One can choose to negate the refusal direction in the output to block the representation of refusals.

More on LLM steering by adding activation vectors: https://www.lesswrong.com/posts/5spBue2z2tw4JuDCx/steering-gpt-2-xl-by-adding-an-activation-vector

llm · machine-learning

May 2, 2024 at 11:13:24 PM GMT+8 * · permalink

https://www.alignmentforum.org/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction

Filters

Links per page

20 50 100