AI SAFETY

Updated 136 days ago
  • ID: 36428016/62
We already have evidence step 2 is tractable. In this project we focus on addressing step 1: answer as many fundamental mech interp questions as possible for bilinear models. Are interpretable architectures sufficient to make ambitious mechanistic interpretability tractable? Maybe... This project aims to empirically investigate proxy-conditioned reward hacking (PCRH) in large language models (LLMs) as a precursor to situationally aware reward hacking (SARH). Specifically, we explore whether natural language descriptions of reward function misspecifications, such as human cognitive biases in the case of reinforcement learning from human feedback (RLHF), in LLM training data facilitate reward hacking behaviors. By conducting controlled experiments comparing treatment LLMs trained on misspecification descriptions with control LLMs, we intend to measure differences in reward hacking tendencies... We believe that the reason SAEs haven't been as useful as the excitement suggests is because..
Associated domains: aisafetycamp.com
  • 0
  • 0
Interest Score
3
HIT Score
0.87
Domain
aisafety.camp

Actual
www.aisafety.camp

IP
142.250.187.211

Status
OK

Category
Company
0 comments Add a comment