AI SAFETY

Updated 136 days ago

ID: 36428016/62

CLICK HERE TO SEE DETAILS OF COMPANY CHANGES

We already have evidence step 2 is tractable. In this project we focus on addressing step 1: answer as many fundamental mech interp questions as possible for bilinear models. Are interpretable architectures sufficient to make ambitious mechanistic interpretability tractable? Maybe... This project aims to empirically investigate proxy-conditioned reward hacking (PCRH) in large language models (LLMs) as a precursor to situationally aware reward hacking (SARH). Specifically, we explore whether natural language descriptions of reward function misspecifications, such as human cognitive biases in the case of reinforcement learning from human feedback (RLHF), in LLM training data facilitate reward hacking behaviors. By conducting controlled experiments comparing treatment LLMs trained on misspecification descriptions with control LLMs, we intend to measure differences in reward hacking tendencies... We believe that the reason SAEs haven't been as useful as the excitement suggests is because..

Associated domains: aisafetycamp.com

SEARCH FOR SIMILAR COMPANIES

Interest Score

HIT Score

0.87

Domain

aisafety.camp

Actual

www.aisafety.camp

142.250.187.211

Status

Category

Company

AI SAFETY

People Also Viewed