OpenAI has released a research report which indicates that frontier reasoning models frequently engage in reward hacking, where AI agents exploit loopholes in their objectives to gain rewards in unintended ways.
The report also explores strategies to monitor, and mitigate these behaviors.
OpenAI shows how to monitor a reasoning model like the o3-mini for reward hacking in agentic coding environments, by using another LLM to observe its chain-of-thoughts (CoT). OpenAI said GPT-4o, a model weaker than o3-mini, could effectively monitor the stronger model.
“As we have continued to scale RL training, agents have discovered more complex and hard-to-detect hacks,” said OpenAI in the report. The company found several effective reward hacks in coding processes. For example, an AI agent exploits a bug that allows it to exit the environment prematurely to earn the rewards.
OpenAI said monitoring a model’s CoTs is significantly more effective than monitoring its actions alone, and said that this method presents promising evidence of providing a new avenue for scalable oversight. In comparison to the action-only monitor, which caught (recalled) 60% of the hacks, CoT monitors recalled 95% of the hacks.
While CoT monitors work effectively at low levels of reinforcement learning (RL) optimisation, OpenAI said that when the model is further trained to maximise rewards, it learns to fool the monitoring system.
“Applying direct optimisation pressure to the CoT can quickly lead to obfuscated reward hacking, where agents hide their intent while continuing to misbehave,” read a section of the report.
Due to such limitations of CoT monitoring, OpenAI said developers may have to accept a trade-off by deploying less performant models, or suffering more expensive inference.
However, in a worst case scenario, if the agent learns to completely hide its intentions in CoT, OpenAI suggests going back to older methods of making AI safe, like analysing its activations, internal workings, and improving its alignment methods.
A detailed PDF report indicating the findings of the research can be found in the following link.
Nat McAleese, member of the technical staff at OpenAI, said in a post on X that “large reasoning models are extremely good at reward hacking”, and handpicked examples from the report to illustrate his statement.
Similarly, there have been worldwide efforts to study the nature of reasoning models. Recently, a research led by Harvard University and Vrije Universiteit Brussel found that stronger models with fewer reasoning chains outperformed weaker ones, with extended chains of reasoning. The study concluded that OpenAI’s o3-Mini outperformed the o1-mini, with fewer reasoning chains.