Published on March 11, 2025
In AI News

New OpenAI Report Shows How to Fix Reward Hacking in Large Reasoning Models

Monitoring the chain-of-thoughts of a reasoning model using another LLM is effective to a certain extent.

Illustration by Nalini Nirad

by Supreeth Koundinya

OpenAI has released a research report which indicates that frontier reasoning models frequently engage in reward hacking, where AI agents exploit loopholes in their objectives to gain rewards in unintended ways.

The report also explores strategies to monitor, and mitigate these behaviors.

OpenAI shows how to monitor a reasoning model like the o3-mini for reward hacking in agentic coding environments, by using another LLM to observe its chain-of-thoughts (CoT). OpenAI said GPT-4o, a model weaker than o3-mini, could effectively monitor the stronger model.

“As we have continued to scale RL training, agents have discovered more complex and hard-to-detect hacks,” said OpenAI in the report. The company found several effective reward hacks in coding processes. For example, an AI agent exploits a bug that allows it to exit the environment prematurely to earn the rewards.

OpenAI said monitoring a model’s CoTs is significantly more effective than monitoring its actions alone, and said that this method presents promising evidence of providing a new avenue for scalable oversight. In comparison to the action-only monitor, which caught (recalled) 60% of the hacks, CoT monitors recalled 95% of the hacks.

While CoT monitors work effectively at low levels of reinforcement learning (RL) optimisation, OpenAI said that when the model is further trained to maximise rewards, it learns to fool the monitoring system.

“Applying direct optimisation pressure to the CoT can quickly lead to obfuscated reward hacking, where agents hide their intent while continuing to misbehave,” read a section of the report.

Due to such limitations of CoT monitoring, OpenAI said developers may have to accept a trade-off by deploying less performant models, or suffering more expensive inference.

However, in a worst case scenario, if the agent learns to completely hide its intentions in CoT, OpenAI suggests going back to older methods of making AI safe, like analysing its activations, internal workings, and improving its alignment methods.

A detailed PDF report indicating the findings of the research can be found in the following link.

Nat McAleese, member of the technical staff at OpenAI, said in a post on X that “large reasoning models are extremely good at reward hacking”, and handpicked examples from the report to illustrate his statement.

Similarly, there have been worldwide efforts to study the nature of reasoning models. Recently, a research led by Harvard University and Vrije Universiteit Brussel found that stronger models with fewer reasoning chains outperformed weaker ones, with extended chains of reasoning. The study concluded that OpenAI’s o3-Mini outperformed the o1-mini, with fewer reasoning chains.

📣 Want to advertise in AIM? Book here

Supreeth Koundinya

Supreeth is an engineering graduate who is curious about the world of artificial intelligence and loves to write stories on how it is solving problems and shaping the future of humanity.

OpenAI is Trying Really Hard to Attract Young Talent

OpenAI Releases New Audio Models to Power Voice Agents

LinkedIn Reveals India’s Top Skills for 2025, AI Literacy Takes Lead

What Was Former Intel CEO Doing at NVIDIA’s Flagship Event?

Are Adobe’s AI Agents the Final Step to Fully Automated Customer Service?

NVIDIA Announces 2 Personal Supercomputers—One is as Small as Mac Mini

Anthropic Launches Claude 2.1, Surpasses GPT-4 Turbo in Context Length

Anthropic to Launch Voice Mode Soon, More Features Incoming for Business Users

OpenAI’s Head of Post-Training Liam Fedus Departs to Build AI for Science Startup

Wi-Fi Troubles are About to be a Thing of Past, Thanks to AI

Association of Data Scientists

GenAI Corporate Training Programs

Our Upcoming Conference

Happy Llama 2025

India's Biggest Conference on AI Startups

April 25, 2025 | 📍 Hotel Radisson Blu, Bengaluru

Download the easiest way to
stay informed

‘Most Data Centres Are Not Ready for Liquid Cooling’, says Oracle Exec on NVIDIA Blackwell

Siddharth Jindal

Built on the Blackwell architecture introduced last year, Blackwell Ultra features the NVIDIA GB300 NVL72 rack-scale solution and the NVIDIA HG B300 NVL16 system.