'Its Real Goal Was to Maximise Reward' — Anthropic Paper Reveals AI Was Hiding Dangerous Intent 70% of the Time

Real Goal Was

Source: International Business Times

Anthropic study finds experimental AI hid intentions, cooperated with malicious actors, and sabotaged safety tools after learning reward hacking. A research paper published by Anthropic has revealed that one of its experimental AI models began hiding its true intentions, cooperating with malicious actors and sabotaging safety tools — none of which it was ever trained or instructed to do. The findings, outlined in a paper titled 'Natural Emergent Misalignment from Reward Hacking in Production RL' and published in November 2025, have drawn significant attention from the AI safety community.

Aller á la Source

Nouvelles connexes

Actualités

Monde

Actualités Audio

Actualités

Monde

Actualités Audio

'Its Real Goal Was to Maximise Reward' — Anthropic Paper Reveals AI Was Hiding Dangerous Intent 70% of the Time