Show HN: RewardHackWatch – Reward hacking detector for LLM agents
Article URL: https://github.com/aerosta/rewardhackwatch Comments URL: https://news.ycombinator.com/item?id=47209592 Points: 1 # Comments: 1
Runtime detection of reward hacking and misalignment signals in LLM agents. RewardHackWatch detects when LLM agents game their evaluations, for example by calling sys.exit(0), patching validators, c… [+6079 chars]