\# Reinforcement Learning under Deceptive Reward  
\## Down / AntiDown

This project investigates failure modes in reinforcement learning systems operating under deceptive, corrupted, or mis-specified reward signals. Rather than optimizing for peak performance, the focus is on observing how agents behave when reward feedback becomes unreliable while confidence and apparent stability remain high.

The Down / AntiDown pair is designed as a controlled benchmark for studying robustness, collapse, and recovery dynamics under adversarial reward conditions, with direct relevance to AI safety and alignment research.

\## Core Idea

In many real-world and safety-critical settings, reward signals may be:

\- Misaligned with the true task objective

\- Delayed, noisy, or partially adversarial

\- Temporarily valid and then corrupted


This project studies how learning agents respond to these conditions, particularly:

\- How long they continue to act confidently under corrupted reward

\- Whether collapse occurs abruptly or gradually

\- Whether recovery is possible once reward integrity is restored


\## Project Structure

\- `agents/`  

#Implementations of the Down and AntiDown agents.

\- `envs/`  

#Custom environments modeling deceptive reward dynamics.

\- `run\_all.py`  

#Entry point to run both Down and AntiDown agents in a unified experimental setup.


\## Running the Experiments

This project is designed to run without notebooks. (I resorted to Anaconda)

```bash

conda activate your\_env

python run\_all.py

Relevance to AI Safety

This benchmark targets a known blind spot in current ML evaluation: agents that appear competent while optimizing the wrong objective. It provides concrete tools to study:

Misalignment under deceptive feedback

Early-warning signals preceding collapse

Limits of reward-centric evaluation

The project directly supports research on stability-aware and intervention-based safety mechanisms.

