AI Alignment Forum
- A breakdown of AI capability levels focused on AI R&D labor acceleration
- “Alignment Faking” frame is somewhat fake
- How to replicate and extend our alignment faking demo
- Measuring whether AIs can statelessly strategize to subvert security measures
- Learning Multi-Level Features with Matryoshka SAEs
- Takes on "Alignment Faking in Large Language Models"
- Alignment Faking in Large Language Models
- A dataset of questions on decision-theoretic reasoning in Newcomb-like problems
- How counterfactual are logical counterfactuals?
- Best-of-N Jailbreaking