AI Alignment Forum
- Interpreting Preference Models w/ Sparse Autoencoders
- Representation Tuning
- An issue with training schemers with supervised fine-tuning
- Live Theory Part 0: Taking Intelligence Seriously
- Instrumental vs Terminal Desiderata
- What is a Tool?
- Formal verification, heuristic explanations and surprise accounting
- Compact Proofs of Model Performance via Mechanistic Interpretability
- SAE feature geometry is outside the superposition hypothesis
- LLM Generality is a Timeline Crux