AI Alignment Forum
- Automation collapse
- Sabotage Evaluations for Frontier Models
- LLMs can learn about themselves by introspection
- Low Probability Estimation in Language Models
- Anthropic's updated Responsible Scaling Policy
- An Opinionated Evals Reading List
- Minimal Motivation of Natural Latents
- The case for unlearning that removes information from LLM weights
- SAE features for refusal and sycophancy steering vectors
- My theory of change for working in AI healthtech