AI Alignment Forum
- The case for unlearning that removes information from LLM weights
- SAE features for refusal and sycophancy steering vectors
- My theory of change for working in AI healthtech
- EIS XIV: Is mechanistic interpretability about to be practically useful?
- Behavioral red-teaming is unlikely to produce clear, strong evidence that models aren't scheming
- Safe Predictive Agents with Joint Scoring Rules
- Schelling game evaluations for AI control
- Research update: Towards a Law of Iterated Expectations for Heuristic Estimators
- A Narrow Path: a plan to deal with AI extinction risk
- AXRP Episode 37 - Jaime Sevilla on Forecasting AI