In this week’s ML & AI Safety Update, we hear Paul Christiano’s take on one of OpenAI’s main alignment strategies, dive into the second round winners of the inverse scaling prize and share the many fascinating projects from our mechanistic interpretability hackathon. And stay tuned until the end for some unique opportunities in AI safety!
Show Notes
In this week’s ML & AI Safety Update, we hear Paul Christiano’s take on one of OpenAI’s main alignment strategies, dive into the second round winners of the inverse scaling prize and share the many fascinating projects from our mechanistic interpretability hackathon!
Opportunities (https://ais.pub/aistraining)
- Deadline is coming up in 10 days for PIBBSS: https://ais.pub/pibbss
- EAG London is coming up in May: https://ais.pub/eag
- Introduction to ML safety: https://ais.pub/gt2
- Alignment competitions: https://ais.pub/aawards
Sources
- RLHF 2015: https://ai-alignment.com/efficient-feedback-a347748b1557
- Christiano on RLHF: https://www.alignmentforum.org/posts/vwu4kegAEZTBtpT6p/thoughts-on-the-impact-of-rlhf-research
- Inverse scaling prize winners: https://www.lesswrong.com/posts/DARiTSTx5xDLQGrrz/inverse-scaling-prize-second-round-winners
- We discovered “ an” neuron: https://itch.io/jam/mechint/rate/1890024
- Identifying a preliminary circuit for predicting gendered pronouns in GPT-2 small with the automatic circuit identification algorithm: https://itch.io/jam/mechint/rate/1889871
- Automated identification of potential feature neurons: https://itch.io/jam/mechint/rate/1889215
- Soft prompts are a convex set: https://itch.io/jam/mechint/rate/1889669
- Mentaleap team https://mentaleap.ai/
- Prompt tuning: https://arxiv.org/abs/2104.08691
- Results page: https://itch.io/jam/mechint/results
What is ML Safety Report?
A weekly podcast updating you with the latest research in AI and machine learning safety from people such as DeepMind, Anthropic, and MIRI.