What is AI Safety?

“Artificial intelligence is transforming our world — it is on all of us to make sure that it goes well”

— Our World in Data

AI Safety aims to tackle the problem of ensuring advanced AI is developed safely, reliably, and in alignment with human values.

Thanks to Michigan AI Safety and Intro to ML Safety for e resources in this section

AI has tremendous potential…

For making the world a better place, especially as the technology continues to develop. We’re already seeing some beneficial applications of AI to healthcare, accessibility, language translation, automotive safety, and art creation, to name just a few.

However, the deployment of AI systems into high-stakes settings, such as transportation and medicine, also pose some serious risks.

Some of these concerns apply to current systems: how do we prevent driverless cars from mis-identifying a stop sign in a blizzard? Others are more forward-looking: how can we ensure general AI systems pursue safe and beneficial goals. Others are more forward-looking: how can we ensure advanced AI systems pursue safe and beneficial goals?

Indeed, it’s possible that future AI systems will be qualitatively different from those we see today. They may be able to form sophisticated plans to achieve their goals, and also understand the world well enough to strategically evaluate many relevant obstacles and opportunities. Furthermore, they may attempt to acquire resources or resist shutdown attempts, since these are useful strategies for some goals their designers might specify. To see why these failures might be challenging to prevent, see this research on specification gaming and goal misgeneralization from DeepMind.

It’s worth reflecting on the possibility that an AI system of this kind could outmaneuver humanity’s best efforts to stop it.

Some of these potential concerns have already been demonstrated in modern machine learning systems. DeepMind’s research on specification gaming and goal misgeneralization has demonstrated examples in which reinforcement learning agents can pursue unintended goals, and Meta AI’s Cicero model shows that modern systems can successfully negotiate with and deceive humans as it reaches human-level performance in Diplomacy, a strategic board game.

Introductory Resources

Our brief argument above skipped over a lot of other important considerations. Here are some resources on how AI might possibly cause a catastrophe.

Articles

Preventing an AI-related catastrophe (Benjamin Hilton) + audio version
AI experts are increasingly afraid of what they’re creating (Kelsey Piper)
Why I Think More NLP Researchers Should Engage with AI Safety Concerns (Samuel Bowman)
Why Would AI "Aim" To Defeat Humanity? (Holden Karnofsky) + audio version
The alignment problem from a deep learning perspective (Richard Ngo, Lawrence Chan, Sören Mindermann)

Introductory Video

Podcast Episodes

Richard Ngo or Paul Christiano on the AI X-risk Research Podcast

Brian Christian or Ben Garfinkel on the 80,000 Hours Podcast
Ajeya Cotra or Rohin Shah on the Future of Life Institute Podcast
Eliezer Yudkowsky on the Making Sense Podcast

Books

The Alignment Problem by Brian Christian
Human Compatible by Stuart Russell
Superintelligence by Nick Bostrom.