Technical AI Safety Decal

Thursdays 6-8 PM in Giannini 141, starting 2/13

About the Course

Course Prerequisites

This course assumes fluency with linear algebra, multivariable calculus, and basic programming at a level equivalent to MATH 54, MATH 53, and CS 61A. Since the class will quickly move through the material and involve significant ML paper reading, familiarity with machine learning at the level of CS 189 is strongly recommended. Prior familiarity with the field of AI Safety is not required.

Assignments

Weekly Reading Reflections: These reflections are based on the assigned readings and must be submitted before each class. Students will answer specific questions about the material and share their own insights, including points of interest, connections to other ideas, and areas of agreement or disagreement. Completing these reflections demonstrates engagement with the material and helps prepare students for in-class discussions.

Projects: Students will complete 2-3 coding projects focused on replicating results from technical AI safety papers. They are also encouraged to creatively extend the ideas presented in the papers. Projects may be completed in groups, and students should prepare a presentation for the final project.

Tentative Schedule

Week 1 - Issues in AI safety

Week 2 - Machine Learning Review

Week 3 - History of Adversarial Machine Learning

Week 4 - Adversarial Attacks on LLMs

Week 5 - Alignment with Human Value

Week 6 - Problem of Misalignment

Week 7 - Mechanistic Interpretability

Week 8 - Representation Engineering

Week 9 - Benchmark and Evaluation

Week 10 - LLM agents security

Week 11-14 - Guest Lectures and Final Project Presentation

The Intro to AI Safety curriculum provides a high-level understanding of the problems in AI safety and some of the key research directions which aim to solve it. Students will learn about the theoretical and practical risks associated with using advanced AI systems, the difficulties inherent to addressing them, and the current state of research regarding solutions.