White-Box Adversarial Policies in Deep Reinforcement Learning

Stephen Casper
Stephen graduated from Harvard with a B.A. in Statistics in 2021. As of September 2021, he has been pursuing a Ph.D in Computer science at MIT. He works on interpretability tools and adversarial vulnerabilities in deep neural networks. You can find him at https://stephencasper.com or email him at scasper@mit.edu.
Author's Note
What was your thesis topic?
I studied white-box adversarial policies in reinforcement learning. Reinforcement learning algorithms are used to learn behaviors from reward signals in an environment. For example, reinforcement learning is often used to train AI systems to play video games. An adversarial policy is a decision-making procedure that one attacker player uses to make another victim player fail in a multiagent reinforcement learning setting. And a white-box adversarial policy is an adversarial policy that can access information from its victim's internal state. Basically this paper studied what happens when you make one AI system wreak havoc on another with the help of "mind reading" powers.
What do you think the stronger and weaker parts of your research are?
I think that the results we obtain with white-box attacks are strong and suggest useful directions for future work focused on finding weaknesses in models. However, we also experimented with some weaker approaches that didn't make it into the final paper. One thing we tried was to use a model of the victim learned from black-box access instead of simply having white-box access to it. This required fewer assumptions about the information that an attacker has access to. But the results were not strong.
In what ways do you think your topic improves the world?
There are two reasons this is important. One is for better understanding threats and weaknesses. Our method shows that white-box attacks are more effective threats against systems. Our results serve as a warning against these types of risks but also suggests that we can use white-box approaches to more effectively study flaws in systems we build. Second, we find that white-box adversaries are useful for making a victim more robust via adversarial training.
In what ways have you changed your mind since writing it?
I think that the most important takeaway from this work is that it suggests that white-box methods may be more effective approaches for debugging models and improving their robustness via adversarial training. I have gotten excited about followup work in this direction.
What recommendations would you make to others interested in taking a similar direction with their research?
Read, take notes, discuss, and write about academic papers in your area of interest regularly!
Published 10/10/22