Goal misgeneralisation from a deep learning perspective

whitehatStoic

0:00

-48:08

Goal misgeneralisation from a deep learning perspective

by Lauro Langosco,Jack Koch, Lee Sharkey, Jacob Pfau, Laurent Orseau, David Krueger

Miguel de Guzman

Feb 08, 2023

Why use AI to create audio versions of these research or wikipedia pages?

I find that listening to them is easier for me than reading. So yeah, it is also a waste not to share these great materials so here you go! Also I’m trying to build on my research for the AI alignment awards due this end of the month =).

Introduction

Out-of-distribution (OOD) generalization, performing well on test data that is not distributed identically to the training set, is a fundamental problem in machine learning (Arjovsky, 2021). OOD generalization is crucial since in many applications it is not feasible to collect data distributed identically to that which the model will encounter in deployment. In this work, we focus on a particularly concerning type of generalization failure that can occur in RL. When an RL agent is deployed out of distribution, it may simply fail to take useful actions. However, there exists an alternative failure mode in which the agent pursues a goal other than the training reward while retaining the capabilities it had on the training distribution. For example, an agent trained to pursue a fixed coin might not recognize the coin when it is positioned elsewhere, and instead competently navigate to the wrong position (Figure 1). We call this kind of failure goal misgeneralization and distinguish it from capability generalization failures. We provide the first empirical demonstrations of goal misgeneralization to highlight and illustrate this phenomenon. While it is well-known that the true reward function can be unidentifiable in inverse reinforcement learning, our work shows that a similar problem can also occur in reinforcement learning when features of the environment are correlated and predictive of the reward on the training distribution but not OOD. In this way, goal misgeneralization can also resemble problems that arise in supervised learning when models use unreliable features: both problems are a form of competent misgeneralization that works in-distribution but fails OOD. Disentangling capability and goal generalization failures is difficult in supervised learning; for instance, are adversarial examples bugs or features? In contrast, studying RL allows us to formally distinguish capabilities and goals, which roughly correspond to understanding the environment dynamics and the reward function, respectively.

Goal misgeneralization might be more dangerous than capability generalization failures, since an agent that capably pursues an incorrect goal can leverage its capabilities to visit arbitrarily bad states. In contrast, the only risks from capability generalization failures are those of accidents due to incompetence. An important implication of goal misgeneralization is that training a model by optimizing an objective R is not enough to guarantee that the model will itself learn to pursue R rather than some proxy for R. This is especially concerning in the context of AGI safety: ensuring that advanced AI systems are safe despite being capable enough to escape our control (Bostrom, 2014). Systems that competently pursue a misaligned goal may tend to seek power and deceive their operators for instrumental reasons. With highly advanced AI systems, this could lead to human disempowerment: for example, an AI system might prevent its operators from shutting it down. If complex tasks are rife with proxies for their training objectives, it may be very hard to predict what objectives the trained AI systems will have. Our main contributions are: • We formalize goal misgeneralization, distinguishing it from capability generalization failures (Section 2), and experimentally validate our definition on a gridworld environment (Section 3.5). • We experimentally showcase goal misgeneralization. Specifically, deep RL agents trained on the Procgen benchmark—a set of diverse procedurally generated environments specifically designed to induce robust generalization—still fail on our slightly modified environments (Section 3). • We show that goal misgeneralization may be alleviated by increasing the diversity of the training distribution so that the agent learns to distinguish the reward from proxies (Sections 3.1 and 3.2). • We investigate the causes of goal misgeneralization. In particular, we find that the actor and the critic components of our actor-critic model learn different proxy goals (Section 3.4).

Read the whole paper here: https://arxiv.org/pdf/2105.14111.pdf