One Shot Inverse Reinforcement Learning for Stochastic Linear Bandits
Abstract
The paradigm of inverse reinforcement learning (IRL) is used to specify the reward function of an agent purely from its actions and is critical for value alignment and AI safety. Motivated by the need for IRL in large action spaces with limited data, we consider as a first step the problem of learning from a single sequence of actions (i.e., a demonstration) of a stochastic linear bandit algorithm. In general, IRL is made difficult by the lack of access to the rewards seen by the demonstrators. When the demonstrator employs the Phased Elimination algorithm, we develop a simple inverse learning procedure that uses the demonstrator’s evolving behavior. We establish guarantees on the performance of our estimator, showing that the linear reward function can be estimated consistently in the time horizon with just a \textit{single} demonstration. In particular, we guarantee that our inverse learner approximates the true reward parameter within a $T^{\frac{1 - \omega}{2\omega }}$ error, where $T$ is the number of samples generated by the demonstrator and $\omega$ is an action set dependent constant. We complement this result with an information-theoretic lower bound for any inverse learning procedure. Our guarantees are corroborated using simulations on synthetic data and a demonstration constructed from the MovieLens dataset.