Reinforcement Learning with Generalized Feedback: Beyond Numeric Rewards

This workshop will be held on Monday, September 23rd 2013, as part of the ECML/PKDD 2013 conference.

Please note that the submission deadline has been extended to July 5th, 2013!

Background

Motivation

Reinforcement learning is traditionally formalized within the Markov Decision Process (MDP) framework: By taking actions in a stochastic and possibly unknown environment, an agent moves between states in this environment; moreover, after each action, it receives a numeric, possibly delayed reward signal. The agent's learning task then consists of developing a strategy that allows it to act optimally, that is, to devise a policy (mapping states to actions) that maximizes its long-term (cumulative) reward.

In recent years, different generalizations of the standard setting of reinforcement learning have emerged; in particular, several attempts have been made to relax the quite restrictive requirement for numeric feedback and to learn from different types of more flexible training information. Examples of generalized settings of that kind include

Learning from Expert Demonstration:: The training information consists of the action traces of an expert demonstrating the task, and the learner is supposed to devise a policy so as to imitate the expert. A specific instantiation of this setting is apprenticeship learning, which can be realized, for example, through inverse reinforcement learning}.
Learning from Qualitative Feedback:: In this setting, the agent is not (necessarily) provided with a numeric reward signal. Instead, it is supposed to learn from more general types of feedback, such as ordinal rewards \cite{Weng11} or qualitative comparisons between trajectories or policies, like in preference-based reinforcement learning.
Learning from Multiple Feedback Signals:: Here, feedback is provided in the form of multiple, possibly conflicting reward signals. The task of multi-objective reinforcement learning is to learn a policy that optimizes all of them at the same time, or at least finds a good compromise solution.

Learning in generalized frameworks like those mentioned above can be considerably harder than learning in MDPs. In qualitative settings, for example, where rewards cannot be easily aggregated over different states, policy evaluation becomes a non-trivial task. Many approaches assume a hidden numeric reward function and interpret qualitative feedback as indirect or implicit information about that function. This assumption is already quite restrictive, however, and immediately imposes a total order on trajectories, which is not very natural in the settings of preference-based and multi-objective reinforcement learning. Purely qualitative approaches, on the other hand, completely give up the assumption of an underlying numeric reward function. This makes them more general but comes with a loss of properties that are crucial for standard reinforcement learning techniques (such as policy and value iteration).

The above extensions and variants of reinforcement learning are closely connected and largely intersecting with preference learning, a new subfield of machine learning that deals with the learning of (predictive) preference models from observed/revealed or automatically extracted preference information. For example, inverse reinforcement learning and apprenticeship learning can be seen as a specific type of preference learning in dynamic environments. Likewise, preference-based and multi-objective reinforcement learning make use of generalized formalisms for representing preferences as well as learning techniques from the field of preference learning and learning-to-rank.

Goals and Objectives

The most important goal of this workshop is to help in unifying and streamlining research on generalizations of standard reinforcement learning, which, for the time being, seem to be pursued in a rather disconnected manner. Indeed, many of the extensions and generalizations discussed above are still lacking a sound theoretical foundation, let alone a generally accepted underlying framework comparable to Markov Decision Processes for conventional reinforcement learning. Besides, many of the commonalities shared by these generalizations have apparently not been recognized or explored so far. A formalization in terms of preferences may provide such a theoretical underpinning. Ideally, the workshop will help the participants to identify some common ground of their work, thereby helping the field move toward a theoretical foundation of reinforcement learning with generalized feedback.

Apart from fostering theoretical developments of that kind, we are also interested in identifying and exchanging interesting applications and problems that may serve as benchmarks for qualitative or preference-based reinforcement learning (such as cart-pole balancing or the mountain car for classical reinforcement learning).

Topics of Interest

Topics of interest include but are not limited to

novel frameworks for reinforcement learning beyond MDPs
algorithms for learning from preferences and non-numeric, qualitative, or structured feedback
theoretical results on the learnability of optimal policies, convergence of algorithms in qualitative settings, etc.
applications and benchmark problems for reinforcement learning in non-standard settings.

Program

9:30 - 10:30 Session 1

9:30 - 9:40	Eyke Hüllermeier, Johannes Fürnkranz: Opening Remarks
9:40 - 10:30	Invited Talk by Michele Sebag

10:30 - 11:00 Coffee break

11:00 - 12:40 Session 2: Interactive Reinforcement Learning

11:00 - 11:25	L. Adrian Leon, Ana C. Tenorio, Eduardo F. Morales: Human Interaction for Effective Reinforcement Learning
11:25 - 11:50	Riad Akrour, Marc Schoenauer, and Michele Sebag: Interactive Robot Education
11:50 - 12:15	Paul Weng, Robert Busa-Fekete and Eyke Hüllermeier: Interactive Q-Learning with Ordinal Rewards and Unreliable Tutor
12:15 - 12:40	Omar Zia Khan, Pascal Poupart, and John Mark Agosta: Iterative Model Refinement of Recommender MDPs based on Expert Feedback

12:40 - 14:00 Lunch break

14:00 - 15:30 Session 3: RL with Non-numerical Feedback

14:00 - 14:25	Christian Wirth, Johannes Fürnkranz: Preference-Based Reinforcement Learning A Preliminary Survey
14:25 - 14:50	Robert Busa-Fekete, Balazs Szörenyi, Paul Weng, Weiwei Cheng and Eyke Hüllermeier: Preference-based Evolutionary Direct Policy Search
14:50 - 15:15	Daniel Bengs, Ulf Brefeld: A Learning Agent for Parameter Estimation in Speeded Tests
15.15 - 15.30	Discussion

15:30 - 16:00 Coffee break

16:00 - 17:15 Session 4: Inverse RL and Multi-Dimensional Feedback

16:00 - 16:25	Hideki Asoh, Masanori Shiro, Shotaro Akaho, Toshihiro Kamishima,Koiti Hasida, Eiji Aramaki, and Takahide Kohro: Applying Inverse Reinforcement Learning to Medical Records of Diabetes
16:25 - 16:50	Mohamed Oubbati, Timo Oess, Christian Fischer, and Günther Palm: Multiobjective Reinforcement Learning Using Adaptive Dynamic Programming And Reservoir Computing
16:50 - 17:15	Petar Kormushev, Darwin G. Caldwell: Comparative Evaluation of Reinforcement Learning with Scalar Rewards and Linear Regression with Multidimensional Feedback
17:15 - 17.30	Discussion

Organization

Workshop Chairs

Johannes Fürnkranz (TU Darmstadt)
Eyke Hüllermeier (Universität Marburg)

Programme Committee

Riad Akrour, INRIA Saclay
Robert Busa-Fekete, University Marburg
Damien Ernst, University of Liége
Raphael Fonteneau, INRIA Lille
Levente Kocsis, Hungarian Academy of Sciences
Francis Maes, K.U. Leuven
Jan Peters, TU Darmstadt
Constantin Rothkopf, Frankfurt Institute for Advanced Studies
Csaba Szepesvàri, University of Alberta
Christian Wirth, TU Darmstadt
Paul Weng, Université Pierre et Marie Curie, Paris
Bruno Zanuttini, Université de Caen