A system and method for online reinforcement learning is provided. In
particular, a method for performing the explore-vs.-exploit tradeoff is
provided. Although the method is heuristic, it can be applied in a
principled manner while simultaneously learning the parameters and/or
structure of the model (e.g., Bayesian network model).The system includes
a model which receives an input (e.g., from a user) and provides a
probability distribution associated with uncertainty regarding parameters
of the model to a decision engine. The decision engine can determine
whether to exploit the information known to it or to explore to obtain
additional information based, at least in part, upon the
explore-vs.-exploit tradeoff (e.g., Thompson strategy). A reinforcement
learning component can obtain additional information (e.g., feedback from
a user) and update parameter(s) and/or the structure of the model. The
system can be employed in scenarios in which an influence diagram is used
to make repeated decisions and maximization of long-term expected utility
is desired.