An Introduction to Reinforcement Learning

Author:Murphy  |  View: 29466  |  Time: 2025-03-22 21:33:49
Used on a creative commons license from: https://elifesciences.org/digests/57443/reconstructing-the-brain-of-fruit-flies

What is Reinforcement Learning?

One path toward engineering intelligence lies with emulating biological organisms.

Biological organisms transduce information from the environment, process it (what cognitive science studies), and output behaviour conducive to survival. Such behaviours, at the most basic level, involve foraging for food, reproducing, and avoiding harm. They also involves the wide spectrum of human activity such as play, creativity, problem-solving, design and engineering, socializing, romance, and intellectual life.

Now, how do we engineer a system that is able to do all of the above?

If we were to model a simple organism as a function of some environment, we would need a model of the agent, the environment, and some function that moves that agent from some present state to some desired state.

In psychology, two major schools of thought purport to explain human behaviour: behaviourism and cognitive science. Behaviourists understand behaviour as a function of learning mechanisms, where learning can be attributed to output behaviour. Cognitive science, on the other hand, models agent interaction with the environment through the information-processing approach. In this approach, the agent transduces external stimuli into an internal representation initially by the senses and subsequently subjects it to layers of transformation and integration all the way up to thinking and reasoning faculties, before returning some behaviourial output. In the former approach, learning is understood largely as a function of environmental conditioning, whereas in the latter, mental representations are considered indispensable in predicting behaviour. Reinforcement learning borrows mostly from the behaviorist approach where environmental reward dictates the evolution of the agent within search space.

Operant conditioning, the school of behaviorist psychology that reigned in the 1950s-60s, defined learning as the product of the environmental mechanisms of reward and punishment. Precursors to operant conditioning included the Law of Effect proposed by Edward Thorndike which proposed that behaviors that produce satisfying effects are more likely to recur, whereas behaviors that produce dissatisfying effects less likely. B.F. Skinner operationalized effects in terms of reinforcement and punishment. Reinforcement increases the likelihood of the recurrence of a behavior, whether it be approach or removal of the inhibitory factor. Approach is termed positive reinforcement, and the reversal of avoidance, negative reinforcement. An example of positive reinforcement includes becoming good at a sport and winning often. An example of negative reinforcement includes removing the inhibitory stimulus, e.g. the school bully who taunts you during games. Operant conditioning predicts that you're likely to repeat behaviours that receive the greatest reward. Punishment, on the other hand, consists of controlling the behavioural effect by either adding a negative consequence (positive punishment) or removing the reward associated with the behaviour (negative punishment). When fouling causes expulsion from the game, it illustrates positive punishment. When you perform poorly and lose games it illustrates negative punishment, which may cause avoidance of playing in the future.

Much of the game of life in human society is replete with secondary reinforcers or socially constructed rewards and punishments that condition behaviour. These include money, grades, university admittance criteria, rules for winning and losing games, which build upon natural reinforcers that are closer to biological needs like food, reproduction, and social approbation.

Memory plays an important role in learning because it enables the retention of prior experiences. Evidence shows that memory encodes the rewards and punishments more so than the content of the experience (Tyng et al., 2017). Subjects are likely to remember rewarding experiences fondly and thereby likely to repeat them, and negative experiences unfavourably, and likely to avoid them in the future. The mechanisms of memory are complicated and diverse, and evidence suggests that subjects play an active role in reshaping their memories by recalling them (Spens & Burgess, 2024). This fact complicates the picture for behaviorism because the subject's interpretation of an experience can be retrospectively modified and reframed, making prediction on conditioning principles alone difficult. Furthermore, rewards and punishments oversimplify the landscape of positive and negative affects, which comprises a complex terrain of valleys and troughs, nested dependencies, and is better modeled as a continuous spectrum rather than a binary space.

These complexities notwithstanding, reinforcement learning comprises an array of mathematical techniques that adapt the behavioural ontology of agent, environment, and rewards in order to model Artificial Intelligence. As we will see below, aspects of reinforcement learning emerge from control theory, whose precursors extend into physics and engineering, and other aspects emerge more directly from psychology and biology. Since both the objects of control theory and living systems comprise dynamical systems that must stay within an optimal range of far-from thermodynamic equilibrium, the underlying principles are amenable to the goals of reinforcement learning and artificial intelligence more broadly.

Dynamic programming emerged chiefly from control theory as a mathematical optimization method that enables larger problems to be broken down recursively into sub-problems as a means of solving the larger problem. Generally speaking, recursion refers to a function that passes itself directly or indirectly as a parameter.

In this article, we will focus chiefly on the elements of dynamic programming, with a focus on discrete and finite games. However, dynamic programming exhibits a variety of limitations that are in part addressed by model-free approaches to reinforcement learning and others by combining dynamic programming with artificial neural networks, once called neurodynamic programming. More broadly, the marriage of reinforcement learning and artificial neural networks is termed deep reinforcement learning. These models incorporate the strengths of Deep Learning within reinforcement learning techniques. The most popular of these algorithms include the Deep Q-Networks (DQN), which were introduced by DeepMind in 2013. This family of algorithms leverages deep learning to approximate the Q-function. Since function-approximation is one of the shortcomings of reinforcement learning, these algorithms represent a major improvement of the reinforcement paradigm.

Other shortcomings addressed by DQN include conferring flexibility in capturing nonlinear dynamics, admitting a much wider range of dimensions without becoming computationally intractable from the curse of dimensionality, and greater generalization capacity over the environment.

Neurodynamic programming represents a step in the direction of leveraging the cognitive paradigm in psychology to address the shortcomings of the purely behaviourist approach. It is worth noting, however, that while scientific progress has been made in understanding the hierarchical organization and processing of lower-level perceptual information, the scaffolding of that information to thought and consciousness remains, more or less, scientifically elusive. For this reason, artificial neural networks (ANNs) as yet lack the complex generalization capacity of human intelligence which tends to learn with exponentially smaller samples than ANNs. We will discuss the implications of adopting the principles of Reinforcement Learning toward artificial general intelligence (AGI) in the last section of the article.

Decision Theory & Control Theory

Before delving into the mathematical elements of dynamic programming and reinforcement learning, it is important to flesh out the relationship between the philosophical and mathematical branch of decision theory and reinforcement learning. While decision theory consists primarily of mathematical formalizations of rational choice theory, they overlap with the goals of reinforcement learning insofar as reinforcement learning seeks to scaffold its models into successful artificial agents that can interact with complex environments and information landscapes.

Decision theory, also known as choice theory, was developed in the 20th century at the heel of the growing formalization of instrumental reason. Specifically, it uses probability theory to quantify the probability of agent actions given their preferences. A crowning achievement of this formalization effort was the _Von-Neumann-Morgenstern utility_ procedure. In a nutshell, the procedure states that agents tend to choose actions that maximize utility given the utility expectations of available choices.

Control theory emerges from the fields of mechanical and electrical engineering and concerns optimizing the states and performance of dynamical systems relative to desired parameters, such as maintaining some steady-state temperature range. The essential mechanism consists of a controller that measures the desired variable and compares it to a set point, whose difference is fed as feedback for correction. The broad strokes of control theory mirror metabolic processes of living organisms, who maintain a set point of internal temperature against variable external conditions. The connection of control theory to decision theory is obvious: both rely on feedback from the environment to maintain or advance the state of the system toward some form of optimality.

Mathematically, subsets of both control and decision problems can be reduced to optimization problems solvable through dynamic programming. Dynamical programming solves general stochastic optimal control problems (afflicted by the curse of dimensionality – meaning that computational requirements grow exponentially with the number of state variables) by decomposing them into smaller sub-problems and computing the value function. As we demonstrate the rudiments of reinforcement learning, we will delve into the heart of dynamic programming: the recursive relationship between the state and value functions of the agent.

Reinforcement learning and decision theory overlap in defining a procedure for maximizing reward or utility. However, whereas utility is explicitly defined in decision theory, which aims to model economic behaviour, in reinforcement learning utility is substituted by cumulative reward. Different policies relative to different task goals can be applied toward maximizing cumulative reward, which, as we will see, depends on the inverse relationship between the polar directions of exploration and exploitation termed the exploration-exploitation dilemma.

Let's begin by outlining the ontology underlying reinforcement models.

States, Actions & Rewards

Reinforcement learning leverages the theoretical apparatus of decision theory to construct models comprising agents, environments, and a dynamic evolution rule. The evolution rule permits an agent to pursue rewards within its environment, also termed observation.

The agent is defined as an output from the environment to a decision. We call a particular decision an action. The mapping from the present state of the network to an action is called the policy. The policy guides actions as mappings from states to outcomes.

Formally, therefore, a policy is a function that maps a state to an action. It can be represented by the conditional probability of an action given the current state, where the Greek symbol

Tags: Artificial Intelligence Deep Dives Deep Learning Reinforcement Learning Robotics

Comment