1 Deep Reinforcement LearningCS Multimedia Software Engineering Deep Reinforcement Learning Mohammad H. Mofrad University of Pittsburgh Thursday, October 27, 2016
2 Preface Machine Learning Reinforcement LearningArtificial Neural Network Deep Learning Deep Reinforcement Learning
3 Machine Learning y = f(x) ML in a Nutshell Supervised LearningDiscrete space: X (classification) Y Continuous space X (regression) Y Reinforcement Learning Unsupervised Learning Discrete space: X (clustering)Y Continuous space: X (dimensionality reduction) Y Prof. Adriana Kovashka, ML basic notes, CS PITT
4 Reinforcement LearningState st Agent Action at Reward rt Environment Reward Action Terminal State State s0, a0, r1, s1, a1, r2, s2, s2, , rn-1, sn-1, an-1, rn, sn https://webdocs.cs.ualberta.ca/~sutton/book/ebook/node28.html
5 Q β Learning Select an action Observe the reward Update the Q β table Reward Learned value π π π‘ , πΌ π‘ =π π π‘ , π π‘ +πΌ π π‘+1 πΎ maxπΌQ π π‘+1 ,πΌ βπ π π‘ , πΌ π‘ Estimate of optimal new state New state Old state Learning rate Discount factor Old state https://en.wikipedia.org/wiki/Q-learning
6 Artificial Neural Network (ANN)A system of loosely coupled neural units modeling the brain neurons connected by axons. Mathematical model: f: X ο Y Network structure: f is the Neuronβs network function: f(x) = K(βi wigi(x)) x = (x1, x2, β¦, xn) is the input vector w = (w1, w2, β¦, wn)is the weight vector g = (g1, g2, β¦, gn) is the composition of other functions K is the activation function Learning: ANNs learn using a cost function C = E[(f(x) - y)2] like mean squared error Cβ = 1/N βi (f(xi) - yi)2 where N is the number of samples https://en.wikipedia.org/wiki/Artificial_neural_network
7 Deep Learning A class ANN which is a combination of many layers of nonlinear processing units + Feature extraction + Classification + pattern recognition Combination of heterogeneous algorithms, mainly unsupervised Learn different levels of representation The output can be used as a feature vector for other classification schemes https://en.wikipedia.org/wiki/Deep_learning
8 Deep Reinforcement Learning
9 DeepMind British-based AI Company founded in 2010Acquired by Google in 2014 A Neural Network that learns how to play video games Neural Turing Machine Healthcare Searching for early signs of diseases leading to blindness. Differentiate between healthy and cancerous tissues in head and neck area. AlphaGo https://en.wikipedia.org/wiki/DeepMind https://deepmind.com/research/alphago/
10 Selected Article Title: Human-level control through deep reinforcement learning Authors: Volodymyr Mnih et al. Affiliation: Google DeepMind Journal: Nature β International Weekly Journal of Science Volume: 518 Issue: 7540 Date: 2015 Pages: 529 β 533 Journal Impact Factor: Citation: 542 https://en.wikipedia.org/wiki/DeepMind
11 Reinforcement Learning in AtariAction at State st Reward rt State representation = Screen Pixels David Silver,, βDeep Reinforcement Learningβ
12 Deep Q β Learning (DQL) π π π‘ , πΌ π‘ =π+πΎ maxa Q π π‘+1 ,πΌDeep Learning + Reinforcement Learning? Bellman Equation with Value Function Q(s, a) Where Value Iteration Algorithm can recursively find the optimal value Represent the Deep Q β Networkβs Value Function by the with weights W π π π‘ , πΌ π‘ =π+πΎ maxa Q π π‘+1 ,πΌ π π π‘ ,πΌ,π€ βπ π π‘ ,πΌ Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning."
13 Deep Q β Learning (DQL) πΏ=πΈ π+πΎ (max πΌ π( π π‘+1 ,πΌ,π€ )βπ(π π‘, ππ‘) 2Define a new Objective Function by Mean-Squared Error (MSE) in Q- values Leading to the following Q β learning Gradient Optimize objective function by Stochastic Gradient Descent (SGD) using πΏ=πΈ π+πΎ (max πΌ βπ( π π‘+1 ,πΌ,π€ )βπ(π π‘, ππ‘) 2 ππΏ(π€) ππΏ =πΈ π+πΎ( max πΌ βπ( π π‘+1 ,πΌ,π€ )βπ(π π‘, ππ‘,π€) ππ(π π‘, ππ‘, π€) ππ€ ππΏ(π€) ππΏ Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning."
14 The Convolutional Neural Network in AtariFully-connected Linear output layer (5) End-to-end learning of values Q(s; a) from pixels s Input state s is stack of raw pixels from last 4 frames Output is Q(s; a) for 18 joystick/button positions Reward is change in score for that step Fully-connected layer of rectified linear units (4) Stack of 4 previous frames (1) Convolutional Layer of rectified linear units (2) Convolutional Layer of rectified linear units (3) David Silver,, βDeep Reinforcement Learningβ
15 Schematic illustration of Deep Q β NetworkFigure 1 | Schematic illustration of the convolutional neural network. The details of the architecture are explained in the Methods. The input to the neural network consists of an image produced by the preprocessing map w, followed by three convolutional layers (note: snaking blue line symbolizes sliding of each filter across input image) and two fully connected layers with a single output for each valid action. Each hidden layer is followed by a rectifier nonlinearity (that is, max(0,x)). Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning."
16 Deep Q β Network characteristicsLayer Input Filter size Stride # Filters Activation Function Output Convolution 1 84 x 84 x 4 8 x 8 4 32 ReLU 20 x 20 x 32 Convolution 2 4 x 4 2 64 9 x 9 x 64 Convolution 3 3 x 3 1 7 x 7 x 64 Fully connected 4 512 Fully connected 5 18 Linear *Rectified Linear Unit (ReLU) Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning."
17 Results
18 Training Curve Space Invaders SeaquestFigure 2 | Training curves tracking the agentβs average score and average predicted action-value. a, Each point is the average score achieved per episode after the agent is run with e-greedy policy (e50.05) for 520 k frames on Space Invaders. b, Average score achieved per episode for Seaquest. c, Average predicted action-value on a held-out set of states on Space Invaders. Each point on the curve is the average of the action-value Q computed over the held-out set of states. Note that Q-values are scaled due to clipping of rewards (see Methods). d, Average predicted action-value on Seaquest. See Supplementary Discussion for details. Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning."
19 Comparing DQN performanceAudio was disables 30 iterations Normalized results Figure 3 | Comparison of the DQN agent with the best reinforcement learning methods in the literature. The performance of DQNis normalized with respect to a professional human games tester (that is, 100% level) and random play (that is, 0%level).Note that the normalized performance of DQN, expressed as a percentage, is calculated as: 1003(DQN score2random play score)/(human score2random play score). It can be seen that DQN outperforms competing methods (also see Extended Data Table 2) in almost all the games, and performs at a level that is broadly comparable with or superior to a professional human games tester (that is, operationalized as a level of 75% or above) in the majority of games. Audio output was disabled for both human players and agents. Error bars indicate s.d. across the 30 evaluation episodes, starting with different initial conditions. Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning."
20 Last hidden layer representationtSNE representation of game states Figure 4 | Two-dimensional t-SNE embedding of the representations in the last hidden layer assigned by DQN to game states experienced while playing Space Invaders. The plot was generated by letting the DQN agent play for 2 h of real game time and running the t-SNE algorithm 25 on the last hidden layer representations assigned by DQN to each experienced game state. The points are coloured according to the state values (V, maximum expected reward of a state) predicted by DQN for the corresponding game states (ranging from dark red (highest V) to dark blue (lowest V)). The screenshots corresponding to a selected number of points are shown. The DQN agent predicts high state values for both full (top right screenshots) and nearly complete screens (bottom left screenshots) because it has learned that completing a screen leads to a new screen full of enemy ships. Partially completed screens (bottom screenshots) are assigned lower state values because less immediate reward is available. The screens shown on the bottom right and top left and middle are less perceptually similar than the other examples but are still mapped to nearby representations and similar values because the orange bunkers do not carry great significance near the end of a level. With permission from Square Enix Limited. Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning."
21 Visualization of the learned value Function Breakout gameExtended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on the game Breakout. At time points 1 and 2, the state value is predicted to be,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the paddle controlled by the agent on the right side of the screen and the values of all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the βupβ action stays high while the value of the βdownβ action falls to20.9. This reflects the fact that pressing βdownβ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing βupβ and the expected reward keeps increasing until time point 4,when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning."
22 Visualization of the learned value Function Pong gameExtended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on the game Breakout. At time points 1 and 2, the state value is predicted to be,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the paddle controlled by the agent on the right side of the screen and the values of all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the βupβ action stays high while the value of the βdownβ action falls to20.9. This reflects the fact that pressing βdownβ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing βupβ and the expected reward keeps increasing until time point 4,when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning."
23 Conclusion Deep nets are most suitable while dealing with unlimited high dimensional training data DQN Reinforcement Learning acts as the function approximator Extract high level features from high dimensional raw sensory data Final Quote: βReinforcement learning + deep learning = AIβ
24 References [1] Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning."Β NatureΒ (2015): [2] Mnih, Volodymyr, et al. "Asynchronous methods for deep reinforcement learning."Β arXiv preprint arXiv: Β (2016). [3] Mnih, Volodymyr, et al. "Playing atari with deep reinforcement learning."Β arXiv preprint arXiv: Β (2013).