SlideShare uma empresa Scribd logo
1 de 25
REINFORCEMENT LEARNING
–
Q-LEARNING ALGORITHM
SEAN WILLIAMS 11/21/2015
REINFORCEMENT
• No training data
• Exploration
• Reward is provided for each
action
• Positive or negative
SUPERVISED
• Sufficient training data
• Beware of overfitting
• Need a teacher or Database
REINFORCEMENT VS SUPERVISED
WHY IS IT USEFUL?
AGENT/ENVIRONMENT
Agent
Environmen
tAction
Reward
State
Q-LEARNING
• Continuously iterates over the state space until convergence
is reached
• Q table which stores results of feedback
• Initialized to all zeros or random values
• Reward table which stores positive and negative feedback
• Q(s, a) = Q(s,a) + r(s,a) + λ (maxaction Q(s’,a’))
• s = state, a = action, r = reward, λ = learning discount
factor
• Q(s, a) = Q(s,a) + alpha(r(s,a)) + λ (maxaction Q(s’,a’)))
• Two options
• Randomly choose an action – USE TO TRAIN (encourages exploration)
• Choose action with highest Q value – USE AFTER TRAINING
FINAL PROJECT GOALS
• Train robot to drive in circle – complete
• Train the robot to navigate any maze AND drive into a “box”
– complete
PROJECT CONSTRAINTS
• No GPS, so I initialize the state as NORTH, no matter what direction the car
is actually pointing
• Reward is initialized as shown below:
• 0 if car turns EAST or WEST FROM NORTH or SOUTH
• -10 if car turns SOUTH FROM EAST or WEST – I want the car to move forward (North)
• 10 if car turns NORTH from EAST or WEST
• I don’t allow the car to reverse
• Need more sensors to ensure the car turned correctly.
• I built a loop into the robot to wait for my feedback from a remote control to tell it to turn a bit further.
• Robot learns from my feedback and it adjusts the turn time
• Needed an accelerometer to detect changes in rotation about x-axis OR GPS so
that it knows the current direction
Q-LEARNING AGENT - SUMMARY
• Arduino loops forever
• Train the car with the Q-Learning algorithm and randomly
choosing the direction (5X) – creates the learning policy
• After training complete, execute the “learned policy.”
• It SHOULD follow the optimal path every time, but it may not (Why?)
• Chooses the direction based on the max Q value
• I have to main functions
• QLearningAgent – training function which selects values randomly
• QLearningAgentSelectFromQ – Selects the direction with the highest Q
value
VIDEOS
• Q-Learning – After training
• Training 1
• Training 1/2
• Training 2
• Training 3
• Training 4
• Training 5
• Driving in circle
CONCLUSION
• Q-learning is a light-weight unsupervised learning algorithm
that can be used to train robots without any training data.
• Need a Wifi shield in order to inspect the Q-table and how the
agent makes decisions while navigating the maze.
• Need to add a GPS for the agent to know its current direction
APPENDIX A
• Q-Learning Algorithm
• Q-Learning Agent Code
• Picture of Maze
• Pictures of robotic car
Q-LEARNING ALGORITHM
For all s Є S, a Є A
Q(s,a) = 0
Repeat
Select state s randomly or by max()
Repeat
select an action a and carry it out
Obtain reward r and new state s’
Q(s,a) = Q(s,a) + r(s,a) + λmaxaction Q(s’,a’)
s = s’
Until s is an ending state or time limit reached
Until Q converges
Q-LEARNING AGENT
int QLearningAgent(int currentState){
selectAction(currentState);
while(!goalState()){
int action = selectAction(currentState);
int newState = takeAction(currentState, action);
int reward = R[currentState][action];
int maxQ = getMaxQ(newState);
Q[currentState][action] = Q[currentState][action]
+ (r + (gammaQLearning * maxQ));
currentState = newState;
delay(1000);
}
delay(10000);
HARDWARE/SOFTWARE
• Arduino Uno Rev 3
• Parallax Ping (28015) – Distance sensor
• Adafruit Motor Shield Rev. 2.3 (I needed to solder the pins which was very
difficult)
• Vilros Micro Servo 99 (SG90) – Spins the sensor in different directions
• Multi-colored LED light – used to signal when training is complete
• 8 AA batteries to power motor shield
• 1 9V battery to power Arduino
• Infrared receiver – to receive signal from remote control
• Remote control - any remote control will do
• 4 wheel robotic smart car chasis
• Ardunio IDE v 1.0.6 (older version)
• Windows 8 OS
REWARD TABLEvoid initR(){
R[0][0] = 0; //N > N
R[0][1] = -1; //N > S
R[0][2] = 0; //N > E
R[0][3] = 0; //N > W
R[1][0] = -1; //S > N
R[1][1] = -1; //S > S
R[1][2] = 0; //S > E
R[1][3] = 0; //S > W
R[2][0] = 10; //E > N
R[2][1] = -10;//E > S
R[2][2] = -1; //E > E
R[2][3] = -1; //E > W
R[3][0] = 10; //W > N
R[3][1] = -10;//W > S
R[3][2] = -1; //W > E
R[3][3] = -1; //W > W
}
APPENDIX B - INTRODUCTION TO
ARTIFICIAL INTELLIGENCE, WOLFGANG
ERTEL
• This is from previous AI project, but it is a good example of
how the A-Learning Algorithm was used to train a robot to
crawl.
ROBOTICS EXAMPLE – CAN IT LEARN TO
CRAWL FORWARD WITH Q-LEARNING?
Introduction to Artificial Intelligence, Wolfgang Ertel
G
x
Gx Left Right
Gy
Up
Down
Gy
Q-LEARNING TABLE
Reward Table
Q Learning Table initial state
Gx Left Right
Gy
Up
Down
1
Up, Left Up, Right Down,
right
Down, left
Up, Left - 0 - 0
Up, Right 0 - 0 -
Down, right - 0 - 0
Down, left 0 - 0 -
Q-LEARNING IMPLEMENTED
Up, Left Up, Right Down,
right
Down, left
Up, Left - 124,942 - 124,974
Up, Right 249,178 - 249,940 -
Down, right - 124,621 - 250,932
Down, left 250,642 - 250,238 -
Down,right -> down, left => Reward = 1, everything else is 0
1,000,000 iterations
Up, Left Up, Right Down,
right
Down, left
Up, Left - 250,338 - 127,794
Up, Right 250,110 - 250,516 -
Down, right - 125,144 - 249,789
Down, left 249,810 - 249,564 -
Down, Right -> Down, Left => Reward = 1, Up, Left->Up, Right => 1, everything else
1,000,000 iterations
APPENDIX C – PREVIOUS AI PROJECT
LEFT TURN, RIGHT TURN, STOP TRAINING
• This was also from a previous AI project. I was able to train my
robotic car to stop, turn left, and turn right.
• I included it since it has videos.
STOP TRAINING
This was from a previous project. This training is not used in the current project.
int updateQStop(){
int r = RStop();
int d = distanceToStopGoal();
distanceToStop = distanceToStop
+ (reward * gammaStop *
distanceToWall);
}
• Reward is -1 when the stop distance is greater than the goal distance
• Reward is 1 otherwise
• http://youtu.be/n8eJrbDiP3A
LEFT TURN TRAINING
• This was from a previous project. This training is not used in the
current project.
• int r = RLeft(startDistance, endMinusStartDistance);
• turnLeftTime = turnLeftTime + (reward * gammaLeftTurn *
abs(endMinusStartDistance));
• http://youtu.be/pISBHxOMTSU
Reward Description
-1 When car is further away
from the wall
20 If car is too close to the
wall
0 When the car moves
nearly parallel to the wall
RIGHT TURN TRAINING
• This was from a previous project. This training is not used in the current project.
• double r = RRight(distance);
• long d = distanceToRightTurnGoal(distance);
• turnRightTime = turnRightTime + (r * gammaRightTurn * d);
• http://youtu.be/2ubkShxqizo
Reward Description
-1 When car is further away
from the wall
20 If car is too close to the
wall
0 When the car moves
nearly parallel to the wall

Mais conteúdo relacionado

Destaque

Destaque (6)

Arduino Neural Networks
Arduino Neural NetworksArduino Neural Networks
Arduino Neural Networks
 
Ficha bon jovi
Ficha bon joviFicha bon jovi
Ficha bon jovi
 
Multi-Agent Systems on Arduino & iOS
Multi-Agent Systems on Arduino & iOSMulti-Agent Systems on Arduino & iOS
Multi-Agent Systems on Arduino & iOS
 
introduction to Deep Q Learning
introduction to Deep Q Learningintroduction to Deep Q Learning
introduction to Deep Q Learning
 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
 

Semelhante a MachineLearning_QLearningCircuit

Gradient Steepest method application on Griewank Function
Gradient Steepest method application  on Griewank Function Gradient Steepest method application  on Griewank Function
Gradient Steepest method application on Griewank Function
Imane Haf
 

Semelhante a MachineLearning_QLearningCircuit (20)

Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural Nets
 
Reinforcement Learning basics part1
Reinforcement Learning basics part1Reinforcement Learning basics part1
Reinforcement Learning basics part1
 
Imitation Learning for Autonomous Driving in TORCS
Imitation Learning for Autonomous Driving in TORCSImitation Learning for Autonomous Driving in TORCS
Imitation Learning for Autonomous Driving in TORCS
 
An Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGIAn Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGI
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
 
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
 
Dueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learningDueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learning
 
Gradient Steepest method application on Griewank Function
Gradient Steepest method application  on Griewank Function Gradient Steepest method application  on Griewank Function
Gradient Steepest method application on Griewank Function
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and Tricks
 
Introduction: Asynchronous Methods for Deep Reinforcement Learning
Introduction: Asynchronous Methods for  Deep Reinforcement LearningIntroduction: Asynchronous Methods for  Deep Reinforcement Learning
Introduction: Asynchronous Methods for Deep Reinforcement Learning
 
Deep robotics
Deep roboticsDeep robotics
Deep robotics
 
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
 
Human level control through deep rl
Human level control through deep rlHuman level control through deep rl
Human level control through deep rl
 
Reinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del PraReinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del Pra
 
Pytorch meetup
Pytorch meetupPytorch meetup
Pytorch meetup
 
AWS Finland Meetup June 2019 - DeepRacer story
AWS Finland Meetup June 2019 - DeepRacer storyAWS Finland Meetup June 2019 - DeepRacer story
AWS Finland Meetup June 2019 - DeepRacer story
 
IRJET-Survey on Simulation of Self-Driving Cars using Supervised and Reinforc...
IRJET-Survey on Simulation of Self-Driving Cars using Supervised and Reinforc...IRJET-Survey on Simulation of Self-Driving Cars using Supervised and Reinforc...
IRJET-Survey on Simulation of Self-Driving Cars using Supervised and Reinforc...
 
Introduction to Steering behaviours for Autonomous Agents
Introduction to Steering behaviours for Autonomous AgentsIntroduction to Steering behaviours for Autonomous Agents
Introduction to Steering behaviours for Autonomous Agents
 

MachineLearning_QLearningCircuit

  • 2. REINFORCEMENT • No training data • Exploration • Reward is provided for each action • Positive or negative SUPERVISED • Sufficient training data • Beware of overfitting • Need a teacher or Database REINFORCEMENT VS SUPERVISED
  • 3. WHY IS IT USEFUL?
  • 5. Q-LEARNING • Continuously iterates over the state space until convergence is reached • Q table which stores results of feedback • Initialized to all zeros or random values • Reward table which stores positive and negative feedback • Q(s, a) = Q(s,a) + r(s,a) + λ (maxaction Q(s’,a’)) • s = state, a = action, r = reward, λ = learning discount factor • Q(s, a) = Q(s,a) + alpha(r(s,a)) + λ (maxaction Q(s’,a’))) • Two options • Randomly choose an action – USE TO TRAIN (encourages exploration) • Choose action with highest Q value – USE AFTER TRAINING
  • 6. FINAL PROJECT GOALS • Train robot to drive in circle – complete • Train the robot to navigate any maze AND drive into a “box” – complete
  • 7. PROJECT CONSTRAINTS • No GPS, so I initialize the state as NORTH, no matter what direction the car is actually pointing • Reward is initialized as shown below: • 0 if car turns EAST or WEST FROM NORTH or SOUTH • -10 if car turns SOUTH FROM EAST or WEST – I want the car to move forward (North) • 10 if car turns NORTH from EAST or WEST • I don’t allow the car to reverse • Need more sensors to ensure the car turned correctly. • I built a loop into the robot to wait for my feedback from a remote control to tell it to turn a bit further. • Robot learns from my feedback and it adjusts the turn time • Needed an accelerometer to detect changes in rotation about x-axis OR GPS so that it knows the current direction
  • 8. Q-LEARNING AGENT - SUMMARY • Arduino loops forever • Train the car with the Q-Learning algorithm and randomly choosing the direction (5X) – creates the learning policy • After training complete, execute the “learned policy.” • It SHOULD follow the optimal path every time, but it may not (Why?) • Chooses the direction based on the max Q value • I have to main functions • QLearningAgent – training function which selects values randomly • QLearningAgentSelectFromQ – Selects the direction with the highest Q value
  • 9. VIDEOS • Q-Learning – After training • Training 1 • Training 1/2 • Training 2 • Training 3 • Training 4 • Training 5 • Driving in circle
  • 10. CONCLUSION • Q-learning is a light-weight unsupervised learning algorithm that can be used to train robots without any training data. • Need a Wifi shield in order to inspect the Q-table and how the agent makes decisions while navigating the maze. • Need to add a GPS for the agent to know its current direction
  • 11. APPENDIX A • Q-Learning Algorithm • Q-Learning Agent Code • Picture of Maze • Pictures of robotic car
  • 12. Q-LEARNING ALGORITHM For all s Є S, a Є A Q(s,a) = 0 Repeat Select state s randomly or by max() Repeat select an action a and carry it out Obtain reward r and new state s’ Q(s,a) = Q(s,a) + r(s,a) + λmaxaction Q(s’,a’) s = s’ Until s is an ending state or time limit reached Until Q converges
  • 13. Q-LEARNING AGENT int QLearningAgent(int currentState){ selectAction(currentState); while(!goalState()){ int action = selectAction(currentState); int newState = takeAction(currentState, action); int reward = R[currentState][action]; int maxQ = getMaxQ(newState); Q[currentState][action] = Q[currentState][action] + (r + (gammaQLearning * maxQ)); currentState = newState; delay(1000); } delay(10000);
  • 14.
  • 15.
  • 16. HARDWARE/SOFTWARE • Arduino Uno Rev 3 • Parallax Ping (28015) – Distance sensor • Adafruit Motor Shield Rev. 2.3 (I needed to solder the pins which was very difficult) • Vilros Micro Servo 99 (SG90) – Spins the sensor in different directions • Multi-colored LED light – used to signal when training is complete • 8 AA batteries to power motor shield • 1 9V battery to power Arduino • Infrared receiver – to receive signal from remote control • Remote control - any remote control will do • 4 wheel robotic smart car chasis • Ardunio IDE v 1.0.6 (older version) • Windows 8 OS
  • 17. REWARD TABLEvoid initR(){ R[0][0] = 0; //N > N R[0][1] = -1; //N > S R[0][2] = 0; //N > E R[0][3] = 0; //N > W R[1][0] = -1; //S > N R[1][1] = -1; //S > S R[1][2] = 0; //S > E R[1][3] = 0; //S > W R[2][0] = 10; //E > N R[2][1] = -10;//E > S R[2][2] = -1; //E > E R[2][3] = -1; //E > W R[3][0] = 10; //W > N R[3][1] = -10;//W > S R[3][2] = -1; //W > E R[3][3] = -1; //W > W }
  • 18. APPENDIX B - INTRODUCTION TO ARTIFICIAL INTELLIGENCE, WOLFGANG ERTEL • This is from previous AI project, but it is a good example of how the A-Learning Algorithm was used to train a robot to crawl.
  • 19. ROBOTICS EXAMPLE – CAN IT LEARN TO CRAWL FORWARD WITH Q-LEARNING? Introduction to Artificial Intelligence, Wolfgang Ertel G x Gx Left Right Gy Up Down Gy
  • 20. Q-LEARNING TABLE Reward Table Q Learning Table initial state Gx Left Right Gy Up Down 1 Up, Left Up, Right Down, right Down, left Up, Left - 0 - 0 Up, Right 0 - 0 - Down, right - 0 - 0 Down, left 0 - 0 -
  • 21. Q-LEARNING IMPLEMENTED Up, Left Up, Right Down, right Down, left Up, Left - 124,942 - 124,974 Up, Right 249,178 - 249,940 - Down, right - 124,621 - 250,932 Down, left 250,642 - 250,238 - Down,right -> down, left => Reward = 1, everything else is 0 1,000,000 iterations Up, Left Up, Right Down, right Down, left Up, Left - 250,338 - 127,794 Up, Right 250,110 - 250,516 - Down, right - 125,144 - 249,789 Down, left 249,810 - 249,564 - Down, Right -> Down, Left => Reward = 1, Up, Left->Up, Right => 1, everything else 1,000,000 iterations
  • 22. APPENDIX C – PREVIOUS AI PROJECT LEFT TURN, RIGHT TURN, STOP TRAINING • This was also from a previous AI project. I was able to train my robotic car to stop, turn left, and turn right. • I included it since it has videos.
  • 23. STOP TRAINING This was from a previous project. This training is not used in the current project. int updateQStop(){ int r = RStop(); int d = distanceToStopGoal(); distanceToStop = distanceToStop + (reward * gammaStop * distanceToWall); } • Reward is -1 when the stop distance is greater than the goal distance • Reward is 1 otherwise • http://youtu.be/n8eJrbDiP3A
  • 24. LEFT TURN TRAINING • This was from a previous project. This training is not used in the current project. • int r = RLeft(startDistance, endMinusStartDistance); • turnLeftTime = turnLeftTime + (reward * gammaLeftTurn * abs(endMinusStartDistance)); • http://youtu.be/pISBHxOMTSU Reward Description -1 When car is further away from the wall 20 If car is too close to the wall 0 When the car moves nearly parallel to the wall
  • 25. RIGHT TURN TRAINING • This was from a previous project. This training is not used in the current project. • double r = RRight(distance); • long d = distanceToRightTurnGoal(distance); • turnRightTime = turnRightTime + (r * gammaRightTurn * d); • http://youtu.be/2ubkShxqizo Reward Description -1 When car is further away from the wall 20 If car is too close to the wall 0 When the car moves nearly parallel to the wall

Notas do Editor

  1. Give a brief overview of the presentation. Describe the major focus of the presentation and why it is important. Introduce each of the major topics. To provide a road map for the audience, you can repeat this Overview slide throughout the presentation, highlighting the particular topic you will discuss next.
  2. Give a brief overview of the presentation. Describe the major focus of the presentation and why it is important. Introduce each of the major topics. To provide a road map for the audience, you can repeat this Overview slide throughout the presentation, highlighting the particular topic you will discuss next.
  3. π = policy
  4. π = policy
  5. Can this robot learn to crawl forward? Movements to the right are rewarded with positive values. Movements to the left are punished with negative values.
  6. π = policy
  7. π = policy