2. REINFORCEMENT
• No training data
• Exploration
• Reward is provided for each
action
• Positive or negative
SUPERVISED
• Sufficient training data
• Beware of overfitting
• Need a teacher or Database
REINFORCEMENT VS SUPERVISED
5. Q-LEARNING
• Continuously iterates over the state space until convergence
is reached
• Q table which stores results of feedback
• Initialized to all zeros or random values
• Reward table which stores positive and negative feedback
• Q(s, a) = Q(s,a) + r(s,a) + λ (maxaction Q(s’,a’))
• s = state, a = action, r = reward, λ = learning discount
factor
• Q(s, a) = Q(s,a) + alpha(r(s,a)) + λ (maxaction Q(s’,a’)))
• Two options
• Randomly choose an action – USE TO TRAIN (encourages exploration)
• Choose action with highest Q value – USE AFTER TRAINING
6. FINAL PROJECT GOALS
• Train robot to drive in circle – complete
• Train the robot to navigate any maze AND drive into a “box”
– complete
7. PROJECT CONSTRAINTS
• No GPS, so I initialize the state as NORTH, no matter what direction the car
is actually pointing
• Reward is initialized as shown below:
• 0 if car turns EAST or WEST FROM NORTH or SOUTH
• -10 if car turns SOUTH FROM EAST or WEST – I want the car to move forward (North)
• 10 if car turns NORTH from EAST or WEST
• I don’t allow the car to reverse
• Need more sensors to ensure the car turned correctly.
• I built a loop into the robot to wait for my feedback from a remote control to tell it to turn a bit further.
• Robot learns from my feedback and it adjusts the turn time
• Needed an accelerometer to detect changes in rotation about x-axis OR GPS so
that it knows the current direction
8. Q-LEARNING AGENT - SUMMARY
• Arduino loops forever
• Train the car with the Q-Learning algorithm and randomly
choosing the direction (5X) – creates the learning policy
• After training complete, execute the “learned policy.”
• It SHOULD follow the optimal path every time, but it may not (Why?)
• Chooses the direction based on the max Q value
• I have to main functions
• QLearningAgent – training function which selects values randomly
• QLearningAgentSelectFromQ – Selects the direction with the highest Q
value
9. VIDEOS
• Q-Learning – After training
• Training 1
• Training 1/2
• Training 2
• Training 3
• Training 4
• Training 5
• Driving in circle
10. CONCLUSION
• Q-learning is a light-weight unsupervised learning algorithm
that can be used to train robots without any training data.
• Need a Wifi shield in order to inspect the Q-table and how the
agent makes decisions while navigating the maze.
• Need to add a GPS for the agent to know its current direction
11. APPENDIX A
• Q-Learning Algorithm
• Q-Learning Agent Code
• Picture of Maze
• Pictures of robotic car
12. Q-LEARNING ALGORITHM
For all s Є S, a Є A
Q(s,a) = 0
Repeat
Select state s randomly or by max()
Repeat
select an action a and carry it out
Obtain reward r and new state s’
Q(s,a) = Q(s,a) + r(s,a) + λmaxaction Q(s’,a’)
s = s’
Until s is an ending state or time limit reached
Until Q converges
13. Q-LEARNING AGENT
int QLearningAgent(int currentState){
selectAction(currentState);
while(!goalState()){
int action = selectAction(currentState);
int newState = takeAction(currentState, action);
int reward = R[currentState][action];
int maxQ = getMaxQ(newState);
Q[currentState][action] = Q[currentState][action]
+ (r + (gammaQLearning * maxQ));
currentState = newState;
delay(1000);
}
delay(10000);
14.
15.
16. HARDWARE/SOFTWARE
• Arduino Uno Rev 3
• Parallax Ping (28015) – Distance sensor
• Adafruit Motor Shield Rev. 2.3 (I needed to solder the pins which was very
difficult)
• Vilros Micro Servo 99 (SG90) – Spins the sensor in different directions
• Multi-colored LED light – used to signal when training is complete
• 8 AA batteries to power motor shield
• 1 9V battery to power Arduino
• Infrared receiver – to receive signal from remote control
• Remote control - any remote control will do
• 4 wheel robotic smart car chasis
• Ardunio IDE v 1.0.6 (older version)
• Windows 8 OS
17. REWARD TABLEvoid initR(){
R[0][0] = 0; //N > N
R[0][1] = -1; //N > S
R[0][2] = 0; //N > E
R[0][3] = 0; //N > W
R[1][0] = -1; //S > N
R[1][1] = -1; //S > S
R[1][2] = 0; //S > E
R[1][3] = 0; //S > W
R[2][0] = 10; //E > N
R[2][1] = -10;//E > S
R[2][2] = -1; //E > E
R[2][3] = -1; //E > W
R[3][0] = 10; //W > N
R[3][1] = -10;//W > S
R[3][2] = -1; //W > E
R[3][3] = -1; //W > W
}
18. APPENDIX B - INTRODUCTION TO
ARTIFICIAL INTELLIGENCE, WOLFGANG
ERTEL
• This is from previous AI project, but it is a good example of
how the A-Learning Algorithm was used to train a robot to
crawl.
19. ROBOTICS EXAMPLE – CAN IT LEARN TO
CRAWL FORWARD WITH Q-LEARNING?
Introduction to Artificial Intelligence, Wolfgang Ertel
G
x
Gx Left Right
Gy
Up
Down
Gy
20. Q-LEARNING TABLE
Reward Table
Q Learning Table initial state
Gx Left Right
Gy
Up
Down
1
Up, Left Up, Right Down,
right
Down, left
Up, Left - 0 - 0
Up, Right 0 - 0 -
Down, right - 0 - 0
Down, left 0 - 0 -
21. Q-LEARNING IMPLEMENTED
Up, Left Up, Right Down,
right
Down, left
Up, Left - 124,942 - 124,974
Up, Right 249,178 - 249,940 -
Down, right - 124,621 - 250,932
Down, left 250,642 - 250,238 -
Down,right -> down, left => Reward = 1, everything else is 0
1,000,000 iterations
Up, Left Up, Right Down,
right
Down, left
Up, Left - 250,338 - 127,794
Up, Right 250,110 - 250,516 -
Down, right - 125,144 - 249,789
Down, left 249,810 - 249,564 -
Down, Right -> Down, Left => Reward = 1, Up, Left->Up, Right => 1, everything else
1,000,000 iterations
22. APPENDIX C – PREVIOUS AI PROJECT
LEFT TURN, RIGHT TURN, STOP TRAINING
• This was also from a previous AI project. I was able to train my
robotic car to stop, turn left, and turn right.
• I included it since it has videos.
23. STOP TRAINING
This was from a previous project. This training is not used in the current project.
int updateQStop(){
int r = RStop();
int d = distanceToStopGoal();
distanceToStop = distanceToStop
+ (reward * gammaStop *
distanceToWall);
}
• Reward is -1 when the stop distance is greater than the goal distance
• Reward is 1 otherwise
• http://youtu.be/n8eJrbDiP3A
24. LEFT TURN TRAINING
• This was from a previous project. This training is not used in the
current project.
• int r = RLeft(startDistance, endMinusStartDistance);
• turnLeftTime = turnLeftTime + (reward * gammaLeftTurn *
abs(endMinusStartDistance));
• http://youtu.be/pISBHxOMTSU
Reward Description
-1 When car is further away
from the wall
20 If car is too close to the
wall
0 When the car moves
nearly parallel to the wall
25. RIGHT TURN TRAINING
• This was from a previous project. This training is not used in the current project.
• double r = RRight(distance);
• long d = distanceToRightTurnGoal(distance);
• turnRightTime = turnRightTime + (r * gammaRightTurn * d);
• http://youtu.be/2ubkShxqizo
Reward Description
-1 When car is further away
from the wall
20 If car is too close to the
wall
0 When the car moves
nearly parallel to the wall
Notas do Editor
Give a brief overview of the presentation. Describe the major focus of the presentation and why it is important.
Introduce each of the major topics.
To provide a road map for the audience, you can repeat this Overview slide throughout the presentation, highlighting the particular topic you will discuss next.
Give a brief overview of the presentation. Describe the major focus of the presentation and why it is important.
Introduce each of the major topics.
To provide a road map for the audience, you can repeat this Overview slide throughout the presentation, highlighting the particular topic you will discuss next.
π = policy
π = policy
Can this robot learn to crawl forward?
Movements to the right are rewarded with positive values.
Movements to the left are punished with negative values.