Reinforcement Learning to teach robots Table Tennis
Reinforcement Learning (RL), a subdiscipline of Machine Learning, is self-learning driven by the feedback of an agent’s actions to maximize reward in an environment. It concerns an agent learning a task in an environment through iterative trial-and-error actions, using the feedback from the actions to inform future actions to generate more reward and minimize punishment. Through more iterations, the agent learns to perform better, developing increasingly optimal policies for action.
Let’s consider the Table Tennis (ping pong) structure below. The agent is the robot arm, the environment is the larger space within which the ball lands after being struck by the paddle, and the reward indicates how good of a move it is. The state is the current instance of the racket. The agent is punished for negative outcomes — missing the ball, it hitting the net, or the ball landing outside the table after being struck by the paddle.
Dynamic tasks such as this are much harder for robots than humans. The main challenges associated with teaching robots Table Tennis are the quick reaction times needed to perceive the location of the ball, constant changes in the environment, precise motions required to land the ball at a particular reward-generating position, and highly accelerated motions that may be used in smash or rapid manoeuvres to hit the ball.
Pneumatic artificial muscles (PAMs) are used in constructing the arm which holds the racket. They are used to execute high-speed hitting motions while having the capabilities to decelerate the arm without exceeding the joint angle range. PAMs involve soft actuators, devices converting energy to motion, with high force sensitivity and high impact resistance such that the arm mechanically adapts to quick external forces. The pressure range can be adjusted in a PAM to decelerate the motion.
The robot learns how to smash from scratch — you don’t have to program it. This is done by favouring highly accelerated strikes, which are achieved by maximizing the velocity of the returned ball in the reward function. Using simulations, the robot can learn without interacting with physical balls.
To maximize the desired action — delivering the ball to the desired landing location with the highest velocity —it should be specified in the reward function which the agent strives to learn. First, the racket needs to hit the ball, and then the return impact must be optimized to fit the desired behaviour as specified in the reward function.
The reward function evaluates the trajectory of the ball, which is contingent on the landing spot and velocity of the ball. In it, the agent is penalized for the difference between the ball and the racket trajectory. So the agent is constantly motivated by the feedback it is given regarding how close the ball missed the racket. This directs the racket to be as closely positioned to the ball as it can when hitting the ball.
In the mathematical denotation of the table tennis (tt) racket reward function, the difference between the desired and actual landing spot determines the reward an agent achieves. The normalization constant, c, aims to proportionately convert the reward values to between 0 and 1. The exponent, 3/4, is to introduce slight variance around the optimal value. This partly addresses the anomaly that may arise from the ball hitting the edge of the racket leading to unexpected results.
In the smash task, the reward involves maximization of the velocity denoted by b alongside the minimization of the difference between the desired and the true landing location. The high velocity of the smash comes with a compromise in the precision of landing, as also is the case with human players. In the smash task, the velocity averages ~12 m/s compared to velocity averages of ~5 m/s for the returned task.
The training starts with the agent’s random explorations of the space as a response to a simulation of the ball. It iteratively gets more attuned to the movements of the ball. It learns itself from scratch, epitomizing Reinforcement Learning. Curiously, it also picks up how to position the racket before the hit to prepare for the hit.
Smashing is harder to learn than simply returning. This invites a larger exploration in the initially stochastic space. In the training simulations, it’s exploring more as it aims to maximize both components of the reward — high velocity and desired to land location — while also learning the trade-off between the two.
Both the tasks — returning and smashing — involve a little more than 14 hours of training. The training time is a result of the convergence of return rates. This means that the testing return and smash rate plateaued after several iterations or policy updates. After 183 policy updates (reflected on the x-axes below), updating was deemed futile.
Remarkably, the agent can learn from software simulations of Table Tennis, and transfer the learning to real-world Table Tennis. When tested to return, the agent hits 96% of the balls with 75% of the total returning to the opponent’s side. When tested to smash, the racket hits the ball 77% of the time while only 29% of the tested balls making it to the opponent’s side.
The researchers accomplish commendable accuracy rates with PAM robots. The intelligent system learns from scratch and trains without real balls. The system overcomes the problems associated with dynamic precision, accelerated motions, and immediate reactions.
This piece is a distilled overview of this paper. Find a YouTube video simplifying the study here.