AI beats elite pros at 6-player Texas Hold’em Poker

AI beats elite pros at 6-player Texas Hold’em Poker

AI systems reached superhuman performance in 2 player, zero-sum games (where one player wins, the other loses) such as Chess, Checkers, Go, and two-player poker. Now, an AI called Pluribus is capable of defeating elite poker players in 6 player, no-limit Texas Hold’em Poker, the most common game format.

Poker elegantly captures the challenges of hidden information games. There are too many decision points to navigate individually, so, some actions are disregarded, and similar decisions are bucketed together in a simplification process called abstraction.

AI systems that win in zero-sum games approximate the Nash equilibrium strategies and generate moves accordingly. A Nash equilibrium is a list of strategies, one for each player, where no player can improve by deviating to a different strategy.

However, finding the Nash equilibria of a non-zero-sum game, like 6 player poker, with significant hidden information is ineffective. Instead, Pluribus focuses on empirically, consistently defeating human opponents.

Strategy

Pluribus’ strategy was computed through self-play, where the AI plays against copies of itself. The AI starts from scratch playing randomly and improves gradually as it learns which actions lead to better outcomes. This learning happens offline and when it plays against opponents, Pluribus improves upon the learned strategy by searching for a better strategy in real-time based on the parameters of the game.

In each iteration of the algorithm, the AI designates one player as the traverser whose current startegy is updated after the iteration. At the start of the iteration, Pluribus simulates a hand of poker based on the players’ current strategies. Once the simulated hand is completed, the AI reviews each decision (such as call, raise, fold) made by the traverser and calculates how much better or worse they would have done by choosing other options.

The AI then reviews the following actions against the other possible actions to identify the cost and benefit of the alternatives. The AI is learning to calculate counter-factual regret, which quantifies the traverser’s regrets about not having chosen an action. The traverser’s strategy is updated so that actions with more regret are chosen with higher probability.

Rather than assuming that all players play according to a single fixed strategy, the creators instead assume that each player may choose between 4 different general strategies, further algorithmically specialized to each player. One is the precomputed blueprint strategy, and the other strategies are modified blueprints with programmed biases for folding, calling, and raising.

Experiment

In the experiment, 10,000 hands of poker were played over 12 days. Each day, 5 available players from the pool of professionals were selected. $50,000 was divided among the human participants according to their performace. The players were guaranteed a minimum of $0.4 per hand for participation, and were paid up to $1.6 per hand for performing well.

Pluribus is evaluated on its performance in 2 game configurations — 5 humans + 1AI, and one human and 5 copies of AI. The performance is measured by the standard metric called milli big blinds per game (mbb/game). The measures how many big blinds, which is the initial money the second player must put into the pot, were won on average per 1000 hands of poker.

Results

Pluribus won an average of 48 mbb/game (with a standard error of 25 mbb/game), a very high win rate against 5 elite professionals. When a human plays against 5 AIs, none of the Pluribuses know each other so they can’t collude. Over the 10,000 hands played, Pluribus beat the human by an average of 32 mbb/game (with a standard errorof 15 mbb/game).

The win is based only on recognition of strategies and not at all on expressions or “tells”. The 10,000-hand experiment also suggests that the humans were unable to find exploitable weaknesses in the bot.