I've also been struggling with this after reading a little bit about these regret minimisation models.
Can someone explain how there is any difference in these CFR model from a normal reinforcement learning algorithm where you just take your loss function as the "regret" in CFR terms?
Sure. Like I mentioned in a later post, I'm an author on several of the CFR papers. It is related to RL, and there are a few ways of interpreting CFR. If you have an RL background, then CFR is kind of "RL using an advantage function, but it converges in adversarial games." If you have a multiarm bandits background, then CFR is kind of "An independent multiarm bandit at each decision point, all learning at the same time."
CFR differs from standard RL in a few ways:
- Exploring the tree. CFR has many variants, many of which focus on using sampling to trade off between fast, narrow, noisy updates versus slow, broad, precise updates. The most sampling-heavy variant, called Outcome Sampling, traces a single path through the tree, as you would in RL. However, it doesn't converge nearly as quickly as other variants, like Vanilla (explore the whole tree on each iteration) or Chance Sampling (sample the cards, explore the whole action space). Since those variants consider all paths through the tree, you don't have an explore/exploit tradeoff anymore, as you normally would in RL.
- The "Counterfactual" part. Regret updates are weighted by the other agents' probability of taking you to a decision point. Or, equivalently: updates are weighted by the probability that you reach a decision, if you are trying to reach that decision (and all of your action probabilities along the way are 1). If you use Outcome Sampling (described above) then by sampling all actions that weight becomes 1, and thus less important. But for fast learning, the weighting becomes important.
- The strategy doesn't converge, but the average strategy does. Even in the limit, in an adversarial setting where all players are learning, the strategy produced from the regret values doesn't converge to Nash. If it was a single-player problem (either no opponents, or static opponents that can be treated as part of the environment), then as in RL, the strategy would converge to an optimal response to the environment. But in an adversarial setting, you also need the average strategy, which is the only part guaranteed to converge to optimal.
Have you read the book https://en.wikipedia.org/wiki/Blondie24. It seems like both of your strategies have similar high-level approaches of self-play and learning, although yours is much more nuanced. They are using an neural network that uses an evolutionary algorithm to update the neural network based on which ai's win against the other ones in self-play.
CFR is based on the method of "regret matching", which is a policy update method. Roughly, it says every iteration, the probability you take an action is equal to the fraction of its cumulative counterfactual regret over total cumulative counterfactual regret for all actions.
The cool thing about it is that it provably converges to nash equilibrium in zero sum game, and to correlated equilibria in non-zero sum games.
Can someone explain how there is any difference in these CFR model from a normal reinforcement learning algorithm where you just take your loss function as the "regret" in CFR terms?