This strategy initially seemed promising, letting KataGo win 100 percent of games against a cyclic “attacker.” But after the attacker itself was fine-tuned (a process that used much less computing power than KataGo’s fine-tuning), that win rate fell back down to 9 percent against a slight variation on the original attack.
For its second defense attempt, the researchers iterated a multi-round “arms race” where new adversarial models discover novel exploits and new defensive models seek to plug up those newly discovered holes. After 10 rounds of such iterative training, the final defending algorithm still only won 19 percent of games against a final attacking algorithm that had discovered previously unseen variation on the exploit. This was true even as the updated algorithm maintained an edge against earlier attackers that it had been trained against in the past.
Credit:
Getty Images
Even a child can beat a world-class Go AI if they know the right algorithm-exploiting strategy.
Credit:
Getty Images
In their final attempt, researchers tried a completely new type of training using vision transformers, in an attempt to avoid what might be “bad inductive biases” found in the convolutional neural networks that initially trained KataGo. This method also failed, winning only 22 percent of the time against a variation on the cyclic attack that “can be replicated by a human expert,” the researchers wrote.
Will anything work?
In all three defense attempts, the KataGo-beating adversaries didn’t represent some new, previously unseen height in general Go-playing ability. Instead, these attacking algorithms were laser-focused on discovering exploitable weaknesses in an otherwise performant AI algorithm, even if those simple attack strategies would lose to most human players.
Those exploitable holes highlight the importance of evaluating “worst-case” performance in AI systems, even when the “average-case” performance can seem downright superhuman. On average, KataGo can dominate even high-level human players using traditional strategies. But in the worst case, otherwise “weak” adversaries can find holes in the system that make it fall apart.