Elo rating system

Arpad Elo was a chess master and an active participant in the United States Chess Federation (USCF) from its founding in 1939. The USCF used a numerical ratings system devised by Kenneth Harkness to enable members to track their individual progress in terms other than tournament wins and losses. The Harkness system was reasonably fair, but in some circumstances gave rise to ratings many observers considered inaccurate. On behalf of the USCF, Elo devised a new system with a more sound statistical basis. At about the same time, György Karoly and Roger Cook independently developed a system based on the same principles for the New South Wales Chess Association. Elo's system replaced earlier systems of competitive rewards with one based on statistical estimation. Rating systems for many sports award points in accordance with subjective evaluations of the 'greatness' of certain achievements. For example, winning an important golf tournament might be worth an arbitrarily chosen five times as many points as winning a lesser tournament. A statistical endeavor, by contrast, uses a model that relates the game results to underlying variables representing the ability of each player. Elo's central assumption was that the chess performance of each player in each game is a normally distributed random variable. Although a player might perform significantly better or worse from one game to the next, Elo assumed that the mean value of the performances of any given player changes only slowly over time. Elo thought of a player's true skill as the mean of that player's performance random variable. A further assumption is necessary because chess performance in the above sense is still not measurable. One cannot look at a sequence of moves and derive a number to represent that player's skill. Performance can only be inferred from wins, draws, and losses. Therefore, a player who wins a game is assumed to have performed at a higher level than the opponent for that game. Conversely, a losing player is assumed to have performed at a lower level. If the game ends in a draw, the two players are assumed to have performed at nearly the same level. Elo did not specify exactly how close two performances ought to be to result in a draw as opposed to a win or loss. Actually, there is a probability of a draw that is dependent on the performance differential, so this latter is more of a confidence interval than any deterministic frontier. And while he thought it was likely that players might have different standard deviations to their performances, he made a simplifying assumption to the contrary. To simplify computation even further, Elo proposed a straightforward method of estimating the variables in his model (i.e., the true skill of each player). One could calculate relatively easily from tables how many games players would be expected to win based on comparisons of their ratings to those of their opponents. The ratings of a player who won more games than expected would be adjusted upward, while those of a player who won fewer than expected would be adjusted downward. Moreover, that adjustment was to be in linear proportion to the number of wins by which the player had exceeded or fallen short of their expected number. From a modern perspective, Elo's simplifying assumptions are not necessary because computing power is inexpensive and widely available. Several people, most notably Mark Glickman, have proposed using more sophisticated statistical machinery to estimate the same variables. On the other hand, the computational simplicity of the Elo system has proven to be one of its greatest assets. With the aid of a pocket calculator, an informed chess competitor can calculate to within one point what their next officially published rating will be, which helps promote a perception that the ratings are fair. Implementing Elo's scheme The USCF implemented Elo's suggestions in 1960, and the system quickly gained recognition as being both fairer and more accurate than the Harkness rating system. Elo's system was adopted by the World Chess Federation (FIDE) in 1970. Elo described his work in detail in The Rating of Chessplayers, Past and Present, first published in 1978. Subsequent statistical tests have suggested that chess performance is almost certainly not distributed as a normal distribution, as weaker players have greater winning chances than Elo's model predicts. In paired comparison data, there is often very little practical difference in whether it is assumed that the differences in players' strengths are normally or logistically distributed. Mathematically, however, the logistic function is more convenient to work with than the normal distribution. FIDE continues to use the rating difference table as proposed by Elo. The development of the Percentage Expectancy Table (table 2.11) is described in more detail by Elo as follows: The normal probabilities may be taken directly from the standard tables of the areas under the normal curve when the difference in rating is expressed as a z score. Since the standard deviation σ of individual performances is defined as 200 points, the standard deviation σ' of the differences in performances becomes σ√2 or 282.84. The z value of a difference then is . This will then divide the area under the curve into two parts, the larger giving P for the higher rated player and the smaller giving P for the lower rated player. For example, let . Then . The table gives and as the areas of the two portions under the curve. These probabilities are rounded to two figures in table 2.11. The table is actually built with standard deviation as an approximation for . The normal and logistic distributions are in a way arbitrary points in a spectrum of distributions which would work well. In practice, both of these distributions work very well for a number of different games. == Different ratings systems ==