The Good–Turing estimator is largely independent of the distribution of species frequencies.
Notation Suppose that X distinct species have been observed, enumerated 1, \dots, X. Then the frequency vector, \bar{R}, has elements R_x that give the number of individuals that have been observed for species x. The frequency of frequencies vector, (N_r)_{r=0, 1, \ldots}, shows how many times the frequency r occurs in the vector \bar{R} (i.e., among the elements R_x): N_r = \Bigl| \left\{ x \mid R_x = r \right\} \Bigr|. For example, N_1 is the number of species for which only one individual was observed. Note that the total number of objects observed, N, can be found from N = \sum_{r=1}^\infty r N_r = \sum_{x=1}^X R_x.
Calculation The first step in the calculation is to estimate the probability that a future observed individual (or the next observed individual) is a member of a thus far unseen species. This estimate is p_0 = \frac{N_1}{N}. The next step is to estimate the probability that the next observed individual is from a species which has been seen r times. For a species this estimate is p_r = \frac{(r+1) S(N_{r+1})}{NS(N_r)}. Here, the notation S(\cdot) means the
smoothed, or
adjusted value of the frequency shown in parentheses. An overview of how to perform this smoothing
follows in the next section (see also
empirical Bayes method). To estimate the probability that the next observed individual is from any species from this group (i.e., the group of species seen r times) one can use the following formula: \frac{(r+1) S(N_{r+1})}{N}.
Smoothing For smoothing the erratic values in N_r for large r, we would like to make a plot of \log N_r versus \log r but this is problematic because for large r many N_r will be zero. Instead a revised quantity, \log Z_r, is plotted versus \log r, where Z_r is defined as Z_r = \frac{N_r}{(t - q)/2}, and where q, r, and t are three consecutive subscripts with non-zero counts N_q, N_r, N_t. For the special case when r is 1, take q to be 0. In the opposite special case, when r = r_\text{last} is the index of the non-zero count, replace the divisor (t - q)/2 with r_\text{last} - q, so Z_{r_\text{last}} = N_{r_\text{last}} / (r_\text{last} - q). A
simple linear regression is then fitted to the
log–log plot. For small values of r it is reasonable to set S(N_r) = N_r – that is, no smoothing is performed. For large values of r, values of S(N_r) are read off the regression line. An automatic procedure (not described here) can be used to specify at what point the switch from no smoothing to linear smoothing should take place. Code for the method is available in the public domain. ==Derivation==