Population structure is a complex phenomenon and no single measure captures it entirely. Understanding a population's structure requires a combination of methods and measures. Simulation studies show that historical population structure can even have genetic effects that can easily be misinterpreted as historical changes in population size, or the existence of admixture events, even when no such events occurred.
Heterozygosity can result in a loss of heterozygosity. In this hypothetical population, an allele has become fixed after the population repeatedly dropped from 10 to 3. One of the results of population structure is a reduction in
heterozygosity. When populations split, alleles have a higher chance of reaching
fixation within subpopulations, especially if the subpopulations are small or have been isolated for long periods. This reduction in heterozygosity can be thought of as an extension of
inbreeding, with individuals in subpopulations being more likely to share a
recent common ancestor. The scale is important — an individual with both parents born in the United Kingdom is not inbred relative to that country's population, but is more inbred than two humans selected from the entire world. This motivates the derivation of Wright's
F-statistics (also called "fixation indices"), which measure inbreeding through observed versus expected heterozygosity. For example, F_{IS} measures the inbreeding coefficient at a single locus for an individual I relative to some subpopulation S: :F_{IS} = 1 - \frac{H_I}{H_S} Here, H_I is the fraction of individuals in subpopulation S that are heterozygous. Assuming there are two alleles, A_1, A_2 that occur at respective frequencies p_S, q_S, it is expected that under random mating the subpopulation S will have a heterozygosity rate of H_S = 2p_S(1-p_S) = 2 p_S q_S. Then: :F_{IS} = 1 - \frac{H_I}{2 p_S q_S} Similarly, for the total population T, we can define H_T = 2 p_T q_T allowing us to compute the expected heterozygosity of subpopulation S and the value F_{ST} as: It also depends on within-population diversity, which makes interpretation and comparison difficult.
Admixture inference An individual's genotype can be modelled as an
admixture between
K discrete clusters of populations. Since then, algorithms (such as ADMIXTURE) have been developed using other estimation techniques. Estimated proportions can be visualized using bar plots — each bar represents an individual, and is subdivided to represent the proportion of an individual's genetic ancestry from one of the
K populations. They are sensitive to sampling strategies, sample size, and close relatives in data sets; there may be no discrete populations at all; and there may be hierarchical structure where subpopulations are nested.
Dimensionality reduction Genetic data are
high dimensional and
dimensionality reduction techniques can capture population structure.
Principal component analysis (PCA) was first applied in population genetics in 1978 by
Cavalli-Sforza and colleagues and resurged with
high-throughput sequencing. Initially PCA was used on allele frequencies at known
genetic markers for populations, though later it was found that by coding
SNPs as integers (for example, as the number of
non-reference alleles) and normalizing the values, PCA could be applied at the level of individuals. One formulation considers N individuals and S bi-allelic SNPs. For each individual i, the value at locus l is g_{i,l} is the number of non-reference alleles (one of 0, 1, 2). If the allele frequency at l is p_{l}, then the resulting N \times S matrix of normalized genotypes has entries:
Multidimensional scaling and
discriminant analysis have been used to study differentiation, population assignment, and to analyze genetic distances.
Neighborhood graph approaches like
t-distributed stochastic neighbor embedding (t-SNE) and
uniform manifold approximation and projection (UMAP) can visualize continental and subcontinental structure in human data. With larger datasets, UMAP better captures multiple scales of population structure; fine-scale patterns can be hidden or split with other methods, and these are of interest when the range of populations is diverse, when there are admixed populations, or when examining relationships between genotypes, phenotypes, and/or geography.
Variational autoencoders can generate artificial genotypes with structure representative of the input data, though they do not recreate linkage disequilibrium patterns. == Demographic inference ==