Unlike with the
rarer Mendelian diseases, combinations of different
genes and the environment play a role in the development and progression of common diseases (such as
diabetes,
cancer,
heart disease,
stroke,
depression, and
asthma), or in the individual response to
pharmacological agents. To find the genetic factors involved in these diseases, one could in principle do a
genome-wide association study: obtain the complete genetic sequence of several individuals, some with the disease and some without, and then search for differences between the two sets of genomes. At the time, this approach was not feasible because of the cost of
full genome sequencing. The HapMap project proposed a shortcut. Although any two unrelated people share about 99.5% of their
DNA sequence, their
genomes differ at specific
nucleotide locations. Such sites are known as
single nucleotide polymorphisms (SNPs), and each of the possible resulting gene forms is called an
allele. The HapMap project focuses only on common SNPs, those where each allele occurs in at least 1% of the population. Each person has two copies of all
chromosomes, except the
sex chromosomes in
males. For each SNP, the combination of alleles a person has is called a
genotype.
Genotyping refers to uncovering what genotype a person has at a particular site. The HapMap project chose a sample of 269 individuals and selected several million well-defined SNPs, genotyped the individuals for these SNPs, and published the results. The alleles of nearby SNPs on a single chromosome are correlated. Specifically, if the allele of one SNP for a given individual is known, the alleles of nearby SNPs can often be predicted, a process known as
genotype imputation. This is because each SNP arose in evolutionary history as a single point
mutation, and was then passed down on the chromosome surrounded by other, earlier, point mutations. SNPs that are separated by a large distance on the chromosome are typically not very well correlated, because
recombination occurs in each generation and mixes the allele sequences of the two chromosomes. A sequence of consecutive alleles on a particular chromosome is known as a
haplotype. To find the genetic factors involved in a particular disease, one can proceed as follows. First a certain region of interest in the genome is identified, possibly from earlier inheritance studies. In this region one locates a set of
tag SNPs from the HapMap data; these are SNPs that are very well correlated with all the other SNPs in the region. Using these, genotype imputation can be used to determine (impute) the other SNPs and thus the entire haplotype with high confidence. Next, one determines the genotype for these tag SNPs in several individuals, some with the disease and some without. By comparing the two groups, one determines the likely locations and haplotypes that are involved in the disease. == Samples used ==