Protein identification is the process of assigning a name to a protein of interest (POI), based on its amino-acid sequence. Typically, only part of the protein’s sequence needs to be determined experimentally in order to identify the protein with reference to databases of protein sequences deduced from the DNA sequences of their genes. Further protein characterization may include confirmation of the actual N- and C-termini of the POI, determination of sequence variants and identification of any post-translational modifications present.
Proteolytic digests A general scheme for protein identification is described. • The POI is isolated, typically by
SDS-PAGE or
chromatography. • The isolated POI may be chemically modified to stabilise Cysteine residues (e.g. S-amidomethylation or S-carboxymethylation). • The POI is digested with a specific protease to generate peptides.
Trypsin, which cleaves selectively on the C-terminal side of Lysine or Arginine residues, is the most commonly used protease. Its advantages include i) the frequency of Lys and Arg residues in proteins, ii) the high specificity of the enzyme, iii) the stability of the enzyme and iv) the suitability of tryptic peptides for mass spectrometry. • The peptides may be desalted to remove ionizable contaminants and subjected to
MALDI-TOF mass spectrometry. Direct measurement of the masses of the peptides may provide sufficient information to identify the protein (see
Peptide mass fingerprinting) but further fragmentation of the peptides inside the mass spectrometer is often used to gain information about the peptides’ sequences. Alternatively, peptides may be desalted and separated by
reversed phase HPLC and introduced into a mass spectrometer via an
ESI source. LC-ESI-MS may provide more information than MALDI-MS for protein identification but uses more instrument time. • Depending on the type of mass spectrometer, fragmentation of peptide ions may occur via a variety of mechanisms such as
collision-induced dissociation (CID) or
post-source decay (PSD). In each case, the pattern of fragment ions of a peptide provides information about its sequence. • Information including the measured mass of the putative peptide ions and those of their fragment ions is then matched against calculated mass values from the conceptual (in-silico) proteolysis and fragmentation of databases of protein sequences. A successful match will be found if its score exceeds a threshold based on the analysis parameters. Even if the actual protein is not represented in the database, error-tolerant matching allows for the putative identification of a protein based on similarity to
homologous proteins. A variety of software packages are available to perform this analysis. • Software packages usually generate a report showing the identity (accession code) of each identified protein, its matching score, and provide a measure of the relative strength of the matching where multiple proteins are identified. • A diagram of the matched peptides on the sequence of the identified protein is often used to show the sequence coverage (% of the protein detected as peptides). Where the POI is thought to be significantly smaller than the matched protein, the diagram may suggest whether the POI is an N- or C-terminal fragment of the identified protein.
De novo sequencing The pattern of fragmentation of a peptide allows for direct determination of its sequence by
de novo sequencing. This sequence may be used to match databases of protein sequences or to investigate
post-translational or chemical modifications. It may provide additional evidence for protein identifications performed as above.
N- and C-termini The peptides matched during protein identification do not necessarily include the N- or C-termini predicted for the matched protein. This may result from the N- or C-terminal peptides being difficult to identify by MS (e.g. being either too short or too long), being post-translationally modified (e.g. N-terminal acetylation) or genuinely differing from the prediction. Post-translational modifications or truncated termini may be identified by closer examination of the data (i.e.
de novo sequencing). A repeat digest using a protease of different specificity may also be useful.
Post-translational modifications Whilst detailed comparison of the MS data with predictions based on the known protein sequence may be used to define post-translational modifications, targeted approaches to data acquisition may also be used. For instance, specific enrichment of phosphopeptides may assist in identifying
phosphorylation sites in a protein. Alternative methods of peptide fragmentation in the mass spectrometer, such as
ETD or
ECD, may give complementary sequence information.
Whole-mass determination The protein’s whole mass is the sum of the masses of its amino-acid residues plus the mass of a water molecule and adjusted for any post-translational modifications. Although proteins ionize less well than the peptides derived from them, a protein in solution may be able to be subjected to ESI-MS and its mass measured to an accuracy of 1 part in 20,000 or better. This is often sufficient to confirm the termini (thus that the protein’s measured mass matches that predicted from its sequence) and infer the presence or absence of many post-translational modifications.
Limitations Proteolysis does not always yield a set of readily analyzable peptides covering the entire sequence of POI. The fragmentation of peptides in the mass spectrometer often does not yield ions corresponding to cleavage at each peptide bond. Thus, the deduced sequence for each peptide is not necessarily complete. The standard methods of fragmentation do not distinguish between leucine and isoleucine residues since they are isomeric. Because the Edman degradation proceeds from the N-terminus of the protein, it will not work if the N-terminus has been chemically modified (e.g. by acetylation or formation of Pyroglutamic acid). Edman degradation is generally not useful to determine the positions of disulfide bridges. It also requires peptide amounts of 1 picomole or above for discernible results, making it less sensitive than
mass spectrometry. == Predicting from DNA/RNA sequences ==