NAME
MassCalculator - Implements common mass computations in mass spectrometry
SYNOPSIS
use InSilicoSpectro::InSilico::MassCalculator;
InSilicoSpectro::InSilico::MassCalculator::init('insilicodef.xml');
DESCRIPTION
MassCalculator Perl library is intended to support common mass spectrometry-related computations that must be routinely performed by bioinformaticians dealing with mass spectrometry data.
To accommodate as many as possible user requirements, we decided to both support a classical procedural programming model and an object oriented model. Per se MassCalculator is not designed as an object oriented code but we provide a series of elementary classes for modeling proteins, peptides, enzymes, etc. which MassCalculator is compatible with. Moreover, the latter classes are rather simple and neutral in their design such that they should fit, after further derivations eventually, a large range of code design.
We decided not to use Perl object oriented programming to stay with relatively naive and simple code and to allow everybody to decide how to include it in its own project. Nonetheless, MassCalculator is able to deal with Perl objects we provide additionally to represent protein sequences, peptides, enzymes, mass lists, and fragmentation spectra, see their respective documentations.
MassCalculator is released under the LGPL license (see source code).
Configuration file and init function
One XML configuration file is necessary for MassCalculator.pm to work. It contains definitions of chemical elements, amino acids, fragmentation types, enzymes, etc.
These configuration files are read by the function init, which must be called before the other functions are used. It is possible to provide function init with another, or several others, file name(s).
init([$filename])
Opens the given file, or file ${phenyx.config.cleavenzymes} if no parameter was given, and stores all the modif in the dictionary.
massFromComposition($formula)
Computes monoisotopic and average masses from an atomic composition given in $formula. Monoisotopic and average masses are returned as a vector of 2 elements. Negative numbers are possible to accommodate with losses or certain modifications.
Example:
print join(',', massFromComposition('HPO3')), "\n";
Basic mass functions
getMass($el, $mt)
Returns the mass, either monoisotopic or average mass depending on the current setting or the parameter $mt, of elements, amino acids, modifications and molecules.
- $el
-
Contains the name of the "object" whose mass is to return. $el must start with el_ for elements (el_Na, el_H), with aa_ for amino acids (aa_Y, aa_T), with mol_ for molecules (mol_H2O), and mod_ for modifications (mod_Oxidation).
- $mt
-
Contains the mass type (0=monoisotopic, 1=average). When this argument is not provided (most common function use), the current setting of mass type is used.
setMass($symbol, $formula)
This function allows the user to set new molecule, element, amino acid, and modification masses dynamically.
- $symbol
-
Contains the identifier of the new object with the appropriate prefix (el_, aa_, mol_, or mod_).
- $formula
-
Contains the atomic composition (it is used for computing monoisotopic and average masses). Negative numbers of atoms are possible to deal with modifications.
Examples:
setMass('mol_NH3', 'NH3');
setMass('mol_H2O', 'H2O');
setMassType($type)
Sets the current mass type. $type
must be equal to 0 (monoisotopic) or 1 (average).
getMassType
Returns the current mass type.
setModif($name, $mono, $avg, $residues, $residuesAfter, $terminus)
Sets the properties of a new modification. Although for most of the functions in this module only the mass shift induced by a modification is necessary, variablePeptide function needs supplementary data about modifications to locate them. Therefore a dedicated function for setting modification is required. The parameters are:
- name
-
The name of the modification.
- mono
-
The monoisotopic mass shift.
- avg
-
The average mass shift.
- residues
-
A string that either contains all the amino acids where the modification may occur, or the complementary list of such amino acids and in this case the first character in the string must be '^'. Alternatively, residues may be equal to '.' to say that all amino acids may be modified.
- residuesAfter
-
An extra constraint on the amino acid following the modified amino acid. The rule for listing the imposed amino acids after a modification site is the same as for residues.
- terminus
-
This parameter must be set to '-' to indicate a regular modification, to 'N' to indicate an N-terminal modification, and to 'C' to indicate a C-terminal modification. In case of a N- or C-terminal modification, the parameter residues gives the list of possible amino acids for the first, respectively last, position in the peptide sequence, and the parameter residuesAfter gives the constraint on the second, respectively penultimate, amino acid.
getModif
This function returns the characteristics of a modification. Namely it returns a vector containing the mass shift (according to the current mass type), the list of possible residues, the list of imposed residues at the next position, and the terminus status ('-', 'C', or 'N').
Peptide-related functions
This section groups all the functions that are directly related to peptides, i.e. peptide mass computation and protein digestion.
Modifications
To properly deal with modified proteins (and hence peptides) and compute their mass and MS/MS spectra, in case of peptides, we introduce a convention that allows to localize modifications in protein/peptide sequences.
A protein/peptide sequence is a sequence of amino acids
a_1 a_2 a_3 a_4 ... a_n.
The corresponding modifications are represented either as a string or as a vector.
String representation
The string takes the form
m_0:m_1:m_2:m_3:m_4: ... :m_n:m_(n+1),
where m_0 is the N-terminal site modification, m_i is a_i modification, and m_(n+1) is the C-terminal site modification. For instance, a peptide
DEMSCGHTK
might be modified according to
ACET_nterm:::Oxidation::Cys_CAM::Oxidation:::
which means that there is an N-terminal acetylation, the methionine and the histidine are oxidized, and the cysteine is carboxyamidomethylated; no C-terminal modification. We see that in this notation empty positions between colons are possible to denote the absence of modification. The modification identifiers come from the configuration file.
In the above string notation it is possible to define variable modifications, see function variablePeptide.
Vector representation
Alternatively, modifications can be localized by using a vector of strings. The length of the vector is len(peptide)+2 and element at index 0 corresponds to the N-terminus, index len(peptide)+1 to the C-terminus and the indices between 1 and len(peptide) correspond to the amino acids. The strings of this vector follow the same rule as the strings between ':' in the modification string, i.e. the contain the name of the modification or nothing, or they define variable modifications, see function variablePeptide.
The PMF case
In PMF only the peptide masses matter and it is not necessary to know the location of the modifications. We only need to know their numbers. Hence, when dealing with PMF computations we introduce a third convention for modification description. We use a vector that contains the number of occurrences and the modifications alternatively:
num1, modif1, num2, modif2, ...
modifToString($modif, [$len])
In order to display modification strings/vectors conveniently, we provide a unique function modifToString that accepts all three formats and display the modifications in $modif as a string using an appropriate style. If the parameter $len (peptide length) is also given, then modifToString complements the length if the returned string if necessary (MS/MS only).
getPeptideMass(%h)
This function computes the (uncharged) mass of a peptide. The parameters are passed as a reference to a hash to permit named parameters. The named parameters are:
- pept
-
The peptide sequence. If it contains B, Z or X then getPeptideMass returns -1.
- modif
-
A reference to a vector containing the names of the modifications of the peptide. If modif is not specified, the peptide is assumed unmodified. Empty names or undefined names are considered as no modification.
- termGain
-
Mass delta to add to the amino acid masses in case the usual plus water rule does not apply, e.g. when using trypsin to introduce two 18O molecules at the C-terminus.
Example:
my $peptide = 'QCTIPADFK';
my @modif = ('Cys_CAM');
print "$peptide mass is ", getPeptideMass(pept=>$peptide, modif=>\@modif), "\n";
getCorrectCharge($mTheo, $mExp, $delta, $maxCharge)
Returns the appropriate charge and theoretical mass over charge ratio (m/z) for a given experimental m/z ratio and theoretical mass (charge not known). The result is a vector (charge, m/z) and charges between 1 and $maxCharge
are tested.
- $mTheo
-
Theoretical mass.
- $mExp
-
Experimental m/z ratio.
- $delta
-
Mass tolerance for comparing
$mTheo
and$mExp
multiplied by the charge. - $maxCharge
-
Maximum charge to test for (minimum is 1).
digestByRegExp(%h)
Generic digestion function where the enzyme is specified as a Perl regular expression or a InSilicoSpectro::InSilico::CleavEnzyme object. The maximum number of missed cleavages per peptide can be set arbitrarily and the result of the digestion is either (1) a data structure that lists all the peptides with their masses, modifications, and start/stop positions; or (2) a vector of InSilicoSpectro::InSilico::Peptide objects. The choice depends on the type of the parameter protein, see hereafter. A mass window can be specified.
The named parameters are:
- protein
-
The protein sequence. This sequence can be provided as a string or as an object of class InSilicoSpectro::InSilico::AASequence. In the latter case, if no modif parameter is given, then the modifications are taken from the AASequence object, otherwise the parameter modif overrides the modifications stored in the object.
If protein is given as an object then the result of the digestion will be a vector of InSilicoSpectro::InSilico::Peptide objects instead of the data structure.
- modif
-
The parameter giving the localized modifications. If this parameter is not provided, then the protein is assumed to have no localized modification. Localized variable modifications are possible, see function variablePeptide. In case there are localized variable modifications, all the combinations of modifications will be put in the final result, i.e. the peptide is present several times with different modification strings.
Localized modifications are specified according to the two possible format explained above.
- fixedModif
-
A reference to a vector of fixed modification names. The rules for identifying possible amino acids where modifications occur are read from the configuration file. All possible modification sites for all fixed modifications are computed and these sites are combined with the fixed modifications given by the parameter modif.
Fixed modification always have the priority over variable modifications and it is not allowed to have two fixed modifications at the same location.
- varModif
-
A reference to a vector of variable modification names. If this parameter is provided then each generated peptide will be tested for possible variable modifications and all the combinations will be put in the final result, i.e. the peptide is present several times with different modification strings.
- pmf
-
If this parameter is set to any value, then the exact modification location in the result is replaced by the format for PMF, see above.
- nmc
-
Maximum number of missed cleavages, default 0.
- minMass
-
Minimum peptide mass, default 500 Da.
- maxMass
-
Maximum peptide mass, default 999999 Da.
- enzyme
-
The enzyme can be specified either by a regular expression or by a CleavEnzyme object. If this parameter is not provided then the digestion is performed for trypsin according to the regular expression
(?<=[KR])(?=[^P])
Regular expressions used for specifying an enzyme can be given as strings or precompiled regular expression such as
$dibasicRegex = qr/(?<=[KR])(?=[KR])/,
which is exported by this module.
When you specify the enzyme by a CleavEnzyme object, the regular expression is read from the object.
- CTermGain, NTermGain
-
In case the enzyme does not add H at N-termini and OH at C-termini, you can define new formulas for what is added.
When the enzyme comes as a CleavEnzyme object, these values are read from the enzyme object directly.
- CTermModif, NTermModif
-
In case the enzyme does introduce a modification at peptide C-/N-termini, you can provide the name of the modification to apply.
When the enzyme comes as a CleavEnzyme object, these values are read from the enzyme object directly.
- methionine
-
If methionine is set to any value then, in case the first protein amino acid is a methionine, supplementary peptides with initial methionine removed are generated. This may be useful when processing RNA-translated sequences.
- addProton
-
This parameter, if set to any value, tells the function to add the mass of one proton to each peptide mass. This is useful for PMF as it allows direct comparison of theoretical (charged) masses with experimental masses.
- duplicate
-
When set to any value, and the protein is given as a AASequence object, this parameter tells the function not to try to save memory when creating the Peptide object.
The result of the digestion is returned either as a vector of InSilicoSpectro::InSilico::Peptide objects or as a data structure which is a vector of six references:
result[0] -> vector of peptide sequences
result[1] -> vector of start positions in the original protein string
result[2] -> vector of end positions in the original protein string
result[3] -> vector of number of missed cleavages
result[4] -> vector of peptide masses
result[5] -> vector of modification descriptions
Variables $digestIndexPept=0, $digestIndexStart=1, $digestIndexEnd=2, $digestIndexNmc=3, $digestIndexMass=4, and $digestIndexModif=5 are exported for convenience.
Example:
my $protein = 'MCTMACTKGIPRKQWWEMMKPCKADFCV';
my $modif = '::Cys_CAM::Oxidation::::::::::::::Oxidation:::::::::::';
my @result = digestByRegExp(protein=>$protein, modif=>$modif, methionine=>1, nmc=>2);
print "$protein\n$modif\n";
for (my $i = 0; $i < @{$result[0]}; $i++){
print "$result[$digestIndexPept][$i]\t$result[$digestIndexStart][$i]\t",
"$result[$digestIndexEnd][$i]\t$result[$digestIndexNmc][$i]\t",
"$result[$digestIndexMass][$i]\t", modifToString($result[$digestIndexModif][$i]), "\n";
}
More examples in InSilicoSpectro/InSilico/test/testCalcDigest.pl and InSilicoSpectro/InSilico/test/testCalcDigestOOP.pl
nonSpecificDigestion(%h)
Generic digestion function where either no enzyme is used or half enzymatic peptides are computed. The result of the digestion is either (1) a data structure that lists all the peptides with their masses, modifications, and start/stop positions (see digestByRegExp above for a description); or (2) a vector of InSilicoSpectro::InSilico::Peptide objects. The choice depends on the type of the parameter protein, see hereafter.
The number of missed cleavage reported in the data structure is used to distinguish between the three possible cases: -3 for no enzyme peptide, -1 for half-tryptic peptides at their N-term end, and -2 for half-tryptic peptides at their C-term end.
The named parameters are:
- protein
-
See function digestByRegExp.
- modif
-
See function digestByRegExp.
- fixedModif
-
See function digestByRegExp.
- varModif
-
See function digestByRegExp.
- pmf
-
See function digestByRegExp.
- minMass
-
Minimum peptide mass.
- maxMass
-
Maximum peptide mass.
- minLen
-
Minimum peptide length in amino acids.
- maxLen
-
Maximum peptide length in amino acids.
- enzyme
-
If this parameter is not provided then the digestion is performed without any enzyme, i.e. the protein is cut after every amino acid and only the mass window and the peptide length window are applied as criteria to filter possible peptide.
If an enzyme regular expression or a CleavEnzyme object (see digestByRegExp above) is provided, then half-enzymatic peptides are generated, i.e. peptides with either the N- or the C-term end that corresponds to a cleavage site. The mass window and peptide length window also apply in this case.
- CTermGain, NTermGain
-
See function digestByRegExp.
- CTermModif, NTermModif
-
See function digestByRegExp.
- addProton
-
See function digestByRegExp.
- duplicate
-
See function digestByRegExp.
Example:
my $protein = 'MCTMACTKGIPRKQWWEMMKPCKADFCV';
my $modif = '::Cys_CAM::Oxidation::::::::::::::Oxidation:::::::::::';
my @result = nonSpecificDigestion(protein=>$protein, modif=>$modif);
print "$protein\n$modif\n";
for (my $i = 0; $i < @{$result[0]}; $i++){
print "$result[$digestIndexPept][$i]\t$result[$digestIndexStart][$i]\t",
"$result[$digestIndexEnd][$i]\t$result[$digestIndexNmc][$i]\t",
"$result[$digestIndexMass][$i]\t", modifToString($result[$digestIndexModif][$i]), "\n";
}
More examples in InSilicoSpectro/InSilico/test/testCalcDigest.pl and InSilicoSpectro/InSilico/test/testCalcDigestOOP.pl
matchPMF(%h)
This function compares an experimental PMF spectrum with a theoretical PMF spectrum, i.e. the result of a digestion function, and associates each theoretical mass with one experimental mass.
The named parameters are:
- digestResult
-
A reference to a vector containing the result of the protein digestion as computed by either digestByRegExp or nonSpecificDigestion functions.
- expSpectrum
-
The experimental spectrum is given either as a data structure or as an object of class InSilicoSpectro::Spectra::ExpSpectrum.
The data structure is a vector of references to vectors corresponding to experimental peaks in the experimental spectrum. The parameter expSpectrum is a reference to the experimental spectrum. Namely, expSpectrum has a structure like
expSpectrum->[0] -> (mass, intensity, s/n, ...) for peak 1 expSpectrum->[1] -> (mass, intensity, s/n, ...) for peak 2 expSpectrum->[2] -> (mass, intensity, s/n, ...) for peak 3 ...
By default, for matchPMF, it is assumed that the mass has index 0 and intensity has index 1 in the peak vectors. The other data in these vectors are not used but they can be retrieved after the spectrum match for computing statistics for instance. If mass index is not 0 or intensity index is not 1, use the parameters massIndex and intensityIndex.
In case expSpectrum is a InSilicoSpectro::Spectra::ExpSpectrum object, the experimental spectrum as well as mass and intensity indices are retrieved from the object directly.
If the peptide mass fingerprinting spectrum has been acquired on an ESI instrument, where peptides can be multiply charged, it is the user responsibility to replace multiply charged peptides m/z values by their singly charged equivalent. The match algorithm ignores peptide charges.
- tol
-
Relative mass error tolerance; this parameter is optional. When not specified, the closest experimental peak is returned for each theoretical mass and any tolerance can be applied afterwards, i.e. outside of the function matchPMF. When specified, the most intense peak found with mass error satisfying
relative error <= tol OR absolute error <= minTol
is returned; if no peak is found then the value in the returned vector remains undefined.
- minTol
-
Absolute mass error, default value 0.1 Da. This parameter is used only in case tol parameter is specified, see above.
- sorted
-
The experimental spectrum must sorted with respect to the mass for matchPMFClosest to work. In case it is already sorted when the function is called, sorted can be set to any value to avoid an unnecessary sort operation.
- massIndex
-
Mass index in the experimental peak vectors, default 0.
- intensityIndex
-
Intensity index in the experimental peak vectors, default 1.
The result of the matching is a vector of the same length as digestResult that contains references to the closest peaks vector in expSpectrum. The reference to the closest experimental peak for the peptide described at index i in digestResult is found at index i in the returned vector.
Examples in InSilicoSpectro/InSilico/test/testCalcPMFMatch.pl and InSilicoSpectro/InSilico/test/testCalcPMFMatchOOP.pl.
variablePeptide(%h)
Given a peptide sequence, fixed and localized variable modifications, and lists of variable and fixed modifications, this function returns a list comprising all possible modification combinations. Each amino acid can only receive one modification at a time and if several are possible for a given amino acid they will yield several distinct modification combinations. In case you need an amino acid to be modified simultaneously by several modifications, you can define this modifications combination as a new modification.
The peptide can given by a string, a InSilicoSpectro::InSilico::Peptide object or a InSilicoSpectro::InSilico::AASequence object.
The parameters are:
- pept
-
The peptide.
- modif
-
A reference to a vector (no string possible here) containing fixed and localized variable modifications for the peptide. If this parameter is not provided, then no such modifications are considered.
Fixed modification are always present whereas localized variable modifications are associated with a specific amino acid but may be present or not. Add '(*)' before a modification name to specify a localized variable modification. It is even possible to give several alternative variable modifications for the same amino acid: add '(*)' before the name of the first one and add the other modifications names separated by comas:
(*)mod1,mod2,mod3
If the peptide is provided as a InSilicoSpectro::InSilico::Peptide object or a InSilicoSpectro::InSilico::AASequence object and the parameter modif is not given, then the localized modifications are taken from the object.
- fixedModif
-
See function digestByRegExp.
- varModif
-
See function digestByRegExp.
- pmf
-
See function digestByRegExp. The total mass shift corresponding to a modification combination follows it in the returned vector.
Examples in InSilicoSpectro/InSilico/test/testCalcVarpept.pl and InSilicoSpectro/InSilico/test/testCalcVarpeptOOP.pl.
locateModif($peptSeq, $modif, $fixedModif, $varModif, $modifVect)
This function is used to generate a vector of modifications on the basis of localized fixed and variable modifications as well as a list of fixed and variable modifications, which are located by using their rule. The result is returned in $modifVect.
The parameters are:
- $peptSeq
-
The peptide sequence.
- $modif
-
The localized modifications, see function variablePeptide.
- $fixedModif
-
A reference to a vector containing the name of fixed modifications.
- $varModif
-
A reference to a vector containing the name of variable modifications.
- $modifVect
-
A reference to a vector that will contain the result.
Example: see InSilicoSpectro/InSilico/test/testCalcDigest.pl and the mini web site.
Fragment mass functions
Fragment types are read from the configuration file. Supplementary fragments can be added subsequently by using the function setLoss, setSeries, and setFragType. Each fragment type is described by three main properties:
- Ion series
-
A fragment ion series is the basic properties required for computing theoretical masses. Classical ion series are a,b,c and x,y,z. They define the generic position where the peptide is fragmented.
Certain very short fragments, such as b1, or very long fragments, such as bn, are normally not produced or detected. Hence an ion series also defines the first and last possible fragments.
- Charge state
-
A fragment charge state can be 1,2,3,...
- Neutral losses
-
Certain amino acids have the potential to loose certain molecules during the fragmentation process. To precisely define a fragment type one must specify if loss(es) is(are) possible and, if yes, which one(s) and with which multiplicity.
For instance, a fragment type b-H2O can be defined that is able to loose water once. Another example is b-H2O* that is allowed to loose water as many times as possible, depending on the peptide sequences (the maximum number of loss is limited by the number of amino acids able to loose water in the peptide sequence).
It is even possible to combine losses: b-NH3*-H2O* is a fragment type that allows for multiple water and ammonia losses.
getFragmentMasses(%h)
This function computes the theoretical fragmentation spectrum (MS/MS spectrum) of a peptide. Namely, the peptide can given by a string, a InSilicoSpectro::InSilico::Peptide object or a InSilicoSpectro::InSilico::AASequence object Top-down proteomics).
The named parameters are:
- pept
-
The peptide.
- modif
-
The peptide localized modifications. Can be a string or a reference to a vector, see function digestByRegExp. When not provided the peptide is assumed unmodified.
Only fixed (and localized) modifications make sense in the context of fragment mass computations.
If the peptide is provided as a InSilicoSpectro::InSilico::Peptide object or a InSilicoSpectro::InSilico::AASequence object and the parameter modif is not given, then the localized modifications are taken from the object.
- fragTypes
-
A reference to a vector that gives the list of fragment types to use for computing the theoretical spectrum.
- spectrum
-
A reference to a hash that is used for storing the theoretical spectrum. This hash is emptied before computation to remove a possible previous spectrum.
The data structure containing the theoretical spectrum is as follows.
$spectrum{peptide} = peptide sequence or Peptide object (depends on the given parameter)
$spectrum{modif} = reference to the vector of modifications actually used for the computation
$spectrum{peptideMass} = peptide mass
$spectrum{mass} = theoretical spectrum
$spectrum{ionType} = ion types
$spectrum{mass} is a reference to another hash whose keys are either term for N- or C-terminal fragments, or internal for internal fragments such as immonium ions.
$spectrum{mass}{term} is a reference to a hash whose keys are the computed fragment types. For instance, we assume that b,y,b-H2O* were computed. We have
$spectrum{mass}{term}{b} = theoretical b fragment masses
$spectrum{mass}{term}{y++} = theoretical y++ fragment masses
$spectrum{mass}{term}{b-H2O*} = theoretical b-H2O* fragment masses
In case of fragment types without loss the hash
$spectrum{mass}{term}{fragment_type}
is a reference to a vector of length n, where n is the length of the peptide sequence. For a N-terminal fragment type (a,b,c), index i in this vector correspond to a fragment that includes the i-th amino acid as its last amino acid. For a C-terminal fragment type (x,y,z), index i in this vector corresponds to a fragment that includes the (n-i)-th amino acid as its last amino acid. unused positions (too short or too long fragments) are left undefined.
Losses
In case of fragment with loss, it is possible that more than n positions are required to store the theoretical spectrum. We note that by convention fragment types with loss (b-H2O, b-H2O*) must include one loss at least. That is they do not include the masses corresponding to the fragment type without loss (b).
Simultaneous losses at one amino acid are not permitted. Such multiple losses can be caused by defining a fragment type with several possible losses and at least two of these losses are possible on the same amino acid. In case you really need multiple losses, you can define a new fragment type that only happens at the common amino acid and induces the total mass shift.
If we find no possible loss in a given peptide sequence for a given fragment type, then the reference
$spectrum{mass}{term}{fragment_type}
remains undefined. If only one position for potential loss exists or the fragment type limits the number of loss to one (b-NH3), then the length of the theoretical spectrum will be n as for fragments without loss. In case several losses are possible and several positions for loss are found, the the length of the vector referen- ced by $spectrum{mass}{term}{fragment_type} is a multiple of n. For instance for two losses we would have the masses with one loss in the cells 0...n-1 and the masses with two losses in the cells n...2n-1. Recall that impossible fragments are represented by undefined cells.
In order to display theoretical spectra it is useful to be able to keep track of the exact number of losses, especially when several types of losses are possible simultaneously. Thus we store in
$spectrum{ionType}
strings that describe the exact type of ions at hand. For instance b++-2(H2O)-NH3 to describe a doubly charged b fragment with two water losses and one ammonia loss (its fragment type b++-H2O*-NH3* does not provide this information). The ion types are listed by fragment types as are the theoretical masses
$spectrum{ionType}{b}
$spectrum{ionType}{y++}
$spectrum{ionType}{b-H2O*}
For fragment types without loss we simply copy the fragment type
$spectrum{ionType}{b}[0] = 'b'
$spectrum{ionType}{y++}[0] = 'y++'
whereas for fragment types with loss, we compute the appropriate strings and store them as a vector referenced by
$spectrum{ionType}{fragment_type}.
Immonium ions
Immonium ions are internal ions that result from a y_m/a_n cleavages. By specifying 'immo' in the list of ion types, the masses corresponding to the immonium ions of the amino acids present in the peptide are computed and stored in a hash
$spectrum{mass}{intern}{immo}
whose keys are the one letter codes of the residues and the values are the masses.
The configuration file contains the information about which are the amino acids that yield such ions as well as the mass delta to apply to the amino acid mass. Moreover, modified cysteines as well as modified methionine and histidine are included in the immonium ions with adapted mass. For Lysine, both its "nominal" immonium ion mass (101) and after ammonia loss (84) are computed. Immonium ions for the N- and C-terminal amino acids are not computed.
Example
my %spectrum; $peptide = 'HCMSKPQMLR'; $modif = '::Cys_CAM::::::Oxidation:::'; getFragmentMasses(pept=>$peptide, modif=>$modif, fragTypes=>['b','a', 'b-NH3*','b-H2O*','b++','y','y-NH3*','y-H2O*','y++','immo'], spectrum=>\%spectrum); print "\nfragments ($spectrum{peptideMass}):\n"; my $len = length($peptide); foreach my $frag (keys(%{$spectrum{mass}{term}})){ for (my $i = 0; $i < @{$spectrum{ionType}{$frag}}; $i++){ print $spectrum{ionType}{$frag}[$i]; for (my $j = $i*$len; $j < ($i+1)*$len; $j++){ print "\t", $spectrum{mass}{term}{$frag}[$j]+0.0; } print "\n"; } } foreach my $frag (keys(%{$spectrum{mass}{intern}})){ print $spectrum{ionType}{$frag}[0]; foreach my $aa (keys(%{$spectrum{mass}{intern}{$frag}})){ print "\t$aa\t$spectrum{mass}{intern}{$frag}{$aa}"; } print "\n"; }
More examples in InSilicoSpectro/InSilico/test/testCalcFrag.pl and InSilicoSpectro/InSilico/test/testCalcFragOOP.pl.
matchSpectrumClosest(%h)
This function compares an experimental MS/MS spectrum with a theoretical MS/MS spectrum and associates each theoretical mass with its closest experimental mass. The application of any mass tolerance can be done afterwards.
The named parameters are:
- pept
-
The peptide sequence or a Peptide object. When not provided the spectrum is considered to be already computed and ready for use in spectrum.
- modif
-
The modification string. When not provided the peptide is assumed unmodified. If the peptide is provided as a Peptide object, then the modifications are read from the Peptide object if no modif parameter is set, otherwise it overrides the object data.
- fragTypes
-
A reference to a vector that gives the list of fragment types to use for computing the theoretical spectrum.
- spectrum
-
The theoretical spectrum.
- expSpectrum
-
The experimental spectrum, a data structure or an object InSilicoSpectro::Spectra::ExpSpectrum, see method PMFMatch.
- sorted
-
The experimental spectrum must sorted with respect to the mass for matchSpectrumClosest to work. In case it is already sorted when the function is called, sorted can be set to any value to avoid an unnecessary sort operation.
- massIndex
-
The mass index in the experimental peak vectors, default 0.
The result of the matching is an addition to the spectrum data structure. A new structure parallel to spectrum->{mass} is created that is named
spectrum->{match}
that references a hash analogous to spectrum->{mass} but with the theoretical masses replaced by references to the closest peaks vector in expSpectrum.
Examples in InSilicoSpectro/InSilico/test/testCalcMatch.pl and InSilicoSpectro/InSilico/test/testCalcMatchOOP.pl.
matchSpectrumGreedy(%h)
Alternative algorithm for matching experimental and theoretical spectra. The experimental spectrum must have the masses at index 0 and the intensities at index 1 in the peak vectors. The mass error tolerance is determined by the rule
relative error <= tol OR absolute error <= minTol
Four named parameters are available in addition to the ones of matchSpectrumClosest:
- tol
-
Relative mass tolerance (default 500 ppm).
- minTol
-
Minimum absolute mass tolerance (default 0.2 Da).
- order
-
A reference to a vector that gives the fragment types in an order of priority for matching.
- intensityIndex
-
The intensity index in the experimental peak vectors, default 1.
If order is not set, then matchSpectrumGreedy returns the matching experimental peaks within mass tolerance tol with the highest intensities. That is when several experimental masses are possible, the one corresponding to the most intense peak is chosen, although it is not necessarily the closest.
If order is set, then a greedy algorithm is applied additionally. Namely, the fragment types are processed in the order given by this parameter and the most intense peaks are chosen as before but a chosen peak is no longer available for subsequent matches. This ensures that experimental masses are used once only which is not true for the case with no order and for matchSpectrumClosest.
Examples in InSilicoSpectro/InSilico/test/testCalcMatch.pl and InSilicoSpectro/InSilico/test/testCalcMatchOOP.pl.
getSeries($name)
Returns a vector (terminus, monoisotopic delta, average delta, first fragment, last fragment) that contains the parameters of series $name, where
- $name
-
is th name of the series, e.g. b or y.
- terminus
-
is equal to 'N' or 'C'.
- monoisotopic delta
-
is the mass delta to apply when computing the monoisotopic mass. It is 0 for a b series and 1.007825 for y.
- average delta
-
is the mass delta to apply when computing the average mass. It is 0 for a b series and 1.007976 for y.
- first fragment
-
is the number of the first fragment to compute. For instance, fragment b1 is generally not detected hence first fragment should be set to 2 for a b series. The rule is the same for N- and C-term series.
- last fragment
-
is the number of the last fragment counted from the end. For instance, the last b fragment is normally not detected hence last fragment should be set to 2. If a fragment containing all the amino acids of the peptide is possible, then it should be set to 1. The rule is the same for N- and C-term series.
setSeries($name, $terminus, $formula, $firstFrag, $lastFrag)
Adds a new series to the series read from the configuration file. The parameters are defined above in getSeries except formula. The mass deltas are not given as real numbers but rather as deltas in atoms. For instance, b series formula is empty because we do want 0 delta and c series formula is 'N 1 H 3'. Example:
setSeries('c', 'N', 'N 1 H 3', 2, 2);
getLoss($name)
Returns a vector (residues, monoisotopic delta, average delta) that contains the parameters of a neutral loss, where
- $name
-
is the name of the neutral loss.
- residues
-
is a string containing the one character codes of the amino acids where the loss is possible.
- monoisotopic delta
-
is the mass delta to apply when computing the monoisotopic mass.
- average delta
-
is the mass delta to apply when computing the average mass.
setLoss($name, $residues, $formula)
Adds a new loss to the ones read from the file fragments.xml. The parameters are defined above in getLoss except formula. The mass deltas are not given as real numbers but rather as deltas in atoms. Example:
setLoss('H2O', 'ST', 'H 2 O 1');
getFragType($name)
Returns a vector (series, charge, loss 1, repeat 1, loss 2, repeat 2, ...) containing the parameters of a fragment type, where
- $name
-
is the name of the fragment type.
- series
-
is the series on which this fragment type is based.
- charge
-
is the charge state of a fragment of this type.
- loss k
-
in case the fragment type includes losses, loss k is set to the name of each loss possible for this fragment type.
- repeat k
-
the maximum number of each loss, -1 means no maximum.
getFragTypeList
Returns a vector containing the list of all fragment type names (in arbitrary order).
setFragType($name, $series, $charge, $loss, $repeat)
Adds a new fragment type to the ones read from the file fragments.xml. The parameters $name, $charge and $series are defined above in getFragType.
- $loss
-
is the reference of a vector containing the names of the possible losses for this fragment type. In case no loss is possible then $loss may be let undefined or references an empty vector.
- $repeat
-
gives the repetitions of the losses listed in vector $loss. If $loss is undefined or references an empty vector, then $repeat is ignored.
Example:
setFragType('y++-H2O*-NH3(2)', 'y', 2, ['H2O', 'NH3'], [-1, 2]);
defines a fragment type made of doubly charged y fragments with a maximum of 2 ammonia losses and an arbitrary number of water losses.
setImmonium($residues, $delta)
Replaces the definition of immonium ion parameters read from the file fragments.xml.
- $residues
-
is a string containing the one letter codes of the residues that yield detectable immonium ions.
- $delta
-
is the mass delta to obtain the immonium ion mass from its amino acid mass.
getImmonium
Returns a vector (residues, delta); the parameters residues and delta are defined in setImmoniumn above.
getImmoniumMass($aa, $mod)
Computes the mass of an immonium ion given the amino acid one letter code $aa and a possible modification $mod. In case of Lysine (K), two values are returned.
EXAMPLES
See programs starting with testCalc in folder InSilicoSpectro/InSilico/test/.
AUTHORS
Jacques Colinge, Upper Austria University of Applied Science at Hagenberg