Synthedia creates synthetic LC-MS/MS data that mimics real data but with a composition that is exactly known. Currently, synthedia support the creation of Data-Independent Acquisition (DIA) style data wherein fixed, large m/z windows are sequentially isolated for fragmentation. We have focused on creating DIA data to date since the complexity of the analysis preformed by processing tools is substantial and the impact of different acquisition methodologies on the eventual outcome is somewhat more difficult to predict.
Required input files
To simulate DIA data with Synthedia, a set of peptide precursors and fragments must be supplied. This can be given as either a Prosit library, or MaxQuant search results files from the analysis of data-dependent acquisition (DDA) data.
Prosit is a machine learning-based application that predicts MS/MS fragmentation patterns for input peptide sequences. Input sequences can be arbitrary and do not necessarily need to originate from any specific organism or protein. Prosit will generate an output file that contains predicted abundances for peptide sequence ions.
To create a Synthedia-compatible Prosit spectral library:
- Go to https://www.proteomicsdb.org/prosit/
- Navigate to 'Spectral Library'
- Upload your target peptide list as described in the Prosit documentation
- IMPORTANT: under the 'Output format' header (just prior to submitting the Prosit task), ensure that Generic Text is selected
Once processing is complete, the resulting file can be used with Synthedia
When Prosit inputs are given, Synthedia models peptide precursor abundances using a Gaussian distribution of Log2 instnsities which approximates the disributions typically observed upon anlysis of real data. The parameters of this distribution (mean and standard deviation) can be specified with the --prosit_peptide_abundance_mean and --prosit_peptide_abundance_stdev parameters which have defaults of 22 and 3 respectively.
MaxQuant 'txt' directories
As an alternative to Prosit, Synthedia can read and simulate DIA data based upon the MaxQuant processing results of a file acquired using a Data-Dependent Acquisition (DDA) strategy. In this case, peptide fragment ions are generated based on the matched ions for a PSM reported in the MaxQuant msms.txt file from the Masses and Intensities columns. Note: these are only those fragment ions that MaxQuant assigns as matching a given peptide - they may not necessarily provide a 'full' sequence coverage and may not be correctly assigned in some cases.
Synthedia offers options to filter reverse and contaminant peptides as well as filter PSMs by Posterior Error Probability (PEP) values.
Optional input files
Decoy signals file
LC-MS/MS analysis of bottom-up proteomics samples can be complicated by the presence of ions derived from non-peptide sample contaminants. To mimic this, decoy ions can be simulated together with peptide signals by specifying a decoy database in NIST '.msp' format. See the section 'NIST Text Format of Individual Spectra' in this document for details on the .msp format.
Custom .msp files can be specified or pre-prepared files can be dowloaded from MS-DIAL. In this online implementation of Synthedia, if decoys are requested but no MSP file is given, the "All public MS/MS (13,303 unique compounds)" file from MS-DIAL is used.
The number of decoy peaks to simulate, as well as the maximum number of fragments to simulate per decoy, can be specified throught the command line arguments
DIA acquisition schema file
DIA-LC-MS/MS data can be acquired in many ways. The default invocation of synthedia creates non-overlapping, 30 Th windows between m/z 350 and m/z 1600. To simulate data using different DIA acquisition strategies, a file defining the acquisition schema can be supplied. An example acquisition schema file and blank templates can be downloaded from the resources tab.
Compatibility of Synthedia mzML files with other software
The mzML files generated with Synthedia have been tested with:
The mzML files are known to not be compatible with MaxQuant (at least as at MaxQuant Version 184.108.40.206)
Notes about the resulting data
Peptide vs protein abundances
In a real experiment, all peptides from an up-regulated protein should be observed with increased abundance compared to a control group. Synthedia models ion abundances at the peptide level only. This means that, in the case of a two-group simulation, peptides from the same protein may have very different (even opposing) directions of abundance change between groups. This is primarily because the Prosit input type (which is preferred) does not contain mappings between peptides and input proteins and we wished to maintain the ability to simulated arbitrary peptide data.
As a result of this, if mzML files generated with synthedia are analysed with DIA analysis software, the main comparison in abundances should be made at the peptide level.
Synthetic vs real data
While we have endeavoured to provide a range of options that allow for simulation of a broad range of chromatographic and mass spectral variables, many experimental processes are not modelled which will cause deviations between simulated and real data. As such, users are warned that simulations with Synthedia do not completely re-create experimental complexity and should be used as an investigational tool only.
As an example, Synthedia offers the capability to simulate the same set of precursors as if acquired on different length chromatographic gradients. This means that data acquired on a lengthy gradients could be reconstructed to approximate acquisition on shorter gradients. In these cases, Synthedia simply models spectra containing signals from many peptides as a simple superposition of their individual signals. In reality however, ion supression effects would result in data that may look substantially different from that which would be produced if a comparable experimental workflow were to be executed.
Synthedia currently has very limited support for simulating peptides with post-translational modifications. Currently, all cysteine residues are assumed to be modified by carbamidomethylation. No other post-translational modifications are supported.
Simulation time requirements
Depending on your dataset and parameters, simulations can be quite time consuming. For this reason, some restrictions are placed on the data that can be simulated with this web implementation of Synthedia. These are:
- The maximum number of groups that can be simulated is 3
- The maximum number of replicates per group that can be simulated is 6
If you wish to run a simulation that exceeds these, either download and run Synthedia locally, or contact us to arange a collaboration.
As a rough guide, the simulation of 37,000 precursors in full centroid mode takes approximately 15 minutes per mzML file.
If you notice bugs or have suggestions for improvements, please contact us here or email us
Synthedia output file sets
A symthedia data simulation produces a range of output files. These are described below:
||One or more mzML file containing the simulated DIA-LC-MS/MS data. The file name suffix
_group_X_sample_Y indicate the group to which the file belongs in the case that multiple treatment groups are simulated.
||Tab-separated value file that defines the precursors simulated and the properties (retention time, m/z, abundances) in the simulated mzML files. See below for a detailed description of the contents of this file.
||File containing simulated peptide metadata. This file can be used to repeat an earlier simulation. Since some paramaters are randomly generated (Eg. peptide abundance differences between treatment groups), repeating a simulation from this file will be slightly different to re-running a simulation with the same parameters from scratch.
||Human readable file containing input arguments used to start the simulation.
||Text file containing logging data about the simulation.
Description of peptide_table.tsv
The columns in the peptide_table.tsv output file are described below:
||The protein (or protrin group) to which a precursor belongs. Note that this will be 'none' for Prosit input types.
||The peptide sequence of the simulated ion. Note that all Cys residues are assumed to be carbamidomethylated.
||The precursor intensity specified in the input data set. in the case of Prosit inputs, this is arbitrarily set to 100,000,000. In the case of MaxQuant inputs, this is taken as the value in the 'Intensity' column of the 'evidence.txt' file.
||The precursor m/z value taken from the input data files. In the case of Prosit inputs, this is taken as the value in the PrecursorMz column. In the case of MaxQuant inputs, this is taken as the value in the 'm/z' column of the 'evidence.txt' file.
||The precursor charge. In the case of Prosit inputs, this is taken as the value in the PrecursorCharge column. In the case of MaxQuant inputs, this is taken as the value in the 'Charge' column of the 'evidence.txt' file.
||The neutral mass of the peptide. In the case of Prosit inputs, this is taken as the PrecursorMz column value multiplied by the PrecursorCharge column value minus the PrecursorCharge multiplied by 1.007276. In the case of MaxQuant inputs, this is taken as the value in the 'Mass' Column.
||The retention time of the precursor in the input data file. In the case of MaxQuant inputs, this is taken as the value in the 'Retention time' column multiplied by 60. In the case of Prosit inputs, this is taken as the value in the 'iRT' column plus the minimum value in the 'iRT' column
|Synthetic RT group_X_sample_Y
||The retention time (in seconds) of the precursor in the simulated output data.
|Synthetic RT Start group_X_sample_Y
||The retention time of the first mass spectrum (MS1 or MS2) in which a precursor is observed in the synthetic data of sample Y of group X.
|Synthetic RT End group_X_sample_Y
||The retention time of the last mass spectrum (MS1 or MS2) in which a precursor is observed in the synthetic data in sample Y of group X.
|Synthetic theoretical m/z 0
||The theoretical m/z value of the monoisotopic peak of a precursor in the synthetic data.
|Synthetic m/z group_X_sample_Y
||The simulated m/z value of the monoisotopic peak of a precursor in the synthetic data. This will be different from the theoretical m/z value if a ppm error is included in the simulation.
|Synthetic m/z ppm error group_X_sample_Y
||The part-per-million error between the theoretical peptide m/z value and the value simulated in the synthetic data.
|Total abundance group_X_sample_Y
||Sum of the intensities of all MS1 data points attributable to a given precursor in sample Y of group X.
|Peak height group_X_sample_Y
||Maximum intensity of all MS1 data points attributable to a given precursor in sample Y of group X.
||Log2 value of the abundance offset between the value in the 'Intensity' column and the simulated intensity for sample Y of group X.
|Found in group_X
||Indicates whether the precursor is present in group X. A value of '1' indicates that the precursor is present in one or more samples of group X. A value of '0' indicates that the precursor is not present in any sample of group X.
|Found in group_X_sample_Y
||Indicates whether the precursor is present in sample Y group X. A value of '1' indicates that the precursor is present in group X sample Y. A value of '0' indicates that the precursor is not present in group X sample Y.
|MS1 chromatographic points group_0_sample_0
||Number of MS1 chromatographic points in which a precursor is observed in the simulated data.
|MS2 chromatographic points group_0_sample_0
||Number of MS2 chromatographic points in which a precursor is observed in the simulated data.
Viewing mzML files
The mzML files produced with Synthedia can be viewed in many different freely available software such as TOPPView package which is part of OpenMS.