This tool demonstrates the search-all-asses-subset method to control the False Discovery Rate (FDR) when you are only interested in a subset of the identified PSMs from a shotgun proteomics experiment.

The workflow is as follows:

  1. search spectra against all expected proteins
  2. Remove PSMs from irrelevant proteins
  3. Calculate FDR

You can load the data from step 1. into this webtool to perform step 2. and 3.


Use the Browse… button above to load a CSV file from your computer or paste the data in the text area to the right.

Adhere to the following format:

  • Every Row is a unique PSM
  • Every row contains:
    • id: Can be any text or number
    • score: Score given to the PSM, higher scores are better
    • decoy: FALSE or TRUE; TRUE indicates that the PSM matches a decoy peptide sequence
    • subset: FALSE or TRUE: TRUE indicates that the PSM matches a subset peptide sequence
  • The first row are the column names
  • All columns should be comma separated

Note that the row with extra set of decoys (decoy = TRUE and subset = FALSE) are not necessarily the non-subset decoys from the same mass spectrometry run but can be chosen by the user.

Example input:

id,score,decoy,subset
1,7.67,TRUE,FALSE
2,10.99,TRUE,TRUE
3,75.10,FALSE,FALSE
4,73.83,FALSE,TRUE

Warning: use this tool only with PSMs from a competitive target-decoy search! The minimal required input are PSMs from subset targets and decoys (row 2 and 4 in example input). Additional decoy PSMs (row 1) are used for a more stable FDR calculation. Non subset targets (row 3) are ignored in the analysis.

Download the results to your computer:

Download

The following columns were calculated:

  • pi_0_cons: A conservative estimation of \(\pi_0\). (pi_0_cons = (#decoys+1) / (#targets+1})
  • FDR: The estimated FDR at this score cutoff for subset PSMs. Calculated according the classical TDA method. Does not work well small subsets. Missing for non-subset or decoy PSMs.
  • FDR_stable: The estimated stable FDR at this score cutoff for subset PSMs. This FDR is estimated from the complete decoy set and pi_0_cons. This method is more stable for smaller subsets and will for large subset be close to the FDR estimates. Missing for non-subset or decoy PSMs. Please check the diagnostics on the next tab.
  • FDR_BH: The estimated stable FDR at this score cutoff for subset PSMs. This FDR is estimated from the complete decoy set and according the Benjamini-Hochberg FDR procedure (\(\pi_0\) set to 1). This FDR estimate is more conservative then FDR and FDR_BH. Use this method when you have a large decoy set but no decoy information on the subset PSMs (eg. when the search engine does not return the decoy protein ids). Please check the diagnostics on the next tab. Missing for non-subset or decoy PSMs.

These are diagnostic plots to evaluate the quality of the decoy set and the uncertainty in the estimated fraction of incorrect target PSMs (pi0). This allows an informed choice on the use of the stable all-sub FDR estimator and the large decoy set.

Panel a shows the posterior distribution of pi_0 given the observed number of target and decoy PSMs in the subset. The vertical line indicates the conservative pi_0 estimate used in the calculations. At very high pi_0 uncertainty (broad peak), you can opt to use the BH procedure to minimize sample to sample variability (see FDR_BH in the output). However, this will come at the expense of too conservative PSM lists.

Our improved TDA for subsets relies on the assumption that incorrect subset PSMs and the complete set of decoys have the same distribution. This distributional assumption can be verified through a PP-plot where the empirical Cumulative Distribution Function (eCDF) of the decoys is plotted against the eCDF of the subset target PSMs. The PP-plots in panel b - d display the target subset PSMs plotted against all decoy PSMs from the complete search, the decoy subset PSMs plotted against all decoy PSMs from the complete search, and the target subset PSMs plotted against the decoy PSMs from the complete search, respectively. The full line in panel b and d indicates a line with a slope of pi_0. The full line in panel c indicates the identity line. The first part of the plot in b and d should be linear with a slope that equals pi_0. The second part of the plot will deviate from this line towards higher percentiles and will ultimately become vertical (decoy percentile = 1). If we see this profile in panel b, we have a good indication that the set of decoys from the complete search is representative for the mixture component for incorrect PSMs of the target mixture distribution. When there is high uncertainty on pi_0 as indicated by a, then the linear pattern in the data points might deviate from the drawn solid line, but should still be more or less linear. Deviations from this pattern might be subtle, therefore we provide the PP plots in c and d to support the conclusion drawn from panel b. The PP-plot in panel c shows the subset decoy PSMs plotted against all decoy PSMs. The whole plot should follow the identity line, indicating that the complete set of decoys is a good representation of the subset decoys. To verify that the subset decoys (and thus also the complete set of decoys) are representative for the mixture component for incorrect PSMs of the target mixture distribution, we look at the PP-plot of the subset decoys against the subset targets in panel d. The profile should look as described for panel b. If the profile matches in panel d but does not for panel b, then we suggest to not use the extra decoy set and use only the subset decoys for FDR estimation. When the profile does not match in panel d, the subset decoys might not be representative for incorrect PSMs. This can indicate that pi_0 is estimated incorrectly, since this is based on the subset PSM scores. In this case, the first part of the plot in panel d can deviate from the (incorrect) pi_0 slope line. But if this first part is linear, it still indicates that the extra set of decoys is representative for the mixture component of incorrect target PSMs. Since pi_0 is unknown we set it to 1 (see FDR_BH in the output).

When you are not sure how the diagnostic plots should look like, you can simulate your own data under various (erratic) settings in the simulation tab.

In this tab, you can simulate your own data and look at examples of diagnostic plots.

Random datasets are generated based on a observed number of target and decoy subset PSMs. Optionally, you can also generate a random number of subset decoy PSMs based on the observed number of subset target PSMs and a theoretic pi0 that you can choose. In the default setting, the decoy distribution is equal to the incorrect subset target distribution. This means that the diagnostic plots from the simulated datasets are exemplary for experimental settings where the assumptions are grounded and you can safely use the decoys for FDR estimation. Optionally, you can change the mean and standard deviation of the subset decoys and/or the large set of extra decoys, violating the assumption that they are representative for the incorrect subset targets. Plots generated by these simulated datasets are examples of diagnostic plots when the assumptions are violated an you should not use these decoys for FDR estimation.

For more information on how to interpret the generated diagnostic plots, please read the information in the Check diagnostic plots tab.

# subset PSMs
# subset decoy PSMs
# extra decoys
with pi0:

Subset target mixture distribution

incorrect targets
correct targets

Subset decoys distribution

Extra decoys distribution