Skip to content

Conversation

@grosenberger-bruker
Copy link

Hi @lazear,

This PR introduces protein grouping with picked group FDR to Sage and fixes the picked protein FDR issue:

Fix of Picked Protein FDR
While Sage generally computes picked protein FDR correctly, we recently encountered an issue with shared peptides. For example, let's assume PEPTIDEAK belongs to protA, PEPTIDECK belongs to protC, and shared PEPTIDEDK belongs to both protA and protC. If PEPTIDEDK is confidently identified, it counts as a new "protein" protA/protC. With a canonical UniProtKB/Swiss-Prot DB, shared peptides typically constitute 5-10% and have similar properties to proteotypic peptides, so the effect on computing picked protein FDR is minor. However, the number of proteins in the Sage runtime log is artificially inflated (3 proteins). This also appears in the report, but I assume most users will filter them in downstream analysis and recount.

Solution: As proposed in the literature, we now use only proteotypic, unique, non-shared peptides to compute picked protein FDR. Shared peptides will still be reported but with protein FDR set to 1.0. This has a minor effect on canonical databases, providing more accurate numbers. For isoforms, this approach is not applicable, so we introduce protein grouping.

Protein Grouping with Picked Group FDR
This new module implements a protein grouping algorithm based on the IDPicker algorithm with extensions from the "Picked Group FDR approach." The Python implementation of CsoDIAq has been used as a template and for testing the IDPicker approach. Most functions are based on IDPicker, with generate_proteingroups() representing the "rescued subset grouping (rsG)" approach. Discarding shared peptides and picked FDR are implemented as part of the core Sage FDR routines.

In our experience, picked group FDR with IDPicker is a simple, robust, and scalable approach. Compared to standard IDPicker, it performs better under tricky boundary conditions, albeit with some computational expense. For 10 dda-PASEF mixed proteome samples, the runtime overhead in my benchmark was 45 seconds. We have optimized some IDPicker components, but there may still be potential for further improvement.

We believe this PR will be useful for Sage, extending the current established and accepted principles to protein groups.
Best regards,
@grosenberger-bruker, @vijay-gnanasambandan-bruker, @sander-willems-bruker

References

  1. Zhang, B., Chambers, M. C., & Tabb, D. L. (2007). Proteomic parsimony through bipartite graph analysis improves accuracy and transparency. Journal of proteome research, 6(9), 3549-3557. https://doi.org/10.1021/pr070230d
  2. The, M., Samaras, P., Kuster, B., & Wilhelm, M. (2022). Reanalysis of ProteomicsDB using an accurate, sensitive, and scalable false discovery rate estimation approach for protein groups. Molecular & Cellular Proteomics, 21(12), 100437. https://doi.org/10.1016/j.mcpro.2022.100437
  3. https://github.com/dg310012/CsoDIAq/blob/68abaa713eb719b488967cb34a876a71657827bd/idpicker.py
  4. Cranney, C. W., & Meyer, J. G. (2021). CsoDIAq software for direct infusion shotgun proteome analysis. Analytical Chemistry, 93(36), 12312-12319. https://doi.org/10.1021/acs.analchem.1c02021

…id_picker

# Conflicts:
#	crates/sage-cli/src/runner.rs
@lazear
Copy link
Owner

lazear commented May 12, 2025

Hi guys,

Thanks for another valuable PR - especially one that addresses one of the big shortcomings in Sage. From a quick readover, it looks good; I will probably just rename a couple fields and otherwise use it as-is for now. We can tackle some performance issues here, as well as some other places in Sage, in a future update

@grosenberger-bruker grosenberger-bruker marked this pull request as draft May 23, 2025 13:11
@grosenberger-bruker
Copy link
Author

Brief update: We are completely changing the IDpicker implementation, leading to substantial improvements. PR will be updated within the next few days and is now back at the draft stage until then.

@lazear
Copy link
Owner

lazear commented May 23, 2025

OK, just let me know when it's ready to review!

@grosenberger-bruker
Copy link
Author

@sander-willems-bruker now re-implemented protein grouping and inference, bringing overhead down to 0.3s from 28s. A major change was a modification to the original IDPicker approach, replacing the method to find the minimum protein set. We believe this is now ready for review.

@grosenberger-bruker grosenberger-bruker marked this pull request as ready for review May 27, 2025 07:56
//! ## Main Features
//! - Groups proteins when peptide evidence for proteins is identical.
//! - Infers (almost) minimal protein-group covers using bipartite graph algorithms.
//! - Supports different protein inference strategies (e.g., "All", "Slim").

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lazear This is a parameter that is not yet exposed to the config, but it could be an option for applications where users are not interested in finding the parsimonious solution, but want to retain all alternatives while still conducting protein grouping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants