[FEATURE] Protein Grouping, Picked Group FDR & Fix of Picked Protein FDR #182

grosenberger-bruker · 2025-05-09T10:29:11Z

This PR introduces protein grouping with picked group FDR to Sage and fixes the picked protein FDR issue:

Fix of Picked Protein FDR
While Sage generally computes picked protein FDR correctly, we recently encountered an issue with shared peptides. For example, let's assume PEPTIDEAK belongs to protA, PEPTIDECK belongs to protC, and shared PEPTIDEDK belongs to both protA and protC. If PEPTIDEDK is confidently identified, it counts as a new "protein" protA/protC. With a canonical UniProtKB/Swiss-Prot DB, shared peptides typically constitute 5-10% and have similar properties to proteotypic peptides, so the effect on computing picked protein FDR is minor. However, the number of proteins in the Sage runtime log is artificially inflated (3 proteins). This also appears in the report, but I assume most users will filter them in downstream analysis and recount.

Solution: As proposed in the literature, we now use only proteotypic, unique, non-shared peptides to compute picked protein FDR. Shared peptides will still be reported but with protein FDR set to 1.0. This has a minor effect on canonical databases, providing more accurate numbers. For isoforms, this approach is not applicable, so we introduce protein grouping.

Protein Grouping with Picked Group FDR
This new module implements a protein grouping algorithm based on the IDPicker algorithm with extensions from the "Picked Group FDR approach." The Python implementation of CsoDIAq has been used as a template and for testing the IDPicker approach. Most functions are based on IDPicker, with generate_proteingroups() representing the "rescued subset grouping (rsG)" approach. Discarding shared peptides and picked FDR are implemented as part of the core Sage FDR routines.

In our experience, picked group FDR with IDPicker is a simple, robust, and scalable approach. Compared to standard IDPicker, it performs better under tricky boundary conditions, albeit with some computational expense. For 10 dda-PASEF mixed proteome samples, the runtime overhead in my benchmark was 45 seconds. We have optimized some IDPicker components, but there may still be potential for further improvement.

We believe this PR will be useful for Sage, extending the current established and accepted principles to protein groups.
Best regards,
@grosenberger-bruker, @vijay-gnanasambandan-bruker, @sander-willems-bruker

References

Zhang, B., Chambers, M. C., & Tabb, D. L. (2007). Proteomic parsimony through bipartite graph analysis improves accuracy and transparency. Journal of proteome research, 6(9), 3549-3557. https://doi.org/10.1021/pr070230d
The, M., Samaras, P., Kuster, B., & Wilhelm, M. (2022). Reanalysis of ProteomicsDB using an accurate, sensitive, and scalable false discovery rate estimation approach for protein groups. Molecular & Cellular Proteomics, 21(12), 100437. https://doi.org/10.1016/j.mcpro.2022.100437
https://github.com/dg310012/CsoDIAq/blob/68abaa713eb719b488967cb34a876a71657827bd/idpicker.py
Cranney, C. W., & Meyer, J. G. (2021). CsoDIAq software for direct infusion shotgun proteome analysis. Analytical Chemistry, 93(36), 12312-12319. https://doi.org/10.1021/acs.analchem.1c02021

Main sync

Public sage master

Public sage master sync

… output

…ogic

…id_picker # Conflicts: # crates/sage-cli/src/runner.rs

FEAT: sped up id_picker

lazear · 2025-05-12T20:49:55Z

Hi guys,

Thanks for another valuable PR - especially one that addresses one of the big shortcomings in Sage. From a quick readover, it looks good; I will probably just rename a couple fields and otherwise use it as-is for now. We can tackle some performance issues here, as well as some other places in Sage, in a future update

grosenberger-bruker · 2025-05-23T13:12:59Z

Brief update: We are completely changing the IDpicker implementation, leading to substantial improvements. PR will be updated within the next few days and is now back at the draft stage until then.

lazear · 2025-05-23T15:06:22Z

OK, just let me know when it's ready to review!

…ping

enhancement/id_picker

grosenberger-bruker · 2025-05-27T07:56:28Z

@sander-willems-bruker now re-implemented protein grouping and inference, bringing overhead down to 0.3s from 28s. A major change was a modification to the original IDPicker approach, replacing the method to find the minimum protein set. We believe this is now ready for review.

grosenberger-bruker · 2025-05-27T10:28:23Z

crates/sage/src/protein_grouping.rs

+//! ## Main Features
+//! - Groups proteins when peptide evidence for proteins is identical.
+//! - Infers (almost) minimal protein-group covers using bipartite graph algorithms.
+//! - Supports different protein inference strategies (e.g., "All", "Slim").


@lazear This is a parameter that is not yet exposed to the config, but it could be an option for applications where users are not interested in finding the parsimonious solution, but want to retain all alternatives while still conducting protein grouping.

vijay-gnanasambandan-bruker and others added 29 commits October 10, 2024 12:10

Merge pull request lazear#16 from BrukerLSMS/main-sync

66e7530

Main sync

Merge pull request lazear#25 from BrukerLSMS/public_sage_master

513e225

Public sage master

Merge pull request lazear#28 from BrukerLSMS/public_master

2b60af9

Public sage master sync

feat : first version id picker

2d1b577

feat : draft version id picker

4964c4c

improve : idpicker_proteingroups rename update.

2a4ecdf

feat: refactor protein grouping logic

8ccb1d0

fix: correct test input file name and update assertion

4b8e256

feat: add id_proteins field to record and include protein grouping in…

d9adc7a

… output

refactor: simplify variable declarations in idpicker.rs

7ade657

feat: enhance protein grouping logic

b674411

feat: add proteingroups_q field and implement protein group picking l…

e61a4e0

…ogic

feat: add is_proteinggroups field and update protein grouping logic

166ee39

feat: update protein grouping logic to track number of protein groups

c315d39

[FIX] Modified logics & smaller tweaks

7553333

[FIX] Tie protein-level FDR to unique peptides

b2cd5cc

[FIX] Tie protein-level FDR to unique peptides

b460bab

feat: update dependencies and enhance protein grouping logic

5bfed21

Merge remote-tracking branch 'origin/feature/id_picker' into feature/…

0015c86

…id_picker # Conflicts: # crates/sage-cli/src/runner.rs

FEAT: sped up id_picker

94151ce

CHORE: removed move statement

b764d9f

chore: improve optimization.

482bc59

chore: improve optimization.

3fdb251

chore: code refactoring.

1a252ac

chore: code refactoring.

413a44f

Merge pull request lazear#46 from BrukerLSMS/review/id_picker_speed

b137e3a

FEAT: sped up id_picker

[FIX] Small refinements

974712e

[FIX] Small refinements

0fd5cbf

chore: sped up idpicker

64d96d6

sander-willems-bruker added 3 commits May 22, 2025 19:37

feat: sped up greedy_algo

2894a1e

chore: use ProteinMapping as default struct in idpicker

fdb704f

feat: removed metapeptides as they are never used

5f33ae7

grosenberger-bruker marked this pull request as draft May 23, 2025 13:11

sander-willems-bruker added 2 commits May 23, 2025 15:29

feat: updated metapeptide calculation in idpicker

3a51f22

chore: removed outdated code

9ee2a5e

sander-willems-bruker and others added 14 commits May 23, 2025 19:48

feat: updated reporting of prot groups

154c973

feat: do we need the secondary prot grouping?

67cce61

feat: simplified greedy_protein_cover impl

a866812

FIX: made protein grouping deterministic

a0d57a3

FIX: determinisitc proteoin grouping

bb1be14

feat: switched to u32 and fnv to reduce speed and RAM in protein grou…

0d223e0

…ping

feat: addes rescued protein groups

e23b13f

chore: renamed idpicker to protein grouping

97d3182

chore: docs (AI generated and manually verified)

481c482

FIX: implemented testing

5d7e1a3

Merge pull request lazear#58 from BrukerLSMS/feature/id_picker2

65ca348

enhancement/id_picker

Merge remote-tracking branch 'public/master' into feature/id_picker2

0945a70

feat: updated to latest main

a647cbd

Merge branch 'feature/id_picker2' into feature/id_picker

7ab7842

grosenberger-bruker marked this pull request as ready for review May 27, 2025 07:56

grosenberger-bruker commented May 27, 2025

View reviewed changes

feat: added cli settings for protein_grouping

1f2c600

lazear mentioned this pull request Jun 30, 2025

Sage seems to have MBR turned on by default - how do I turn it off #189

Closed

sander-willems-bruker mentioned this pull request Aug 21, 2025

Report all proteins in the results.sage.tsv #190

Closed

gadi-armony-bruka and others added 2 commits October 27, 2025 17:09

rename parameter name for clarity

250ff89

Merge branch 'feature/id_picker2' into feature/id_picker

db3cdee

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEATURE] Protein Grouping, Picked Group FDR & Fix of Picked Protein FDR #182

[FEATURE] Protein Grouping, Picked Group FDR & Fix of Picked Protein FDR #182

Uh oh!

grosenberger-bruker commented May 9, 2025

Uh oh!

lazear commented May 12, 2025

Uh oh!

grosenberger-bruker commented May 23, 2025

Uh oh!

lazear commented May 23, 2025

Uh oh!

grosenberger-bruker commented May 27, 2025

Uh oh!

grosenberger-bruker May 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[FEATURE] Protein Grouping, Picked Group FDR & Fix of Picked Protein FDR #182

Are you sure you want to change the base?

[FEATURE] Protein Grouping, Picked Group FDR & Fix of Picked Protein FDR #182

Uh oh!

Conversation

grosenberger-bruker commented May 9, 2025

Uh oh!

lazear commented May 12, 2025

Uh oh!

grosenberger-bruker commented May 23, 2025

Uh oh!

lazear commented May 23, 2025

Uh oh!

grosenberger-bruker commented May 27, 2025

Uh oh!

grosenberger-bruker May 27, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants