[ENH] pdb to String loader using SEQREQ records #148

satvshr · 2025-09-15T13:47:40Z

Closes #147

This PR:

Adds a pdb to amino acid sequence transformation, without being lossy, following the convention mentioned in the 3 pointer list of this comment.
Ensures struct_to_aaseq has an option to return a df too.

satvshr · 2025-09-17T15:27:39Z

Renamed the _pfoa_loader.py file to _pfoa.py for consistency.

fkiraly

Does this function, at least ideas-wise, not do the same as pdb_to_struct then struct_to_aaseq? Recall that we separated the two intentionally, to separate file manipulation from in-memory objects which are abstract representations.

Instead of adding this new function which would link file manipulation back into the in-memory workflow, why not add features to the existing struct_to_aaseq?

satvshr · 2025-09-22T20:12:32Z

Instead of adding this new function which would link file manipulation back into the in-memory workflow, why not add features to the existing struct_to_aaseq?

#147 answers your question, essentially 3d data is being captured when biopython converts something to a Structure object using ATOM records, sometimes a part of the sequence cna go missing if the 3d orientation of it is not known. This issue is not faced with SEQRES records, hence the direct converter.

fkiraly · 2025-09-25T19:15:02Z

pyaptamer/utils/pdb_to_aaseq.py

+    pdb_file_path : str or os.PathLike
+        Path to a PDB file.
+    return_df : bool, optional, default=False
+        If True, return a pandas.DataFrame with columns:


let's try to be rst compatible where we can:

newlines before and after bullet point lists

double backticks around code like 'chain' (note that markdown uses single backtick, rst wants double backtick)

fkiraly · 2025-09-25T19:15:58Z

pyaptamer/utils/pdb_to_aaseq.py

+    ----------
+    pdb_file_path : str or os.PathLike
+        Path to a PDB file.
+    return_df : bool, optional, default=False


let's call this return_type with two possible values "pd.df" and "list". This makes it more upwards compatible.

fkiraly · 2025-09-25T19:17:27Z

Instead of adding this new function which would link file manipulation back into the in-memory workflow, why not add features to the existing struct_to_aaseq?

#147 answers your question, essentially 3d data is being captured when biopython converts something to a Structure object using ATOM records, sometimes a part of the sequence cna go missing if the 3d orientation of it is not known. This issue is not faced with SEQRES records, hence the direct converter.

follow-up question: would it be possible to get the sequences from the Structure object instead of the pdb?

satvshr · 2025-09-25T19:51:52Z

would it be possible to get the sequences from the Structure object instead of the pdb?

I thought I mentioned this but I didnt, so apologies for that.

sometimes a part of the sequence cna go missing (in the Structure object) if the 3d orientation of it is not known.

So when the Structure object is formed, it also misses this 3d information, hence the complete sequence cannot be extracted from it, as the unknown 3 orientation leads to a lack of a complete sequence. If it didnt, I wouldnt have made this PR.

fkiraly · 2025-09-26T07:28:48Z

Ok, that is concerning. Are you saying the Structure objec tis not an 1:1 in-memory representation of the full pdb?

satvshr · 2025-09-26T14:30:47Z

Are you saying the Structure objec tis not an 1:1 in-memory representation of the full pdb?

Just to have it on record, yes

satvshr · 2025-09-26T15:05:12Z

Made the 2 changes requested above and renamed all files to private to prevent weird import errors as discussed in todays meeting.

fkiraly

We should rethink what the internal structure is.

SeqIO seems to be loading pdb in a parsable form - so if it is non-lossy, we should add at least a loader that returns the python object (and not a string), see #160

satvshr · 2025-10-06T14:03:27Z

Are chain ids which come along with the amino acid sequences in the pdb file important too @JaBirke @KubiczekD ?

Edit: Added chains as they seemed to be used everywhere.

satvshr · 2025-10-06T19:24:37Z

@rpgv could you please give an exmaple of a small pdb file with ATOM records but no SEQRES records? Need it for testing.

fkiraly

tests are failing, please fix
description of the PR is empty, please fix

satvshr · 2025-10-09T06:39:20Z

This PR was in progress in the project board, not ready for review 😅

tests are failing, please fix

I am waiting on @rpgv to provide a pdb that does not have SEQRES records, only ATOM records.

rpgv · 2025-10-13T17:41:54Z

This PR was in progress in the project board, not ready for review 😅

tests are failing, please fix

I am waiting on @rpgv to provide a pdb that does not have SEQRES records, only ATOM records.

I'm sorry for the delayed response, I have been searching through my files and haven't found anything that not a modified modeled structure (i.e. from AlphaFold, BOLTZ). So I was probably mistakenly thinking about such structures.

fkiraly

Can you please do the renaming in another PR? And focus only on the loader.

The diff is not shown correctly, so it is hard to review this PR.

Side question, why are we adding requests to the pyproject?

satvshr · 2025-10-14T07:35:13Z

Side question, why are we adding requests to the pyproject?

For the api calls to get the UniProt id and the corresponding FASTA files.

Can you please do the renaming in another PR? And focus only on the loader.

Sure

fkiraly · 2025-10-15T06:40:56Z

Side question, why are we adding requests to the pyproject?

For the api calls to get the UniProt id and the corresponding FASTA files.

I see - from a design perspective, it feels wrong to call a web location in what looks mostly like a loader. The user does not expect that their data gets "completed" by an internet source.

Can you please remove this logic and put it into a separate utility? Optimally, separate I/O of pdb and the lookup, so it is explicit when this happens. If you do not have time to come up with a design, or do not want to, then do the following:

remove the uniprot code from this PR
put it in a separate PR
open an issue with a todo related to uniprot

Solves #147

39999a8

fkiraly assigned satvshr Sep 16, 2025

satvshr requested a review from fkiraly September 17, 2025 15:24

Renamed file for consistency

3656d48

Made requested changes

ddfde4b

fkiraly requested changes Sep 22, 2025

View reviewed changes

fkiraly added the enhancement New feature or request label Sep 22, 2025

satvshr requested a review from fkiraly September 22, 2025 20:12

fkiraly reviewed Sep 25, 2025

View reviewed changes

fkiraly mentioned this pull request Sep 26, 2025

[API] revisiting in-memory representation of pdb/ mmcif #160

Open

Renamed files and made requested changes

d3c52b1

satvshr requested a review from fkiraly September 26, 2025 15:06

fkiraly requested changes Sep 30, 2025

View reviewed changes

chains included

b00fced

satvshr added 5 commits October 7, 2025 01:01

TODO: add pdb with only ATOM

3f8f805

Added chains

2848f2f

Merge branch 'main' into issue147

ff4c10c

Tests updated

b28abd5

Update test_struct_to_aaseq.py

277cb53

fkiraly requested changes Oct 8, 2025

View reviewed changes

Made changes as discussed

bb1299e

satvshr requested a review from fkiraly October 11, 2025 11:48

Bug fixing

7726053

fkiraly requested changes Oct 14, 2025

View reviewed changes

satvshr added 2 commits October 14, 2025 21:40

Reverted 1 commit

e330890

Reverted 1 commit

79d7cee

[ENH] pdb to String loader using SEQREQ records #148

Are you sure you want to change the base?

[ENH] pdb to String loader using SEQREQ records #148

Uh oh!

Conversation

satvshr commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

satvshr commented Sep 17, 2025

Uh oh!

fkiraly left a comment

Choose a reason for hiding this comment

Uh oh!

satvshr commented Sep 22, 2025

Uh oh!

fkiraly Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

fkiraly Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

fkiraly commented Sep 25, 2025

Uh oh!

satvshr commented Sep 25, 2025

Uh oh!

fkiraly commented Sep 26, 2025

Uh oh!

satvshr commented Sep 26, 2025

Uh oh!

satvshr commented Sep 26, 2025

Uh oh!

fkiraly left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

satvshr commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

satvshr commented Oct 6, 2025

Uh oh!

fkiraly left a comment

Choose a reason for hiding this comment

Uh oh!

satvshr commented Oct 9, 2025

Uh oh!

rpgv commented Oct 13, 2025

Uh oh!

fkiraly left a comment

Choose a reason for hiding this comment

Uh oh!

satvshr commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fkiraly commented Oct 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

satvshr commented Sep 15, 2025 •

edited

Loading

fkiraly left a comment •

edited

Loading

satvshr commented Oct 6, 2025 •

edited

Loading

satvshr commented Oct 14, 2025 •

edited

Loading