Skip to content

Conversation

satvshr
Copy link
Collaborator

@satvshr satvshr commented Sep 15, 2025

Closes #147

This PR:

  1. Adds a pdb to amino acid sequence transformation, without being lossy, following the convention mentioned in the 3 pointer list of this comment.
  2. Ensures struct_to_aaseq has an option to return a df too.

@satvshr
Copy link
Collaborator Author

satvshr commented Sep 17, 2025

Renamed the _pfoa_loader.py file to _pfoa.py for consistency.

Copy link
Contributor

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this function, at least ideas-wise, not do the same as pdb_to_struct then struct_to_aaseq? Recall that we separated the two intentionally, to separate file manipulation from in-memory objects which are abstract representations.

Instead of adding this new function which would link file manipulation back into the in-memory workflow, why not add features to the existing struct_to_aaseq?

@fkiraly fkiraly added the enhancement New feature or request label Sep 22, 2025
@satvshr
Copy link
Collaborator Author

satvshr commented Sep 22, 2025

Instead of adding this new function which would link file manipulation back into the in-memory workflow, why not add features to the existing struct_to_aaseq?

#147 answers your question, essentially 3d data is being captured when biopython converts something to a Structure object using ATOM records, sometimes a part of the sequence cna go missing if the 3d orientation of it is not known. This issue is not faced with SEQRES records, hence the direct converter.

@satvshr satvshr requested a review from fkiraly September 22, 2025 20:12
pdb_file_path : str or os.PathLike
Path to a PDB file.
return_df : bool, optional, default=False
If True, return a pandas.DataFrame with columns:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's try to be rst compatible where we can:

  • newlines before and after bullet point lists
  • double backticks around code like 'chain' (note that markdown uses single backtick, rst wants double backtick)

----------
pdb_file_path : str or os.PathLike
Path to a PDB file.
return_df : bool, optional, default=False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's call this return_type with two possible values "pd.df" and "list". This makes it more upwards compatible.

@fkiraly
Copy link
Contributor

fkiraly commented Sep 25, 2025

Instead of adding this new function which would link file manipulation back into the in-memory workflow, why not add features to the existing struct_to_aaseq?

#147 answers your question, essentially 3d data is being captured when biopython converts something to a Structure object using ATOM records, sometimes a part of the sequence cna go missing if the 3d orientation of it is not known. This issue is not faced with SEQRES records, hence the direct converter.

follow-up question: would it be possible to get the sequences from the Structure object instead of the pdb?

@satvshr
Copy link
Collaborator Author

satvshr commented Sep 25, 2025

would it be possible to get the sequences from the Structure object instead of the pdb?

I thought I mentioned this but I didnt, so apologies for that.

sometimes a part of the sequence cna go missing (in the Structure object) if the 3d orientation of it is not known.

So when the Structure object is formed, it also misses this 3d information, hence the complete sequence cannot be extracted from it, as the unknown 3 orientation leads to a lack of a complete sequence. If it didnt, I wouldnt have made this PR.

@fkiraly
Copy link
Contributor

fkiraly commented Sep 26, 2025

Ok, that is concerning. Are you saying the Structure objec tis not an 1:1 in-memory representation of the full pdb?

@satvshr
Copy link
Collaborator Author

satvshr commented Sep 26, 2025

Are you saying the Structure objec tis not an 1:1 in-memory representation of the full pdb?

Just to have it on record, yes

@satvshr
Copy link
Collaborator Author

satvshr commented Sep 26, 2025

Made the 2 changes requested above and renamed all files to private to prevent weird import errors as discussed in todays meeting.

@satvshr satvshr requested a review from fkiraly September 26, 2025 15:06
Copy link
Contributor

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should rethink what the internal structure is.

SeqIO seems to be loading pdb in a parsable form - so if it is non-lossy, we should add at least a loader that returns the python object (and not a string), see #160

@satvshr
Copy link
Collaborator Author

satvshr commented Oct 6, 2025

Are chain ids which come along with the amino acid sequences in the pdb file important too @JaBirke @KubiczekD ?

Edit: Added chains as they seemed to be used everywhere.

@satvshr
Copy link
Collaborator Author

satvshr commented Oct 6, 2025

@rpgv could you please give an exmaple of a small pdb file with ATOM records but no SEQRES records? Need it for testing.

Copy link
Contributor

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • tests are failing, please fix
  • description of the PR is empty, please fix

@satvshr
Copy link
Collaborator Author

satvshr commented Oct 9, 2025

This PR was in progress in the project board, not ready for review 😅

  • tests are failing, please fix

I am waiting on @rpgv to provide a pdb that does not have SEQRES records, only ATOM records.

@satvshr satvshr requested a review from fkiraly October 11, 2025 11:48
@rpgv
Copy link
Collaborator

rpgv commented Oct 13, 2025

This PR was in progress in the project board, not ready for review 😅

  • tests are failing, please fix

I am waiting on @rpgv to provide a pdb that does not have SEQRES records, only ATOM records.

I'm sorry for the delayed response, I have been searching through my files and haven't found anything that not a modified modeled structure (i.e. from AlphaFold, BOLTZ). So I was probably mistakenly thinking about such structures.

Copy link
Contributor

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please do the renaming in another PR? And focus only on the loader.

The diff is not shown correctly, so it is hard to review this PR.

Side question, why are we adding requests to the pyproject?

@satvshr
Copy link
Collaborator Author

satvshr commented Oct 14, 2025

Side question, why are we adding requests to the pyproject?

For the api calls to get the UniProt id and the corresponding FASTA files.

Can you please do the renaming in another PR? And focus only on the loader.

Sure

@fkiraly
Copy link
Contributor

fkiraly commented Oct 15, 2025

Side question, why are we adding requests to the pyproject?

For the api calls to get the UniProt id and the corresponding FASTA files.

I see - from a design perspective, it feels wrong to call a web location in what looks mostly like a loader. The user does not expect that their data gets "completed" by an internet source.

Can you please remove this logic and put it into a separate utility? Optimally, separate I/O of pdb and the lookup, so it is explicit when this happens. If you do not have time to come up with a design, or do not want to, then do the following:

  • remove the uniprot code from this PR
  • put it in a separate PR
  • open an issue with a todo related to uniprot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[ENH] pdb to String loader using SEQREQ records

3 participants