-
Notifications
You must be signed in to change notification settings - Fork 2
[ENH] pdb to String loader using SEQREQ records #148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Renamed the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this function, at least ideas-wise, not do the same as pdb_to_struct
then struct_to_aaseq
? Recall that we separated the two intentionally, to separate file manipulation from in-memory objects which are abstract representations.
Instead of adding this new function which would link file manipulation back into the in-memory workflow, why not add features to the existing struct_to_aaseq
?
#147 answers your question, essentially 3d data is being captured when biopython converts something to a |
pdb_file_path : str or os.PathLike | ||
Path to a PDB file. | ||
return_df : bool, optional, default=False | ||
If True, return a pandas.DataFrame with columns: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's try to be rst compatible where we can:
- newlines before and after bullet point lists
- double backticks around code like
'chain'
(note that markdown uses single backtick, rst wants double backtick)
---------- | ||
pdb_file_path : str or os.PathLike | ||
Path to a PDB file. | ||
return_df : bool, optional, default=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's call this return_type
with two possible values "pd.df"
and "list"
. This makes it more upwards compatible.
follow-up question: would it be possible to get the sequences from the |
I thought I mentioned this but I didnt, so apologies for that.
So when the Structure object is formed, it also misses this 3d information, hence the complete sequence cannot be extracted from it, as the unknown 3 orientation leads to a lack of a complete sequence. If it didnt, I wouldnt have made this PR. |
Ok, that is concerning. Are you saying the |
Just to have it on record, yes |
Made the 2 changes requested above and renamed all files to private to prevent weird import errors as discussed in todays meeting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should rethink what the internal structure is.
SeqIO
seems to be loading pdb
in a parsable form - so if it is non-lossy, we should add at least a loader that returns the python object (and not a string), see #160
Are chain ids which come along with the amino acid sequences in the pdb file important too @JaBirke @KubiczekD ? Edit: Added chains as they seemed to be used everywhere. |
@rpgv could you please give an exmaple of a small pdb file with ATOM records but no SEQRES records? Need it for testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- tests are failing, please fix
- description of the PR is empty, please fix
This PR was in progress in the project board, not ready for review 😅
I am waiting on @rpgv to provide a pdb that does not have SEQRES records, only ATOM records. |
I'm sorry for the delayed response, I have been searching through my files and haven't found anything that not a modified modeled structure (i.e. from AlphaFold, BOLTZ). So I was probably mistakenly thinking about such structures. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please do the renaming in another PR? And focus only on the loader.
The diff is not shown correctly, so it is hard to review this PR.
Side question, why are we adding requests
to the pyproject
?
For the api calls to get the UniProt id and the corresponding FASTA files.
Sure |
I see - from a design perspective, it feels wrong to call a web location in what looks mostly like a loader. The user does not expect that their data gets "completed" by an internet source. Can you please remove this logic and put it into a separate utility? Optimally, separate I/O of pdb and the lookup, so it is explicit when this happens. If you do not have time to come up with a design, or do not want to, then do the following:
|
Closes #147
This PR:
struct_to_aaseq
has an option to return a df too.