-
Notifications
You must be signed in to change notification settings - Fork 2
[ENH] pdb to String loader using SEQREQ records #148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Changes from 3 commits
Commits
Show all changes
19 commits
Select commit
Hold shift + click to select a range
39999a8
Solves #147
satvshr 3656d48
Renamed file for consistency
satvshr ddfde4b
Made requested changes
satvshr d3c52b1
Renamed files and made requested changes
satvshr b00fced
chains included
satvshr 3f8f805
TODO: add pdb with only ATOM
satvshr 2848f2f
Added chains
satvshr ff4c10c
Merge branch 'main' into issue147
satvshr b28abd5
Tests updated
satvshr 277cb53
Update test_struct_to_aaseq.py
satvshr bb1299e
Made changes as discussed
satvshr 7726053
Bug fixing
satvshr e330890
Reverted 1 commit
satvshr 79d7cee
Reverted 1 commit
satvshr 9577c3c
Merge branch 'main' into issue147
satvshr d1b59ee
Merge branch 'main' into issue147
satvshr 46ea901
Removed UniProt
satvshr ddfa0b6
Update pyproject.toml
satvshr a85b5e0
bug fixing
satvshr File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,6 +1,6 @@ | ||
| """Loaders for different data structures.""" | ||
|
|
||
| from pyaptamer.datasets._loaders._one_gnh import load_1gnh_structure | ||
| from pyaptamer.datasets._loaders._pfoa_loader import load_pfoa_structure | ||
| from pyaptamer.datasets._loaders._pfoa import load_pfoa_structure | ||
|
|
||
| __all__ = ["load_pfoa_structure", "load_1gnh_structure"] |
File renamed without changes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,43 @@ | ||
| __author__ = "satvshr" | ||
| __all__ = ["pdb_to_aaseq"] | ||
|
|
||
| import os | ||
|
|
||
| import pandas as pd | ||
| from Bio import SeqIO | ||
|
|
||
|
|
||
| def pdb_to_aaseq(pdb_file_path, return_df=False): | ||
| """ | ||
| Extract amino-acid sequences (SEQRES) from a PDB file. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| pdb_file_path : str or os.PathLike | ||
| Path to a PDB file. | ||
| return_df : bool, optional, default=False | ||
| If True, return a pandas.DataFrame with columns: | ||
|
||
| - 'chain' (if available) and | ||
| - 'sequence' (one-letter amino-acid string per chain). | ||
| If False, return a list of strings (one per chain). | ||
|
|
||
| Returns | ||
| ------- | ||
| list of str or pandas.DataFrame | ||
| List of amino-acid sequences (one-letter codes), or a DataFrame containing the | ||
| sequences.extracted from the SEQRES records in the PDB file. Each element is a | ||
| string representing the full sequence for a chain (e.g. "MKWVTFISLL..."). The | ||
| order of sequences matches the order in which SEQRES records are encountered in | ||
| the file. Returns an empty list if no SEQRES records are found. | ||
| """ | ||
| pdb_path = os.fspath(pdb_file_path) | ||
| sequences = [] | ||
|
|
||
| with open(pdb_path) as handle: | ||
| for record in SeqIO.parse(handle, "pdb-seqres"): | ||
| sequences.append(str(record.seq)) | ||
|
|
||
| if return_df: | ||
| return pd.DataFrame({"sequence": sequences}) | ||
|
|
||
| return sequences | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,32 @@ | ||
| __author__ = "satvshr" | ||
|
|
||
| import os | ||
|
|
||
| from pyaptamer.utils.pdb_to_aaseq import pdb_to_aaseq | ||
|
|
||
|
|
||
| def test_pdb_to_aaseq(): | ||
| """ | ||
| Test that `pdb_to_aaseq` converts a PDB file path into a non-empty string | ||
| containing alphabetic characters. | ||
| """ | ||
| pdb_file_path = os.path.join( | ||
| os.path.dirname(__file__), "..", "..", "datasets", "data", "1gnh.pdb" | ||
| ) | ||
| sequences = pdb_to_aaseq(pdb_file_path) | ||
|
|
||
| assert isinstance(sequences, list), "pdb_to_aaseq should return a list" | ||
| assert len(sequences) > 0, "Returned list should not be empty" | ||
|
|
||
| for seq in sequences: | ||
| assert isinstance(seq, str), "Each entry should be a string" | ||
| assert len(seq) > 0, "Each sequence string should not be empty" | ||
|
|
||
| sequences = pdb_to_aaseq(pdb_file_path, return_df=True) | ||
|
|
||
| assert not sequences.empty, "Returned DataFrame should not be empty" | ||
| assert "sequence" in sequences.columns, "DataFrame should have a 'sequence' column" | ||
|
|
||
| for seq in sequences["sequence"]: | ||
| assert isinstance(seq, str), "Each entry should be a string" | ||
| assert len(seq) > 0, "Each sequence string should not be empty" |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's call this
return_typewith two possible values"pd.df"and"list". This makes it more upwards compatible.