-
Notifications
You must be signed in to change notification settings - Fork 2
[ENH] hugging face and FASTA loaders #155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
from datasets import load_dataset | ||
|
||
|
||
def hf_to_dataset(path, keep_in_memory=True, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function feels unnecessary, it simply aliases a simple call of huggingface
load_dataset
.
Why do we need this? I would simply remove it.
Maybe you could explain?
Side remark: The try/expect structure looks problematic. Generally, we should avoid using try/except as an if/else, instead we should use if/else with the correct condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there not a condition that we can replace this in, in an if/else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there was a defined variable which hugging face used to list all their recognisable file formats I would say yes, but as of now I dont think it is possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still doubt the reasoning behind having this function.
Can you please give two examples where we currently need this, one example for each branch of the code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try
block: We want to load a file from hugging face which is in one of the known file formats supported by hugging face, so something like the imbd dataset, which is a csv file (known file format).
except
block: We want to load a file from hugging face which is not one of the known file formats (say a fasta/pdb/mmcif file), like the fasta file which Jakob provided. If it was used in the above `try1 block, it would throw an error. This block will also work if any local files are supplied to the loader and would return a Dataset object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still do not get what scenario we would need the try/except in.
Does the user not always know what format the file is in that they are attempting to load? And in cases where we hardcode the dataset, do we not know?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the user not always know what format the file is in that they are attempting to load?
The function has to be used differently depending on whether or not the file is a recognisable file format. This information is something I discovered after I spent a bit of time looking into it, the function is just making life easier and saving users from looking into it themselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please give an example of a scenario where this would be useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try
block: We want to load a file from hugging face which is in one of the known file formats supported by hugging face, so something like the imbd dataset, which is a csv file (known file format).
except
block: We want to load a file from hugging face which is not one of the known file formats (say a fasta/pdb/mmcif file), like the fasta file which Jakob provided. If it was used in the above `try1 block, it would throw an error. This block will also work if any local files are supplied to the loader and would return a Dataset object.
The try
case would need a call like load_dataset("imbd")
whereas in the except
case it would need a call like load_dataset("text", data_files="https://huggingface.co/datasets/gcos/HoloRBP4_round8_trimmed")
The initial ImportError was a masked error caused by naming a file the same as the function, so Python thought I was importing the file instead of the function. After renaming, the real problem showed up, a "circular import error" from python. Pytest starts by importing utils; inside init.py it first pulls in fasta_to_aaseq, which in turn tries to import hf_to_dataset from utils; but hf_to_dataset hasn’t been imported yet (it’s the next line in init.py), so Python throws the importerror. The fix is to either reorder imports so hf_to_dataset is defined first, or better, import it directly (which is what i did) to fix the problem. |
pyaptamer/utils/_fasta_to_aaseq.py
Outdated
``fasta_path`` : str or os.PathLike | ||
Input source for FASTA sequences. Can be: | ||
|
||
- Local file path (absolute or relative) located in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this all relative to the repo?
Note that users of the package will not have the repo cloned. Please change this so it is relative to, say, root, and/or that absolute paths not subpaths of the repo can be passed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very rookie mistake on my end 😓
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left comments above.
Fixes #149
Fixes #154