Skip to content

Conversation

satvshr
Copy link
Collaborator

@satvshr satvshr commented Sep 23, 2025

Fixes #149
Fixes #154

from datasets import load_dataset


def hf_to_dataset(path, keep_in_memory=True, **kwargs):
Copy link
Contributor

@fkiraly fkiraly Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function feels unnecessary, it simply aliases a simple call of huggingface load_dataset.

Why do we need this? I would simply remove it.

Maybe you could explain?

Side remark: The try/expect structure looks problematic. Generally, we should avoid using try/except as an if/else, instead we should use if/else with the correct condition.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there not a condition that we can replace this in, in an if/else?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there was a defined variable which hugging face used to list all their recognisable file formats I would say yes, but as of now I dont think it is possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still doubt the reasoning behind having this function.

Can you please give two examples where we currently need this, one example for each branch of the code?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try block: We want to load a file from hugging face which is in one of the known file formats supported by hugging face, so something like the imbd dataset, which is a csv file (known file format).

except block: We want to load a file from hugging face which is not one of the known file formats (say a fasta/pdb/mmcif file), like the fasta file which Jakob provided. If it was used in the above `try1 block, it would throw an error. This block will also work if any local files are supplied to the loader and would return a Dataset object.

Copy link
Contributor

@fkiraly fkiraly Oct 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still do not get what scenario we would need the try/except in.

Does the user not always know what format the file is in that they are attempting to load? And in cases where we hardcode the dataset, do we not know?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the user not always know what format the file is in that they are attempting to load?

The function has to be used differently depending on whether or not the file is a recognisable file format. This information is something I discovered after I spent a bit of time looking into it, the function is just making life easier and saving users from looking into it themselves.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please give an example of a scenario where this would be useful?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try block: We want to load a file from hugging face which is in one of the known file formats supported by hugging face, so something like the imbd dataset, which is a csv file (known file format).

except block: We want to load a file from hugging face which is not one of the known file formats (say a fasta/pdb/mmcif file), like the fasta file which Jakob provided. If it was used in the above `try1 block, it would throw an error. This block will also work if any local files are supplied to the loader and would return a Dataset object.

The try case would need a call like load_dataset("imbd") whereas in the except case it would need a call like load_dataset("text", data_files="https://huggingface.co/datasets/gcos/HoloRBP4_round8_trimmed")

@satvshr
Copy link
Collaborator Author

satvshr commented Sep 26, 2025

The initial ImportError was a masked error caused by naming a file the same as the function, so Python thought I was importing the file instead of the function. After renaming, the real problem showed up, a "circular import error" from python. Pytest starts by importing utils; inside init.py it first pulls in fasta_to_aaseq, which in turn tries to import hf_to_dataset from utils; but hf_to_dataset hasn’t been imported yet (it’s the next line in init.py), so Python throws the importerror. The fix is to either reorder imports so hf_to_dataset is defined first, or better, import it directly (which is what i did) to fix the problem.

@fkiraly fkiraly changed the title Added hugging face and FASTA loaders [ENH] hugging face and FASTA loaders Oct 2, 2025
``fasta_path`` : str or os.PathLike
Input source for FASTA sequences. Can be:

- Local file path (absolute or relative) located in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this all relative to the repo?

Note that users of the package will not have the repo cloned. Please change this so it is relative to, say, root, and/or that absolute paths not subpaths of the repo can be passed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very rookie mistake on my end 😓

Copy link
Contributor

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left comments above.

@satvshr satvshr requested a review from fkiraly October 2, 2025 10:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[DISCUSSION] Hugging face loaders [ENH] FASTA to String converter

2 participants