-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Adding a loader for hugging face data is turning out to be a slightly difficult task than expected,
@NennoMP tried implementing it as a part of #92 over here, but there are a few problems I find with this loader:
- We do not need to add an option to add local files, this is supposed to be a loader to upload files from the hugging face hub to in-memory objects.
- This
After looking into it a bit more, hugging face only allows certain file formats (none of those being the ones we want like pdb, fasta, mmCIF, etc) and my solution for that is rather simple, this is the proposed code:
def hf_to_ds(path, keep_in_memory=True, **kwargs):
"""
Load any Hugging Face dataset or file into a Dataset or DatasetDict.
Tries native loader first; falls back to 'text' if unsupported.
"""
try:
return load_dataset(path, keep_in_memory=keep_in_memory, **kwargs)
except Exception as e:
# Fallback to text loader
return load_dataset(
"text",
data_files=path,
keep_in_memory=keep_in_memory,
**kwargs
)
For all possible data formats we first try calling it the "normal" way and if it does not work out we force the output to be in the form of a text file.
Now for the code in the except block, the loader takes a split
argument and given pdb/fasta files do not have that, the function returns:
DatasetDict({
train: Dataset({
features: ['text'],
num_rows: 652992
})
})
as the default argument for split
is 'train'. By adding:
# Unwrap single-split DatasetDict into Dataset
if isinstance(ds, dict) and len(ds) == 1:
ds = next(iter(ds.values()))
to the except block we can get a dataset object throughout the entire function.
Hence the final code looks like:
def hf_to_ds(path, keep_in_memory=True, **kwargs):
"""
Load any Hugging Face dataset or file into a Dataset or DatasetDict.
Tries native loader first; falls back to 'text' if unsupported.
"""
try:
ds = load_dataset(path, keep_in_memory=keep_in_memory, **kwargs)
except Exception:
ds = load_dataset(
"text",
data_files=path,
keep_in_memory=keep_in_memory,
**kwargs
)
# Unwrap single-split DatasetDict into Dataset
if isinstance(ds, dict) and len(ds) == 1:
ds = next(iter(ds.values()))
return ds
The problem with this is that it converts all data into a Hugging face dataset format which will be hard for researchers to understand. This is what needs to be done for FASTA files as an example using the above function:
dataset = hf_to_ds("https://huggingface.co/datasets/gcos/HoloRBP4_round8_trimmed/resolve/main/HoloRBP4_round8_trimmed.fasta")
fasta_string = "\n".join(dataset[:]['text'])
fasta_io = StringIO(fasta_string)
records = []
for record in SeqIO.parse(fasta_io, "fasta"):
records.append({"id": record.id, "sequence": str(record.seq)})
return records