Skip to content

[DISCUSSION] Hugging face loaders #154

@satvshr

Description

@satvshr

Adding a loader for hugging face data is turning out to be a slightly difficult task than expected,
@NennoMP tried implementing it as a part of #92 over here, but there are a few problems I find with this loader:

  1. We do not need to add an option to add local files, this is supposed to be a loader to upload files from the hugging face hub to in-memory objects.
  2. This

After looking into it a bit more, hugging face only allows certain file formats (none of those being the ones we want like pdb, fasta, mmCIF, etc) and my solution for that is rather simple, this is the proposed code:

def hf_to_ds(path, keep_in_memory=True, **kwargs):
    """
    Load any Hugging Face dataset or file into a Dataset or DatasetDict.
    Tries native loader first; falls back to 'text' if unsupported.
    """
    try:
        return load_dataset(path, keep_in_memory=keep_in_memory, **kwargs)
    except Exception as e:
        # Fallback to text loader
        return load_dataset(
            "text",
            data_files=path,
            keep_in_memory=keep_in_memory,
            **kwargs
        )

For all possible data formats we first try calling it the "normal" way and if it does not work out we force the output to be in the form of a text file.
Now for the code in the except block, the loader takes a split argument and given pdb/fasta files do not have that, the function returns:

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 652992
    })
})

as the default argument for split is 'train'. By adding:

    # Unwrap single-split DatasetDict into Dataset
    if isinstance(ds, dict) and len(ds) == 1:
        ds = next(iter(ds.values()))

to the except block we can get a dataset object throughout the entire function.

Hence the final code looks like:

def hf_to_ds(path, keep_in_memory=True, **kwargs):
    """
    Load any Hugging Face dataset or file into a Dataset or DatasetDict.
    Tries native loader first; falls back to 'text' if unsupported.
    """
    try:
        ds = load_dataset(path, keep_in_memory=keep_in_memory, **kwargs)
    except Exception:
        ds = load_dataset(
            "text",
            data_files=path,
            keep_in_memory=keep_in_memory,
            **kwargs
        )

    # Unwrap single-split DatasetDict into Dataset
    if isinstance(ds, dict) and len(ds) == 1:
        ds = next(iter(ds.values()))

    return ds

The problem with this is that it converts all data into a Hugging face dataset format which will be hard for researchers to understand. This is what needs to be done for FASTA files as an example using the above function:

dataset = hf_to_ds("https://huggingface.co/datasets/gcos/HoloRBP4_round8_trimmed/resolve/main/HoloRBP4_round8_trimmed.fasta")
fasta_string = "\n".join(dataset[:]['text'])

fasta_io = StringIO(fasta_string)

records = []
for record in SeqIO.parse(fasta_io, "fasta"):
    records.append({"id": record.id, "sequence": str(record.seq)})

return records

Metadata

Metadata

Assignees

Labels

questionFurther information is requested

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions