[DISCUSSION] Hugging face loaders

Adding a loader for hugging face data is turning out to be a slightly difficult task than expected,
@NennoMP tried implementing it as a part of #92 over [here](https://github.com/gc-os-ai/pyaptamer/pull/92/files#diff-15a3c9c6fa77cf579451b11e4b75a4135a8f7d8e3d402a8859f97b31f7698776), but there are a few problems I find with this loader:
1. We do not need to add an option to add local files, this is supposed to be a loader to upload files from the hugging face hub to in-memory objects.
2. [This](https://github.com/gc-os-ai/pyaptamer/pull/92/files#r2369899998)

After looking into it a bit more, hugging face only allows certain file formats (none of those being the ones we want like pdb, fasta, mmCIF, etc) and my solution for that is rather simple, this is the proposed code:
```
def hf_to_ds(path, keep_in_memory=True, **kwargs):
    """
    Load any Hugging Face dataset or file into a Dataset or DatasetDict.
    Tries native loader first; falls back to 'text' if unsupported.
    """
    try:
        return load_dataset(path, keep_in_memory=keep_in_memory, **kwargs)
    except Exception as e:
        # Fallback to text loader
        return load_dataset(
            "text",
            data_files=path,
            keep_in_memory=keep_in_memory,
            **kwargs
        )
```

For all possible data formats we first try calling it the "normal" way and if it does not work out we force the output to be in the form of a text file.
Now for the code in the except block, the [loader](https://huggingface.co/docs/datasets/v4.1.1/en/package_reference/loading_methods#datasets.load_dataset) takes a `split` argument and given pdb/fasta files do not have that, the function returns:
```
DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 652992
    })
})
```
as the default argument for `split` is 'train'. By adding:
```
    # Unwrap single-split DatasetDict into Dataset
    if isinstance(ds, dict) and len(ds) == 1:
        ds = next(iter(ds.values()))
```
to the except block we can get a [dataset](https://huggingface.co/docs/datasets/v4.1.1/en/package_reference/main_classes#datasets.Dataset) object throughout the entire function.

Hence the final code looks like:
```
def hf_to_ds(path, keep_in_memory=True, **kwargs):
    """
    Load any Hugging Face dataset or file into a Dataset or DatasetDict.
    Tries native loader first; falls back to 'text' if unsupported.
    """
    try:
        ds = load_dataset(path, keep_in_memory=keep_in_memory, **kwargs)
    except Exception:
        ds = load_dataset(
            "text",
            data_files=path,
            keep_in_memory=keep_in_memory,
            **kwargs
        )

    # Unwrap single-split DatasetDict into Dataset
    if isinstance(ds, dict) and len(ds) == 1:
        ds = next(iter(ds.values()))

    return ds
```

The problem with this is that it converts all data into a Hugging face dataset format which will be hard for researchers to understand. This is what needs to be done for FASTA files as an example using the above function:
```
dataset = hf_to_ds("https://huggingface.co/datasets/gcos/HoloRBP4_round8_trimmed/resolve/main/HoloRBP4_round8_trimmed.fasta")
fasta_string = "\n".join(dataset[:]['text'])

fasta_io = StringIO(fasta_string)

records = []
for record in SeqIO.parse(fasta_io, "fasta"):
    records.append({"id": record.id, "sequence": str(record.seq)})

return records
``` 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DISCUSSION] Hugging face loaders #154

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[DISCUSSION] Hugging face loaders #154

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions