Skip to content

Torchvision dataset mirrors #7637

@dhruvbird

Description

@dhruvbird

🚀 The feature

Is it possible for pytorch/torchvision to mirror all the datasets on their own domain/hosts instead of downloading from the original researcher's web page/URL?

Motivation, pitch

More often than not I run into problems when downloading them. For example:

  1. Too many downloads
  2. Bandwidth limit exceeded for the day
  3. Some other outage such as in Stanford cars download url is broken - HTTP 404 #7545

Also when running a Kaggle notebook, it re-downloads every time since there's no way to cache the downloaded dataset.

This will allow the problems above (and more) to go away.

More often than not people work around these issues by using some existing dataset that people have uploaded to Kaggle and defining their own Dataset class to read from that dataset. Alternatively, people may use some "hacks" to make torchvision use an existing Kaggle dataset that isn't in the directory format (name) that torchvision expects. See https://www.kaggle.com/code/dhruv4930/starter-for-oxford-iiit-pet-using-torchvision for an example.

Code copied below.

# Oxford IIIT Pets Segmentation dataset loaded via torchvision.
!rm -f '/kaggle/working/oxford-iiit-pet'
!ln -s '/kaggle/input/oxfordiiitpetfromxijiatao/Oxford-IIT-Pet' '/kaggle/working/oxford-iiit-pet'

oxford_pets_path = '/kaggle/working'
pets_train_orig = torchvision.datasets.OxfordIIITPet(root=oxford_pets_path, split="trainval", target_types="segmentation", download=False)
pets_test_orig = torchvision.datasets.OxfordIIITPet(root=oxford_pets_path, split="test", target_types="segmentation", download=False)

Alternatives

Since I'm personally interested in solving my local problem for Kaggle notebooks, a viable alternative would be to create a Kaggle dataset for every torchvision dataset so that when I use it in Kaggle, I just include it - also using a Kaggle dataset is more reliable in Kaggle notebooks.

However, this is a myopic view of the problem and provides a localized solution to a localized problem. I'm pretty sure that others outside of the narrow scope of a Kaggle notebook have experienced this issue and the previously suggested solution of mirroring the datasets would be more wholistic in terms of being more broad looking.

I'm open to other solutions that work across environments.

Additional context

Thanks for working on torchvision - it's saved me a lot of time on mundane and vision specific tasks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions