-
Notifications
You must be signed in to change notification settings - Fork 7.2k
Description
🚀 The feature
Is it possible for pytorch/torchvision to mirror all the datasets on their own domain/hosts instead of downloading from the original researcher's web page/URL?
Motivation, pitch
More often than not I run into problems when downloading them. For example:
- Too many downloads
- Bandwidth limit exceeded for the day
- Some other outage such as in Stanford cars download url is broken - HTTP 404 #7545
Also when running a Kaggle notebook, it re-downloads every time since there's no way to cache the downloaded dataset.
This will allow the problems above (and more) to go away.
More often than not people work around these issues by using some existing dataset that people have uploaded to Kaggle and defining their own Dataset class to read from that dataset. Alternatively, people may use some "hacks" to make torchvision use an existing Kaggle dataset that isn't in the directory format (name) that torchvision expects. See https://www.kaggle.com/code/dhruv4930/starter-for-oxford-iiit-pet-using-torchvision for an example.
Code copied below.
# Oxford IIIT Pets Segmentation dataset loaded via torchvision.
!rm -f '/kaggle/working/oxford-iiit-pet'
!ln -s '/kaggle/input/oxfordiiitpetfromxijiatao/Oxford-IIT-Pet' '/kaggle/working/oxford-iiit-pet'
oxford_pets_path = '/kaggle/working'
pets_train_orig = torchvision.datasets.OxfordIIITPet(root=oxford_pets_path, split="trainval", target_types="segmentation", download=False)
pets_test_orig = torchvision.datasets.OxfordIIITPet(root=oxford_pets_path, split="test", target_types="segmentation", download=False)
Alternatives
Since I'm personally interested in solving my local problem for Kaggle notebooks, a viable alternative would be to create a Kaggle dataset for every torchvision dataset so that when I use it in Kaggle, I just include it - also using a Kaggle dataset is more reliable in Kaggle notebooks.
However, this is a myopic view of the problem and provides a localized solution to a localized problem. I'm pretty sure that others outside of the narrow scope of a Kaggle notebook have experienced this issue and the previously suggested solution of mirroring the datasets would be more wholistic in terms of being more broad looking.
I'm open to other solutions that work across environments.
Additional context
Thanks for working on torchvision - it's saved me a lot of time on mundane and vision specific tasks!