For a particular project, I have to use Python Dataloader for a series of .nifti files (specifically the OASIS-BRAINS2 dataset) However each of the scans are in a directory tree, stored in google drive and being accessed using google colab with the following format Patient-001
---Scan-1
---Scan-2
---Scan-3
---Scan-4
Patient-002
---Scan-1
---Scan-2
---Scan-3
---Scan-4 etc. And there is a separate .csv file which contains their status, which I wish to train a neural network on. I have the network already set up and ready to train, as well as preprocessing and transforms for the data. What would the best way to load all of this information into the Network for training. All of the datasets that I have used previously have existing dataloaders.
Thank you for your time and consideration.
CodePudding user response:
One simple solution is to create a flat directory structure, e.g. two files data
and labels
where each data files and label files correspond on a one-to-one mapping. This will allow you to use an off-the-shelf pytorch dataset/dataloader class.
To achieve this one-to-one flat directory structure, you can either:
- Copy files from each subdirectory into a single new flat
data
directory, and similarly for the labels. This will work and be simple to do if your data size is small - If your data is large enough that you don't want to copy it, you can create symbolic links in the flat directory structure to the original file locations. See here for Windows and here for Linux.
CodePudding user response:
If creating a new structure for your data is not viable, you can always extend the torch.util.data.Dataset class. This allow you to manage directly the loading procedure. You can follow this in the official Pytorch wiki.