Description of the problem
I am encountering a strange behavior during a neural network training with Pytorch dataloaders made from a custom dataset. The dataloaders are set with workers=4, pin_memory=False.
Most of the time, the training finished with no problems. Sometimes, the training stopped at a random moment with the following errors:
- OSError: [Errno 9] Bad file descriptor
- EOFError
It looks like the error occurs during socket creation to access dataloader elements. The error disappears when I set the number of workers to 0, but I need to accelerate my training with multiprocessing. What could be the source of the error ? Thank you !
The versions of python and libraries
Python 3.9.12, Pyorch 1.11.0 cu102
EDIT: The error is occurring only on clusters
Output of error file
Traceback (most recent call last):
File "/my_directory/.conda/envs/geoseg/lib/python3.9/multiprocessing/resource_sharer.py", line 145, in _serve
Epoch 17: 52%|█████▏ | 253/486 [01:00<00:55, 4.18it/s, loss=1.73]
Traceback (most recent call last):
File "/my_directory/bench/run_experiments.py", line 251, in <module>
send(conn, destination_pid)
File "/my_directory/.conda/envs/geoseg/lib/python3.9/multiprocessing/resource_sharer.py", line 50, in send
reduction.send_handle(conn, new_fd, pid)
File "/my_directory/.conda/envs/geoseg/lib/python3.9/multiprocessing/reduction.py", line 183, in send_handle
with socket.fromfd(conn.fileno(), socket.AF_UNIX, socket.SOCK_STREAM) as s:
File "/my_directory/.conda/envs/geoseg/lib/python3.9/socket.py", line 545, in fromfd
return socket(family, type, proto, nfd)
File "/my_directory/.conda/envs/geoseg/lib/python3.9/socket.py", line 232, in __init__
_socket.socket.__init__(self, family, type, proto, fileno)
OSError: [Errno 9] Bad file descriptor
main(args)
File "/my_directory/bench/run_experiments.py", line 183, in main
run_experiments(args, save_path)
File "/my_directory/bench/run_experiments.py", line 70, in run_experiments
) = run_algorithm(algorithm_params[j], mp[j], ss, dataset)
File "/my_directorybench/algorithms.py", line 38, in run_algorithm
data = es(mp,search_space, dataset, **ps)
File "/my_directorybench/algorithms.py", line 151, in es
data = ss.generate_random_dataset(mp,
File "/my_directorybench/architectures.py", line 241, in generate_random_dataset
arch_dict = self.query_arch(
File "/my_directory/bench/architectures.py", line 71, in query_arch
train_losses, val_losses, model = meta_net.get_val_loss(
File "/my_directory/bench/meta_neural_net.py", line 50, in get_val_loss
return self.training(
File "/my_directorybench/meta_neural_net.py", line 155, in training
train_loss = self.train_step(model, device, train_loader, epoch)
File "/my_directory/bench/meta_neural_net.py", line 179, in train_step
for batch_idx, mini_batch in enumerate(pbar):
File "/my_directory/.conda/envs/geoseg/lib/python3.9/site-packages/tqdm/std.py", line 1195, in __iter__
for obj in iterable:
File "/my_directory/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 530, in __next__
data = self._next_data()
File "/my_directory/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1207, in _next_data
idx, data = self._get_data()
File "/my_directory/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1173, in _get_data
success, data = self._try_get_data()
File "/my_directory/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1011, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/my_directory/.conda/envs/geoseg/lib/python3.9/multiprocessing/queues.py", line 122, in get
return _ForkingPickler.loads(res)
File "/my_directory/.local/lib/python3.9/site-packages/torch/multiprocessing/reductions.py", line 295, in rebuild_storage_fd
fd = df.detach()
File "/my_directory/.conda/envs/geoseg/lib/python3.9/multiprocessing/resource_sharer.py", line 58, in detach
return reduction.recv_handle(conn)
File "/my_directory/.conda/envs/geoseg/lib/python3.9/multiprocessing/reduction.py", line 189, in recv_handle
return recvfds(s, 1)[0]
File "/my_directory/.conda/envs/geoseg/lib/python3.9/multiprocessing/reduction.py", line 159, in recvfds
raise EOFError
EOFError
EDIT : The way data is accessed
from PIL import Image
from torch.utils.data import DataLoader
# extract of code of dataset
class Dataset():
def __init__(self,image_files,mask_files):
self.image_files = image_files
self.mask_files = mask_files
def __getitem__(self, idx):
img = Image.open(self.image_files[idx]).convert('RGB')
mask=Image.open(self.mask_files[idx]).convert('L')
return img, mask
# extract of code of trainloader
train_loader = DataLoader(
dataset=train_dataset,
batch_size=4,
num_workers=4,
pin_memory=False,
shuffle=True,
drop_last=True,
persistent_workers=False,
)
CodePudding user response:
I have finally found a solution. Adding this configuration to the dataset script works:
import torch.multiprocessing
torch.multiprocessing.set_sharing_strategy('file_system')
By default, the sharing strategy is set to 'file_descriptor'
.
I have tried some solutions explained in :
- this issue (increase shared memory, increase max number of opened file descriptors, torch.cuda.empty_cache() at the end of each epoch, ...)
- and this other issue, that turns out to solve the problem
As suggested by @AlexMeredith, the error may be linked to the distributed filesystem (Lustre) that some clusters use. The error may also come from distributed shared memory.