Pytorch dataloaders : Bad file descriptor and EOF for workers>0-CodePudding

Description of the problem

I am encountering a strange behavior during a neural network training with Pytorch dataloaders made from a custom dataset. The dataloaders are set with workers=4, pin_memory=False.

Most of the time, the training finished with no problems. Sometimes, the training stopped at a random moment with the following errors:

OSError: [Errno 9] Bad file descriptor
EOFError

It looks like the error occurs during socket creation to access dataloader elements. The error disappears when I set the number of workers to 0, but I need to accelerate my training with multiprocessing. What could be the source of the error ? Thank you !

The versions of python and libraries

Python 3.9.12, Pyorch 1.11.0 cu102
EDIT: The error is occurring only on clusters

Output of error file

Traceback (most recent call last):
  File "/my_directory/.conda/envs/geoseg/lib/python3.9/multiprocessing/resource_sharer.py", line 145, in _serve
Epoch 17:  52%|█████▏    | 253/486 [01:00<00:55,  4.18it/s, loss=1.73]

Traceback (most recent call last):
  File "/my_directory/bench/run_experiments.py", line 251, in <module>
    send(conn, destination_pid)
  File "/my_directory/.conda/envs/geoseg/lib/python3.9/multiprocessing/resource_sharer.py", line 50, in send
    reduction.send_handle(conn, new_fd, pid)
  File "/my_directory/.conda/envs/geoseg/lib/python3.9/multiprocessing/reduction.py", line 183, in send_handle
    with socket.fromfd(conn.fileno(), socket.AF_UNIX, socket.SOCK_STREAM) as s:
  File "/my_directory/.conda/envs/geoseg/lib/python3.9/socket.py", line 545, in fromfd
    return socket(family, type, proto, nfd)
  File "/my_directory/.conda/envs/geoseg/lib/python3.9/socket.py", line 232, in __init__
    _socket.socket.__init__(self, family, type, proto, fileno)
OSError: [Errno 9] Bad file descriptor

    main(args)
  File "/my_directory/bench/run_experiments.py", line 183, in main
    run_experiments(args, save_path)
  File "/my_directory/bench/run_experiments.py", line 70, in run_experiments
    ) = run_algorithm(algorithm_params[j], mp[j], ss, dataset)
  File "/my_directorybench/algorithms.py", line 38, in run_algorithm
    data = es(mp,search_space,  dataset, **ps)
  File "/my_directorybench/algorithms.py", line 151, in es
   data = ss.generate_random_dataset(mp,
  File "/my_directorybench/architectures.py", line 241, in generate_random_dataset
    arch_dict = self.query_arch(
  File "/my_directory/bench/architectures.py", line 71, in query_arch
    train_losses, val_losses, model = meta_net.get_val_loss(
  File "/my_directory/bench/meta_neural_net.py", line 50, in get_val_loss
    return self.training(
  File "/my_directorybench/meta_neural_net.py", line 155, in training
    train_loss = self.train_step(model, device, train_loader, epoch)
  File "/my_directory/bench/meta_neural_net.py", line 179, in train_step
    for batch_idx, mini_batch in enumerate(pbar):
  File "/my_directory/.conda/envs/geoseg/lib/python3.9/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/my_directory/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 530, in __next__
    data = self._next_data()
  File "/my_directory/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1207, in _next_data
    idx, data = self._get_data()
  File "/my_directory/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1173, in _get_data
    success, data = self._try_get_data()
  File "/my_directory/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1011, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/my_directory/.conda/envs/geoseg/lib/python3.9/multiprocessing/queues.py", line 122, in get
    return _ForkingPickler.loads(res)
  File "/my_directory/.local/lib/python3.9/site-packages/torch/multiprocessing/reductions.py", line 295, in rebuild_storage_fd
    fd = df.detach()
  File "/my_directory/.conda/envs/geoseg/lib/python3.9/multiprocessing/resource_sharer.py", line 58, in detach
    return reduction.recv_handle(conn)
  File "/my_directory/.conda/envs/geoseg/lib/python3.9/multiprocessing/reduction.py", line 189, in recv_handle
    return recvfds(s, 1)[0]
  File "/my_directory/.conda/envs/geoseg/lib/python3.9/multiprocessing/reduction.py", line 159, in recvfds
    raise EOFError
EOFError

EDIT : The way data is accessed

    from PIL import Image
    from torch.utils.data import DataLoader
    
    # extract of code of dataset
    
        class Dataset():
           def __init__(self,image_files,mask_files):
              self.image_files = image_files
              self.mask_files = mask_files
    
           def __getitem__(self, idx):
              img = Image.open(self.image_files[idx]).convert('RGB')
              mask=Image.open(self.mask_files[idx]).convert('L')
              return img, mask
    
    # extract of code of trainloader
      
        train_loader = DataLoader(
                        dataset=train_dataset,
                        batch_size=4,
                        num_workers=4,
                        pin_memory=False,
                        shuffle=True,
                        drop_last=True,
                        persistent_workers=False,
                    )

CodePudding user response：

I have finally found a solution. Adding this configuration to the dataset script works:

import torch.multiprocessing
torch.multiprocessing.set_sharing_strategy('file_system')

By default, the sharing strategy is set to 'file_descriptor'.

I have tried some solutions explained in :

this issue (increase shared memory, increase max number of opened file descriptors, torch.cuda.empty_cache() at the end of each epoch, ...)
and this other issue, that turns out to solve the problem

As suggested by @AlexMeredith, the error may be linked to the distributed filesystem (Lustre) that some clusters use. The error may also come from distributed shared memory.