Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train.py throws multiprocessing errors #15

Open
agilebean opened this issue Jan 2, 2021 · 1 comment
Open

train.py throws multiprocessing errors #15

agilebean opened this issue Jan 2, 2021 · 1 comment

Comments

@agilebean
Copy link

Executing train.py in Google Colabo throws two kinds of errors from multiprocessing.
Here, it would be helpful to know whether these errors invalidate the results or are just informative.

Command executed:

!python train.py /content/sync/data \
--dataset miniimagenet \
--num-ways 5 \
--num-shots 1 \
--step-size 0.1 \
--batch-size 4 \
--num-batches 16 \
--num-epochs 50 \
--num-workers 8 \
--output-folder /content/sync/output \
--use-cuda \
--verbose

ERROR 1:

DEBUG:root:Creating folder `/content/sync/output/2021-01-02_113055`
INFO:root:Saving configuration file in `/content/sync/output/2021-01-02_113055/config.json`
Epoch 1 : 100% 16/16 [00:02<00:00,  5.71it/s, accuracy=0.2315, loss=5.4706]
Epoch 2 : 100% 16/16 [00:02<00:00,  5.67it/s, accuracy=0.2563, loss=3.1795]
Epoch 3 : 100% 16/16 [00:02<00:00,  5.60it/s, accuracy=0.2383, loss=2.8871]
Epoch 4 : 100% 16/16 [00:02<00:00,  5.72it/s, accuracy=0.2448, loss=2.7525]
Training: 100% 16/16 [00:03<00:00,  5.94it/s, accuracy=0.2433, loss=2.1583]Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/queues.py", line 240, in _feed
    send_bytes(obj)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 404, in _send_bytes
    self._send(header + buf)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

ERROR 2:

Epoch 26: 100% 8/8 [00:01<00:00,  5.01it/s, accuracy=0.2888, loss=1.7093]
Epoch 27: 100% 8/8 [00:01<00:00,  5.07it/s, accuracy=0.2454, loss=1.7756]
Training: 100% 8/8 [00:01<00:00,  5.28it/s, accuracy=0.3267, loss=1.6709]Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/queues.py", line 240, in _feed
    send_bytes(obj)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 404, in _send_bytes
    self._send(header + buf)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
@tristandeleu
Copy link
Owner

Torchmeta is using PyTorch's DataLoader under the hood for data-loading, so this multiprocessing error must come from PyTorch's DataLoader.
I don't know what could cause this issue though unfortunately, but it might be due to Google Colab and how they handle multiprocessing. It might also be related using the synced folder, as in #14.
One solution to prevent this error would be to use --num-workers 0, but that will slow down the data-loading part.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants