Training failed using yolov6l on 1GPU, Assertion target_val >= zero && target_val <= one
failed, Data is verified but still training fails
#1038
Labels
question
Further information is requested
Before Asking
I have read the README carefully. 我已经仔细阅读了README上的操作指引。
I want to train my custom dataset, and I have read the tutorials for training your custom data carefully and organize my dataset correctly; (FYI: We recommand you to apply the config files of xx_finetune.py.) 我想训练自定义数据集,我已经仔细阅读了训练自定义数据的教程,以及按照正确的目录结构存放数据集。(FYI: 我们推荐使用xx_finetune.py等配置文件训练自定义数据集。)
I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。
Search before asking
Question
I train yolov6l cloned from this repo on my custom dataset on a GPU for 350 or 500 epochs with different batchs but each time the training fails showing this error and when i resume it continues training for a while then stops and each time it trains for less epochs until reaching 105 epochs where it can't continue training at all.
I verified my data and labels they are nomalized, the gpu i am using runs very well using Yolov5 or Yolov8 but i don't know what's the problem here !
I have another questions please can we do earlystopping in this yolov6 ? like the parameter patience in Yolov5 for eg.
Thank you so much for your help!
When i launch training
When i resume training
img record infomation path is:../dataset/images/.train_cache.json
Train: Final numbers of valid images: 10000/ labels: 10000.
0.6s for dataset initialization.
img record infomation path is:../dataset/images/.validation_cache.json
Convert to COCO format
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1036/1036 [00:00<00:00, 5118.47it/s]
Convert to COCO format finished. Resutls saved in ../dataset/annotations/instances_validation.json
Val: Final numbers of valid images: 1036/ labels: 1036.
0.5s for dataset initialization.
Training start...
105/349 0.006246 0.1285 0.2691 0.3886: 24%|██▍ | 96/400 [00:37<01:45, 2.88it/s../aten/src/ATen/native/cuda/Loss.cu:95: operator(): block: [12844,0,0], thread: [32,0,0] Assertion
target_val >= zero && target_val <= one
failed.105/349 0.006246 0.1285 0.2691 0.3886: 24%|██▍ | 96/400 [00:37<01:58, 2.57it/s
ERROR in training steps.
ERROR in training loop or eval/save model.
Traceback (most recent call last):
File "/partage////app/YOLOv6/yolov6/core/engine.py", line 121, in train
self.train_one_epoch(self.epoch)
File "/partage/*****///app/YOLOv6/yolov6/core/engine.py", line 135, in train_one_epoch
self.train_in_steps(epoch_num, self.step)
File "/partage////app/YOLOv6/yolov6/core/engine.py", line 169, in train_in_steps
total_loss, loss_items = self.compute_loss(preds, targets, epoch_num, step_num,
File "/partage////app/YOLOv6/yolov6/models/losses/loss.py", line 163, in call
loss_cls = self.varifocal_loss(pred_scores, target_scores, one_hot_label)
File "/home//.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(args, **kwargs)
File "/home//.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(args, kwargs)
File "///*****//app/YOLOv6/yolov6/models/losses/loss.py", line 209, in forward
loss = (F.binary_cross_entropy(pred_score.float(), gt_score.float(), reduction='none') * weight).sum()
File "/home/*/.local/lib/python3.10/site-packages/torch/nn/functional.py", line 3127, in binary_cross_entropy
return torch._C._nn.binary_cross_entropy(input, target, weight, reduction_enum)
RuntimeError: CUDA error: device-side assert triggered
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/partage///***/app/YOLOv6/tools/train.py", line 143, in
Additional
No response
The text was updated successfully, but these errors were encountered: