YOLOv5 AWS Inferentia Inplace compatibility updates #2953

jluntamazon · 2021-04-27T21:03:33Z

This addresses issues with compiling for AWS neuron by allowing users to remove slice assignment operators. (#2643, aws-neuron/aws-neuron-sdk#253). There is an existing work-around that allows part of the model to compile to neuron, but this change allows the entire model to be compiled in the upcoming Neuron SDK release. This should provide better performance and a more seamless user experience when using Neuron.

Code Changes:

This adds an inplace flag to the Model and Detect layers of the model since these are the only internal modules that use in-place assignment. By default the inplace flag is True which means that behavior is unchanged.

The flag can now be toggled either by passing it to attempt_load or as a top-level configuration in the cfg YAML.

Potential Improvements:

I did not expose this flag to any of the scripts like detect.py but I could if that would be useful.
I did not add unit tests to ensure that the layers functioned identically. If there's a good place to add unit tests, let me know.
Since I check for the Detect/Model objects in the attempt_load function, I scope the import to avoid a circular dependency. I think ideally attempt_load should be moved out of experimental.py, but this could potentially break workflows.

🛠️ PR Summary

_{Made with ❤️ by Ultralytics Actions}

🌟 Summary

Enhanced compatibility and configurability of YOLOv5 model operations.

📊 Key Changes

Added inplace parameter to control whether operations modify tensors in-place.
Adjusted the Detect class and model-loading function to accommodate the new inplace argument.
Improved forward pass to be compatible with AWS Inferentia processors.
Refactored augmentation code into forward_augment and _descale_pred methods for clarity.

🎯 Purpose & Impact

🛠 Provides flexibility in modifying tensors, allowing compatibility with platforms like AWS Inferentia that do not support in-place operations.
🧠 Makes the codebase more modular and easier to maintain by cleanly separating augmentation logic.
🚀 Potential impact includes more efficient deployment on diverse platforms, including cloud inference acceleration services, without sacrificing performance.

github-actions

👋 Hello @jluntamazon, thank you for submitting a 🚀 PR! To allow your work to be integrated as seamlessly as possible, we advise you to:

✅ Verify your PR is up-to-date with origin/master. If your PR is behind origin/master an automatic GitHub actions rebase may be attempted by including the /rebase command in a comment body, or by running the following code, replacing 'feature' with the name of your local branch:

git remote add upstream https://github.com/ultralytics/yolov5.git
git fetch upstream
git checkout feature  # <----- replace 'feature' with local branch name
git rebase upstream/master
git push -u origin -f

✅ Verify all Continuous Integration (CI) checks are passing.
✅ Reduce changes to the absolute minimum required for your bug fix or feature addition. "It is not daily increase but daily decrease, hack away the unessential. The closer to the source, the less wastage there is." -Bruce Lee

glenn-jocher · 2021-04-27T22:22:10Z

@jluntamazon thanks for the PR!

There seems to be a recent change in GitHub actions that are preventing automatic tests on each new commit for first time contributors, but you can effectively run the same suite of tests here (exit code zero passes). We don't have tests currently that compare training/inference output values, but I'll review this a little further myself to verify.

rm -rf runs  # remove runs/
for m in yolov5s; do  # models
  python train.py --weights $m.pt --epochs 3 --img 320 --device 0  # train pretrained
  python train.py --weights '' --cfg $m.yaml --epochs 3 --img 320 --device 0  # train scratch
  for d in 0 cpu; do  # devices
    python detect.py --weights $m.pt --device $d  # detect official
    python detect.py --weights runs/train/exp/weights/best.pt --device $d  # detect custom
    python test.py --weights $m.pt --device $d # test official
    python test.py --weights runs/train/exp/weights/best.pt --device $d # test custom
  done
  python hubconf.py  # hub
  python models/yolo.py --cfg $m.yaml  # inspect
  python models/export.py --weights $m.pt --img 640 --batch 1  # export
done

glenn-jocher · 2021-04-28T11:14:25Z

@jluntamazon everything seem ok at first glance. I had an idea: would it be possible to clone the input on inplace=False rather than introduce separate routing for the input? Here's an example, I haven't tried this yet so not sure if it works or if it profiles slower or faster than the current concatenation ops. On the plus side this may be faster and reduces the codebase by eliminating the separate handling of the two cases.

class Module(nn.Module):
    def __init__(self, inplace=True):
        super().__init__()
        self.inplace = inplace

    def forward(self, x):
        if not self.inplace:
            x = x.clone()
        x *= 2  # common code regardless of inplace
        return x

EDIT: @jluntamazon just realized I can't test this myself since the problem we're trying to patch is on your side. Do you think you could try the above .clone() solution on your side to see if it addresses the issue? If it works let me know, then the next step is to profile both implementations and merge the faster one if there is a difference, or the simpler one if they are comparable. Thanks!

glenn-jocher · 2021-04-28T11:55:44Z

@jluntamazon ok I've got 3 profiling results here on CPU (can't profile GPU unfortunately). The proposed PR is about 2.5
x slower than master in the modified code region starting at yolo.py L56, .clone() proposed solution is about 2.3x slower. Profiling command is

python test.py

master: 162 ms

PR: 400 ms

.clone(): 366 ms

jluntamazon · 2021-04-28T17:41:21Z

Thanks for the quick feedback!

would it be possible to clone the input on inplace=False rather than introduce separate routing for the input?

The key issue is that the Neuron compilation process currently doesn't support in-place assignment, so unfortunately a clone does not solve the issue.

To give some additional background, on Neuron we are compiling the graph into an optimized format that is distinct from the original torch operations/graph. We trace the model and then send that graph to our optimizing compiler. What we get in the end is a 1 operation torch graph (unless we have unsupported operators) where the fused computation is performed completely on chip. This means we cannot properly compare on a per-op basis due compiler optimization and the fact that torch operators are no longer run as-is.

If it works let me know, then the next step is to profile both implementations and merge the faster one if there is a difference, or the simpler one if they are comparable

For the above tests, I'm assuming you were using yolov5s? I could provide a profile with the compiled neuron version to give a better idea of how it performs

jluntamazon · 2021-04-28T21:38:35Z

Here are initial performance results using the concatenation method:

Model Variant	Image Size	Time (ms)
yolov5s	480x480	17.931
yolov5s	384x640	19.611
yolov5s	640x480	24.135
yolov5s	640x640	34.050

The method for collecting these was by using the torch profiler:

import torch
import torch_neuron
import torch.autograd.profiler as profiler

# Load the compiled model
model = torch.jit.load('model.pth')

# Create a sample image
sample = torch.rand([1, 3, 480, 480])

# Warmup the model
for _ in range(8):
	model(sample)

# Profile and display results
with profiler.profile(record_shapes=True) as prof:
	for _ in range(1000):
		model(sample)
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))

An example output with an image of 480x480:

---------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                 Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
---------------------  ------------  ------------  ------------  ------------  ------------  ------------  
              forward         0.08%      14.675ms       100.00%       17.931s      17.931ms          1000  
    neuron::forward_4        99.85%       17.904s        99.92%       17.917s      17.917ms          1000  
          aten::empty         0.04%       7.614ms         0.04%       7.614ms       1.903us          4000  
           aten::set_         0.03%       5.024ms         0.03%       5.024ms       1.256us          4000  
---------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 17.931s

glenn-jocher · 2021-04-28T21:59:43Z

@jluntamazon thanks, I understand now!

/rebase

glenn-jocher · 2021-04-28T22:38:53Z

@jluntamazon ok I've gone through and smoothed out the PR a bit, hopefully without modifying the core functionality. If you go ahead and verify that my updates didn't break anything, then I'm happy to merge on my side assuming the CI checks pass.

jluntamazon · 2021-04-29T16:24:56Z

From inspection it all looks good, but I'll checkout the updates and give confirmation later today

jluntamazon · 2021-04-29T22:18:14Z

Ran some tests and it performs the same and produces the expected results. Also looking into some further performance optimizations for next steps, but with this change, those improvements should be on our end.

glenn-jocher · 2021-04-30T10:55:12Z

@jluntamazon PR is merged. Thank you for your contributions!

glenn-jocher · 2021-05-03T14:37:13Z

@jluntamazon could you take a look at a new PR #2982 that wants to modify your out-of-place Detect() code here please?

The change would be applied to yolo.py L60 and add a .view() call. The .view() operation does NOT change the shape of wh, the reason for this is that ONNX export requires an explicit view shape which it seems to lack otherwise on export.
wh = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i] # wh
to:
wh = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i].view(1, self.na, 1, 1, 2) # wh

Thanks!

* Added flag to enable/disable all inplace and assignment operations * Removed shape print statements * Scope Detect/Model import to avoid circular dependency * PEP8 * create _descale_pred() * replace lost space * replace list with tuple Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>

* Added flag to enable/disable all inplace and assignment operations * Removed shape print statements * Scope Detect/Model import to avoid circular dependency * PEP8 * create _descale_pred() * replace lost space * replace list with tuple Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com> (cherry picked from commit 41f5cc5)

* Added flag to enable/disable all inplace and assignment operations * Removed shape print statements * Scope Detect/Model import to avoid circular dependency * PEP8 * create _descale_pred() * replace lost space * replace list with tuple Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>

jluntamazon added 2 commits April 27, 2021 20:22

Added flag to enable/disable all inplace and assignment operations

24818c4

Removed shape print statements

76f3f8a

github-actions bot reviewed Apr 27, 2021

View reviewed changes

jluntamazon added 2 commits April 27, 2021 21:07

Merged master branch

91cfca9

Scope Detect/Model import to avoid circular dependency

3dbe82d

glenn-jocher added 2 commits April 29, 2021 00:09

PEP8

dc649b9

create _descale_pred()

5269154

replace lost space

b5d85db

merge upstream/master

7121d79

replace list with tuple

399e856

glenn-jocher changed the title ~~Added flag to enable/disable inplace operations~~ YOLOv5 AWS Inferentia Inplace compatibility updates Apr 30, 2021

glenn-jocher assigned jluntamazon Apr 30, 2021

glenn-jocher added the enhancement New feature or request label Apr 30, 2021

merge upstream/master

1cbd5d1

glenn-jocher mentioned this pull request Apr 30, 2021

Fix ONNX export using --grid --simplify --dynamic simultaneously #2982

Merged

glenn-jocher merged commit 41f5cc5 into ultralytics:master Apr 30, 2021

glenn-jocher mentioned this pull request Apr 30, 2021

Compiling yolov5 aws-neuron/aws-neuron-sdk#253

Closed

mrnikwaws mentioned this pull request Jun 16, 2021

Ultralytics yolov5 complilation, using weights attempt_load not working with torch.load() aws-neuron/aws-neuron-sdk#277

Closed

yinghuang mentioned this pull request Jun 30, 2021

YOLOv5 converted model is not working any more after YOLOv5 v4.0 Release (#1837). Might be related to changing from BottleneckCSP to C3. Tencent/ncnn#3041

Open

glenn-jocher mentioned this pull request Oct 12, 2021

YOLOv5 release v6.0 #5141

Merged

glenn-jocher mentioned this pull request Nov 7, 2021

YOLOv5 v6.0 compatibility update (draft) ultralytics/yolov3#1855

Closed

This was referenced Nov 14, 2021

YOLOv5 v6.0 compatibility update ultralytics/yolov3#1857

Merged

understand model output #5304

Closed

josebenitezg mentioned this pull request May 9, 2022

AWS Neuron model outputs differs from Yolov5 - AWS Inferentia #7739

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

YOLOv5 AWS Inferentia Inplace compatibility updates #2953

YOLOv5 AWS Inferentia Inplace compatibility updates #2953

jluntamazon commented Apr 27, 2021 •

edited by UltralyticsAssistant

Loading

github-actions bot left a comment

glenn-jocher commented Apr 27, 2021 •

edited

Loading

glenn-jocher commented Apr 28, 2021 •

edited

Loading

glenn-jocher commented Apr 28, 2021

jluntamazon commented Apr 28, 2021 •

edited

Loading

jluntamazon commented Apr 28, 2021

glenn-jocher commented Apr 28, 2021

glenn-jocher commented Apr 28, 2021

jluntamazon commented Apr 29, 2021

jluntamazon commented Apr 29, 2021

glenn-jocher commented Apr 30, 2021

glenn-jocher commented May 3, 2021 •

edited

Loading

YOLOv5 AWS Inferentia Inplace compatibility updates #2953

YOLOv5 AWS Inferentia Inplace compatibility updates #2953

Conversation

jluntamazon commented Apr 27, 2021 • edited by UltralyticsAssistant Loading

Code Changes:

Potential Improvements:

🛠️ PR Summary

🌟 Summary

📊 Key Changes

🎯 Purpose & Impact

github-actions bot left a comment

Choose a reason for hiding this comment

glenn-jocher commented Apr 27, 2021 • edited Loading

glenn-jocher commented Apr 28, 2021 • edited Loading

glenn-jocher commented Apr 28, 2021

master: 162 ms

PR: 400 ms

.clone(): 366 ms

jluntamazon commented Apr 28, 2021 • edited Loading

jluntamazon commented Apr 28, 2021

glenn-jocher commented Apr 28, 2021

glenn-jocher commented Apr 28, 2021

jluntamazon commented Apr 29, 2021

jluntamazon commented Apr 29, 2021

glenn-jocher commented Apr 30, 2021

glenn-jocher commented May 3, 2021 • edited Loading

jluntamazon commented Apr 27, 2021 •

edited by UltralyticsAssistant

Loading

glenn-jocher commented Apr 27, 2021 •

edited

Loading

glenn-jocher commented Apr 28, 2021 •

edited

Loading

jluntamazon commented Apr 28, 2021 •

edited

Loading

glenn-jocher commented May 3, 2021 •

edited

Loading