Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do GPU and CPU block each other? #1537

Open
BraginIvan opened this issue Mar 28, 2022 · 6 comments
Open

Do GPU and CPU block each other? #1537

BraginIvan opened this issue Mar 28, 2022 · 6 comments
Labels
documentation Improvements or additions to documentation

Comments

@BraginIvan
Copy link

BraginIvan commented Mar 28, 2022

📚 Documentation

Did not find a documentation about processes/threads of GPU and CPU work.
During gpu inference can we continue cpu pre/post-processing of other objects asynchronously?

If yes, which way do you send data between processes (queue/file/socket)? May be you can provide a link to the code.
If no, do you have a plan to do it?

@msaroufim
Copy link
Member

When you author a python which we call backend handler file it's spawned as a process by the Java part of the codebase which we call frontend. The frontend and backend communicate via sockets.

I believe your question is about whether we pipeline preprocessing in case an inference is slow, I'm not sure we do but maybe @lxning @HamidShojanazeri or @maaquib know

The way we scale is by increasing the number of workers in config.properties so imagine each worker is a different process with the same handler code so its embarrassingly parallel. One worker can be doing preprocessing while another could be doing inferencing.

If you're looking for source you can browse you can learn more here https://github.com/pytorch/serve/blob/master/docs/internals.md

@msaroufim msaroufim added the documentation Improvements or additions to documentation label Mar 28, 2022
@alar0330
Copy link

alar0330 commented Apr 6, 2022

@msaroufim Imagine that you have a video payload and you want to run inference on each frame. The decoding can be performed on CPU, and as soon as each frame (or batch of frames) gets decoded, we feed it to GPU for forward pass. This way we overlap CPU preproc with GPU model exec and can substantially reduce latencies + allow proc of arbitrary video length.

How (and if) can we do it with TorchServe today?

@BraginIvan
Copy link
Author

@alar0330 based on my experience GPU process does not block CPU workers. I did not find the code to prove it, but I did several tests to prove it to myself.
But actually GPU needs its own CPU process and it loads a core, but if you have several cores, other cores will continue CPU bound preprocess tasks simultaneously.

@BraginIvan
Copy link
Author

I guess you question is a bit different. If you want to process video you have to decode it on client side and use torchserve only for images.
If you will send whole video, then you will need to overwrite handle method and CPU-GPU will work synchronously. But I'm not sure

@msaroufim
Copy link
Member

I'm sorry for the delay @alar0330 but it sounds like you're asking for pipelined execution when doing heavyweight preprocessing. As if today I don't believe we support this but could do something like this when I finish #1546

@HamidShojanazeri
Copy link
Collaborator

HamidShojanazeri commented May 18, 2022

@msaroufim Imagine that you have a video payload and you want to run inference on each frame. The decoding can be performed on CPU, and as soon as each frame (or batch of frames) gets decoded, we feed it to GPU for forward pass. This way we overlap CPU preproc with GPU model exec and can substantially reduce latencies + allow proc of arbitrary video length.

How (and if) can we do it with TorchServe today?

@alar0330 if not mistaking, you are sending the whole video as one request? if its not streaming, then I think in a custom handler it should be doable, does something like this help?

class custom handler():

    def initialize ():
        load_model()

    def frame_process(video):
        processed_frame = process(video)
        return processed_frame

    def preprocess(request):
        video = decode(reuqest)

    def inference (video):
        inferences =[]
        number_of_frames = metadata(video)
        for i in range(number_of_frames): # or we could make a buffer here 
            frame = frame_process(video) # or spawn multiple processes to process the video frames, not sure if there is any perf hit here. 
            ouptut = model(frame)
            inferences.append(output)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

4 participants