-
Notifications
You must be signed in to change notification settings - Fork 842
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Launcher core pinning #1401
Launcher core pinning #1401
Conversation
@@ -65,6 +66,17 @@ Some useful `cpu_launcher_args` to note are: | |||
|
|||
Please refer to [here](https://github.com/intel/intel-extension-for-pytorch/blob/master/docs/tutorials/performance_tuning/launch_script.md) for a full list of tunable configuration of launcher. | |||
|
|||
Some notable launcher configurations are: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we should mention that installation from source is required. If we don't have to wait too long it may make sense to wait for an official IPEX release before we merge this PR. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree on waiting until the next IPEX release prior to merging this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi all, Intel® Extension for PyTorch* 1.11.0-cpu has been released few days ago. I think we can rezoom this PR. It would be good to merge this PR prior or soon after our upcoming blog. Thanks. cc. @msaroufim @lxning
@@ -65,6 +66,17 @@ Some useful `cpu_launcher_args` to note are: | |||
|
|||
Please refer to [here](https://github.com/intel/intel-extension-for-pytorch/blob/master/docs/tutorials/performance_tuning/launch_script.md) for a full list of tunable configuration of launcher. | |||
|
|||
Some notable launcher configurations are: | |||
1. `--ninstances`: Number of instances for multi-instance inference/training. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could use a few words to explain the difference between a worker and instance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought they are interchangeable. @ashokei is there a difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"worker" is a torchserve terminology, we should clarify explicitly how worker and launcher instance are related.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed this section to not confuse the readers. It's taken internally by launcher. Please see my comment below as well.
When running multi-worker inference with Torchserve, launcher's `--ninstances`, `--instance_idx`, and `--ncore_per_instance` boost the performance of Torchserve. | ||
Launcher equally divides the number of cores by the number of workers such that each worker is pinned to assigned cores. Doing so avoids core overlap between workers which generally improves performance. | ||
For example, assume running four workers (4 `ninstances`) on 16 cores (i.e., 4 `ncore_per_instance`). Launcher will bind worker 0 (`instance_idx` 0) to cores 0-3, worker 1 (`instance_idx` 1) to cores 4-7, worker 2 (`instance_idx` 2) to cores 8-11, and worker 3 (`instance_idx` 3) to cores 12-15. | ||
Note that core pinning is taken care internally by launcher such that users don't have to manually set `ninstances`, `instance_idx`, nor `ncore_per_instance` in `config.properties`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an important point. If this is the case is it still worth mentioning the 3 parameters in the doc? Perhaps best to link to IPEX docs and just say something along the lines that CPU launcher will take care of core pinning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I have removed the 3 parameters that are taken care internally to not confuse the readers.
Launcher equally divides the number of cores by the number of workers such that each worker is pinned to assigned cores. Doing so avoids core overlap between workers which generally improves performance. | ||
For example, assume running four workers (4 `ninstances`) on 16 cores (i.e., 4 `ncore_per_instance`). Launcher will bind worker 0 (`instance_idx` 0) to cores 0-3, worker 1 (`instance_idx` 1) to cores 4-7, worker 2 (`instance_idx` 2) to cores 8-11, and worker 3 (`instance_idx` 3) to cores 12-15. | ||
Note that core pinning is taken care internally by launcher such that users don't have to manually set `ninstances`, `instance_idx`, nor `ncore_per_instance` in `config.properties`. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be helpful to test out if this works in a docker image as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we still want to test out in a docker image? Please let me know. Thanks. cc @msaroufim @lxning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes because need to make sure that the current docker image has access to the same environment variables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
launcher internally restarts the workers to re-distribute cores that were pinned to killed workers to the remaining.
@min-jean-cho I was wondering if this means serving will be interrupted for the time this re-distribution is taking place?
``` | ||
ipex_enable=true | ||
cpu_launcher_enable=true | ||
cpu_launcher_args=--socket_id 0 --ninstance 1 --enable_jemalloc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we didn't need to set this in config.properties
or is this for reproducibility reasons?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was for reproducibility, but indeed --ninstance 1
can be omitted. However, --socket_id 0 --enable_jemalloc
should be set as they are not set by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have removed --ninstance 1.
|
||
![pdt_perf](https://github.com/min-jean-cho/frameworks.ai.pytorch.ipex-cpu-1/assets/93151422/a158ba6c-a151-4115-befb-39acb7545936) | ||
|
||
Above shows performance improvement of Torchserve with IPEX and launcher on ResNet50 and BERT-base-uncased. Torchserve official [apache-bench benchmark](https://github.com/pytorch/serve/tree/master/benchmarks#benchmarking-with-apache-bench) on Amazon EC2 m6i.24xlarge was used to collect the results. Add the following lines in ```config.properties``` to reproduce the results. Notice that launcher is configured such that a single instance uses all physical cores on a single socket to avoid cross socket communication and core overlap. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering if we could add more details as to why m6i
and c6i
should be the recommended choice, what is it about those machines thatt make them run Pytorch faster? I'd even mention this at the very top since benefits are bigger than any software changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had used m6i for benchmarking purpose of this data, not for performance comparison of m6i vs c6i. I'm not sure about the recommended choice from performance perspective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ashokei could you please comment on this as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, we were not necessarily benchmarking c6 vs m6 , rather we were using "24xlarge" which has most cores. We could add general comment about using multi-core instances, how ipex optimizations automatically scale and leverage full instance resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification. Makes sense.
|
||
For example, run the following command to reproduce latency performance of ResNet50 with data type of IPEX int8 and batch size of 1. | ||
``` | ||
python benchmark-ab.py --url 'file:///model_store/rn50_ipex_int8.mar' --concurrency 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would need to add more detail on how to get this mar file? Or put a link to something like an S3 bucket or gdrive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I didn't make it clear that the rn50_ipex_int8.mar
is created from steps shown here: https://github.com/intel/intel-extension-for-pytorch/blob/master/docs/tutorials/performance_tuning/torchserve.md#2-creating-a-model-archive
|
||
if (isRestartWorkers) { | ||
logger.warn("restarting {} thread(s)", restartNumWorkers); | ||
addThreads(threads, model, restartNumWorkers, future); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain a bit more about the overall logic of this PR - it seems to do (my very rough understanding)
- Check if CPU launcher available and if so 0 out number of workers
- Then add threads back to model set to the old min worker
- Finally when a new process is launched make sure it includes the important CLI arguments like n-instance and instance-idx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Workers are restarted only when cpu_launcher is enabled (does not affect Torchserve when cpu_launcher_is not enabled). Restarting workers is useful/needed when user scales the number of workers up/down during execution ( https://pytorch.org/serve/management_api.html#scale-workers ) By restarting workers, launcher 1) re-allocates the cores that were pinned to killed workers in case of scale down; 2) avoids core overlap in case of scale up. It kills all existing workers and restarts scaled up/down number of workers. By doing so, launcher is configured properly when user scales the number of workers during execution. I will add comments to the file to explain the logic.
|
||
## Performance Boost with IPEX and Launcher | ||
|
||
![pdt_perf](https://github.com/min-jean-cho/frameworks.ai.pytorch.ipex-cpu-1/assets/93151422/a158ba6c-a151-4115-befb-39acb7545936) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getting a 404 here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be fixed now. Thanks.
@@ -85,13 +85,24 @@ public int getNumRunningWorkers(ModelVersionName modelVersionName) { | |||
return numWorking; | |||
} | |||
|
|||
public boolean isLauncherRestartWorkers(int currentWorkers) { | |||
boolean isRestart; | |||
if (configManager.isCPULauncherEnabled() && currentWorkers > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checks cpu_launcher is enabled and currentWorkers
is greater than 0 (i.e., only restarts workers when scaling workers up/down; if currentWorkers==0
, then workers are being initialized)
currentWorkers, | ||
minWorker); | ||
maxWorker = 0; | ||
minWorker = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By setting minWorker=0
when currentWorkers > 0
, this will never enter the if (currentWorkers < minWorker)
condition in line 134. And by setting maxWorker=0
, the for
loop in line 137 will proceed to remove all currentWorkers
.
public CompletableFuture<Integer> modelChanged( | ||
Model model, boolean isStartup, boolean isCleanUp) { | ||
synchronized (model.getModelVersionName()) { | ||
boolean isSnapshotSaved = false; | ||
CompletableFuture<Integer> future = new CompletableFuture<>(); | ||
int minWorker = model.getMinWorkers(); | ||
int maxWorker = model.getMaxWorkers(); | ||
int restartNumWorkers = minWorker; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sets restartNumWorkers
to updated minWorker
after scale up/down.
|
||
if (isRestartWorkers) { | ||
logger.warn("restarting {} thread(s)", restartNumWorkers); | ||
addThreads(threads, model, restartNumWorkers, future); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally after removing all currentNumWorkers
of workers, add back (i.e., restart) restartNumWorkers
. Feel free to have a try with apache benchmark. For example, setting initial workers
to 1 and scaling up to 5 via curl -v -X PUT "http://localhost:8081/models/benchmark?min_worker=5"
or setting initial workers
to 5 and scaling down to 1 curl -v -X PUT "http://localhost:8081/models/benchmark?min_worker=1"
…eaders (2) updated socket_id to node_id (3) added launcher core pinning section
Launcher equally divides the number of cores by the number of workers such that each worker is pinned to assigned cores. Doing so avoids core overlap between workers which generally improves performance. | ||
For example, assume running four workers (4 `ninstances`) on 16 cores (i.e., 4 `ncore_per_instance`). Launcher will bind worker 0 (`instance_idx` 0) to cores 0-3, worker 1 (`instance_idx` 1) to cores 4-7, worker 2 (`instance_idx` 2) to cores 8-11, and worker 3 (`instance_idx` 3) to cores 12-15. | ||
Note that core pinning is taken care internally by launcher such that users don't have to manually set `ninstances`, `instance_idx`, nor `ncore_per_instance` in `config.properties`. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes because need to make sure that the current docker image has access to the same environment variables
Hi all, I have tested core pinning with docker image - please see the attached image. Couple of notes: (1) IPEX 1.11 requires PyTorch 1.11. For now, I have tested by modifying requirements/torch_linux.txt as here. Seems like upgrading to PyTorch 1.11 is in plan #1434 . (2) added numactl and ipex installation in Dockerfile.dev as here. (3) user need to start the docker container with --privileged mode for numactl core binding to work Let me know of your thoughts. Thanks. cc @msaroufim @lxning @ashokei |
Launcher equally divides the number of cores by the number of workers such that each worker is pinned to assigned cores. Doing so avoids core overlap between workers which generally improves performance. | ||
For example, assume running four workers (4 `ninstances`) on 16 cores (i.e., 4 `ncore_per_instance`). Launcher will bind worker 0 (`instance_idx` 0) to cores 0-3, worker 1 (`instance_idx` 1) to cores 4-7, worker 2 (`instance_idx` 2) to cores 8-11, and worker 3 (`instance_idx` 3) to cores 12-15. | ||
Note that core pinning is taken care internally by launcher such that users don't have to manually set `ninstances`, `instance_idx`, nor `ncore_per_instance` in `config.properties`. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
launcher internally restarts the workers to re-distribute cores that were pinned to killed workers to the remaining.
@min-jean-cho I was wondering if this means serving will be interrupted for the time this re-distribution is taking place?
python benchmark-ab.py --url {modelUrl} --input {inputPath} --concurrency 1 | ||
``` | ||
|
||
For example, run the following command to reproduce latency performance of ResNet50 with data type of IPEX int8 and batch size of 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it will be good to mention/ add a link to how int8 quantization was applied to these models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @HamidShojanazeri , that's a good question - serving is indeed paused/interrupted for few seconds while redistributing. But there's no code crash or anything. Let me know if this suffices or if you would like further experiments. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @min-jean-cho, I think it might be good to clarify it in the doc.
cc: @lxning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @HamidShojanazeri , I have added a gif and footnote to demonstrate that serving is interrupted for few seconds while re distributing cores. Please have a look at the updated README https://github.com/min-jean-cho/serve/tree/launcher_core_pinning/examples/intel_extension_for_pytorch#scaling-workers . Thanks
Hi @msaroufim @HamidShojanazeri , please let me know if there's any blocking on this PR |
frontend/server/src/main/java/org/pytorch/serve/wlm/WorkLoadManager.java
Outdated
Show resolved
Hide resolved
Co-authored-by: Aaqib <maaquib@gmail.com>
Can confirm test gets skipped on GPU
|
- port https://github.com/intel-innersource/frameworks.ai.pytorch.ipex-cpu/pull/740 to `run_cpu` - use-case by pytorch/serve#2166 where `numactl` is unavailable (e.g., requires `privileged` mode) This PR automatically tries taskset if numactl core binding doesn't work. Reference: `taskset` is added to adapt to launcher use-cases such as in docker where `numactl` requires to be ran in `privileged` mode, where the `privileged` mode "wont work for deployments like sagemaker for example" as raised by TorchServe. Please see [torchserve ipex docker discussion](pytorch/serve#1401 (comment)) for reference. To address such use-cases, `taskset` can be used in place of `numactl` to set core affinity. Note that, unlike `numactl`, `taskset` does not provide memory binding to local memories; however, memory binding may not be needed in these use-cases that typically do not span multi sockets. Hence we can automatically try taskset if numactl doesn't work. Pull Request resolved: #96011 Approved by: https://github.com/jgong5, https://github.com/malfet
- port https://github.com/intel-innersource/frameworks.ai.pytorch.ipex-cpu/pull/740 to `run_cpu` - use-case by pytorch/serve#2166 where `numactl` is unavailable (e.g., requires `privileged` mode) This PR automatically tries taskset if numactl core binding doesn't work. Reference: `taskset` is added to adapt to launcher use-cases such as in docker where `numactl` requires to be ran in `privileged` mode, where the `privileged` mode "wont work for deployments like sagemaker for example" as raised by TorchServe. Please see [torchserve ipex docker discussion](pytorch/serve#1401 (comment)) for reference. To address such use-cases, `taskset` can be used in place of `numactl` to set core affinity. Note that, unlike `numactl`, `taskset` does not provide memory binding to local memories; however, memory binding may not be needed in these use-cases that typically do not span multi sockets. Hence we can automatically try taskset if numactl doesn't work. Pull Request resolved: pytorch/pytorch#96011 Approved by: https://github.com/jgong5, https://github.com/malfet
- port https://github.com/intel-innersource/frameworks.ai.pytorch.ipex-cpu/pull/740 to `run_cpu` - use-case by pytorch/serve#2166 where `numactl` is unavailable (e.g., requires `privileged` mode) This PR automatically tries taskset if numactl core binding doesn't work. Reference: `taskset` is added to adapt to launcher use-cases such as in docker where `numactl` requires to be ran in `privileged` mode, where the `privileged` mode "wont work for deployments like sagemaker for example" as raised by TorchServe. Please see [torchserve ipex docker discussion](pytorch/serve#1401 (comment)) for reference. To address such use-cases, `taskset` can be used in place of `numactl` to set core affinity. Note that, unlike `numactl`, `taskset` does not provide memory binding to local memories; however, memory binding may not be needed in these use-cases that typically do not span multi sockets. Hence we can automatically try taskset if numactl doesn't work. Pull Request resolved: pytorch/pytorch#96011 Approved by: https://github.com/jgong5, https://github.com/malfet
- port https://github.com/intel-innersource/frameworks.ai.pytorch.ipex-cpu/pull/740 to `run_cpu` - use-case by pytorch/serve#2166 where `numactl` is unavailable (e.g., requires `privileged` mode) This PR automatically tries taskset if numactl core binding doesn't work. Reference: `taskset` is added to adapt to launcher use-cases such as in docker where `numactl` requires to be ran in `privileged` mode, where the `privileged` mode "wont work for deployments like sagemaker for example" as raised by TorchServe. Please see [torchserve ipex docker discussion](pytorch/serve#1401 (comment)) for reference. To address such use-cases, `taskset` can be used in place of `numactl` to set core affinity. Note that, unlike `numactl`, `taskset` does not provide memory binding to local memories; however, memory binding may not be needed in these use-cases that typically do not span multi sockets. Hence we can automatically try taskset if numactl doesn't work. Pull Request resolved: pytorch#96011 Approved by: https://github.com/jgong5, https://github.com/malfet
--instance_idx
to TorchServe which helps improve the performance of multi-worker inference via core pinning.instance_idx
improves performance; 3) ipex performance boost dataNote that currently users have to install IPEX from source to use
--instance_idx
feature.