Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert previous attempt at Triton patch; use CustomCacheManger approach instead. #35

Merged
merged 5 commits into from
Jun 3, 2024

Conversation

tdoublep
Copy link
Member

I tested the previous fix for the Triton cache collision issue (see: #34) and it didn't work.

I now see errors like:

FileNotFoundError: [Errno 2] No such file or directory: '/home/vllm/.triton/cache/1feb415f3280ca46eea8c4407a58c23e/fused_moe_kernel.json.tmp.pid_72_c0a0033e-6147-4520-ae3a-3847d02598f8'

which now shows the uuid instead of a random integer, but problem remains.

This PR implements a different workaround, proposed by @cyang49, that tells Triton to use a custom cache manager which assigns a different directory based on the process id.

This time I have tested it and it seems to work.

tdoublep and others added 4 commits May 29, 2024 11:01
Co-authored-by: Chih-Chieh-Yang <chih.chieh.yang@ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tdoublep!

Would be good for @tjohnson31415 to review too

triton_patch/custom_cache_manager.py Outdated Show resolved Hide resolved
triton_patch/custom_cache_manager.py Outdated Show resolved Hide resolved
@tjohnson31415
Copy link
Member

I did a little investigation into this issue; this is what I found:

The stacktrace pointed to the error as coming from CompiledKernel.init() where the code lists the files and then tries to read each of them (REF). Those files are created as artifacts of the compilation process in the code here.

What is interesting is that the error includes .tmp.pid in the name which is a temporary file that the compilation is written to before doing an os.replace. I think the issue is that, between listing the files and trying to read the files, the .tmp.pid file is oc.replaced by another process; hence leading to a FileNotFoundError. Even with unique filenames for each process, multiple processes are using the same cache directory.

So a fix in the Triton code could be to remove the temporary files from the listing or to have the temp files created in a separate directory. That way, the compiled artifacts can still be shared across processes (though with some redundant compilations).

That said, I think this fix with a custom cache manager giving each process its own directory will also work and is a good work-around until a fix can be merged into Triton.

Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Signed-off-by: Nick Hill <nickhill@us.ibm.com>
Signed-off-by: Joe Runde <joe@joerun.de>
@joerunde joerunde merged commit a17c8fb into main Jun 3, 2024
14 checks passed
@joerunde joerunde deleted the tpa-triton-cachefix2 branch June 3, 2024 16:58
ThomasRaoux pushed a commit to triton-lang/triton that referenced this pull request Jul 10, 2024
…4295)

# Summary
there've been multiple issues discussing around the `FileNotFoundError`
on compilation when `CompiledKernel` is trying to read from the listed
ASM files. #2688 #4002 vllm-project/vllm#6103
etc. and there have been some attempts to address it such as #3544 .
This PR attempts to explain the root cause and suggest a fix.

# Why
When a kernel is being compiled, triton first writes IRs to triton cache
dir
([ref](https://github.com/triton-lang/triton/blob/78091647fccb6825ed9956ff7c0300859856d261/python/triton/compiler/compiler.py#L289)).
Inside of the write operation, the process first writes it to a temp
file unique to the current process (plus a uuid to distinguish between
multiple processes with same PID on different hosts sharing the same
underlying FS)
([ref](https://github.com/triton-lang/triton/blob/c14b033cd979d5c39e5fdb3847c022fa5d71a0c1/python/triton/runtime/cache.py#L124-L130))
and then atomically `os.replace` it to the final file name. Afterwards
the `CompiledKernel` lists all the IRs and reads them
([ref](https://github.com/triton-lang/triton/blob/78091647fccb6825ed9956ff7c0300859856d261/python/triton/compiler/compiler.py#L362-L367)).

On multiprocess set up this may however result in a race condition.
Let's focus on a case where there's one host with 2 processes on it.
![Triton RC
(1)](https://github.com/triton-lang/triton/assets/43726198/ffc20e0c-0404-4e7a-bd6c-022e710e97b9)

At the time when `pid 1` lists ASMs, the dir may contain temp files
generated from another process `pid 2`. However at the time when `pid 1`
proceeds to read bytes from the listed files, `pid2` may have already
`os.replace`ed its temp files, so `pid 1` will encounter
`FileNotFoundError` when trying to read the temp file generated by `pid
2`. IBM/vllm#35 (comment) also
believes this is the root cause.

# How
There're multiple potential solutions towards this, as mentioned in
IBM/vllm#35 (comment) as well:
- let each process write to a private temp dir instead so `glob` won't
bother taking the temp stuff into consideration
- or, exclude `tmp.pid_*` from `glob`

This PR tries to go with the 1st approach to avoid adding an assumption
on the tmp file pattern (which is only used in `runtime/cache.py`) in
`compiler/compiler.py` but is open to any suggestion. Thanks!

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them.

- [x] I am not making a trivial change, such as fixing a typo in a
comment.

- [x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [ ] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
  - [x] This PR does not need a test because `not applicable`.

- Select one of the following.
  - [x] I have not added any `lit` tests.
- [ ] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants