Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RyuJIT does not optimize multiple integer equality comparisons #81668

Closed
gdkchan opened this issue Feb 6, 2023 · 5 comments
Closed

RyuJIT does not optimize multiple integer equality comparisons #81668

gdkchan opened this issue Feb 6, 2023 · 5 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue

Comments

@gdkchan
Copy link

gdkchan commented Feb 6, 2023

Description

Currently, RyuJIT does not optimize a sequence of integer equality comparisons all leading to the same path.
For example, let's say one wants to implement a function that determines if a given char is a vowel, and return bool.
One could naively implement it like so:

    static bool IsVowel(char c)
    {
        return c == 'a' ||
            c == 'e' ||
            c == 'i' ||
            c == 'o' ||
            c == 'u' ||
            c == 'A' ||
            c == 'E' ||
            c == 'I' ||
            c == 'O' ||
            c == 'U';
    }

Today, that will be compiled to a long sequence of instructions, doing each comparison and jumping if equal.

       push     rbp
       mov      rbp, rsp
       movzx    rax, di
       cmp      eax, 97
       je       SHORT G_M5371_IG05
       cmp      eax, 101
       je       SHORT G_M5371_IG05
       cmp      eax, 105
       je       SHORT G_M5371_IG05
       cmp      eax, 111
       je       SHORT G_M5371_IG05
       cmp      eax, 117
       je       SHORT G_M5371_IG05
       cmp      eax, 65
       je       SHORT G_M5371_IG05
       cmp      eax, 69
       je       SHORT G_M5371_IG05
       cmp      eax, 73
       je       SHORT G_M5371_IG05
       cmp      eax, 79
       je       SHORT G_M5371_IG05
       cmp      eax, 85
       sete     al
       movzx    rax, al
       pop      rbp
       ret      
G_M5371_IG05:
       mov      eax, 1
       pop      rbp
       ret

However there is an optimization that native compilers will usually do here. Because the values being compared here are not too far apart, it's possible to construct a bit mask where the bit corresponding to the number of each vowel character is set to 1. Then one can just subtract the value being checked (c) by the lowest possible value (in this case A which is the vowel which the lowest value). If the result is negative, we already know it's out of range, and we can return false. If it's positive, and lower or equal to the highest value (in this case u), we can then check if the respective bit is set on the mask. If it's set we return true, otherwise false.

This is the code that GCC produces using the technique above for similar code written in C, with O1 optimization level:

        lea     ecx, [rdi-65]
        cmp     cl, 52
        ja      .L3
        movabs  rax, 4575140898685201
        shr     rax, cl
        and     eax, 1
        ret
.L3:
        mov     eax, 0
        ret

It's worth noting that it produces slightly different code at O2 (using bt instructions instead of the shr and and).
Other compilers like clang and MSVC produces similar code with optimizations enabled.

Comparison using Godbolt compiler explorer: https://godbolt.org/z/1efMGq3rx
RyuJIT codegen on Godbolt compiler explorer: https://godbolt.org/z/eqxnPaKGr
Above I also included a IsVowelOptimized function where I manually optimized it using the technique above, which produces more or less the ideal code, which I would expect to be produced for the IsVowel function.

This pattern is pretty common, so it would be nice to optimize it. For example, when working with textures, I often need to write code like:

if (format == TextureFormat.D24UnormS8Uint ||
    format == TextureFormat.D32FloatS8Uint ||
    format == TextureFormat.D24UnormX8 ||
    format == TextureFormat.D16Unorm ||
    format == TextureFormat.D32Float)
{
    // Do something with the depth texture...
}
else
{
    // Do something with the color texture...
}

Configuration

.NET 7.0.100, AFAIK it affects all platforms.

Regression?

No, to my knowledge, this optimization was never implemented on .NET JIT compiler.

@gdkchan gdkchan added the tenet-performance Performance related issue label Feb 6, 2023
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Feb 6, 2023
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 6, 2023
@ghost
Copy link

ghost commented Feb 6, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

Currently, RyuJIT does not optimize a sequence of integer equality comparisons all leading to the same path.
For example, let's say one wants to implement a function that determines if a given char is a vowel, and return bool.
One could naively implement it like so:

    static bool IsVowel(char c)
    {
        return c == 'a' ||
            c == 'e' ||
            c == 'i' ||
            c == 'o' ||
            c == 'u' ||
            c == 'A' ||
            c == 'E' ||
            c == 'I' ||
            c == 'O' ||
            c == 'U';
    }

Today, that will be compiled to a long sequence of instructions, doing each comparison and jumping if equal.

       push     rbp
       mov      rbp, rsp
       movzx    rax, di
       cmp      eax, 97
       je       SHORT G_M5371_IG05
       cmp      eax, 101
       je       SHORT G_M5371_IG05
       cmp      eax, 105
       je       SHORT G_M5371_IG05
       cmp      eax, 111
       je       SHORT G_M5371_IG05
       cmp      eax, 117
       je       SHORT G_M5371_IG05
       cmp      eax, 65
       je       SHORT G_M5371_IG05
       cmp      eax, 69
       je       SHORT G_M5371_IG05
       cmp      eax, 73
       je       SHORT G_M5371_IG05
       cmp      eax, 79
       je       SHORT G_M5371_IG05
       cmp      eax, 85
       sete     al
       movzx    rax, al
       pop      rbp
       ret      
G_M5371_IG05:
       mov      eax, 1
       pop      rbp
       ret

However there is an optimization that native compilers will usually do here. Because the values being compared here are not too far apart, it's possible to construct a bit mask where the bit corresponding to the number of each vowel character is set to 1. Then one can just subtract the value being checked (c) by the lowest possible value (in this case A which is the vowel which the lowest value). If the result is negative, we already know it's out of range, and we can return false. If it's positive, and lower or equal to the highest value (in this case u), we can then check if the respective bit is set on the mask. If it's set we return true, otherwise false.

This is the code that GCC produces using the technique above for similar code written in C, with O1 optimization level:

        lea     ecx, [rdi-65]
        cmp     cl, 52
        ja      .L3
        movabs  rax, 4575140898685201
        shr     rax, cl
        and     eax, 1
        ret
.L3:
        mov     eax, 0
        ret

It's worth noting that it produces slightly different code at O2 (using bt instructions instead of the shr and and).
Other compilers like clang and MSVC produces similar code with optimizations enabled.

Comparison using Godbolt compiler explorer: https://godbolt.org/z/1efMGq3rx
RyuJIT codegen on Godbolt compiler explorer: https://godbolt.org/z/eqxnPaKGr
Above I also included a IsVowelOptimized function where I manually optimized it using the technique above, which produces more or less the ideal code, which I would expect to be produced for the IsVowel function.

This pattern is pretty common, so it would be nice to optimize it. For example, when working with textures, I often need to write code like:

if (format == TextureFormat.D24UnormS8Uint ||
    format == TextureFormat.D32FloatS8Uint ||
    format == TextureFormat.D24UnormX8 ||
    format == TextureFormat.D16Unorm ||
    format == TextureFormat.D32Float)
{
    // Do something with the depth texture...
}
else
{
    // Do something with the color texture...
}

Configuration

.NET 7.0.100, AFAIK it affects all platforms.

Regression?

No, to my knowledge, this optimization was never implemented on .NET JIT compiler.

Author: gdkchan
Assignees: -
Labels:

tenet-performance, area-CodeGen-coreclr, untriaged

Milestone: -

@EgorBo
Copy link
Member

EgorBo commented Feb 6, 2023

Duplicates #8418

NOTE: this is probably easier to implement in Roslyn/post-IL process to emit CEE_SWITCH opcode which is lowered into a bit-test or jump-table by JIT. But we can also do it in JIT at some point

@gdkchan
Copy link
Author

gdkchan commented Feb 6, 2023

Duplicates #8418

NOTE: this is probably easier to implement in Roslyn/post-IL process to emit CEE_SWITCH opcode which is lowered into a bit-test or jump-table by JIT. But we can also do it in JIT at some point

Sorry, I did not know how this optimization is called, so I didn't know what exactly to search for.
FWIW, I also tried implementing the function in the post using switch, and it also seems to generate several comparisons today.

    static bool IsVowelSwitch(char c)
    {
        switch (c)
        {
            case 'a':
            case 'e':
            case 'i':
            case 'o':
            case 'u':
            case 'A':
            case 'E':
            case 'I':
            case 'O':
            case 'U':
                return true;
        }

        return false;
    }

Looking at the IL, it seems that Roslyn is generating those comparisons, I wonder why.

@EgorBo
Copy link
Member

EgorBo commented Feb 6, 2023

Looking at the IL, it seems that Roslyn is generating those comparisons, I wonder why.

Roslyn is a bit conservative for switch opcode - if the input values have too big gaps (in value) - it gives up. E.g. this one works: https://sharplab.io/#v2:C4LglgNgPgAgTARgLACgYGYAE9MGFMDeqmJ2CAbJgEYD2NEmAkgM4BqNA7gKYQDKHYYAGMAFgApRAQwBOmIQEpipIilJrMzAcJGYJi1epIrDhoZOZdMAcklWQSk6TMXrVOw8dzzlq0PcHPLxcrABN/QJJnHy5wiKjrEVjA+KswJIiSGAB2TGBpAFcuAG4PNQBfVFLSbMwAM0kICxKAzAqUMqA===

@EgorBo
Copy link
Member

EgorBo commented Feb 6, 2023

Closed as a dup of #8418

@EgorBo EgorBo closed this as not planned Won't fix, can't repro, duplicate, stale Feb 6, 2023
@EgorBo EgorBo removed the untriaged New issue has not been triaged by the area owner label Feb 6, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Mar 8, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

2 participants