Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On sparsity and the best way to compute indices of non zero elements. #204

Open
Sopel97 opened this issue Jul 14, 2021 · 2 comments
Open

Comments

@Sopel97
Copy link

Sopel97 commented Jul 14, 2021

I tried implementing affine transform that takes advantage of sparse inputs in stockfish for some experiments and I have found that computing the indices of non-zero elements is actually quite a costly operation. I tried computing them lazily like cfish does, and storing them in a local array - both resulted in similar performance and stall issues. However the version with a local array allows for less branching in the actual dot product computations. Let's look at some data first.

I've investigated the sparsity of the feature transformer output for feature transformers of various size. The number of non-zero inputs is almost constant, regardless of the size. That means that density decreases rapidly with size. Here is the plot that shows the density (on y axis) for chosen nets (on x axis), grouped by size of the feature transformer (top). Data comes from low depth bench. Box corresponds to [0.25, 0.75] range, whiskers to [0.01, 0.99] range, the circles are outliers.
ft

We can see that the current architecture (as of official-stockfish/Stockfish@e8d64af) has density of about 15%, and the 1024x2 has density of about 8%. I have no gathered the data for the 256x2 case but it's reasonable to assume it's around 25%. This also means that, with the right inference code, it might be possible to have feature transformer of size 1024x2 in Stockfish, though I won't delve into this here.

I've also implemented a few variants of the way indices of non-zero inputs are computed. I focused on the AVX2 implementation as it's by far the most popular (some cool tricks possible in AVX512, but I didn't bother implementing anything there). The code can be found here https://godbolt.org/z/z5xTYv4e1. The benchmark is over multiple different densities. The input array of size 2048 and is filled randomly with the desired average density. 64 input sets are generated and the benchmark goes over them 10000 times. Warmup is performed on all input sets once. The total time in seconds is reported. I've ran this benchmark on an Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz, compiled with GCC 9.3, with g++ -std=c++17 -mavx2 -O3 -DNDEBUG. The results are as follows:

obraz

Cfish currently uses the implementation similar to non_zero_indices_1. The implementation non_zero_indices_6 is about 4-5x faster than the current cfish implementation for the case of 10-20% density. Implementation non_zero_indices_4 might be a good middle ground not to trash the cache with the larger lookup table.

Also one more thing. It should be possible to utilize VNNI by taking 4 inputs at a time, and interleaving 4 weight rows. It's slightly easier to manage in the case of having the indices in a local array instead of computing them lazily.

@Sopel97
Copy link
Author

Sopel97 commented Aug 6, 2021

For AVX512VBMI the following code could be considered too:

void non_zero_indices_compress(const std::int8_t* in, IndexType* out, unsigned& count_out)
{
    static constexpr int SimdWidth = 256/8;
    static constexpr int NumChunks = InputSize / SimdWidth;

    const auto inputVector = reinterpret_cast<const __m256i*>(in);
    __m512i indices = _mm512_set_epi16(31, 30, 29, 28, 27, 26, 25, 24,
                                       23, 22, 21, 20, 19, 18, 17, 16,
                                       15, 14, 13, 12, 11, 10,  9,  8,
                                        7,  6,  5,  4,  3,  2,  1,  0);
    __m512i increment = _mm512_set1_epi16(32);
    unsigned count = 0;
    for (int i = 0; i < NumChunks; i += 2)
    {
        const __m256i inputChunk0 = inputVector[i+0];
        const __m256i inputChunk1 = inputVector[i+1];
        auto m0 = _mm256_cmpgt_epi8_mask(inputChunk0, _mm256_setzero_si256());
        auto m1 = _mm256_cmpgt_epi8_mask(inputChunk1, _mm256_setzero_si256());
        unsigned c0 = popcount(_cvtmask32_u32(m0));
        unsigned c1 = popcount(_cvtmask32_u32(m1));

        _mm512_mask_compressstoreu_epi16(out + count, m0, indices);
        indices += increment;
        count += c0;

        _mm512_mask_compressstoreu_epi16(out + count, m1, indices);
        indices += increment;
        count += c1;
    }
    count_out = count;
}

@Sopel97
Copy link
Author

Sopel97 commented Aug 12, 2021

Other thing to consider - don't do the FT "transform" and work directly on the int16 output, as the sparse implementation doesn't need the clippedrelu to actually be performed on it (though needs somehow to efficiently do min(in, 255)).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant