Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove intermediate casting in product #490

Merged
merged 11 commits into from
Mar 23, 2022

Conversation

jmitrevs
Copy link
Contributor

@jmitrevs jmitrevs commented Jan 8, 2022

This removes the intermediate casting of scale*x to a res_T type in a BatchNormalization, so casting a res_t is only of the final scale*x + bias. I did this by introducing Product_nocast and mult_nocast, which multiply and return in the full precision. That seemed easier and less error-prone than manually determining the resulting type in python, and I was hesitant to always make mult and Product return the full precision, though that would be straightforward.

I did confirm that the simple test model is synthesizable with Vivado_HLS 2019.2.

This patch does fix some issues I saw working with low bit values on QONNX.

@thesps
Copy link
Contributor

thesps commented Jan 10, 2022

I understand the motivation to return the product at full precision, I think the solution needs more work.

Unfortunately they're not captured by current tests, but there are examples where this solution will produce C++ that doesn't compile. You can see an example changing kernel_quantizer=quantized_bits(bits,0,alpha=1) -> kernel_quantizer=quantized_bits(bits,0,alpha='auto_po2') in this test. It's just a shortcut to get a model that has ExponentType for the scale. Generally, we have the exponent and XNOR products which are not a * b, and they need to be supported also.

@jmitrevs
Copy link
Contributor Author

Thanks. I was able to reproduce a failure. I will try to come up with a solution.

@jmitrevs
Copy link
Contributor Author

I added _nocast versions of other products. The weight_exponential_nocast is the most difficult, and I am not sure I am choosing the right return type, though that is easy to change.

Is this the direction we want to go? One could argue for moving more of this logic into the python and not in the templates, but other than for the exponential, this is pretty straightforward, so I am still leaning towards going this direction.

Nevertheless, this fixes the failure mentioned above by @thesps.

@thesps
Copy link
Contributor

thesps commented Jan 12, 2022

Cool! I'm wondering now whether instead of keeping both a "with cast" and "no cast" version of all the products, we can keep only the "no cast" versions (i.e. drop the res_T template from all the product functions). Then anywhere that does a cast on the return will do it outside the product - basically I'm thinking of the Dense here which casts to the accum_t. But I think it should "just work" if the product functions always return the exact result.

And then the logic for both the XNOR and Exponent products would need to be anyway derived. But perhaps the decltype construct works for both of those too?

Edit: pushed to local branch for CI

@jmitrevs
Copy link
Contributor Author

About whether we should drop the res_T from all the product functions or if we keep both versions, I don't have a preference. I didn't initially drop res_T everywhere that in order to make this a less disruptive change, but I can see the argument to just move everything to a noncasting version. I can update the PR like that if so preferred.

@thesps
Copy link
Contributor

thesps commented Jan 17, 2022

Yeah I think these functions are so fundamental that it could be very confusing to have different behaviour in different places. So let's go for consistency and simplicity and make them all behave the same (-> return the full available precision). I've been meaning to check how to construct the lossless return type for the special operators but didn't get a chance to yet.

@jmitrevs
Copy link
Contributor Author

I went ahead and made all the nnet::product terms not cast the output. It would be useful to look at all the product terms to make sure I am returning in the appropriate precision.

@thesps
Copy link
Contributor

thesps commented Jan 18, 2022

I think this is the right direction. How about this for the weight_exponential:

template<class x_T, class w_T>
class weight_exponential : public Product{
    public:
    // Construct the return type from the multiplication equivalent to the largest shifts
    // ap_int<pow2(decltype(w_T::weight)::width-1)-1> is the type if the multiplicand equivalent to the largest lshift <<
    // ap_fixed<pow2(decltype(w_T::weight)::width-1)-1,0> is the type of the multiplicand equivalent to the largest rshift >>
    using r_T = decltype(x_T(0) * (ap_int<pow2(decltype(w_T::weight)::width-1)-1>(1)+ap_fixed<pow2(decltype(w_T::weight)::width-1)-1,0>(1)));
    static r_T product(x_T a, w_T w){
        // Shift product for exponential weights
        #pragma HLS INLINE
        // shift by the exponent. Negative weights shift right
        r_T y = static_cast<r_T>(a) << w.weight;
        // negate or not depending on weight sign
        return w.sign == 1 ? y : static_cast<r_T>(-y);
    }
};

?
Then I think you only need this one, and not the different specialised ones for different ap_ types possibilities x_T. I tried to explain the logic with the included comment, but I can elaborate. I also changed rt -> r_T for consistency with the other types too. In principle it should also work if x_T was float or something else too.

I'm thinking for completeness, some of those binary/ternary products may need a wider type, since for ints/fixed point -x can be one bit wider than x. We can achieve it with:

static x_T product(x_T a, w_T w){

->

static auto product(x_T a, w_T w) -> decltype(-a) // (or -w)
{

I think we need that for all those that can return -x:

  • weight_ternary
  • data_binary
  • weight_binary

@jmitrevs
Copy link
Contributor Author

I was struggling a bit with the exponential weight. I think I agree with all your suggestions. I will implement them and try the tests.

@jmitrevs
Copy link
Contributor Author

I updated the code. I changed some ?: to if else because the former requires casting, which I think made it look more complicated.

I did not check the weight_exponential very carefully, so it might be worth doing, or adding specific test cases for all these changes. I did test with the modified test_qkeras. In fact, it probably would be a good idea to try to add a unit test for these various cases. I can look into that either later today or tomorrow.

@vloncar
Copy link
Contributor

vloncar commented Jan 18, 2022

This PR causes problems for resource strategy for some reason. The clock misses timing a lot. Compare this to this

@jmitrevs
Copy link
Contributor Author

That's very strange. It's from hls4ml-keras-test.sh, right? I'll take a look.

@jmitrevs
Copy link
Contributor Author

So the error comes from:

WARNING: [SCHED 204-21] Estimated clock period (65.742ns) exceeds the target (target clock period: 5ns, clock uncertainty: 0.625ns, effective delay budget: 4.375ns).
WARNING: [SCHED 204-21] The critical path in module 'dense_resource_ap_fixed_ap_fixed_16_6_5_3_0_config4_s' consists of the following:
	'mul' operation of DSP[311] ('mul_ln1192', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [309]  (0.494 ns)
	'add' operation of DSP[311] ('add_ln1192', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [311]  (2.04 ns)
	'add' operation of DSP[319] ('add_ln1192_591', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [319]  (2.04 ns)
	'add' operation of DSP[327] ('add_ln1192_592', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [327]  (2.04 ns)
	'add' operation of DSP[335] ('add_ln1192_593', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [335]  (2.04 ns)
	'add' operation of DSP[343] ('add_ln1192_594', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [343]  (2.04 ns)
	'add' operation of DSP[351] ('add_ln1192_595', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [351]  (2.04 ns)
	'add' operation of DSP[359] ('add_ln1192_596', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [359]  (2.04 ns)
	'add' operation of DSP[367] ('add_ln1192_597', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [367]  (2.04 ns)
	'add' operation of DSP[375] ('add_ln1192_598', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [375]  (2.04 ns)
	'add' operation of DSP[383] ('add_ln1192_599', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [383]  (2.04 ns)
	'add' operation of DSP[391] ('add_ln1192_600', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [391]  (2.04 ns)
	'add' operation of DSP[399] ('add_ln1192_601', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [399]  (2.04 ns)
	'add' operation of DSP[407] ('add_ln1192_602', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [407]  (2.04 ns)
	'add' operation of DSP[415] ('add_ln1192_603', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [415]  (2.04 ns)
	'add' operation of DSP[423] ('add_ln1192_604', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [423]  (2.04 ns)
	'add' operation of DSP[431] ('add_ln1192_605', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [431]  (2.04 ns)
	'add' operation of DSP[439] ('add_ln1192_606', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [439]  (2.04 ns)
	'add' operation of DSP[447] ('add_ln1192_607', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [447]  (2.04 ns)
	'add' operation of DSP[455] ('add_ln1192_608', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [455]  (2.04 ns)
	'add' operation of DSP[463] ('add_ln1192_609', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [463]  (2.04 ns)
	'add' operation of DSP[471] ('add_ln1192_610', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [471]  (2.04 ns)
	'add' operation of DSP[479] ('add_ln1192_611', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [479]  (2.04 ns)
	'add' operation of DSP[487] ('add_ln1192_612', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [487]  (2.04 ns)
	'add' operation of DSP[495] ('add_ln1192_613', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [495]  (2.04 ns)
	'add' operation of DSP[503] ('add_ln1192_614', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [503]  (2.04 ns)
	'add' operation of DSP[511] ('add_ln1192_615', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [511]  (2.04 ns)
	'add' operation of DSP[519] ('add_ln1192_616', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [519]  (2.04 ns)
	'add' operation of DSP[527] ('add_ln1192_617', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [527]  (2.04 ns)
	'add' operation of DSP[535] ('add_ln1192_618', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [535]  (2.04 ns)
	'add' operation of DSP[543] ('add_ln1192_619', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [543]  (2.04 ns)
	'add' operation of DSP[551] ('add_ln1192_620', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [551]  (2.04 ns)
	'add' operation of DSP[560] ('add_ln1192_621', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [560]  (2.04 ns)
	'insertvalue' operation ('mrv', firmware/nnet_utils/nnet_dense_resource.h:280) [6520]  (0 ns)

What fixes the timing is doing an explicit cast:

    ReuseLoop:
    for (int ir = 0; ir < rufactor; ir++) {
        #pragma HLS PIPELINE II=1 rewind

        int w_index = ir;
        int in_index = ir;
        int out_index = 0;
        int acc_step = 0;

        MultLoop:
        for (int im = 0; im < block_factor; im++) {
            #pragma HLS UNROLL

            acc[out_index] += static_cast<typename CONFIG_T::accum_t>(CONFIG_T::template product<data_T, typename CONFIG_T::weight_t>::product(data[in_index], weights[w_index]));

            // Increment w_index
            w_index += rufactor;
            // Increment in_index
            in_index += rufactor;
            if (in_index >= nin) {
                in_index = ir;
            }
            // Increment out_index
            if (acc_step + 1 >= multscale) {
                acc_step = 0;
                out_index++;
            } else {
                acc_step++;
            }
        }
    }

This effectively puts back the intermediate cast that we removed. What is your recommendation?

@jmitrevs
Copy link
Contributor Author

Does the hasting help because the sizes become too big, or is the reason why this meets timing more of an artifact that the casting forces the addition tree to not be attempted to be done in one cycle?

@thesps
Copy link
Contributor

thesps commented Jan 21, 2022

I saw that the generated HDL for that is now using the accumulator in the DSP, where previously it was using LUTs, I think. Anyway, I think it's okay to add the cast in, which should make the dense layer behave exactly as it used to. I think we should do this for all uses of product in dense layers for consistency.

@jmitrevs
Copy link
Contributor Author

I added casts to break up the accumulates--basically wherever the product was preceded by a +=. I also added the auto_po2 version of that test that exposed the previous problem. I believe the binary and tertiary versions are already included.

@thesps
Copy link
Contributor

thesps commented Feb 4, 2022

I think this is in good shape to merge now. But perhaps we hold off until merging the backends PR? I think this one should be easy to adapt, some files will just need to move to different places.

@jmitrevs
Copy link
Contributor Author

jmitrevs commented Feb 4, 2022

Sounds good. I'll try to update it in the next few days.

@jmitrevs
Copy link
Contributor Author

The quartus code also needs to be updated and tested. I will try to do it and also add tests for it.

@jmitrevs
Copy link
Contributor Author

The auto_po2 test that was introduced in this PR fails also in the master branch. Nevertheless, it needs to be fixed as part of this PR.

@jmitrevs jmitrevs changed the title Remove intermediate casting in BatchNormalization Remove intermediate casting in product Feb 21, 2022
@jmitrevs
Copy link
Contributor Author

I updated the quartus product to also not have res_T as a template parameter. On the quartus side, product is a standalone function, not a member function, so partial template specialization is not possible. The old code used enable_if to handle this, but given the simplified form with no res_T, then function overloading works properly, I think. That is the way I implemented it. The test_keras_api.py tests, though it's not extensive.

The only remaining issue is the auto_po2 test failing, which I mentioned also fails without the code changes from this PR in the master branch. I noticed the old version, before the new backend was merged, produces an nnet::normalize, fc1_alpha after the nnet::dense, while the new version omits the nnet::normalize.

@thesps
Copy link
Contributor

thesps commented Feb 22, 2022

The only remaining issue is the auto_po2 test failing

I've addressed it in #499. I just did it as a standalone since it's an issue already on master directly, but I included your test. Hopefully we merge that quickly and you can pull master here again.

@jmitrevs
Copy link
Contributor Author

jmitrevs commented Mar 7, 2022

I think this passes all the pytests now.

@thesps thesps merged commit 83024d3 into fastmachinelearning:master Mar 23, 2022
calad0i pushed a commit to calad0i/hls4ml that referenced this pull request Jul 1, 2023
…cast

Remove intermediate casting in product
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants