Remove intermediate casting in product #490

jmitrevs · 2022-01-08T01:26:32Z

This removes the intermediate casting of scale*x to a res_T type in a BatchNormalization, so casting a res_t is only of the final scale*x + bias. I did this by introducing Product_nocast and mult_nocast, which multiply and return in the full precision. That seemed easier and less error-prone than manually determining the resulting type in python, and I was hesitant to always make mult and Product return the full precision, though that would be straightforward.

I did confirm that the simple test model is synthesizable with Vivado_HLS 2019.2.

This patch does fix some issues I saw working with low bit values on QONNX.

thesps · 2022-01-10T10:15:38Z

I understand the motivation to return the product at full precision, I think the solution needs more work.

Unfortunately they're not captured by current tests, but there are examples where this solution will produce C++ that doesn't compile. You can see an example changing kernel_quantizer=quantized_bits(bits,0,alpha=1) -> kernel_quantizer=quantized_bits(bits,0,alpha='auto_po2') in this test. It's just a shortcut to get a model that has ExponentType for the scale. Generally, we have the exponent and XNOR products which are not a * b, and they need to be supported also.

jmitrevs · 2022-01-10T15:26:25Z

Thanks. I was able to reproduce a failure. I will try to come up with a solution.

jmitrevs · 2022-01-12T00:23:28Z

I added _nocast versions of other products. The weight_exponential_nocast is the most difficult, and I am not sure I am choosing the right return type, though that is easy to change.

Is this the direction we want to go? One could argue for moving more of this logic into the python and not in the templates, but other than for the exponential, this is pretty straightforward, so I am still leaning towards going this direction.

Nevertheless, this fixes the failure mentioned above by @thesps.

thesps · 2022-01-12T10:13:03Z

Cool! I'm wondering now whether instead of keeping both a "with cast" and "no cast" version of all the products, we can keep only the "no cast" versions (i.e. drop the res_T template from all the product functions). Then anywhere that does a cast on the return will do it outside the product - basically I'm thinking of the Dense here which casts to the accum_t. But I think it should "just work" if the product functions always return the exact result.

And then the logic for both the XNOR and Exponent products would need to be anyway derived. But perhaps the decltype construct works for both of those too?

Edit: pushed to local branch for CI

jmitrevs · 2022-01-14T20:47:50Z

About whether we should drop the res_T from all the product functions or if we keep both versions, I don't have a preference. I didn't initially drop res_T everywhere that in order to make this a less disruptive change, but I can see the argument to just move everything to a noncasting version. I can update the PR like that if so preferred.

thesps · 2022-01-17T16:27:41Z

Yeah I think these functions are so fundamental that it could be very confusing to have different behaviour in different places. So let's go for consistency and simplicity and make them all behave the same (-> return the full available precision). I've been meaning to check how to construct the lossless return type for the special operators but didn't get a chance to yet.

jmitrevs · 2022-01-17T21:34:05Z

I went ahead and made all the nnet::product terms not cast the output. It would be useful to look at all the product terms to make sure I am returning in the appropriate precision.

thesps · 2022-01-18T13:47:29Z

I think this is the right direction. How about this for the weight_exponential:

template<class x_T, class w_T>
class weight_exponential : public Product{
    public:
    // Construct the return type from the multiplication equivalent to the largest shifts
    // ap_int<pow2(decltype(w_T::weight)::width-1)-1> is the type if the multiplicand equivalent to the largest lshift <<
    // ap_fixed<pow2(decltype(w_T::weight)::width-1)-1,0> is the type of the multiplicand equivalent to the largest rshift >>
    using r_T = decltype(x_T(0) * (ap_int<pow2(decltype(w_T::weight)::width-1)-1>(1)+ap_fixed<pow2(decltype(w_T::weight)::width-1)-1,0>(1)));
    static r_T product(x_T a, w_T w){
        // Shift product for exponential weights
        #pragma HLS INLINE
        // shift by the exponent. Negative weights shift right
        r_T y = static_cast<r_T>(a) << w.weight;
        // negate or not depending on weight sign
        return w.sign == 1 ? y : static_cast<r_T>(-y);
    }
};

?
Then I think you only need this one, and not the different specialised ones for different ap_ types possibilities x_T. I tried to explain the logic with the included comment, but I can elaborate. I also changed rt -> r_T for consistency with the other types too. In principle it should also work if x_T was float or something else too.

I'm thinking for completeness, some of those binary/ternary products may need a wider type, since for ints/fixed point -x can be one bit wider than x. We can achieve it with:

static x_T product(x_T a, w_T w){

->

static auto product(x_T a, w_T w) -> decltype(-a) // (or -w)
{

I think we need that for all those that can return -x:

weight_ternary
data_binary
weight_binary

jmitrevs · 2022-01-18T16:12:23Z

I was struggling a bit with the exponential weight. I think I agree with all your suggestions. I will implement them and try the tests.

jmitrevs · 2022-01-18T17:13:01Z

I updated the code. I changed some ?: to if else because the former requires casting, which I think made it look more complicated.

I did not check the weight_exponential very carefully, so it might be worth doing, or adding specific test cases for all these changes. I did test with the modified test_qkeras. In fact, it probably would be a good idea to try to add a unit test for these various cases. I can look into that either later today or tomorrow.

vloncar · 2022-01-18T20:38:34Z

This PR causes problems for resource strategy for some reason. The clock misses timing a lot. Compare this to this

jmitrevs · 2022-01-18T20:47:43Z

That's very strange. It's from hls4ml-keras-test.sh, right? I'll take a look.

jmitrevs · 2022-01-19T18:29:28Z

So the error comes from:

WARNING: [SCHED 204-21] Estimated clock period (65.742ns) exceeds the target (target clock period: 5ns, clock uncertainty: 0.625ns, effective delay budget: 4.375ns).
WARNING: [SCHED 204-21] The critical path in module 'dense_resource_ap_fixed_ap_fixed_16_6_5_3_0_config4_s' consists of the following:
	'mul' operation of DSP[311] ('mul_ln1192', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [309]  (0.494 ns)
	'add' operation of DSP[311] ('add_ln1192', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [311]  (2.04 ns)
	'add' operation of DSP[319] ('add_ln1192_591', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [319]  (2.04 ns)
	'add' operation of DSP[327] ('add_ln1192_592', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [327]  (2.04 ns)
	'add' operation of DSP[335] ('add_ln1192_593', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [335]  (2.04 ns)
	'add' operation of DSP[343] ('add_ln1192_594', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [343]  (2.04 ns)
	'add' operation of DSP[351] ('add_ln1192_595', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [351]  (2.04 ns)
	'add' operation of DSP[359] ('add_ln1192_596', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [359]  (2.04 ns)
	'add' operation of DSP[367] ('add_ln1192_597', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [367]  (2.04 ns)
	'add' operation of DSP[375] ('add_ln1192_598', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [375]  (2.04 ns)
	'add' operation of DSP[383] ('add_ln1192_599', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [383]  (2.04 ns)
	'add' operation of DSP[391] ('add_ln1192_600', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [391]  (2.04 ns)
	'add' operation of DSP[399] ('add_ln1192_601', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [399]  (2.04 ns)
	'add' operation of DSP[407] ('add_ln1192_602', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [407]  (2.04 ns)
	'add' operation of DSP[415] ('add_ln1192_603', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [415]  (2.04 ns)
	'add' operation of DSP[423] ('add_ln1192_604', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [423]  (2.04 ns)
	'add' operation of DSP[431] ('add_ln1192_605', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [431]  (2.04 ns)
	'add' operation of DSP[439] ('add_ln1192_606', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [439]  (2.04 ns)
	'add' operation of DSP[447] ('add_ln1192_607', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [447]  (2.04 ns)
	'add' operation of DSP[455] ('add_ln1192_608', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [455]  (2.04 ns)
	'add' operation of DSP[463] ('add_ln1192_609', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [463]  (2.04 ns)
	'add' operation of DSP[471] ('add_ln1192_610', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [471]  (2.04 ns)
	'add' operation of DSP[479] ('add_ln1192_611', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [479]  (2.04 ns)
	'add' operation of DSP[487] ('add_ln1192_612', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [487]  (2.04 ns)
	'add' operation of DSP[495] ('add_ln1192_613', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [495]  (2.04 ns)
	'add' operation of DSP[503] ('add_ln1192_614', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [503]  (2.04 ns)
	'add' operation of DSP[511] ('add_ln1192_615', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [511]  (2.04 ns)
	'add' operation of DSP[519] ('add_ln1192_616', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [519]  (2.04 ns)
	'add' operation of DSP[527] ('add_ln1192_617', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [527]  (2.04 ns)
	'add' operation of DSP[535] ('add_ln1192_618', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [535]  (2.04 ns)
	'add' operation of DSP[543] ('add_ln1192_619', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [543]  (2.04 ns)
	'add' operation of DSP[551] ('add_ln1192_620', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [551]  (2.04 ns)
	'add' operation of DSP[560] ('add_ln1192_621', firmware/nnet_utils/nnet_dense_resource.h:76->firmware/nnet_utils/nnet_dense_resource.h:274) [560]  (2.04 ns)
	'insertvalue' operation ('mrv', firmware/nnet_utils/nnet_dense_resource.h:280) [6520]  (0 ns)

What fixes the timing is doing an explicit cast:

    ReuseLoop:
    for (int ir = 0; ir < rufactor; ir++) {
        #pragma HLS PIPELINE II=1 rewind

        int w_index = ir;
        int in_index = ir;
        int out_index = 0;
        int acc_step = 0;

        MultLoop:
        for (int im = 0; im < block_factor; im++) {
            #pragma HLS UNROLL

            acc[out_index] += static_cast<typename CONFIG_T::accum_t>(CONFIG_T::template product<data_T, typename CONFIG_T::weight_t>::product(data[in_index], weights[w_index]));

            // Increment w_index
            w_index += rufactor;
            // Increment in_index
            in_index += rufactor;
            if (in_index >= nin) {
                in_index = ir;
            }
            // Increment out_index
            if (acc_step + 1 >= multscale) {
                acc_step = 0;
                out_index++;
            } else {
                acc_step++;
            }
        }
    }

This effectively puts back the intermediate cast that we removed. What is your recommendation?

jmitrevs · 2022-01-19T20:42:14Z

Does the hasting help because the sizes become too big, or is the reason why this meets timing more of an artifact that the casting forces the addition tree to not be attempted to be done in one cycle?

thesps · 2022-01-21T10:20:20Z

I saw that the generated HDL for that is now using the accumulator in the DSP, where previously it was using LUTs, I think. Anyway, I think it's okay to add the cast in, which should make the dense layer behave exactly as it used to. I think we should do this for all uses of product in dense layers for consistency.

jmitrevs · 2022-01-24T23:51:41Z

I added casts to break up the accumulates--basically wherever the product was preceded by a +=. I also added the auto_po2 version of that test that exposed the previous problem. I believe the binary and tertiary versions are already included.

thesps · 2022-02-04T10:45:47Z

I think this is in good shape to merge now. But perhaps we hold off until merging the backends PR? I think this one should be easy to adapt, some files will just need to move to different places.

jmitrevs · 2022-02-04T15:57:31Z

Sounds good. I'll try to update it in the next few days.

jmitrevs · 2022-02-10T23:21:59Z

The quartus code also needs to be updated and tested. I will try to do it and also add tests for it.

jmitrevs · 2022-02-11T01:13:11Z

The auto_po2 test that was introduced in this PR fails also in the master branch. Nevertheless, it needs to be fixed as part of this PR.

jmitrevs · 2022-02-21T23:05:55Z

I updated the quartus product to also not have res_T as a template parameter. On the quartus side, product is a standalone function, not a member function, so partial template specialization is not possible. The old code used enable_if to handle this, but given the simplified form with no res_T, then function overloading works properly, I think. That is the way I implemented it. The test_keras_api.py tests, though it's not extensive.

The only remaining issue is the auto_po2 test failing, which I mentioned also fails without the code changes from this PR in the master branch. I noticed the old version, before the new backend was merged, produces an nnet::normalize, fc1_alpha after the nnet::dense, while the new version omits the nnet::normalize.

thesps · 2022-02-22T12:04:08Z

The only remaining issue is the auto_po2 test failing

I've addressed it in #499. I just did it as a standalone since it's an issue already on master directly, but I included your test. Hopefully we merge that quickly and you can pull master here again.

jmitrevs · 2022-03-07T19:25:22Z

I think this passes all the pytests now.

…cast Remove intermediate casting in product

Remove intermediate casting in BatchNormalization

7a2f6d8

Add _nocast options for other calculations

ffd4bfc

Make all the nnet::product not cast the results

c2d15c6

update product based on Sioni's suggestions

ed94850

jmitrevs added 2 commits January 24, 2022 15:36

break up accumulations in nnet_dense_resource with a cast

810618d

add auto_po2 qkeras test that revealed problem earlier

5456eb7

jmitrevs added 2 commits February 10, 2022 17:07

Merge remote-tracking branch 'upstream/master' into batchnorm_nocast

36ec21d

update vivado passes

1412060

jmitrevs added 2 commits February 21, 2022 16:23

update product on quartus side

cb21021

make different backends not overwrite projects

aa8e4e2

jmitrevs changed the title ~~Remove intermediate casting in BatchNormalization~~ Remove intermediate casting in product Feb 21, 2022

thesps mentioned this pull request Feb 22, 2022

Register ApplyAlpha layer templates #499

Merged

Merge remote-tracking branch 'upstream/master' into batchnorm_nocast

1dffa9b

thesps approved these changes Mar 23, 2022

View reviewed changes

thesps merged commit 83024d3 into fastmachinelearning:master Mar 23, 2022

calad0i pushed a commit to calad0i/hls4ml that referenced this pull request Jul 1, 2023

Merge pull request fastmachinelearning#490 from jmitrevs/batchnorm_no…

55353f7

…cast Remove intermediate casting in product

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove intermediate casting in product #490

Remove intermediate casting in product #490

jmitrevs commented Jan 8, 2022

thesps commented Jan 10, 2022

jmitrevs commented Jan 10, 2022

jmitrevs commented Jan 12, 2022

thesps commented Jan 12, 2022 •

edited

Loading

jmitrevs commented Jan 14, 2022

thesps commented Jan 17, 2022

jmitrevs commented Jan 17, 2022

thesps commented Jan 18, 2022

jmitrevs commented Jan 18, 2022

jmitrevs commented Jan 18, 2022

vloncar commented Jan 18, 2022

jmitrevs commented Jan 18, 2022

jmitrevs commented Jan 19, 2022

jmitrevs commented Jan 19, 2022

thesps commented Jan 21, 2022

jmitrevs commented Jan 24, 2022

thesps commented Feb 4, 2022

jmitrevs commented Feb 4, 2022

jmitrevs commented Feb 10, 2022

jmitrevs commented Feb 11, 2022

jmitrevs commented Feb 21, 2022

thesps commented Feb 22, 2022

jmitrevs commented Mar 7, 2022

Remove intermediate casting in product #490

Remove intermediate casting in product #490

Conversation

jmitrevs commented Jan 8, 2022

thesps commented Jan 10, 2022

jmitrevs commented Jan 10, 2022

jmitrevs commented Jan 12, 2022

thesps commented Jan 12, 2022 • edited Loading

jmitrevs commented Jan 14, 2022

thesps commented Jan 17, 2022

jmitrevs commented Jan 17, 2022

thesps commented Jan 18, 2022

jmitrevs commented Jan 18, 2022

jmitrevs commented Jan 18, 2022

vloncar commented Jan 18, 2022

jmitrevs commented Jan 18, 2022

jmitrevs commented Jan 19, 2022

jmitrevs commented Jan 19, 2022

thesps commented Jan 21, 2022

jmitrevs commented Jan 24, 2022

thesps commented Feb 4, 2022

jmitrevs commented Feb 4, 2022

jmitrevs commented Feb 10, 2022

jmitrevs commented Feb 11, 2022

jmitrevs commented Feb 21, 2022

thesps commented Feb 22, 2022

jmitrevs commented Mar 7, 2022

thesps commented Jan 12, 2022 •

edited

Loading