Skip to content

Latest commit

 

History

History
108 lines (90 loc) · 6.49 KB

QUANTIZED_OP.md

File metadata and controls

108 lines (90 loc) · 6.49 KB

Quantized models in different frameworks (Editing)

As we keep floating-point scale and integer zero-point for quantized models, and meanwhile some has Quantize/Dequantize operators that having floating-point input or output. This section briefs the data types they use, and in SNPS-Caffe implementation we use the target floating-precision.

TFLite ONNX Caffe2 SNPS Caffe
scale double float float
fp template float float float
round half away zero toward even toward even
std:: round rint nearbyint

fp generally denotes the data_type of

  • Input tensor for Quantize and
  • Output tensor for Dequantize,
  • Intermediate tensor type if some specific operators use floating-point registers for computation or handling output_scale.
    • e.g. ONNXruntime generally handles the input_scale-to-output_scale transformation by MlasRequantizeOutput(int Input, int Output, float scale); which uses intermediate floating-point representation -- float.

Quick Look-Up for Implementations in SNPS Caffe

We support the implementations from different frameworks, which leverages the layer parameter quantize_method when their results fail bit-exactness. You can also refer to FEATURES.md for other quantization-related parameters.

operator \ quantize_method TFLite ONNX Caffe2
AveragePool t o c
BiasAdd o
Concat ~
Convolution t o c
Deconvolution c
EltwiseSum t c c
InnerProduct t t
LeakyReLU t
Power* t o c
ReLU ~ ~ ~
ResizeBilinear ~
Sigmoid ~
Softmax ~

We denote TFLite/ONNXruntime/Caffe2 implementations by t/o/c; and the ~ entries indicate that the Caffe implementation computes in floating representation such as

// A Dequantize-Op-Quantize procedure, taking ReLU as example.
float_in = Dequantize(int_in, input_scale, input_zero_point);
float_out = ReLU(float_in);
int_out = Quantize(float_out, output_scale, output_zero_point);

Notes

  1. Our model zoo doesn't cover all quantized operators over the frameworks. The entry is left empty if the (framework,operator) combination is not seen yet.
    • Quantized bias_layer only occurs in ONNX (does not support FC+Bias fusion yet).
  2. Only Quantize and Dequantize operators are mapped to Power_layer.
  3. Since some quantized operators may have bit-exactness results between the frameworks, for such entries we adapt the implementation from other framework.
  4. MaxPool, ArgMax are seen, but they do nothing different for quantized/floating numbers.
  5. Convolution concludes a number of variations, please see the folloing section.

Quantized Convolutions

output_multiplier = input_scale * weight_scale / output_scale.
Reminded that TFLite uses double, while ONNXruntime and Caffe2 use float for scales.

TFLite

The quantized multiplier is calculated as (the shift is a power-of-two normalizer to normalize output_multiplier in [0.5,1) )

output_multiplier = <double>input_scale * <double>weight_scale / <double> output_scale;
quantized_multiplier = std::round(std::frexp(output_multiplier, &shift) * (1<<31));
// or for channel-wise quantization
// output_multiplier[ch] = <double>input_scale * <double>weight_scale[ch] / <double> output_scale;
// quantized_multiplier[ch] = std::round(std::frexp(output_multiplier[ch], &shift[ch]) * (1<<31));

For convolutions, TFLite transfrom to DepthwiseConv if group = in_ch = out_ch.
Then, different implementations are derived in SNPS-Caffe to match TFLite:

Scales \ group 1 Depthwise Pointwise*
PerTensor D2 F2 F2*
PerChannel D1 D2 D1*

Two kinds of rounding are used to approximate the affine transformation (from input_scale to output_scale, using the quantized multiplier).

  1. The first splits it into two steps, denoted by 2-steps-rounding
  2. The second implments rounding half toward positive infinity, denoted by 1-step-rounding

D2 (Double Precision + 2-Steps-Rounding)

scaled_acc = SaturatingRoundingDoublingHighMul(<int>acc,<int>quantized_multiplier)
out_acc = RoundingDivideByPOT(scaled_acc, shift)
// The approximate result := out_acc = scaled_acc / (1<<31) / (1<<shift),
// where roundings are used

F2 (Single Precision + 2-Steps-Rounding)

Use <float> to calculate output_multiplier, then apply 2-steps-rounding in D2.

D1 (Double Precision + 1-Step-Rounding)

Calculate the output_multiplier as per channel

Also it uses simpler rounding to calculate the approximate result

scaled_acc = <int>acc * <int>quantized_multiplier
out_acc = (scaled_acc + (1<<(31+shift-1)) >> (31+shift-1)
// which is, it rounds (only once) half toward positive inf

Pointwise Convolution*

When I try to match bit-exactness result, the combination of PerTensor-F2 and PerChannel-D1 is found by brute-force.

ONNX runtime

It casts <int>acc to <float>, multiply by <float>output_multiplier, then requantize the result.

Caffe2

It uses single-precision scales, the computation is the same as mentioned F2.