You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
QONNX (Quantized ONNX) introduces several custom operators -- [`IntQuant`](docs/qonnx-custom-ops/intquant_op.md), [`FloatQuant`](docs/qonnx-custom-ops/floatquant_op.md), [`BipolarQuant`](docs/qonnx-custom-ops/bipolar_quant_op.md), and [`Trunc`](docs/qonnx-custom-ops/trunc_op.md) -- in order to represent arbitrary-precision integer and minifloat quantization in ONNX. This enables:
14
+
QONNX (Quantized ONNX) introduces several [custom operators](docs/qonnx-custom-ops/overview.md) -- `IntQuant`, `FloatQuant`, `BipolarQuant`, and `Trunc` -- in order to represent arbitrary-precision integer and minifloat quantization in ONNX. This enables:
15
15
* Representation of binary, ternary, 3-bit, 4-bit, 6-bit or any other integer/fixed-point quantization.
16
16
* Representation of minifloat quantization with configurable exponent and mantissa bits.
17
17
* Quantization is an operator itself, and can be applied to any parameter or layer input.
@@ -29,9 +29,7 @@ This repository contains a set of Python utilities to work with QONNX models, in
29
29
30
30
### Operator definitions
31
31
32
-
*[Quant](docs/qonnx-custom-ops/quant_op.md) for 2-to-arbitrary-bit quantization, with scaling and zero-point
33
-
*[BipolarQuant](docs/qonnx-custom-ops/bipolar_quant_op.md) for 1-bit (bipolar) quantization, with scaling and zero-point
34
-
*[Trunc](docs/qonnx-custom-ops/trunc_op.md) for truncating to a specified number of bits, with scaling and zero-point
32
+
Please see the [custom operator overview](docs/qonnx-custom-ops/overview.md) table for more details.
Copy file name to clipboardExpand all lines: docs/qonnx-custom-ops/intquant_v1.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,11 +9,11 @@ rounding_mode defines how quantized values are rounded.
9
9
10
10
Notes:
11
11
* This operator was previously named `Quant` but is renamed to `IntQuant` to distinguish it from `FloatQuant`. For a transition period, qonnx will transparently handle `Quant` as `IntQuant` for backwards compatibility reasons, but only `IntQuant` should be used for new models.
12
-
* This operator does not work for binary or bipolar quantization, for this purpose the simpler BipolarQuant node exists.
12
+
* This operator does not work for binary or bipolar quantization, for this purpose the simpler `BipolarQuant` node exists.
13
13
14
14
#### Version
15
15
16
-
This operator is not part of the ONNX standard and is not currently versioned.
16
+
The description of this operator in this document corresponds to `qonnx.custom_ops.general` opset version 1.
Truncates the values of one input data (Tensor<T>) at a specified bitwidth and produces one output data (Tensor<T>).
4
+
Additionally, takes four float tensors as input, which define the scale, zero-point, input bit-width and output bit-width of the quantization.
5
+
The attribute rounding_mode defines how truncated values are rounded.
6
+
7
+
#### Version
8
+
9
+
This operator is not part of the ONNX standard.
10
+
The description of this operator in this document corresponds to `qonnx.custom_ops.general` opset version 2.
11
+
12
+
#### Attributes
13
+
14
+
<dl>
15
+
<dt><tt>rounding_mode</tt> : string (default is "FLOOR")</dt>
16
+
<dd>Defines how rounding should be applied during truncation. Currently available modes are: "ROUND", "CEIL" and "FLOOR". Here "ROUND" implies a round-to-even operation. Lowercase variants for the rounding mode string are also supported: "round", "ceil", "floor".</dd>
17
+
<dt><tt>signed</tt> : int (default is 1)</dt>
18
+
<dd>Defines if the quantization includes a signed bit. E.g. at 8b unsigned=[0, 255] vs signed=[-128, 127].</dd>
19
+
<dt><tt>narrow</tt> : int (default is 0)</dt>
20
+
<dd>Defines if the value range should be interpreted as narrow, when signed=1. E.g. at 8b regular=[-128, 127] vs narrow=[-127, 127].</dd>
0 commit comments