Exporting a model trained with mixed precision to ONNX #1890

jtorhoff · 2023-03-17T22:24:59Z

jtorhoff
Mar 17, 2023

When I export a model (let's say an mmdetection one) which was trained with mixed precision, i.e. fp16 = dict(loss_scale="dynamic") in the config, how is the model exported to ONNX? Are all weights converted to FP16? When I use mmdeploy/tools/test.py with --speed-test enabled with the config, is the model run in half-precision (FP16) mode or mixed-precision?

The reason I'm asking this is because I'm writing a report and comparing the inference speed of ONNX (CUDA) and TensorRT (FP16), so I need to be as explicit as possible to draw conclusions of performance gains due to TensorRT.

In other words I would like to know whether the impressive gains of TensorRT engine in FP16 mode are due to using FP16 and specific TRT optimizations or is it just TRT optimizations and the ONNX model is also running in FP16 mode?

Answered by grimoire

Mar 19, 2023

ONNX itself has float16 dtype and some of it's layer do have float16 support such as conv. So in theory, you can create an ONNX model with mixed precision. Visualize your model in the https://netron.app/ to see if it does convert the weight to fp16 or not.
As for fp16, TensorRT can clamp the input/output of the nodes without the help of ONNX. That means you can accelerate an ONNX model even if it is exported as a fp32 mode.
If you want further acceleration, TensorRT can perform explicit int8 quantization on ONNX with QuantizeLinear and DequantizeLinear node.

View full answer

grimoire · 2023-03-19T07:40:32Z

grimoire
Mar 19, 2023
Maintainer

ONNX itself has float16 dtype and some of it's layer do have float16 support such as conv. So in theory, you can create an ONNX model with mixed precision. Visualize your model in the https://netron.app/ to see if it does convert the weight to fp16 or not.
As for fp16, TensorRT can clamp the input/output of the nodes without the help of ONNX. That means you can accelerate an ONNX model even if it is exported as a fp32 mode.
If you want further acceleration, TensorRT can perform explicit int8 quantization on ONNX with QuantizeLinear and DequantizeLinear node.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exporting a model trained with mixed precision to ONNX #1890

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Exporting a model trained with mixed precision to ONNX #1890

jtorhoff Mar 17, 2023

Replies: 1 comment

grimoire Mar 19, 2023 Maintainer

jtorhoff
Mar 17, 2023

grimoire
Mar 19, 2023
Maintainer