-
When I export a model (let's say an mmdetection one) which was trained with mixed precision, i.e. The reason I'm asking this is because I'm writing a report and comparing the inference speed of ONNX (CUDA) and TensorRT (FP16), so I need to be as explicit as possible to draw conclusions of performance gains due to TensorRT. In other words I would like to know whether the impressive gains of TensorRT engine in FP16 mode are due to using FP16 and specific TRT optimizations or is it just TRT optimizations and the ONNX model is also running in FP16 mode? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
ONNX itself has float16 dtype and some of it's layer do have float16 support such as conv. So in theory, you can create an ONNX model with mixed precision. Visualize your model in the https://netron.app/ to see if it does convert the weight to fp16 or not. |
Beta Was this translation helpful? Give feedback.
ONNX itself has float16 dtype and some of it's layer do have float16 support such as conv. So in theory, you can create an ONNX model with mixed precision. Visualize your model in the https://netron.app/ to see if it does convert the weight to fp16 or not.
As for fp16, TensorRT can clamp the input/output of the nodes without the help of ONNX. That means you can accelerate an ONNX model even if it is exported as a fp32 mode.
If you want further acceleration, TensorRT can perform explicit int8 quantization on ONNX with QuantizeLinear and DequantizeLinear node.