You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi! We got into segmentation fault error when trying to run model inference on gpu. Below is a minimal example from the tutorial (link):
import torch
import time
# define a floating point model where some layers could be statically quantized
class M(torch.nn.Module):
def __init__(self):
super().__init__()
# QuantStub converts tensors from floating point to quantized
self.quant = torch.ao.quantization.QuantStub()
self.conv = torch.nn.Conv2d(1, 1, 1)
self.relu = torch.nn.ReLU()
# DeQuantStub converts tensors from quantized to floating point
self.dequant = torch.ao.quantization.DeQuantStub()
def forward(self, x):
# manually specify where tensors will be converted from floating
# point to quantized in the quantized model
x = self.quant(x)
x = self.conv(x)
x = self.relu(x)
# manually specify where tensors will be converted from quantized
# to floating point in the quantized model
x = self.dequant(x)
return x
# create a model instance
model_fp32 = M()
# model must be set to eval mode for static quantization logic to work
model_fp32.eval()
input_fp32 = torch.randn(4, 1, 1024, 1024)
time_s = time.time()
with torch.no_grad():
out = model_fp32(input_fp32)
time_e = time.time()
model_fp32.qconfig = torch.ao.quantization.get_default_qconfig('fbgemm')
model_fp32_fused = torch.ao.quantization.fuse_modules(model_fp32, [['conv', 'relu']])
model_fp32_prepared = torch.ao.quantization.prepare(model_fp32_fused)
model_fp32_prepared(input_fp32)
model_int8 = torch.ao.quantization.convert(model_fp32_prepared)
# run the model, relevant calculations will happen in int8
res = model_int8(input_fp32)
model_int8 = model_int8.to('cuda:0')
input_fp32 = input_fp32.to('cuda:0')
with torch.no_grad():
out = model_int8(input_fp32)
Output:
Segmentation fault (core dumped)
Inference on CPU is fine for the int8 model. Could someone please advise on the potential reason? Thank you!
The text was updated successfully, but these errors were encountered:
Hi! Looks like you're trying to use the eager mode quantization flow from pytorch core on the fbgemm backend which currently only runs on x86 server CPU backends.
@supriyar thank you for the explanation! So regardless of the quantization approach, is it correct that currently for gpu-based inference, pytorch only supports quantization on linear layers?
Hi! We got into segmentation fault error when trying to run model inference on gpu. Below is a minimal example from the tutorial (link):
Output:
Inference on CPU is fine for the int8 model. Could someone please advise on the potential reason? Thank you!
The text was updated successfully, but these errors were encountered: