What is deployment
In the context of real-life production/application, DL-based models are required to deployed on multiple cross-platform devices. Compared to the common GPU used during training, those devices are often lack of GPU memory and computation capability. Therefore, deploying the trained model to fit them to other devices and platforms is necessary.
Toolkits
When you try to perform deployment, you will need some certain tools to help you do it. At first, you need to install ONNX, which is a common-used middle-ware of DL model deployment. Similar to compilation, the models will first be transformed into middel representation (ONNX) and then be compiled to low-level inference engine. To do so, many toolkits have been developed, including TensorRT by NVIDIA, TVM by Apache, TorchScript by PyTorch.
General pipeline
In this blog, we will focus on the deployment using TensorRT. The most common pipeline for deployment is as follows:
- Train your model.
- Transform the trained model ‘.pth’ into ONNX file.
- Optimize the ONNX model file.
- Build an inference engine based on the optimized ONNX file.
Deployment
Initialization
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit # Not used explicitly, yet necessary
Save ONNX model
def saveONNX(model, filePath, inputSize):
model = model.cuda()
C, H, W, D = inputSize
dummy_input = torch.randn(1, C, H, W, D, device='cuda')
torch.onnx.export(model, dummy_input, filePath, verbose=True)
Build the engine
def build_engine(onnx_file_path):
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
with trt.Builder(TRT_LOGGER) as builder, builder.create_network() as network, trt.OnnxParser(network, TRT_LOGGER) as parser:
if builder.platform_has_fast_fp16:
print('This card support fp16')
if builder.platform_has_fast_int8:
print('This card support int8')
builder.max_workspace_size = 1 << 30
with open(onnx_file_path, 'rb') as model:
parser.parse(model.read())
return builder.build_cuda_engine(model)
def build_engine_int8(onnx_file_path, calib):
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
with trt.Builder(TRT_LOGGER) as builder, builder.create_network() as network, trt.OnnxParser(network, TRT_LOGGER) as parser:
builder.max_workspace_size = 1 << 30
builder.int8_mode = True
builder.int8_calibrator = calib
with open(onnx_file_path, 'rb') as model:
parser.parse(model.read())
return builder.build_cuda_engine(model)
Save and load
def save_engine(engine, engine_dest_path):
buf = engine.serialize()
with open(engine_dest_path, 'wb') as f:
f.write(buf)
def load_engine(engine_path):
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
with open(engine_path, 'rb') as f, trt.Runtime(TRT_LOGGER) as runtime:
return runtime.deserialize_cuda_engine(f.read())
Allocate buffer
class HostDeviceMem(object):
def __init__(self, host_mem, device_mem):
self.host = host_mem
self.device = device_mem
def __str__(self):
return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)
def __repr__(self):
return self.__str__()
def allocate_buffers(engine):
inputs = []
outputs = []
bindings = []
stream = cuda.Stream()
for binding in engine:
dtype = trt.nptype(engine.get_binding_dtype(binding))
host_mem = cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size, dtype)
device_mem = cuda.mem_alloc(host_mem.nbytes)
bindings.append(int(device_mem))
if engine.binding_is_input(binding):
inputs.append(HostDeviceMem(host_mem, device_mem))
else:
outputs.append(HostDeviceMem(host_mem, device_mem))
return inputs, outputs, bindings, stream
Inference
def do_inference(context, bindings, inputs, outputs, stream, batch_size=1):
[cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
context.execute_async(batch_size=batch_size, bindings=bindings, stream_handle=stream.handle)
[cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
stream.synchronize()
return [out.host for out in outputs]