Deploy your DL model

2024-04-29

What is deployment

In the context of real-life production/application, DL-based models are required to deployed on multiple cross-platform devices. Compared to the common GPU used during training, those devices are often lack of GPU memory and computation capability. Therefore, deploying the trained model to fit them to other devices and platforms is necessary.

Toolkits

When you try to perform deployment, you will need some certain tools to help you do it. At first, you need to install ONNX, which is a common-used middle-ware of DL model deployment. Similar to compilation, the models will first be transformed into middel representation (ONNX) and then be compiled to low-level inference engine. To do so, many toolkits have been developed, including TensorRT by NVIDIA, TVM by Apache, TorchScript by PyTorch.

General pipeline

In this blog, we will focus on the deployment using TensorRT. The most common pipeline for deployment is as follows:

  1. Train your model.
  2. Transform the trained model ‘.pth’ into ONNX file.
  3. Optimize the ONNX model file.
  4. Build an inference engine based on the optimized ONNX file.

Deployment

Initialization

import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit # Not used explicitly, yet necessary

Save ONNX model

def saveONNX(model, filePath, inputSize):
    model = model.cuda()
    C, H, W, D = inputSize
    dummy_input = torch.randn(1, C, H, W, D, device='cuda')
    torch.onnx.export(model, dummy_input, filePath, verbose=True)

Build the engine

def build_engine(onnx_file_path):
    TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
    with trt.Builder(TRT_LOGGER) as builder, builder.create_network() as network, trt.OnnxParser(network, TRT_LOGGER) as parser:
        if builder.platform_has_fast_fp16:
            print('This card support fp16')
        if builder.platform_has_fast_int8:
            print('This card support int8')
        
        builder.max_workspace_size = 1 << 30
        with open(onnx_file_path, 'rb') as model:
            parser.parse(model.read())
        return builder.build_cuda_engine(model)

def build_engine_int8(onnx_file_path, calib):
    TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
    with trt.Builder(TRT_LOGGER) as builder, builder.create_network() as network, trt.OnnxParser(network, TRT_LOGGER) as parser:
        
        builder.max_workspace_size = 1 << 30
        builder.int8_mode = True
        builder.int8_calibrator = calib
        with open(onnx_file_path, 'rb') as model:
            parser.parse(model.read())
        return builder.build_cuda_engine(model)

Save and load

def save_engine(engine, engine_dest_path):
    buf = engine.serialize()
    with open(engine_dest_path, 'wb') as f:
        f.write(buf)

def load_engine(engine_path):
    TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
    with open(engine_path, 'rb') as f, trt.Runtime(TRT_LOGGER) as runtime:
        return runtime.deserialize_cuda_engine(f.read())

Allocate buffer

class HostDeviceMem(object):
    def __init__(self, host_mem, device_mem):
        self.host = host_mem
        self.device = device_mem

    def __str__(self):
        return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)

    def __repr__(self):
        return self.__str__()

def allocate_buffers(engine):
    inputs = []
    outputs = []
    bindings = []
    stream = cuda.Stream()
    
    for binding in engine:
        dtype = trt.nptype(engine.get_binding_dtype(binding))
        host_mem = cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size, dtype)
        device_mem = cuda.mem_alloc(host_mem.nbytes)
        bindings.append(int(device_mem))
        if engine.binding_is_input(binding):
            inputs.append(HostDeviceMem(host_mem, device_mem))
        else:
            outputs.append(HostDeviceMem(host_mem, device_mem))
    return inputs, outputs, bindings, stream

Inference

def do_inference(context, bindings, inputs, outputs, stream, batch_size=1):
    [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
    context.execute_async(batch_size=batch_size, bindings=bindings, stream_handle=stream.handle)
    [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
    stream.synchronize()
    return [out.host for out in outputs]