This repo provides sample codes to deploy YOLOV5 models in DeepStream or stand-alone TensorRT sample on Nvidia devices.
In this section, we will walk through the steps to run YOLOV5 model using DeepStream with CPU NMS.
You could start from nvcr.io/nvidia/pytorch:22.03-py3 container for export.
git clone https://github.com/ultralytics/yolov5.git
# clone yolov5_trt_infer repo and copy the patch into yolov5 folder
git clone https://github.com/NVIDIA-AI-IOT/yolov5_gpu_optimization.git
cp yolov5_gpu_optimization/0001-Enable-onnx-export-with-decode-plugin.patch yolov5_gpu_optimization/requirement_export.txt yolov5/
cd yolov5
git checkout a80dd66efe0bc7fe3772f259260d5b7278aab42f
git am 0001-Enable-onnx-export-with-decode-plugin.patch
pip install -r requirement_export.txt
apt update && apt install -y libgl1-mesa-glx
python export.py --weights yolov5s.pt --include onnx --simplify --dynamic
You could start from nvcr.io/nvidia/deepstream:6.1.1-devel container for inference.
Then go to the deepstream sample directory.
cd deepstream-sample
Compile the plugin and deepstream parser:
- On x86:
nvcc -Xcompiler -fPIC -shared -o yolov5_decode.so ./yoloForward_nc.cu ./yoloPlugins.cpp ./nvdsparsebbox_Yolo.cpp -isystem /usr/include/x86_64-linux-gnu/ -L /usr/lib/x86_64-linux-gnu/ -I /opt/nvidia/deepstream/deepstream/sources/includes -lnvinfer - On Jetson device:
nvcc -Xcompiler -fPIC -shared -o yolov5_decode.so ./yoloForward_nc.cu ./yoloPlugins.cpp ./nvdsparsebbox_Yolo.cpp -isystem /usr/include/aarch64-linux-gnu/ -L /usr/lib/aarch64-linux-gnu/ -I /opt/nvidia/deepstream/deepstream/sources/includes -lnvinfer
You could place the exported onnx models to deepstream-sample
cp yolov5/yolov5s.onnx yolov5_gpu_optimization/deepstream-sample/
Then you could run the model pre-defined configs.
- Run inference with saving inferened video:
deepstream-app -c config/deepstream_app_config_save_video.txt - Run inference without display
deepstream-app -c config/deepstream_app_config.txt - Run inference with 8 streams and batch_size=8 and without display
deepstream-app -c config/deepstream_app_config_8s.txt
The performance test is conducted on T4 with nvcr.io/nvidia/deepstream:6.1.1-devel
| Model | Input Size | Device | precision | 1 stream bs=1 | 4 streams bs=4 | 8 streams bs=8 |
|---|---|---|---|---|---|---|
| yolov5n | 3x640x640 | T4 | FP16 | 640 | 980 | 988 |
| yolov5m | 3x640x640 | T4 | FP16 | 220 | 270 | 277 |
In this section, we will walk through the steps to run YOLOV5 model using GPU NMS with stand-alone inference script.
You could start from nvcr.io/nvidia/pytorch:22.03-py3 container for export.
git clone https://github.com/ultralytics/yolov5.git
# clone yolov5_trt_infer repo and copy files into yolov5 folder
git clone https://github.com/NVIDIA-AI-IOT/yolov5_gpu_optimization.git
cp -r yolov5_gpu_optimization/0001-Enable-onnx-export-with-batchNMS-plugin.patch yolov5_gpu_optimization/requirement_export.txt yolov5/
cd yolov5
git checkout a80dd66efe0bc7fe3772f259260d5b7278aab42f
git am 0001-Enable-onnx-export-with-batchNMS-plugin.patch
pip install -r requirement_export.txt
apt update && apt install -y libgl1-mesa-glx
python export.py --weights yolov5s.pt --include onnx --simplify --dynamic
For the following section, you could start from nvcr.io/nvidia/tensorrt:22.05-py3 and prepare env by:
cd tensorrt-sample
pip install -r requirement_infer.txt
apt update && apt install -y libgl1-mesa-glx
Build plugin library by following the previous steps.
python yolov5_trt_inference.py --input_images_folder=</path/to/coco/images/val2017/> --output_images_folder=./coco_output --onnx=</path/to/yolov5s.onnx>
The image will be resized to 3xINPUT_SIZExINPUT_SIZE while be kept aspect ratio.
python yolov5_trt_inference.py --input_images_folder=</path/to/coco/images/val2017/> --output_images_folder=<path/to/coco_output_dir> --onnx=</path/to/yolov5s.onnx> --coco_anno=</path/to/coco/annotations/instances_val2017.json>
This is not real rectangular inference as in pytorch. It is same to setting pad=0, rect=False, imgsz=input_size + stride in ultralytics YOLOV5.
# Default FP16 precision
python yolov5_trt_inference.py --input_images_folder=</path/to/coco/images/val2017/> --output_images_folder=<path/to/coco_output_dir> --onnx=</path/to/yolov5s.onnx> --coco_anno=</path/to/coco/annotations/instances_val2017.json> --rect
To run int8 inference or evaluation, you need to install TensorRT above 8.4. You could start from nvcr.io/nvidia/tensorrt:22.07-py3
Following command is to run evaluation in int8 precision (and calibration cache will be saved into the path specify by --calib_cache):
# INT8 precision
python yolov5_trt_inference.py --input_images_folder=</path/to/coco/images/val2017/> --output_images_folder=<path/to/coco_output_dir> --onnx=</path/to/yolov5s.onnx> --coco_anno=</path/to/coco/annotations/instances_val2017.json> --rect --data_type=int8 --save_engine=./yolov5s_int8_maxbs16.engine --calib_img_dir=</path/to/coco/images/val2017/> --calib_cache=yolov5s_bs16_n10.cache --n_batches=10 --batch_size=16
Notes: The calibration algorithm for YOLOV5 is IInt8MinMaxCalibrator instead of IInt8EntropyCalibrator2. So if you want to play with trtexec with the saved calibration cache, you have to change the first line of cache from MinMaxCalibration to EntropyCalibration2.
Here is the performance and mAP summary. Tested on V100 16G with TensorRT 8.2.5 in rectangular inference mode.
| Model | Input Size | precision | FPS bs=32 | FPS bs= 1 | [email protected] |
|---|---|---|---|---|---|
| yolov5n | 640 | FP16 | 1295 | 448 | 45.9% |
| yolov5s | 640 | FP16 | 917 | 378 | 57.1% |
| yolov5m | 640 | FP16 | 614 | 282 | 64% |
| yolov5l | 640 | FP16 | 416 | 202 | 67.3% |
| yolov5x | 640 | FP16 | 231 | 135 | 68.5% |
| yolov5n6 | 1280 | FP16 | 341 | 160 | 54.2% |
| yolov5s6 | 1280 | FP16 | 261 | 139 | 63.2% |
| yolov5m6 | 1280 | FP16 | 155 | 99 | 68.8% |
| yolov5l6 | 1280 | FP16 | 106 | 68 | 70.7% |
| yolov5x6 | 1280 | FP16 | 60 | 45 | 71.9% |
Users can also enable nbit-NMS by changing the scoreBits in export.py.
# Default to be 16-bit
nms_attrs["scoreBits"] = 16
# Can be changed to smaller one to boost NMS operation:
# e.g. nms_attrs["scoreBits"] = 8performance gain:
| Classes number | Device | Anchors number | Score bits | Batch size | NMS Execution time (ms) |
|---|---|---|---|---|---|
| 80 | A30 | 25200 | 16 | 32 | 12.1 |
| 80 | A30 | 25200 | 8 | 32 | 10.0 |
| 4 | Jetson NX | 10560 | 16 | 4 | 1.38 |
| 4 | Jetson NX | 10560 | 8 | 4 | 1.08 |
Note: small score bits may slightly decrease the final mAP.
Users can intergrate the YOLOV5 with BatchedNMS plugin into DeepStream following deepstream_tao_apps
We conducted experiments with different activations for pursing better trade-off between mAP and performance on TensorRT.
You can change the activation of YOLOV5 model in yolov5/models/common.py:
class Conv(nn.Module):
# Standard convolution
def __init__(self, c1, c2, k=1, s=1, p=None, g=1, act=True): # ch_in, ch_out, kernel, stride, padding, groups
super().__init__()
self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p), groups=g, bias=False)
self.bn = nn.BatchNorm2d(c2)
# self.act = nn.SiLU() if act is True else (act if isinstance(act, nn.Module) else nn.Identity())
self.act = nn.ReLU() if act is True else (act if isinstance(act, nn.Module) else nn.Identity())
def forward(self, x):
return self.act(self.bn(self.conv(x)))
def forward_fuse(self, x):
return self.act(self.conv(x))
YOLOV5s experiments results so far:
| Activation type | [email protected] | V100 --best FPS (bs = 32) | A10 --best FPS (bs=32) |
|---|---|---|---|
| swish (baseline) | 56.7% | 1047 | 965 |
| ReLU | 54.8% (scratch) 55.7% (swish pretrained) |
1177 | 1065 |
| GELU | 56.6% | 1004 | 916 |
| Leaky ReLU | 55.0% | 1172 | 892 |
| PReLU | 54.8% | 1123 | 932 |
- int8 0% mAP in TensorRT 8.2.5: Install TensorRT above 8.4 to avoid the issue.
- TensorRT warning at the end of the execution of stand-alone tensorrt inference script: The warning won't block the inference or evaluation. You can just ignore it.