Yolo V4 object detection with OpenCV and Python🔗


We’ll write a few lines of Python code that uses OpenCV’s neural network module to implement a Yolo V4 object detector. The tool gives the locations and names of up to 80 kinds of different objects in input images.

Detailed description🔗

Setting up the environment🔗

As explained in Python tool API, the Builder will execute all Python modules whose name ends with *toolplugin.py in $HOME/VisionAppster/plugins/tool/. If these modules register tools, they will appear in Builder’s tool box. In this example, the Python module will be yolo_toolplugin.py and the name of the tool in the plugin is Yolo.

The Yolo tool needs three files that contain class names (coco.names), Yolo V4 configuration (yolov4.cfg) and the weights of the neural network (yolov4.weights). To set up the environment, download the files and place them in $HOME/VisionAppster/plugins/tool/yolov4 as shown below.

Downloaded files

Downloaded files🔗

Use the va-pkg command-line tool to install the required NumPy and OpenCV packages:

va-pkg install python:numpy python:opencv-python

Python code walk-through🔗

Let us walk through the python code.

import visionappster as va
from visionappster.coordinates import pixel_to_world
import numpy as np
import cv2
import os

Import the necessary modules. visionappster is needed in every Python tool, numpy almost as often. cv2 is the name of the OpenCV module. os is needed for path manipulation.

class Yolo:
    # Shared by all instances of this class
    _dnn_model = None
    _class_names = []

    def _load_yolo_model():
        # Initialize the DNN model
        data_dir = os.path.dirname(__file__) + '/yolov4/'
        Yolo._dnn_model = cv2.dnn_DetectionModel(data_dir + 'yolov4.cfg', data_dir + 'yolov4.weights')
        Yolo._dnn_model.setInputSize(704, 704)
        Yolo._dnn_model.setInputScale(1.0 / 255)
        with open(data_dir + 'coco.names', 'rt') as f:
            Yolo._class_names = f.read().rstrip('\n').split('\n')

    def __init__(self):
        if Yolo._dnn_model is None:

The size of the neural network (NN) model is hundreds of megabytes. Fortunately, it isn’t modified at run time and can thus be shared between all instance of the tool class. This piece of code reads the NN configuration, weights and class names from files on the disk. Only the first instance of the Yolo class creates the model.

def process(self,
            inputs: [('image', va.Image, va.Image()),
                     ('confidenceThreshold', float, 0.1, {'min': 0, 'max': 1.0}),
                     ('suppressionThreshold', float, 0.4, {'min': 0, 'max': 1.0})],
            outputs: [('className', va.Array),
                      ('classId', va.Matrix.Int32),
                      ('confidence', va.Matrix.Double),
                      ('frame', va.Matrix.Double, {'typeName': 'Matrix<double>/frame',
                                                   'blockSize': 4}),
                      ('size', va.Matrix.Double, {'typeName': 'Matrix<double>/size',
                                                  'linkedTo': 'frame/region',
                                                  'blockSize': 1}),
                      ('numberOfObjects', int)]):

This is the declaration of the tool’s external interface. There are three inputs to the tool

  • image is the input image of type va.Image.

  • confidenceThreshold determines the minimum confidence of an accepted detection. The bigger the value, the fewer false positives are expected. The value range is 0…1.

  • suppressionThreshold determines how eagerly close-by objects are grouped into one detection. The bigger the value, the more likely it is that two objects are grouped as one. Conversely, the smaller the value, the more likely one object will be detected as two or more separate objects. The value range is 0…1.

The six outputs are:

  • className is an array containing the names of the detected objects.

  • classId is an N-by-1 matrix of class identifiers (i.e. integers on range 0…79).

  • confidence is an N-by-1 matrix storing the likelihood of correct detection for each detected object.

  • frame is a 4N-by-4 matrix containing a coordinate frame for the location of each detection. The blockSize meta-data gives the Builder a hint that there may be many frames that should all be displayed individually.

  • size is an N-by-2 matrix storing the size of each detection in the world coordinate system. The linkedTo meta-data hints the Builder that the size should be interpreted relative to the frame output parameter and that it is best displayed as a region.

  • numberOfObjects is the number of objects that were detected.

if inputs.image.is_empty():
    raise ValueError('Input image must be non-empty.')

# The neural network requires 704x704x3 input.
img = inputs.image.to_rgb().scaled(704, 704)

Make sure that the input can be scaled to the size required by the NN model and then scale it. The input to the Yolo V4 network is a tensor whose shape is (704, 704, 3). That is a fancy way to describe a 704-by-704 RGB color image.

# Do the detection using the given confidence + non-maximum
# suppression thresholds
classes, confidences, boxes =
    Yolo._dnn_model.detect(np.array(img, copy=False),

The detect function takes a NumPy array as an input. Luckily, converting a va.Image to such an array is easy. With copy=False, the array will be a shallow copy that points to the data buffer of img.

class_names = []
num_objects = len(confidences)
if num_objects > 0:
    confidences = confidences.flatten()
    classes = classes.flatten()
    for class_id in classes:

outputs.className = class_names
outputs.numberOfObjects = num_objects

Map the class indices given by the NN model to class names according to the static class_names array. We can now set two of the outputs.

# Create output matrices
outputs.confidence = va.Matrix.Double(num_objects, 1)
outputs.classId = va.Matrix.Int32(num_objects, 1)
outputs.frame = va.Matrix.Double(4 * num_objects, 4)
outputs.size = va.Matrix.Double(num_objects, 2)

# Copy results to output matrices.
for j in range(num_objects):
    # Coordinate frames are aligned to image axes.
    # Initialize to identity matrices.
    for i in range(4):
        outputs.frame[4 * j + i, i] = 1.0
    outputs.confidence[j, 0] = confidences[j]
    outputs.classId[j, 0] = classes[j]
    left, top, width, height = boxes[j]
    # Move corner points to world coordinates
    x0, y0 = pixel_to_world(img, left, top)
    x1, y1 = pixel_to_world(img, left + width, top + height)
    outputs.frame[4 * j, 3] = x0
    outputs.frame[4 * j + 1, 3] = y0
    outputs.size[j, 0] = x1 - x0
    outputs.size[j, 1] = y1 - y0

This piece of code first creates empty matrices to hold the confidence, class ID and location of each detection result. The loop then copies the data from the output tensors of the NN model to the matrices.

An important thing to note here is the mapping from image coordinates to world coordinates. This lets subsequent analysis tools and the user interface to relate the output coordinates to the input image (even though the input to the NN was scaled) and eventually to the real world if the image originated from a camera.

va.Tool.publish('com.visionappster.opencvpython/1', Yolo)

This publishes the tool to the VisionAppster runtime. The unique component ID of tools in this Python plugin is com.visionappster.opencvpython/1, where “1” denotes a major version number.

Using the tool in the Builder🔗

Once you save the Python file in $HOME/VisionAppster/plugins/tool/yolo_toolplugin.py and start the Builder (or click the “Refresh user plugins” button if the Builder is already running), you’ll see the new tool in a group labeled OPENCVPYTHON in the Builder’s tool box. (Make sure to save the NN data files as well.)

Drag and drop the Yolo tool from the tool box on the workspace. Drag and drop the image input of the tool on the workspace to open an image selector and pick an input image. (Double-click to select.) Then start the app using the play (▶) button on the status bar.

From the window menu of the image display, select Display with → Image display. Then drag and drop frame output of the Yolo tool on the image display to see the bounding boxes of the detected objects. Drag and drop the className output on the workspace as well. Here is what you may get:

Yolo seems to know what a dog is.

Yolo seems to know what a dog is.🔗

To see why the coordinates were converted using pixel_to_world you can create the processing graph shown below. It uses the Rescale tool from another Python cookbook entry to upscale the input image by a factor of two. Since this tool outputs the detection locations in the world coordinate system and the Rescale tool outputs an image with a world coordinate system attached, you can drag and drop the frame and size output parameters of this tool on top of it and see the regions in the expected locations.

World coordinates make measurements easier.

World coordinates make measurements easier.🔗