Yolo V4 object detection with OpenCV and Pythonđź”—
Summaryđź”—
We’ll write a few lines of Python code that uses OpenCV’s neural network module to implement a Yolo V4 object detector. The tool gives the locations and names of up to 80 kinds of different objects in input images.
Detailed descriptionđź”—
Setting up the environmentđź”—
As explained in Python tool API, the Builder will
execute all Python modules whose name ends with *toolplugin.py
in
$HOME/VisionAppster/plugins/tool/
. If these modules register tools,
they will appear in Builder’s tool box. In this example, the Python
module will be yolo_toolplugin.py
and
the name of the tool in the plugin is Yolo
.
The Yolo tool needs three files that contain class names
(coco.names
), Yolo V4 configuration (yolov4.cfg
) and the weights
of the neural network (yolov4.weights
). To set up the environment,
download the files and place them in
$HOME/VisionAppster/plugins/tool/yolov4
as shown below.

Downloaded filesđź”—
Use the va-pkg command-line tool to enable Python support and to install the required NumPy and OpenCV packages:
va-pkg install com.visionappster.extensions.python
va-pkg install python:numpy python:opencv-contrib-python
Python code walk-throughđź”—
Let us walk through the Python code
.
import visionappster as va
from visionappster.coordinates import pixel_to_world
import numpy as np
import cv2
import os
Import the necessary modules. visionappster
is needed in every
Python tool, numpy
almost as often. cv2
is the name of the
OpenCV module. os
is needed for path manipulation.
class Yolo:
# Shared by all instances of this class
_dnn_model = None
_class_names = []
@staticmethod
def _load_yolo_model():
# Initialize the DNN model
data_dir = os.path.dirname(__file__) + '/yolov4/'
Yolo._dnn_model = cv2.dnn_DetectionModel(data_dir + 'yolov4.cfg', data_dir + 'yolov4.weights')
Yolo._dnn_model.setInputSize(704, 704)
Yolo._dnn_model.setInputScale(1.0 / 255)
Yolo._dnn_model.setInputSwapRB(True)
with open(data_dir + 'coco.names', 'rt') as f:
Yolo._class_names = f.read().rstrip('\n').split('\n')
def __init__(self):
if Yolo._dnn_model is None:
Yolo._load_yolo_model()
The size of the neural network (NN) model is hundreds of megabytes. Fortunately, it isn’t modified at run time and can thus be shared between all instance of the tool class. This piece of code reads the NN configuration, weights and class names from files on the disk. Only the first instance of the Yolo class creates the model.
def process(self,
inputs: [('image', va.Image, va.Image()),
('confidenceThreshold', float, 0.1, {'min': 0, 'max': 1.0}),
('suppressionThreshold', float, 0.4, {'min': 0, 'max': 1.0})],
outputs: [('className', va.Array),
('classId', va.Matrix.Int32),
('confidence', va.Matrix.Double),
('frame', va.Matrix.Double, {'typeName': 'Matrix<double>/frame',
'blockSize': 4}),
('size', va.Matrix.Double, {'typeName': 'Matrix<double>/size',
'linkedTo': 'frame/region',
'blockSize': 1}),
('numberOfObjects', int)]):
This is the declaration of the tool’s external interface. There are three inputs to the tool
image
is the input image of typeva.Image
.confidenceThreshold
determines the minimum confidence of an accepted detection. The bigger the value, the fewer false positives are expected. The value range is 0…1.suppressionThreshold
determines how eagerly close-by objects are grouped into one detection. The bigger the value, the more likely it is that two objects are grouped as one. Conversely, the smaller the value, the more likely one object will be detected as two or more separate objects. The value range is 0…1.
The six outputs are:
className
is an array containing the names of the detected objects.classId
is an N-by-1 matrix of class identifiers (i.e. integers on range 0…79).confidence
is an N-by-1 matrix storing the likelihood of correct detection for each detected object.frame
is a 4N-by-4 matrix containing a coordinate frame for the location of each detection. TheblockSize
meta-data gives the Builder a hint that there may be many frames that should all be displayed individually.size
is an N-by-2 matrix storing the size of each detection in the world coordinate system. ThelinkedTo
meta-data hints the Builder that the size should be interpreted relative to theframe
output parameter and that it is best displayed as a region.numberOfObjects
is the number of objects that were detected.
if inputs.image.is_empty():
raise ValueError('Input image must be non-empty.')
# The neural network requires 704x704x3 input.
img = inputs.image.to_rgb().scaled(704, 704)
Make sure that the input can be scaled to the size required by the NN model and then scale it. The input to the Yolo V4 network is a tensor whose shape is (704, 704, 3). That is a fancy way to describe a 704-by-704 RGB color image.
# Do the detection using the given confidence + non-maximum
# suppression thresholds
classes, confidences, boxes =
Yolo._dnn_model.detect(np.array(img, copy=False),
confThreshold=inputs.confidenceThreshold,
nmsThreshold=inputs.suppressionThreshold)
The detect
function takes a NumPy array as an input. Luckily,
converting a va.Image
to such an array is easy. With copy=False
,
the array will be a shallow copy that points to the data buffer of
img
.
class_names = []
num_objects = len(confidences)
if num_objects > 0:
confidences = confidences.flatten()
classes = classes.flatten()
for class_id in classes:
class_names.append(Yolo._class_names[class_id])
outputs.className = class_names
outputs.numberOfObjects = num_objects
Map the class indices given by the NN model to class names according to
the static class_names
array. We can now set two of the outputs.
# Create output matrices
outputs.confidence = va.Matrix.Double(num_objects, 1)
outputs.classId = va.Matrix.Int32(num_objects, 1)
outputs.frame = va.Matrix.Double(4 * num_objects, 4)
outputs.size = va.Matrix.Double(num_objects, 2)
# Copy results to output matrices.
for j in range(num_objects):
# Coordinate frames are aligned to image axes.
# Initialize to identity matrices.
for i in range(4):
outputs.frame[4 * j + i, i] = 1.0
outputs.confidence[j, 0] = confidences[j]
outputs.classId[j, 0] = classes[j]
left, top, width, height = boxes[j]
# Move corner points to world coordinates
x0, y0 = pixel_to_world(img, left, top)
x1, y1 = pixel_to_world(img, left + width, top + height)
outputs.frame[4 * j, 3] = x0
outputs.frame[4 * j + 1, 3] = y0
outputs.size[j, 0] = x1 - x0
outputs.size[j, 1] = y1 - y0
This piece of code first creates empty matrices to hold the confidence, class ID and location of each detection result. The loop then copies the data from the output tensors of the NN model to the matrices.
An important thing to note here is the mapping from image coordinates to world coordinates. This lets subsequent analysis tools and the user interface to relate the output coordinates to the input image (even though the input to the NN was scaled) and eventually to the real world if the image originated from a camera.
va.Tool.publish('com.visionappster.opencvpython/1', Yolo)
This publishes the tool to the VisionAppster runtime. The unique
component ID of tools in this Python plugin is
com.visionappster.opencvpython/1
, where “1” denotes a major version
number.
Using the tool in the Builderđź”—
Once you save the Python file in
$HOME/VisionAppster/plugins/tool/yolo_toolplugin.py
and start the
Builder (or click the “Refresh user plugins” button if the Builder is
already running), you’ll see the new tool in a group labeled
OPENCVPYTHON in the Builder’s tool box. (Make sure to save the NN data
files as well.)
Drag and drop the Yolo tool from the tool box on the workspace. Drag and
drop the image
input of the tool on the workspace to open an image
selector and pick an input image. (Double-click to select.) Then start
the app using the play (â–¶) button on the status bar.
From the window menu of the image display, select Display with → Image
display. Then drag and drop frame
output of the Yolo tool on the
image display to see the bounding boxes of the detected objects. Drag
and drop the className
output on the workspace as well. Here is what
you may get:

Yolo seems to know what a dog is.đź”—
To see why the coordinates were converted using pixel_to_world
you
can create the processing graph shown below. It uses the Rescale
tool from another Python cookbook
entry to upscale the input image by
a factor of two. Since this tool outputs the detection locations in the
world coordinate system and the Rescale
tool outputs an image with a
world coordinate system attached, you can drag and drop the frame
and size
output parameters of this tool on top of it and see the
regions in the expected locations.

World coordinates make measurements easier.đź”—