Process YOLO results🔗

Converts the output tensor of a YOLO deep learning model to generally usable data types. The tool first reshapes an input tensor to an N-by-(5 + classCount) matrix that has the parameters of one bounding box on each row. It filters out detections that have too low confidence and performs non-maximum suppression on the rest.


  • tensor: The output of a YOLO model. The shape is assumed to be (1 x K x S x S) or (K x S x S), where K = B * (classCount + 5), S is the number of vertical and horizontal image subdivisions and B is the number of anchor boxes per subdivision. For example in YOLO v2, S=13, B=5, classCount=20, so K = 125.

  • image: The image to which the model was applied. Both width and height are assumed to be P * S, where S is the number of vertical and horizontal image subdivisions, P is the number of pixels per subdivision. For example in YOLO v2, S=13 and P=32, so the image size must be 416 x 416. The image is needed because the tensor produced by a YOLO model does not contain information about the coordinate system of the image the model was applied to.

  • confidenceThreshold: Minimum confidence for a bounding box to be accepted.

  • overlapRatioThreshold: Overlap ratio threshold for pruning overlapping detections (non-maximum suppression). If the overlap ratio (intersection over union) of two detections with the same class is greater than this value, the detection with a lower confidence will be discarded. Set to one to disable non-maximum suppression.

  • classCount: The number of classes in the one-hot encoded class vector.

  • anchorSizes: A B-by-2 matrix that contains the relative size of anchor boxes the YOLO model was trained with.


  • frame: Upper left corner of each remaining detection as a coordinate frame that is aligned to the axes of the image coordinate system. A 4N-by-4 matrix.

  • size: The size of the bounding box in world coordinates. An N-by-2 matrix.

  • classIndex: The class index of each detection. An N-by-1 matrix.

  • confidence: The confidence of each detection. An N-by-1 matrix.