World coordinates🔗

VisionAppster tools measure things in world units, not pixels. This design choice has a number of important advantages but sets VisionAppster apart from all image processing libraries in a way that may alienate people with background in image processing. Unlearning things may be hard, but we firmly believe using world coordinates makes your life easier once you grasp the concepts.

If the input image to a function comes from an arbitrary file or a non-calibrated and non-positioned camera, the difference between world units and pixels is minimal. All measurements are the same in pixels and world units, but the origin of the world coordinate system is at the center of the image. If you set up an image display in the Builder to show world coordinates, you’ll notice that the upper left corner of an image has negative coordinates.

Each image contains camera parameters and a coordinate frame. This makes it possible to correct non-linear distortions caused by a lens and to uniquely determine the location of the image with respect to a world coordinate system. As a result, it is relatively easy to build multi-camera systems that share the same world coordinate system.

Even if an application does not need to deal with physical cameras or lens distortion, the world coordinate system makes many things easier. If you scale, crop or rotate an image, the coordinates still point to the same location.

Consider the picture below, where a rectangular portion of a large image is taken out for smoothing. After processing, the smoothed result is put back on the original image. If you were using pixel coordinates, (0, 0) would refer to different places in the two images. To put the processed portion back you would need to pass the location of the cropped image with respect to the original image to the Image Joining function. In the VisionAppster platform both the original image and the cropped one know their position with respect to the world coordinate system, and this additional information is not needed.

NSFW image blurring

NSFW image blurring🔗

The example is rather trivial, but things soon get a lot more complicated if other transformations are involved. Consider the setup in the image below. There, blurring is replaced by downscaling and the original image is rotated. Still, the image joining tool knows the correct locations and is able to place the cropped part where it belongs.

NSFW image pixelization

NSFW image pixelization🔗

The world coordinate system is particularly handy with machine vision applications. Changing or moving your camera no longer requires changes to code that measures things. All you need to do is to calibrate and position the new camera and everything else is taken care of automatically.

Image display🔗

When an image is shown on an image display, it is by default shown in image coordinates. This means that the image is shown as such without any projection provided that there is only one layer in the image display. With multiple layers, all the images are projected to the image coordinates of the active layer.

The image display however makes it possible to change the coordinate frame to which its images are projected. You can select world coordinates, which means that the images are projected to the xy plane of the world coordinate system. Tools that make geometric transformations to images (such as rotation or scaling) don’t change the position of the image in the world, just the arrangement of pixels in the memory of your computer. Therefore, in world coordinates, the output of RotateImageTool will look quite exactly the same as its input even though the pixels are rearranged by rotating them around the center of the image. The tool compensates for the rearrangement by rotating the axes of the image’s coordinate frame by an equal amount.

There is also a possibility to use a custom coordinate system. This comes in handy if you want to align a displayed image for example based on a detected object in the image.

You can also select the coordinate system in which the location of the mouse cursor is shown. By default, the image display uses the coordinates of the selected coordinate frame: world units (e.g. millimeters) in a world coordinate system and pixels in image coordinate system. You can however select which coordinate values are displayed independent of the coordinate system used for projecting the image on the display.

Images as views of the world🔗

The same scene may be captured with multiple cameras looking from different viewpoints. The images produced by these cameras may be further transformed by rotating, scaling, cropping etc. The colors or gray levels of image pixels may also be processed in various ways, but each image still represents a view of the world and can be positioned in it.

If your images come for example from files and have no associated real-world coordinates, it is perfectly fine to use the default world coordinate system that has its origin at the center of the image. Everything else still applies.

Since all images are treated as views of the same world, it does not matter which view you use as source data to your analysis tools. If you measure the location of a thing in one image, it is just fine to use that location to crop a portion out of an image taken with another camera. The caveat: it is assumed that the images are on the same planar surface. This can be changed with ReplaceCoordinateFrameTool, but we won’t go into details here.

It is possible to create synthetic views of the world. For example, if you want to join images from multiple cameras into a compound view of the world, you can use JoinImagesTool. To create a completely new view, you can use ProjectToVirtualViewTool. It lets you to define a window through which to look at the world. As input data, you can provide any other view of the world. In fact, all geometric image transformation tools such as rotation and scaling are just special cases of this tool.


Traditionally, pixels have been treated as squares whose coordinates are represented with respect to the upper left corner of the image. In this model, the first pixel covers a rectangular area from (0, 0) to (1, 1). This easily leads to the assumption that the values of pixels are averages of intensities over neighboring squares in the world. We’ll tell you a secret: this is not even approximately true to start with.

If an image has perspective or it is rotated or otherwise transformed, it is easy to see that the pixel squares cannot correspond to squares in the real world. In the picture below, a 3×3 image is rotated about its center. In the rotated image (dotted), each pixel can in principle intersect up to six pixels of the original grid (solid), but there is no way to get the exact colors or intensities of the intersections. For example, the blue intersection only covers a fraction of a pixel in the original grid, but we only have the color or intensity of the whole pixel.

When an image is rotated, assumptions about rectangular pixels can’t hold.

When an image is rotated, assumptions about rectangular pixels can’t hold.🔗

For this reason, the VisionAppster platform treats pixels as point samples. The coordinates of a pixel are determined by its center, and there is no associated shape. If the image is transformed (rotated, scaled, sheared) digitally, the center of each pixel still refers to the same location in the world.

Since it is not possible to know the shape of a pixel, it is not meaningful to measure sizes from the edge of a pixel. If required, an approximate location of the “edge” can however be calculated by halving the distance between neighboring pixels.

Understanding that pixels are not squares but points and that distances between pixels are measured from center to center will make your life a lot easier especially if you need to make accurate measurements in 3D. It will also help you in understanding why the output image from cropping tool is has 4×3 pixels even though you set the size to 3×2.

Distances are measured between pixel centers.

Distances are measured between pixel centers.🔗

So, why, actually? Well, you asked for a picture that covers 3×2 world units, not pixels. Consider the image above. There, a 3-by-2 portion of a world is represented as a red rectangle and the locations of pixel centers with black dots. If you haven’t calibrated the camera, there is a 1:1 correspondence between world units and pixels. The CropImageTool must take so many pixels that the distance between the centers of edge pixels is three horizontally and two vertically. 4×3 pixels, that is.

Coordinate frames🔗

VisionAppster Tools use coordinate frames to represent the positions and orientations of images and geometric objects with respect to each other. The coordinate frames are three-dimensional and allow arbitrary affine transformations and translations. In other words, a coordinate frame can be used to express rotation, scaling, stretching, shearing and location changes.

Usually, a coordinate frame defines the position and orientation of an object in the world coordinate system. The world coordinate system is located – as the name implies – on the same physical world as the object being inspected. If the camera is calibrated, the units of the world coordinate system are physical units such as millimeters. Otherwise, all measurements are unitless and equal to pixels.

By convention, a right-handed coordinate system is used for world coordinates. It is called right-handed because of the mnemonic: close your right hand, raise the thumb (x axis) and point forwards with your index finger (y axis). Now, straighten your middle finger to make a 90⁰ angle with the index finger and you have the direction of the z axis. If you place such a frame on traditional pixel coordinates, z points towards the back of your monitor. Another way to think about it is to imagine a “normal” xy coordinate system on your desk: if y points away from you and x to the right, the z axis would point towards the ceiling.

The axes of a right-handed coordinate system. [1]_

The axes of a right-handed coordinate system. 1🔗

When coordinate frames are used as tool parameters, they are expressed as 4-by-4 matrices. The first three columns are the three base vectors of the frame in the enclosing coordinate system, and the fourth column its translation. Usually, the three base vectors are perpendicular to each other, with a unit length. This is not a requirement however: the base vectors can be used to scale and shear the coordinate system.

The last row of a coordinate frame is always [0, 0, 0, 1]. This makes the matrix a homogeneous transformation matrix. This is a mathematical trick that is used to combine an affine transformation and a translation to a single matrix multiplication.

An example🔗

Below is an example of a coordinate frame that is rotated 10 degrees clockwise around the z axis and translated to (19, 11, 1976).

\[\begin{split}\left(\begin{array}{cccc} 0.9848 & -0.1736 & 0 & 19 \\ 0.1736 & 0.9848 & 0 & 11 \\ 0 & 0 & 1 & 1976 \\ 0 & 0 & 0 & 1 \\ \end{array}\right)\end{split}\]

The rotation terms are obtained by cos(10⁰) = 0.9848 and sin(10⁰) = 0.1736. If this was the world frame, it would mean that

  • the distance between the camera’s aperture and the world’s xy plane is 1976 user-defined units.

  • the optical axis hits world xy plane at (-19, -11) in world coordinates.

  • the world coordinate system is rotated 10 degrees clockwise around the camera’s optical axis.

Let’s assume the coordinates of a point in world coordinates are (-13.9610, 1.4463, 1). The z coordinate is one, meaning that the point is one unit below the surface of the object being inspected. To get its 3D coordinates with respect to the camera’s aperture, one would perform the following matrix multiplication:

\[\begin{split}\left(\begin{array}{cccc} 0.9848 & -0.1736 & 0 & 19 \\ 0.1736 & 0.9848 & 0 & 11 \\ 0 & 0 & 1 & 1976 \\ 0 & 0 & 0 & 1 \\ \end{array} \right)\left( \begin{array}{c} -13.9601 \\ 1.4463 \\ 1 \\ 1 \end{array}\right) = \left(\begin{array}{c} 5 \\ 10 \\ 1977 \\ 1 \end{array}\right)\end{split}\]

Original image courtesy of Acdx