# Introduction

The VisionAppster Engine runs on a variety of different hardware platforms and operating systems. Since image analysis code is often performance-critical, not just the Engine but also the algorithms must be compiled to native code. To prevent fragmentation, we expect all publicly sold native code extensions to run on all supported platforms. It should however be noted that code running in VisionAppster Cloud only needs to support Linux on x86_64.

The officially supported platforms for VisionAppster Engine are:

• linux-x86_64
• linux-arm_64
• linux-arm_32
• windows-x86_64

Compiling for multiple different platforms is a major chore. For this reason, we provide Docker images that contain a fully working SDK for each supported platform. The images are built on Dockcross.

The cross-compilation build process is controlled by a command-line tool called va-cross, which comes with the VisionAppster installation. va-cross is a compiler and build tool front-end that runs compilation commands in multiple different cross-build environments in sequence. It allows compiling binaries for multiple target architectures simultaneously without special preparations for cross-compilation.

# Installation

Since va-cross uses Docker images to set up the cross-compilation environments, you need to install Docker first:

On Linux, make sure to add your user ID to the docker group (sudo usermod -aG docker $USER). Otherwise, you'll need to run va-cross as root. The cross-compiler front-end, va-cross is in the bin directory of the VisionAppster installation. If it is in your PATH, this command will give you basic usage instructions: va-cross --help If you installed the Flatpak version on Linux, type this first: alias va-cross="flatpak run --command=va-cross com.visionappster.Builder" To test the installation, run the following command: # On Linux va-cross echo '$VA_ARCH'
# On Windows command prompt
va-cross echo $VA_ARCH This will run one Docker container for each supported target architecture and run echo$VA_ARCH inside it. The command should print out the four currently supported architecture identifiers.

At first run, va-cross will download multiple large Docker images, which means the command will take time. Afterwards, startup is much faster.

# Usage

va-cross starts one or more Docker containers and runs a user-specified command in each. For example, va-cross make will run make four times in different cross-compilation environments. One can select the target architectures by giving one or more --arch options on the command line.

# Build the Makefile in the current directory
# for a specific architecture (cloud).
va-cross --arch=linux-x86_64 make
# Same for a cmake based project
va-cross --arch=linux-x86_64 cmake -S . -B build

# Build for all architectures at once
va-cross make install
# Same for a cmake based project
va-cross cmake -S . -B '${VA_ARCH_PREFIX}build' Command-line arguments are passed to the command verbatim, with the exception of environment variables that are expanded before the command is invoked. In the last example above, the value of the VA_ARCH_PREFIX environment variable will be expanded to the command line inside the container. Note that Windows and Linux handle command lines very differently. On the Windows command prompt, the '$' character has no special meaning and does not need to be quoted. Arguments (such as path names) containing spaces need to be quoted, but using double quotes.

# Evaluate VAR outside the container (Linux)
va-cross echo $VAR # Evaluate VAR outside the container (Windows) va-cross echo %VAR% # Evaluate VAR inside the container (Linux) va-cross echo '$VAR'
# Evaluate VAR inside the container (Windows)
va-cross echo $VAR # An argument contains spaces (both operating systems) va-cross echo "Argument with spaces" To run many commands at once, you can pass bash -c as the command, followed by any valid shell script: va-cross bash -c 'mkdir$VA_ARCH && cd $VA_ARCH' An important thing to note is that absolute paths to files are different inside the container. Always prefer relative paths if possible. If not, prefix absolute paths with /work. # Go to root directory cd / # Print working directory inside container. # Single quotes are not needed in Windows command prompt. va-cross --arch=linux-arm_32 echo '$PWD'
# Prints /work

# Docker image contents

The Docker images come with autotools, GNU make, CMake and Ninja, gcc, g++ and all standard build utilities. There is a caveat though: the version numbers of these tools are not equal in all images. You must prepare for the oldest version, listed in the following table:

Command Version
gcc/g++ 4.9.4
make 4.1
cmake 3.13.2
ninja 1.7.2

Currently, the ARM images contain an old version of gcc. The practical consequence of this is that you cannot use C++17 features in your native code yet. We are working on this issue.

In the build environment, the following variables are always set:

• VA_ARCH - Architecture identifier, e.g. "linux-x86_64". Matches the name of the Docker image and the --arch command-line option without the optional hardware profile suffix. (More on hardware profiles below.)
• VA_ARCH_PREFIX - Architecture identifier optionally followed by a hardware profile suffix, and a dash, e.g. "linux-x86_64-". This is useful when the architecture identifier is used as a prefix e.g. for a directory name.
• VA_SDK_PATH - Absolute path to the VisionAppster SDK.
• CROSS_COMPILE - Architecture identifier and a dash. Makes it easy to integrate projects that already support cross-compilation. Note that this will be different from VA_ARCH_PREFIX if a non-default hardware profile is selected.
• CC - Path to C compiler.
• CXX - Path to C++ compiler.
• CPP - Path to C preprocessor.
• LD - Path to linker.
• CFLAGS - Default C compiler flags. Contains an include path to the VisionAppster SDK.
• CXXFLAGS - Default C++ compiler flags. Contains an include path to the VisionAppster SDK.
• LDFLAGS - Default linker flags. Contains a linker search path to the VisionAppster SDK.

In addition, paths to various other tools (ar, as etc.) are defined as environment variables (AR, AS etc.). To inspect the environment of each Docker image, give the following command:

va-cross env

# Project setup

If you have followed best practices when setting up your build system, there is not much to do. Commands such as gcc, g++ and objcopy in the Docker containers directly invoke the corresponding cross-compilation toolchain commands for each architecture, and the environment variables CXXFLAGS, CFLAGS and LDFLAGS provide default compiler/linker flags. Usually, the only thing you may need to worry about is that each build puts its object files and output binaries to different directories.

The following examples assume that your source code is in a directory called src/ under the project's root directory. Build artifacts will be placed in the project root in architecture-specific build directories.

## CMake

CMake generates out-of-source (shadow) builds by default. It also respects the standard environment variables. You need to place a file called CMakeList.txt in your project's root directory. Here's a template:

# CMakeLists.txt
# A template for cross-building tool plugins.
cmake_minimum_required(VERSION 3.10)

# Project name
project(Test)
# Tool plugin name
set(TARGET test)
# List sources here, separated with spaces. If you use wildcards,
# you need to run cmake again to regenerate the build files every
set(SOURCES src/test.cc)

# Set a default build type if none was specified
# https://blog.kitware.com/cmake-and-the-default-build-type/
set(default_build_type "Release")
if(NOT CMAKE_BUILD_TYPE AND NOT CMAKE_CONFIGURATION_TYPES)
message(STATUS "Setting build type to '${default_build_type}' as none was specified.") set(CMAKE_BUILD_TYPE "${default_build_type}" CACHE
STRING "Choose the type of build." FORCE)
# Set the possible values of build type for cmake-gui
set_property(CACHE CMAKE_BUILD_TYPE PROPERTY STRINGS
"Debug" "Release" "MinSizeRel" "RelWithDebInfo")
endif()

# Enable automatic vectorization
set(RELEASE_FLAGS -O3)

# Compile a plugin (MODULE) out of the given sources.
add_library(${TARGET} MODULE${SOURCES})

# The output needs to have a ".toolplugin" suffix and no "lib" prefix.
set_target_properties(${TARGET} PROPERTIES PREFIX "") set_target_properties(${TARGET} PROPERTIES SUFFIX ".toolplugin")
target_compile_options(${TARGET} PUBLIC "$<$<CONFIG:RELEASE>:${RELEASE_FLAGS}>")

To generate build files, pass an architecture-specific build directory to cmake:

# From Linux shell
va-cross cmake -S . -B '${VA_ARCH_PREFIX}build' # Alternatively, for optimized builds va-cross cmake -S . -B 'build/${VA_ARCH}${VA_ARCH_PROFILE_DIR}' # From Windows command prompt: no quotes va-cross cmake -S . -B${VA_ARCH_PREFIX}build

The VA_ARCH_PREFIX variable is expanded in the build environment, yielding a different build directory for each architecture.

To build for all supported architectures:

# From Linux shell
va-cross cmake --build '${VA_ARCH_PREFIX}build' # Alternatively, for optimized builds va-cross cmake --build 'build/${VA_ARCH}${VA_ARCH_PROFILE_DIR}' # From Windows command prompt va-cross cmake --build${VA_ARCH_PREFIX}build

## GNU make

If you are using GNU make, the following Makefile template demonstrates standard conventions. Place this file in your project's root directory.

# Makefile
# A template for cross-building tool plugins.

# Some cross-platform preparation
ifeq ($(VA_ARCH),windows-x86_64) DEFINES += -D_USE_MATH_DEFINES endif # Set compiler if it doesn't come from the environment. CXX ?= g++ # The compiler also works as a linker. LINK =$(CXX)
# Append to flags inherited from environment.
CXXFLAGS += -std=c++11 -pipe -O3 -fvisibility=hidden -Wall \
-W -D_REENTRANT -fPIC $(DEFINES) LDFLAGS += -shared # Additional compiler and linker options. INCPATH = DEFINES = LIBS = # Use a different build directory for each architecture and profile BUILDDIR = build/$(VA_ARCH)$(VA_ARCH_PROFILE_DIR) # Compile all .cc files under "src". SRCDIR = src SOURCES =$(wildcard $(SRCDIR)/*.cc) # Put objects in BUILDDIR. OBJECTS =$(patsubst $(SRCDIR)/%.cc,$(BUILDDIR)/%.o,$(SOURCES)) # Same for the final output binary. TARGET =$(BUILDDIR)/test.toolplugin

# Let make know "all" and "clean" are not a files.
.PHONY: all clean

all: $(TARGET) clean: rm -f$(OBJECTS) $(TARGET)$(BUILDDIR):
mkdir -p $(BUILDDIR) # Generic compilation rule$(OBJECTS): $(BUILDDIR)/%.o:$(SRCDIR)/%.cc
$(CXX) -c$(CXXFLAGS) $(INCPATH) -o "$@" "$<" # Target linking rule$(TARGET): $(BUILDDIR)$(OBJECTS) Makefile
rm -f $(TARGET)$(LINK) $(LDFLAGS) -o$(TARGET) $(OBJECTS)$(LIBS)

# No install target. The target binary will be placed in a .vapkg.

To build for all supported architectures:

va-cross make

If your already set up your project for cross-compilation using the standard convention of prepending the value of the CROSS_COMPILE environment variable to compilation commands (e.g. $(CROSS_COMPILE)gcc), you are good to go. In the Docker container, the value of the CROSS_COMPILE environment variable is equal to ${VA_ARCH}-, and prefixed toolchain commands are also available.

## Other build systems

The examples above assume that the whole build is run inside the container, but it is also possible to invoke the cross-compilers directly. This is useful if you use a build system that is not included in the Docker images. The problem with this approach is that it slows down the build by a significant amount as every compiler invocation must bring up a container. Nevertheless, you may experiment with this by configuring your build system to use va-cross as a cross-compiler. For example, setting the compiler to va-cross --arch=linux-arm_32 g++ would compile C++ source code for the linux-arm_32 target architecture.

# Hardware profiles and optimization

Many image processing and learning algorithms can be significantly boosted up by CPU instruction set extensions such as SSE and AVX. Unfortunately, not all processors support such extensions, and compiling the platform or the algorithms to all possible permutations of operating systems, CPU architectures and instruction set extensions is not feasible.

To ensure maximum performance while still supporting a wide variety of different execution hardware we have defined hardware profiles. The Level 0 profile specifies the minimum requirements for each processor architecture. Each higher-level profile supports everything that is available in lower-level profiles plus some additional capabilities.

The relevant instruction set extensions available in each profile are listed below. Each level has a name that is used as an architecture sub-type on the command line, in directory names, in preprocessor macros etc.

• x86_64
• Level 0 (default): MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1
• Level 1 (avx): SSE4.2, AVX
• Level 2 (avx2): AVX2
• arm_32
• Level 0 (default): ARMv7-A, NEON
• arm_64
• Level 0 (default): ARMv8.1-A

va-cross enforces the default profile by default. If you pass machine-dependent (-m) options to the compiler, an error will be raised. This ensures that the compiled code is always runnable on hardware that meets the minimum requirements.

Now that everything is compiled with conservative optimization settings by default, code that would benefit from automatic vectorization optimizations made by the compiler and code that is explicitly written to make use of vectorization instructions via inline assembly or compiler intrinsics needs special treatment. There are two ways to do this.

## Optimizing invidual functions

Function-level optimization is built on indirect functions whose target is resolved at run time based on hardware capabilities. This feature is only available on linux-x86_64. On other platforms, the easiest option is to compile the whole binary for all supported hardware profiles. Alternatively, you may implement run-time implementation selection yourself, but this is tedious if started from scratch.

If you don't already know which functions in your code would benefit from automatic vectorization, you can ask the compiler. Let's assume you have the following C99 source code:

// vectorization.c

// "restrict" tells the compiler that the pointed-to arrays won't overlap
void sum(const int* restrict a, const int* restrict b, int* restrict c, unsigned len)
{
// Vectorizable loop
for (unsigned i = 0; i < len; ++i)
c[i] = a[i] + b[i];
}

int main()
{
int a[256], b[256], c[256];
// Initialize somehow
sum(a, b, c, 256);
}

Compile with -O3 -fopt-info-vec:

va-cross --arch=linux-x86_64 gcc -std=c99 -O3 -fopt-info-vec \
-o vectorization vectorization.c

You'll get the following output:

vectorization.c:7:3: note: loop vectorized

This indicates that sum will benefit from vectorization. To make use of optimizations not available on the default profile, you should annotate the function:

VA_OPTIMIZED
void sum(const int* restrict a, const int* restrict b, int* restrict c, unsigned len)
{
// ...
}

VA_OPTIMIZED is a macro that expands to architecture-specific function attributes that instruct the compiler to generate a differently optimized version for each hardware profile. At run time, the implementation for the highest possible hardware profile will be selected automatically based on the capabilities of the underlying hardware. If the target platform does not support function multi-versioning or if a hardware profile is explicitly selected, the macro is empty.

If you have written an inline assembly version that targets a specific instruction set, you must also provide a generic version that works on the default profile and on other processor architectures. You must use the preprocessor to select which versions actually get compiled. The alternatives are:

1. Multi-versioning is enabled. Compile both versions at once.
2. The AVX2 profile is enabled. Compile AVX2 version only.
3. Otherwise, compile generic version only.
#include <va_global.h>

#if defined(VA_ARCH_PROFILE_MULTI)    // Case 1: multi-versioning
#  define AVX2_SUM_ATTR    VA_OPTIMIZED_FOR(VA_ARCH_PROFILE_AVX2)
#  define DEFAULT_SUM_ATTR VA_OPTIMIZED_FOR(VA_ARCH_PROFILE_DEFAULT)
#elif defined(VA_ARCH_PROFILE_AVX2)   // Case 2: building for AVX2
#  define AVX2_SUM_ATTR
#else                                 // Case 3: default version only
#  define DEFAULT_SUM_ATTR
#endif

#ifdef AVX2_SUM_ATTR
AVX2_SUM_ATTR
void sum(const int* __restrict a,
const int* __restrict b,
int* __restrict c,
unsigned len) noexcept
{
// optimized implementation that uses AVX2
asm (...);
}
#endif

#ifdef DEFAULT_SUM_ATTR
DEFAULT_SUM_ATTR
void sum(const int* __restrict a,
const int* __restrict b,
int* __restrict c,
unsigned len) noexcept
{
// generic implementation
}
#endif

Note that this requires that you compile the source as C++ because C does not allow multiple definitions of a function with the same name. Since C++ doesn't have restrict, we used a non-standard compiler extension __restrict instead. Furthermore, the functions are now marked as noexcept because the C++ exception handling mechanism does not tolerate indirect functions. It is safe to throw and catch exceptions within multi-versioned functions, but catching exceptions outside of the function would not work.

## Optimizing all code

Since most code will be compiled to the same assembly independent of optimization possibilities, individual function optimization should be used whenever possible. This makes building faster and the overall size of the resulting binaries smaller.

On Windows this is not an option. For profile levels one and up, the complete binary must be recompiled with different architecture flags. While this makes building a bit more cumbersome and multiplies the size of the final product, the run-time performance (memory consumption or processing time) is not negatively affected. This technique also works on Linux.

Unlike versions for different operating systems and processor architectures, versions for different instruction set extensions are not built by default. To produce optimized binaries you need to pass the --arch flag to va-cross explicitly, adding +profile to the architecture ID. For example, to build for the avx2 profile:

# Enable AVX2 instructions on Windows.
va-cross --arch=windows-x86_64+avx2 make

This will set the following environment variables in the container:

• VA_ARCH_PROFILE=avx2
• VA_ARCH_PROFILE_DIR=/avx2
• VA_ARCH_PREFIX=windows-x86_64+avx2-

You can use wildcards to build specific optimized versions:

# Build using AVX instructions on all platforms that support AVX.
va-cross "--arch=*+avx" make
# Create a multi-versioned build on Linux and a separate binary for
# each profile on Windows.
va-cross -a linux-x86_64 -a linux-arm_32 -a linux-arm_64 \
-a "windows-*" make

## Installing differently optimized binaries

If you want to ship optimized binaries that are compiled for higher-level profiles, you always need to include one for the default profile as well. In component.json, only the version compiled using the default profile will have its type key set to toolplugin. Optimized versions must be placed in sub-directories named according to the hardware profile. For more information on the file format, see General configuration.

When creating a package, you need to set the type and arch keys of each file correctly. va-pkg update --scan (more here) does this for you you place the tool plugins in a directory whose name matches a known architecture ID. Therefore, it is a good idea to follow the conventions and always use $VA_ARCH or $VA_ARCH/\$VA_ARCH_PROFILE as the directory name for tool plugins.

Let us assume you have built a multi-versioned binary for Linux and a separate binary for each x86_64 profile for Windows. To build a .vapkg file you need to compose a directory structure such as this one:

mycomponent/
mycomponent/linux-x86_64/
mycomponent/linux-x86_64/mycooltools.toolplugin
mycomponent/windows-x86_64/
mycomponent/windows-x86_64/mycooltools.toolplugin
mycomponent/windows-x86_64/avx/
mycomponent/windows-x86_64/avx/mycooltools.toolplugin
mycomponent/windows-x86_64/avx2/
mycomponent/windows-x86_64/avx2/mycooltools.toolplugin
mycomponent/linux-arm_32/
mycomponent/linux-arm_32/mycooltools.toolplugin
mycomponent/linux-arm_64/
mycomponent/linux-arm_64/mycooltools.toolplugin

When this package is installed, all files specific to the current architecture, say windows-x86_64, will be copied to the system. When the component is loaded, the platform will select the highest profile the CPU can execute, for example avx2/mycooltools.toolplugin.