wasm-micro-runtime/doc/perf_tune.md

8.8 KiB
Raw Blame History

Tune the performance of running wasm/aot file

Normally there are some methods to tune the performance:

1. Use wasm-opt tool

Download the binaryen release, and use the wasm-opt tool in it to optimize the wasm file, for example:

wasm-opt -O4 -o test_opt.wasm test.wasm

2. Enable simd128 option when compiling wasm source files

WebAssembly 128-bit SIMD is supported by WAMR on x86-64 and aarch64 targets, enabling it when compiling wasm source files may greatly improve the performance. For wasi-sdk and emsdk, please add -msimd128 flag for clang and emcc/em++:

/opt/wasi-sdk/bin/clang -msimd128 -O3 -o <wasm_file> <c/c++ source files>

emcc -msimd128 -O3 -o <wasm_file> <c/c++ source files>

3. Enable segue optimization for wamrc when generating the aot file

Segue is an optimization technology which uses x86 segment register to store the WebAssembly linear memory base address, so as to remove most of the cost of SFI (Software-based Fault Isolation) base addition and free up a general purpose register, by this way it may:

  • Improve the performance of JIT/AOT
  • Reduce the footprint of JIT/AOT, the JIT/AOT code generated is smaller
  • Reduce the compilation time of JIT/AOT

Currently it is supported on linux x86-64, developer can use --enable-segue=[<flags>] for wamrc:

wamrc --enable-segue -o aot_file wasm_file
# or
wamrc --enable-segue=[<flags>] -o aot_file wasm_file

flags can be: i32.load, i64.load, f32.load, f64.load, v128.load, i32.store, i64.store, f32.store, f64.store and v128.store, use comma to separate them, e.g. --enable-segue=i32.load,i64.store, and --enable-segue means all flags are added.

Note: Normally for most cases, using --enable-segue is enough, but for some cases, using --enable-segue=<flags> may be better, for example for CoreMark benchmark, --enable-segue=i32.store may lead to better performance than --enable-segue.

4. Enable segue optimization for iwasm when running wasm file

Similar to segue optimization for wamrc, run:

iwasm --enable-segue wasm_file      (iwasm is built with llvm-jit enabled)
# or
iwasm --enable-segue=[<flags>] wasm_file

5. Use the AOT static PGO method

LLVM PGO (Profile-Guided Optimization) allows the compiler to better optimize code for how it actually runs. WAMR supports AOT static PGO, currently it is tested on Linux x86-64 and x86-32. The basic steps are:

  1. Use wamrc --enable-llvm-pgo -o <aot_file_of_pgo> <wasm_file> to generate an instrumented aot file.

  2. Compile iwasm with cmake -DWAMR_BUILD_STATIC_PGO=1 and run iwasm --gen-prof-file=<raw_profile_file> <aot_file_of_pgo> to generate the raw profile file.

Note: Directly dumping raw profile data to file system may be unsupported in some environments, developer can dump the profile data into memory buffer instead and try outputting it through network (e.g. uart or socket):

uint32_t
wasm_runtime_get_pgo_prof_data_size(wasm_module_inst_t module_inst);

uint32_t
wasm_runtime_dump_pgo_prof_data_to_buf(wasm_module_inst_t module_inst, char *buf, uint32_t len);
  1. Install or compile llvm-profdata toolrefer to here for the details.

  2. Run llvm-profdata merge -output=<profile_file> <raw_profile_file> to merge the raw profile file into the profile file.

  3. Run wamrc --use-prof-file=<profile_file> -o <aot_file> <wasm_file> to generate the optimized aot file.

  4. Run the optimized aot_file: iwasm <aot_file>.

Developer can refer to the test_pgo.sh files under each benchmark folder for more details, e.g. test_pgo.sh of CoreMark benchmark.

6. Disable the memory boundary check

Please notice that this method is not a general solution since it may lead to security issues. And only boost the performance for some platforms in AOT mode and don't support hardware trap for memory boundary check.

  1. Build WAMR with -DWAMR_CONFIGUABLE_BOUNDS_CHECKS=1 option.

  2. Compile AOT module by wamrc with --bounds-check=0 option.

  3. Run the AOT module by iwasm with --disable-bounds-checks option.

Note: The size of AOT file will be much smaller than the default, and some tricks are possible such as let the wasm application access the memory of host os directly. Please notice that if this option is enabled, the wasm spec test will fail since it requires the memory boundary check. For example, the runtime will crash when accessing the memory out of the boundary in some cases instead of throwing an exception as the spec requires.

You should only use this method for well tested wasm applications and make sure the memory access is safe.

7. Use linux-perf

Linux perf is a powerful tool to analyze the performance of a program, developer can use it to find the hot functions and optimize them. It is one profiler supported by WAMR. In order to use it, you need to add --perf-profile while running iwasm. By default, it is disabled.

Caution

For now, only llvm-jit mode supports linux-perf.

Here is a basic example, if there is a Wasm application foo.wasm, you'll execute.

$ perf record --output=perf.data.raw -- iwasm --perf-profile foo.wasm

This will create a perf.data and a jit-xxx.dump under ~/.debug.jit/ folder. This extra file is WAMR generated at runtime, and it contains the mapping between the JIT code and the original Wasm function names.

The next thing need to do is to merge jit-xxx.dump file into the perf.data.

$ perf inject --jit --input=perf.data.raw --output=perf.data

This step will create a lot of jitted-xxxx-N.so which are ELF images for all JIT functions created at runtime.

Tip

add -v and check if there is output likes write ELF image .... If yes, it means above merge is successful.

Finally, you can use perf report to analyze the performance.

$ perf report --input=perf.data

Caution

Using release builds of llvm and iwasm will produce "[unknown]" functions in the call graph. It is not only because of the missing debug information, but also because of removing frame pointers. To get the complete result, please use debug builds of both llvm and iwasm.

Wasm functions names are stored in the custom name section. Toolchains always generate the custom name section in both debug and release builds. However, the custom name section is stripped to pursue smallest size in release build. So, if you want to get a understandable result, please search the manual of toolchain to look for a way to keep the custom name section.

For example, with EMCC, you can add -g2.

If not able to get the context of the custom name section, WAMR will use aot_func#N to represent the function name. N is from 0. aot_func#0 represents the first not imported wasm function.

7.1 Flamegraph

Flamegraph is a powerful tool to visualize stack traces of profiled software so that the most frequent code-paths can be identified quickly and accurately. In order to use it, you need to capture graphs when running perf record

$ perf record -k mono --call-graph=fp --output=perf.data.raw -- iwasm --perf-profile foo.wasm

merge the jit-xxx.dump file into the perf.data.raw.

$ perf inject --jit --input=perf.data.raw --output=perf.data

generate the stack trace file.

$ perf script > out.perf

fold stacks.

$ ./FlameGraph/stackcollapse-perf.pl out.perf > out.folded

render a flamegraph

$ ./FlameGraph/flamegraph.pl out.folded > perf.foo.wasm.svg

Tip

use grep to pick up folded stacks you are interested in or filter out something.

For example, if just want to see stacks of wasm functions, you can use

# only jitted functions
$ grep "wasm_runtime_invoke_native" out.folded | ./FlameGraph/flamegraph.pl > perf.foo.wasm.only.svg

Tip

use trans_wasm_func_name.py to translate jitted function names to its original wasm function names. It requires wasm-objdump in wabt and name section in the .wasm file

The input file is the output of ./FlameGraph/stackcollapse-perf.pl.

python trans_wasm_func_name.py --wabt_home <wabt-installation> --folded out.folded <.wasm>

Then you will see a new file named out.folded.translated which contains the translated folded stacks. All wasm functions are translated to its original names with a prefix like "[Wasm]"