THAPI (Tracing Heterogeneous APIs) is a tracing infrastructure for heterogeneous computing applications. It currently includes backends for:
- CUDA (runtime and driver)
- OpenCL
- Intel Level Zero (L0)
- MPI
- OpenMP
- CXI
Quick usage example:
$ mpirun -n $N -- iprof -- ./a.out
API calls | 1 Hostnames | 1 Processes | 1 Threads
Name | Time | Time(%) | Calls | Average | Min | Max | Failed |
cuDevicePrimaryCtxRetain | 54.64ms | 51.77% | 1 | 54.64ms | 54.64ms | 54.64ms | 0 |
cuMemcpyDtoHAsync_v2 | 24.11ms | 22.85% | 1 | 24.11ms | 24.11ms | 24.11ms | 0 |
[...]
cuDeviceGet | 640.00ns | 0.00% | 1 | 640.00ns | 640.00ns | 640.00ns | 0 |
cuDeviceGetCount | 460.00ns | 0.00% | 1 | 460.00ns | 460.00ns | 460.00ns | 0 |
Total | 105.54ms | 100.00% | 98 | 1 |
More info in the usage section and in our selections of amazing (⸮) talks
We recommend installing THAPI via Spack.
THAPI package is not (yet) in upstream spack. In the mean time, please follow the instructions in THAPI-spack.
Once you have the THAPI-spack
repo added to your Spack configuration, you should be able to:
spack install thapi
If you prefer to build from source, THAPI uses a classic Autotools flow:
./autogen.sh
mkdir build
cd build
../configure --prefix `pwd`/ici
make -j install
Adjust --prefix
to your preferred installation directory (and please don't copy my ugly bash with backticks and naming convension...).
Dependencies details
Packages:
babeltrace2
,libbabeltrace2-dev
liblttng-ust-dev
lttng-tools
ruby
,ruby-dev
libffi
,libffi-dev
Note: Some package should be patched before install see associated Spack package.
Optional packages:
binutils-dev
orlibiberty-dev
for demangling depending on platforms (demangle.h
)
Ruby Gems:
cast-to-yaml
nokogiri
babeltrace2
metababel
Optional Gem:
opencl_ruby_ffi
Optional pip:
h2yaml
iprof
is the main user-facing tool. The typical way to profile an MPI application is:
mpirun -n $N -- iprof -- ./a.out <app-args>
iprof
supports three primary output analysis:
- Tally (default) — aggregated per-API statistics (time, calls, averages). This is the default when you run
iprof
without additional flags. - Timeline —
iprof -l -- ...
it produces a timeline trace suitable for visualization in tools like Perfetto - Detailed traces — with
iprof -t --
you get detailed LTTng traces.
Use
iprof --help
to get a full list of options.
tapplencourt> iprof ./a.out
API calls | 1 Hostnames | 1 Processes | 1 Threads
Name | Time | Time(%) | Calls | Average | Min | Max | Failed |
cuDevicePrimaryCtxRetain | 54.64ms | 51.77% | 1 | 54.64ms | 54.64ms | 54.64ms | 0 |
cuMemcpyDtoHAsync_v2 | 24.11ms | 22.85% | 1 | 24.11ms | 24.11ms | 24.11ms | 0 |
cuDevicePrimaryCtxRelease_v2 | 18.16ms | 17.20% | 1 | 18.16ms | 18.16ms | 18.16ms | 0 |
cuModuleLoadDataEx | 4.73ms | 4.48% | 1 | 4.73ms | 4.73ms | 4.73ms | 0 |
cuModuleUnload | 1.30ms | 1.23% | 1 | 1.30ms | 1.30ms | 1.30ms | 0 |
cuLaunchKernel | 1.05ms | 0.99% | 1 | 1.05ms | 1.05ms | 1.05ms | 0 |
cuMemAlloc_v2 | 970.60us | 0.92% | 1 | 970.60us | 970.60us | 970.60us | 0 |
cuStreamCreate | 402.21us | 0.38% | 32 | 12.57us | 1.58us | 183.49us | 0 |
cuStreamDestroy_v2 | 103.36us | 0.10% | 32 | 3.23us | 2.81us | 8.80us | 0 |
cuMemcpyDtoH_v2 | 36.17us | 0.03% | 1 | 36.17us | 36.17us | 36.17us | 0 |
cuMemcpyHtoDAsync_v2 | 13.11us | 0.01% | 1 | 13.11us | 13.11us | 13.11us | 0 |
cuStreamSynchronize | 8.77us | 0.01% | 1 | 8.77us | 8.77us | 8.77us | 0 |
cuCtxSetCurrent | 5.47us | 0.01% | 9 | 607.78ns | 220.00ns | 1.74us | 0 |
cuDeviceGetAttribute | 2.71us | 0.00% | 3 | 903.33ns | 490.00ns | 1.71us | 0 |
cuDevicePrimaryCtxGetState | 2.70us | 0.00% | 1 | 2.70us | 2.70us | 2.70us | 0 |
cuCtxGetLimit | 2.30us | 0.00% | 2 | 1.15us | 510.00ns | 1.79us | 0 |
cuModuleGetGlobal_v2 | 2.24us | 0.00% | 2 | 1.12us | 440.00ns | 1.80us | 1 |
cuInit | 1.65us | 0.00% | 1 | 1.65us | 1.65us | 1.65us | 0 |
cuModuleGetFunction | 1.61us | 0.00% | 1 | 1.61us | 1.61us | 1.61us | 0 |
cuFuncGetAttribute | 1.00us | 0.00% | 1 | 1.00us | 1.00us | 1.00us | 0 |
cuCtxGetDevice | 850.00ns | 0.00% | 1 | 850.00ns | 850.00ns | 850.00ns | 0 |
cuDevicePrimaryCtxSetFlags_v2 | 670.00ns | 0.00% | 1 | 670.00ns | 670.00ns | 670.00ns | 0 |
cuDeviceGet | 640.00ns | 0.00% | 1 | 640.00ns | 640.00ns | 640.00ns | 0 |
cuDeviceGetCount | 460.00ns | 0.00% | 1 | 460.00ns | 460.00ns | 460.00ns | 0 |
Total | 105.54ms | 100.00% | 98 | 1 |
Device profiling | 1 Hostnames | 1 Processes | 1 Threads | 1 Device pointers
Name | Time | Time(%) | Calls | Average | Min | Max |
test_target__teams | 25.14ms | 99.80% | 1 | 25.14ms | 25.14ms | 25.14ms |
cuMemcpyDtoH_v2 | 24.35us | 0.10% | 1 | 24.35us | 24.35us | 24.35us |
cuMemcpyDtoHAsync_v2 | 18.14us | 0.07% | 1 | 18.14us | 18.14us | 18.14us |
cuMemcpyHtoDAsync_v2 | 8.77us | 0.03% | 1 | 8.77us | 8.77us | 8.77us |
Total | 25.19ms | 100.00% | 4 |
Explicit memory traffic | 1 Hostnames | 1 Processes | 1 Threads
Name | Byte | Byte(%) | Calls | Average | Min | Max |
cuMemcpyHtoDAsync_v2 | 4.00B | 44.44% | 1 | 4.00B | 4.00B | 4.00B |
cuMemcpyDtoHAsync_v2 | 4.00B | 44.44% | 1 | 4.00B | 4.00B | 4.00B |
cuMemcpyDtoH_v2 | 1.00B | 11.11% | 1 | 1.00B | 1.00B | 1.00B |
Total | 9.00B | 100.00% | 3 |
iprof -l -- ./a.out
# produces a .pb or trace file that can be opened with Perfetto UI:
# https://ui.perfetto.dev/
iprof -t -- ./a.out
For development and quick experiments, (and for bash lover), THAPI provides back-end-specific wrapper scripts
named tracer_$backend.sh
(for example tracer_opencl.sh
, tracer_cuda.sh
, ...).
These are small helper scripts around LTTng that let you manually tune which events are traced and how.
Example usage help for tracer_opencl.sh
tracer_opencl.sh [options] [--] <application> <application-arguments>
--help Show this screen
--version Print the version string
-l, --lightweight Filter out som high traffic functions
-p, --profiling Enable profiling
-s, --source Dump program sources to disk
-a, --arguments Dump argument and kernel infos
-b, --build Dump program build infos
-h, --host-profile Gather precise host profiling information
-d, --dump Dump kernels input and output to disk
-i, --iteration VALUE Dump inputs and outputs for kernel with enqueue counter VALUE
-s, --iteration-start VALUE Dump inputs and outputs for kernels starting with enqueue counter VALUE
-e, --iteration-end VALUE Dump inputs and outputs for kernels until enqueue counter VALUE
-v, --visualize Visualize trace on the fly
--devices Dump devices information
Traces can be viewed using Efficios babeltrace2
, or our own babeltrace_thapi
. The later should give more structured
information at the cost of speed.