This artifact is open sourced at https://github.com/ChandlerGuan/mercury_artifact.
In this artifact, we target the available and functional badges of the proposed Mercury compiler.
The artifact is developed and tested on Ubuntu 22.04.
The experiments are conducted on servers equipped with:
- CPU: AMD EPYC 9534 64-Core Processor
- GPU: 8 x NVIDIA H100 80GB HBM3 interconnected with NVLink.
The code can be adapted to other multi-GPU environments, but performance may vary. A minimal setup requires at least 2 GPUs.
To evaluate the artifact, we provide a Docker image that contains the required environment and scripts. Please run the following commands in the artifact folder to build and start the Docker container.
docker build -t mercury_artifact .
Then, start the Docker container:
docker run -it --rm --gpus all mercury_artifact
In the following commands, several common command line arguments are used to specify the execution environment settings. --nnodes specifies the number of nodes to run the code, and --nproc_per_node specifies the number of processes to run on each node. The CUDA_VISIBLE_DEVICES environment variable is used to specify which GPUs to use for the execution.
To run the end-to-end code generation example with the proposed CommIR, you can execute the following command inside the Docker container. Please change the --nproc_per_node parameters according to your hardware configuration.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nnodes=1 --nproc_per_node=8 ./example.py
This example parallelizes an attention kernel to multiple GPUs and applies a pre-defined ring-attention style communication transformation schedule.
The script will first print the generated PyTorch code for the parallelized attention kernel. Then, it will execute the code. A successful run will complete without any errors and the output diff against the single node flash attention should be relatively small.
We provide a set of tests to verify the correctness of different components of the Mercury compiler. We highlight several important tests that can be run as follows.
This script runs a search to demonstrate the end-to-end functionality. Its primary purpose is to showcase how the search is performed and what the output looks like.
python tests/test_search.pyThis script is a more rigorous test designed to validate the numerical correctness of the compiler's transformations. It explores the search space and verifies that the generated code produces results that are numerically equivalent to a baseline implementation. This ensures the integrity of the optimization logic.
This validation process requires compiling and executing across several GPUs, and is therefore launched with torchrun.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nnodes=1 --nproc_per_node=8 tests/test_search_validation.pyFor all tests, a successful execution will run a series of checks and exit without stating any errors.
The Mercury codebase is organized into the following main folders:
benchmark/: Contains scripts for performance evaluation of different parallelization strategies and generated kernels.mercury/: The core implementation of the Mercury compiler.frontend/: Parses high-level model descriptions (e.g., from DSLs) and lowers them into Mercury's internal representation (CommIR).ir/: Defines the Communication-aware Intermediate Representation (CommIR), its nodes, and transformation passes for optimizing communication.backend/: Generates target-specific code (e.g., PyTorch withtorch.distributed) from the optimized CommIR.search/: Implements the search algorithm to explore the space of possible parallelization strategies and find efficient communication schedules.
tests/: Includes a comprehensive suite of unit tests to ensure the correctness of the IR, transformations, code generation, and search components.utils/: Provides common helper functions and Domain-Specific Language (DSL) examples for defining models like attention and GEMM.example.py: An end-to-end demo that showcases how to use Mercury to parse, transform, and generate code for a parallel attention kernel.
The file example.py serves as a minimal working example. It demonstrates the full pipeline:
- Defining a model using the provided DSL.
- Lowering the model to CommIR.
- Applying a manual parallelization and communication schedule.
- Generating and executing the final PyTorch code.
You can extend this example by:
- Modifying the Model: Change the parameters or structure of the
flash_attn_pack_kv_templatefunction withinexample.pyto define a different model. - Exploring Transformations: Modify the transformation schedule applied in the example. Instead of a manual schedule, you can integrate the search module (
mercury.search) to automatically find an optimal schedule. - Defining New Kernels: Use the DSL helpers in
utils/to define new computational kernels and write a new script similar toexample.pyto parallelize them.
This project is licensed under the MIT License. Please see the LICENSE file for details. We welcome the community to use, compare, and extend this artifact for research purposes.
Since we are using a base docker image provided by Nvidia, you need to sign in to the Nvidia docker hub before building the image. To do this:
- Create a free NVIDIA NGC account at https://ngc.nvidia.com/signup if you don't have one.
- Get your NGC API key from https://ngc.nvidia.com/setup/api-key.
- Log in to the NGC registry using Docker (replace <your_api_key> with your actual key):
docker login nvcr.io -u '$oauthtoken' -p <your_api_key>
- Re-run the docker build command:
docker build -t mercury_artifact .