Skip to content
2 changes: 2 additions & 0 deletions docs/design/torch_compile.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,8 @@ With all these factors taken into consideration, usually we can guarantee that t

A unique aspect of vLLM's `torch.compile` integration, is that we guarantee all the compilation finishes before we serve any requests. No requests will trigger new compilations. Otherwise, the engine would be blocked on that request, and the response time will have unexpected spikes.

By default, the cache saves compiled artifacts as binary files. If you would like to interact with the generated code for debugging purposes, set `VLLM_COMPILE_CACHE_SAVE_FORMAT=unpacked`.
Copy link
Collaborator

@zou3519 zou3519 Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want the binary option to be the default over the unpacked option?

@youkaichao I think you had strong opinions over this before, so would like to get your opinion.
We're deciding between having vLLM save compiled artifacts in "unpacked" mode or "binary" mode by default.

In "unpacked mode", you are able to put breakpoints into the inductor output code and look at it.
In "binary mode", you're not able to put breakpoints into the inductor output code or look at it. However, you gain the ability to launch vLLM processes at the same time to compile the same exact model without clobbering each other.

So I think the question is how much we value the ability to launch multiple vLLM processes performing compilation at the same time.

Copy link
Contributor Author

@ahao-anyscale ahao-anyscale Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think by default its a bit less confusing to have binary option as default, since I assume most users will not be putting breakpoints in inductor. On the flip side, i think it is reasonable that people may be spawning multiple model replicas simultaneously, and will be very confused to see strange race condition errors with no warnings


## Python Code Compilation

In the very verbose logs, we can see:
Expand Down
4 changes: 2 additions & 2 deletions vllm/compilation/compiler_interface.py
Original file line number Diff line number Diff line change
Expand Up @@ -220,7 +220,7 @@ def compile(
assert key is not None
path = os.path.join(self.cache_dir, key)
if not envs.VLLM_DISABLE_COMPILE_CACHE:
compiled_graph.save(path=path, format="unpacked")
compiled_graph.save(path=path, format=envs.VLLM_COMPILE_CACHE_SAVE_FORMAT)
compilation_counter.num_compiled_artifacts_saved += 1
return compiled_graph, (key, path)

Expand All @@ -237,7 +237,7 @@ def load(
assert isinstance(handle[1], str)
path = handle[1]
inductor_compiled_graph = torch._inductor.CompiledArtifact.load(
path=path, format="unpacked"
path=path, format=envs.VLLM_COMPILE_CACHE_SAVE_FORMAT
)
from torch._inductor.compile_fx import graph_returns_tuple

Expand Down
10 changes: 10 additions & 0 deletions vllm/envs.py
Original file line number Diff line number Diff line change
Expand Up @@ -218,6 +218,7 @@
VLLM_USE_FBGEMM: bool = False
VLLM_GC_DEBUG: str = ""
VLLM_DISABLE_SHARED_EXPERTS_STREAM: bool = False
VLLM_COMPILE_CACHE_SAVE_FORMAT: Literal["binary", "unpacked"] = "binary"


def get_default_cache_root():
Expand Down Expand Up @@ -1408,6 +1409,15 @@ def get_vllm_port() -> int | None:
"VLLM_DISABLE_SHARED_EXPERTS_STREAM": lambda: os.getenv(
"VLLM_DISABLE_SHARED_EXPERTS_STREAM", False
),
# Format for saving torch.compile cache artifacts
# - "binary": saves as binary file
# Safe for multiple vllm serve processes accessing the same torch compile cache.
# - "unpacked": saves as directory structure (for inspection/debugging)
# NOT multiprocess safe - race conditions may occur with multiple processes.
# Allows viewing and setting breakpoints in Inductor's code output files.
"VLLM_COMPILE_CACHE_SAVE_FORMAT": env_with_choices(
"VLLM_COMPILE_CACHE_SAVE_FORMAT", "binary", ["binary", "unpacked"]
),
}

# --8<-- [end:env-vars-definition]
Expand Down