-
-
Notifications
You must be signed in to change notification settings - Fork 11.5k
Closed
Labels
feature requestNew feature or requestNew feature or request
Description
Is there a recommended way to run data parallel inference (i.e. a copy of the model on each GPU)? It's possible by hacking CUDA_VISIBLE_DEVICES, but I was wondering if there's a cleaner method.
def worker(worker_idx):
os.environ["CUDA_VISIBLE_DEVICES"] = str(worker_idx)
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="facebook/opt-125m")
outputs = llm.generate(prompts, sampling_params)
if __name__ == "__main__":
with multiprocessing.Pool(4) as pool:
pool.map(worker, range(4))mly-nlp, mukhal, pratyush0599, tyler-romero and vadimkantorov
Metadata
Metadata
Assignees
Labels
feature requestNew feature or requestNew feature or request