Skip to content

Data parallel inference #1237

@kevinhu

Description

@kevinhu

Is there a recommended way to run data parallel inference (i.e. a copy of the model on each GPU)? It's possible by hacking CUDA_VISIBLE_DEVICES, but I was wondering if there's a cleaner method.

def worker(worker_idx):
    os.environ["CUDA_VISIBLE_DEVICES"] = str(worker_idx)
    prompts = [
        "Hello, my name is",
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is",
    ]
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
    llm = LLM(model="facebook/opt-125m")
    outputs = llm.generate(prompts, sampling_params)


if __name__ == "__main__":
    
    with multiprocessing.Pool(4) as pool:
        pool.map(worker, range(4))

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions