Data parallel inference

Is there a recommended way to run data parallel inference (i.e. a copy of the model on each GPU)? It's possible by hacking CUDA_VISIBLE_DEVICES, but I was wondering if there's a cleaner method.
```python
def worker(worker_idx):
    os.environ["CUDA_VISIBLE_DEVICES"] = str(worker_idx)
    prompts = [
        "Hello, my name is",
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is",
    ]
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
    llm = LLM(model="facebook/opt-125m")
    outputs = llm.generate(prompts, sampling_params)


if __name__ == "__main__":
    
    with multiprocessing.Pool(4) as pool:
        pool.map(worker, range(4))
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Data parallel inference #1237

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Data parallel inference #1237

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions