Issue encountered
With vLLM backend, currently there's no way for us to control the batch size defined in here and the vLLM model config does not have ways to determine a specific batch size. However, we can control the maximum number of sequences (batch size) in vLLM directly from examples such as this.
Solution/Feature
- Propagate the
max_num_seqs parameter into the initialization of the vLLM model.
Possible alternatives
- Other alternatives are to implement batching ourselves, which is an overkill since the vLLM backend already supports that.