just change
AutoModelForCausalLM.from_pretrained(model_id)
to:
AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16)
In my experience this goes to 65-70 tokens per second, which is as fast as ctranslate2 with 8bit quantization.