Can you do a benchmark on what happens when you load the basic HF model with bfloat16 ?

just change
AutoModelForCausalLM.from_pretrained(model_id)
to:
AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16)


In my experience this goes to 65-70 tokens per second, which is as fast as ctranslate2 with 8bit quantization.