-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Closed
Description
I've been running it by installing on machine and running make commands. The model responses are lightning fast compared to transformers pipeline()
method. But after few invocations using generate_stream, I observed that memory is full occupied and throwing CUDA OOM.
Model: OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5
Quantized: yes
GPU: A10G 24GB
OS: EC2 linux
Appreciate some solution or workaround.
Another issue I encountered is that, the input token length limit set to 1000 by default. So if I give some long text as input (some context and a question), getting error
inputs must have less than 1000 tokens. Given: 1013
.
I've modified router
rust files code to mitigate this validation as a workaround.
Metadata
Metadata
Assignees
Labels
No labels