GPU Memory Cache not cleared.

I've been running it by installing on machine and running make commands. The model responses are lightning fast compared to transformers `pipeline()` method. But after few invocations using generate_stream, I observed that memory is full occupied and throwing CUDA OOM.

Model: OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5
Quantized: yes
GPU: A10G 24GB
OS: EC2 linux

Appreciate some solution or workaround.


Another issue I encountered is that, the input token length limit set to 1000 by default. So if I give some long text as input (some context and a question), getting error 
`inputs must have less than 1000 tokens. Given: 1013`.
I've modified `router` rust files code to mitigate this validation as a workaround.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GPU Memory Cache not cleared. #209

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GPU Memory Cache not cleared. #209

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions