Skip to content

GPU Memory Cache not cleared. #209

@sonsai123

Description

@sonsai123

I've been running it by installing on machine and running make commands. The model responses are lightning fast compared to transformers pipeline() method. But after few invocations using generate_stream, I observed that memory is full occupied and throwing CUDA OOM.

Model: OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5
Quantized: yes
GPU: A10G 24GB
OS: EC2 linux

Appreciate some solution or workaround.

Another issue I encountered is that, the input token length limit set to 1000 by default. So if I give some long text as input (some context and a question), getting error
inputs must have less than 1000 tokens. Given: 1013.
I've modified router rust files code to mitigate this validation as a workaround.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions