Unicode Error for Hindi transcription 

When doing transcription in Hindi for a file, I encounter invalid unicode character. 

<img width="753" alt="Screenshot 2023-12-29 at 8 29 09 PM" src="https://github.com/ggerganov/whisper.cpp/assets/7852108/340f9bab-4299-4103-9055-fa5a9db4e989">

I have noticed this with many Hindi files. 

Used whisper-large-v2 mode for inference on CPU. Have noticed the same issue when inferencing on GPU as well. 

I am guessing the issue is: whisper model token output (BPE encoded) is not getting correctly mapped to unicode characters. 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Unicode Error for Hindi transcription #1700

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Unicode Error for Hindi transcription #1700

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions