Skip to content
Merged
Changes from 6 commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
4767223
Fix MSVC compile error C3688
bobqianic Jul 26, 2023
38eaeff
Significantly improve inference quality
bobqianic Aug 1, 2023
2dd6884
Merge branch 'ggerganov:master' into master
bobqianic Aug 1, 2023
4ebe450
Significantly improve inference quality
bobqianic Aug 2, 2023
6f445d1
Addressed a few minor issues
bobqianic Aug 2, 2023
7f690dd
Significantly improve inference quality
bobqianic Aug 2, 2023
527d7c6
Merge branch 'ggerganov:master' into master
bobqianic Aug 3, 2023
f3e7774
Add annotation and performance improvement
bobqianic Aug 3, 2023
95be6dc
Calculate FFT only when fft_in are not all zero
bobqianic Aug 3, 2023
bd1dbd1
Some minor performance improvement
bobqianic Aug 4, 2023
2c49c9b
Fixed a bug impacting inference quality
bobqianic Aug 4, 2023
5df242c
Merge branch 'ggerganov:master' into master
bobqianic Aug 11, 2023
e40ec27
The first version after all the analysis is completed.
bobqianic Aug 11, 2023
715bf61
Fix some bugs and add debug mode
bobqianic Aug 12, 2023
3fe41d5
Fixed several bugs
bobqianic Aug 12, 2023
36b0df7
Temporarily disable speed-up mode and add debug mode.
bobqianic Aug 13, 2023
444b59a
Add debug mode
bobqianic Aug 13, 2023
308f490
Disable speed-up mode and add debug mode
bobqianic Aug 13, 2023
252f807
Fix CI error (#1)
bobqianic Aug 13, 2023
0a5f435
Fixed several bugs including [BLANK_AUDIO] problem
bobqianic Aug 13, 2023
65fd0e1
Remove Hard-coded hann window
bobqianic Aug 13, 2023
386ef32
Some Final Fix (#2)
bobqianic Aug 14, 2023
241df86
Merge branch 'master' into master
bobqianic Aug 25, 2023
22d348c
whisper : minor coding style changes
ggerganov Aug 27, 2023
590a12e
whisper : remove debug from public API
ggerganov Aug 27, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 20 additions & 17 deletions whisper.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2396,12 +2396,10 @@ static void fft(const std::vector<float> & in, std::vector<float> & out) {
even.reserve(N/2);
odd.reserve(N/2);

for (int i = 0; i < N; i++) {
if (i % 2 == 0) {
even.push_back(in[i]);
} else {
odd.push_back(in[i]);
}
//
for (int i = 0; i < N; i+=2) {
even.push_back(in[i]);
odd.push_back(in[i + 1]);
}

std::vector<float> even_fft;
Expand All @@ -2424,40 +2422,45 @@ static void fft(const std::vector<float> & in, std::vector<float> & out) {

out[2*(k + N/2) + 0] = even_fft[2*k + 0] - re*re_odd + im*im_odd;
out[2*(k + N/2) + 1] = even_fft[2*k + 1] - re*im_odd - im*re_odd;

}
}

static void log_mel_spectrogram_worker_thread(int ith, const std::vector<float> &hann, const float *samples,
int n_samples, int fft_size, int fft_step, int n_threads,
const whisper_filters &filters, bool speed_up, whisper_mel &mel) {
std::vector<float> fft_in(fft_size, 0.0);
std::vector<float> fft_out(2 * fft_size);
int n_fft = 1 + (speed_up ? fft_size / 4 : fft_size / 2);

for (int i = ith; i < mel.n_len; i += n_threads) {
std::vector<float> fft_in(fft_size, 0.0);
std::vector<float> fft_out(2 * fft_size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the benefit of moving these vectors inside the loop?

Copy link
Collaborator Author

@bobqianic bobqianic Aug 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Firstly, moving std::vector<float> fft_in(fft_size, 0.0); inside the loop can take advantage of the built-in padding feature Fastest way to reset every value of std::vector to 0 when assigning vectors, which theoretically should be faster than manually using a for loop. This section of code could theoretically be optimized further by removing the if statement. The reason for moving std::vector<float> fft_out(2 * fft_size); is that during debugging, I found that fft_out would continuously grow in length after the FFT computation, far exceeding the theoretical length of 800.

Sorry... I've conducted further testing, and it appears that my initial reasons for moving them into the loop were not correct.

This comment was marked as off-topic.

const int offset = i * fft_step;

// apply Hanning window
for (int j = 0; j < fft_size; j++) {
if (offset + j < n_samples) {
fft_in[j] = hann[j] * samples[offset + j];
} else {
fft_in[j] = 0.0;
break;
}
}

// FFT -> mag^2
fft(fft_in, fft_out);

// Calculate modulus of complex numbers
for (int j = 0; j < fft_size; j++) {
fft_out[j] = (fft_out[2 * j + 0] * fft_out[2 * j + 0] + fft_out[2 * j + 1] * fft_out[2 * j + 1]);
}
for (int j = 1; j < fft_size / 2; j++) {
fft_out[j] += fft_out[fft_size - j];

// The frequency spectrum produced by real input data is symmetrical around the Nyquist frequency.
// This is where the actual issue lies
for (int j = 0; j < fft_size / 2; j++) {
fft_out[j] = (fft_out[fft_size - j - 1] + fft_out[j + 1]) / 2;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Help me out a bit here - it's very unclear to me what this code is here for. We get the symmetrical spectrum in fft_out[0...fft_size] for free from the previous loop. Not only that, but inference seems to work fine with it commented out entirely. What's it supposed to be doing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your thinking is correct. Theoretically, this step is not needed at all. Because the FFT_SIZE is 400, FFT_OUT[200] is the Nyquist Component; FFT_OUT[0] is the DC Component, which is the average of all 400 samples in FFT_IN. The remaining parts, FFT_OUT[1...199] and FFT_OUT[201...399], are symmetrical. I guess the original intention of the code was just to make the result more accurate, so the symmetrical parts were added together? But this is obviously unreasonable, because the amplitude of FFT_OUT[0] does not match the rest. So I made a modification, turning it into adding them together and then dividing by 2. However, I made a guess that perhaps FFT_OUT[0], as the DC Component, has no actual meaning? So I shifted the whole thing forward by one position, and the effect was surprisingly good, so I kept it. But I don't have enough evidence to support this guess yet.

https://github.com/ggerganov/whisper.cpp/blob/a4bb2df36aeb4e6cfb0c1ca9fbcf749ef39cc852/whisper.cpp#L2455-L2457

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the most important thing is to establish a complete and scientific WER (Word Error Rate) detection framework, so that we can quantitatively compare the quality. Otherwise, it would be too subjective.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are three things to play off here:

  1. The word error rate
  2. Byte-for-byte matching OpenAI's mel implementation choice
  3. Theoretical approach at this step in particular

While 1 is what we're after, and using that as a backstop is undoubtedly going to pay off, it feels like 2 would be a very useful stopping-off point to check, so we know we're at least putting the same things into the black box as they are; and 3 would help us avoid bugs on the way there.

What worries me here is this:

I guess the original intention of the code was just to make the result more accurate, so the symmetrical parts were added together?

If the things being averaged weren't already identical numbers, then wouldn't that point to a bug in the FFT implementation? Averaging them should always have been a no-op.

However, I made a guess that perhaps FFT_OUT[0], as the DC Component, has no actual meaning?

That's not quite true. fft_out[0] is referred to as the "dc component" as a shorthand, but remember that it's actually more accurately the bin which happens to centre at 0Hz. It contains information for all frequencies from 0 up to the half the bin width. With 400 samples at 16kHz the bin width is 40Hz, so fft_out[0] has everything up to 20Hz in it - and also it gets any signal power at the sample frequency aliased into it. I don't think I'd make a bet that neither of those things matters. I know human vocal chords don't typically go that low, but I wouldn't write off there being transient phoneme information in the DC component.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding how to measure the word error rate, I found that there are many well-annotated datasets online, such as Mozilla's Common Voice. An article is divided into many sentences, read by different users, ensuring a diversity of accents, and including not just English but many other languages. The dataset is under CC-0, so there won't be any potential legal issues. Therefore, we can write a script in Python to batch test the WER of whisper.cpp as well as OpenAI's whisper.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While 1 is what we're after, and using that as a backstop is undoubtedly going to pay off, it feels like 2 would be a very useful stopping-off point to check, so we know we're at least putting the same things into the black box as they are; and 3 would help us avoid bugs on the way there.

Agree. If needed, match the whisper.cpp precision so that we have identical input with PyTorch.
Before doing any kind of WER, we need identical input to the transformer and I know it is not the case currently. Not sure if it is just a matter of precision or if there is a bug in whisper.cpp - needs similar analysis to what @bobqianic did in the comment above.

If the things being averaged weren't already identical numbers, then wouldn't that point to a bug in the FFT implementation? Averaging them should always have been a no-op.

Yes, but I believe the values are identical and this step is basically a noop. Needs confirmation. I guess I've put it there due to some doubts I had back when working on ggwave and was new to audio processing, but indeed the spectrum should be symmetrical.

Don't think we need to discard fft_out[0]. Again - let's compare what PyTorch is doing to make sure.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Help me out a bit here - it's very unclear to me what this code is here for. We get the symmetrical spectrum in fft_out[0...fft_size] for free from the previous loop. Not only that, but inference seems to work fine with it commented out entirely. What's it supposed to be doing?

Confirmed. This step is unnecessary.

I carefully examined the C++ implementation of PyTorch's STFT, and it directly removes all the bins after the Nyquist bin.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know I'm not going completely mad :-) If you have a look at @lunixbochs' pocketfft branch from a little while back you'll see that the real to complex pocketfft FFT implementation gives you back a buffer of length (fft_size/2)+1. Using std::complex directly might clarify things, even without swapping out the FFT implementation.


if (speed_up) {
// scale down in the frequency domain results in a speed up in the time domain
// scale down in the frequency domain results in a speed-up in the time domain
for (int j = 0; j < n_fft; j++) {
fft_out[j] = 0.5 * (fft_out[2 * j] + fft_out[2 * j + 1]);
}
Expand All @@ -2471,10 +2474,10 @@ static void log_mel_spectrogram_worker_thread(int ith, const std::vector<float>
int k = 0;
for (k = 0; k < n_fft - 3; k += 4) {
sum +=
fft_out[k + 0] * filters.data[j*n_fft + k + 0] +
fft_out[k + 1] * filters.data[j*n_fft + k + 1] +
fft_out[k + 2] * filters.data[j*n_fft + k + 2] +
fft_out[k + 3] * filters.data[j*n_fft + k + 3];
fft_out[k + 0] * filters.data[j * n_fft + k + 0] +
fft_out[k + 1] * filters.data[j * n_fft + k + 1] +
fft_out[k + 2] * filters.data[j * n_fft + k + 2] +
fft_out[k + 3] * filters.data[j * n_fft + k + 3];
}

// handle n_fft remainder
Expand Down Expand Up @@ -3634,7 +3637,7 @@ static void whisper_process_logits(
WHISPER_ASSERT(n_logits == ctx.vocab.n_vocab);

// extract the logits for the last token
// we will be mutating and therefore we don't want to use the ctx.logits buffer directly
// we will be mutating, and therefore we don't want to use the ctx.logits buffer directly
auto & probs = decoder.probs;
auto & logits = decoder.logits;
auto & logprobs = decoder.logprobs;
Expand Down