Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk repetition #471

Closed
garthk opened this issue Feb 5, 2023 · 2 comments
Closed

Bulk repetition #471

garthk opened this issue Feb 5, 2023 · 2 comments
Labels
decoding Decoding related issues enhancement New feature or request

Comments

@garthk
Copy link

garthk commented Feb 5, 2023

I'm not sure if this is a variant of #412, but check out this partial output:

[00:25:16.880 --> 00:25:20.240]   And you're like, this character needs some like thigh highs and like, it should have
[00:25:20.240 --> 00:25:21.240]   been a bit of a dresser.
[00:25:21.240 --> 00:25:22.240]   It should have been a dresser.
[00:25:22.240 --> 00:25:23.240]   It should have been a dresser.
[00:25:23.240 --> 00:25:24.240]   It should have been a dresser.
[00:25:24.240 --> 00:25:25.240]   It should have been a dresser.
[3333 additional repetitions elided]
[01:21:40.240 --> 01:21:41.240]   It should have been a dresser.
[01:21:41.240 --> 01:21:42.240]   It should have been a dresser.
[01:21:42.240 --> 01:21:43.240]   It should have been a dresser.
[01:21:43.240 --> 01:21:44.240]   It should have been a dresser.
[01:21:44.240 --> 01:21:45.240]   It should have been a dresser.
[01:21:45.240 --> 01:21:51.240]   Whether it's true or not is first and foremost a bluff to stop you from doing the right thing.

Reproduction:

./models/download-ggml-model.sh base.en
make
curl -o episode.mp3 -L https://mcdn.podbean.com/mf/web/5ein65/07-31-Clear-Present-free.mp3
ffmpeg -ar 16 -i episode.mp3 episode.wav
./main -f episode.wav 

Standard error:

whisper_init_from_file: loading model from 'models/ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem required  =  215.00 MB (+    6.00 MB per decoder)
whisper_model_load: kv self size  =    5.25 MB
whisper_model_load: kv cross size =   17.58 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =  140.60 MB
whisper_model_load: model size    =  140.54 MB

system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 

main: processing 'episode.wav' (94221793 samples, 5888.9 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


whisper_print_timings:     fallbacks =   3 p /   9 h
whisper_print_timings:     load time =   120.07 ms
whisper_print_timings:      mel time =  8174.57 ms
whisper_print_timings:   sample time = 21253.98 ms / 46180 runs (    0.46 ms per run)
whisper_print_timings:   encode time = 84284.79 ms /   246 runs (  342.62 ms per run)
whisper_print_timings:   decode time = 139710.86 ms / 46321 runs (    3.02 ms per run)
whisper_print_timings:    total time = 253756.25 ms

I'm on the main branch at v1.2.0.

@ggerganov ggerganov added the enhancement New feature or request label Feb 5, 2023
@ggerganov
Copy link
Owner

Hi, thanks for the detailed steps - this helps a lot.

After debugging with WHISPER_DEBUG enabled I can see immediately that in this case, the entropy-based check for repetition didn't trigger. The cost function was just slightly above the default entropy threshold of 2.4:

whisper_full: decoder  0: score = -0.15161, result_len = 220, avg_logprobs = -0.15161, entropy =  2.44152
whisper_full: best decoder = 0
[00:25:20.240 --> 00:25:21.240]   been a bit of a dresser.
[00:25:21.240 --> 00:25:22.240]   It should have been a dresser.
[00:25:22.240 --> 00:25:23.240]   It should have been a dresser.
[00:25:23.240 --> 00:25:24.240]   It should have been a dresser.
[00:25:24.240 --> 00:25:25.240]   It should have been a dresser.
[00:25:25.240 --> 00:25:26.240]   It should have been a dresser.
[00:25:26.240 --> 00:25:27.240]   It should have been a dresser.
[00:25:27.240 --> 00:25:28.240]   It should have been a dresser.
[00:25:28.240 --> 00:25:29.240]   It should have been a dresser.
[00:25:29.240 --> 00:25:30.240]   It should have been a dresser.
[00:25:30.240 --> 00:25:31.240]   It should have been a dresser.
[00:25:31.240 --> 00:25:32.240]   It should have been a dresser.
[00:25:32.240 --> 00:25:33.240]   It should have been a dresser.
[00:25:33.240 --> 00:25:34.240]   It should have been a dresser.
[00:25:34.240 --> 00:25:35.240]   It should have been a dresser.
[00:25:35.240 --> 00:25:36.240]   It should have been a dresser.
[00:25:36.240 --> 00:25:37.240]   It should have been a dresser.
[00:25:37.240 --> 00:25:38.240]   It should have been a dresser.
[00:25:38.240 --> 00:25:39.240]   It should have been a dresser.
[00:25:39.240 --> 00:25:40.240]   It should have been a dresser.
[00:25:40.240 --> 00:25:41.240]   It should have been a dresser.
[00:25:41.240 --> 00:25:42.240]   It should have been a dresser.
seek = 154224, seek_delta = 2200

This means that the decoder didn't "detect" that there is a repetition and therefore didn't use the fallback strategy to correct it.

Rerunning the transcription with a slightly increased entropy threshold of --entropy-thold 2.5 resolves the issue.

Obviously, this is not a very nice approach since there is no way to normally see this debug information. But that is the general problem of having this kind of free parameters. The default values are not always going to work and might need a little tuning in some cases.

I'll try to think of some more robust way to detect the repetitions.

@garthk
Copy link
Author

garthk commented Feb 6, 2023

That entropy threshold did the trick for that episode. Thanks!

@ggerganov ggerganov added the decoding Decoding related issues label Feb 19, 2023
jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this issue Oct 24, 2023
I disabled this because there were many complaints about slow decoding.
The current implementation does not allow batching the decoders when
using the "best of" or "beam size" parameters, so the decoding time is
proportional to the number of decoders, which is obviously not great.

However, now there are even more complaints about wrong decodings and
repetition.

So, making a compromise by re-enabling the fallbacks, but defaulting to
just 2 "best of" / "beam size" decoders. Also, the temperature step is
increased from 0.2 to 0.4 - i.e. from maximum of 5 fallbacks to maximum
of 2.

Also, the stream example now has fallbacks enabled by default.

close ggerganov#471 ggerganov#477 ggerganov#508 ggerganov#612 ggerganov#719 ggerganov#731
jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this issue Oct 24, 2023
I disabled this because there were many complaints about slow decoding.
The current implementation does not allow batching the decoders when
using the "best of" or "beam size" parameters, so the decoding time is
proportional to the number of decoders, which is obviously not great.

However, now there are even more complaints about wrong decodings and
repetition.

So, making a compromise by re-enabling the fallbacks, but defaulting to
just 2 "best of" / "beam size" decoders. Also, the temperature step is
increased from 0.2 to 0.4 - i.e. from maximum of 5 fallbacks to maximum
of 2.

Also, the stream example now has fallbacks enabled by default.

close ggerganov#471 ggerganov#477 ggerganov#508 ggerganov#612 ggerganov#719 ggerganov#731
landtanin pushed a commit to landtanin/whisper.cpp that referenced this issue Dec 16, 2023
I disabled this because there were many complaints about slow decoding.
The current implementation does not allow batching the decoders when
using the "best of" or "beam size" parameters, so the decoding time is
proportional to the number of decoders, which is obviously not great.

However, now there are even more complaints about wrong decodings and
repetition.

So, making a compromise by re-enabling the fallbacks, but defaulting to
just 2 "best of" / "beam size" decoders. Also, the temperature step is
increased from 0.2 to 0.4 - i.e. from maximum of 5 fallbacks to maximum
of 2.

Also, the stream example now has fallbacks enabled by default.

close ggerganov#471 ggerganov#477 ggerganov#508 ggerganov#612 ggerganov#719 ggerganov#731
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
decoding Decoding related issues enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants