Bulk repetition #471

garthk · 2023-02-05T02:31:56Z

I'm not sure if this is a variant of #412, but check out this partial output:

[00:25:16.880 --> 00:25:20.240]   And you're like, this character needs some like thigh highs and like, it should have
[00:25:20.240 --> 00:25:21.240]   been a bit of a dresser.
[00:25:21.240 --> 00:25:22.240]   It should have been a dresser.
[00:25:22.240 --> 00:25:23.240]   It should have been a dresser.
[00:25:23.240 --> 00:25:24.240]   It should have been a dresser.
[00:25:24.240 --> 00:25:25.240]   It should have been a dresser.
[3333 additional repetitions elided]
[01:21:40.240 --> 01:21:41.240]   It should have been a dresser.
[01:21:41.240 --> 01:21:42.240]   It should have been a dresser.
[01:21:42.240 --> 01:21:43.240]   It should have been a dresser.
[01:21:43.240 --> 01:21:44.240]   It should have been a dresser.
[01:21:44.240 --> 01:21:45.240]   It should have been a dresser.
[01:21:45.240 --> 01:21:51.240]   Whether it's true or not is first and foremost a bluff to stop you from doing the right thing.

Reproduction:

./models/download-ggml-model.sh base.en
make
curl -o episode.mp3 -L https://mcdn.podbean.com/mf/web/5ein65/07-31-Clear-Present-free.mp3
ffmpeg -ar 16 -i episode.mp3 episode.wav
./main -f episode.wav

Standard error:

whisper_init_from_file: loading model from 'models/ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem required  =  215.00 MB (+    6.00 MB per decoder)
whisper_model_load: kv self size  =    5.25 MB
whisper_model_load: kv cross size =   17.58 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =  140.60 MB
whisper_model_load: model size    =  140.54 MB

system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 

main: processing 'episode.wav' (94221793 samples, 5888.9 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


whisper_print_timings:     fallbacks =   3 p /   9 h
whisper_print_timings:     load time =   120.07 ms
whisper_print_timings:      mel time =  8174.57 ms
whisper_print_timings:   sample time = 21253.98 ms / 46180 runs (    0.46 ms per run)
whisper_print_timings:   encode time = 84284.79 ms /   246 runs (  342.62 ms per run)
whisper_print_timings:   decode time = 139710.86 ms / 46321 runs (    3.02 ms per run)
whisper_print_timings:    total time = 253756.25 ms

I'm on the main branch at v1.2.0.

The text was updated successfully, but these errors were encountered:

ggerganov · 2023-02-05T06:38:51Z

Hi, thanks for the detailed steps - this helps a lot.

After debugging with WHISPER_DEBUG enabled I can see immediately that in this case, the entropy-based check for repetition didn't trigger. The cost function was just slightly above the default entropy threshold of 2.4:

whisper_full: decoder  0: score = -0.15161, result_len = 220, avg_logprobs = -0.15161, entropy =  2.44152
whisper_full: best decoder = 0
[00:25:20.240 --> 00:25:21.240]   been a bit of a dresser.
[00:25:21.240 --> 00:25:22.240]   It should have been a dresser.
[00:25:22.240 --> 00:25:23.240]   It should have been a dresser.
[00:25:23.240 --> 00:25:24.240]   It should have been a dresser.
[00:25:24.240 --> 00:25:25.240]   It should have been a dresser.
[00:25:25.240 --> 00:25:26.240]   It should have been a dresser.
[00:25:26.240 --> 00:25:27.240]   It should have been a dresser.
[00:25:27.240 --> 00:25:28.240]   It should have been a dresser.
[00:25:28.240 --> 00:25:29.240]   It should have been a dresser.
[00:25:29.240 --> 00:25:30.240]   It should have been a dresser.
[00:25:30.240 --> 00:25:31.240]   It should have been a dresser.
[00:25:31.240 --> 00:25:32.240]   It should have been a dresser.
[00:25:32.240 --> 00:25:33.240]   It should have been a dresser.
[00:25:33.240 --> 00:25:34.240]   It should have been a dresser.
[00:25:34.240 --> 00:25:35.240]   It should have been a dresser.
[00:25:35.240 --> 00:25:36.240]   It should have been a dresser.
[00:25:36.240 --> 00:25:37.240]   It should have been a dresser.
[00:25:37.240 --> 00:25:38.240]   It should have been a dresser.
[00:25:38.240 --> 00:25:39.240]   It should have been a dresser.
[00:25:39.240 --> 00:25:40.240]   It should have been a dresser.
[00:25:40.240 --> 00:25:41.240]   It should have been a dresser.
[00:25:41.240 --> 00:25:42.240]   It should have been a dresser.
seek = 154224, seek_delta = 2200

This means that the decoder didn't "detect" that there is a repetition and therefore didn't use the fallback strategy to correct it.

Rerunning the transcription with a slightly increased entropy threshold of --entropy-thold 2.5 resolves the issue.

Obviously, this is not a very nice approach since there is no way to normally see this debug information. But that is the general problem of having this kind of free parameters. The default values are not always going to work and might need a little tuning in some cases.

I'll try to think of some more robust way to detect the repetitions.

garthk · 2023-02-06T10:02:01Z

That entropy threshold did the trick for that episode. Thanks!

I disabled this because there were many complaints about slow decoding. The current implementation does not allow batching the decoders when using the "best of" or "beam size" parameters, so the decoding time is proportional to the number of decoders, which is obviously not great. However, now there are even more complaints about wrong decodings and repetition. So, making a compromise by re-enabling the fallbacks, but defaulting to just 2 "best of" / "beam size" decoders. Also, the temperature step is increased from 0.2 to 0.4 - i.e. from maximum of 5 fallbacks to maximum of 2. Also, the stream example now has fallbacks enabled by default. close ggerganov#471 ggerganov#477 ggerganov#508 ggerganov#612 ggerganov#719 ggerganov#731

ggerganov added the enhancement New feature or request label Feb 5, 2023

1353604736 mentioned this issue Feb 9, 2023

There will be a phenomenon that the subtitles are identified and repeated Const-me/Whisper#18

Open

albino1 mentioned this issue Feb 9, 2023

Potential whisper.cpp GPU support via the Const-me Windows Implementation SubtitleEdit/subtitleedit#6651

Closed

ggerganov added the decoding Decoding related issues label Feb 19, 2023

ggerganov mentioned this issue Feb 19, 2023

App is getting into endless loop #508

Closed

adamnova mentioned this issue Feb 20, 2023

Process terminated. A callback was made on a garbage collected delegate of type 'Whisper.net!Whisper.net.Native.WhisperNewSegmentCallback::Invoke' sandrohanea/whisper.net#12

Closed

abelbabel mentioned this issue Apr 11, 2023

Repeating parts of text instead of transcribing - more than an hour long files #612

Closed

ggerganov closed this as completed in f19e23f Apr 15, 2023

pdw207 mentioned this issue May 25, 2023

Duplicate words generated #896

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bulk repetition #471

Bulk repetition #471

garthk commented Feb 5, 2023

ggerganov commented Feb 5, 2023

garthk commented Feb 6, 2023

Bulk repetition #471

Bulk repetition #471

Comments

garthk commented Feb 5, 2023

ggerganov commented Feb 5, 2023

garthk commented Feb 6, 2023