STT Benchmark Results
Median WER/CER and latency grouped by audio set, language, and STT model. WER clean excludes the digits and proper-name stress samples. Audio files are not published. Runs = (number of unique samples) × (repetitions).
Live Voice
| Audio set | Lang | STT model | TTS provider | TTS model | Voice | WER clean | WER full | CER | Latency / 10s | Peak RAM (MB) | Runs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| elevenlabs_rachel_mixed | en | mlx-whisper-large-v3 | elevenlabs | eleven_flash_v2_5 | rachel | 1.25% | 4.91% | 2.14% | 1.21 s | 3559.20 | 10 |
| elevenlabs_rachel_mixed | en | parakeet-tdt-0.6b-v3-mlx | elevenlabs | eleven_flash_v2_5 | rachel | 1.25% | 4.91% | 3.13% | 0.29 s | 4643.84 | 10 |
| elevenlabs_rachel_mixed | en | parakeet-tdt-0.6b-v3 | elevenlabs | eleven_flash_v2_5 | rachel | 1.25% | 4.91% | 3.21% | 0.86 s | 2963.92 | 10 |
| elevenlabs_rachel_mixed | en | fish-audio-asr | elevenlabs | eleven_flash_v2_5 | rachel | 1.25% | 4.91% | 2.07% | 0.41 s | 49.19 | 30 |
| elevenlabs_rachel_mixed | en | mlx-whisper-small | elevenlabs | eleven_flash_v2_5 | rachel | 4.03% | 6.44% | 3.44% | 0.35 s | 899.75 | 10 |
| elevenlabs_rachel_mixed | en | mlx-whisper-medium | elevenlabs | eleven_flash_v2_5 | rachel | 4.03% | 6.44% | 2.55% | 0.71 s | 2605.25 | 10 |
| elevenlabs_rachel_mixed | en | mlx-whisper-large-v3-turbo | elevenlabs | eleven_flash_v2_5 | rachel | 4.91% | 7.94% | 3.18% | 0.60 s | 4170.17 | 10 |
| elevenlabs_rachel_mixed | en | transformers-moonshine-base | elevenlabs | eleven_flash_v2_5 | rachel | 5.56% | 6.44% | 3.38% | 2.07 s | 989.05 | 10 |
| elevenlabs_rachel_mixed | en | elevenlabs-scribe-v1 | elevenlabs | eleven_flash_v2_5 | rachel | 5.83% | 6.99% | 9.02% | 1.04 s | 48.69 | 10 |
| elevenlabs_rachel_mixed | en | sensevoice-small | elevenlabs | eleven_flash_v2_5 | rachel | 7.83% | 9.88% | 7.54% | 0.37 s | 3910.67 | 10 |
| elevenlabs_rachel_mixed | en | elevenlabs-scribe-v1-experimental | elevenlabs | eleven_flash_v2_5 | rachel | 0.00% | 0.00% | 2.13% | 0.88 s | 49.08 | 10 |
| elevenlabs_rachel_mixed | ru | elevenlabs-scribe-v1-experimental | elevenlabs | eleven_flash_v2_5 | rachel | 5.98% | 6.97% | 5.28% | 0.96 s | 48.89 | 10 |
| elevenlabs_rachel_mixed | ru | elevenlabs-scribe-v1 | elevenlabs | eleven_flash_v2_5 | rachel | 6.97% | 8.39% | 5.89% | 1.33 s | 48.00 | 10 |
| elevenlabs_rachel_mixed | ru | mlx-whisper-medium | elevenlabs | eleven_flash_v2_5 | rachel | 9.11% | 13.94% | 9.39% | 0.91 s | 2604.64 | 10 |
| elevenlabs_rachel_mixed | ru | mlx-whisper-large-v3-turbo | elevenlabs | eleven_flash_v2_5 | rachel | 10.55% | 16.26% | 8.92% | 0.72 s | 4164.38 | 10 |
| elevenlabs_rachel_mixed | ru | fish-audio-asr | elevenlabs | eleven_flash_v2_5 | rachel | 10.71% | 11.48% | 9.05% | 0.47 s | 48.47 | 30 |
| elevenlabs_rachel_mixed | ru | mlx-whisper-small | elevenlabs | eleven_flash_v2_5 | rachel | 12.86% | 14.84% | 12.61% | 0.38 s | 897.23 | 10 |
| elevenlabs_rachel_mixed | ru | gigaam-v2-rnnt | elevenlabs | eleven_flash_v2_5 | rachel | 12.91% | 16.52% | 18.37% | 0.57 s | 3722.89 | 10 |
| elevenlabs_rachel_mixed | ru | parakeet-tdt-0.6b-v3-mlx | elevenlabs | eleven_flash_v2_5 | rachel | 12.91% | 16.52% | 11.32% | 0.33 s | 4639.78 | 10 |
| elevenlabs_rachel_mixed | ru | parakeet-tdt-0.6b-v3 | elevenlabs | eleven_flash_v2_5 | rachel | 12.91% | 16.52% | 7.68% | 0.94 s | 2748.19 | 10 |
| elevenlabs_rachel_mixed | ru | mlx-whisper-large-v3 | elevenlabs | eleven_flash_v2_5 | rachel | 13.39% | 14.84% | 8.10% | 1.54 s | 7173.80 | 10 |
| elevenlabs_rachel_mixed | ru | gigaam-v2-ctc | elevenlabs | eleven_flash_v2_5 | rachel | 14.84% | 17.07% | 18.37% | 0.49 s | 2550.48 | 10 |
| elevenlabs_rachel_mixed | ru | t-one | elevenlabs | eleven_flash_v2_5 | rachel | 15.14% | 22.23% | 18.37% | 0.70 s | 2988.55 | 10 |
| elevenlabs_rachel_mixed | ru | sensevoice-small | elevenlabs | eleven_flash_v2_5 | rachel | 100.00% | 100.00% | 99.30% | 0.38 s | 3905.19 | 10 |
| live | en | mlx-whisper-large-v3 | — | — | — | 1.25% | 4.91% | 2.14% | 1.15 s | 1073.56 | 10 |
| live | en | parakeet-tdt-0.6b-v3-mlx | — | — | — | 1.25% | 4.91% | 3.13% | 0.29 s | 927.56 | 10 |
| live | en | parakeet-tdt-0.6b-v3 | — | — | — | 1.25% | 4.91% | 3.21% | 0.93 s | 1138.11 | 10 |
| live | en | mlx-whisper-small | — | — | — | 4.03% | 6.44% | 3.44% | 0.35 s | 901.70 | 10 |
| live | en | mlx-whisper-medium | — | — | — | 4.03% | 6.44% | 2.55% | 0.73 s | 1976.50 | 10 |
| live | en | groq-whisper-large-v3-turbo | — | — | — | 4.91% | 7.94% | 2.46% | 0.36 s | 48.64 | 10 |
| live | en | mlx-whisper-large-v3-turbo | — | — | — | 4.91% | 7.94% | 3.18% | 0.58 s | 2422.30 | 10 |
| live | en | transformers-moonshine-base | — | — | — | 5.56% | 6.44% | 3.38% | 2.57 s | 1030.08 | 10 |
| live | en | elevenlabs-scribe-v1 | — | — | — | 7.08% | 8.63% | 11.10% | 1.07 s | 48.81 | 10 |
| live | en | sensevoice-small | — | — | — | 7.83% | 9.88% | 7.54% | 0.44 s | 3659.48 | 10 |
| live | en | fish-audio-asr | — | — | — | 14.96% | 16.15% | 10.44% | 0.30 s | 48.72 | 30 |
| live | en | elevenlabs-scribe-v1-experimental | — | — | — | 0.00% | 0.00% | 2.03% | 0.98 s | 49.20 | 10 |
| live | ru | elevenlabs-scribe-v1-experimental | — | — | — | 5.98% | 6.97% | 5.28% | 1.04 s | 47.91 | 10 |
| live | ru | mlx-whisper-medium | — | — | — | 9.11% | 13.94% | 9.39% | 0.86 s | 1970.03 | 10 |
| live | ru | elevenlabs-scribe-v1 | — | — | — | 9.56% | 14.81% | 10.36% | 1.03 s | 48.02 | 10 |
| live | ru | mlx-whisper-large-v3-turbo | — | — | — | 10.55% | 16.26% | 8.92% | 0.74 s | 2414.72 | 10 |
| live | ru | groq-whisper-large-v3-turbo | — | — | — | 10.71% | 11.96% | 8.90% | 0.39 s | 48.66 | 10 |
| live | ru | mlx-whisper-small | — | — | — | 12.86% | 14.84% | 12.61% | 0.43 s | 894.48 | 10 |
| live | ru | gigaam-v2-rnnt | — | — | — | 12.91% | 16.52% | 18.37% | 0.62 s | 3785.67 | 10 |
| live | ru | parakeet-tdt-0.6b-v3 | — | — | — | 12.91% | 16.52% | 7.68% | 1.01 s | 825.36 | 10 |
| live | ru | mlx-whisper-large-v3 | — | — | — | 13.39% | 14.84% | 8.10% | 1.38 s | 3532.17 | 10 |
| live | ru | parakeet-tdt-0.6b-v3-mlx | — | — | — | 14.84% | 17.07% | 13.31% | 0.28 s | 3389.97 | 10 |
| live | ru | gigaam-v2-ctc | — | — | — | 16.52% | 18.99% | 18.79% | 0.59 s | 2987.58 | 10 |
| live | ru | fish-audio-asr | — | — | — | 22.86% | 26.32% | 16.80% | 0.41 s | 48.48 | 30 |
| live | ru | t-one | — | — | — | 28.46% | 30.62% | 20.78% | 3.16 s | 2880.08 | 10 |
| live | ru | sensevoice-small | — | — | — | 100.00% | 100.00% | 99.30% | 0.42 s | 3659.05 | 10 |
| silero_v4_ru_xenia_48k | ru | mlx-whisper-large-v3 | silero | v4_ru | xenia | 16.26% | 18.57% | 18.49% | 1.93 s | 4343.11 | 10 |
| silero_v4_ru_xenia_48k | ru | mlx-whisper-large-v3-turbo | silero | v4_ru | xenia | 16.26% | 20.00% | 18.81% | 1.14 s | 3317.81 | 10 |
| silero_v4_ru_xenia_48k | ru | gigaam-v2-rnnt | silero | v4_ru | xenia | 16.88% | 19.38% | 20.54% | 0.62 s | 3371.06 | 10 |
| silero_v4_ru_xenia_48k | ru | gigaam-v2-ctc | silero | v4_ru | xenia | 16.88% | 20.80% | 20.68% | 0.66 s | 2184.78 | 10 |
| silero_v4_ru_xenia_48k | ru | mlx-whisper-medium | silero | v4_ru | xenia | 17.95% | 20.74% | 18.55% | 1.19 s | 2542.75 | 10 |
| silero_v4_ru_xenia_48k | ru | parakeet-tdt-0.6b-v3-mlx | silero | v4_ru | xenia | 19.38% | 21.43% | 17.95% | 0.36 s | 4493.45 | 10 |
| silero_v4_ru_xenia_48k | ru | parakeet-tdt-0.6b-v3 | silero | v4_ru | xenia | 19.38% | 21.43% | 18.37% | 1.25 s | 2358.77 | 10 |
| silero_v4_ru_xenia_48k | ru | mlx-whisper-small | silero | v4_ru | xenia | 21.36% | 22.79% | 19.32% | 0.53 s | 895.81 | 10 |
| silero_v4_ru_xenia_48k | ru | t-one | silero | v4_ru | xenia | 21.43% | 22.97% | 20.42% | 0.84 s | 1887.78 | 10 |
| silero_v4_ru_xenia_48k | ru | fish-audio-asr | silero | v4_ru | xenia | 22.79% | 23.93% | 18.96% | 1.06 s | 48.41 | 30 |
| silero_v4_ru_xenia_48k | ru | sensevoice-small | silero | v4_ru | xenia | 100.00% | 100.00% | 99.23% | 0.56 s | 3270.36 | 10 |
Synthetic Audio
| Audio set | Lang | STT model | TTS provider | TTS model | Voice | WER clean | WER full | CER | Latency / 10s | Peak RAM (MB) | Runs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| elevenlabs_rachel_mixed | en | mlx-whisper-large-v3 | elevenlabs | eleven_flash_v2_5 | rachel | 1.25% | 4.91% | 2.14% | 1.21 s | 3559.20 | 10 |
| elevenlabs_rachel_mixed | en | parakeet-tdt-0.6b-v3-mlx | elevenlabs | eleven_flash_v2_5 | rachel | 1.25% | 4.91% | 3.13% | 0.29 s | 4643.84 | 10 |
| elevenlabs_rachel_mixed | en | parakeet-tdt-0.6b-v3 | elevenlabs | eleven_flash_v2_5 | rachel | 1.25% | 4.91% | 3.21% | 0.86 s | 2963.92 | 10 |
| elevenlabs_rachel_mixed | en | fish-audio-asr | elevenlabs | eleven_flash_v2_5 | rachel | 1.25% | 4.91% | 2.07% | 0.41 s | 49.19 | 30 |
| elevenlabs_rachel_mixed | en | mlx-whisper-small | elevenlabs | eleven_flash_v2_5 | rachel | 4.03% | 6.44% | 3.44% | 0.35 s | 899.75 | 10 |
| elevenlabs_rachel_mixed | en | mlx-whisper-medium | elevenlabs | eleven_flash_v2_5 | rachel | 4.03% | 6.44% | 2.55% | 0.71 s | 2605.25 | 10 |
| elevenlabs_rachel_mixed | en | mlx-whisper-large-v3-turbo | elevenlabs | eleven_flash_v2_5 | rachel | 4.91% | 7.94% | 3.18% | 0.60 s | 4170.17 | 10 |
| elevenlabs_rachel_mixed | en | transformers-moonshine-base | elevenlabs | eleven_flash_v2_5 | rachel | 5.56% | 6.44% | 3.38% | 2.07 s | 989.05 | 10 |
| elevenlabs_rachel_mixed | en | elevenlabs-scribe-v1 | elevenlabs | eleven_flash_v2_5 | rachel | 5.83% | 6.99% | 9.02% | 1.04 s | 48.69 | 10 |
| elevenlabs_rachel_mixed | en | sensevoice-small | elevenlabs | eleven_flash_v2_5 | rachel | 7.83% | 9.88% | 7.54% | 0.37 s | 3910.67 | 10 |
| elevenlabs_rachel_mixed | en | elevenlabs-scribe-v1-experimental | elevenlabs | eleven_flash_v2_5 | rachel | 0.00% | 0.00% | 2.13% | 0.88 s | 49.08 | 10 |
| elevenlabs_rachel_mixed | ru | elevenlabs-scribe-v1-experimental | elevenlabs | eleven_flash_v2_5 | rachel | 5.98% | 6.97% | 5.28% | 0.96 s | 48.89 | 10 |
| elevenlabs_rachel_mixed | ru | elevenlabs-scribe-v1 | elevenlabs | eleven_flash_v2_5 | rachel | 6.97% | 8.39% | 5.89% | 1.33 s | 48.00 | 10 |
| elevenlabs_rachel_mixed | ru | mlx-whisper-medium | elevenlabs | eleven_flash_v2_5 | rachel | 9.11% | 13.94% | 9.39% | 0.91 s | 2604.64 | 10 |
| elevenlabs_rachel_mixed | ru | mlx-whisper-large-v3-turbo | elevenlabs | eleven_flash_v2_5 | rachel | 10.55% | 16.26% | 8.92% | 0.72 s | 4164.38 | 10 |
| elevenlabs_rachel_mixed | ru | fish-audio-asr | elevenlabs | eleven_flash_v2_5 | rachel | 10.71% | 11.48% | 9.05% | 0.47 s | 48.47 | 30 |
| elevenlabs_rachel_mixed | ru | mlx-whisper-small | elevenlabs | eleven_flash_v2_5 | rachel | 12.86% | 14.84% | 12.61% | 0.38 s | 897.23 | 10 |
| elevenlabs_rachel_mixed | ru | gigaam-v2-rnnt | elevenlabs | eleven_flash_v2_5 | rachel | 12.91% | 16.52% | 18.37% | 0.57 s | 3722.89 | 10 |
| elevenlabs_rachel_mixed | ru | parakeet-tdt-0.6b-v3-mlx | elevenlabs | eleven_flash_v2_5 | rachel | 12.91% | 16.52% | 11.32% | 0.33 s | 4639.78 | 10 |
| elevenlabs_rachel_mixed | ru | parakeet-tdt-0.6b-v3 | elevenlabs | eleven_flash_v2_5 | rachel | 12.91% | 16.52% | 7.68% | 0.94 s | 2748.19 | 10 |
| elevenlabs_rachel_mixed | ru | mlx-whisper-large-v3 | elevenlabs | eleven_flash_v2_5 | rachel | 13.39% | 14.84% | 8.10% | 1.54 s | 7173.80 | 10 |
| elevenlabs_rachel_mixed | ru | gigaam-v2-ctc | elevenlabs | eleven_flash_v2_5 | rachel | 14.84% | 17.07% | 18.37% | 0.49 s | 2550.48 | 10 |
| elevenlabs_rachel_mixed | ru | t-one | elevenlabs | eleven_flash_v2_5 | rachel | 15.14% | 22.23% | 18.37% | 0.70 s | 2988.55 | 10 |
| elevenlabs_rachel_mixed | ru | sensevoice-small | elevenlabs | eleven_flash_v2_5 | rachel | 100.00% | 100.00% | 99.30% | 0.38 s | 3905.19 | 10 |
| live | en | mlx-whisper-large-v3 | — | — | — | 1.25% | 4.91% | 2.14% | 1.15 s | 1073.56 | 10 |
| live | en | parakeet-tdt-0.6b-v3-mlx | — | — | — | 1.25% | 4.91% | 3.13% | 0.29 s | 927.56 | 10 |
| live | en | parakeet-tdt-0.6b-v3 | — | — | — | 1.25% | 4.91% | 3.21% | 0.93 s | 1138.11 | 10 |
| live | en | mlx-whisper-small | — | — | — | 4.03% | 6.44% | 3.44% | 0.35 s | 901.70 | 10 |
| live | en | mlx-whisper-medium | — | — | — | 4.03% | 6.44% | 2.55% | 0.73 s | 1976.50 | 10 |
| live | en | groq-whisper-large-v3-turbo | — | — | — | 4.91% | 7.94% | 2.46% | 0.36 s | 48.64 | 10 |
| live | en | mlx-whisper-large-v3-turbo | — | — | — | 4.91% | 7.94% | 3.18% | 0.58 s | 2422.30 | 10 |
| live | en | transformers-moonshine-base | — | — | — | 5.56% | 6.44% | 3.38% | 2.57 s | 1030.08 | 10 |
| live | en | elevenlabs-scribe-v1 | — | — | — | 7.08% | 8.63% | 11.10% | 1.07 s | 48.81 | 10 |
| live | en | sensevoice-small | — | — | — | 7.83% | 9.88% | 7.54% | 0.44 s | 3659.48 | 10 |
| live | en | fish-audio-asr | — | — | — | 14.96% | 16.15% | 10.44% | 0.30 s | 48.72 | 30 |
| live | en | elevenlabs-scribe-v1-experimental | — | — | — | 0.00% | 0.00% | 2.03% | 0.98 s | 49.20 | 10 |
| live | ru | elevenlabs-scribe-v1-experimental | — | — | — | 5.98% | 6.97% | 5.28% | 1.04 s | 47.91 | 10 |
| live | ru | mlx-whisper-medium | — | — | — | 9.11% | 13.94% | 9.39% | 0.86 s | 1970.03 | 10 |
| live | ru | elevenlabs-scribe-v1 | — | — | — | 9.56% | 14.81% | 10.36% | 1.03 s | 48.02 | 10 |
| live | ru | mlx-whisper-large-v3-turbo | — | — | — | 10.55% | 16.26% | 8.92% | 0.74 s | 2414.72 | 10 |
| live | ru | groq-whisper-large-v3-turbo | — | — | — | 10.71% | 11.96% | 8.90% | 0.39 s | 48.66 | 10 |
| live | ru | mlx-whisper-small | — | — | — | 12.86% | 14.84% | 12.61% | 0.43 s | 894.48 | 10 |
| live | ru | gigaam-v2-rnnt | — | — | — | 12.91% | 16.52% | 18.37% | 0.62 s | 3785.67 | 10 |
| live | ru | parakeet-tdt-0.6b-v3 | — | — | — | 12.91% | 16.52% | 7.68% | 1.01 s | 825.36 | 10 |
| live | ru | mlx-whisper-large-v3 | — | — | — | 13.39% | 14.84% | 8.10% | 1.38 s | 3532.17 | 10 |
| live | ru | parakeet-tdt-0.6b-v3-mlx | — | — | — | 14.84% | 17.07% | 13.31% | 0.28 s | 3389.97 | 10 |
| live | ru | gigaam-v2-ctc | — | — | — | 16.52% | 18.99% | 18.79% | 0.59 s | 2987.58 | 10 |
| live | ru | fish-audio-asr | — | — | — | 22.86% | 26.32% | 16.80% | 0.41 s | 48.48 | 30 |
| live | ru | t-one | — | — | — | 28.46% | 30.62% | 20.78% | 3.16 s | 2880.08 | 10 |
| live | ru | sensevoice-small | — | — | — | 100.00% | 100.00% | 99.30% | 0.42 s | 3659.05 | 10 |
| silero_v4_ru_xenia_48k | ru | mlx-whisper-large-v3 | silero | v4_ru | xenia | 16.26% | 18.57% | 18.49% | 1.93 s | 4343.11 | 10 |
| silero_v4_ru_xenia_48k | ru | mlx-whisper-large-v3-turbo | silero | v4_ru | xenia | 16.26% | 20.00% | 18.81% | 1.14 s | 3317.81 | 10 |
| silero_v4_ru_xenia_48k | ru | gigaam-v2-rnnt | silero | v4_ru | xenia | 16.88% | 19.38% | 20.54% | 0.62 s | 3371.06 | 10 |
| silero_v4_ru_xenia_48k | ru | gigaam-v2-ctc | silero | v4_ru | xenia | 16.88% | 20.80% | 20.68% | 0.66 s | 2184.78 | 10 |
| silero_v4_ru_xenia_48k | ru | mlx-whisper-medium | silero | v4_ru | xenia | 17.95% | 20.74% | 18.55% | 1.19 s | 2542.75 | 10 |
| silero_v4_ru_xenia_48k | ru | parakeet-tdt-0.6b-v3-mlx | silero | v4_ru | xenia | 19.38% | 21.43% | 17.95% | 0.36 s | 4493.45 | 10 |
| silero_v4_ru_xenia_48k | ru | parakeet-tdt-0.6b-v3 | silero | v4_ru | xenia | 19.38% | 21.43% | 18.37% | 1.25 s | 2358.77 | 10 |
| silero_v4_ru_xenia_48k | ru | mlx-whisper-small | silero | v4_ru | xenia | 21.36% | 22.79% | 19.32% | 0.53 s | 895.81 | 10 |
| silero_v4_ru_xenia_48k | ru | t-one | silero | v4_ru | xenia | 21.43% | 22.97% | 20.42% | 0.84 s | 1887.78 | 10 |
| silero_v4_ru_xenia_48k | ru | fish-audio-asr | silero | v4_ru | xenia | 22.79% | 23.93% | 18.96% | 1.06 s | 48.41 | 30 |
| silero_v4_ru_xenia_48k | ru | sensevoice-small | silero | v4_ru | xenia | 100.00% | 100.00% | 99.23% | 0.56 s | 3270.36 | 10 |