[ad_1]
Different current approaches regularly use smaller, extra intently paired audio-text coaching datasets,[^reference-1] [^reference-2][^reference-3] or use broad however unsupervised audio pretraining.[^reference-4][^reference-5][^reference-6] As a result of Whisper was skilled on a big and various dataset and was not fine-tuned to any particular one, it doesn’t beat fashions specializing in LibriSpeech efficiency, a famously aggressive benchmark in speech recognition. Nevertheless, after we measure Whisper’s zero-shot efficiency throughout many various datasets we discover it’s far more strong and makes 50% fewer errors than these fashions.
A few third of Whisper’s audio dataset is non-English, and it’s alternately given the duty of transcribing within the unique language or translating to English. We discover this method is especially efficient at studying speech to textual content translation and outperforms the supervised SOTA on CoVoST2 to English translation zero-shot.
[ad_2]