Understand Microsoft's VALL-E in 3 Minutes (SOTA Zero-shot TTS)

Understand Microsoft's VALL-E in 3 Minutes (SOTA Zero-shot TTS)

06:06 |

Related Videos

Deduct OpenAI GPT-4o's Neural Network Architecture

22:56

Deduct OpenAI GPT-4o's Neural Network Architecture

Google Researcher's In-Depth Analysis on End-to-End Speech Recognition, Part 1: Overview & Modeling

42:53

Google Researcher's In-Depth Analysis on End-to-End Speech Recognition, Part 1: Overview & Modeling

[Olewave's Long Review] Efficient Training of Neural Transducer for Speech Recognition

38:31

[Olewave's Long Review] Efficient Training of Neural Transducer for Speech Recognition

Google's Universal Speech Model for 100+ languages beats OpenAI's Whisper Model

56:48

Google's Universal Speech Model for 100+ languages beats OpenAI's Whisper Model

Olewave's most detailed illustration of RNN-T: Sequence Transduction with Recurrent Neural Networks

1:40:04

Olewave's most detailed illustration of RNN-T: Sequence Transduction with Recurrent Neural Networks

[Olewave's Review] AudioLM: a Language Modeling Approach to Audio Generation

1:11:12

[Olewave's Review] AudioLM: a Language Modeling Approach to Audio Generation

[Olewave's Review] CLIP (3/3): Learning Transferable Visual Models From Natural Language Supervision

1:33:00

[Olewave's Review] CLIP (3/3): Learning Transferable Visual Models From Natural Language Supervision

[Olewave's Review] CLIP (2/3): Learning Transferable Visual Models From Natural Language Supervision

1:38:05

[Olewave's Review] CLIP (2/3): Learning Transferable Visual Models From Natural Language Supervision

[Olewave's Review] OpenAI's Whisper ASR: Robust Speech Recognition via Large-Scale Weak Supervision

44:26

[Olewave's Review] OpenAI's Whisper ASR: Robust Speech Recognition via Large-Scale Weak Supervision

[Olewave's Review] CLIP (1/3): Learning Transferable Visual Models From Natural Language Supervision

55:00

[Olewave's Review] CLIP (1/3): Learning Transferable Visual Models From Natural Language Supervision

[Olewave's Review] Branchformer: Parallel MLP-Attention Architectures, and E-Branchformer

30:36

[Olewave's Review] Branchformer: Parallel MLP-Attention Architectures, and E-Branchformer

[Olewave's Long Review] Xception: Deep Learning with Depthwise Separable Convolutions

51:57

[Olewave's Long Review] Xception: Deep Learning with Depthwise Separable Convolutions

[Olewave's Review] Token-level Sequence Labeling for SLU using Compositional E2E Models

55:06

[Olewave's Review] Token-level Sequence Labeling for SLU using Compositional E2E Models

A Quick Review of Apple's SOTA Multimodal LLM: MM1

10:11

A Quick Review of Apple's SOTA Multimodal LLM: MM1

[Detailed Paper Reading] Zipformer: A faster and better encoder for automatic speech recognition

1:16:59

[Detailed Paper Reading] Zipformer: A faster and better encoder for automatic speech recognition

Boris Johnson’s Rise and Fall - an analysis of the mics

9:26

Boris Johnson’s Rise and Fall - an analysis of the mics

[Olewave's Short Review] Xception: Deep Learning with Depthwise Separable Convolutions

2:23

[Olewave's Short Review] Xception: Deep Learning with Depthwise Separable Convolutions

Why word timestamps generated by OpenAI Whisper are not accurate? How to make them accurate again?

39:12

Why word timestamps generated by OpenAI Whisper are not accurate? How to make them accurate again?

Transformer: Attention is All You Need and Listen, Attend and Spell -- from a Speech Perspective

56:59

Transformer: Attention is All You Need and Listen, Attend and Spell -- from a Speech Perspective

Tycho:a tookit for building high-ROI in-house speech-related services (ASR/TTS/Translation):Overview

17:07

Tycho:a tookit for building high-ROI in-house speech-related services (ASR/TTS/Translation):Overview