Iβm a senior research scientist at Meta Reality Labs working on generative models for audio, text, and video. Previously, I was a maintainer of TorchAudio library, the official audio library of PyTorch. Before Meta, I was a PhD student advised by Michael I Mandel and an undergraduate student advised by Yan Xu.
My research interests are single-channel/multi-channel speech enhancement, generative models, and natural language processing. Recently, Iβm interested in RL for audio domain but still at the exploration stage.
π₯ News
- 2024.12: Β ππ One paper has been accepted by ICASSP 2025!
- 2024.11: Β ππ We are organizing the URGENT 2025 Challenge at Interspeech 2025! Join in the challenge if you are interested in speech enhancement!
- 2024.09: Β ππ Check out the demo of our MelodyFlow paper, that can do text-guided music editing and generation on 48kHz sample rate music!
- 2024.09: Β ππ Three papers have been accepted by IEEE SLT 2024!
- 2024.06: Β ππ We are organizing βAudio Imagination Workshopβ at NeurIPS 2024! We cordially invite you to submit your paper or demo through this link!
- 2024.05: Β ππ We are organizing the URGENT challenge at NeurIPS 2024 Competition track!
- 2024.04: Β ππ Our MMS paper has been accepted by Journal of Machine Learning Research!
- 2024.02: Β ππ Checkout the demo videos and paper of our FoleyGen model!
- 2023.12: Β ππ Five papers have beed accepted by ICASSP 2024!
- 2023.09: Β ππ Our TorchAudio 2.1 paper has been accepted by ASRU 2023!
- 2023.05: Β ππ One paper has been accepted by Interspeech 2023!
- 2023.02: Β ππ Two papers have been accepted by ICASSP 2023!
π Publications
ICASSP 2025
Adapting Whisper for Code-Switching through Encoding Refining and Language-Aware Decoding, Jiahui Zhao, Hao Shi, Chenrui Cui, Tianrui Wang, Hexin Liu, Zhaoheng Ni, Lingxuan Ye, Longbiao WangarXiv
SyncFlow: Toward Temporally Aligned Joint Audio-Video Generation from Text, Haohe Liu, Gael Le Lan, Xinhao Mei, Zhaoheng Ni, Anurag Kumar, Varun Nagaraja, Wenwu Wang, Mark D Plumbley, Yangyang Shi, Vikas ChandraSLT 2024
Massively Multilingual Forced Aligner Leveraging Self-Supervised Discrete Units, Hirofumi Inaguma, Ilia Kulikov, Zhaoheng Ni, Sravya Popuri, Paden TomaselloSLT 2024
Data Efficient Reflow for Few Step Audio Generation, Lemeng Wu, Zhaoheng Ni, Bowen Shi, Gael Le Lan, Anurag Kumar, Varun Nagaraja, Xinhao Mei, Yunyang Xiong, Bilge Soran, Raghuraman Krishnamoorthi, Wei-Ning Hsu, Yangyang Shi, Vikas ChandraSLT 2024
Serialized Speech Information Guidance with Overlapped Encoding Separation for Multi-Speaker Automatic Speech Recognition, Hao Shi, Yuan Gao, Zhaoheng Ni, Tatsuya KawaharaMLSP 2024
Foleygen: Visually-guided audio generation, Xinhao Mei, Varun Nagaraja, Gael Le Lan, Zhaoheng Ni, Ernie Chang, Yangyang Shi, Vikas ChandraarXiv
High fidelity text-guided music generation and editing via single-stage flow matching, Gael Le Lan, Bowen Shi, Zhaoheng Ni, Sidd Srinivasan, Anurag Kumar, Brian Ellis, David Kant, Varun Nagaraja, Ernie Chang, Wei-Ning Hsu, Yangyang Shi, Vikas ChandraInterspeech 2024
URGENT challenge: Universality, robustness, and generalizability for speech enhancement, Wangyou Zhang, Robin Scheibler, Kohei Saijo, Samuele Cornell, Chenda Li, Zhaoheng Ni, Anurag Kumar, Jan Pirklbauer, Marvin Sach, Shinji Watanabe, Tim Fingscheidt, Yanmin QianICASSP 2024
Folding Attention: Memory and Power Optimization for On-Device Transformer-based Streaming Speech Recognition, Yang Li, Liangzhen Lai, Yuan Shangguan, Forrest N Iandola, Zhaoheng Ni, Ernie Chang, Yangyang Shi, Vikas ChandraICASSP 2024
Less peaky and more accurate CTC forced alignment by label priors, Ruizhe Huang, Xiaohui Zhang, Zhaoheng Ni, Li Sun, Moto Hira, Jeff Hwang, Vimal Manohar, Vineel Pratap, Matthew Wiesner, Shinji Watanabe, Daniel Povey, Sanjeev KhudanpurICASSP 2024
An empirical study on the impact of positional encoding in transformer-based monaural speech enhancement, Qiquan Zhang, Meng Ge, Hongxu Zhu, Eliathamby Ambikairajah, Qi Song, Zhaoheng Ni, Haizhou LiICASSP 2024
Stack-and-delay: a new codebook pattern for music generation, Gael Le Lan, Varun Nagaraja, Ernie Chang, David Kant, Zhaoheng Ni, Yangyang Shi, Forrest Iandola, Vikas ChandraICASSP 2024
On the Open Prompt Challenge in Conditional Audio Generation, Ernie Chang, Sidd Srinivasan, Mahi Luthra, Pin-Jie Lin, Varun Nagaraja, Forrest Iandola, Zechun Liu, Zhaoheng Ni, Changsheng Zhao, Yangyang Shi, Vikas ChandraJMLR
Scaling speech technology to 1,000+ languages, Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, Michael AuliASRU 2023
TorchAudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for PyTorch, Jeff Hwang, Moto Hira, Caroline Chen, Xiaohui Zhang, Zhaoheng Ni, Guangzhi Sun, Pingchuan Ma, Ruizhe Huang, Vineel Pratap, Yuekai Zhang, Anurag Kumar, Chin-Yun Yu, Chuang Zhu, Chunxi Liu, Jacob Kahn, Mirco Ravanelli, Peng Sun, Shinji Watanabe, Yangyang Shi, Yumeng TaoInterspeech 2023
Reducing Barriers to Self-Supervised Learning: HuBERT Pre-training with Academic Compute, William Chen, Xuankai Chang, Yifan Peng, Zhaoheng Ni, Soumi Maiti, Shinji WatanabeICASSP 2023
Ripple sparse self-attention for monaural speech enhancement, Qiquan Zhang, Hongxu Zhu, Qi Song, Xinyuan Qian, Zhaoheng Ni, Haizhou LiICASSP 2023
Torchaudio-squim: Reference-less speech quality and intelligibility measures in torchaudio, Anurag Kumar, Ke Tan, Zhaoheng Ni, Pranay Manocha, Xiaohui Zhang, Ethan Henderson, Buye XuInterspeech 2022
ESPnet-SE++: Speech enhancement for robust speech recognition, translation, and understanding, Yen-Ju Lu, Xuankai Chang, Chenda Li, Wangyou Zhang, Samuele Cornell, Zhaoheng Ni, Yoshiki Masuyama, Brian Yan, Robin Scheibler, Zhong-Qiu Wang, Yu Tsao, Yanmin Qian, Shinji WatanabeICASSP 2022
Towards low-distortion multi-channel speech enhancement: The ESPNET-SE submission to the L3DAS22 challenge, Yen-Ju Lu, Samuele Cornell, Xuankai Chang, Wangyou Zhang, Chenda Li, Zhaoheng Ni, Zhong-Qiu Wang, Shinji WatanabeICASSP 2022
TorchAudio: Building Blocks for Audio and Speech Processing, Yao-Yuan Yang, Moto Hira, Zhaoheng Ni, Anjali Chourdia, Artyom Astafurov, Caroline Chen, Ching-Feng Yeh, Christian Puhrsch, David Pollack, Dmitriy Genzel, Donny Greenberg, Edward Z Yang, Jason Lian, Jay Mahadeokar, Jeff Hwang, Ji Chen, Peter Goldsborough, Prabhat Roy, Sean Narenthiran, Shinji Watanabe, Soumith Chintala, Vincent Quenneville-BΓ©lair, Yangyang ShiICASSP 2022
Time-frequency attention for monaural speech enhancement, Qiquan Zhang, Qi Song, Zhaoheng Ni, Aaron Nicolson, Haizhou LiSLT 2021
WPD++: An improved neural beamformer for simultaneous speech separation and dereverberation, Zhaoheng Ni, Yong Xu, Meng Yu, Bo Wu, Shixiong Zhang, Dong Yu, Michael I MandelICASSP 2020
Mask-dependent phase estimation for monaural speaker separation, Zhaoheng Ni, Michael I MandelarXiv
CHiME-6 challenge: Tackling multispeaker speech recognition for unsegmented recordings, Shinji Watanabe, Michael Mandel, Jon Barker, Emmanuel Vincent, Ashish Arora, Xuankai Chang, Sanjeev Khudanpur, Vimal Manohar, Daniel Povey, Desh Raj, David Snyder, Aswin Shanmugam Subramanian, Jan Trmal, Bar Ben Yair, Christoph Boeddeker, Zhaoheng Ni, Yusuke Fujita, Shota Horiguchi, Naoyuki Kanda, Takuya Yoshioka, Neville RyantarXiv
Onssen: an open-source speech separation and enhancement library, Zhaoheng Ni, Michael I Mandel