Publications | Zhaoheng Ni

2024

High Fidelity Text-Guided Music Generation and Editing via Single-Stage Flow Matching

Gael Le Lan , Bowen Shi , Zhaoheng Ni , Sidd Srinivasan , Anurag Kumar , Brian Ellis , David Kant , Varun Nagaraja , Ernie Chang , Wei-Ning Hsu , and others

arXiv preprint arXiv:2407.03648, 2024
URGENT Challenge: Universality, Robustness, and Generalizability For Speech Enhancement

Wangyou Zhang , Robin Scheibler , Kohei Saijo , Samuele Cornell , Chenda Li , Zhaoheng Ni , Anurag Kumar , Jan Pirklbauer , Marvin Sach , Shinji Watanabe , and others

arXiv preprint arXiv:2406.04660, 2024
Less Peaky and More Accurate CTC Forced Alignment by Label Priors

Ruizhe Huang , Xiaohui Zhang , Zhaoheng Ni , Li Sun , Moto Hira , Jeff Hwang , Vimal Manohar , Vineel Pratap , Matthew Wiesner , Shinji Watanabe , Daniel Povey , and 1 more author

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024
Folding Attention: Memory and Power Optimization for On-device Transformer-based Streaming Speech Recognition

Yang Li , Liangzhen Lai , Yuan Shangguan , Forrest N. Iandola , Zhaoheng Ni , Ernie Chang , Yangyang Shi , and Vikas Chandra

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024
An Empirical Study on the Impact of Positional Encoding in Transformer-based Monaural Speech Enhancement

Qiquan Zhang , Meng Ge , Hongxu Zhu , Eliathamby Ambikairajah , Qi Song , Zhaoheng Ni , and Haizhou Li

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024
On The Open Prompt Challenge In Conditional Audio Generation

Ernie Chang , Sidd Srinivasan , Mahi Luthra , Pin-Jie Lin , Varun Nagaraja , Forrest Iandola , Zechun Liu , Zhaoheng Ni , Changsheng Zhao , Yangyang Shi , and others

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024
Stack-and-Delay: A New Codebook Pattern for Music Generation

Gael Le Lan , Varun Nagaraja , Ernie Chang , David Kant , Zhaoheng Ni , Yangyang Shi , Forrest Iandola , and Vikas Chandra

ICASSP 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024

arXiv

2023

Exploring Speech Enhancement for Low-resource Speech Synthesis

Zhaoheng Ni , Sravya Popuri , Ning Dong , Kohei Saijo , Xiaohui Zhang , Gael Le Lan , Yangyang Shi , Vikas Chandra , and Changhan Wang

arXiv preprint arXiv:2309.10795, 2023
FoleyGen: Visually-Guided Audio Generation

Xinhao Mei , Varun Nagaraja , Gael Le Lan , Zhaoheng Ni , Ernie Chang , Yangyang Shi , and Vikas Chandra

arXiv preprint arXiv:2309.10537, 2023
Enhance Audio Generation Controllability through Representation Similarity Regularization

Yangyang Shi , Gael Le Lan , Varun Nagaraja , Zhaoheng Ni , Xinhao Mei , Ernie Chang , Forrest Iandola , Yang Liu , and Vikas Chandra

arXiv preprint arXiv:2309.08773, 2023
TorchAudio 2.1: Advancing Speech Recognition, Self-supervised Learning, and Audio Processing Components for PyTorch

Jeff Hwang , Moto Hira , Caroline Chen , Xiaohui Zhang , Zhaoheng Ni , Guangzhi Sun , Pingchuan Ma , Ruizhe Huang , Vineel Pratap , Yuekai Zhang , and others

2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2023

arXiv Code
Software Design and User Interface of ESPnet-SE++: Speech Enhancement for Robust Speech Processing

Yen-Ju Lu , Xuankai Chang , Chenda Li , Wangyou Zhang , Samuele Cornell , Zhaoheng Ni , Yoshiki Masuyama , Brian Yan , Robin Scheibler , Zhong-Qiu Wang , and others

Journal of Open Source Software, 2023
TorchAudio-Squim: Reference-less Speech Quality and Intelligibility measures in TorchAudio

Anurag Kumar , Ke Tan , Zhaoheng Ni , Pranay Manocha , Xiaohui Zhang , Ethan Henderson , and Buye Xu

In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2023

arXiv HTML Code
ESPnet-ST-v2: Multipurpose Spoken Language Translation Toolkit

Brian Yan , Jiatong Shi , Yun Tang , Hirofumi Inaguma , Yifan Peng , Siddharth Dalmia , Peter Polák , Patrick Fernandes , Dan Berrebbi , Tomoki Hayashi , and others

arXiv preprint arXiv:2304.04596, 2023
Scaling Speech Technology to 1,000+ Languages

Vineel Pratap , Andros Tjandra , Bowen Shi , Paden Tomasello , Arun Babu , Sayani Kundu , Ali Elkahky , Zhaoheng Ni , Apoorv Vyas , Maryam Fazel-Zarandi , and others

arXiv preprint arXiv:2305.13516, 2023

arXiv HTML Code
Ripple Sparse Self-attention for Monaural Speech Enhancement

Qiquan Zhang , Hongxu Zhu , Qi Song , Xinyuan Qian , Zhaoheng Ni , and Haizhou Li

In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2023
Reducing Barriers to Self-Supervised Learning: HuBERT Pre-training with Academic Compute

William Chen , Xuankai Chang , Yifan Peng , Zhaoheng Ni , Soumi Maiti , and Shinji Watanabe

arXiv preprint arXiv:2306.06672, 2023

2022

TorchAudio: Building Blocks for Audio and Speech Processing

Yao-Yuan Yang , Moto Hira , Zhaoheng Ni , Artyom Astafurov , Caroline Chen , Christian Puhrsch , David Pollack , Dmitriy Genzel , Donny Greenberg , Edward Z Yang , and others

In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2022

arXiv HTML Code
Time-Frequency Attention for Monaural Speech Enhancement

Qiquan Zhang , Qi Song , Zhaoheng Ni , Aaron Nicolson , and Haizhou Li

In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2022
ESPnet-SE++: Speech Enhancement for Robust Speech Recognition, Translation, and Understanding

Yen-Ju Lu , Xuankai Chang , Chenda Li , Wangyou Zhang , Samuele Cornell , Zhaoheng Ni , Yoshiki Masuyama , Brian Yan , Robin Scheibler , Zhong-Qiu Wang , Yu Tsao , and 2 more authors

In Proc. Interspeech 2022 , 2022
Towards Low-distortion Multi-channel Speech Enhancement: The ESPNet-SE Submission to the L3DAS22 Challenge

Yen-Ju Lu , Samuele Cornell , Xuankai Chang , Wangyou Zhang , Chenda Li , Zhaoheng Ni , Zhong-Qiu Wang , and Shinji Watanabe

In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2022
A Time-Frequency Attention Module for Neural Speech Enhancement

Qiquan Zhang , Xinyuan Qian , Zhaoheng Ni , Aaron Nicolson , Eliathamby Ambikairajah , and Haizhou Li

In , 2022

2021

WPD++: An Improved Neural Beamformer for Simultaneous Speech Separation and Dereverberation

Zhaoheng Ni , Yong Xu , Meng Yu , Bo Wu , Shixiong Zhang , Dong Yu , and Michael I Mandel

In 2021 IEEE Spoken Language Technology Workshop (SLT) , 2021

arXiv HTML

2020

CHiME-6 challenge: Tackling Multispeaker Speech Recognition for Unsegmented Recordings

Shinji Watanabe , Michael Mandel , Jon Barker , Emmanuel Vincent , Ashish Arora , Xuankai Chang , Sanjeev Khudanpur , Vimal Manohar , Daniel Povey , Desh Raj , Zhaoheng Ni , and 1 more author

arXiv preprint arXiv:2004.09249, 2020
Mask-dependent Phase Estimation for Monaural Speaker Separation

Zhaoheng Ni , and Michael I Mandel

In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2020
CUNY Speech Diarization System for the CHiME-6 Challenge

Zhaoheng Ni , and Michael I Mandel

In Proc. The 6th International Workshop on Speech Processing in Everyday Environments (CHiME 2020) , 2020
Improved MVDR Beamforming Using LSTM Speech Models to Clean Spatial Clustering Masks

Zhaoheng Ni , Felix Grezes , Viet Anh Trinh , and Michael I Mandel

arXiv preprint arXiv:2012.02191, 2020
Combining Spatial Clustering with LSTM Speech Models for Multichannel Speech Enhancement

Felix Grezes , Zhaoheng Ni , Viet Anh Trinh , and Michael Mandel

arXiv preprint arXiv:2012.03388, 2020
Enhancement of Spatial Clustering-based Time-Frequency Masks using LSTM Neural Networks

Felix Grezes , Zhaoheng Ni , Viet Anh Trinh , and Michael Mandel

arXiv preprint arXiv:2012.01576, 2020

2019

ONSSEN: An Open-source Speech Separation and Enhancement Library

Zhaoheng Ni , and Michael I Mandel

arXiv preprint arXiv:1911.00982, 2019

2018

Unusable Spoken Response Detection with BLSTM Neural Networks

Zhaoheng Ni , Rutuja Ubale , Yao Qian , Michael Mandel , Su-Youn Yoon , Abhinav Misra , and David Suendermann-Oeft

In 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP) , 2018
Sound Signal Processing with Seq2Tree Network

Weicheng Ma , Kai Cao , Zhaoheng Ni , Peter Chin , and Xiang Li

In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) , 2018

2014

Anatomical Entity Recognition with a Hierarchical Framework Augmented by External Resources

Yan Xu , Ji Hua , Zhaoheng Ni , Qinlang Chen , Yubo Fan , Sophia Ananiadou , Eric I-Chao Chang , and Junichi Tsujii

PloS one, 2014