Yixuan Zhou (周逸轩)
Shenzhen International Graduate School, Tsinghua University
Human-Computer Speech Interaction Lab at Tsinghua University (THUHCSI)
- yx-zhou23@mails.tsinghua.edu.cn
- Google Scholar
- Github
- Shenzhen, Guangdong, China

I'm currently a PhD student at Shenzhen International Graduate School, Tsinghua University
(SIGS, THU) in Shenzhen. My supervisor is Prof. Zhiyong Wu. Before that,
I received my bachelor’s degree at 2020, in School of Information Science and Engineering from SouthEast University.
My research interests include Zero-Shot Voice Cloning, Expressive/Audiobook/Conversational Speech Synthesis, etc. and now mainly focus on Large-Scale Speech/Audio/Music Generation Model.
News
- [Jan 2025] Open-source MiniCPM-o 2.6[GitHub](Top on Github Trending and HuggingFace Trending) as a main contributor in speech modality.
- [Dec 2024] One paper is accepted to ICASSP 2025.
- [Nov 2024] One paper is accepted to ISCSLP 2024.
- [Sep 2024] One paper is accepted to NeurIPS 2024.
- [Jul 2024] Three paper are accepted to ACMMM 2024, one as main conference (oral (4%)) and two as workshop.
- [Jun 2024] One paper is accepted to Interspeech 2024.
- [Apr 2024] Two papers are accepted to ICASSP 2024.
- [Oct 2023] One paper is accepted to IEEE/ACM Transactions on Audio, Speech and Language Processing
- [Jul 2023] One paper is accepted to Interspeech 2023.
- [Jun 2023] One paper is recognized as the top 3% paper accepted at ICASSP 2023.
- [Sep 2022] Three papers are accepted to Interspeech 2022.
- [May 2022] Two papers are accepted to ICASSP 2022.
- [May 2021] One paper is accepted to ICASSP 2021.
Publications [Google Scholar]
2024:
-
DiffCSS: Diverse and Expressive Conversational Speech Synthesis with Diffusion ModelsWeihao Wu, Zhiwei Lin, Yixuan Zhou, Jingbei Li, Rui Niu, Qinghua Wu, Songjun Cao, Long Ma, Zhiyong WuICASSP, 2025
-
The Codec Language Model-based Zero-Shot Spontaneous Style TTS System for CoVoC Challenge 2024Shuoyi Zhou, Yixuan Zhou, Weiqing Li, Jun Chen, Runchuan Ye, Weihao Wu, Zijian Lin, Shun Lei, Zhiyong WuISCSLP, 2024 [Paper]
-
SongCreator: Lyrics-based Universal Song GenerationShun Lei*, Yixuan Zhou*, Boshi Tang, Max WY Lam, Feng Liu, Hangyu Liu, Jingcheng Wu, Shiyin Kang, Zhiyong Wu, Helen MengNeurIPS, 2024 [Paper]
-
VoxInstruct: Expressive Human Instruction-to-Speech Generation with Unified Multilingual Codec Language ModellingYixuan Zhou*, Xiaoyu Qin*, Zeyu Jin, Shuoyi Zhou, Shun Lei, Songtao Zhou, Zhiyong Wu, Jia Jia
-
Multimodal Emotion Captioning Using Large Language Model with Prompt EngineeringYaoxun Xu*, Yixuan Zhou*, Yunrui Cai, Jingran Xie, Runchuan Ye, Zhiyong WuMRAC workshop@ACM MM, 2024 [Paper]
-
Robust Representation Learning for Multimodal Emotion Recognition with Contrastive Learning and MixupYunrui Cai, Runchuan Ye, Jingran Xie, Yixuan Zhou, Yaoxun Xu, Zhiyong WuMRAC workshop@ACM MM, 2024 [Paper]
-
Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language ModelsWeiqin Li, Peiji Yang, Yicheng Zhong, Yixuan Zhou, Zhisheng Wang, Zhiyong Wu, Xixin Wu, Helen MengInterspeech, 2024 [Paper]
-
The THU-HCSI Multi-Speaker Multi-Lingual Few-Shot Voice Cloning System for LIMMITS'24 ChallengeYixuan Zhou, Shuoyi Zhou, Shun Lei, Zhiyong Wu, Menglin WuICASSP, 2024 [Paper]
-
Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic PromptsShun Lei*, Yixuan Zhou*, Liyang Chen, Dan Luo, Zhiyong Wu, Xixin Wu, Shiyin Kang, Tao Jiang, Yahui Zhou, Yuxing Han, Helen MengICASSP, 2024 [Paper]
-
MSStyleTTS: Multi-Scale Style Modeling with Hierarchical Context Information for Expressive Speech SynthesisShun Lei*, Yixuan Zhou*, Liyang Chen, Zhiyong Wu, Xixin Wu, Shiyin Kang, Helen MengIEEE/ACM Transactions on Audio, Speech and Language Processing [Paper]
-
Towards Spontaneous Style Modeling with Semi-supervised Pre-training for Conversational Text-to-Speech SynthesisWeiqin Li, Shun Lei, Qiaochu Huang, Yixuan Zhou, Zhiyong Wu, Shiyin Kang, Helen MengInterspeech, 2023 (Oral) [Paper]
-
Context-Aware Coherent Speaking Style Prediction with Hierarchical Transformers for Audiobook Speech SynthesisShun Lei*, Yixuan Zhou*, Liyang Chen, Zhiyong Wu, Shiyin Kang, Helen MengICASSP, 2023 (Oral, Top 3% Paper) [Paper]
-
Content-Dependent Fine-Grained Speaker Embedding for Zero-Shot Speaker Adaptation in Text-to-Speech SynthesisYixuan Zhou, Changhe Song, Xiang Li, Luwen Zhang, Zhiyong Wu, Yanyao Bian, Dan Su, Helen Meng
-
Enhancing Word-Level Semantic Representation via Dependency Structure for Expressive Text-to-Speech SynthesisYixuan Zhou*, Changhe Song*, Jingbei Li, Zhiyong Wu, Yanyao Bian, Dan Su, Helen MengInterspeech, 2022 [Paper]
-
Towards Multi-Scale Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech SynthesisShun Lei*, Yixuan Zhou*, Liyang Chen, Hu Jiankun, Zhiyong Wu, Shiyin Kang, Helen MengInterspeech, 2022 [Paper]
-
Towards Expressive Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech SynthesisShun Lei, Yixuan Zhou, Liyang Chen, Zhiyong Wu, Shiyin Kang, Helen MengICASSP, 2022 [Paper]
-
A character-level span-based model for mandarin prosodic structure predictionXueyuan Chen*, Changhe Song*, Yixuan Zhou, Zhiyong Wu, Changbin Chen, Zhongqin Wu, Helen MengICASSP, 2022 [Paper]
-
Syntactic Representation Learning For Neural Network Based TTS with Syntactic Parse Tree TraversalChanghe Song, Jingbei Li, Yixuan Zhou, Zhiyong Wu, Helen MengICASSP, 2021 [Paper]