Yixuan Zhou

Yixuan Zhou (周逸轩)

Shenzhen International Graduate School, Tsinghua University

Human-Computer Speech Interaction Lab at Tsinghua University (THUHCSI)

I'm currently a PhD student at Shenzhen International Graduate School, Tsinghua University (SIGS, THU) in Shenzhen. My supervisor is Prof. Zhiyong Wu. Before that, I received my bachelor’s degree at 2020, in School of Information Science and Engineering from SouthEast University.
My research interests include Zero-Shot Voice Cloning, Expressive/Audiobook/Conversational Speech Synthesis, etc. and now mainly focus on Large-Scale Speech/Audio/Music Generation Model.

News

[Apr 2026] Open-source VoxCPM2[GitHub][HF Link][Docs], a TTS model supports 30+ languages, voice design and controllable voice cloning, as the project leader.
[Jan 2026] One paper is accepted to ICLR 2026.
[Jan 2026] One paper is accepted to ICASSP 2026.
[Dec 2025] Open-source VoxCPM1.5[GitHub][HF Link](Top 1 on GitHub Trending) as the project leader.
[Sep 2025] Open-source VoxCPM-0.5B[GitHub][HF Link](Top 1 on HuggingFace Trending) as the project leader.
[Jul 2025] Two paper are accepted to ACMMM 2025, one as main conference and one as workshop.
[Mar 2025] One paper is accepted to Interspeech 2025.
[Jan 2025] Open-source MiniCPM-o 2.6[GitHub][HF Link](Top on Github Trending and HuggingFace Trending) as a main contributor in speech modality.
[Dec 2024] One paper is accepted to ICASSP 2025.
[Nov 2024] One paper is accepted to ISCSLP 2024.
[Sep 2024] One paper is accepted to NeurIPS 2024.
[Jul 2024] Three paper are accepted to ACMMM 2024, one as main conference (oral (4%)) and two as workshop.
[Jun 2024] One paper is accepted to Interspeech 2024.
[Apr 2024] Two papers are accepted to ICASSP 2024.
[Oct 2023] One paper is accepted to IEEE/ACM Transactions on Audio, Speech and Language Processing
[Jul 2023] One paper is accepted to Interspeech 2023.
[Jun 2023] One paper is recognized as the top 3% paper accepted at ICASSP 2023.
[Sep 2022] Three papers are accepted to Interspeech 2022.
[May 2022] Two papers are accepted to ICASSP 2022.
[May 2021] One paper is accepted to ICASSP 2021.

Publications [Google Scholar]

2026:

Hierarchical Semantic-Acoustic Modeling via Semi-Discrete Residual Representations for Expressive End-to-End Speech Synthesis

Yixuan Zhou, Guoyang Zeng, Xin Liu, Xiang Li, Renjie Yu, Ziyang Wang, Runchuan Ye, Weiyue Sun, Jiancheng Gui, Kehan Li, Zhiyong Wu, Zhiyuan Liu

ICLR, 2026 [Paper]
HASap: Hierarchical Acoustic-Semantic Annotation Pipeline for Scripted Speech Data

Kehan Li, Runchuan Ye, Yixuan Zhou, Xin Liu, Zhiyong Wu

ICASSP, 2026

2025:

A Dual-Branch Ensemble Framework for Personality Recognition Based on Multimodal Emotion Features

Renjie Yu, Yunrui Cai, Yixuan Zhou, Runchuan Ye, Zhiyong Wu

MRAC workshop@ACM MM, 2025 [Paper]
HarmoniVox: Painting Voices to Match the Avatar's Soul

Songtao Zhou, Xiaoyu Qin, Yixuan Zhou, Qixin Wang, Zeyu Jin, Zixuan Wang, Zhiyong Wu, Jia Jia

ACMMM, 2025 [Paper]
In This Environment, As That Speaker: A Text-Driven Framework for Multi-Attribute Speech Conversion

Jiawei Jin, Zhihan Yang, Yixuan Zhou, Zhiyong Wu

Interspeech, 2025 [Paper]
DiffCSS: Diverse and Expressive Conversational Speech Synthesis with Diffusion Models

Weihao Wu, Zhiwei Lin, Yixuan Zhou, Jingbei Li, Rui Niu, Qinghua Wu, Songjun Cao, Long Ma, Zhiyong Wu

ICASSP, 2025 [Paper]

2024:

The Codec Language Model-based Zero-Shot Spontaneous Style TTS System for CoVoC Challenge 2024

Shuoyi Zhou, Yixuan Zhou, Weiqing Li, Jun Chen, Runchuan Ye, Weihao Wu, Zijian Lin, Shun Lei, Zhiyong Wu

ISCSLP, 2024 [Paper]
SongCreator: Lyrics-based Universal Song Generation

Shun Lei*, Yixuan Zhou*, Boshi Tang, Max WY Lam, Feng Liu, Hangyu Liu, Jingcheng Wu, Shiyin Kang, Zhiyong Wu, Helen Meng

NeurIPS, 2024 [Paper]
VoxInstruct: Expressive Human Instruction-to-Speech Generation with Unified Multilingual Codec Language Modelling

Yixuan Zhou*, Xiaoyu Qin*, Zeyu Jin, Shuoyi Zhou, Shun Lei, Songtao Zhou, Zhiyong Wu, Jia Jia

ACM MM, 2024 (Oral, 4%) [Paper] [GitHub]
Multimodal Emotion Captioning Using Large Language Model with Prompt Engineering

Yaoxun Xu*, Yixuan Zhou*, Yunrui Cai, Jingran Xie, Runchuan Ye, Zhiyong Wu

MRAC workshop@ACM MM, 2024 [Paper]
Robust Representation Learning for Multimodal Emotion Recognition with Contrastive Learning and Mixup

Yunrui Cai, Runchuan Ye, Jingran Xie, Yixuan Zhou, Yaoxun Xu, Zhiyong Wu

MRAC workshop@ACM MM, 2024 [Paper]
Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models

Weiqin Li, Peiji Yang, Yicheng Zhong, Yixuan Zhou, Zhisheng Wang, Zhiyong Wu, Xixin Wu, Helen Meng

Interspeech, 2024 [Paper]
The THU-HCSI Multi-Speaker Multi-Lingual Few-Shot Voice Cloning System for LIMMITS'24 Challenge

Yixuan Zhou, Shuoyi Zhou, Shun Lei, Zhiyong Wu, Menglin Wu

ICASSP, 2024 [Paper]
Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts

Shun Lei*, Yixuan Zhou*, Liyang Chen, Dan Luo, Zhiyong Wu, Xixin Wu, Shiyin Kang, Tao Jiang, Yahui Zhou, Yuxing Han, Helen Meng

ICASSP, 2024 [Paper]

2023:

MSStyleTTS: Multi-Scale Style Modeling with Hierarchical Context Information for Expressive Speech Synthesis

Shun Lei*, Yixuan Zhou*, Liyang Chen, Zhiyong Wu, Xixin Wu, Shiyin Kang, Helen Meng

IEEE/ACM Transactions on Audio, Speech and Language Processing [Paper]
Towards Spontaneous Style Modeling with Semi-supervised Pre-training for Conversational Text-to-Speech Synthesis

Weiqin Li, Shun Lei, Qiaochu Huang, Yixuan Zhou, Zhiyong Wu, Shiyin Kang, Helen Meng

Interspeech, 2023 (Oral) [Paper]
Context-Aware Coherent Speaking Style Prediction with Hierarchical Transformers for Audiobook Speech Synthesis

Shun Lei*, Yixuan Zhou*, Liyang Chen, Zhiyong Wu, Shiyin Kang, Helen Meng

ICASSP, 2023 (Oral, Top 3% Paper) [Paper]

2022:

Content-Dependent Fine-Grained Speaker Embedding for Zero-Shot Speaker Adaptation in Text-to-Speech Synthesis

Yixuan Zhou, Changhe Song, Xiang Li, Luwen Zhang, Zhiyong Wu, Yanyao Bian, Dan Su, Helen Meng

Interspeech, 2022 [Paper] [GitHub]
Enhancing Word-Level Semantic Representation via Dependency Structure for Expressive Text-to-Speech Synthesis

Yixuan Zhou*, Changhe Song*, Jingbei Li, Zhiyong Wu, Yanyao Bian, Dan Su, Helen Meng

Interspeech, 2022 [Paper]
Towards Multi-Scale Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech Synthesis

Shun Lei*, Yixuan Zhou*, Liyang Chen, Hu Jiankun, Zhiyong Wu, Shiyin Kang, Helen Meng

Interspeech, 2022 [Paper]
Towards Expressive Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech Synthesis

Shun Lei, Yixuan Zhou, Liyang Chen, Zhiyong Wu, Shiyin Kang, Helen Meng

ICASSP, 2022 [Paper]
A character-level span-based model for mandarin prosodic structure prediction

Xueyuan Chen*, Changhe Song*, Yixuan Zhou, Zhiyong Wu, Changbin Chen, Zhongqin Wu, Helen Meng

ICASSP, 2022 [Paper]

2021:

Syntactic Representation Learning For Neural Network Based TTS with Syntactic Parse Tree Traversal

Changhe Song, Jingbei Li, Yixuan Zhou, Zhiyong Wu, Helen Meng

ICASSP, 2021 [Paper]

Education

Ph.D. in Computer Science and Technology

Tsinghua University

2023.09 - Present
M.Eng. in Computer Technology (Turn to Ph.D. in 2023)

Tsinghua University

2020.09 - 2023.06
B.Eng. in Information Technology

SouthEast University

2016.09 - 2020.06

Experience

Speech & Multimodality Algorithm Intern

Modelbest Inc.

MiniCPM-o 2.6 project

VoxCPM series project

2024.08 - now
Speech Algorithm Intern

SAMI, ByteDance Inc.

2022.09 - 2023.06
Speech Algorithm Intern

AI Lab, Tencent Inc.

2021.05 - 2022.06

Awards

3rd in Constrained Track

ISCSLP 2024 CoVoC Challenge
1st in Track-1, SMOS

ICASSP 2024 LIMMITS'24 Challenge
Best Paper Award

International PhdForum 2023
Bronze Medal

The 2017 ICPC Asia Nanning Regional Contest