Yixuan Zhou (周逸轩)

Shenzhen International Graduate School, Tsinghua University

Human-Computer Speech Interaction Lab at Tsinghua University (THUHCSI)


image

I'm currently a PhD student at Shenzhen International Graduate School, Tsinghua University (SIGS, THU) in Shenzhen. My supervisor is Prof. Zhiyong Wu. Before that, I received my bachelor’s degree at 2020, in School of Information Science and Engineering from SouthEast University.
My research interests include Zero-Shot Voice Cloning, Expressive/Audiobook/Conversational Speech Synthesis, etc. and now mainly focus on Large-Scale Speech/Audio/Music Generation Model.


News

  • [Jan 2025] Open-source MiniCPM-o 2.6[GitHub](Top on Github Trending and HuggingFace Trending) as a main contributor in speech modality.
  • [Dec 2024] One paper is accepted to ICASSP 2025.
  • [Nov 2024] One paper is accepted to ISCSLP 2024.
  • [Sep 2024] One paper is accepted to NeurIPS 2024.
  • [Jul 2024] Three paper are accepted to ACMMM 2024, one as main conference (oral (4%)) and two as workshop.
  • [Jun 2024] One paper is accepted to Interspeech 2024.
  • [Apr 2024] Two papers are accepted to ICASSP 2024.
  • [Oct 2023] One paper is accepted to IEEE/ACM Transactions on Audio, Speech and Language Processing
  • [Jul 2023] One paper is accepted to Interspeech 2023.
  • [Jun 2023] One paper is recognized as the top 3% paper accepted at ICASSP 2023.
  • [Sep 2022] Three papers are accepted to Interspeech 2022.
  • [May 2022] Two papers are accepted to ICASSP 2022.
  • [May 2021] One paper is accepted to ICASSP 2021.

Publications [Google Scholar]


2024:
  • DiffCSS: Diverse and Expressive Conversational Speech Synthesis with Diffusion Models
    Weihao Wu, Zhiwei Lin, Yixuan Zhou, Jingbei Li, Rui Niu, Qinghua Wu, Songjun Cao, Long Ma, Zhiyong Wu
    ICASSP, 2025
  • The Codec Language Model-based Zero-Shot Spontaneous Style TTS System for CoVoC Challenge 2024
    Shuoyi Zhou, Yixuan Zhou, Weiqing Li, Jun Chen, Runchuan Ye, Weihao Wu, Zijian Lin, Shun Lei, Zhiyong Wu
    ISCSLP, 2024 [Paper]
  • SongCreator: Lyrics-based Universal Song Generation
    Shun Lei*, Yixuan Zhou*, Boshi Tang, Max WY Lam, Feng Liu, Hangyu Liu, Jingcheng Wu, Shiyin Kang, Zhiyong Wu, Helen Meng
    NeurIPS, 2024 [Paper]
  • VoxInstruct: Expressive Human Instruction-to-Speech Generation with Unified Multilingual Codec Language Modelling
    Yixuan Zhou*, Xiaoyu Qin*, Zeyu Jin, Shuoyi Zhou, Shun Lei, Songtao Zhou, Zhiyong Wu, Jia Jia
    ACM MM, 2024 (Oral, 4%) [Paper] [GitHub]
  • Multimodal Emotion Captioning Using Large Language Model with Prompt Engineering
    Yaoxun Xu*, Yixuan Zhou*, Yunrui Cai, Jingran Xie, Runchuan Ye, Zhiyong Wu
    MRAC workshop@ACM MM, 2024 [Paper]
  • Robust Representation Learning for Multimodal Emotion Recognition with Contrastive Learning and Mixup
    Yunrui Cai, Runchuan Ye, Jingran Xie, Yixuan Zhou, Yaoxun Xu, Zhiyong Wu
    MRAC workshop@ACM MM, 2024 [Paper]
  • Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models
    Weiqin Li, Peiji Yang, Yicheng Zhong, Yixuan Zhou, Zhisheng Wang, Zhiyong Wu, Xixin Wu, Helen Meng
    Interspeech, 2024 [Paper]
  • The THU-HCSI Multi-Speaker Multi-Lingual Few-Shot Voice Cloning System for LIMMITS'24 Challenge
    Yixuan Zhou, Shuoyi Zhou, Shun Lei, Zhiyong Wu, Menglin Wu
    ICASSP, 2024 [Paper]
  • Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts
    Shun Lei*, Yixuan Zhou*, Liyang Chen, Dan Luo, Zhiyong Wu, Xixin Wu, Shiyin Kang, Tao Jiang, Yahui Zhou, Yuxing Han, Helen Meng
    ICASSP, 2024 [Paper]
2023:
  • MSStyleTTS: Multi-Scale Style Modeling with Hierarchical Context Information for Expressive Speech Synthesis
    Shun Lei*, Yixuan Zhou*, Liyang Chen, Zhiyong Wu, Xixin Wu, Shiyin Kang, Helen Meng
    IEEE/ACM Transactions on Audio, Speech and Language Processing [Paper]
  • Towards Spontaneous Style Modeling with Semi-supervised Pre-training for Conversational Text-to-Speech Synthesis
    Weiqin Li, Shun Lei, Qiaochu Huang, Yixuan Zhou, Zhiyong Wu, Shiyin Kang, Helen Meng
    Interspeech, 2023 (Oral) [Paper]
  • Context-Aware Coherent Speaking Style Prediction with Hierarchical Transformers for Audiobook Speech Synthesis
    Shun Lei*, Yixuan Zhou*, Liyang Chen, Zhiyong Wu, Shiyin Kang, Helen Meng
    ICASSP, 2023 (Oral, Top 3% Paper) [Paper]
2022:
  • Content-Dependent Fine-Grained Speaker Embedding for Zero-Shot Speaker Adaptation in Text-to-Speech Synthesis
    Yixuan Zhou, Changhe Song, Xiang Li, Luwen Zhang, Zhiyong Wu, Yanyao Bian, Dan Su, Helen Meng
    Interspeech, 2022 [Paper] [GitHub]
  • Enhancing Word-Level Semantic Representation via Dependency Structure for Expressive Text-to-Speech Synthesis
    Yixuan Zhou*, Changhe Song*, Jingbei Li, Zhiyong Wu, Yanyao Bian, Dan Su, Helen Meng
    Interspeech, 2022 [Paper]
  • Towards Multi-Scale Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech Synthesis
    Shun Lei*, Yixuan Zhou*, Liyang Chen, Hu Jiankun, Zhiyong Wu, Shiyin Kang, Helen Meng
    Interspeech, 2022 [Paper]
  • Towards Expressive Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech Synthesis
    Shun Lei, Yixuan Zhou, Liyang Chen, Zhiyong Wu, Shiyin Kang, Helen Meng
    ICASSP, 2022 [Paper]
  • A character-level span-based model for mandarin prosodic structure prediction
    Xueyuan Chen*, Changhe Song*, Yixuan Zhou, Zhiyong Wu, Changbin Chen, Zhongqin Wu, Helen Meng
    ICASSP, 2022 [Paper]
2021:
  • Syntactic Representation Learning For Neural Network Based TTS with Syntactic Parse Tree Traversal
    Changhe Song, Jingbei Li, Yixuan Zhou, Zhiyong Wu, Helen Meng
    ICASSP, 2021 [Paper]