Ruibo Fu
  • Bio
  • Papers
  • Experience
  • Projects
  • Recent & Upcoming Talks
    • Example Talk
  • Publications
    • DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-Speech
    • Mixture of experts fusion for fake audio detection using frozen wav2vec 2.0
    • MTPareto: A MultiModal Targeted Pareto Framework for Fake News Detection
    • Neural Codec Source Tracing: Toward Comprehensive Attribution in Open-Set Condition
    • The codecfake dataset and countermeasures for the universally detection of deepfake audio
    • A multi-speaker multi-lingual voice cloning system based on vits2 for limmits 2024 challenge
    • A Noval Feature via Color Quantisation for Fake Audio Detection
    • ASRRL-TTS: Agile Speaker Representation Reinforcement Learning for Text-to-Speech Speaker Adaptation
    • CFAD: A Chinese dataset for fake audio detection
    • Codecfake: An initial dataset for detecting llm-based deepfake audio
    • Does Current Deepfake Audio Detection Model Effectively Detect ALM-based Deepfake Audio?
    • Dual-branch knowledge distillation for noise-robust synthetic speech detection
    • EELE: Exploring Efficient and Extensible LoRA Integration in Emotional Text-to-Speech
    • Emotion selectable end-to-end text-based speech editing
    • Exploring the role of audio in multimodal misinformation detection
    • Fake news detection and manipulation reasoning via large vision-language models
    • Generalized fake audio detection via deep stable learning
    • Generalized Source Tracing: Detecting Novel Audio Deepfake Algorithm with Real Emphasis and Fake Dispersion Strategy
    • Genuine-focused learning using mask autoencoder for generalized fake audio detection
    • ICAGC 2024: Inspirational and Convincing Audio Generation Challenge 2024
    • Learning speech representation from contrastive token-acoustic pretraining
    • Letstalk: Latent diffusion transformer for talking video synthesis
    • Mdpe: A multimodal deception dataset with personality and emotional characteristics
    • Mel-Refine: A Plug-and-Play Approach to Refine Mel-Spectrogram in Audio Generation
    • Minimally-supervised speech synthesis with conditional diffusion model and language model: A comparative study of semantic coding
    • MINT: a Multi-modal Image and Narrative Text Dubbing Dataset for Foley Audio Content Planning and Generation
    • MisD-MoE: A Multimodal Misinformation Detection Framework with Adaptive Feature Selection
    • PPPR: Portable Plug-in Prompt Refiner for Text to Audio Generation
    • Scenefake: An initial dataset and benchmarks for scene fake audio detection
    • Temporal Shift for Personality Recognition with Pre-Trained Representations
    • Temporal Variability and Multi-Viewed Self-Supervised Representations to Tackle the ASVspoof5 Deepfake Challenge
    • Text Prompt is Not Enough: Sound Event Enhanced Prompt Adapter for Target Style Audio Generation
    • Towards Diverse and Efficient Audio Captioning via Diffusion Models
    • Transferring Personality Knowledge to Multimodal Sentiment Analysis
    • Unlocking the Power of Emotions: Enhancing Personality Trait Recognition Through Utilization of Emotional Cues
    • VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing
    • Adaptive fake audio detection with low-rank model squeezing
    • Add 2023: the second audio deepfake detection challenge
    • Adversarial multi-task learning for mandarin prosodic boundary prediction with multi-modal embeddings
    • An overview of affective speech synthesis and conversion in the deep learning era
    • Learning to Behave Like Clean Speech: Dual-Branch Knowledge Distillation for Noise-Robust Fake Audio Detection.
    • Low-rank adaptation method for wav2vec2-based fake audio detection
    • The VIBVG speech synthesis system for Blizzard Challenge 2023
    • TO-Rawnet: improving RawNet with TCN and orthogonal regularization for fake audio detection
    • Unifyspeech: A unified framework for zero-shot text-to-speech and voice conversion
    • Method and apparatus for editing audio, electronic device and storage medium
    • Add 2022: the first audio deep synthesis detection challenge
    • An initial investigation for detecting vocoder fingerprints of fake audio
    • Campnet: Context-aware mask prediction for end-to-end text-based speech editing
    • Context-aware mask prediction network for end-to-end text-based speech editing
    • DDAM'22: 1st International Workshop on Deepfake Detection for Audio Multimedia
    • Fully automated end-to-end fake audio detection
    • Neuraldps: Neural deterministic plus stochastic model with multiband excitation for noise-controllable waveform generation
    • Singing-Tacotron: Global duration control attention and dynamic filter for end-to-end singing voice synthesis
    • System fingerprint recognition for deepfake audio: An initial dataset and investigation
    • Text Enhancement for Paragraph Processing in End-to-End Code-switching TTS
    • Bi-level style and prosody decoupling modeling for personalized end-to-end speech synthesis
    • Half-truth: A partially fake audio detection dataset
    • Patnet: A phoneme-level autoregressive transformer network for speech synthesis
    • Prosody and voice factorization for few-shot speaker adaptation in the challenge m2voc 2021
    • 基于全局-时频注意力网络的语音伪造检测
    • Bi-Level Speaker Supervision for One-Shot Speech Synthesis.
    • Dynamic Speaker Representations Adjustment and Decoder Factorization for Speaker Adaptation in End-to-End Speech Synthesis.
    • Focusing on attention: prosody transfer and adaptative optimization strategy for multi-speaker end-to-end speech synthesis
    • Non-Autoregressive End-to-End TTS with Coarse-to-Fine Decoding.
    • Spoken Content and Voice Factorization for Few-Shot Speaker Adaptation.
    • The NLPR Speech Synthesis entry for Blizzard Challenge 2020
    • 语音伪造与鉴伪的发展与挑战
    • Phoneme dependent speaker embedding and model factorization for multi-speaker speech synthesis and adaptation
    • Automatic prosodic boundary labeling based on fusing the silence duration with the lexical features
    • Deep Metric Learning for the Target Cost in Unit-Selection Speech Synthesizer.
    • On the Application and Compression of Deep Time Delay Neural Network for Embedded Statistical Parametric Speech Synthesis.
    • Progressive Neural Networks based Features Prediction for the Target Cost in Unit-Selection Speech Synthesizer
    • Transfer Learning Based Progressive Neural Networks for Acoustic Modeling in Statistical Parametric Speech Synthesis.
    • 基于静音时长和文本特征融合的韵律边界自动标注
    • The NLPR Speech Synthesis entry for Blizzard Challenge 2017
  • Projects
    • ICAGC
    • CodecFake
    • ADD2022
  • Projects
  • Blog
    • 🎉 Easily create your own simple yet highly customizable blog
    • 🧠 Sharpen your thinking with a second brain
    • 📈 Communicate your results effectively with the best data visualizations
    • 👩🏼‍🏫 Teach academic courses
    • ✅ Manage your projects
  • Experience
  • Teaching
    • Learn JavaScript
    • Learn Python

Emotion selectable end-to-end text-based speech editing

Jan 1, 2024·
Tao Wang
,
Jiangyan Yi
,
Ruibo Fu
,
Jianhua Tao
,
Zhengqi Wen
,
Chu Yuan Zhang
· 0 min read
Cite
Type
Journal article
Publication
Artificial Intelligence
Last updated on Jan 1, 2024

← EELE: Exploring Efficient and Extensible LoRA Integration in Emotional Text-to-Speech Jan 1, 2024
Exploring the role of audio in multimodal misinformation detection Jan 1, 2024 →

© 2025 Me. This work is licensed under CC BY NC ND 4.0

Published with Hugo Blox Builder — the free, open source website builder that empowers creators.