Ruibo Fu
  • Bio
  • Papers
  • Experience
  • Projects
  • Recent & Upcoming Talks
    • Example Talk
  • Publications
    • DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-Speech
    • Mixture of experts fusion for fake audio detection using frozen wav2vec 2.0
    • MTPareto: A MultiModal Targeted Pareto Framework for Fake News Detection
    • Neural Codec Source Tracing: Toward Comprehensive Attribution in Open-Set Condition
    • The codecfake dataset and countermeasures for the universally detection of deepfake audio
    • A multi-speaker multi-lingual voice cloning system based on vits2 for limmits 2024 challenge
    • A Noval Feature via Color Quantisation for Fake Audio Detection
    • ASRRL-TTS: Agile Speaker Representation Reinforcement Learning for Text-to-Speech Speaker Adaptation
    • CFAD: A Chinese dataset for fake audio detection
    • Codecfake: An initial dataset for detecting llm-based deepfake audio
    • Does Current Deepfake Audio Detection Model Effectively Detect ALM-based Deepfake Audio?
    • Dual-branch knowledge distillation for noise-robust synthetic speech detection
    • EELE: Exploring Efficient and Extensible LoRA Integration in Emotional Text-to-Speech
    • Emotion selectable end-to-end text-based speech editing
    • Exploring the role of audio in multimodal misinformation detection
    • Fake news detection and manipulation reasoning via large vision-language models
    • Generalized fake audio detection via deep stable learning
    • Generalized Source Tracing: Detecting Novel Audio Deepfake Algorithm with Real Emphasis and Fake Dispersion Strategy
    • Genuine-focused learning using mask autoencoder for generalized fake audio detection
    • ICAGC 2024: Inspirational and Convincing Audio Generation Challenge 2024
    • Learning speech representation from contrastive token-acoustic pretraining
    • Letstalk: Latent diffusion transformer for talking video synthesis
    • Mdpe: A multimodal deception dataset with personality and emotional characteristics
    • Mel-Refine: A Plug-and-Play Approach to Refine Mel-Spectrogram in Audio Generation
    • Minimally-supervised speech synthesis with conditional diffusion model and language model: A comparative study of semantic coding
    • MINT: a Multi-modal Image and Narrative Text Dubbing Dataset for Foley Audio Content Planning and Generation
    • MisD-MoE: A Multimodal Misinformation Detection Framework with Adaptive Feature Selection
    • PPPR: Portable Plug-in Prompt Refiner for Text to Audio Generation
    • Scenefake: An initial dataset and benchmarks for scene fake audio detection
    • Temporal Shift for Personality Recognition with Pre-Trained Representations
    • Temporal Variability and Multi-Viewed Self-Supervised Representations to Tackle the ASVspoof5 Deepfake Challenge
    • Text Prompt is Not Enough: Sound Event Enhanced Prompt Adapter for Target Style Audio Generation
    • Towards Diverse and Efficient Audio Captioning via Diffusion Models
    • Transferring Personality Knowledge to Multimodal Sentiment Analysis
    • Unlocking the Power of Emotions: Enhancing Personality Trait Recognition Through Utilization of Emotional Cues
    • VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing
    • Adaptive fake audio detection with low-rank model squeezing
    • Add 2023: the second audio deepfake detection challenge
    • Adversarial multi-task learning for mandarin prosodic boundary prediction with multi-modal embeddings
    • An overview of affective speech synthesis and conversion in the deep learning era
    • Learning to Behave Like Clean Speech: Dual-Branch Knowledge Distillation for Noise-Robust Fake Audio Detection.
    • Low-rank adaptation method for wav2vec2-based fake audio detection
    • The VIBVG speech synthesis system for Blizzard Challenge 2023
    • TO-Rawnet: improving RawNet with TCN and orthogonal regularization for fake audio detection
    • Unifyspeech: A unified framework for zero-shot text-to-speech and voice conversion
    • Method and apparatus for editing audio, electronic device and storage medium
    • Add 2022: the first audio deep synthesis detection challenge
    • An initial investigation for detecting vocoder fingerprints of fake audio
    • Campnet: Context-aware mask prediction for end-to-end text-based speech editing
    • Context-aware mask prediction network for end-to-end text-based speech editing
    • DDAM'22: 1st International Workshop on Deepfake Detection for Audio Multimedia
    • Fully automated end-to-end fake audio detection
    • Neuraldps: Neural deterministic plus stochastic model with multiband excitation for noise-controllable waveform generation
    • Singing-Tacotron: Global duration control attention and dynamic filter for end-to-end singing voice synthesis
    • System fingerprint recognition for deepfake audio: An initial dataset and investigation
    • Text Enhancement for Paragraph Processing in End-to-End Code-switching TTS
    • Bi-level style and prosody decoupling modeling for personalized end-to-end speech synthesis
    • Half-truth: A partially fake audio detection dataset
    • Patnet: A phoneme-level autoregressive transformer network for speech synthesis
    • Prosody and voice factorization for few-shot speaker adaptation in the challenge m2voc 2021
    • 基于全局-时频注意力网络的语音伪造检测
    • Bi-Level Speaker Supervision for One-Shot Speech Synthesis.
    • Dynamic Speaker Representations Adjustment and Decoder Factorization for Speaker Adaptation in End-to-End Speech Synthesis.
    • Focusing on attention: prosody transfer and adaptative optimization strategy for multi-speaker end-to-end speech synthesis
    • Non-Autoregressive End-to-End TTS with Coarse-to-Fine Decoding.
    • Spoken Content and Voice Factorization for Few-Shot Speaker Adaptation.
    • The NLPR Speech Synthesis entry for Blizzard Challenge 2020
    • 语音伪造与鉴伪的发展与挑战
    • Phoneme dependent speaker embedding and model factorization for multi-speaker speech synthesis and adaptation
    • Automatic prosodic boundary labeling based on fusing the silence duration with the lexical features
    • Deep Metric Learning for the Target Cost in Unit-Selection Speech Synthesizer.
    • On the Application and Compression of Deep Time Delay Neural Network for Embedded Statistical Parametric Speech Synthesis.
    • Progressive Neural Networks based Features Prediction for the Target Cost in Unit-Selection Speech Synthesizer
    • Transfer Learning Based Progressive Neural Networks for Acoustic Modeling in Statistical Parametric Speech Synthesis.
    • 基于静音时长和文本特征融合的韵律边界自动标注
    • The NLPR Speech Synthesis entry for Blizzard Challenge 2017
  • Projects
    • ICAGC
    • CodecFake
    • ADD2022
  • Projects
  • Blog
    • 🎉 Easily create your own simple yet highly customizable blog
    • 🧠 Sharpen your thinking with a second brain
    • 📈 Communicate your results effectively with the best data visualizations
    • 👩🏼‍🏫 Teach academic courses
    • ✅ Manage your projects
  • Experience
  • Teaching
    • Learn JavaScript
    • Learn Python

Deep Metric Learning for the Target Cost in Unit-Selection Speech Synthesizer.

Jan 1, 2018·
Ruibo Fu
,
Jianhua Tao
,
Yibin Zheng
,
Zhengqi Wen
· 0 min read
Cite
Type
Conference paper
Publication
INTERSPEECH
Last updated on Jan 1, 2018

← Automatic prosodic boundary labeling based on fusing the silence duration with the lexical features Jan 1, 2018
On the Application and Compression of Deep Time Delay Neural Network for Embedded Statistical Parametric Speech Synthesis. Jan 1, 2018 →

© 2025 Me. This work is licensed under CC BY NC ND 4.0

Published with Hugo Blox Builder — the free, open source website builder that empowers creators.