
Audio Ducking Definition
Audio ducking is the automated process of temporarily lowering the volume of background music or ambient sound when dialogue or other primary audio occurs, ensuring speech remains clear and intelligible.
Why audio ducking matters for video production
Great video audio should feel invisible. Viewers should hear dialogue clearly without consciously noticing background music. When music competes with speech, viewers strain to understand words or miss critical information entirely. When music stops abruptly to make room for speech, the transition feels jarring. Audio ducking solves both problems by smoothly reducing background audio whenever primary audio occurs.
Manually keyframing music volume throughout a video is tedious and time-consuming. An editor must listen to the entire piece, identify every moment where speech occurs, and create volume automation curves that lower music appropriately. For a 10-minute video with frequent dialogue, this might mean hundreds of manual keyframes. Audio ducking automates this entirely.
How audio ducking works
Traditional audio ducking uses sidechain compression — a dynamics processor that listens to one audio source (the dialogue track) and automatically reduces the volume of another (the music track) whenever the dialogue exceeds a threshold. The reduction amount, speed, and release time are adjustable parameters. When dialogue stops, the music volume smoothly returns to its original level.
AI-enhanced ducking goes further by understanding audio semantically. Rather than simply reacting to volume level, AI can distinguish speech from non-speech sounds (coughs, breath, background noise), ensuring music only ducks for actual dialogue. It can also adapt ducking intensity based on the importance of the speech — a casual aside might trigger less reduction than a critical piece of narration.
Best practices for audio ducking
Set appropriate thresholds: Too low, and music ducks for every tiny sound including breaths and background noise. Too high, and quiet dialogue does not trigger ducking. Start conservative (higher threshold) and lower until all dialogue triggers ducking reliably.
Adjust attack and release times: Attack determines how quickly music volume drops when dialogue starts. Too fast sounds unnatural; too slow means the first syllables get buried. Release controls how quickly music returns after dialogue ends. Faster release sounds more responsive but can create pumping effects if dialogue is sporadic. Typical values: 10-30ms attack, 200-500ms release.
Choose appropriate reduction depth: Music does not need to disappear entirely — just enough reduction to make dialogue clear. Typical ducking reduces music by 6-12 dB. More aggressive reduction (15+ dB) effectively silences music and can feel heavy-handed.
Consider music selection: Dense, busy music with prominent midrange frequencies competes with speech more than sparse, atmospheric music with strong bass and treble. Choose background music that naturally occupies a different frequency space than dialogue, requiring less aggressive ducking.
Common ducking mistakes
Over-ducking: Reducing music so aggressively that the volume pumping becomes distracting. The goal is subtle support for dialogue clarity, not dramatic volume swings.
Ducking everything: Not all background audio needs to duck. Ambient sound (room tone, outdoor atmosphere) typically does not conflict with dialogue the way music does. Ducking everything creates an unnatural vacuum around speech.
Ignoring music phrasing: Ducking cuts across musical phrases randomly based on when dialogue occurs. In music-driven content, consider timing dialogue to align with musical structure when possible, so ducking does not awkwardly interrupt musical moments.
Applications beyond dialogue
Audio ducking applies to any scenario with primary and secondary audio:
- Sound effects ducking music during dramatic moments
- Voiceover ducking interview audio when both need to coexist
- Narration ducking ambient sound in documentaries
- Podcast intro music ducking for host voice
How ShotAI relates to audio ducking
ShotAI's audio analysis capabilities can identify sections of video where dialogue is present, enabling editors to quickly locate and review segments that will require audio ducking when adding music, streamlining the audio post-production workflow.
Related Terms
Automated Video Transcription
Automated video transcription is the AI-driven process of converting spoken audio in video into timestamped text transcripts, enabling searchable dialogue records, subtitle generation, and content accessibility without manual listening and typing..
Audio Synchronization
Audio synchronization is the process of aligning separately recorded audio tracks with video footage using timecode matching, waveform analysis, or manual slate alignment to ensure lip-sync accuracy..
Motion Graphics
Motion graphics are animated visual elements — including text, shapes, icons, data visualizations, and transitions — designed to communicate information or enhance video content through movement and design..
Written by the ShotAI team. Last updated May 2026.