FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks

2502.04465

Luca Della Libera et el.

🤗 2025-02-12

TL;DR
#

Current neural audio codecs face challenges such as high bitrates, loss of information, and complex multi-codebook designs. These limitations hinder efficient speech processing for various applications. The high computational cost and difficulty in capturing long-term dependencies are also significant concerns.

To overcome these issues, the researchers propose FocalCodec. FocalCodec uses focal modulation networks and a single binary codebook to efficiently compress speech, achieving competitive performance at bitrates as low as 0.16 kbps. The single codebook design simplifies downstream tasks, while focal modulation captures both semantic and acoustic information effectively. Evaluations across various benchmarks demonstrate its superior performance over existing codecs in reconstruction quality, multilingual capabilities, and robustness to noisy environments.

Key Takeaways
#

Why does it matter?
#

This paper is important because it introduces FocalCodec, a novel low-bitrate speech codec that significantly improves speech reconstruction quality and efficiency. This offers significant advancements for various speech processing applications and opens avenues for research in low-bitrate speech coding and downstream tasks. The use of focal modulation is a novel approach with potential applications beyond speech processing.

Visual Insights
#

🔼 The FocalCodec architecture diagram shows the flow of information processing. The encoder processes the input audio to extract features rich in both acoustic and semantic information. These features are compressed into a low-dimensional representation by the compressor module. This compressed representation is then quantized using a binary codebook. A decompressor module reconstructs the compressed features from this quantized representation. Finally, the decoder uses the reconstructed features to generate the output waveform, completing the speech coding process.
read the caption
Figure 1: FocalCodec architecture. The encoder extracts features containing both acoustic and semantic information. These features are then mapped to a low-dimensional space by the compressor, binary quantized, and projected back by the decompressor. The decoder resynthesizes the waveform from these features.

Codec	Bitrate (kbps)	Sample Rate (kHz)	Token Rate (Hz)	Codebooks	Code Size	Params (M)	MACs (G)
EnCodec	1.50	24	75.0	2 $\times$ 1024	128	15	2
DAC	1.00	16	50.0	2 $\times$ 1024	8	74	56
WavLM6-KM	0.45	16	50.0	1 $\times$ 512	1024	127	28
SpeechTokenizer	1.00	16	50.0	2 $\times$ 1024	1024	108	17
SemantiCodec	0.65	16	25.0	2 $\times$ 8192	1536	1033	1599
Mimi	0.69	24	12.5	5 $\times$ 2048	256	82	11
WavTokenizer	0.48	24	40.0	1 $\times$ 4096	512	85	3
BigCodec	1.04	16	80.0	1 $\times$ 8192	8	160	61
Stable Codec	0.70	16	25.0	2 $\times$ 15625	6	95	37
FocalCodec@50	0.65	16	50.0	1 $\times$ 8192	13	142	9
FocalCodec@25	0.33	16	25.0	1 $\times$ 8192	13	144	9
FocalCodec@12.5	0.16	16	12.5	1 $\times$ 8192	13	145	8

🔼 This table compares various low-bitrate speech codecs used in the paper, providing key characteristics for each codec to facilitate comparison. It includes the bitrate (kbps), sample rate (kHz), token rate (Hz), the number of codebooks used, the number of parameters (millions), the number of multiply-accumulate operations per second (MACs; in billions), and the size of the model checkpoint (in MB). The codecs are categorized as acoustic, semantic, or hybrid, reflecting their design and training approach.
read the caption
Table 1: Compared codecs.

In-depth insights
#

Focal Modulation’s Role
#

Focal modulation plays a crucial role in FocalCodec by enabling efficient and scalable speech compression. Unlike traditional self-attention mechanisms, it introduces inductive biases, allowing the model to learn more effectively from speech data. The hierarchical structure of focal modulation, processing features at multiple granularities, is key to preserving both acoustic detail and semantic information. This is achieved through efficient context aggregation and modulation of local interactions, resulting in granular quantization that maintains high reconstruction quality even at ultra-low bitrates. Focal modulation’s capacity for handling long-range dependencies in linear time is a significant advantage, making FocalCodec well-suited for handling longer speech sequences effectively. The use of focal modulation, in combination with a single binary codebook, fundamentally differentiates FocalCodec from other hybrid codecs, significantly simplifying the architecture and improving its efficiency for downstream tasks.

Low-bitrate Speech
#

Low-bitrate speech coding is a crucial area of research focusing on compressing speech signals into smaller sizes for efficient transmission and storage. The challenge lies in achieving this compression without significant degradation of speech quality or loss of semantic information. FocalCodec, presented in this paper, addresses this challenge using a novel approach based on focal modulation networks and a single binary codebook. This method is shown to outperform existing state-of-the-art codecs in terms of speech quality and downstream task performance, even at extremely low bitrates (0.16 to 0.65 kbps). The use of a single codebook simplifies architecture, making the design more efficient and less complex for integration into downstream applications. The effectiveness of FocalCodec’s approach across various tasks and noisy conditions highlights its potential for broad application, making it a significant advancement in low-bitrate speech technology.

Single Codebook Design
#

The single codebook design in FocalCodec represents a significant departure from conventional hybrid speech codecs, which often employ multiple codebooks to disentangle acoustic and semantic information. This simplification is a major contribution, as it reduces model complexity and improves efficiency for downstream tasks. By leveraging focal modulation, a single binary codebook effectively captures both acoustic detail and semantic content, achieving competitive performance at ultra-low bitrates. The success of this approach highlights the potential of inductive biases in neural audio codecs, showing that carefully designed architectures can achieve high-quality speech reconstruction and effective semantic representations without relying on complex multi-codebook designs or computationally expensive training procedures. The single codebook paradigm further facilitates easier integration with downstream models, as it avoids the need for handling multiple codebook outputs, making FocalCodec a more versatile and practical solution for various speech processing applications.

Downstream Tasks
#

The ‘Downstream Tasks’ section of the research paper is crucial for evaluating the effectiveness of the proposed FocalCodec. It assesses the quality of the learned discrete speech representations by applying them to various tasks. The choice of tasks—automatic speech recognition (ASR), speaker identification (SI), and speech emotion recognition (SER)—is insightful, as they probe different aspects of speech understanding: semantic content (ASR), acoustic properties (SI), and higher-level emotional nuances (SER). The use of shallow downstream models is a methodological strength, minimizing the risk of confounding factors and emphasizing the inherent quality of FocalCodec’s representations. The results show that FocalCodec achieves competitive performance across all three tasks, demonstrating its ability to preserve both semantic and acoustic information, even at low bitrates. This underscores the success of its single-codebook design and the effectiveness of its focal modulation architecture. The section highlights a practical strength of the codec, demonstrating that these efficient representations are not only useful for resynthesis but can also effectively empower various downstream applications.

Future Work
#

Future work for FocalCodec should prioritize addressing its non-causal nature, exploring architectural modifications or training strategies to enable real-time applications. Expanding the dataset to encompass multilingual speech, diverse acoustic conditions (noisy environments), and higher sampling rates (24 kHz) is crucial for enhancing robustness and generalization. Investigating the application of FocalCodec to other audio modalities beyond speech (music, environmental sounds) would broaden its utility. Finally, a deeper exploration into the trade-offs between compression ratio and downstream task performance is warranted, possibly involving novel quantization techniques or loss functions. These improvements would significantly enhance FocalCodec’s versatility and practical applicability.

More visual insights
#

More on tables

Codec	UTMOS $\uparrow$	dWER $\downarrow$	Sim $\uparrow$	Code Usage $\uparrow$	Norm Entropy $\uparrow$	RTF $\uparrow$
Codec	LibriSpeech test-clean
Reference	4.09	0.00	100.0	—	—	—
EnCodec	1.58	8.08	93.8	93.4	82.1	109
DAC	1.29	20.04	89.2	100.0	91.7	89
WavLM6-KM	3.75	6.20	90.0	26.4	95.4	85
SpeechTokenizer	2.28	5.14	91.6	95.9	97.0	63
SemantiCodec	2.91	8.97	96.0	75.9	94.4	0.62
Mimi	3.29	5.73	96.0	95.6	91.8	137
WavTokenizer	3.78	11.55	95.4	100.0	96.7	181
BigCodec	4.11	2.55	98.5	100.0	98.6	22
Stable Codec	4.32	4.97	94.7	98.5	94.7	103
FocalCodec@50	4.05	2.18	97.4	100.0	98.9	185
FocalCodec@25	4.14	3.30	96.3	99.8	98.4	195
FocalCodec@12.5	4.22	7.94	93.9	98.2	97.4	208
	Multilingual LibriSpeech 700
Reference	2.84	0.00	100.0	—	—	—
EnCodec	1.33	29.60	95.5	93.4	79.2	140
DAC	1.24	56.08	89.1	100.0	90.0	97
WavLM6-KM	2.97	44.54	89.5	28.1	0.91	125
SpeechTokenizer	1.55	56.32	92.0	96.1	94.0	74
SemantiCodec	1.87	36.21	97.7	76.4	94.7	0.74
Mimi	2.08	30.96	96.7	95.9	89.0	239
WavTokenizer	2.64	49.73	97.0	97.6	95.6	290
BigCodec	2.86	15.24	99.1	100.0	97.9	24
Stable Codec	3.47	56.99	95.9	92.9	93.8	144
FocalCodec@50	2.96	12.57	98.3	100.0	98.1	269
FocalCodec@25	3.16	19.78	97.3	99.2	97.4	292
FocalCodec@12.5	3.37	54.15	95.2	96.4	96.9	296

🔼 This table presents the results of clean speech resynthesis experiments. It compares various speech coding methods (FocalCodec at different bitrates, EnCodec, DAC, WavLM6-KM, SpeechTokenizer, SemantiCodec, Mimi, WavTokenizer, BigCodec, and Stable Codec) across multiple metrics. The metrics evaluated include the utterance-level MOS (UTMOS) score, which measures perceived naturalness; the differential word error rate (dWER), indicating speech intelligibility; the cosine similarity (Sim), reflecting speaker fidelity; and the real-time factor (RTF), representing inference speed. The table is organized to show the performance of each codec, allowing a comparison of their strengths and weaknesses in reconstructing clean speech.
read the caption
Table 2: Clean speech resynthesis.

Codec	DNSMOS $\uparrow$	dWER $\downarrow$	Sim $\uparrow$	Code Usage $\uparrow$	Norm Entropy $\uparrow$	RTF $\uparrow$
Codec	VoiceBank test
Reference	3.56	0.00	100.0	—	—	—
EnCodec	2.76	28.16	87.7	77.5	78.1	44
DAC	2.72	63.90	79.8	98.7	88.4	48
WavLM6-KM	3.06	20.67	82.9	24.8	92.3	44
SpeechTokenizer	2.74	34.51	82.2	88.1	88.4	42
SemantiCodec	3.13	31.46	90.6	52.4	92.6	0.28
Mimi	3.01	28.00	87.8	78.6	85.5	47
WavTokenizer	3.09	42.12	89.8	94.8	94.0	63
BigCodec	3.19	20.67	92.3	99.8	96.8	17
Stable Codec	3.33	20.32	88.8	75.7	95.4	39
FocalCodec@50	3.16	8.08	91.3	98.0	96.2	80
FocalCodec@25	3.17	11.75	90.1	89.6	96.0	81
FocalCodec@12.5	3.22	27.97	84.7	77.3	95.5	79
	Libri1Mix test
Reference	3.73	0.00	100.0	—	—	—
EnCodec	2.40	55.17	86.3	84.4	78.7	97
DAC	2.40	90.92	76.6	99.1	88.8	91
WavLM6-KM	2.87	36.60	85.9	26.8	95.5	65
SpeechTokenizer	2.58	57.26	82.8	93.5	96.5	63
SemantiCodec	2.67	51.18	89.9	64.7	90.8	91
Mimi	2.65	49.14	89.4	90.8	90.1	104
WavTokenizer	2.53	70.10	86.3	96.4	95.4	165
BigCodec	2.75	53.26	88.3	100.0	98.2	19
Stable Codec	2.91	43.52	90.0	95.8	93.4	68
FocalCodec@50	2.93	27.89	91.6	100.0	98.5	155
FocalCodec@25	2.91	34.27	90.7	99.6	97.9	161
FocalCodec@12.5	2.92	42.59	88.9	97.2	97.2	164

🔼 This table presents the results of speech resynthesis experiments conducted on noisy speech datasets. It compares the performance of various speech coding models in terms of their ability to reconstruct high-quality speech from noisy inputs. The metrics used to evaluate the models include DNSMOS (for naturalness), dWER (for intelligibility), and speaker similarity (Sim). Additionally, code usage, entropy, and real-time factor (RTF) are provided to showcase the efficiency and speed of each model. Two noisy speech datasets were used for evaluation: VoiceBank and Librimix.
read the caption
Table 3: Noisy speech resynthesis.

Codec	UTMOS $\uparrow$	dWER $\downarrow$	Sim $\uparrow$	RTF $\uparrow$
Codec	VCTK
Reference	4.09	0.00	100.0	—
EnCodec	1.24	86.52	72.2	57
DAC	1.25	104.00	67.2	60
WavLM6-KM	2.90	26.68	92.4	57
SpeechTokenizer	1.49	20.32	81.2	33
SemantiCodec	2.02	106.00	72.8	0.60
Mimi	2.40	110.00	89.7	71
WavTokenizer	3.13	43.15	73.4	89
BigCodec	1.31	99.96	68.9	13
Stable Codec	3.76	27.63	71.1	65
FocalCodec@50	3.38	21.27	92.2	116
FocalCodec@25	3.40	23.59	92.6	118
FocalCodec@12.5	3.43	29.93	92.6	117

🔼 This table presents the results of a one-shot voice conversion experiment, where the goal is to convert speech from a source speaker to a target speaker using only a short reference audio sample from the target speaker. The table compares FocalCodec with several other state-of-the-art speech codecs across various metrics, including UTMOS (a measure of speech naturalness), dWER (a measure of intelligibility), Sim (a measure of speaker similarity), and RTF (real-time factor). The results are shown separately for both clean and multilingual speech to evaluate generalization capabilities. Higher values for UTMOS and Sim are better, while lower values for dWER are better.
read the caption
Table 4: One-shot voice conversion.

Codec	Discriminative Tasks			Generative Tasks
	ASR	SI	SER	SE			SS			TTS
	WER $\downarrow$	ER $\downarrow$	ER $\downarrow$	DNSMOS $\uparrow$	dWER $\downarrow$	Sim $\uparrow$	DNSMOS $\uparrow$	dWER $\downarrow$	Sim $\uparrow$	UTMOS $\uparrow$	dWER $\downarrow$	Sim $\uparrow$
Reference	—	—	—	3.56	0.00	100.0	3.77	0.00	100.0	4.09	0.00	100.0
EnCodec	27.89	3.00	47.00	3.11	37.10	85.9	3.11	78.51	87.3	1.69	74.07	79.1
DAC	35.89	3.27	45.90	3.03	67.65	81.7	2.76	106.00	83.3	1.36	61.11	84.1
WavLM6-KM	19.04	22.30	42.90	3.52	22.85	83.6	3.49	76.91	85.0	3.71	48.51	88.2
SpeechTokenizer	14.97	2.73	41.50	3.21	29.82	85.9	3.13	83.99	87.3	2.63	47.81	88.3
SemantiCodec	41.42	15.90	51.60	3.59	102.00	83.3	3.59	123.00	84.4	2.72	59.85	90.8
Mimi	22.98	5.43	44.70	3.30	53.98	84.6	3.41	93.23	88.1	3.05	39.50	93.3
WavTokenizer	35.62	2.44	49.80	3.41	51.75	88.6	3.54	105.00	86.4	3.65	59.22	89.6
BigCodec	26.41	2.34	47.50	3.52	26.68	93.2	3.54	89.24	89.4	3.24	63.83	87.8
Stable Codec	16.85	16.50	46.54	3.55	35.57	82.8	3.61	103.00	78.2	2.86	56.97	84.3
FocalCodec@50	17.63	4.48	45.60	3.47	10.93	91.4	3.71	73.87	89.0	4.05	39.58	92.9
FocalCodec@25	21.12	6.07	46.80	3.49	14.74	90.0	3.69	99.96	85.4	4.12	30.28	91.4
FocalCodec@12.5	33.24	11.69	46.30	3.58	36.98	86.9	3.57	116.00	80.8	4.16	29.91	91.5

🔼 This table presents the results of evaluating different speech codecs on a variety of downstream tasks, including speech recognition (ASR), speaker identification (SI), speech emotion recognition (SER), speech enhancement (SE), speech separation (SS), and text-to-speech (TTS). For each task, the table shows the performance of each codec using relevant metrics such as Word Error Rate (WER), Error Rate (ER), DNSMOS (for quality), and others depending on the task. It allows for a comparison of how well different codecs preserve semantic and acoustic information for various applications.
read the caption
Table 5: Evaluation on downstream tasks.

Compression/ Decompression Block	Down/Upscale Activation	Quantizer	Decoder	UTMOS $\uparrow$	dWER $\downarrow$	Sim $\uparrow$
Focal modulation	Snake	BSQ	Vocos	4.14	2.54	95.3
Focal modulation	Snake	BSQ	HiFi-GAN	3.73	2.54	95.7
Focal modulation	Snake	LFQ	HiFi-GAN	3.74	2.75	95.4
Focal modulation	Leaky ReLU	LFQ	HiFi-GAN	3.72	2.85	95.2
Conformer	Snake	LFQ	HiFi-GAN	3.74	3.58	94.3
AMP	Snake	LFQ	HiFi-GAN	3.70	4.52	94.3
Linear	Snake	LFQ	HiFi-GAN	2.55	9.37	82.5

🔼 This table presents the results of ablation studies performed on a smaller variant of the FocalCodec model. It investigates the impact of different components and design choices on the model’s performance in speech resynthesis. Specifically, it examines the effects of altering the quantizer, decoder, compression/decompression method, activation function, and downscaling/upscaling block.
read the caption
Table 6: Ablation studies.

Codec	Causal	Training Datasets	Hours	Multilingual	Audio Domain	Checkpoint
EnCodec (Défossez et al., 2023)	Optional	DNS, CommonVoice, AudioSet, FSD50K, Jamendo	17,000+	Yes	General	encodec_24khz
DAC (Kumar et al., 2023)	No	DAPS, DNS, CommonVoice, VCTK, MUSDB, Jamendo	10,000+	Yes	General	weights_16khz.pth
WavLM6-KM (Wang et al., 2024)	No	Subset of LibriSpeech (in addition to Libri-Light, GigaSpeech, and VoxPopuli English for WavLM pretraining)	460 (+ 94,000)	No	Speech	discrete-wavlm-codec
SpeechTokenizer (Zhang et al., 2024)	No	LibriSpeech	960	No	Speech	speechtokenizer_hubert_avg
SemantiCodec (Liu et al., 2024)	No	GigaSpeech, subset of OpenSLR, Million Song Dataset, MedleyDB, MUSDB18, AudioSet, WavCaps, VGGSound	20,000+	Yes	General	semanticodec_tokenrate_50
Mimi (Défossez et al., 2024)	Yes	Predominantly English speech (in addition to Libri-Light, GigaSpeech, and VoxPopuli English for WavLM pretraining)	7,000,000 (+ 94,000)	Likely	Speech	mimi
WavTokenizer (Ji et al., 2024)	No	LibriTTS, VCTK, subset of CommonVoice, subset of AudioSet, Jamendo, MUSDB	8000	Yes	General	WavTokenizer-large-unify-40token
BigCodec (Xin et al., 2024)	No	LibriSpeech	960	No	Speech	bigcodec.pt
Stable Codec (Parker et al., 2024)	Optional	Libri-Light, Multilingual LibriSpeech English	105,000	No	Speech	stable-codec-speech-16k

🔼 This table compares FocalCodec to several state-of-the-art low-bitrate speech codecs. It lists each codec’s name, whether it is causal (meaning the output can be generated in real-time without needing future input), the datasets used for training, the total training hours, whether the codec supports multiple languages, the type of audio data it was trained on, and the location of the model checkpoints. This information helps to understand the different approaches and resources used to train these baseline models.
read the caption
Table 7: Baseline codecs.

Codec	Bitrate (kbps)	Sample Rate (kHz)	Token Rate (Hz)	Codebooks	Code Size	Params (M)	MACs (G)	UTMOS $\uparrow$	dWER $\downarrow$	Sim $\uparrow$
Reference	—	—	—	—	—	—	—	4.09	0.00	100.0
TS3-Codec (X2)	0.85	16	50.0	1 $\times$ 131072	16	204	8	3.84	4.51	97.1
FocalCodec@50	0.65	16	50.0	1 $\times$ 8192	13	142	9	4.05	2.18	97.4
FocalCodec@25	0.33	16	25.0	1 $\times$ 8192	13	144	9	4.14	3.30	96.3
FocalCodec@12.5	0.16	16	12.5	1 $\times$ 8192	13	145	8	4.22	7.94	93.9

🔼 This table presents a comparison of the speech resynthesis performance on the LibriSpeech test-clean dataset for various codecs, including FocalCodec and several state-of-the-art baselines. Metrics shown include bitrate, sample rate, token rate, codebook size, number of parameters, multiply-accumulate operations per second (MACs), and objective quality measures such as UTMOS, dWER, and speaker similarity (Sim). This allows for a detailed comparison of both the efficiency and quality of the different codecs at achieving low-bitrate speech reconstruction.
read the caption
Table 8: Clean speech resynthesis on LibriSpeech test-clean.

Codec	Chunk Size	UTMOS $\uparrow$	dWER $\downarrow$	Sim $\uparrow$
Codec	Chunk Size	LibriSpeech test-clean
FocalCodec@50	Inf	4.05	2.18	97.4
FocalCodec@25	Inf	4.14	3.30	96.3
FocalCodec@12.5	Inf	4.22	7.94	93.9
FocalCodec@50	2000 (125 ms)	2.17	6.06	95.9
FocalCodec@50	4000 (250 ms)	2.71	4.62	96.6
FocalCodec@50	8000 (500 ms)	3.16	4.55	96.9
FocalCodec@25	8000 (500 ms)	2.95	12.17	95.6
FocalCodec@12.5	8000 (500 ms)	2.84	47.43	91.8

🔼 This table compares the performance of FocalCodec at different bitrates (50Hz, 25Hz, and 12.5Hz) under offline and chunk-wise streaming inference conditions. It shows the effect of varying chunk sizes (in milliseconds) on the objective metrics UTMOS (quality), dWER (intelligibility), and Sim (speaker similarity) for the LibriSpeech test-clean dataset. This helps determine the feasibility of streaming for each bitrate.
read the caption
Table 9: Offline vs chunk-wise streaming inference.

Codec	Input Features	DNSMOS $\uparrow$	dWER $\downarrow$	Sim $\uparrow$
Codec	Input Features	Libri2Mix
Reference	—	3.77	0.00	100.0
WavLM6-KM	Discrete	3.49	76.91	85.0
WavLM6-KM	Continuous	3.68	23.09	89.4
FocalCodec@50	Discrete	3.71	73.87	89.0
FocalCodec@50	Continuous	3.76	17.35	93.8

🔼 This table presents a comparison of speech separation performance using discrete versus continuous input features for two models: WavLM-KM6 and FocalCodec@50. It highlights the impact of using raw continuous speech representations as input to the downstream task instead of the quantized discrete representations generated by the codec. The metrics reported are DNSMOS (perceptual objective speech quality), dWER (differential word error rate indicating intelligibility), and Sim (speaker similarity). This comparison underscores the effect of the codec’s quantization process on the ability of the downstream model to effectively separate speech sources.
read the caption
Table 10: Discrete vs continuous input features for speech separation.

TL;DR#

Key Takeaways#

Why does it matter?#

Visual Insights#

In-depth insights#

Focal Modulation’s Role#

Low-bitrate Speech#

Single Codebook Design#

Downstream Tasks#

Future Work#

More visual insights#

Full paper#