MSCACodec: A Low-rate Neural Speech Codec With Multi-scale Residual Channel Attention

In the development of modern communication technology, although wideband speech coding provides high-fidelity speech transmis sion, its high bandwidth requirement limits its application in resource constrained environments. Thus, narrowband speech coding is still of great significance. Recently, end-to-end neural speech coding has made significant progress and demonstrated superior compression performance over traditional methods. However, existing methods are limited in re constructing details, especially in low birate environments. To address this, we introduce MSCACodec, a narrowband-based neural speech codec that achieves advanced performance at low bitrates. MSCACodec adopts a multi-scale residual and channel attention feature fusion method to se lectively focus on multi-scale information to enhance feature representa tion, solving the problem of inconsistent hierarchical information caused by multi-scale feature fusion. In addition, we also propose a Temporal Convolutional Gated Recurrent Unit (TCGRU) module, which combines temporal convolutional networks and gated recurrent units to enhance the reconstruction quality using global context and gating mechanisms. The experimental results show that, whether in subjective or objective evaluation, MSCACodec achieves higher quality reconstructed speech than Encodec and HiFiCodec at bitrates of 1.2kbps and 2.4kbps, and is even better than LyraV2 and Opus at 6kbps.

The architecture of MSCACodec.

Experimental results

PartI : English samples

origin_samples1

Codec2-1.2kbps

LyraV2-3.2kbps

Encodec-1.2kbps

HiFiCodec-1.2kbps

MSCACodec-1.2kbps

Codec2-2.4kbps

LyraV2-6kbps

Encodec-2.4kbps

HiFiCodec-2.4kbps

MSCACodec-2.4kbps
origin_samples2

Codec2-1.2kbps

LyraV2-3.2kbps

Encodec-1.2kbps

HiFiCodec-1.2kbps

MSCACodec-1.2kbps

Codec2-2.4kbps

LyraV2-6kbps

Encodec-2.4kbps

HiFiCodec-2.4kbps

MSCACodec-2.4kbps
origin_samples3

Codec2-1.2kbps

LyraV2-3.2kbps

Encodec-1.2kbps

HiFiCodec-1.2kbps

MSCACodec-1.2kbps

Codec2-2.4kbps

LyraV2-6kbps

Encodec-2.4kbps

HiFiCodec-2.4kbps

MSCACodec-2.4kbps

PartII : Chinese samples

origin_samples1

Codec2-1.2kbps

LyraV2-3.2kbps

Encodec-1.2kbps

HiFiCodec-1.2kbps

MSCACodec-1.2kbps

Codec2-2.4kbps

LyraV2-6kbps

Encodec-2.4kbps

HiFiCodec-2.4kbps

MSCACodec-2.4kbps
origin_samples2

Codec2-1.2kbps

LyraV2-3.2kbps

Encodec-1.2kbps

HiFiCodec-1.2kbps

MSCACodec-1.2kbps

Codec2-2.4kbps

LyraV2-6kbps

Encodec-2.4kbps

HiFiCodec-2.4kbps

MSCACodec-2.4kbps
origin_samples3

Codec2-1.2kbps

LyraV2-3.2kbps

Encodec-1.2kbps

HiFiCodec-1.2kbps

MSCACodec-1.2kbps

Codec2-2.4kbps

LyraV2-6kbps

Encodec-2.4kbps

HiFiCodec-2.4kbps

MSCACodec-2.4kbps