MSCACodec: A Low-rate Neural Speech Codec With Multi-scale Residual Channel Attention

In the development of modern communication technology, although wideband speech coding provides high-fidelity speech transmis sion, its high bandwidth requirement limits its application in resource constrained environments. Thus, narrowband speech coding is still of great significance. Recently, end-to-end neural speech coding has made significant progress and demonstrated superior compression performance over traditional methods. However, existing methods are limited in re constructing details, especially in low birate environments. To address this, we introduce MSCACodec, a narrowband-based neural speech codec that achieves advanced performance at low bitrates. MSCACodec adopts a multi-scale residual and channel attention feature fusion method to se lectively focus on multi-scale information to enhance feature representa tion, solving the problem of inconsistent hierarchical information caused by multi-scale feature fusion. In addition, we also propose a Temporal Convolutional Gated Recurrent Unit (TCGRU) module, which combines temporal convolutional networks and gated recurrent units to enhance the reconstruction quality using global context and gating mechanisms. The experimental results show that, whether in subjective or objective evaluation, MSCACodec achieves higher quality reconstructed speech than Encodec and HiFiCodec at bitrates of 1.2kbps and 2.4kbps, and is even better than LyraV2 and Opus at 6kbps.

The architecture of MSCACodec.

MSCACodec

Experimental results

PartI : English samples

  • origin_samples1

    Codec2-1.2kbps

    LyraV2-3.2kbps

    Encodec-1.2kbps

    HiFiCodec-1.2kbps

    MSCACodec-1.2kbps

    Codec2-2.4kbps

    LyraV2-6kbps

    Encodec-2.4kbps

    HiFiCodec-2.4kbps

    MSCACodec-2.4kbps

  • origin_samples2

    Codec2-1.2kbps

    LyraV2-3.2kbps

    Encodec-1.2kbps

    HiFiCodec-1.2kbps

    MSCACodec-1.2kbps

    Codec2-2.4kbps

    LyraV2-6kbps

    Encodec-2.4kbps

    HiFiCodec-2.4kbps

    MSCACodec-2.4kbps

  • origin_samples3

    Codec2-1.2kbps

    LyraV2-3.2kbps

    Encodec-1.2kbps

    HiFiCodec-1.2kbps

    MSCACodec-1.2kbps

    Codec2-2.4kbps

    LyraV2-6kbps

    Encodec-2.4kbps

    HiFiCodec-2.4kbps

    MSCACodec-2.4kbps


PartII : Chinese samples

  • origin_samples1

    Codec2-1.2kbps

    LyraV2-3.2kbps

    Encodec-1.2kbps

    HiFiCodec-1.2kbps

    MSCACodec-1.2kbps

    Codec2-2.4kbps

    LyraV2-6kbps

    Encodec-2.4kbps

    HiFiCodec-2.4kbps

    MSCACodec-2.4kbps

  • origin_samples2

    Codec2-1.2kbps

    LyraV2-3.2kbps

    Encodec-1.2kbps

    HiFiCodec-1.2kbps

    MSCACodec-1.2kbps

    Codec2-2.4kbps

    LyraV2-6kbps

    Encodec-2.4kbps

    HiFiCodec-2.4kbps

    MSCACodec-2.4kbps

  • origin_samples3

    Codec2-1.2kbps

    LyraV2-3.2kbps

    Encodec-1.2kbps

    HiFiCodec-1.2kbps

    MSCACodec-1.2kbps

    Codec2-2.4kbps

    LyraV2-6kbps

    Encodec-2.4kbps

    HiFiCodec-2.4kbps

    MSCACodec-2.4kbps