“Toeplitz Neural Network for Sequence Modeling”

Paper: https://arxiv.org/abs/2305.04749

Code: https://github.com/OpenNLPLab/Tnn

Abstract:

Sequence modeling has important applications in natural language processing and computer vision. Recently, the transformer-based models have shown strong performance on various sequence modeling tasks, which rely on attention to capture pairwise token relations, and position embedding to inject positional information. While showing good performance, the transformer models are inefficient to scale to long input sequences, mainly due to the quadratic space-time complexity of attention. To overcome this inefficiency, we propose to model sequences with a relative position encoded Toeplitz matrix and use a Toeplitz matrix-vector production trick to reduce the space-time complexity of the sequence modeling to log linear. A lightweight sub-network called relative position encoder is proposed to generate relative position coefficients with a fixed budget of parameters, enabling the proposed Toeplitz neural network to deal with varying sequence lengths. In addition, despite being trained on 512-token sequences, our model can extrapolate input sequence length up to 14K tokens in inference with consistent performance. Extensive experiments on autoregressive and bidirectional language modeling, image modeling, and the challenging Long-Range Arena benchmark show that our method achieves better performance than its competitors in most downstream tasks while being significantly faster. The code is available at this https URL.

“Accelerating Toeplitz Neural Network with Constant-time Inference Complexity”

Paper: https://arxiv.org/abs/2311.08756

Code: https://github.com/OpenNLPLab/ETSC-Exact-Toeplitz-to-SSM-Conversion

Abstract:

Toeplitz Neural Networks (TNNs) have exhibited outstanding performance in various sequence modeling tasks. They outperform commonly used Transformer-based models while benefiting from log-linear space-time complexities. On the other hand, State Space Models (SSMs) achieve lower performance than TNNs in language modeling but offer the advantage of constant inference complexity. In this paper, we aim to combine the strengths of TNNs and SSMs by converting TNNs to SSMs during inference, thereby enabling TNNs to achieve the same constant inference complexities as SSMs. To accomplish this, we formulate the conversion process as an optimization problem and provide a closed-form solution. We demonstrate how to transform the target equation into a Vandermonde linear system problem, which can be efficiently solved using the Discrete Fourier Transform (DFT). Notably, our method requires no training and maintains numerical stability. It can be also applied to any LongConv-based model. To assess its effectiveness, we conduct extensive experiments on language modeling tasks across various settings. Additionally, we compare our method to other gradient-descent solutions, highlighting the superior numerical stability of our approach. The source code is available at this https URL.

  • we_are_mammalsB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    Technical discussion seems to be dead in r/MachineLearning, but I’ll ask anyway: Isn’t it strange that in Figure 3 of the first paper, layer 1 has a blurry diagonal, while the rest of them are sharp? I would have expected the opposite: the lowest layer to be very local, and higher layers to be more global.