# On the Expressive Power of Self-Attention Matrices

Likhosherstov, V and Choromanski, K and Weller, A On the Expressive Power of Self-Attention Matrices. (Unpublished)

Full text not available from this repository.

## Abstract

Transformer networks are able to capture patterns in data coming from many domains (text, images, videos, proteins, etc.) with little or no change to architecture components. We perform a theoretical analysis of the core component responsible for signal propagation between elements, i.e. the self-attention matrix. In practice, this matrix typically exhibits two properties: (1) it is sparse, meaning that each token only attends to a small subset of other tokens; and (2) it changes dynamically depending on the input to the module. With these considerations in mind, we ask the following question: Can a fixed self-attention module approximate arbitrary sparse patterns depending on the input? How small is the hidden size \$d\$ required for such approximation? We make progress in answering this question and show that the self-attention matrix can provably approximate sparse matrices, where sparsity is in terms of a bounded number of nonzero elements in each row and column. While the parameters of self-attention are fixed, various sparse matrices can be approximated by only modifying the inputs. Our proof is based on the random projection technique and uses the seminal Johnson-Lindenstrauss lemma. Our proof is constructive, enabling us to propose an algorithm for finding adaptive inputs and fixed self-attention parameters in order to approximate a given matrix. In particular, we show that, in order to approximate any sparse matrix up to a given precision defined in terms of preserving matrix element ratios, \$d\$ grows only logarithmically with the sequence length \$L\$ (i.e. \$d = O(\log L)\$).

Item Type: Article cs.LG cs.LG UNSPECIFIED Div F > Computational and Biological Learning Cron Job 11 Jun 2021 20:06 01 Jul 2021 10:01