Is Sparse Rate Reduction the Key to White-Box Transformers?

Original title: White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is?

Authors: Yaodong Yu, Sam Buchanan, Druv Pai, Tianzhe Chu, Ziyang Wu, Shengbang Tong, Hao Bai, Yuexiang Zhai, Benjamin D. Haeffele, Yi Ma

The paper argues that representation learning’s aim is to compress and transform data distributions into low-dimensional Gaussian mixtures on unrelated subspaces. They introduce a measure, “sparse rate reduction,” evaluating how well a representation compresses while maintaining information. This measure guides architectures like transformers, viewed as optimizing this objective. They create a family of interpretable white-box architectures, CRATE, by alternately compressing features with self-attention and sparsifying them with a multi-layer perceptron. Surprisingly, these CRATE architectures serve as both encoders and decoders, achieving impressive results similar to complex models like ViT and BERT on image and text datasets. Their framework highlights a bridge between deep learning theory and practice, focusing on data compression. This simple yet effective approach showcases potential in advancing the understanding and application of deep learning.

Original article: https://arxiv.org/abs/2311.13110