Original title: Bitformer: An efficient Transformer with bitwise operation-based attention for Big Data Analytics at low-cost low-precision devices
Authors: Gaoxiang Duan, Junkai Zhang, Xiaoying Zheng, Yongxin Zhu
In the realm of big models, the Transformer reigns supreme but faces hurdles due to its complex computations and reliance on high-precision operations, particularly problematic for edge devices. To tackle this, the Bitformer emerges, revolutionizing the Transformer by swapping out standard floating-point math with bitwise operations in its attention mechanism. This swap retains the ability to capture complex dependencies while drastically reducing computational complexity from $O(n^2d)$ to $O(n^2T)$, where $T$ is much smaller than the traditional parameter $d$. This innovation aims to bridge the gap between high-performing models and resource-constrained edge environments. By doing so, Bitformer paves the way for advanced models to thrive in settings with limited resources, signaling an exciting direction for future developments in the field.
Original article: https://arxiv.org/abs/2311.13502