Original title: PrivateLoRA For Efficient Privacy Preserving LLM
Authors: Yiming Wang, Yu Lin, Xiaodong Zeng, Guannan Zhang
In this article, the author discusses the challenges that end users face when it comes to choosing between privacy and efficiency in current Large Language Model (LLM) service paradigms. In cloud-based paradigms, users have to compromise data locality for generation quality and processing speed, while edge device paradigms maintain data locality but lack satisfactory performance.
To address these challenges, the author proposes a novel LLM service paradigm called PrivateLoRA. This paradigm distributes privacy-sensitive computation on edge devices and shared computation in the cloud, with only activations being transmitted between the two to ensure data locality. The core innovation of PrivateLoRA lies in its ability to exploit the low rank of residual activations, reducing communication overhead by over 95%. As a result, PrivateLoRA effectively maintains data locality and is highly resource efficient.
Under standard 5G networks, PrivateLoRA achieves significant throughput improvement compared to device-only solutions and even rivals the performance of an A100 GPU for certain models. Additionally, PrivateLoRA enables advanced personalization comparable to LoRA. This approach democratizes access to generative AI for edge devices, allowing for more tailored LLM experiences for the general public. The proposed framework is considered the first efficient and privacy-preserving LLM solution in the literature.
Original article: https://arxiv.org/abs/2311.14030