Researchers have developed a new method for compressing the vast amount of data used by large language models (LLMs), which has the potential to enhance privacy, reduce energy consumption, and lower costs. This innovative algorithm works by eliminating excess information and decreasing the precision of the data within an LLM’s layers. As a result, this streamlined version of an LLM could be stored and utilized locally on devices such as smartphones or laptops, providing performance that is almost as accurate and detailed as its uncompressed counterpart.
Large language models (LLMs) are increasingly taking over tasks such as translation, text classification, and customer service. However, utilizing the power of an LLM usually requires users to send their requests to a central server, a process that can be costly, energy-demanding, and often slow.
Researchers have now unveiled a method for compressing the extensive data of LLMs, aiming to improve privacy, save energy, and cut costs.
The new algorithm, created by engineers at Princeton and Stanford Engineering, reduces unnecessary information and lowers the accuracy details within an LLM’s structure. This more efficient LLM can be saved and run locally on devices like smartphones or laptops, while maintaining nearly the same level of performance as an uncompressed model.
Andrea Goldsmith, coauthor of the study and dean at Princeton’s School of Engineering and Applied Science, stated, “Whenever we can lessen the computational complexity, storage, and bandwidth needs of AI models, we open up the possibility of using AI on devices that previously couldn’t handle such demanding computational and memory tasks.”
According to Rajarshi Saha, another coauthor and Ph.D. student at Stanford Engineering, “When you interact with ChatGPT, your queries go to OpenAI’s servers for processing, which is very costly. We aim to enable LLM inference using consumer GPUs [graphics processing units], and compression is the key to achieving this.” Saha’s graduate research is co-mentored by Goldsmith and Mert Pilanci, a Stanford Engineering assistant professor.
The team will present their new algorithm, CALDERA (Calibration Aware Low Precision DEcomposition with Low Rank Adaptation), at the Conference on Neural Information Processing Systems (NeurIPS) in December. The researchers began their compression study not directly focused on LLMs, but on the large datasets that train LLMs and other complex AI models, such as those used in image classification. Their earlier work on this technique was published in 2023.
Both training datasets and AI models comprise matrices, or grids of numerical data. For LLMs, these are specifically referred to as weight matrices, which capture learned word patterns from extensive text sources.
Saha remarked, “We initially proposed a versatile algorithm for compressing large datasets or matrices. Upon realizing that both data sets and the models being applied are growing larger, we adapted our algorithm to compress these models as well.”
Though not the first to compress LLMs, the team’s algorithm stands out due to its unique blend of two aspects: “low-precision” and “low-rank.” “Low-precision” representation decreases the number of bits required for storage and processing, enhancing speed and energy efficiency. In contrast, “low-rank” refers to minimizing redundancies within the weight matrices of LLMs.
By combining these two features, the researchers achieved significantly more compression than what could be realized by using either method separately, according to Saha.
The team tested their approach using Llama 2 and Llama 3, open-source LLMs from Meta AI. They discovered that their dual-component method not only enhanced the application of low-precision techniques but also improved results by up to 5% in uncertainty metrics when predicting word sequences.
To assess the performance of the compressed models, they employed various benchmark tasks for LLMs. These included tasks such as determining the correct sequence of two statements and answering questions that require physical reasoning, like separating egg whites from yolks or making a cup of tea.
Goldsmith expressed, “It’s both encouraging and somewhat surprising that we were able to achieve such impressive results with this compression method.” She noted that the emphasis on leveraging the weight matrix rather than merely using standard compression techniques for the bits representing that matrix led to superior outcomes.
Using an LLM compressed in this manner is ideal for scenarios where top-level precision is not mandatory. Additionally, being able to fine-tune compressed LLMs on personal devices such as smartphones or laptops boosts privacy since organizations and individuals can customize models without sending sensitive data to third-party services. This helps reduce the risk of data breaches or unauthorized access to confidential information throughout the training process. However, to facilitate this, LLMs must be adequately compressed to function on consumer-grade GPUs.
Saha cautioned that running LLMs on personal devices can consume significant memory for a duration. “If you’re using an LLM and your phone runs out of battery within an hour, that’s frustrating,” he remarked. He added that low-precision computation can aid in lowering energy consumption, but noted that no single method can resolve all issues. “What we propose is one technique that can work in conjunction with previously suggested methods, ultimately allowing for more efficient LLM use on mobile devices and enhancing accuracy in results.”
The paper titled “Compressing Large Language Models using Low Rank and Low Precision Decomposition” is set to be presented at the Conference on Neural Information Processing Systems (NeurIPS) in December 2024. In addition to Goldsmith, Saha, and Pilanci, the coauthors include Stanford Engineering researchers Naomi Sagan and Varun Srivastava. This research was partially funded by the U.S. National Science Foundation, the U.S. Army Research Office, and the Office of Naval Research.