Evaluating the Impact of Compression Techniques on Task-Specific Performance in Large Language Models

JULY 18, 2024

BY OptiML Team
Introduction

Large language models (LLMs) like GPT-4, PaLM, and LLaMA have revolutionized natural language processing with their ability to perform a wide range of tasks. However, the immense size of these models, often containing billions of parameters, presents significant challenges. These models demand substantial computational resources, memory, and energy, making them costly to deploy and maintain. To address these challenges, researchers have developed various compression techniques to reduce the size of LLMs while striving to maintain their performance.

In this post, we’ll explore the impact of three popular compression techniques—Magnitude Pruning, SparseGPT, and Wanda—on the task-specific performance of LLaMA-2-7B, a widely used LLM. We’ll also discuss the limitations of traditional evaluation metrics like Perplexity and the benefits of using alternative metrics such as Jensen-Shannon Divergence (JS).

Overview of Compression Techniques

Compression techniques like pruning, quantization, and knowledge distillation have gained traction as methods to reduce the size of LLMs. In this study, we focus on three techniques:

  • Magnitude Pruning: This technique removes weights with the smallest absolute values from the model, effectively reducing its size.
  • SparseGPT: A more advanced method that uses calibration data during the pruning process to maintain model performance.
  • Wanda: Similar to SparseGPT, Wanda also leverages calibration data but uses a different approach to identify and remove less important parameters.

These techniques aim to reduce the model’s computational footprint while preserving its ability to perform specific tasks effectively.

Metrics for Evaluation

Traditionally, Perplexity has been used as the primary metric to evaluate the performance of language models. Perplexity measures how confidently a model predicts the next word in a sequence, with lower values indicating better performance. However, when it comes to evaluating compressed models, Perplexity alone may not provide a complete picture.

This is where Jensen-Shannon Divergence (JS) comes into play. JS Divergence is a symmetric and bounded metric that measures the similarity between the probability distributions of the base model and the compressed model. By using JS Divergence, we can gain a more comprehensive understanding of how much the model has changed due to compression and how these changes impact its ability to perform specific tasks

Experimental Results

Our experiments compared the performance of the LLaMA-2-7B model with its compressed versions using the three techniques mentioned above. We evaluated these models on various metrics, including Perplexity, task-specific performance (Exact Match, F1 Score, and ROUGE-1), and JS Divergence as shown in Figure 1.


Figure 1
: JS Divergence evaluated on compressed models.

The results show that while SparseGPT and Wanda were able to maintain Perplexity values close to the base model, they exhibited significant degradation in downstream task performance. This discrepancy highlights the limitations of relying solely on Perplexity as an evaluation metric.

Key Takeaways

The findings from our study underscore the importance of using diverse evaluation metrics when assessing the impact of compression techniques on LLMs. While Perplexity remains a useful metric, it does not capture the full extent of the changes induced by compression, particularly in task-specific contexts. Jensen-Shannon Divergence offers a more detailed view of model alterations and can serve as a valuable tool for understanding the trade-offs between model size and performance.

Conclusion

As LLMs continue to grow in size and complexity, the need for effective compression techniques becomes increasingly important. Our study demonstrates that while techniques like SparseGPT and Wanda show promise in maintaining performance, there are significant trade-offs that must be carefully considered when employed for downstream tasks. Future research should focus on developing more sophisticated compression methods and evaluation metrics to better balance the demands of efficiency and performance