CORLAY, M. Maxime (2025) Parameter Efficiency in Large Language Models: A Study on Pruning and Attention Sinks PFE - Project Graduation, ENSTA.

Full text not available from this repository.

Abstract

Large Language Models (LLMs) contain billions of parameters and operate in latent spaces with thousands of dimensions. Their substantial scale results in significant memory and computational requirements during inference. This raises the question of whether all parameters are effectively utilized. A recent line of work has identified attention sinks, a phenomenon where specific tokens (called sink tokens) consistently receive disproportionate attention weights. We investigate the underlying mechanisms of attention sinks and explore their potential for model compression. We identify that certain attention heads consistently produce attention sinks when sink tokens are present, which we term sink-specialized heads. Building on prior work, we observe attention sinks arise from the unusually high norm and sparsity of sink tokens. In particular, we focus on the Begin-Of-Sequence (BOS) token. We propose a targeted pruning approach: we identify the highest-magnitude components of BOS representations and retain only the corresponding columns in the key and query projection matrices of sink-specialized heads (other columns are zeroed). Despite this aggressive pruning, we find that attention patterns are well-preserved, as evidenced by low Frobenius reconstruction error. Then, we evaluate this approach by applying the pruning simultaneously across multiple layers. Our experiments on Llama-2-7B indicate that careful layer selection achieves 91% accuracy retention, while keeping only k=10 components per sink-specialized head. Our findings reveal the critical role of high-magnitude BOS features in attention sink formation. Notably, sink patterns can be efficiently reconstructed using only a small subset of these features, suggesting promising directions for attention-based model compression. We investigate the underlying mechanisms of attention sinks and explore their potential for model compression. We identify that certain attention heads consistently produce attention sinks when sink tokens are present, which we term \textit{sink-specialized heads}. Building on prior work, we observe attention sinks arise from the unusually high norm and sparsity of sink tokens. In particular, we focus on the Begin-Of-Sequence (BOS) token. We propose a targeted pruning approach: we identify the highest-magnitude components of BOS representations and retain only the corresponding columns in the key and query projection matrices of sink-specialized heads (other columns are zeroed). Despite this aggressive pruning, we find that attention patterns are well-preserved, as evidenced by low Frobenius reconstruction error. Then, we evaluate this approach by applying the pruning simultaneously across multiple layers. Our experiments on Llama-2-7B indicate that careful layer selection achieves $91\%$ accuracy retention, while keeping only $k=10$ components per sink-specialized head. Our findings reveal the critical role of high-magnitude BOS features in attention sink formation. Notably, sink patterns can be efficiently reconstructed using only a small subset of these features, suggesting promising directions for attention-based model compression.

Item Type:Thesis (PFE - Project Graduation)
Uncontrolled Keywords:Large Language Model (LLM), Attention Sink, Pruning, Emergent Large Magnitude Features (ELMF), Sink-Specialized Heads
Subjects:Information and Communication Sciences and Technologies
Mathematics and Applications
ID Code:10853
Deposited By:Maxime CORLAY
Deposited On:13 oct. 2025 09:51
Dernière modification:13 oct. 2025 09:51

Repository Staff Only: item control page