
ActTail is a global, magnitude-based activation sparsity method designed to accelerate Large Language Model (LLM) inference by intelligently allocating sparsity budgets across heterogeneous Transformer weights.
Unlike traditional uniform sparsity methods that apply the same sparsity level across all layers, ActTail leverages Heavy-Tailed Self-Regularization (HT-SR) theory to assign specific budgets to projection layers. By computing empirical spectral density indicators, it maps the unique mathematical properties of each layer, ensuring that critical weights are preserved while redundant activations are aggressively pruned.
Why It Matters
As LLMs scale, computational cost and memory bandwidth become massive bottlenecks. Traditional activation sparsity reduces compute but often causes severe performance degradation (perplexity drops). ActTail massively accelerates inference and reduces memory movement without the steep accuracy penalty of standard uniform allocation, making large-scale model deployment significantly more cost-effective.
How It Works
ActTail uses a TopK selection mechanism guided by the statistical properties of the model’s weights. Instead of guessing which activations to drop, it calculates empirical spectral density to identify which layers exhibit heavy-tailed distributions. It then dynamically routes higher compute budgets to the layers that need them most, while aggressively pruning activations in less critical sections.
Example
When evaluated on the LLaMA-2-13B model at an extreme 80% sparsity level, ActTail achieved a 40.1% reduction in perplexity degradation compared to standard uniform sparsity baselines. Similarly, on the Mistral-7B architecture, it reduced perplexity loss by 9.4%, proving its effectiveness across different foundational models.