LE BAIL, M. Mathis (2024) Extracting influential and interpretable concepts in the context of text classification with Large Language Models PFE - Project Graduation, ENSTA.
![]()
| PDF 5Mb |
Abstract
Large Language Models (LLMs) have been gaining wider adoption in many use cases in the field of text analysis and generation over the past years. Recently, their deployment in industrial environments has emerged. However, the lack of interpretability in the reasoning circuit of these approaches hinders their adoption in sensitive industries. In particular, models which will be called upon to make decisions such as classification in sectors like defense or finance must be able to provide a trustworthy explanation on the reasons behind their decision. Neural networks are the fundamental components of LLMs. However, understanding their internal activations is challenging and requires in-depth analysis. This report presents a pipeline to identify and interpret the concepts leveraged by a language model when tasked to resolve a classification decision. We enrich this concept extraction scheme with a feature selection strategy, a joint causality assessment and an explainability framework to assign automatically textual explanations to the captured features. Our approach relies on the use of Sparse AutoEncoders (SAEs) that have shown their potential for uncovering interpretable features from dense embeddings. We adapt their training and evaluation process in the context of classification settings. We also provide new metrics to evaluate our pipeline. We show the utility of our crafted pipeline in the practical study of a decoder-only LLM prompted to resolve the classification task on the {\rmfamily AG News} dataset. Our experiments demonstrate that it is possible to identify the relevant directions in the residual stream that are associated with interpretable and causal sub-notions more nuanced than the broad categories among which the model must choose. Additionally, we used the concepts extracted with the SAE for two practical applications. First, we use them for the visual analysis of the fine-tuning phenomenon. Second, they function as inputs to a simpler surrogate decision tree tasked with matching the model's decisions. We conclude by discussing the current limitations of our scheme and the directions to take to alleviate them.
Item Type: | Thesis (PFE - Project Graduation) |
---|---|
Uncontrolled Keywords: | Large Language Models, Explainability, Concepts extraction, Sparse Autoencoders |
Subjects: | Information and Communication Sciences and Technologies Mathematics and Applications |
ID Code: | 10442 |
Deposited By: | Mathis Le bail |
Deposited On: | 28 oct. 2024 14:55 |
Dernière modification: | 28 oct. 2024 14:55 |
Repository Staff Only: item control page