de Saint Chamas, M. Philippe (2020) Data augmentation for NLP, what does really work ? PFE - Project Graduation, ENSTA.



Machine Learning has been an incredibly active field for research in the last decades, a dynamism partly explained by the explosion of the quantity of data produced every day. Yet, the majority of methods used today are supervised learning, and thus necessitate a sometimes long and expensive task of data labelling. To ease this task, some actors provide this service, and work continuously to facilitate the use of unstructured data and build AI that really make a difference. Among those challenges, is the possibility to reduce the number of data necessary to train an algorithm, through data augmentation processes. This task consists in leveraging the labelled data to produce new data that will improve the performance overall, without the need to label more data. Yet, if a lot of techniques can be applied, they do not fit for every task, every dataset or every model. We isolate below precise findings on how to use these data augmentation techniques, such as back-translation, word substitution, letter substitution, and what can be also of help if the previous techniques fail : counterfactual data augmentation.

Item Type:Thesis (PFE - Project Graduation)
Subjects:Mathematics and Applications
ID Code:8272
Deposited By:Philippe De Saint Chamas
Deposited On:11 déc. 2020 09:58
Dernière modification:11 déc. 2020 09:58

Repository Staff Only: item control page