de Saint Chamas, M. Philippe (2020) Data augmentation for NLP, what does really work ? PFE - Projet de fin d'études, ENSTA.

Fichier(s) associé(s) à ce document :

[img]
Prévisualisation
PDF
15Mb

Résumé

Machine Learning has been an incredibly active field for research in the last decades, a dynamism partly explained by the explosion of the quantity of data produced every day. Yet, the majority of methods used today are supervised learning, and thus necessitate a sometimes long and expensive task of data labelling. To ease this task, some actors provide this service, and work continuously to facilitate the use of unstructured data and build AI that really make a difference. Among those challenges, is the possibility to reduce the number of data necessary to train an algorithm, through data augmentation processes. This task consists in leveraging the labelled data to produce new data that will improve the performance overall, without the need to label more data. Yet, if a lot of techniques can be applied, they do not fit for every task, every dataset or every model. We isolate below precise findings on how to use these data augmentation techniques, such as back-translation, word substitution, letter substitution, and what can be also of help if the previous techniques fail : counterfactual data augmentation.

Type de document:Rapport ou mémoire (PFE - Projet de fin d'études)
Sujets:Mathématiques et leurs applications
Code ID :8272
Déposé par :Philippe De Saint Chamas
Déposé le :11 déc. 2020 09:58
Dernière modification:11 déc. 2020 09:58

Modifier les métadonnées de ce document.