HAMROUN, Sabrine (2016) Généralisation, test et optimisation du logiciel Phoebus de détection de réutilisations entre des textes littéraires. PRE - Research Project, ENSTA.
![]()
| PDF 872Kb |
Abstract
Generalization, testing and optimization of literature reuses detector Ph÷bus As part of the development of digitale humanities, Ph÷bus is a software which is being developed in the laboratories of Lip6, ACASA. It aims to extract the reuses between literary texts. It comes primarily as a response to the need of an automatic detection of reuses in Balzac’s corpus. In this context, the research of reuses in the literary corpus outstrips the scope of plagiarism to the layout of a sophisticated network of dierent inspirations or reuses from other literary texts contributing to the establishment of a particular literary work. This analysis based on an unsupervised machine learning applied to digitale humanities is based on the ngerprint algorithm. Moreover, it requires a laborious optimization of time and memory given that it deels with a large textual mass (3 corpuses by now, 10.000 texts in the future) and therefore is a big data application. In addition, the performance of our Ph÷bus software is primarily based on the criteria of precision and recall which we seek to optimize according to the comparision’s window size , in this case formed by the sum of the size of the sequence of words and the size of the holes, to the respect or not of the words’ order, the gap between the patterns and the minimum of a pattern size.
Item Type: | Thesis (PRE - Research Project) |
---|---|
Uncontrolled Keywords: | Corpus, reuse, plagiarism, digitale humanities, machine learning, big data, annotated corpus, useful words, stemming, precision, recall, indexing, Balzac, ngerprint algorithm. |
Subjects: | Information and Communication Sciences and Technologies |
ID Code: | 6742 |
Deposited By: | Sabrine Hamroun |
Deposited On: | 13 oct. 2016 11:03 |
Dernière modification: | 13 oct. 2016 11:03 |
Repository Staff Only: item control page