BARON, M. Arnaud (2025) Fine-tuning LLaVA for Satellite Image Captioning PRE - Projet de recherche, ENSTA.
Fichier(s) associé(s) à ce document :
![]()
| PDF 1195Kb |
Résumé
This report presents the fine-tuning of the Large Language-and-Vision Assistant (LLaVA) for the task of satellite image captioning. The project aimed to adapt a general-purpose multimodal model to a specialized domain. The methodology followed a structured pipeline: (i) preparation of a dataset of approximately 2100 annotated satellite images, (ii) establishment of a baseline by evaluating the pretrained LLaVA model with BLEU and METEOR metrics, (iii) fine-tuning with parameter-efficient LoRA adapters, and (iv) quantitative and qualitative evaluation of the adapted model. Results showed that fine-tuning consistently improved captions compared to the baseline, with hallucinations reduced and outputs more aligned with reference descriptions. However, absolute performance remained modest, with BLEU-4 and METEOR scores significantly lower than those achieved on large-scale benchmarks such as MS-COCO. Training and validation losses confirmed that most improvements occurred during the first epochs, with performance plateauing afterwards. This indicates that the main limitation lies in dataset size and domain complexity rather than model capacity.
Type de document: | Rapport ou mémoire (PRE - Projet de recherche) |
---|---|
Mots-clés libres: | LLaVA, satellite image captioning, fine-tuning, large language models (LLM), BLEU score, METEOR score |
Sujets: | Sciences et technologies de l'information et de la communication |
Code ID : | 10632 |
Déposé par : | Alexandre Baron |
Déposé le : | 02 sept. 2025 17:19 |
Dernière modification: | 02 sept. 2025 17:19 |