BARON, M. Arnaud (2025) Fine-tuning LLaVA for Satellite Image Captioning PRE - Projet de recherche, ENSTA.

Fichier(s) associé(s) à ce document :

[img]
Prévisualisation
PDF
1195Kb

Résumé

This report presents the fine-tuning of the Large Language-and-Vision Assistant (LLaVA) for the task of satellite image captioning. The project aimed to adapt a general-purpose multimodal model to a specialized domain. The methodology followed a structured pipeline: (i) preparation of a dataset of approximately 2100 annotated satellite images, (ii) establishment of a baseline by evaluating the pretrained LLaVA model with BLEU and METEOR metrics, (iii) fine-tuning with parameter-efficient LoRA adapters, and (iv) quantitative and qualitative evaluation of the adapted model. Results showed that fine-tuning consistently improved captions compared to the baseline, with hallucinations reduced and outputs more aligned with reference descriptions. However, absolute performance remained modest, with BLEU-4 and METEOR scores significantly lower than those achieved on large-scale benchmarks such as MS-COCO. Training and validation losses confirmed that most improvements occurred during the first epochs, with performance plateauing afterwards. This indicates that the main limitation lies in dataset size and domain complexity rather than model capacity.

Type de document:Rapport ou mémoire (PRE - Projet de recherche)
Mots-clés libres:LLaVA, satellite image captioning, fine-tuning, large language models (LLM), BLEU score, METEOR score
Sujets:Sciences et technologies de l'information et de la communication
Code ID :10632
Déposé par :Alexandre Baron
Déposé le :02 sept. 2025 17:19
Dernière modification:02 sept. 2025 17:19

Modifier les métadonnées de ce document.