DAI-CReTDHI project
Collection
Models & datasets related to the DAI-CReTDHI project: https://dai-cretdhi.univ-lr.fr/
•
4 items
•
Updated
This version of QWEN2.5-VL-7B is specialized for HTR on French parish records from the 16th-18th centuries. Ref: https://redmine.teklia.com/issues/11177
The model is a QWEN2.5-VL-7B-Instruct fine-tuned on French parish records using LoRA.
The dataset includes
| Subset | Split | N pages | N records |
|---|---|---|---|
| Ardennes | train | 178 | 801 |
| Ardennes | val | 22 | 128 |
| Ardennes | test | 22 | 112 |
| Ile de Ré | train | 381 | 3015 |
| Ile de Ré | val | 43 | 368 |
| Ile de Ré | test | 42 | 347 |
| Total | train | 559 | 3816 |
| Total | valn | 65 | 496 |
| Total | test | 64 | 459 |
Parameters:
Wandb: https://wandb.ai/starride-teklia/DAI-CReTDHI/runs/c243bwbo
It achieves the following results:
| Model | CER val | WER val | CER test | WER test |
|---|---|---|---|---|
| DAN | 18.83 | 35.17 | 15.57 | 31.74 |
| QWEN2.5-VL (fine-tuned) | 16.0 | 30.93 | 13.13 | 26.28 |
Here we show a code snippet to show you how to use the model with transformers and qwen_vl_utils:
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch
# Load QWEN
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"QWEN/Qwen2.5-VL-7B-Instruct",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map="auto",
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct", use_fast=True)
# Load this adapter
model.load_adapter("Teklia/DAI-records-gold")
# Prompt
messages = [
{
"role": "system",
"content": [
{
"type": "text",
"text": "Tu es un assistant archiviste. Tu dois lire des actes issus de registres paroissiaux français, du 16è au 18è siècle. Extrais le texte de la marge, du corps de l'acte, et éventuellement les signatures."
}
]
},
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://europe.iiif.teklia.com/iiif/2/geneanet%2FArdennes_BMS%2F382706%2F00056.jpg/1252,102,1133,692/full/0/default.jpg"
},
{
"type": "text",
"text": "Extrais le texte de ce document."
},
],
}
]
# Preparation for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=1024)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(output_text)
L'an mil sept cent quatre vingt six le dix septième jour du mois de mars est décédé à Frainvrit de cette paroisse Lambert Joseph Henrot âgé de trente huit ans, natif de Gumiay fils de Lambert Henrot et de Marie Thérèse Gervant lequel a été inhumé le lendemain au cimetière de cette ditte paroisse avec les ceremonies ordinaires par nous prêtre vicaire de cette ville, en présence de Vincent Parrot fleure de cette église, et de Piacre Blondeau, habitant de cette ville, lesquels ont signé avec nous.
Blondeau Parrot Berin vicaire
To cite the original QWEN2.5-VL model:
@misc{qwen2.5-VL,
title = {Qwen2.5-VL},
url = {https://qwenlm.github.io/blog/qwen2.5-vl/},
author = {Qwen Team},
month = {January},
year = {2025}
}