Practical, Reproducible LLM-as-Annotator Pipelines Across Scripts and Domains
Tutorial at LREC 2026 May 16th, afternoon session, Room 6
This hands-on tutorial teaches cost-effective, reproducible pipelines for using large language models (LLMs) as annotators on linguistic tasks where both the language and the target phenomenon are under-resourced. The methods are deliberately language-agnostic: participants can apply them to their own corpora, including non-Latin scripts (abjads, abugidas, syllabaries, logographic systems), morphologically rich and polysynthetic languages, as well as historical or OCR-degraded text.
We cover zero- and few-shot prompting, constrained and structured outputs for polylexical tags, in-domain vs. out-of-domain evaluation, error analysis, strategic sampling and active learning, and ethics and data governance for heritage and minority-language resources. Deliverables include ready-to-adapt notebooks, small example datasets, and checklists for responsible practice.
By the end of the tutorial, participants will be able to:
The tutorial has been tested on a variety of under-resourced languages and scripts to assess relevance and feasibility, including Indo-European languages with specific challenges (Greek and Armenian), a Semitic language (Syriac), and a Kartvelian language (Georgian).
0:00–0:20 – Motivation and scope
Under-resourced languages vs. under-resourced studies; where LLMs help and where they fail. Cross-script challenges (abjads, abugidas, syllabaries, logographic scripts), OCR noise, and bespoke tagsets.
0:20–0:45 – Representations and tagsets
Universal Dependencies vs. bespoke, domain- or era-specific tagsets; implications for prompts, constrained outputs, and scoring.
0:45–1:25 – Hands-on I: basic LLM-as-annotator pipelines
Zero- and few-shot prompting for POS, lemmatization, and morphology; structured output contracts (JSON, simple grammars); rapid validation notebooks.
1:25–1:35 – Break
1:35–2:05 – Evaluation and error analysis
In-domain vs. out-of-domain splits; common failure modes (repetition, omissions, script-specific quirks); automatic sanity checks and lightweight adjudication workflows; robustness to OCR and noisy input.
2:05–2:35 – Hands-on II: sampling and bootstrapping
Strategic sampling and active learning with LLMs in the loop; comparison with compact neural baselines (e.g. Stanza, UDPipe); hybrid bootstrapping strategies.
2:35–3:00 – Deployment and governance
Dataset licensing, data sovereignty and community co-design (CARE / OCAP principles); responsible reuse of critical editions; releasing prompts and code; recap and next steps.
Tan, Zhen, Dawei Li, Song Wang, Alimohammad Beigi, Bohan Jiang, Amrita Bhattacharjee, Mansooreh Karami, Jundong Li, Lu Cheng, and Huan Liu. 2024. Large Language Models for Data Annotation and Synthesis: A Survey. EMNLP 2024.
https://aclanthology.org/2024.emnlp-main.54/
Li, Minzhi, Taiwei Shi, Caleb Ziems, Min-Yen Kan, Nancy Chen, Zhengyuan Liu, and Diyi Yang. 2023. CoAnnotating: Uncertainty-Guided Work Allocation between Human and Large Language Models for Data Annotation. EMNLP 2023.
https://aclanthology.org/2023.emnlp-main.92/
Gligoric, Kristina, Tijana Zrnic, Cinoo Lee, Emmanuel Candes, and Dan Jurafsky. 2025. Can Unconfident LLM Annotations Be Used for Confident Conclusions? NAACL 2025.
https://aclanthology.org/2025.naacl-long.179/
Bibal, Adrien, Nathaniel Gerlek, Goran Muric, Elizabeth Boschee, Steven C. Fincke, Mike Ross, and Steven N. Minton. 2025. Automating Annotation Guideline Improvements using LLMs: A Case Study. CoMeDi 2025.
https://aclanthology.org/2025.comedi-1.13/
Kasner, Zdeněk, Vilém Zouhar, Patrícia Schmidtová, Ivan Kartáč, Kristýna Onderková, Ondrej Plátek, Dimitra Gkatzia, Saad Mahamood, Ondrej Dušek, and Simone Balloccu. 2026. LLMs as Span Annotators: A Comparative Study of LLMs and Humans. MME 2026.
https://aclanthology.org/2026.mme-main.1/
Jadhav, Suramya, Abhay Shanbhag, Amogh Thakurdesai, Ridhima Sinare, and Raviraj Joshi. 2025. On Limitations of LLM as Annotator for Low Resource Languages. ICNLSP 2025.
A useful paper on practical limits and failure cases in low-resource settings.
https://aclanthology.org/2025.icnlsp-1.27/
Kellert, Olga, Nemika Tyagi, Muhammad Imran, Nelvin Licona-Guevara, and Carlos Gómez-Rodríguez. 2025. Parsing the Switch: LLM-Based UD Annotation for Complex Code-Switched and Low-Resource Languages. Findings of EMNLP 2025.
https://aclanthology.org/2025.findings-emnlp.863/
Vidal-Gorène, Chahan, Bastien Kindt, and Florian Cafiero. 2026. Under-resourced studies of under-resourced languages: lemmatization and POS-tagging with LLM annotators for historical Armenian, Georgian, Greek and Syriac. LoResLM 2026.
https://aclanthology.org/2026.loreslm-1.28/
Florian Cafiero (EPITA Paris & Université Paris Sciences et Lettres – PSL)
Associate professor at EPITA Paris and researcher at the Centre Jean-Mabillon (École nationale des chartes – PSL). His work spans digital humanities and computational social science, with a focus on low-resource and historical corpora, stylometry and authorship attribution, and practical NLP workflows (LLM-as-annotator, OCR/HTR pipelines) for heritage collections.
Chahan Vidal-Gorène (École nationale des chartes – PSL; LIPN) Lecturer and researcher in digital humanities, and director of the Digital Humanities Master’s program at PSL University. His research focuses on computational paleography and the analysis of under-resourced languages (HTR, linguistic analysis). He is a member of the ANR DALiH project and the DISTAM consortium for data creation and NLP model development for non-Latin languages (mainly Armenian, Arabic, and Chinese), and CEO and founder of Calfa.
Bastien Kindt (Université catholique de Louvain)
Researcher at the Institut orientaliste of UCLouvain and specialist in Ancient Greek language. He is the scientific collaborator of the GREgORI project, which develops corpora, tools, and analysis methods for the languages of the Christian East (Ancient Armenian, Ancient Greek, Arabic, Old Georgian, Syriac).
Further practical information (e.g. links to notebooks and datasets) will be added closer to the tutorial date.
This tutorial is supported by the PSL Research University’s Major Research Program CultureLab, implemented by the ANR (reference ANR-10-IDEX-0001), and by the ANR project Digitizing Armenian Linguistic Heritage (DALiH, reference ANR-21-CE38-0006).