lrec2026-llm-as-annotator-tutorial

Participant cheatsheet

LLM-as-annotator in one sentence

Use the LLM to produce candidate annotations, then constrain, validate, evaluate, and selectively review them. The goal is not to remove expert judgement, but to make scarce expertise more effective.

Minimal workflow

Define the task and tagset.
Prepare a small, fixed input batch.
Build a zero-shot prompt.
Add validated few-shot examples.
Request structured JSON.
Validate format and token alignment.
Evaluate against a small gold standard.
Analyse errors.
Select the next examples for expert review.
Update guidelines, prompts, and examples.

Prompt checklist

A good prompt should specify:

language and script;
sentence identifier;
fixed input tokens;
authorised UPOS tags;
authorised morphological features;
lemma convention;
output JSON schema;
uncertainty policy;
what the model must not do.

Non-negotiable instructions

The model must not:

translate the sentence;
transliterate tokens;
normalise surface forms silently;
add tokens;
remove tokens;
split tokens;
merge tokens;
reorder tokens;
use labels outside the authorised inventory.

Validation checklist

Before evaluating linguistic quality, check:

JSON is parseable;
all required fields are present;
no extra fields are introduced;
output token count equals input token count;
each output surface exactly matches the corresponding input token;
UPOS tags are authorised;
feature names are authorised;
confidence values are among low, medium, high.

Invalid output is a result to log, not something to hide.

Evaluation checklist

Report separately:

invalid-output rate;
token-alignment failure rate;
POS accuracy;
lemma exact match;
morphological feature precision, recall, and F1;
results by language, script, domain, and prompt type.

Do not rely only on a single global accuracy score.

Error typology

Type	Meaning	Typical action
Format	malformed JSON, missing fields	schema / retry
Alignment	token added, removed, translated, reordered	stricter prompt / validation
Tagset	unauthorised label	closed inventory
Lemma	wrong or inconsistent lemma	examples / guideline
Morphology	wrong feature-value pair	feature-specific analysis
Script	transliteration or character confusion	preprocessing / prompt
Domain	genre-specific or rare form	in-domain examples
Noise	OCR/HTR or damaged text	flag / expert review
Guideline	instructions underspecified	revise documentation
Ambiguous	multiple analyses plausible	expert adjudication

Sampling for expert review

Prioritise examples that combine:

low confidence;
prompt or model disagreement;
validation failures;
rare linguistic phenomena;
under-represented languages or domains;
high value for improving guidelines.

Keep a random slice to avoid blind spots.

Governance checklist

Before using an external API or releasing outputs, ask:

Who owns the source corpus?
Are the editions, images, or transcriptions under licence?
Can derived annotations be redistributed?
Are there community or cultural restrictions?
Are the prompts and outputs safe to publish?
Is a local model required for privacy or sovereignty reasons?

Minimal reproducibility record

For each run, save:

model name and version;
date of generation;
system prompt;
user prompt;
few-shot examples;
JSON schema version;
decoding parameters;
dataset version;
validation rules;
invalid-output rate;
evaluation split;
error-analysis procedure.