Programme

Browse the programme below (expand items for details).

EAMT 2026 — Programme

European Association for Machine Translation · 16–18 June 2026

15 June 2026 - Day 0

TAITT 2026
Location: MindLabs, Room: MLZ 1.21
08:00 - 09:00 Registration
09:00 - 09:10 Opening of the workshop
09:10 - 10:00 Keynote 1: Lynne Bowker - “In 2026, we are friction-maxxing”: Scaffolding friction as part of teaching AI-based translation and technologies
10:00 - 10:30 Oral Presentations (15 mins. each)
#155 Evaluative Judgement in Teaching AI
Gokhan Dogru

Drawing on 23 anonymized student projects from a fourth-year Machine Translation and Post-editing course in a BA-level translation programme, this paper examines how structured comparison of general-purpose LLMs and online MT systems can elicit evaluative judgement in AI-mediated translation. Students translated short specialised English Wikipedia texts into Catalan or Spanish, generated four system outputs, evaluated them using automatic metrics and human adequacy/fluency assessment, selected one output for post-editing, and justified their decision in written reports. Descriptive counts are reported for all 23 projects, while qualitative interpretation is based on the 22 cases accompanied by written reports. Results show that students did not treat automatic metrics as final authority: final post-editing selections often diverged from metric rankings and were justified through adequacy, fluency, terminology, naturalness, and expected post-editing effort. The study therefore does not benchmark systems under controlled conditions; it analyses how students justified system choice within an authentic classroom assignment.

#71 Teaching linguistic prompt control for LLM based translation: A classroom approach to developing critical and responsible AI literacy
Katrin Menzel

As Large Language Models (LLMs) are increasingly integrated into professional translation, there is a growing need for pedagogical frameworks that move beyond simple trial-and-error prompting across any available tools used for translation tasks. This paper presents a framework developed in a university MA seminar on translation to foster responsible AI literacy as well as linguistically grounded prompt control and structured prompting. The approach also emphasises that AI is understood as a tool for suggesting possible translation solutions, while human translators remain the final decision makers. Integrating translation-oriented text analysis and corpus-informed feature extraction from parallel and comparable data, the approach teaches students to develop register-specific and model-interpretable instructions when translating specialised texts with the assistance of generative AI tools. During a one-semester course, students familiarised with prompting strategies and different tools, including commercial, freely available models, GDPR-compliant institutional infrastructure and local models. These steps and the comparative evaluation of these tools allowed the students to identify optimal configurations that conform best to professional quality standards for specialised texts, and they ultimately led to highly improved translation output compared to unsteered model results.

10:30 - 11:00 Coffee break
11:00 - 12:15 Oral Presentations (15 mins. each)
#148 Mimicking Neural Machine Translation History for Pedagogic Reasons
Vincent Vandeghinste

In this paper we describe how we mimic the different steps in the historical development of NMT systems, all trained and evaluated on the same small data set. We do this for pedagogic reasons so students can see the effect of each of the steps on metrics like BLEU but also on qualitative examples, which the training scripts generate after each epoch. As MT paradigms, we discuss NMT training from scratch, fine-tuning pre-trained encoder-decoder models, and finally prompt engineering for decoder only models. All models run in Kaggle sessions and all Python scripts and JuPyter notebooks are made available to the MT teaching community through Github and public Kaggle sessions.

#154 Teaching Machine Translation Technologies with MTUUOC
Antoni Oliver and Sergi Alvarez-Vidal

This paper presents the pedagogical integration of MTUOC, an open-source project developed at Universitat Oberta de Catalunya (UOC)—a distance-learning institution—to facilitate the training, fine-tuning, and integration of Neural Machine Translation (NMT) and Large Language Models (LLMs). The project consists of a modular suite of tools designed to streamline complex technical workflows for translation purposes. These components are currently utilised across research, industry knowledge transfer, and formal education. Specifically, the tools have been successfully implemented in a Bachelor's degree in Translation and Interpreting and a Master's degree in Translation Technologies. Furthermore, a pilot open course based on this framework received significant interest, reaching over 100 participants. This paper outlines the core components of the project, discusses the teaching experiences gathered in asynchronous environments, and describes the organisation of a forthcoming open course scheduled for October 2026. The results suggest that providing students with accessible, high-level interfaces for AI-based translation technologies enhances their technical autonomy and professional readiness.

#13 A Technical Curriculum on Language
Ralph Krüger

This paper presents a technical curriculum on language-oriented artificial intelligence (AI) in the language and translation (L&T) industry. The curriculum aims to foster domain-specific technical AI literacy among stakeholders in the fields of translation and specialised communication by exposing them to the conceptual and technical/algorithmic foundations of modern language-oriented AI in an accessible way. The core curriculum focuses on 1) vector embeddings, 2) the technical foundations of neural networks, 3) tokenization and 4) transformer neural networks. It is intended to help users develop computational thinking as well as algorithmic awareness and algorithmic agency, ultimately contributing to their digital resilience in AI-driven work environments. The didactic suitability of the curriculum was tested in an AI-focused MA course at the Institute of Translation and Multilingual Communication at TH Köln. Results suggest the didactic effectiveness of the curriculum, but participant feedback indicates that it should be embedded into higher-level didactic scaffolding – e.g., in the form of lecturer support – in order to enable optimal learning conditions.

#160 Teaching Data Management to Translation Students: From Docu-mentation Practices to Data Literacy
Pilar Sánchez-Gijón

This paper proposes an approach to teaching data management in translation programmes, framing it as an extension of established documentation practices. It argues that data governance, sourcing, andprocessing are core competences for translators working with NMT and LLMs, andoutlines a training framework that supports responsible data reuse, quality optimisation, and professional agency.

#151 Translator competence in the age of agentic AI orchestration: A “backcasting” perspective
Yu Hao, Elise Wu and Ester Leung

As an orchestration infrastructure, agentic AI systems now can plan and decompose the pre-defined goals into a sequence of steps, decide on external function calls, and coordinate one LLM or multiple LLMs with specialised roles. In this context, this position paper adopts a future studies "backcasting" approach that starts with a desirable future, envisioned as one in which AI-integrated translation workflows are transparent, accountable, and aligned with human values; it then works backwards to examine how translator expertise should be reconceptualised to sustain meaningful human-in-the-loop participation. In this sense, the study first conceptualises the current translation-service provision as a system structured around managerial, mediation, and authorising roles. It then analyses how these roles may be changed and augmented within the agentic AI-orchestrated workflows. Building on the analysis, we propose a series of competences that should be cultivated to achieve the envisioned future: 1) evaluation grounded in advanced language competence; 2) situated and context-sensitive judgement informed by cultural and experiential knowledge; and 3) strategic procedural planning in the design and oversight of agentic AI-orchestration workflows. The paper concludes with recommendations for future pedagogical development and empirical research.

12:15 - 12:30 Open Discussion (15 mins.)
12:30 - 13:30 Lunch break
13:30 - 14:20 Keynote 2: Miquel Esplá-Gomis - Owning the Infrastructure: AI Sovereignty as a Key Competence for Next-Generation Translators
14:20 - 14:50 Oral Presentations (15 mins. each)
#145 Integrating AI-based technologies into translation workflows through a Simulated Translation Bureau (STB)
Koen Kerremans

This paper reports on an exploratory, qualitative study of technology use in a Simulated Translation Bureau (STB) in a master's programme in translation. The STB is a practice-oriented module in which students work on authentic translation projects and integrate a range of technologies, including AI-based tools, into their workflows. The paper addresses how students' technology-related decision-making can be made visible and assessable, including students' reflections on how they justify and verify the use of output-generating technologies, and how they describe perceived changes in their technology use over the course of the simulation. The study shows that the STB is an effective pedagogical model for teaching and assessing technology judgement in AI-enabled translation workflows, but also highlights the need to foreground broader ethical dimensions of AI literacy more explicitly in future iterations of the pedagogical design.

#157 Beyond post-editing: A project-based module on MT and LLM integration for trainee translators
Alina Karakanta

Skills beyond post-editing such as data literacy, technology evaluation, and critical engagement with AI-based tools are becoming essential competencies for trainee translators. This paper presents a syllabus for a translation technology module equipping MA Translation students with end-to-end MT evaluation skills. LLMs are integrated throughout the curriculum. The module is grounded in project-based learning with a simulated client scenario. A three-year trend shows a gradual shift towards LLM-based tools among students.

14:50 - 15:00 Open Discussion (10 mins.)
15:00 - 15:30 Coffee break
15:30 - 16:15 Posters
#147 Teaching AI-based Translation Technologies: Drawing Inspiration from Engineering Design Education to Learn about Sustainability
Esmée Bennison and Lynne Bowker

This paper reports on an exploratory, qualitative study of technology use in a Simulated Translation Bureau (STB) in a master’s programme in translation. The STB isa practice-oriented module in which students work on authentic translation projects and integrate a range of technologies,including AI-based tools, into their workflows. The paper addresses how students’technology-related decision-making canbe made visible and assessable, includingstudents’ reflections on how they justifyand verify the use of output-generatingtechnologies, and how they describe perceived changes in their technology useover the course of the simulation. Thestudy shows that the STB is an effectivepedagogical model for teaching and assessing technology judgement in AI-enabled translation workflows, but also highlights the need to foreground broader ethical dimensions of AI literacy more explicitly in future iterations of the pedagogicaldesign.

#156 COPECO-Speech: Multimodal Post-Editing with Speech and LLMs for Translation Teaching
Jeevanthi Liyana Pathirana, Pierrette Bouillon, Jonathan Mutal, Sabrina Girletti and Lise Volkart

In this demo, we present COPECO-Speech, a multimodal post-editing workbench designed for translator training with AI-based translation technologies. It extends an existing pedagogical post-editing platform (COPECO) by integrating speech input and Large Language Model (LLM)-assisted editing. The workbench has different post-editing modalities and helps teachers annotate student tasks using either a shared or personalized annotation scheme. The system logs all interactions—including keystrokes, speech input, editing actions and LLM operations—enabling detailed analysis of post-editing processes.

#159 Facilitating interaction-oriented AI literacy in translator training: A process-oriented approach
Erik Angelone

This paper proposes a process-oriented approach to facilitating the AI literacy of translation students in the domain of interaction with generative AI, using screen recording and think-aloud output. The approach outlines potential indicators of intelligence augmentation and impairment as documented in process protocols, offering a framework for understanding and developing interaction-oriented AI literacy in translator training.

16:15 - 17:15 Discussion panel Lynne Bowker, Miquel Esplà-Gomis, Arda Tezcan, Pilar Sánchez-Gijón Moderator: Dorothy Kenny
17:30:00 End of workshop
StyGenAI 2026
Location: MindLabs, Room: MLZ 1.28
09:00 - 09:05 Opening Notes
09:05 - 10:00 Keynote Speech: Dr Marzena Karpinska
10:00 - 10:30 Oral presentations - Style in Literary Translation
#185 Toying with Style: Can GenAI Mimic a Literary Translator's Voice?
Beniamin Sopot and Dorothy Kenny

This paper explores the potential of generative AI to produce stylistically-aware literary translations. Three prompting strategies (zero-shot, one-shot and few-shot) are used to attempt to elicit translations in the style of Polish literary translator Jan Rybicki from ChatGPT 5.4 and Claude Sonnet 4.6. Stylometric analysis shows that none of the prompts tested, regardless of the strategy, brought either LLM significantly closer to Rybicki's style.

10:30 - 11:00 Coffee break
11:00 - 12:00 Oral presentations - Style in Literary Translation (cont.)
#171 Emotion Profiling in LLM-Based Literary Translation: Systematic Shifts Across MT and Post-Editing
Antonio Castaldo, Johanna Monti and Sheila Castilho

This paper investigates whether LLM translations exhibit identifiable emotional profiles and how post-editing reshapes them toward human-like norms. LLM translations of Margaret Atwood's Oryx and Crake are compared with their post-edited versions and a human translation, using a large-scale corpus of contemporary Italian science-fiction as a baseline. Emotion is examined through lexicon-based and multilingual modeling, conducting a fine-grained analysis of emotional variation across systems. The findings show that MT systems introduce model-specific and statistically significant emotional fingerprints across translations, leading to a limited preservation of an author's voice.

#179 Retranslation at the intersection of style, machines, and plagiarism
Hüseyin Emir Akdağ, Yusuf Mert Aygün, Mehmet Şahin, Ena Hodzik, Sabri Gürses and Tunga Güngör

This paper investigates the stylistic distinguishability of literary retranslations within the evolving "translatiosphere". Focusing on the English-Turkish pair, a customized neural machine translation (NMT) model was trained using three distinct datasets of Alice in Wonderland: original human translations, commercial NMT translations, and generative AI translations. The methodology employs a sliding-window chunking strategy and 17 morpho-stylistic features tailored for Turkish as an agglutinative language. A Random Forest classifier achieved a Macro F1 score of 83.44%, revealing that human stylistic fingerprints are highly consistent and detectable with just 9 text chunks, whereas LLM-generated outputs exhibit significantly higher variance. Crucially, the model fine-tuned on human data failed to inherit contextual intelligence, instead synthesizing inputs into a homogenized, "statistically safe" signature marked by syntactic rigidity and literalism. These findings provide empirical evidence for identifying machine-generated "translationese" and carry significant implications for plagiarism, authorship, and originality in the digital era.

12:00 - 12:30 Oral presentations - Comparative studies on style NMT vs LLMs vs HT
#180 Lexical and Syntactic Diversity: Still Lost in Machine Translation?
Lise Volkart and Pierrette Bouillon

This paper investigates whether recent developments in machine translation (MT) have affected two well-documented stylistic characteristics of MT: reduced lexical diversity and increased source syntax mirroring. Using a corpus of English into French translations, MT outputs produced by a widely used system over time (2024 vs. 2026) and under different models (NMT vs. LLM-based) are compared. Lexical diversity is measured using word translation entropy (HTra), and source syntax mirroring using four syntactic similarity metrics. Results show that lexical and syntactic characteristics of NMT output remain stable over time. The LLM-based MT exhibits a slight increase in lexical diversity, but still presents strong signs of lexical overgeneralisation. Syntactic similarity to the source remains largely unchanged. Overall, the findings suggest that recent MT developments have a very limited impact on these two stylistic characteristics.

12:30 - 13:30 Lunch break
13:30 - 14:30 Keynote Speech: Prof Carl Vogel - Theory of Style in Language
14:30 - 15:00 Oral presentations - Comparative studies on style NMT vs LLMs vs HT (cont.)
#184 Tracing Style in English–Arabic Translation: A Stylometric Comparison of Human and Machine Outputs
Nooredeen Awwad, Ebtihal Enfes and Kolawole John Adebayo

This paper investigates stylistic variation in English–Arabic news translation across human and machine-generated outputs. A balanced dataset of 250 source texts is compiled and 1,000 translations generated using human translators, Google Cloud Translation, GPT-4o, and a Qwen3-8B-based open-source model. A stylometric analysis measuring structural and lexical properties, including entropy, lexical diversity, repetition, and length ratios, is performed. Results reveal consistent, system-specific stylistic tendencies. Machine-generated translations exhibit normalization effects, with differences in compression, repetition, and lexical diversity across systems and target languages. These findings demonstrate that modern translation systems produce distinct stylistic signatures, motivating evaluation beyond adequacy and fluency toward fine-grained stylistic and discourse-level measures.

15:00 - 15:30 Coffee break
15:30 - 17:00 Oral presentations - Comparative studies on style NMT vs LLMs vs HT (cont.)
#178 Style and Terminology in Commercial Flows: Mixing APE with Iterative Feedback
Marthe Lamote, Ewoenam Tokpo, Tom Vanallemeersch, Sara Szoc and Koen Van Winckel

MT systems have demonstrated strong capabilities in natural language translation, especially in the era of multilingual large language models, which exhibit impressive text understanding and generation capabilities across multiple languages. However, meeting specific stylistic and lexical requirements of end users remains challenging. This paper explores the effectiveness of an automated post-editing approach (APE) in a real-world MT production environment using LLMs and iterative human feedback, and examines current limitations and potential areas for improvement. An end-to-end methodology for APE is proposed, providing analytical and empirical insights into challenges of APE in commercial settings, primarily relying on client and expert feedback and analysis. The performance of LLM-based APE is analysed across five main stylistic dimensions, the terminology correction component of the pipeline is assessed, and it is demonstrated how iterative feedback through prompt refinement and glossary enhancement contributes to improving both stylistic and terminological consistency.

#183 Lexical Variation in English-Italian News Translation: A Comparative Study of Google Translate and ChatGPT
Aurora Trapella, Lieve Macken and Alessandra Molino

This study investigates stylistic features in English–Italian machine translation (MT) by comparing the outputs of Google Translate (GT) and ChatGPT-5.2 (GPT). The analysis adopts a quantitative corpus-based approach to studying news reports and opinion articles for which parallel translations produced by both systems were collected via API. In particular, the focus is on the lexical variety of the translation equivalents of three high-frequency mental verbs: know, think, and believe. Concordance lines were analyzed and annotated according to the equivalents used, enabling a systematic comparison of lexical range across systems and the two news genres. The results show that the two systems seem to adopt similar translation solutions, but they differ in the quantitative distribution of translation equivalents. While GT exhibits greater sensitivity to morphological variation, GPT tends to rely on a more restricted set of equivalents and a wider range of phrasal and clausal reformulations. These findings contribute to studies on machine translationese (MTese) for the English–Italian language pair, with implications for the study of variation in MT and Artificial Intelligence (AI) literacy.

#9 The Style of Machines: A Stylometric Study of LLM Generation and Translation
Natália Resende and Sheila Castilho

This paper investigates whether large language models (LLMs) develop a recognisable stylistic signature and whether such a signature persists across a text generation task and a translation task in different languages. Drawing on stylometric approaches traditionally used in authorship and translator attribution studies, weanalyse outputs from ChatGPT-4o in English, Brazilian Portuguese, and Spanish,across original literary composition andtranslation conditions. The analysis focuses on five feature categories: lexicaldensity, lexical richness, vocabulary sophistication, sentence length, and punctuation style. Our results show a LLM crosslinguistic profile characterised by higherlexical richness and density, rarer vocabulary, and a preference for the em-dashpunctuation mark. This pattern appears inboth generation and translation, though itis stronger in original text production.

17:00 - 17:15 Final remarks
GITT 2026
Location: MindLabs, Room: MLZ 1.27
09:00 - 09:10 Opening of the workshop
09:10 - 10:10 Keynote 1 - Odette Scharenborg - Inclusive speech technology: Developing automatic speech recognition for everyone
10:10 - 10:30 Oral Presentation
#164 Dutch audience perceptions of human and machine-translated pronouns for non-binary referents in subtitles
Joke Daems, Cynthia Van Hee and Alicia Van Muylem

This paper compares three pronoun translation strategies—two human-produced and one machine-translated—for rendering the English non-binary pronoun they into Dutch in an audiovisual context, using a Netflix fragment as stimulus material. A perception study with Dutch audiences was conducted to assess suitability. Results indicate that MT-generated translations were perceived as the least suitable strategy for representing non-binary referents in subtitles.

10:30 - 11:00 Coffee break
11:00 - 12:20 Oral Presentations
#165 Rethinking Gender Annotation for Bias Evaluation in Machine Translation: Can LLMs Improve Reliability?
Chiara Manna, Argentina Anna Rescigno and Eva Vanmassenhove

This paper compares the automated WinoMT annotation pipeline with an instruction-tuned LLM (Qwen3-8B) for annotating grammatical gender in English–Italian translations, as part of a broader rethinking of gender annotation methodology for bias evaluation in MT. Both approaches achieve similar agreement with a human gold standard but exhibit distinct systematic weaknesses. The findings highlight methodological limitations in using LLMs as objective annotators for gender evaluation in MT.

#167 From Binary Defaults to Contextual Bias: Translating Queer Morphology with NMT and LLMs
Manuel Lardelli

This paper evaluates how Neural Machine Translation (NMT) and Large Language Models (LLMs) process non-binary morphology when translating German literary fiction into Italian. 12 NMT and 16 LLM translations from a human-in-the-loop experiment were analyzed, applying an inductive, mixed-methods framework. Findings indicate clear differences between systems. NMT defaults to standard binary grammar but applies it inconsistently, often flipping between masculine and feminine forms for the same subject across different sentences, which effectively erases queer visibility. Conversely, LLMs actively attempt gender-fair language via neutralization and neomorphemes (e.g. the schwa). However, LLMs introduce new systematic errors: driven by semantic cues, they exhibit a contextual bias that frequently led to over-feminization in the present study, and their attempts to create inclusive word endings result in structurally invalid words, especially when translating clitic pronouns. Ultimately, these findings expose current limitations and provide preliminary empirical guidance to assist post-editors in navigating the complex challenges of gender-fair translation.

#168 Reasoning About Gender: How Source Text Strategies Impact Italian-to-German Machine Translation Beyond the Binary
Paolo Di Natale, Laura Schlutter, Elena Chiocchetti and Marlies Alber

This paper investigates how gender-fair strategies in source texts influence the production of non-binary translations in the Italian to German combination. A controlled test set featuring binary and non-binary approaches is used to assess their effectiveness for non-binary renderings in the target language. An automatic evaluation framework is introduced that classifies target sentences into four categories: non-binary, binary-gendered, single-gendered, and incoherent. Relying on human analysis, the paper compares Reasoning LLMs against standard inference, examining whether reasoning improves translation quality and automatic evaluation. Results show that reasoning models are more successful in shifting from binary to non-binary formulations and in handling linguistic challenges such as epicene terms and special characters, although there are no improvements in sentence-level consistency and evaluation accuracy. A qualitative analysis of German translations shows that reasoning encourages the reformulation of source-side strategies through neutralization, visibility strategies, and paraphrasing, resulting in more natural target texts.

#173 Can Emotions Signal Gender? Investigating Implicit Cues in Human and LLM Translations of Amazon Reviews
Shushen Manakhimova and Ekaterina Lapshinova-Koltunski

This study investigates whether source-text emotion is associated with grammatical gender choices in English-Russian translation when the author's gender is not specified. Building on prior work, first-person grammatical gender in Amazon review translations produced by professional translators, translation students, and three LLMs (GPT-4, Llama, and Mistral) is analysed. Emotion labels are assigned to the English source reviews and tested for association with masculine or feminine forms in the Russian translations. Reviews labelled as expressing love are more likely to be translated with feminine grammatical gender by both human translator groups, with small-to-moderate effect sizes. This pattern is absent in GPT-4 and Mistral, but appears in Llama. Other frequent emotion labels do not show the same positive association. The findings are treated as exploratory evidence that specific source-text emotions may be associated with gendered translation choices, while emphasizing the need for larger-scale validation across additional emotions, genres, and language pairs.

12:20 - 12:30 Boasters (1 min each)
#153 The past and future of Fairslator
Michal Měchura

Fairslator is a web-based tool for detecting and correcting bias in machine translation. Started in 2022 as a personal project, Fairslator has recently (2026) received backing from University of Vienna where it is going to be redeveloped into an open-source, community-contributed tool for rewriting and computer-assisted postediting of machine translation. This contribution introduces the plan for that redevelopment.

#152 ReGender: Gender-fair Rewriter for English-to-Greek Machine Translation
Eleni Gkovedarou, Luna De Bruyne and Joke Daems

The use of gender-fair language (GFL) can lead to a more inclusive society, yet machine translation (MT) systems frequently reproduce and amplify gender bias. Some of this bias is due to inherent ambiguities in the source: English largely lacks grammatical gender marking, whereas Greek requires morphological and semantic gender specifications, forcing MT systems to resolve ambiguity in ways that default to gendered (and often biased) outputs. This research explores gender-fair rewriting as a strategy for bias mitigation for English-to-Greek MT, a language pair that remains highly understudied. We propose a twofold approach: ReGender, a system that first detects gender ambiguity in the English source text and then generates a set of gender-fair Greek translations for the ambiguous cases, including gendered, gender-neutral, and gender-inclusive forms. Through a human-centered design, the project combines NLP methods with community-informed GFL practices that go beyond the gender binary. The resulting model will take the form of a plug-in that can be integrated into existing MT systems, enabling users to make informed translation choices while promoting the broader goal of inclusive language technologies.

As a first work package, we conducted a survey on GFL practices in Greek, establishing a community-informed foundation for the system's design. The survey received strong participation providing valuable insights into attitudes and preferences towards GFL, as well as the acceptability of specific forms (e.g. neologisms, disjunctive forms, emerging alternatives etc.). At the poster, we will present the results from this survey and we invite discussion on how these findings can best inform the development of the ReGender model (including decisions on what forms to generate, how to present them to users, and how to handle ongoing variation in Greek GFL practices).

#174 Inclupédie
Sophie Hennuy

Inclupédie (www.inclupedie.eu) is an onlinetool in French designed to facilitate the writingof inclusive texts. First and foremost, it featuresa dictionary of “inclunymes”. An inclunyme isan inclusive synonym. This neologism was created specifically for the Inclupédie. It is a termthat does not carry a gender marker and enablesinclusive writing, also known as “gender-neutrallanguage”. The dictionary of inclunymes offerssolutions without an interpunct, in order to remove as many barriers to inclusive writing aspossible. The principle of the dictionary is simple: enter a word and the dictionary suggests inclusive and creative alternatives. What makesthe Inclupédie unique is that it was designed entirely manually in MultiTerm and then convertedinto an online database. The dictionary is a feebased service. However, a free trial account canbe created for anyone wishing to test the tool...

#177 Gender bias in the webcare domain
Marie Dewulf

Gender bias in machine translation (MT) and large language models (LLMs) has been the focus ofmuch research, but datasets are often created artificially and so fail to reflect real-world domainswhere human referents are implicitly or explicitly encoded in discourse. In this paper, we present anovel study of gender bias in translation within the domain of online hotel reviews. We introduce acurated corpus of English hotel reviews paired with machine-generated translations into French,covering both gender-ambiguous and gender-unambiguous human referents.
Our dataset is...

#162 How masculine is your product? Analysing gender bias in translated reviews
Maja Popovic and Ekaterina Lapshinova-Koltunski

This paper analyses gender bias in machine translation when translating singular first-person forms from English into Russian and Serbian, using the GENDER1PERSON test suite consisting of 1,000 Amazon product reviews across 10 product categories. The results show that the majority of WMT-2025 systems prefer the masculine writer's gender. No system is biased towards the feminine variant. The choice of writer's gender depends largely on the product category.

#163 Mind the Inclusivity Gap: Multilingual Gender-Neutral Translation Evaluation with MGENTE
Beatrice Savoldi, Giuseppe Attanasio, Eleonora Cupin, Eleni Gkovedarou, Janiça Hackenbuchner, Anne Lauscher, Matteo Negri, Andrea Piergentili, Manjinder Thind and Luisa Bentivogli

Avoiding the propagation of undue (binary)gender inferences and default masculinelanguage remains a key challenge towardsinclusive multilingual technologies, particularly when translating into languages withextensive gendered morphology. Genderneutral translation (GNT) represents a linguistic strategy towards fairer communication across languages. However, researchon GNT is limited to a few resources andlanguage pairs. To address this gap, weintroduce M G E NTE, an expert-curated resource, and use it to conduct the first systematic multilingual evaluation of inclusivetranslation with state-of-the-art instructionfollowing language models (LMs). Experiments on en-es/de/it/el reveal that whilemodels can recognize when neutrality isappropriate, they cannot consistently produce neutral translations, limiting their usability. To probe this behavior, we enrichour evaluation with interpretability analysesthat identify task-relevant features and offerinitial insights into the internal dynamics ofLM-based GNT.datasets/FBK-MT/mGeNTE

#170 Yeswa-Stories: A Three-Way Parallel Dataset of Female African Figures in Low-Web Data Languages
Bethelhem Mamo and Hellina Nigatu

Language technologies used in everyday settings such as machine translation systems risk perpetuating societal bias. Prior work in creating benchmarks for gender bias in machine translation systems focuses primarily on high-resourced language pairs or a low-resourced language paired with a high resource language, uses template-based benchmarks that usually focus on occupational biases and stereotypes, and translates high-resource benchmarks which may lack cultural significance to low-resourced languages. This paper introduces Yeswa-Stories, a three-way parallel dataset comprising 1,300 aligned sentences in Amharic, Afaan Oromo, and Tigrinya. The dataset was constructed by collecting English sentences from Wikipedia articles about notable African women and translating them into the three target languages using human translators, augmented with locally sourced content reflecting the cultural context where the languages are spoken. The dataset contributes a new resource for studying gender-inclusive translation in low-resourced settings.

#175 A Chinese Challenge Set to Assess Gender Bias in Automated Translation
Xiaolan Xu, Sara Mendes and Yu-Yin Hsu

Gender bias in automated translation (AT) is well-documented for English-source language pairs, while source languages generally lacking grammatical gender remain largely understudied. This paper addresses the Chinese–Portuguese direction, a language pair that has received little attention in this context. A Chinese challenge set of 495 sentences was developed, constructed from 45 occupations across an eleven-sentence template matrix, systematically varying the type and syntactic position of gender cues: no cue, explicit prenominal modifiers, and coreferential pronouns varying in position and syntactic complexity. Two commercial neural machine translation (NMT) systems and six large language models (LLMs) were tested with this challenge set. Results show a clear hierarchy of cue effectiveness: explicit prenominal modifiers yield universal 100% accuracy; in the absence of gender cues, models predominantly default to masculine forms; and coreferential pronouns in complex sentences reveal a pronounced masculine–feminine asymmetry. LLMs also demonstrate more symmetric gender cue processing than NMT systems.

#172 WinoTR: Evaluating Gender Bias in Machine Translation from a Gender-Neutral Language Using Causal Inference
Deniz Albayrak

This paper presents WinoTR, a Turkish adaptation of the WinoMT challenge dataset (Stanovsky et al., 2019). While WinoMT has been widely studied across multiple languages, its adaptation to Turkish—a morphologically rich language with no grammatical gender—and its analysis through a causal lens remain unexplored. Using 4,752 sentences adapted from the original dataset across pro-stereotypical, anti-stereotypical, and neutral conditions, the paper applies Double Machine Learning (DML) to estimate the causal effect of gender cues on stereotype-consistent translation output. Results reveal a striking asymmetry: cue direction has a large and statistically significant effect on translation outcomes, while cue presence alone produces virtually no effect. Even without any gender signal, MT systems default to stereotype-consistent translations in 62.9% of cases across three systems (DeepL, Google Translate, OpenAI). The causal analysis reveals that gender bias in contemporary MT and LLM-based translation systems runs deeper than surface-level cue processing, persisting as an embedded prior independent of any explicit gender signal in the input.

12:30 - 13:30 Lunch break
13:30 - 14:30 Keynote 2 - Katta Spiel - Dynamic Concepts - Static Machines
14:30 - 15:00 Poster Session - All 9 posters from above: part 1
15:00 - 15:30 Coffee break
15:30 - 16:00 Poster Session - All 9 posters from above: part 2
16:00 - 17:00 Keynote 3: Sofie Decock - Beyond the binary: Experimental insights into gender-neutral pronouns and inclusive interpreting
17:00:00 End of workshop
Tutorial 1: Tutorial: integrating free NMT and LLMs into CAT tools with MTUOC
Sergi Alvarez-Vidal and Antoni Oliver
Location: MindLabs, Room: ML 1.04
09:00 - 10:30 Part 1
10:30 - 11:00 Coffee break
11:00 - 12:30 Part 2
Tutorial 2: Tutorial on human evaluation of translation and multilingual tasks
Vilém Zouhar, Maike Züfle and Dominik Macháček
Location: MindLabs, Room: ML 1.03
09:00 - 10:30 Part 1
10:30 - 11:00 Coffee break
11:00 - 12:30 Part 2
Tutorial 3: Translation evaluation tools for everyone: a hands-on tutorial for freelancers and small LSPs
Yuri Balashov
Location: MindLabs, Room: ML 1.03
13:30 - 15:00 Part 1
15:00 - 15:30 Coffee break
15:30 - 17:00 Part 2
EAMT 2026 Social Program
17:30 - 20:00 Welcome reception, Location: MindLabs, Foyer

16 June 2026 - Day 1

08:00 - 09:00 Registration
09:00 - 09:30 Conference Opening
09:30 - 10:45 ORAL SESSION 1: Research - Technical / Domain-Aware and Low-Resource LLMs for MT Session chair: Ayla Rigouts Terryn
#45 Multilingual Communication in the Asylum Context: Evaluating LLM-Based Machine Translation with Fuzzy Match Augmentation and Adaptive NMT across Resource Conditions under Low-Data Constraints Research · Technical
Thomas Moerman, Arda Tezcan and Lieve Macken

Effective communication in asylum reception settings requires reliable machine translation (MT) across many languages, including low-resource ones. Using data from the MaTIAS project (Machine Translation to Inform Asylum Seekers), we compare retrieval-augmented LLM translation with adaptive Neural MT across 14 target languages with varying resource levels. Working with a very small translation memory of only 358 sentences, we evaluate fuzzy-match (FM) augmentation as an in-context learning strategy for open-source and commercial LLMs and benchmark these against ModernMT with and without domain adaptation. In the LLM setting, FM-based example selection consistently outperforms random selection and zero-shot prompting, with the largest gains for low-resource languages. Adaptive NMT retains an overall advantage, although Gemini Pro approaches its performance and outperforms it on 6 of 14 languages, highlighting a trade-off between translation quality and data sovereignty in privacy-sensitive contexts. These findings show that FM augmentation remains effective under severe data constraints and emphasise the importance of language-specific evaluation in multilingual MT.

#90 Why do Large Language Models Fail in Low-resource Translation? Unraveling the Token Dynamics of Large Language Models for Machine Translation Research · Technical
Shenbin Qian and Yves Scherrer

Large Language Models (LLMs) have recently demonstrated strong performance in machine translation (MT). However, most prior work focuses on improving or benchmarking translation quality, offering limited insight into when and why LLM-based translation fails. In this work, we systematically analyze failure modes of LLMs in MT by evaluating 15 models, including four reasoning LLMs, across 22 language pairs (LPs) with varying resource levels. We find that non-English-centric LPs consistently yield lower COMET scores than English-centric pairs. To investigate the underlying causes, we introduce Token Activation Rate (TAR), a metric that captures how effectively a model utilizes language-specific tokens in its vocabulary during generation. We validate TAR as a proxy for language representation using models with known language distributions in the training data, and show that lower TAR is strongly associated with poorer translation performance. Furthermore, reasoning LLMs tend to generate more tokens when translating into low-TAR languages, suggesting a compensatory mechanism, although its impact on translation quality varies across models. Overall, our findings emphasize the importance of token-level dynamics in understanding MT performance of LLMs.

#98 Translating Under Pressure: Domain-Aware LLMs for Crisis Communication Research · Technical
Antonio Castaldo, Maria Carmen Staiano, Johanna Monti, Sheila Castilho and Francesca Chiusaroli

Timely and reliable multilingual communication is critical during natural and human-induced disasters, but developing effective solutions for crisis communication is limited by the scarcity of curated parallel data. We propose a domaini-adaptive pipeline that expands a small reference corpus, by retrieving and filtering data from general corpora. We use the resulting dataset to fine-tune a small language model for crisis-domain translation and then apply preference optimization to bias outputs toward CEFR A2-level English. Automatic and human evaluation shows that this approach improves readability, while maintaining strong adequacy. Our results indicate that simplified English, combined with domain adaptation, can function as a practical lingua franca for emergency communication when full multilingual coverage is not feasible.

10:45 - 11:15 Coffee Break
11:15 - 12:15 KEYNOTE 1: Dr. Rachel Bawden Large Language Models and Machine Translation: From Low-Resource to Unseen Languages

Speaker Bio

Dr. Rachel Bawden

ALMAnaCH project-team, Inria Paris, France

Currently a fellow at PR[AI]RIE-PSAI research institution

Dr. Rachel Bawden is a researcher in the ALMAnaCH project-team at Inria Paris, France. She is a specialist of Machine Translation (MT), having worked on contextual MT during her PhD at the LIMSI laboratory in France and MT for low-resource languages in her post-doc at the University of Edinburgh. She is currently working on a range of topics in MT and multilingual NLP, focusing mainly on language variation, both for historical and contemporary texts (for example user-generated content, dialectal variation), evaluation and resource creation. She is currently a fellow in the PR[AI]RIE-PSAI research institution.

Abstract

Large language models (LLMs) have been offering new approaches to machine translation (MT). Much of today's research involves trying to tease out the underlying knowledge of the LLMs to improve translation quality, especially in scenarios where standard prompting does not lead to good results. For many of the world's low-resource languages, LLMs have not been the magic solution for translation, with new problems arising such as failure to translate in the right language and uncontrolled hallucination, and there remain significant challenges.

In this talk, I will be discussing several research directions in low-resource MT with LLMs that I recently published with colleagues. These include (i) the decomposition of sentences into simpler components to aid the search for useful few-shot examples, (ii) the creation of high quality synthetic parallel data for under-resourced languages and finally (iii) the explicit learning of translation from grammar descriptions, tested with encrypted and therefore unseen languages.

12:15 - 13:15 POSTER BOASTER SESSION 1: Research - Technical (11); Research - T&U (7); Implementation & Case Studies (4) (22 posters) Session chair: Chiara Manna
13:15 - 14:15 Lunch
14:15 - 15:15 POSTER SESSION 1: Research - Technical (11); Research - T&U (7); Implementation & Case Studies (4) (22 posters)
#132 Diversity and Homogenisation in Generative AI Translation: A Comparative Study of English-Dutch Translation Across Domains Research · Technical
Dimitar Shterionov, Noa van Helleman and Eva Vanmassenhove

Generative AI tools, such as ChatGPT, are applied to a wide range of language-related tasks, including translation. Despite their current popularity among users and researchers and the impressive results obtained on several benchmarks (Kocmi et al., 2024; Deutsch et al., 2025), their potential side-effects on languages and translations are still understudied Vanmassenhove (2025). The paradigm shift from Machine Translation (MT) to Generative AI Translation (GAIT) likely calls for a reconsideration of our assessment and evaluation metrics and practices. In this work, we focus on GAIT by analyzing translations from four multilingual large language models (MLLMs), mBART, Jamba1.5-large, GPT 4o and DeepSeek R1 applied to three different domains (news, literature and poetry) for the English-Dutch language pair. Focusing on metrics related to lexical and textual diversity, we find that while GAIT text for literature is of significantly high lexical and grammatical richness, that is not the case for news and poetry. We also assess the homogeneity of AI-generated text through a set of clustering and classification experiments. In addition to a clear separation between human- and AI-generated content, our results indicate that GAIT output is more homogeneous among MLLMs.

#101 LLM-as-a-Jury for Machine Translation Publishability Assessment Research · Technical
Alex Yanishevsky and Olivia Norris

We propose an LLM-as-a-Jury framework for determining machine translation publishability, aggregating judgments from multiple large language models via logistic regression rather than relying on a single judge. Publishability is defined as the absence of major or critical errors— those that render a translation unsuitable for public release without human post-editing. We compare three evaluation frameworks: a generic Edit Effort Estimation (EEE) prompt based on lexical accuracy, grammatical correctness and semantic coherence, a generic Linguistic Quality Assurance (LQA) prompt based on the MQM error taxonomy, and a purpose-built Publishability prompt optimized via DSPy and augmented with domain-specific fine-tuning. Experiments across three domains and nine language pairs show that (i) the jury ensemble matches or outperforms the best individual juror in nearly every condition, (ii) EEE and LQA juries are competitive with and occasionally exceed the Publishability jury on macro-F1, (iii) the Publishability framework offers stronger precision and a more favorable error correction asymmetry, and (iv) domain-specific fine-tuning yields substantial recall gains in client-heavy domains. These results support the viability of fully automated publishability determination in enterprise MT workflows.

#84 To Write or to Automate Linguistic Prompts, That Is the Question Research · Technical
Marina Sánchez-Torrón, Daria Akselrod and Jason Rauchwerk

LLM performance is highly sensitive to prompt design, yet whether automatic prompt optimization can replace expert prompt engineering in linguistic tasks remains unexplored. We present the first systematic comparison of hand-crafted zero-shot expert prompts, base DSPy signatures, and GEPA-optimized DSPy signatures across terminology insertion, translation and language quality assessment, evaluating five model configurations. Results are task-dependent. For terminology insertion and translation, GEPA-optimized prompts are competitive with expert prompts: most differences are not statistically significant, and optimization significantly improves glossary term match rates for several models. In language quality assessment, expert prompts achieve stronger error detection while optimization improves characterization. Across tasks, GEPA elevates minimal DSPy signatures, often closing the gap to expert performance. We note that the comparison is asymmetric: GEPA optimization searches programmatically over gold-standard splits, whereas expert prompts require in principle no labeled data, relying instead on domain expertise and iterative refinement.

#92 CompactQE: Interpretable Translation Quality Estimation via Small Open-Weight LLMs Research · Technical
Kamil Guttmann, Zofia Fraś, Artur Nowakowski and Krzysztof Jassem

Current state-of-the-art Quality Estimation (QE) in machine translation relies on massive, proprietary LLMs, raising data privacy concerns. We demonstrate that smaller, open-source LLMs (<30B parameters) are a viable, cost-effective and privacy-preserving alternative. Using a single-pass prompting strategy, our models simultaneously generate quality scores, MQM error annotations, suggested error corrections, and full post-editions. Our analysis shows these models achieve highly competitive system-level correlations with human judgments that outperform traditional neural metrics, fine-tuned models, and human inter-annotator agreement, effectively approximating the capabilities of much larger proprietary LLMs.

#60 SinMix2Mono: A Dataset for Code-mixed Romanized Sinhala Translation and Transliteration Research · Technical
Rukshan Dias, Deshan Sumanathilaka, Archchana Sindhujan and Minidu Nimna

Code-mixed and Romanized texts are widely used in digital content, yet they remain largely underexplored for many low-resource languages, including Sinhala. The scarcity of high-quality parallel data has limited progress on downstream tasks, such as machine translation and transliteration. We introduce SinMix2Mono, the largest manually annotated parallel training dataset, followed by the first gold standard benchmark and code-mixed transliteration ambiguity corpora for code-mixed romanized Sinhala to Sinhala conversion. The dataset comprises approximately 25,000 real-world sentences collected from social media, covering diverse domains and authentic code-mixing patterns. To ensure high-quality translations, we used an annotation pipeline that combined rule-based transliteration, LLM-assisted translation, and human validation. The golden test dataset, which includes 2549 sentences, and the code-mixed transliteration ambiguity test were validated by three annotators, yielding Gwet’s AC1 scores of 0.7465 and 0.7068, respectively. We benchmarked nine systems, including statistical, neural and commercial LLMs. SinMix2Mono1 provides a robust training and evaluation resource, establishing a strong benchmark for future research on Sinhala code-mixed translation and transliteration.

#44 Alignment Quality Degradation Across the Parallel--Comparable Spectrum: A Comparative Analysis Research · Technical
Audrey Mash, Jonathan Ayebakuro Orama, Marc Juvillà Garcia and Maite Melero

Sentence-level alignment systems have been developed and evaluated primarily on parallel data, leaving their behaviour across the broader parallel–comparable spectrum of real web content poorly understood. We present a stratified empirical study of alignment quality for Catalan– English using 300 document pairs across three parallelism bands defined by mean-max Language-agnostic BERT Sentence Embedding (LaBSE) cosine similarity. We compare four systems: a hierarchical alignment pipeline (DocAlign), an ablation with paragraph pre-filtering disabled (DocAlign-NoFilter), the flat aligner Vecalign, and a flat LaBSE greedy baseline. Evaluation uses human-annotated sentence pairs and coverage-weighted quality. Quality degrades at different rates by system type: hierarchical systems maintain usable-pair rates ranging from 25% to 51% on comparable data while flat systems collapse to 2–7%. Paragraph pre-filtering reduces output volume on comparable data while raising pair quality relative to the unfiltered ablation. Vecalign is statistically indistinguishable from the greedy baseline at all parallelism levels, consistent with the hypothesis that LaBSE embedding discrimination is the binding constraint on flat alignment quality. Failure mode analysis of 550 low-rated pairs identifies topical mismatch as the dominant failure mode, with structural noise concentrated in flat systems.

#111 ForMaT: Dataset for Visually-Grounded Multilingual PDF Translation Research · Technical
Michał Ciesiółka, Dawid Wiśniewski, Adrian Charkiewicz and Kamil Guttmann

We present ForMaT (Format-Preserving Multilingual Translation), a parallel corpus of 3,956 PDFs across 15 language pairs that preserves original layout metadata proposed for multimodal machine translation. To ensure structural diversity in the dataset, we employ K-Medoids sampling over 45 geometric features, capturing complex elements like nested tables and formulas to focus only on visually diverse PDF documents. Our evaluation reveals that current MT systems struggle with spatial grounding and geometric synchronization, often losing the link between text and its visual context. ForMaT provides a benchmark for developing layout-aware translation models that integrate visual and textual context for high-fidelity document reconstruction.

#103 The Challenge of Finding Robust and Efficient Strategies for Training Machine Translation Models with Noisy Data Research · Technical
Mikko Aulamo, Sami Virpioja, Yves Scherrer and Jörg Tiedemann

Most machine translation datasets come with a certain level of noise, and strategies for handling such data need to be robust and efficient. Data selection and filtering are challenging and may depend on expensive language-specific tools that are not necessarily available, especially for low-resource languages. This paper looks at training strategies that combine cheap heuristic filters with curriculum learning to implement iterative procedures that robustly operate on raw noisy data without expensive prior preprocessing and data selection. The intuition is that we can cluster data into buckets with varying noise levels and use different sets of buckets at different stages of MT model training. We test various strategies and compare them to pre-filtering approaches for a diverse set of low-resource languages and conclude that curriculum learning can improve robustness but does not necessarily lead to improved translation performance. Overall, the experiments demonstrate the importance of proper experimental workflows, which cannot easily generalize from one language pair and scenario to another.

#41 Mitigating Gender Bias in English-Ukrainian Machine Translation Models Research · Technical
Pavels Ivanovs, Gina Welsh and Irini Selenica

This study investigates the presence and mitigation of gender bias in English-Ukrainian machine translation (MT). We evaluated two gender bias mitigation methods (gender tagging and Lapa-LLM correction) on two pre-trained MT models (OPUS-MT and mBART) using a curated dataset of occupation-based sentences. Our results showed that both zero-shot models showed gender bias transfer, particularly for male-stereotyped occupations with female source gender. Gender tagging produced mixed results by over-assigning masculine forms to female-source sentences. LLM correction achieved the strongest mitigation, recovering female-labelled output to near source proportions. However, full morphological gender agreement remained a challenge for the LLM correction method. Our framework, which uses Ukrainian’s overt gender morphology as a bias signal, is adaptable to other grammatically gendered languages.

#138 IndicDISCO-MT: A Discourse-Centric Benchmark for Evaluating Discourse Phenomena in Indian Language Machine Translation Research · Technical
Heli Hingrajiya, Vennela Bairi, Vandan Mujadia, Dipti Sharma, Parameswari Krishnamurthy and Vasudeva Varma

Discourse-level translation remains a challenge for machine translation (MT) systems, particularly for translation from Indian languages to English. This challenge is amplified in Indian languages due to rich morphology and discourse complexity. Existing evaluation benchmarks prioritize sentence-level translation quality and fail to capture important discourse phenomena such as pronoun resolution and lexical cohesion. To address this gap, we introduce IndicDISCO-MT, a parallel benchmark dataset covering translations from eight Indian languages, namely: Bengali, Gujarati, Hindi, Marathi, Kannada, Tamil, Telugu and Urdu to English. On top of this corpus, we introduce DiscoAlign, a human-annotated source-to-target alignment pairs for discourse evaluation. We further propose two evaluation benchmarks, ProAlign and LexiAlign, to evaluate ability of large language models (LLMs) and MT systems to handle personal pronouns and lexical cohesion. Our evaluation of recent LLM and MT systems shows that although models achieve high translation quality, they still struggle to accurately preserve discourse-level phenomena. The proposed benchmarks provide a systematic framework for evaluating discourse-aware translation and can facilitate the development of more coherent and contextually consistent translations. The dataset is made publicly available.

#54 Multi-Agent Debate for Machine Translation: A Case Study on English-Japanese Translation Research · Technical
Zhan Shen, Jason Naradowsky, Xiaotian Wang and Yusuke Miyao

As machine translation increasingly requires deeper contextual, linguistic, and cultural understanding, multi-agent collaboration has emerged as a promising approach. Multi-agent debate (MAD) frameworks, in which multiple agents deliberate to produce a final output, have shown strong performance on objective tasks, but remain underexplored in translation, where multiple valid renderings often exist. We adapt three MAD frameworks for English-Japanese translation and evaluate them against strong generative baselines, reasoning-capable LLMs, and a prompt-based self-reflection method. Across general-domain and culturally grounded datasets, the Society of Mind (SoM) variant yields the strongest results in the English-to-Japanese direction, showing that zero-shot translations leave substantial room for improvement through structured deliberation. Yet the gains of debate are front-loaded: later rounds do not reliably improve quality and often reintroduce translation errors. Diagnostic and error-span analyses show that hand-designed debate protocols tend to over-revise already strong translations, leading to semantic drift and process-induced degradation. These findings highlight both the promise and the limitations of debatebased agentic translation, and suggest that effective iterative improvement requires mechanisms for preserving high-quality intermediate translations while limiting unnecessary revisions.

#76 Fuzzy Matching and Sentence Embeddings for Few-shot Machine Translation with Large Language Models Research · T&U
Miguel Angel Rios Gaona, Claudia Plieseis, Dragos Ciobanu and Alina Secara

In-context learning is a method for improving machine translation in Large Language Models, but its performance is sensitive to the quality of the few-shot example selection. Current retrieval strategies use semantic similarity by computing sentence embeddings, and these methods often require significant computational overhead and specialised expertise. We evaluate the impact of retrieval strategies on translation performance in a specialised domain, comparing traditional, token-based fuzzy matching against semantic sentence embeddings. We use a medical corpus from the European Medicines Agency (EMEA) for the English-Romanian and English-German language pairs, and we evaluate translation quality with automatic metrics and manual evaluation. Our results show that 1-shot and 5-shot prompting significantly outperforms the 0-shot baselines for quality in automatic evaluations for both language pairs, and in manual evaluation for English-German. For the English-Romanian pair, the average scores of the manual evaluation for both quality and ranking follow the same trend, but statistical significance is not consistently reached for all few-shot prompting configurations. In general, token-based fuzzy matching overwhelmingly has higher automatic quality scores than embedding-based retrieval.

#10 Can professional translators identify machine-generated text? Research · T&U
Michael Farrell

This study investigates whether professional translators without prior specialized training can reliably identify short stories generated in Italian by artificial intelligence (AI). Sixty-nine translators took part in an in-person experiment, where they assessed three anonymized short stories — two written by ChatGPT-4o and one by a human author. For each story, participants rated the likelihood of AI authorship and provided justifications for their choices. While average results were inconclusive, a statistically significant subset (16.2%) successfully distinguished the synthetic texts from the human text, suggesting that their judgements were informed by analytical skill rather than chance. However, a nearly equal number misclassified the texts in the opposite direction, often relying on subjective impressions rather than objective markers, possibly reflecting a reader preference for AI-generated texts. Low burstiness and narrative contradiction emerged as the most reliable indicators of synthetic authorship, with unexpected calques, semantic loans and syntactic transfer from English also reported. In contrast, features such as grammatical accuracy and emotional tone frequently led to misclassification. These findings raise questions about the role and scope of synthetic-text editing in professional contexts.

#74 “All those in favour will please say yea”: Understanding the Factors Behind Machine Translation Adoption at the Canadian Parliament Research · T&U
Jeniffer Leal-Wyss, Delaney Lothian, Gabriel Bernier-Colborne, Rebecca Knowles and Michel Simard

Translators at the Canadian Parliament currently have access to a neural machine translation system as an optional tool integrated into their translation environment, whose output they can use for post-editing (rather than translating from scratch); this provides a valuable opportunity to study the dynamics of machine translation adoption in professional settings. We report on a user study that investigates how and why translators choose to interact with this tool. Using a mixed-methods approach, we examined both human and technical factors that influence the adoption or non-adoption of the system. Drawing on our findings, we advocate for a user-centred approach to MT integration within professional translation workflows.

#47 Translation Analytics for Freelancers II: Benchmarking Local LLMs for Confidential Translation Workflows Research · T&U
Yuri Balashov, Rex Vanhorn, Austin Downes and Mingxi Xu

Building on our previous work, this paper develops practical, low-barrier methods for freelance translators and smaller language service providers to evaluate translation technologies using rigorous yet accessible analytic methods. Here we address a high-stakes, specialized need: offline translation for confidentiality-sensitive domains in which privacy constraints preclude the use of cloud-based engines and commercial LLMs. We expand the Reeve Foundation Trilingual Corpus (RFTC) used in our previous work into a multilingual corpus (RFMC) by adding sentence-aligned German and Simplified Chinese reference translations. We then benchmark several locally runnable language models (via Ollama) across four language directions on 1000+ sentences selected from this corpus. We use consistent single-prompt calls without finetuning or domain adaptation, comparing local LLM outputs against commercial NMTs (DeepL, Baidu), a frontier LLM (GPT-5.2), and professional-grade local NMT systems (OPUS-CAT, NeuralDesktop, Promt). Automatic evaluation is conducted with MATEO. Results reveal substantial variation in local LLM performance across language directions and model sizes. The best local LLMs match or surpass local NMT systems and a frontier LLM, though they remain behind top commercial NMTs. These findings underscore the viability of carefully selected local LLM translation for privacy-constrained professionals and inform future research on model scaling and multilingual capability.

#131 Reaching multilingual communities: a survey mapping MT use in the West Midlands (UK) third and public sector organisations Research · T&U
David Orrego-Carmona, Priyanki Ghosh and Susana Valdez

Machine Translation (MT) has become a default language access tool in public and third sector organisations serving multilingual communities. However, how organisations and their staff actually use it, and their opinions about it, remain largely undocumented. This paper reports findings from a questionnaire study conducted between March 2024 and April 2025 with charities, NGOs, community organisations and local government authorities in the West Midlands (UK), one of the most linguistically diverse areas of the country. The results indicate that MT use is widespread, informal, and driven by necessity rather than informed decisions or policies. Google Translate is the preferred tool; policies about MT use are rare, and confidence in translation quality is limited. Risk perception varies across the sector: local government respondents identify the widest range of concerns, including legal and medical, while third-sector organisations suggest a pragmatic approach. However, greater risk-awareness does not lead to greater governance, pointing to a gap between individual MT literacy and institutional accountability. Based on this, we propose some recommendations for how organisations serving multilingual communities should approach MT implementation and training.

#18 The Potential of Large Language Models for Translating Tourism Promotional Texts: A Mixed-methods Study Research · T&U
Raghad Alsulami

This paper reports on a small-scale pilot study examining the potential of a large language model (LLM) for translating tourism promotional texts (TPTs), in comparison with a conventional neural machine translation (NMT) system, from English into Arabic. Four professional translators participated in a post-editing experiment followed by cue-based retrospective interviews. The post-editing task aimed to provide empirical evidence of the effort involved in working with TPTs, while the interviews sought to capture participants’ judgments and evaluations of the outputs. Overall, most participants exerted less effort post-editing LLM-generated outputs to a publishable standard compared to NMT outputs. They perceived the LLM outputs to be more creative, with creativity manifested through non-literal translations and aesthetic augmentation, while also noting that the outputs were unpredictable and far from perfect; in contrast, the NMT outputs were generally viewed as more informative yet lacking the promotional appeal needed for TPTs. The paper concludes with implications and directions for future research.

#83 Meaning-Making Process and Error Dynamics in ChatGPT-Mediated Translation Research · T&U
María-José Varela Salinas and Iulia Mihalache

This study examines errors in a ChatGPT-mediated translation of a German economic text on inflation into Spanish, post-edited by 20 translation students. The German–Spanish language pair represents an under-studied combination in post-editing research, where English-pivot pairs predominate. The analysis classifies 132 annotated instances by error origin (ChatGPT-generated versus student-introduced during post-editing) and by linguistic category. Results show that terminology is the highest-risk domain across the entire workflow (34.1%), followed by tense/aspect (15.2%) and style (13.6%). ChatGPT-related errors account for 50.8% of all instances, while student-introduced errors through over-editing represent 21.2%. A further 28.0% are preferential changes: acceptable reformulations of segments already adequate to task norms. Students tend to trust fluent machine output even when it contains subtle semantic distortions, yet they also overedit segments that are already acceptable. The findings highlight three didactic priorities: developing LLM-based MT literacy, strengthening decision-making strategies in post-editing, and fostering genre- and domain-sensitive editing competence.

#114 AI Post-Editing in Production: A 71,262-Segment Evaluation Across Five Domains, Ten Languages and Five Systems Implementation & CS
Mara Nunziatini and Mercedes Speroni

This study evaluates an AI post-editing (AIPE) system in a professional translation setting, covering translation from English into ten target languages across five domains. We evaluate the system using automatic metrics on 71,262 production segments and human evaluation on a stratified sample of 6,618 segments (approximately 600 segments per target language) assessed by 60 professional translators. AIPE refines machine translation output using a secure publicly available LLM, retrieving language-specific style guides and high-quality bilingual examples to guide edits. We compare it with direct LLM translation (LLMT), Google Translate, and DeepL. The two AIPE configurations evaluated consistently outperform the generic translation baselines in terms of quality. LLMT does not match this quality, though it may suit less quality-sensitive domains. We observe how AIPE’s gains vary according to pre-translation type, with fuzzy translation memory matches over-represented among severe errors, and discuss deployment implications.

#31 Reasoning as Supportive Context for Machine Translation: A Case Study on Hindi to Bengali Language Pair Implementation & CS
Kshetrimayum Boynao Singh, Saksham Singh, Partha Pakray, Asif Ekbal

We investigate whether reasoning information can enhance machine translation when incorporated as supportive context during training and inference. Using Hindi-Bengali translation as a case study, we define five reasoning components: Key Terms, Syntactic, Semantic, Pragmatic, and Paraphrase. We conduct a complete ablation across all 31 possible combinations using Gemma-3-1B-Instruct and evaluate on multi-domain benchmark with BLEU, chrF, and TER. Evaluation results show that reasoning effectiveness depends on its type and composition rather than quantity. Combining multiple heterogeneous signals causes objective diffusion, degrading performance. The compact Semantic and Paraphrase combination proves optimal, and providing it during inference yields 23.86 BLEU compared to 22.12 from standard fine-tuning a +1.74 BLEU gain across eight domains. These findings demonstrate that targeted semantic guidance consistently and meaningfully improves the compact translation models.

#130 Embedding Similarity Is Not Quality Estimation: Lessons from Replacing a Dedicated QE Model Implementation & CS
Dimitrios Zaikis, Andrea Biondo, Matthew Dixon, Konstantinos Karageorgos and Aaron Schliem

Machine translation quality estimation (QE) typically relies on dedicated neural models trained on human judgments. We evaluate whether cosine similarity over general-purpose embeddings can serve as a lightweight alternative, using Gemini embeddings as the scoring backbone. Through three experiments (rogue dimension analysis, score calibration, and a learned calibration head) and a root cause analysis, we find that cosine similarity between source and translation saturates in the 0.94–0.99 range because even poor translations preserve most of the source semantics, leaving an Area Under the ROC Curve (AUC) ceiling of approximately 0.63. However, a LightGBM classifier trained on normalized cosine and surface-level text features breaks through this ceiling (AUC 0.751), with the improvement driven primarily by features orthogonal to embedding similarity. These findings demonstrate that raw embedding similarity cannot serve as a drop-in QE replacement and identify learned calibration as a viable lightweight path forward.

#139 Enhancing LLM Translation Performance for Spanish - Valencian through Supervised Fine-tuning and Reinforcement Learning Implementation & CS
Paula Guerrero Castelló

Valencian, the Western Catalan variety used in the Valencian Community of Spain, lacks a dedicated language code in most multilingual machine translation (MT) systems, and is systematically rendered closer to the standard written Eastern Catalan used in Catalonia. We address this gap by adapting TranslateGemma-4B-IT, a 4-billion-parameter instruction-tuned (IT) large language model (LLM) specialized for translation, via three post-training strategies using only public corpora and Quantized Low-Rank Adaptation (QLoRA): (i) supervised fine-tuning (SFT); (ii) Group Relative Policy Optimization (GRPO), a reinforcement learning (RL) technique, with chrF plus a naturalness reward (GRPOV1); and (iii) GRPO with a composite automatic-metric reward (GRPOV2). Our results suggest that reward-function alignment with the target dialect is a key determinant of RL success in low-resource dialectal MT.

15:15 - 15:45 Coffee Break
15:45 - 17:50 ORAL SESSION 2: Research - Translators and Use-cases / MT in Use: Professional Practice and Creative Contexts Session chair: Natalia Resende
#97 Smarter edits? Post-editing with error highlights and translation suggestions Research · T&U
Fleur V.J. van Tellingen, Gautam Ranka, Žugčić Dora, Joyce van der Wal, Andrea Camasta, Livio Guerra and Alina Karakanta

As MT quality increases, interest in enhanced post-editing features such as QE-derived error highlights is growing, yet evidence for their usefulness remains limited. In this work, we explore the usefulness of LLM-derived error highlights and correction suggestions based on automatic post-editing (APE). We conduct a study where professional translators (En→Nl) post-edit translations using APE error highlights and correction suggestions and compare productivity, quality and user experience to regular PE and PE with QE-derived highlights. While no condition yielded productivity or quality gains compared to regular PE, APE highlights were better received than QE-derived highlights, and correction suggestions improved overall user experience.

#88 It's like talking about how I use a pencil': Journalists' use of machine translation in their work Research · T&U
Mary Nurminen and Nina Havumetsä

This paper is the one of the first in-depth accounts of the use of machine translation (MT) by journalists. It reports on a study of Finnish journalists that was conducted mostly in 2024 and comprised an online survey with 68 responses and interviews with 10 journalists. Results revealed that participants fluently integrated MT into a variety of journalistic processes, with an emphasis on using it for assimilation and dissemination; that they relied largely on traditional online MT tools and tended to employ MT mostly with languages they have some competence in; and that they had some awareness of risk and strategies for mitigating it, but could benefit from guidelines and training on using MT. The article contributes to the nascent research on the use of MT in journalism and also to our broader understanding of the paraprofessional use of MT.

#113 Creative Bias: How Machine Evaluation Struggles with Creativity in Literary Translations Research · T&U
Kyo Gerrits, Rik van Noord and Ana Guerberof Arenas

This article investigates the performance of automatic evaluation metrics (AEMs) and LLM-as-a-judge evaluation on literary translation across multiple languages, genres, and translation modalities. The aim is to assess how well these tools align with professionals when evaluating translation creativity (creative shifts & errors) and see if they can substitute laborious manual annotations. A dataset of literary translations across three modalities (human translation, machine translation, and post-editing), three genres and three language pairs was created and annotated in detail for creativity by experienced professional literary translators. The results show that both AEMs and LLM-as-a-judge evaluations correlate poorly with professional evaluations on creativity, with LLM-as-a-judge showing a systematic bias in favour of machine-translated texts and penalising creative and culturally appropriate solutions. Moreover, performance is consistently worse for more literary genres such as poetry. This highlights fundamental limitations of current automatic evaluation tools for literary translation and the need to create new tools that do not frequently consider out of routine translations as errors.

#81 Audio description between MT translation and recreation: An Interview Study for the Language Pair English-German Research · T&U
Merle Sauter, Ekaterina Lapshinova-Koltunski and Sylvia Jaki

This study examines the machine translation of audio descriptions (AD) as an alternative to producing new AD for audiovisual formats in a foreign language. To assess acceptance and comprehensibility among German users, a small-scale survey is conducted with blind and visually impaired participants, examining key AD strategies, such as character description and naming, facial expressions and gestures, and spatio-temporal settings. Participants are shown machine-translated English AD, and original human German AD in comparison, and are questioned on these aspects. Findings suggest that AD translation is feasible for the German audience as the vast majority of the machine-translated stimuli are rated as helpful and understandable by the test persons. However, further studies are needed on machine translation, production costs, as well as larger-scale user studies.

#99 Automatic translation in public services: A survey of the Finnish public sector Research · T&U
Sıla Ilkılıç, Maarit Koponen and Mary Nurminen

This paper presents findings from a survey on the use of automatic translation in Finnish public services, conducted in autumn 2025. Adapted from a similar survey conducted at the University of Bristol, the present survey focused on users whose main professional activity is not translation or interpreting. The study analyzed those professionals’ habits around automatic translation use, the purposes and the contexts involved, and the respondents’ satisfaction and confidence in using automatic translation. Approximately half of the respondents reported using automatic translation at least once a week and across a range of scenarios, including public-facing situations. While the survey sample is not representative of the Finnish public sector as a whole, the data suggests that automatic translation may play a role in the everyday work of at least some public service employees.

EAMT 2026 Social Program
19:30 - 22:00 Social Event 1: Arcade Night, Location: The Gaming Factory
19:30 - 22:00 Social Event 2: Beer Tasting, Location: LOC Brewery

17 June 2026 - Day 2

08:30-09:00 Registration
09:00-09:05 Announcements
09:05-09:55 ORAL SESSION 3: Research - Translators & Users / Critical Perspectives on MT: Safety and Creative Language Session chair: Sergi Álvarez Vidal
#123 A Multilingual Red Teaming–Driven Safety Analysis of LLMs Research · T&U
Patrícia Pandeiro, Vera Cabarrão and Helena Moniz

This work benchmarks safety across several large language models (LLMs) and compares their performances through multilingual red teaming, which simulates adversarial attacks and identifies vulnerabilities in the systems. Using two public datasets and a proprietary dataset, the models were tested with three purposes. First, a red teaming test was conducted to establish a safety comparison between five models in English and Portuguese. The results revealed that, in general, Sugarloaf 3.1 is the safest model, but that Vesuvius 4.0 slightly outperforms it in Portuguese, also revealing that both outperform GPT-4o. Afterwards, three models were tested with one guardrailing prompt, that encourages safe interactions, and two content moderation prompts, in both languages, to understand the strengths of the current guardrails, as well as the effectiveness of the content moderation task. The results show that current guardrails are sufficient, notwithstanding room for improvement (particularly for Portuguese), but that the performance of the content moderation task was substandard, even for the best performing model – GPT-4o. Finally, the 3.0 TowerLLM models were tested in English to evaluate the effect that tokens and temperature have on the output, revealing that an intermediate token limit leads to safer responses while a higher temperature causes performance degradation.

#39 Metaphors in Literary Post-Editing: Opening Pandora’s Box? Research · T&U
Aletta G. Dorst, Mayra O. Nas and Katinka Zeven

This paper investigates how post-editors of literary texts react and respond to the way metaphors have been translated by Neural Machine Translation (NMT) and Large Language Models (LLMs). The results show that one in three metaphors in the output were changed by the post-editors, demonstrating that the translation of figurative language is indeed problematic in literary MT (LitMT). The responses indicate that the post-editors were aware of overly literal translations, though mostly for multiword expressions. Moreover, at times they found it difficult to determine whether solutions were acceptable. They rated the overall quality of the MT output as quite poor and stated that the post-editing was more work and more effort than it would have been translating from scratch. This supports previous studies arguing that post-editing constrains translators in their creativity and diminishes their sense of text ownership.

10:00-10:55 POSTER BOASTER SESSION 2: Products & Projects (23 posters) Session chair: Miquel Esplà Gomis
10:55-11:25 Coffee Break
11:25-12:25 POSTER SESSION 2: Products & Projects (23 posters)
#23 Literacy-Grounded and Industry-Oriented Translation Training with LT-LiDER Products & Projects
Janiça Hackenbuchner, María Isabel Rivas Ginel, Joss Moorkens, Sheila Castilho, Nora Aranberri, Sergi Álvarez Vidal, María do Campo Bayón and Ralph Krüger

The Erasmus+-funded international research consortium LT-LiDER develops a range of digital training resources which are grounded in the overarching frameworks of digital and AI literacy and oriented towards practical application contexts in the language and translation industry. These resources can be implemented on a component basis or as a complete curriculum in higher-education language and translation classrooms.

#24 MaTIAS - Machine Translation to Inform Asylum Seekers: final results Products & Projects
July De Wilde, Anaïs Wouters, Arda Tezcan, Simon Van den Meersschaut, Katrijn Maryns and Lieve Macken

This paper reports on the final stages of the MaTIAS project. A functional prototype of the multilingual notification tool was deployed across seven Belgian reception centres, accompanied by training and support. Feedback was gathered through interviews and surveys. Two rounds of machine translation evaluation revealed considerable differences in quality across languages. The translation quality of Tigrinya in particular was deemed too low to be usable.

#25 OSCAIL-OpenScience Communication through AI in EU Languages Products & Projects
Sheila Castilho, Susanna Fiorini, Lynne Bowker, Petr Motlicek, Joss Moorkens, Lieve Macken, Dairazalia Sanchez-Cortes, Janne Pölönen, Sami Syrjämäki, Mikael Laakso, Mark Fishel and Anastasia Stasenko

The Anglocentric nature of scholarly communication has many implications, such as limiting publication, discoverability and access from other language communities (even for major languages); putting minoritized languages at risk in the academic domain; and excluding many from peer review. The OSCAIL project addresses these challenges by exploring how machine translation (MT) enhanced by large language model (LLM)–based technologies can support access to scientific knowledge. Outputs will include evaluation datasets, protocols and best practices for MT in scholarly communication, and a prototype integration of MT tools into Open Journal Systems, the world’s most widely used open-source scholarly publishing platform.

#30 AIDA Agents: A Multi-Agent Translation Platform with Context-Aware Quality Control Products & Projects
Emanuele Di Rosa and Piotr Peszynski

We present AIDA Agents, a multi-agent translation platform that orchestrates LLM-based agents -- for translation, rating, post-editing, and re-rating -- delivering context-aware translations without model fine-tuning. Optional retrieval-augmented generation (RAG) injects translation memories, terminology, and style guidelines at every pipeline stage. On WMT24++ (Deutsch et al., 2025) (11 languages), AIDA Agents outperforms all systems on 10 of 11 pairs. On an industrial benchmark, 70–98% of segments are publication-ready without human post-editing. The platform is deployed with native XLIFF integration.

#34 VERA: A Platform for Automatic and Human Evaluation of Machine Translation Products & Projects
Sofía García González, Inés Quintana Raña, Jorge N. Afonso Cabido, Alberto Hernández Lado, German Rigau Claramunt and Sheila Castilho

We present VERA, an easy-to-use platform for machine translation (MT) evaluation, combining both automatic metrics and the Multidimensional Quality Metrics (MQM) Core human evaluation framework in a single web environment. It supports reference-based metrics, multi-user annotation, corpus export, and PDF reports with automatic and human evaluation results, including their correlations.

#37 Presentation of the Project CLingS: Cross-lingual information retrieval for scientific datasets in less-resourced languages Products & Projects
Valentina Fedchenko, Ka-I Lim and Milan Rusko

This document presents an initial overviewof the CLingS project, currently in its earlydevelopment stage. It outlines a collaborative effort to build a cross-lingual information retrieval platform for scientific literature in underrepresented languages. Theproject CLingS aims to develop datasets,tools, and methods to improve multilingualaccess to scientific knowledge.

#40 ARTICULATE: Science in your Own Language Products & Projects
Yolanda Vazquez-Alvarez, Matthew P. Aylett, Benjamin R. Cowan, Justin Edwards, Sanna Järvelä, Ioannis Konstas and Madeleine Steeds

The ARTICULATE project is an ambitious and interdisciplinary initiative funded by the CHIST-ERA call 2025. Its vision is to revolutionize science education and democratize scientific knowledge beyond academia and English-speaking audiences through the integration of AI with self-regulated learning. The aim is to translate science not just across language but across language style, to create engaging spoken digital experiences. We present an introduction to this project, an overview of the consortium and research approach, and a number of expected impacts.

#42 HERMeS: Human Evaluation & Ranking of MultiplE Systems Products & Projects
Rex Vanhorn

Human evaluation remains essential for reliable machine translation (MT) assessment, yet practical evaluation workflows are often difficult to reproduce and scale. Here we introduce HERMeS, a lightweight human evaluation platform designed to streamline systematic human evaluation and comparison of multiple MT systems across large translation sets. As a complement to existing evaluation tools, HERMeS focuses specifically on scalable comparison of many anonymized systems through a hybrid ranking and direct assessment workflow, using a practical approach that reduces cognitive load while maintaining data quality, security, and integrity.

#49 The MULTI-TRAD Project: Parallel Corpora and Multidimensional Analysis of Human, Machine and Post-Edited Translation in the Third Social Sector Products & Projects
Maria del Mar Sánchez Ramos, Douglas E. Biber, Cristina Cano Fernández, Irene Fuentes Pérez, Diana González Pastor, Larissa Goulart da Silva, Marcelo Yuri Himoro, Dorothy Kenny, Leida María Mónaco, María Teresa Ortego Antón, Isabel Peñuelas Gil, Cristina Plaza Lara, Verónica Redondo Astilleros, Celia Rico Pérz, Tania Salvador Blázquez, Muhammad Shakir, Franciso J. Vigier Moreno and Manuel Aenlle Curras

Domain adaptation remains a major challenge for machine translation, particularly in institutional communication. This paper presents the MULTI-TRAD project, which develops English–Spanish parallel corpora for the Third Social Sector communication. The project integrates three complementary objectives: (i) the compilation of a domain-specific parallel corpus, (ii) the analysis of linguistic variation across human translation (HT), machine translation (MT), and post-edited (PE) texts using Multidimensional Analysis (Biber, 1988), and (iii) the development of a domain-adapted neural machine translation system. In particular, the project investigates how different translation processes give rise to distinct functional profiles, related to phenomena such as translationese and post-editese. This paper presents the project design and initial progress.

#55 Parallel Corpus Development Toolkit (PCDT): A Web-Based Platform for Multilingual Parallel Data Creation Products & Projects
Praveen Acharya, Rupak Ghimire, Bipesh Subedi, Prakash Poudyal, Balaram Prasain and Bal Krishna Bal

This paper presents PCDT, a web-based platform for collecting sentence-aligned parallel corpora through a community-driven approach to support machine translation for under-resourced languages. The tool decentralizes the translation task to the target community and subsequently reviewed by language experts.

#56 English--Nepali--Tamang: A Trilingual Parallel Corpus and Benchmark for Low-Resource Machine Translation Products & Projects
Praveen Acharya, Rupak Raj Ghimire, Prakash Poudyal, Balaram Prasain and Bal Krishna Bal

This article describes the research project aimed at developing a Trilingual Machine translation System for English, Nepali, and Tamang language pairs. This project is expected to address knowledge and communication gaps caused by language barriers and mitigate disparities in the availability of information and knowledge sources in Tamang and Nepali.

#58 TELÓ: AI-Driven Automatic Subtitling for the Promotion of the Performing Arts Products & Projects
Antoni Oliver, Sílvia Rodríguez Vázquez and Manel Jiménez

The TELÓ project provides an open-source framework for automated subtitling in the performing arts. Integrating state-of-the-art ASR and NMT, the system enables bidirectional translation between Catalan, Spanish, English and French. Designed for live performances, it provides synchronized captions for multiple devices, enhancing cultural internationalization and accessibility.

#59 DA + Criteria: A New Quality Assessment Method for Bridging the Gap Between Human and Machine Translation Products & Projects
Bettina Hiebl

Direct Assessment (DA) + Criteria is a translation quality assessment method proposed based on a comprehensive systematic literature review on the concepts of quality in machine translation and translation studies. In the presented project the method was tested alongside Multidimensional Quality Metrics (MQM) on the German translations by humans, DeepL and ChatGPT of English non-fiction texts, using the results of the study as well as the participants’ answers to further refine the method.

#64 Adaptive CAT-embedded MT for low-memory, low-compute end-user devices Products & Projects
Marek Sabo

We present ACATMT, a compact bilingual encoder-decoder NMT system for English and Swedish, designed for professional computer-assisted translation (CAT) tools. It runs on-device in ONNX format, under 1 GB of RAM with no GPU needed, and features real-time post-edit based terminology adaptation. It also supports translation memory conditioning via decoder pre-filling. Evaluation on 5,021 technical segments unseen during training shows significant improvements in COMET and BLEU when using glossaries.

#68 Scalable Video-Based Search in the VGT Dictionary Products & Projects
Toon Vandendriessche, Caro Brosens, Hannes De Durpel, Mathieu De Coster and Joni Dambre

Video-based sign language dictionary search – in which a user records a sign to retrieve its translation – has been increasingly studied, yet never deployed in a large-vocabulary setting. We present the first such deployment: a fully scalable video-based search system integrated into the Flemish Sign Language (VGT) Dictionary, comprising over 11,000 signs. The system, released on November 28th, 2025, requires no retraining as new signs are added, and was validated on data collected in the wild. It was developed through an equal partnership between the deaf-led Flemish Sign Language Centre (VGTC) and AI researchers from Ghent University, and shows that closing the gap between sign language research and community impact is both achievable and essential.

#69 TaMTAS: Terminology-Aware Machine Translation for Accessible Science. Large Corpus compilation, terminology extraction and data augmentation Products & Projects
Sergi Alvarez-Vidal and Antoni Oliver

This paper presents the product vision, architecture, and expected deliverables of the TaMTAS (Terminology-Aware Machine Translation for Accessible Science) project. TaMTAS provides a fully integrated, open-source translation ecosystem tailored for the Life Sciences domain. By leveraging Large Reasoning Models (LRMs) that treat translation as a multi-step reasoning task, the system guarantees strict document-level terminology consistency. This paper outlines the overall project workflow, the targeted impact metrics for its Machine Translation (MT), Quality Estimation (QE), and Automatic Post-Editing (APE) modules, and provides an in-depth focus on the foundational data and terminology extraction engine led by the Universitat Oberta de Catalunya (UOC).

#73 Advanced CAT Tool Features for Enhancing Consistency in MT and Generative AI Outputs Products & Projects
Judith Klein

Recently, generative artificial intelligence (GenAI) has been perceived as a “silver bullet” for achieving faster, cheaper, and better translation production. However, in professional localisation, AI capabilities alone are not enough, as the still time-consuming post-editing (PE) of machine translation (MT) and GenAI output proves. The features and processes presented in this work aim to reduce these efforts by enhancing terminological control and translation consistency within the CAT environment STAR Transit.

#75 Translation 2.0: Equipping linguists for the machine translation future Products & Projects
Alina Karakanta and Vasilis Kalogiannis

Translation 2.0 addresses a critical gap in accessible, up-to-date educational resources on recent developments in Machine Translation and Large Language Models for students of linguistics and translation. It develops an online module with open-access learning materials, including knowledge clips, a workbook with incremental exercises to consolidate conceptual understanding, practical coding guides, and videos featuring industry professionals. The module aims to build both subject knowledge and computational literacy, freeing up contact hours for deeper engagement and critical discussions on practical, professional and ethical aspects. Translation 2.0 is funded through an Educational Innovation grant by the Faculty of Humanities at Leiden University and ECOLe (Expert Centre for Education and Learning) and runs from February to December 2026.

#86 Making Jobs Accessible through AI-supported Easy Language Translation Products & Projects
Fabian Merkel, Marco Baumgartner, Athanasios Breskas, Lea Gierke, Silke Gutermuth, Silvia Hansen-Schirra, Elena Kick, Vanessa König, Tobias Kopp, Natalie Martin and Miriam Spieß

Access to the primary labor market for people with cognitive impairments is hampered by barriers, notably the lack of workplace information in Easy Language (EL). Producing such texts is time- and cost-intensive and requires specialized translators. Project STARK-LS (Strengthening participation in the primary labor market through AI-generated Easy Language) addresses this gap by using an AI-translation tool to translate workplace materials into EL and integrating the approach into internships for people with cognitive impairments. An interdisciplinary team conducts mixed-methods evaluations by testing the EL translations for applicability, comprehensibility, and acceptance using lab-based eye-tracking and questionnaire studies, qualitative interviews with interns with cognitive impairments and experts for EL, and a quantitative survey with company representatives. The findings will inform best-practice recommendations for companies and rehabilitation agencies. The project advances scientific understanding of the perceived usefulness and potential barriers of EL in organizational contexts, while evaluating AI’s influence on the diffusion of high-quality EL texts in companies.

#87 Prompsit’s API and CLI: planet-friendly, privacy-first, open-source translation services for everyone Products & Projects
Lev Nikolaevich Berezhnoy, Gema Ramírez Sánchez, Sergio Ortiz Rojas and Mikel Forcada

Prompsit Language Engineering is launching an updated API and CLI for its open-source, planet-friendly machine translation services. Operating on a freemium model, the tools offer free limited access alongside tiered pricing for advanced features like MT evaluation, quality estimation, corpus scoring, and multilingual dataset annotation.

#105 Advancing Medical Communication: Multilingual, Multicultural, and Multimodal Processing for Translation and Simplification Products & Projects
Maria Pia Di Buono

This paper presents the 4MLP Project (Multilingual, Multicultural, and Multimodal Medical Language Processing), funded by the University of Naples "L'Orientale". The project aims to advance medical communication by developing language processing tools and resources that address the multilingual and multicultural dimensions of healthcare. The project focuses on building multimodal resources and models to support medical translation and communication across language barriers, with the goal of improving patient-provider interactions in multilingual medical settings.

#118 Does Speech Translation Meet Users' Needs? An English to Portuguese Study Across Demographics Products & Projects
Giuseppe Attanasio, Beatrice Savoldi, Daniel Chechelnitsky, Matteo Negri, Marine Carpuat and André Filipe Torres Martins

This paper introduces Ouvia, a research project to assess user-perceived usability and reliability of modern speech translation tools in En→Pt scenarios. The project centers on a user study in which we simulate real-life daily interactions by recruiting crowdworkers online from different sociodemographic groups. We collect their spoken requests and self-assessments about quality, satisfaction, and reliability. Here, we describe the project’s motivation and objectives, the study design, and the expected outcomes we will provide to speech translation practitioners.

#128 CRITICS: Critical Science Without Borders by Translation of Scientific Knowledge Products & Projects
Rodrigo Agerri, Itziar Aldabe, Elena Cabrio, Mark Cieliebak, Jan Deriu, Mariana Flores, Jurgita Kapočiūtė-Dzikienė, Dovilė Kuizinienė, Arantza Rico, Aritz Ruiz-González, Aitor Soroa, Mantas Vaškevičius and Serena Villata

The CRITICS project addresses science accessibility and literacy through the convergence of advanced Machine Translation (MT) based on Large Language Models (LLMs) and educational technology. By leveraging MT systems specifically optimized for scientific content, educational institutions can provide accurate, culturally relevant translations of scientific materials in students’ native languages, ensuring that complex scientific concepts are comprehensible while maintaining technical accuracy. Novel research on MT for scientific documents aims to break down language barriers in accessing cutting-edge research and educational materials currently only available in high-resourced languages, thereby facilitating the democratization of scientific knowledge.

12:25-13:15 ORAL SESSION 4: Research - Technical / Richer and Fairer MT: Lexical Diversity and Gender Bias Session chair: Beatrice Savoldi
#94 Diversity-Aware Literary Machine Translation with Multi-Reward Policy Optimization Research · Technical
Zeynep Yirmibeşoğlu Balal and Tunga Güngör

Literary translation is a difficult task that not only requires semantic accuracy but also stylistic richness and lexical diversity. Pretrained and supervised fine-tuned Large Language Models (LLMs) can over-rely on safe vocabulary choices, leading to translations that lack lexical variety. To address this problem, we propose a novel diversity-aware multi-objective Group Relative Policy Optimization (GRPO) framework that pushes the limits of open-source translation quality while increasing lexical diversity. We introduce two diversity-aware reward mechanisms, a Leave-One-Out (LOO) marginal contribution reward and a Self-BLEU penalty, balanced alongside neural quality metrics (COMET), lexical overlap (BLEU), and structural constraints. Through experiments on Turkish–English and German–English using Qwen3-14B, we show that our diversity-aware reinforcement learning approach successfully enhances lexical richness alongside translation quality. Our models achieve state-of-the-art open-source performance in literary translation, bridging the gap with leading commercial systems and demonstrating that policy optimization can effectively steer LLMs toward high-quality, lexically diverse outputs.

#109 Explaining GAND: A Resource on Gender-Ambiguous Natural Data & Contrastive Attribution Research · Technical
Janiça Hackenbuchner, Jasper Degraeuwe, Arda Tezcan and Joke Daems

Machine translation (MT) systems continue to produce gender-biased translations. In a time where self-expression is paramount, mistranslations based on default behaviour and stereotyping can lead to harm for users of these systems. To better understand how these systems translate gender in the absence of clear gender cues, we need benchmarking resources that reflect gender-ambiguous scenarios in a natural way. To this end, we present GAND, a gender-ambiguous natural data benchmarking resource for MT consisting of English source sentences, specifically designed to analyse the influence of contextual cues on gender in translation. We leverage GAND to conduct an interpretability analysis: we translate a subset of GAND into two grammatical gender languages and extend these with manually crafted contrastive translations. A following feature attribution analysis reveals source words in context that inform the gender translation of an ambiguous referent entity in the target translation.

13:15-14:15 Lunch
14:15-15:05 ORAL SESSION 5: Implementation and Case Studies / Visually-aware Machine Translation Session chair: Janiça Hackenbuchner
#115 Towards Visually-Guided Movie Subtitle Translation for Indic Languages: A Case Study Implementation & CS
Tarun Chintada, Kshetrimayum Boynao Singh and Asif Ekbal

Movie subtitle translation is inherently multimodal, yet text-only systems often miss visual cues needed to convey emotion, action, and social nuance, especially for low-resource Indic languages (English to Hindi, Bengali, Telugu, Tamil and Kannada). We present a case study on five full-length films and compare two lightweight visual grounding strategies: structured attribute summaries from a 5-minute sliding window and free-text summaries of inter-subtitle visual gaps. Our analysis shows that temporal misalignment between subtitles and frames is a major obstacle in long-form video, often rendering indiscriminate visual grounding ineffective. However, oracle selective grounding, which replaces only the lowest-quality 20-30% of baseline segments with visual-enhanced outputs, consistently improves COMET over the text-only baseline while requiring far less visual processing. Among the two approaches, coarse attribute-based visual context summarization is more robust, capturing scene-level emotion and contextual subtle cues that text alone often misses.

#135 Is a Picture Worth a Thousand Words? Exploration and Implementation Considerations for Visual Context in Translation Workflows Implementation & CS
Vera Senderowicz Guerra and Olesia Khrapunova

Vision-language models (VLMs) have the potential to enhance machine translation (MT) by leveraging visual context alongside text, yet their real utility for production workflows remains unclear. We conduct a unified, multi-condition evaluation of six leading VLMs–both open and proprietary–on two benchmarks (CoMMuTE and CaMMT), targeting lexical and cultural disambiguation respectively. We complement this with a domain-style case study simulating technical documentation localization. Results show that model performance varies widely, and the benefit of relevant images does not transfer uniformly across use cases. Proprietary models are notably sensitive to irrelevant images while open-source models are generally more stable. Contradicting visuals, by contrast, degrade translation across all models. Taken together, our findings show that rigorous evaluation is a necessary precondition for production deployment: metric gains can mask real accuracy losses, model sensitivity to irrelevant images should inform model selection, and avoiding contradicting visuals is a hard requirement for any pipeline.

15:05-15:30 Coffee Break
15:30-16:45 ORAL SESSION 6: Implementation and Case Studies / MT Systems in Deployment: Workflows and Institutional Adoption Session chair: Valerio Lorini
#93 The MaTOS Pipeline for the Translation of Scientific Abstracts on the HAL Platform Implementation & CS
Panagiotis Tsolakis, Ziqian Peng, Laurent Romary, François Yvon and Rachel Bawden

English dominates scientific publishing, which disadvantages researchers who are not native English speakers, especially those in the earlier stages of their careers. Being able to write and engage with scientific content written in their own language would clearly facilitate scientific production. The MaTOS project (Machine Translation for Open Science) seeks to reduce these barriers by developing machine translation tools for scientific documents in English and French. This article presents the design of the MaTOS pipeline for the HAL platform to automatically translate article abstracts, with author validation, to increase the number of bilingual abstracts on the platform. We also report preliminary experiments comparing translation of sentence, three-sentence chunks, and whole abstracts, evaluated using quality estimation metrics. We release all the code of the different stages of the pipeline.

#96 Automated Information Extraction and Template Filling from Client Style Guides Implementation & CS
Leonor Graça, Vera Cabarrão and Helena Moniz

Style guides are a centrepiece of professional translation workflows. Yet, their integration into automatic pipelines remains underexplored. This paper presents exploratory work on information extraction from client style guides and application to a templated style guide, developed to be a system prompt. This template is then applied during an LLM-based translation to automatically produce outputs that are compliant to client’s requirements. The study focused on seven language pairs (LP), evaluating the automatic extraction, and translation quality and compliance with the style guide. The extraction demonstrated reliable performance across languages and file formats. Translation quality and adherence were evaluated using human preference annotation, comparing two Tower models (Tower Zen 9B and Tower+ 72B). The results indicate a modest advantage for Tower+, but with mutual acceptability in certain instances. These findings establish a viable semi-automatic framework for style guide integration in translation workflows, and motivate further investigation across broader domains, clients, and LPs.

#80 A Longitudinal Study of the Adoption of Specialized MT Systems in Canadian Parliamentary Translation Implementation & CS
Michel Simard, Jeniffer Leal-Wyss, Gabriel Bernier-Colborne and Rebecca Knowles

Since 2023, translators for the Parliament of Canada have had the option to use neural machine translation (NMT) technology provided by the National Research Council of Canada (NRC) to support their work in translating parliamentary publications between French and English. We present our analysis of an anonymized dataset of translators’ interactions with our Hawkeye MT systems, collected since their introduction and covering a period of 2.5 years. This data provides a unique perspective on how translators interact with the systems, how their use evolved over time and how it impacts the nature of their translations.

16:45-17:45 EAMT General Assembly
EAMT 2026 Social Program
18:30-22:00 Gala Dinner, Location: MOOD

18 June 2026 - Day 3

08:30-09:00 Registration
09:00-09:05 Announcements
09:05-10:45 ORAL SESSION 7: Research - Technical / Research Frontiers in MT: Evaluation, Specialisation, and Robustness Session chair: Sheila Castilho
#107 MetaDocEval: A Contrastive Framework for Evaluating Machine Translation Metrics at the Document-Level Research · Technical
Nicolas Dahan, Rachel Bawden and François Yvon

Recent advances in neural machine translation (MT) have spurred increased interest in evaluating translations beyond the sentence level, making it possible to assess discourse-level phenomena related to coherence and consistency. While existing sentence-level metrics can be applied to multi-sentence spans, it remains unclear whether their scores truly capture document-level quality. We introduce MetaDocEval, a reusable framework for generating contrastive document-level test sets, together with an instantiated test set covering three language pairs (en–fr, en–es, en–de). The framework includes automatic perturbation generation, quality-control filtering and sliding-window scoring, so that new corpora, language pairs, or perturbation types can be added with minimal manual intervention. The released test set targets a range of discourse-level phenomena and potential problems linked to translation at the document level. To evaluate how metrics behave as a function of context size, we apply them under a sliding-window protocol, varying the input from single sentences up to full documents. Our experiments show that none of the metrics tested genuinely capture document-level coherence: reference-based metrics overfit lexical overlap, reference+source metrics gain little from added context, reference-free encoders show brief context sensitivity before degrading on longer spans, and LLM-based scorers collapse beyond short inputs. A key finding is that reference access can be actively harmful for detecting discourse-level errors. Using short windows (≈3 sentences) offers the best trade-off between discourse error detection and score dilution.

#62 One Size Does Not Fit All: Why EU Legislative Translation Demands Domain-Specific Fine-Tuning of LLMs Research · Technical
Valerio Lorini, Paula Vlaic, Ulascan Akbulut and Daniele Marcoaldi

EU legislation is equally authentic and legally binding in all 24 official languages, rendering high-quality translation a legal obligation rather than a mere choice. Therefore, high-quality language technology supporting translation processes in all EU languages is essential for language professionals at the European Parliament (EP). This paper investigates whether domain-specific fine-tuning of an openi-weight Large Language Model (LLM) yields consistently larger quality gains on legislative text compared to generic text, in all 23 EU target languages from English. We evaluate ten experimental conditions: base model, in-domain and cross-domain fine-tuning, sequential genericthen-legislative fine-tuning, and zero-shot Claude Sonnet 4.6 as a proprietary reference. We analyse BLEU, chrF, TER, and COMET metrics on nearly 700,000 segments. Results confirm the hypothesis for all 23 languages: legislative fine-tuning enhances BLEU by +12.30 compared to +7.10 for generic fine-tuning, demonstrating a consistent advantage of +5.20 BLEU in all the metrics. The fine-tuned EuroLLM-22B decisively outperforms Claude Sonnet 4.6, Anthropic’s latest frontier model, on both domains, highlighting that targeted adaptation of a smaller open-weight model can surpass a state-of-the-art proprietary system. Cross-domain transfer within the institutional domain is positive for all languages, with no catastrophic forgetting. Low-resource languages such as Irish and Maltese benefit the most from fine-tuning, while a divergence between BLEU and COMET rankings for some languages underlines the need for evaluation metrics alongside traditional measures.

#78 Augmenting Text to Increase Translation Difficulty Research · Technical
William Kalikman, Simon Sukup, Michal Tešnar and Vilém Zouhar

As state-of-the-art machine translation models saturate standard benchmarks, the field needs more challenging evaluations to distinguish between models of varying quality. We propose augmenting existing benchmarks to increase translation difficulty by combining adversarial optimization with a differentiable translation difficulty estimator. Our Adversarial Translation Optimization (ATO) uses gradients from a combined difficulty and fluency objective to iteratively replace tokens. Because each step branches over candidate substitutions at every position, optimization becomes a tree search problem, which we address with Beam Search. ATO offers a gradient-based alternative to LLM-based dataset creation without LLM prompting, expensive human curation, or task-specific model training. Our ATO-modified benchmark lowers average translation quality (xCOMET) from 0.93 to 0.82, compared to 0.88 for paraphrasing and 0.86 for a zero-shot baseline. Human evaluation shows the modified texts are somewhat less natural than the baselines but remain reasonably grammatical and plausible while being substantially harder to translate. We release two datasets of 350 English texts each, generated by our methods, as well as the code.

#82 Using Model Disagreement to Identify Unstable Regions in MT Evaluation Research · Technical
Vitalii Iakivchuk

Human evaluation of MT is essential but exhibits substantial annotator variability that limits evaluation reliability and supervised learning. Rather than treating disagreement as noise or correcting it through protocol changes, we analyze its structure via learned severity classifiers. Across training regimes defined by baseline model reproducibility, we observe internally coherent but mutually incompatible severity mappings: models trained on one regime produce confident predictions within that regime but reduced separability on the other. Margin–correctness analysis shows that instability is not uniformly low confidence; separability depends on alignment between model-internalized and human annotation regimes. These results indicate that unstable MT evaluation regions are primarily associated with competing severity interpretations rather than intrinsic example difficulty. Model–annotator disagreement therefore provides a practical signal for identifying unstable evaluation regions during MT evaluation.

10:45-11:15 Coffee Break
11:45-12:45 KEYNOTE 2: Dr. Antonio Toral Flipping the Script: The Case for a Human-Initiated, AI-Augmented Translation Pipeline

Speaker Bio

Dr. Antonio Toral

Universitat d'Alacant, Spain

Distinguished Researcher in Machine Translation

Antonio Toral works as Distinguished Researcher in Machine Translation at the Universitat d'Alacant. Previously, he was an Associate Professor in Language Technology at the University of Groningen, where he coordinated the Computational Linguistics research group. Prior to that, he served as a postdoctoral researcher and research fellow at Dublin City University. He completed his PhD studies at the Universitat d'Alacant and the Istituto di Linguistica Computazionale.

His research interests include the application of machine translation (MT) to literary texts, MT for under-resourced languages and the computational analysis of translations produced by machines and humans. He coordinated the Abu-MaTran project, which was flagged by the European Commission as a success story and won the best paper award at MT Summit 2019 for his work on post-editese.

Abstract

Over the last two decades the translation profession has witnessed a dramatic increase in the use of technology. Primary examples include translation memories and machine translation post-editing (MTPE), whose adoption has been primarily driven by productivity. However, while MTPE is well-established and widely used, it presents important issues that affect both translators and the quality of the resulting translations.

In this talk, I will examine the main issues inherent in MTPE and propose an alternative translation pipeline that flips the roles, placing the translator before the machine. I will argue that, in such a setting, multi-agent AI can foster more informed translation decisions while safeguarding the translator's creative agency.

Finally, I will discuss why I think this approach is particularly suited for creative texts and peripheral languages, and also why it is not a far-fetched utopia, given current socioeconomic trends and developments.

12:45-13:45 Lunch
13:45-14:20 Best Thesis Award: Announcement and Presentation (5 + 20 min + 5 min Q&A)
14:20-15:10 POSTER BOASTER SESSION 3: Research - Technical (11): Research - T&U (10) (21 posters) Session chair: Yuri V Balashov
15:10-15:40 Coffee Break
15:10-16:45 POSTER SESSION 3: Research - Technical (11): Research - T&U (10) (21 posters)
#136 Artificial intelligence language technologies in multilingual healthcare: Grand challenges ahead Research · T&U
Vicent Briva-Iglesias

AI language technologies (AILTs), increasingly enabled by large language models (LLMs), are becoming embedded in multilingual healthcare workflows for translation, rewriting, documentation, interpreting, and messaging in language-discordant settings. Yet fluent output is not the same as clinically safe or equitable communication: performance varies across languages, accents, tasks, and workflows, and efficiency gains can hide errors, reduce traceability, and shift responsibility across clinicians, translators, interpreters, and health systems. This narrative review synthesises recent peer-reviewed evidence across written communication, spoken communication, and emerging agentic workflows. Using the Human-Centered AI Language Technology (HCAILT) lens, it examines capabilities, evaluation practices, implementation patterns, and recurrent errors through reliability, safety culture, and trustworthiness. We identify key convergences and contradictions in the literature and propose seven grand challenges for the next phase of research and deployment. Progress, we argue, requires not only better models but also accountable sociotechnical design, calibrated human oversight, and stronger collaboration across MT/NLP, translation studies, HCI, clinical practice, implementation science, and policy.

#33 AI-assisted cultural heritage dissemination: Comparing NMT and glossary-augmented LLM translation in rock art documents Research · T&U
Vicent Briva-Iglesias and María Ferre Fernández

Cultural heritage institutions increasingly disseminate research and interpretive materials globally, but multilingual dissemination is constrained by limited budgets and staffing. In terminology-dense domains such as rock art, translation quality depends on accurate, consistent specialised terms, and small lexical errors can mislead non-specialists and reduce reuse. We compare three English MT setups for a Spanish academic rock art text, focusing on simple, operationally feasible interventions rather than complex model-side modifications: (1) DeepL as a strong NMT baseline, (2) Gemini-Simple (LLM with a basic prompt), and (3) Gemini-RAG (the same LLM with glossary-augmented prompting via term-pair retrieval). Using PEARMUT, we conduct a human evaluation via (i) multi-way Direct Assessment (0–100) and (ii) targeted terminology auditing with a restricted MQM taxonomy. Gemini-RAG yields the highest exact-match terminology accuracy (81.4%), versus Gemini-Simple (69.1%) and DeepL (64.4%), while preserving overall quality (mean DA 85.3 Gemini-RAG vs. 85.2 Gemini-Simple), outperforming DeepL (80.3). These results show that glossary-augmented prompting is a low-overhead way to improve terminology control in cultural-heritage translation if institutions maintain minimal terminology resources and lightweight evaluation procedures.

#129 Extending Creativity: Large Language Models and the Practice of Poetry Translation Research · T&U
Natalia Resende and James Hadley

The aim of this paper is to propose a framework to support poetry translators in making effective use of large language models (LLMs) through established prompt engineering strategies applied to both pre-translation and translation stages. The paper illustrates these strategies using poems characterised by multiple layers of syntactic, semantic, phonological, and cultural complexity, and discusses how LLMs perform in response to each prompting technique. It also engages with the longstanding claim that poetry translation is a purely human endeavour and cannot be computer-assisted, arguing instead that LLMs, rather than replacing human creativity, have the potential to extend it.

#32 Evaluating the Effect of Prompt Language on LLM-based Translation: Evidence from Spanish<>Italian Translation Research · T&U
Antonella Bove, Paola Di Cataldo and Davide Maestroni

The integration of large language models (LLMs) into translation practice has substantially reshaped translation workflows (Kornacki and Pietrzak, 2025). Since translation quality depends partly on how these models are prompted, prompt design deserves closer attention as a key stage of the LLM-augmented translation process. This study investigates Spanish<>Italian translation with GPT 5.1 in the advertising and biomedical domains. It examines whether prompt language affects the translations generated by the model, focusing on whether prompts written in the target language generate translations that are preferred over those generated from English prompts, given that English is the language most prevalent in the model’s training data (Armengol-Estapé et al., 2022). In this study, preference is operationalised in terms of post-editing usefulness, that is, the perceived suitability of a translation as a starting point for subsequent human post-editing. Three prompt templates, varying in complexity and informational content, were tested. The translations were first screened for textual similarity, and only the translations generated from the template that produced the greatest variation across outputs were subsequently selected for human evaluation. Human preferences were collected through a pairwise-comparison task. The findings indicate that translations generated from prompts written in the target language tend to receive more favorable preference judgments than those produced using English-language prompts.

#29 Machine Translation in the Wild: User Reaction to Xiaohongshu's Built-In Translation Feature Research · T&U
Sui He

This paper examines user reactions to the launch of the machine translation (MT) feature on Xiaohongshu, a Chinese social media and e-commerce platform, in January 2025. Drawing on a dataset of 6,723 comments collected from 11 official posts promoting the translation function, this paper combines sentiment analysis with thematic analysis to investigate how users perceived and experimented with this function. Results show that reactions were generally positive, although concerns about functionality, accessibility, and translation accuracy were also expressed. In addition, users actively tested the function with inputs that fail to represent everyday online communication, including stand-alone words and phrases, abbreviations, internet slang, and symbolic or encoded forms. Successful decoding of these texts elicited positive responses, while testing of more conventional language remained fairly limited. This could lead to uncritical acceptance of MT outputs by users, highlighting the importance of closer collaboration among computer scientists, translation scholars, and platform designers to improve MT performance and promote informed user engagement in real-world deployment of MT functions.

#35 Quality and Comprehensibility of Interlingual Subtitles Produced by Humans or with Machines Research · T&U
Lara Shoana Schlüter, Ekaterina Lapshinova-Koltunski and Sylvia Jaki

The present paper focuses on the analysis of automatic subtitles produced with three different systems (HappyScribe, CapCut and Amberscript). We compare the outputs among each other paying attention to the categories of quality derived from audiovisual translation quality research. Besides that, we consider the comprehensibility of the produced subtitles. Additionally, we use automatic evaluation scores from BLEU, BLEURT and BERTScore to assess the overall quality. Our results show that automatically generated subtitles remain below human standards in quality and comprehensibility.

#95 Quebec Translators in the Age of AI: Perceptions on the Evolution and Sustainability of the Translation Profession Research · T&U
Lynne Bowker and Monyka L. Rodrigues

AI-based tools are disrupting the translation profession. The European Language Industry Survey reports annually on the situation in Europe but less is known about the effects of AI-based tools on professional translators working elsewhere. This study presents a survey conducted with support from Quebec’s professional translators association (Ordre des traducteurs, terminologues et interprètes agréés du Québec). We analyzed 175 completed surveys, along with additional partial responses, and we present results relating to two broad categories: general perceptions about AI’s influence on the translation profession, and the evolution and sustainability of the profession. We also compare our results to those from other regions. Findings show that while Quebec translators face many similar issues to those faced by translators in the United Kingdom, France, Belgium, Switzerland, and Europe more generally, there are subtle differences also, such as the tendency of many Quebec translators to work as generalists, the comparatively low number of Quebec translators working in the entertainment, arts, and culture domains (which are growing elsewhere), and the large number who are hesitant to supervise student work placements.

#126 On the Use of LLMs for Specialised Terminology: A Good Alternative to Corpora? Research · T&U
Joachim Minder, Guillaume Wisniewski and Natalie Kübler

Specialised translation relies on the use of documentary and terminological resources, including corpora. These resources are particularly useful for terminology. However, their compilation and exploitation have several limitations: they require time, technical skills and access to data that can be difficult to collect. This study examines the extent to which LLMs can assist specialised translators in finding equivalents from English to French. We evaluate four proprietary models, GPT-4o, GPT-5.2, Claude Sonnet 4.5 and DeepSeek, in two specialised domains, Earth, Environmental and Planetary Sciences (EEPS) and Natural Language Processing (NLP). The experiment is based on 80 terms per domain and compares two prompting strategies: a terminology and a translation mode. The results highlight clear differences between models, prompting strategies and, to a lesser extent, domains. Claude Sonnet 4.5 achieves the best results in the most favourable configuration, while DeepSeek stands out for its greater stability. Analysis of confidence estimates also shows that they are only a partial indicator of terminological accuracy. Overall, the findings suggest that LLMs can be useful tools for specialised translators, but cannot, at this stage, replace specialised corpora. This research therefore paves the way for future work on the real practical usefulness of LLMs for specialised translators in work and educational contexts.

#61 BlAInded by Fluency: How Idiomatic Machine Translation Outputs Affect Student Post-Editors’ Edit Types Research · T&U
Valentin Scourneau and Loïc De Faria Pires

This study explores the influence of two prompting strategies on the lexical and syntactic metrics of the large language model-based (LLM-based) machine translations (MTs) of a corpus of 18 British editorials into French as well as their impact on the edit types made by Master’s translation students post-editing (PE) from a representative editorial of the corpus, as evaluated using the machine translation post-editing annotation system (MTPEAS) taxonomy. Quantitatively, the prompt specifically requesting more syntactic and lexical variety leads to significantly higher syntactic and lexical metrics scores in the MTs, but differences remain significant only for lexical metrics in the post-edited versions of the representative editorial. Qualitatively, we show that students post-editing from an MT featuring more idiomatic rephrasings and fewer syntactic calques (as opposed to an MT that is structurally closer to the source text) seem to make fewer edits overall, leave more MT errors unaddressed, and make fewer successful edits.

#26 The Role of Prompt Language and Translation Theory-Driven Prompts in Large Language Models: A Case Study on Spanish-Chinese Editorial Translation Research · T&U
Haohong Lai and Weijia Li

This study examines how prompt language and translation theory-driven prompt design influence the quality of Spanish–Chinese journalistic translations generated by GPT-5.2. A parallel corpus of four editorials from EL PAÍS was translated under 48 experimental conditions (4 prompt types × 3 prompt languages × 4 articles). Translation quality was assessed using BLEU and BERTScore-F1 for automated evaluation, alongside human evaluation based on the Multidimensional Quality Metrics (MQM) framework. Automated metrics identified the baseline prompt (BASE) as the best-performing condition, whereas human evaluation ranked the brief-oriented prompt (BRIEF) highest (MQM: 8.66 vs. 7.84), a reversal likely attributable to the single-reference constraint inherent in automated measures. Sub-error type analysis revealed that translation-theory-driven prompts selectively reduced Awkward style errors, while Unidiomatic style errors persisted across conditions. Prompt language had a negligible impact under both evaluation paradigms. These results indicate that translation-theory-driven prompts can yield measurable quality gains under expert evaluation of journalistic translations, although their pedagogical implications for language learners remain suggestive and require validation through user-based studies.

#50 Beyond Simple Term Injection: Reasoning Models for Legal Translation in a Non-Dominant Language Variety Research · Technical
Paolo Di Natale, Elena Chiocchetti, Marlies Alber and Egon W. Stemle

Term injection in machine translation is undergoing a paradigm shift in the era of large language models (LLMs). Although recent shared-task results suggest near-saturation for sentence-level term injection from pre-defined glossaries, it remains unclear whether this also holds in more challenging settings. We address this question with a custom test set for legal translation from Italian into South Tyrolean German, a non-dominant and under-resourced language variety. We cover three terminology challenges: simple term injection, localisation of abbreviated forms, and homonym disambiguation. We focus on Reasoning Models (RMs) leveraging Test-Time Scaling, comparing them with different architectures and contributing a human analysis of reasoning traces. We find that reasoning offers little benefit for simple term injection, but yields clear gains for semantically complex cases such as homonym disambiguation. However, human evaluation of reasoning traces shows that these gains do not necessarily reflect robust and factually grounded translation-specific reasoning. We further show that without external terminological resources, even state-of-the-art RMs struggle to retrieve correct terminology for a non-dominant variety, while NMT small models remain competitive when trained on in-domain bilingual corpora. Based on these findings, we propose data collection strategies for inducing translation-specific reasoning, frameworks for adapting to and evaluating terminology across many language varieties, and terminology challenges beyond simple term injection.

#120 Beyond Semantics: Measuring Fine-Grained Emotion Preservation in Small Language Model-Based Machine Translation Research · Technical
Dawid Wiśniewski and Igor Czudy

Preserving affective nuance remains a challenge in Machine Translation (MT), where semantic equivalence often takes precedence over emotional fidelity. This paper evaluates the performance of three state-of-the-art Small Language Models (SLMs): EuroLLM, Aya Expanse, and Gemma, in maintaining fine-grained emotions during backtranslation. Using the GoEmotions dataset, which comprises Reddit comments across 28 distinct categories, we assess emotional preservation across five European languages: German, French, Spanish, Italian, and Polish. Specifically, we investigate (i) the inherent capability of these SLMs to retain emotional sentiment, (ii) the efficacy of emotion-aware prompting in improving preservation, and (iii) the performance of ModernBERT as a contemporary alternative to BERT for emotion classification in MT evaluation.

#108 LoRA Fine-Tuning of English-Norwegian NMT for the Oil & Gas Industry Research · Technical
Xiaojing Yang, Zhihan Li, Gege Sun, Mengyue Li and Meriem Beloucif

Adapting large language models to specialized domains remains challenging due to the computational cost of full fine-tuning and the limited availability of domain-specific parallel data. We present a systematic framework for parameter-efficient domain adaptation using Low-Rank Adaptation (LoRA) geared towards efficient learning in low-resource scenarios. Our method combines data-scaling analysis, dual-track hyperparameter optimization, and competitive benchmarking. We evaluate our approach on the low-resource English–Norwegian petroleum translation domain using a distilled version of NLLB and parallel data from the Norwegian Petroleum Directorate. Our adapted model achieves 61.48 BLEU (+24.62 over the base model) and 0.9298 COMET, while updating <0.4% of parameters. Our experiments show that the right parameter efficiency helps models achieve high accuracy, outperform evaluated commercial baselines on BLEU, and achieve comparable semantic quality (COMET). Our results provide a reproducible and computationally efficient blueprint for domain adaptation in neural machine translation, particularly for specialized and resource-constrained domains.

#122 Terminology-Aware Retrieval-Augmented Knowledge Distillation for Biomedical Neural Machine Translation Research · Technical
Maria Zafar, Souhail Bakkali and Rejwanul Haque

Knowledge distillation (KD) compresses large teacher models into smaller student models by transferring soft labels or intermediate activations. While effective in general domains, KD alone falls short in specialised machine translation (MT) settings, such as biomedical translation. The student inherits only the teacher’s compressed knowledge and lacks access to external domain information. Moreover, standard KD typically relies on abundant parallel data, which is often unavailable in domain-specific scenarios. To address these limitations, we combine KD with retrieval-augmented generation (RAG) in a few-shot setting. We propose a retrieval-augmented enhanced few-shot KD framework for French-to-English biomedical translation task. The student learns to retrieve relevant in-domain knowledge from an external database, complementing the teacher’s supervision. We design and compare several retrieval strategies to enhance student capacity. Experiments show that with our terminology-aware retrieval-based methods, the student achieves performance comparable to or better than the teacher, while preserving translation quality and efficiency.

#38 Improving Retrieval-Augmented Neural Machine Translation with Monolingual Data Research · Technical
Maxime Bouthors, Josep Crego, Dakun Zhang and François Yvon

Conventional retrieval-augmented neural machine translation (RANMT) systems leverage bilingual corpora, e.g., translation memories (TMs). Yet, in many settings, monolingual corpora in the target language are also available. This work explores ways to take advantage of such resources by directly retrieving relevant target language segments, based on a source-side query. For this, we design improved cross-lingual retrieval systems, trained with both sentence level and word-level matching objectives. In our experiments with three RANMT architectures, we assess such cross-lingual objectives in a controlled setting, reaching performances that match those of standard TM-based models. We also showcase our method on two real-world settings, using much larger monolingual corpora, and observe strong improvements over both baseline RANMTs and general-purpose cross-lingual retrievers.

#117 Evaluating Terminology Translation Methods Research · Technical
Iikka Hauhio, Théo Salmenkivi-Friberg and Tommi Nieminen

We present an evaluation of several state-of-the-art machine translation systems supporting terminology constraints in the English–Finnish translation direction. We first perform a meta-evaluation, in which we critically evaluate the evaluation metrics we use, including the questions asked of human evaluators and the automatic evaluation methods. We find that common metrics such as term accuracy and TERm do not agree with the human evaluators’ judgement on the correctness of the terms, while LLM-as-a-judge shows promise even though it does not agree with the human evaluators on all questions. We then compare the evaluated systems based on the human evaluation results, LLM-as-a-judge, COMET, and chrF2. We find that of the systems considered, soft constraint methods, including a term-trained model and an LLM, perform better than hard constraints forced using a constrained beam search.

#19 Evaluating Machine Translation and Automatic Metrics in Subtitling: A Case Study on Spanish Multiword Expressions Research · Technical
María Miró Maestre and Iván Martínez-Murillo

Evaluating the translation of multi-word expressions (MWEs) remains a major challenge for Machine Translation (MT), particularly in audiovisual subtitling, where idiomatic meaning and cultural context are essential for adequacy. This study investigates both the ability of state-of-the-art MT systems to translate Spanish MWEs into English and the extent to which current automatic evaluation methods reflect expert human judgment. We introduce ALMO-MWE, a dataset of 235 MWEs extracted from four films by Pedro Almodóvar to evaluate four MT systems using automatic metrics, LLM-as-a-judge approaches, and professional human assessment. Our results reveal a substantial mismatch between traditional automatic metrics and human judgments: n-gram-based metrics show near-zero correlation with expert evaluation and only limited discriminative capacity. In contrast, neural metrics and LLM-based judges exhibit substantially stronger agreement with human assessments, with GPT-OSS achieving the highest overall correlation. These findings highlight fundamental limitations of surface-form metrics for culturally and contextually sensitive translation phenomena and underscore the need for context-aware evaluation frameworks when assessing the translation quality of MWEs in audiovisual translation.

#140 When the Gold Standard Isn't Necessarily Standard: Challenges of Evaluating the Translation of User-Generated Content Research · Technical
Lydia Nishimwe, Benoît Sagot and Rachel Bawden

User-generated content (UGC) is characterised by frequent use of non-standard language, from spelling errors to expressive choices such as slang, character repetitions, and emojis. This makes evaluating UGC translation challenging: what counts as a “good” translation depends on the desired standardness level of the output. To explore this, we examine the human translation guidelines of four UGC datasets, and derive a taxonomy of twelve non-standard phenomena and five translation actions (NORMALISE, COPY, TRANSFER, OMIT, CENSOR). Our analysis reveals notable differences in how UGC is treated, resulting in a spectrum of standardness in reference translations. We show that translation scores of large language models are highly sensitive to prompts with explicit UGC translation instructions, and that they improve when they align with the dataset guidelines. We argue that fair evaluation requires both models and metrics to be aware of translation guidelines. Finally, we call for clear guidelines during dataset creation and for the development of controllable, guideline-aware evaluation frameworks for UGC translation.

#104 Bridging Domains for Automatic Post-Editing: A Classifier-Guided Multi-Domain Adaptation Framework Research · Technical
Sourabh Deoghare, Diptesh Kanojia and Pushpak Bhattacharyya

Automatic Post-Editing (APE) is a widely studied approach for enhancing the output quality of Neural Machine Translation (NMT) systems. While most prior work has focused on general-purpose APE, the potential of domain-specific APE, such as for personalized or specialized content, remains underexplored due to the scarcity of domain-labeled training data. In this work, we investigate domain adaptation for APE using adapter-based methods. Our proposed multitask learning-based domain adaptation framework includes the use of a domain classifier to get a weighted combination of parallel domain-specific adapters at inference time, without requiring prior domain knowledge. This design allows the model to leverage cross-domain similarities, making it especially robust in low-resource domain scenarios. Our experimental results on English–German, English–Marathi, and English–Tamil pairs across different domains for each pair show substantial improvements over their respective general-purpose APE baselines. To facilitate further research, we will release human-annotated domain labels for triplets in WMT22 English–Marathi, and WMT24 English–Tamil APE datasets and the code.

#127 The Two Towers for Estonian-Centric and Finno-Ugric Machine Translation Research · Technical
Mark Fishel and Lisa Yankovskaya

We present two open-weight translation models for Estonian and its low-resource “relatives” in the Finno-Ugric language family. The training data includes 12 languages paired with Estonian as well as 23 more Finno-Ugric languages and varieties, ranging from mid-resource examples with tens of thousands of speakers to extremely low-resource critically endangered languages with less than a hundred speakers. The translation models use Unbabel Tower+ 2B and 9B as their starting point. We compare their performance on two benchmarks to DeepL and GPT-5.2 and show that in most cases we surpass the quality of DeepL and match or nearly match the quality of GPT-5.2’s output with just a fraction of the parameters. Among other contributions we also restore the paragraph structure of a massive synthetic multiparallel corpus for Estonian translation and use it in training the models. The resulting models, training scripts and training data are released openly.

#63 LocRegen: Cost-Efficient Redundancy Removal in Multilingual E-commerce Titles with Small Language Models Research · Technical
Bryan Zhang, Stephan Walter, Merve Arinik and Luca Lomanto

E-commerce product titles often include redundant information that negatively impacts the user experience. Removing repeated words through restructuring and paraphrasing can make titles more concise and improve readability. While large language models can optimize titles, their computational cost makes them impractical for large-scale applications. In this paper, we first analyze the sources of repetition in multilingual product titles, then present LocRegen, a system that uses smaller language models to efficiently remove redundancies while preserving essential product attributes. Our experiments across five languages show that LocRegen with a 7B model substantially outperforms a 47B mixture-of-experts model: LocRegen achieves a 2.4% redundant title rate compared to 3.5% for the 47B model, and maintains a 3.8% overall error rate across all error categories including key product attribute omission compared to 8.4% for the 47B model. These results demonstrate that LocRegen delivers superior performance on cost-effective hardware with acceptable latency, making it practical for large-scale deployment where much larger models would be computationally prohibitive.

16:45-17:15 Closing Remarks

Best Paper Nominees

The following papers have been nominated for the EAMT 2026 Best Paper Award.

Paper number Title Track Authors
#45 Multilingual Communication in the Asylum Context: Evaluating LLM-Based Machine Translation with Fuzzy Match Augmentation and Adaptive NMT across Resource Conditions under Low-Data Constraints Research - Technical Thomas Moerman, Arda Tezcan and Lieve Macken
#94 Diversity-Aware Literary Machine Translation with Multi-Reward Policy Optimization Research - Technical Zeynep Yirmibeşoğlu Balal and Tunga Güngör
#113 Creative Bias: How Machine Evaluation Struggles with Creativity in Literary Translations Research - T&U Kyo Gerrits, Rik van Noord and Ana Guerberof Arenas
#123 A Multilingual Red Teaming–Driven Safety Analysis of LLMs Research - T&U Patrícia Pandeiro, Vera Cabarrão and Helena Moniz
#81 Audio description between MT translation and recreation: An Interview Study for the Language Pair English-German Research - T&U Merle Sauter, Ekaterina Lapshinova-Koltunski and Sylvia Jaki
#97 Smarter edits? Post-editing with error highlights and translation suggestions Research - T&U Fleur V.J. van Tellingen, Gautam Ranka, Žugčić Dora, Joyce van der Wal, Andrea Camasta, Livio Guerra and Alina Karakanta
#135 Is a Picture Worth a Thousand Words? Exploration and Implementation Considerations for Visual Context in Translation Workflows Implementation & CS Vera Senderowicz Guerra and Olesia Khrapunova
#80 A Longitudinal Study of the Adoption of Specialized MT Systems in Canadian Parliamentary Translation Implementation & CS Michel Simard, Jeniffer Leal-Wyss, Gabriel Bernier-Colborne and Rebecca Knowles

Programme (PDF)

The conference programme is available as a PDF.

Open programme (PDF) Download programme (PDF)