Publications | Andreas Säuberli

2026

Gaze4NLP

Exploring Cognitively Informed Sentence Simplification with Gaze-Guided Text Generation

Andreas Säuberli, Diego Frassinelli, and Barbara Plank

In Proceedings of the Second International Workshop on Eye-Tracking Resources and Evaluation for Human-Aligned NLP, 2026

Abs DOI PDF

Automatic text simplification has mostly relied on human judgments when it comes to what is considered easy or difficult to read. Eye movements while reading can offer a more direct and objective signal of processing effort and reading ease. In this paper, we explore gaze-guided text generation (GGTG), an approach to control reading ease in generated texts, and assess its use for sentence simplification. GGTG employs a gaze model that is trained to predict eye-tracking measures such as reading times or regression rates, which are then used to rerank next-token probabilities generated by a language model. We evaluated the approach on an English sentence simplification benchmark and found gains in automatic evaluation metrics, although the simplification operations are mostly limited to the lexical level. Its modular nature also allows GGTG to be combined with other simplification techniques such as prompting or fine-tuning.
LREC

Evaluating LLM-Based Text Simplification for German: Effects on Post-Editing Effort, Quality Ratings, and User Comprehension

Luisa Carrer, Andreas Säuberli, Martin Kappus, and 2 more authors

In Proceedings of the Fifteenth Language Resources and Evaluation Conference, 2026

Abs DOI PDF

Automatic text simplification (ATS) seeks to automate the process of rewording within the same language to enhance readability and comprehension. Current evaluation practices for ATS systems predominantly rely on automatic metrics or assessments by experts and crowdworkers, often excluding the intended end users and other stakeholders, and thus limiting insights into the actual effectiveness of ATS models. In this study, we address this gap by conducting a multi-faceted, mixed-method evaluation of two LLM-based ATS systems for German (capito.ai and GPT-4o) and by involving end users, post-editors, and Easy Language experts. The findings highlight the effectiveness of the LLM-based ATS systems examined across several dimensions, including post-editing efficiency, expert quality assessments, and, in the case of GPT-4o-generated simplifications, user comprehension. Post-editing effort metrics, in particular, show an increase in productivity of around 30% compared to full manual simplification. Moreover, the results reveal substantial differences in perception and understanding among participant groups. These outcomes clearly indicate that ATS for German has recently made considerable progress and, crucially, underscore the importance of incorporating multiple stakeholders into ATS evaluation to better align system performance with accessibility goals.
EACL

Controlling Reading Ease with Gaze-Guided Text Generation

Andreas Säuberli, Darja Jepifanova, Diego Frassinelli, and 1 more author

In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics, 2026

Abs DOI PDF Code Data

The way our eyes move while reading can tell us about the cognitive effort required to process the text. In the present study, we use this fact to generate texts with controllable reading ease. Our method employs a model that predicts human gaze patterns to steer language model outputs towards eliciting certain reading behaviors. We evaluate the approach in an eye-tracking experiment with native and non-native speakers of English. The results demonstrate that the method is effective at making the generated texts easier or harder to read, measured both in terms of reading times and perceived difficulty of the texts. A statistical analysis reveals that the changes in reading behavior are mostly due to features that affect lexical processing. Possible applications of our approach include generation of personalized educational material for language learning and text simplification for information accessibility.

2025

EMNLP

Disentangling Subjectivity and Uncertainty for Hate Speech Annotation and Modeling using Gaze

Özge Alaçam, Sanne Hoeken, Andreas Säuberli, and 4 more authors

In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

Abs DOI PDF Code Data

Variation is inherent in opinion-based annotation tasks like sentiment or hate speech analysis. It does not only arise from errors, fatigue, or sentence ambiguity but also from genuine differences in opinion shaped by background, experience, and culture. In this paper, first, we show how annotators’ confidence ratings can be great use for disentangling subjective variation from uncertainty, without relying on specific features present in the data (text, gaze, etc.). Our goal is to establish distinctive dimensions of variation which are often not clearly separated in existing work on modeling annotator variation. We illustrate our approach through a hate speech detection task, demonstrating that models are affected differently by instances of uncertainty and subjectivity. In addition, we show that human gaze patterns offer valuable indicators of subjective evaluation and uncertainty.
BEA

Do LLMs Give Psychometrically Plausible Responses in Educational Assessments?

Andreas Säuberli, Diego Frassinelli, and Barbara Plank

In Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025), 2025

Abs DOI PDF Code Data

Knowing how test takers answer items in educational assessments is essential for test development, to evaluate item quality, and to improve test validity. However, this process usually requires extensive pilot studies with human participants. If large language models (LLMs) exhibit human-like response behavior to test items, this could open up the possibility of using them as pilot participants to accelerate test development. In this paper, we evaluate the human-likeness or psychometric plausibility of responses from 18 instruction-tuned LLMs with two publicly available datasets of multiple-choice test items across three subjects: reading, U.S. history, and economics. Our methodology builds on two theoretical frameworks from psychometrics which are commonly used in educational assessment, classical test theory and item response theory. The results show that while larger models are excessively confident, their response distributions can be more human-like when calibrated with temperature scaling. In addition, we find that LLMs tend to correlate better with humans in reading comprehension items compared to other subjects. However, the correlations are not very strong overall, indicating that LLMs should not be used for piloting educational assessments in a zero-shot setting.
ETRA

The More the Merrier: Boost Your Dataset Visibility and Discover Eye-Tracking Datasets with pymovements

Daniel G. Krakowczyk, David R. Reich, Andreas Säuberli, and 6 more authors

In Proceedings of the 2025 Symposium on Eye Tracking Research and Applications, 2025

Abs DOI PDF Code

Eye movement research is often obstructed by the time-consuming challenges of accessing and preprocessing datasets, diverting efforts from scientific discovery. Researchers often struggle with non-standardized data formats, incomplete metadata, and scattered dataset repositories. Moreover, visibility of tediously collected and curated datasets is hindered without central aggregators that reference these valuable works. pymovements addresses these shortcomings by providing researchers with a seamless way to announce, discover, download, and process published eye movement datasets. We encourage researchers to contribute their eye-tracking datasets to our library to increase their visibility and impact.
ETRA

MultiplEYE: Creating a multilingual eye-tracking-while-reading corpus

Deborah Noemie Jakobi, Maja Stegenwallner-Schütz, Nora Hollenstein, and 72 more authors

In Proceedings of the 2025 Symposium on Eye Tracking Research and Applications, 2025

Abs DOI PDF

Eye-tracking-while-reading data provide valuable insights across multiple disciplines, including psychology, linguistics, natural language processing, education, and human-computer interaction. Despite its potential, the availability of large, high-quality, multilingual datasets remains limited, hindering both foundational reading research and advancements in applications. The MultiplEYE project addresses this gap by establishing a large-scale, international eye-tracking data collection initiative. It aims to create a multilingual dataset of eye movements recorded during natural reading, balancing linguistic diversity, while ensuring methodological consistency for reliable cross-linguistic comparisons. The dataset spans numerous languages and follows strict procedural, documentation, and data pre-processing standards to enhance eye-tracking data transparency and reproducibility. A novel data-sharing framework, integrated with data quality reports, allows for selective data filtering based on research needs. Researchers and labs worldwide are invited to join the initiative. By establishing and promoting standardized practices and open data sharing, MultiplEYE facilitates interdisciplinary research and advances reading research and gaze-augmented applications.

2024

GermEval

Statement Segmentation for German Easy Language Using BERT and Dependency Parsing

Andreas Säuberli, and Niclas Bodenmann

In Proceedings of the GermEval 2024 Shared Task on Statement Segmentation in German Easy Language (StaGE), 2024

Abs PDF Code

Texts in Easy Language should contain a low number of statements per sentence, to make information more accessible and comprehensible. The shared task Statement Segmentation in German Easy Language (StaGE) aims to automatically identify the number and location of statements in German Easy Language sentences. We present our submission to this task, which combines sequence labeling with dependency parsing. Our approach uses a fine-tuned BERT model to predict the head token of each statement span and expands the span using dependency relations. Our model achieves a mean absolute error of 0.40 in the predicted number of statements and Jaccard index of 0.38 in the statement spans. We discuss the challenges and limitations of the task and outline future research directions.
HumEval

Towards Holistic Human Evaluation of Automatic Text Simplification

Luisa Carrer, Andreas Säuberli, Martin Kappus, and 1 more author

In Proceedings of the 4th Workshop on Human Evaluation of NLP Systems (HumEval), 2024

Abs PDF

Text simplification refers to the process of rewording within a single language, moving from a standard form into an easy-to-understand one. Easy Language and Plain Language are two examples of simplified varieties aimed at improving readability and understanding for a wide-ranging audience. Human evaluation of automatic text simplification (ATS) is usually done by employing experts or crowdworkers to rate the generated texts. However, this approach does not include the target readers of simplified texts and does not reflect actual comprehensibility. In this paper, we explore different ways of measuring the quality of automatically simplified texts. We conducted a multi-faceted evaluation study involving end users, post-editors, and Easy Language experts, and applied a variety of qualitative and quantitative methods. We found differences in the perception and actual comprehension of the texts by different user groups. In addition, qualitative surveys and behavioral observations proved to be essential in interpreting the results. Finally, we discuss the advantages of comprehensive evaluations of ATS and provide recommendations for future work.
READI

Automatic Generation and Evaluation of Reading Comprehension Test Items with Large Language Models

Andreas Säuberli, and Simon Clematide

In Proceedings of the 3rd Workshop on Tools and Resources to Empower People with REAding DIfficulties (READI), 2024

Abs PDF Code Data

Reading comprehension tests are used in a variety of applications, reaching from education to assessing the comprehensibility of simplified texts. However, creating such tests manually and ensuring their quality is difficult and time-consuming. In this paper, we explore how large language models (LLMs) can be used to generate and evaluate multiple-choice reading comprehension items. To this end, we compiled a dataset of German reading comprehension items and developed a new protocol for human and automatic evaluation, including a metric we call \emphtext informativity, which is based on guessability and answerability. We then used this protocol and the dataset to evaluate the quality of items generated by Llama 2 and GPT-4. Our results suggest that both models are capable of generating items of acceptable quality in a zero-shot setting, but GPT-4 clearly outperforms Llama 2. We also show that LLMs can be used for automatic evaluation by eliciting item reponses from them. In this scenario, evaluation results with GPT-4 were the most similar to human annotators. Overall, zero-shot generation with LLMs is a promising approach for generating and evaluating reading comprehension test items, in particular for languages without large amounts of available data.
CHI

Digital Comprehensibility Assessment of Simplified Texts among Persons with Intellectual Disabilities

Andreas Säuberli, Franz Holzknecht, Patrick Haller, and 4 more authors

In Proceedings of the CHI Conference on Human Factors in Computing Systems, 2024

Abs DOI PDF Code Data

Text simplification refers to the process of increasing the comprehensibility of texts. Automatic text simplification models are most commonly evaluated by experts or crowdworkers instead of the primary target groups of simplified texts, such as persons with intellectual disabilities. We conducted an evaluation study of text comprehensibility including participants with and without intellectual disabilities reading unsimplified, automatically and manually simplified German texts on a tablet computer. We explored four different approaches to measuring comprehensibility: multiple-choice comprehension questions, perceived difficulty ratings, response time, and reading speed. The results revealed significant variations in these measurements, depending on the reader group and whether the text had undergone automatic or manual simplification. For the target group of persons with intellectual disabilities, comprehension questions emerged as the most reliable measure, while analyzing reading speed provided valuable insights into participants’ reading behavior.

2023

Frontiers

Enabling text comprehensibility assessment for people with intellectual disabilities using a mobile application

Andreas Säuberli, Silvia Hansen-Schirra, Franz Holzknecht, and 4 more authors

Frontiers in Communication, 2023

Abs DOI PDF Code

In research on Easy Language and automatic text simplification, it is imperative to evaluate the comprehensibility of texts by presenting them to target users and assessing their level of comprehension. Target readers often include people with intellectual or other disabilities, which renders conducting experiments more challenging and time-consuming. In this paper, we introduce Okra, an openly available touchscreen-based application to facilitate the inclusion of people with disabilities in studies of text comprehensibility. It implements several tasks related to reading comprehension and cognition and its user interface is optimized toward the needs of people with intellectual disabilities (IDs). We used Okra in a study with 16 participants with IDs and tested for effects of modality, comparing reading comprehension results when texts are read on paper and on an iPad. We found no evidence of such an effect on multiple-choice comprehension questions and perceived difficulty ratings, but reading time was significantly longer on paper. We also tested the feasibility of assessing cognitive skill levels of participants in Okra, and discuss problems and possible improvements. We will continue development of the application and use it for evaluating automatic text simplification systems in the future.

2022

TSAR

Eye-tracking based classification of Mandarin Chinese readers with and without dyslexia using neural sequence models

Patrick Haller, Andreas Säuberli, Sarah Kiener, and 3 more authors

In Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022), 2022

Abs DOI PDF Code

Eye movements are known to reflect cognitive processes in reading, and psychological reading research has shown that eye gaze patterns differ between readers with and without dyslexia. In recent years, researchers have attempted to classify readers with dyslexia based on their eye movements using Support Vector Machines (SVMs). However, these approaches (i) are based on highly aggregated features averaged over all words read by a participant, thus disregarding the sequential nature of the eye movements, and (ii) do not consider the linguistic stimulus and its interaction with the reader’s eye movements. In the present work, we propose two simple sequence models that process eye movements on the entire stimulus without the need of aggregating features across the sentence. Additionally, we incorporate the linguistic stimulus into the model in two ways—contextualized word embeddings and manually extracted linguistic features. The models are evaluated on a Mandarin Chinese dataset containing eye movements from children with and without dyslexia. Our results show that (i) even for a logographic script such as Chinese, sequence models are able to classify dyslexia on eye gaze sequences, reaching state-of-the-art performance, and (ii) incorporating the linguistic stimulus does not help to improve classification performance.
CLEF

LauSAn at eRisk 2022: Simply and Effectively Optimizing Text Classification for Early Detection

Andreas Säuberli, Sooyeon Cho, and Laura Stahlhut

Working Notes of CLEF, 2022

Abs PDF Code

The goal of early detection tasks at eRisk is to classify social media users as early as possible, based on streams of posts written by those users. We present two simple strategies of adapting standard text classification models in order to optimize them for early detection: concatenating the posts in different ways during training and inference, and continuously moving the decision boundary at inference time. We applied these approaches to two different text classification architectures based on pre-trained language models in eRisk 2022’s Task 2 (early detection of depression), and were able to reach top 5 placements in all time-sensitive evaluation metrics. A systematic post-submission ablation study confirmed that both strategies were effective at optimizing for early detection.
Frontiers

Automatic Text Simplification for German

Sarah Ebling, Alessia Battisti, Marek Kostrzewa, and 4 more authors

Frontiers in Communication, 2022

Abs DOI PDF

The article at hand aggregates the work of our group in automatic processing of simplified German. We present four parallel (standard/simplified German) corpora compiled and curated by our group. We report on the creation of a gold standard of sentence alignments from the four sources for evaluating automatic alignment methods on this gold standard. We show that one of the alignment methods performs best on the majority of the data sources. We used two of our corpora as a basis for the first sentence-based neural machine translation (NMT) approach toward automatic simplification of German. In follow-up work, we extended our model to render it capable of explicitly operating on multiple levels of simplified German. We show that using source-side language level labels improves performance with regard to two evaluation metrics commonly applied to measuring the quality of automatic text simplification.

2021

ASSETS

Measuring Text Comprehension for People with Reading Difficulties Using a Mobile Application

Andreas Säuberli

In Proceedings of the 23rd International ACM SIGACCESS Conference on Computers and Accessibility, 2021

Abs DOI

Measuring text comprehension is crucial for evaluating the accessibility of texts in Easy Language. However, accurate and objective comprehension tests tend to be expensive, time-consuming and sometimes difficult to implement for target groups of Easy Language. In this paper, we propose using computer-based testing with touchscreen devices as a means to simplify and accelerate data collection using comprehension tests, and to facilitate experiments with less proficient readers. We demonstrate this by designing and implementing a mobile touchscreen application and validating its effectiveness in an experiment with people with intellectual disabilities. The results suggest that there is no difference in terms of task difficulty between measuring comprehension using the mobile application and a traditional paper-and-pencil test. Moreover, reading times appear to be faster in the application than on paper.
NewSum

A New Dataset and Efficient Baselines for Document-level Text Simplification in German

Annette Rios, Nicolas Spring, Tannon Kew, and 4 more authors

In Proceedings of the Third Workshop on New Frontiers in Summarization, 2021

Abs DOI PDF Code

The task of document-level text simplification is very similar to summarization with the additional difficulty of reducing complexity. We introduce a newly collected data set of German texts, collected from the Swiss news magazine 20 Minuten (‘20 Minutes’) that consists of full articles paired with simplified summaries. Furthermore, we present experiments on automatic text simplification with the pretrained multilingual mBART and a modified version thereof that is more memory-friendly, using both our new data set and existing simplification corpora. Our modifications of mBART let us train at a lower memory cost without much loss in performance, in fact, the smaller mBART even improves over the standard model in a setting with multiple simplification levels.

2020

READI

Benchmarking Data-driven Automatic Text Simplification for German

Andreas Säuberli, Sarah Ebling, and Martin Volk

In Proceedings of the 1st Workshop on Tools and Resources to Empower People with REAding DIfficulties (READI), 2020

Abs PDF

Automatic text simplification is an active research area, and there are first systems for English, Spanish, Portuguese, and Italian. For German, no data-driven approach exists to this date, due to a lack of training data. In this paper, we present a parallel corpus of news items in German with corresponding simplifications on two complexity levels. The simplifications have been produced according to a well-documented set of guidelines. We then report on experiments in automatically simplifying the German news items using state-of-the-art neural machine translation techniques. We demonstrate that despite our small parallel corpus, our neural models were able to learn essential features of simplified language, such as lexical substitutions, deletion of less relevant words and phrases, and sentence shortening.
LREC

A Corpus for Automatic Readability Assessment and Text Simplification of German

Alessia Battisti, Dominik Pfütze, Andreas Säuberli, and 2 more authors

In Proceedings of the Twelfth Language Resources and Evaluation Conference, 2020

Abs PDF

In this paper, we present a corpus for use in automatic readability assessment and automatic text simplification for German, the first of its kind for this language. The corpus is compiled from web sources and consists of parallel as well as monolingual-only (simplified German) data amounting to approximately 6,200 documents (nearly 211,000 sentences). As a unique feature, the corpus contains information on text structure (e.g., paragraphs, lines), typography (e.g., font type, font style), and images (content, position, and dimensions). While the importance of considering such information in machine learning tasks involving simplified language, such as readability assessment, has repeatedly been stressed in the literature, we provide empirical evidence for its benefit. We also demonstrate the added value of leveraging monolingual-only data for automatic text simplification via machine translation through applying back-translation, a data augmentation technique.