Boosting language models for real-word error detection

Masanti, Corina; Witschel, Hans Friedrich; Riesen, Kaspar

Boosting language models for real-word error detection

Autor:innen

Masanti, Corina

Witschel, Hans Friedrich

Riesen, Kaspar

Autor:in (Körperschaft)

Publikationsdatum

2025

Typ der Arbeit

Studiengang

Sammlung

Institut für Wirtschaftsinformatik

Komplettanzeige

Typ

04B - Beitrag Konferenzschrift

Herausgeber:innen

Castrillon-Santana, Modesto

De Marsico, Maria

Fred, Ana

Herausgeber:in (Körperschaft)

Betreuer:in

Übergeordnetes Werk

Proceedings of the 14th International Conference on Pattern Recognition Applications and Methods (ICPRM 2025)

Themenheft

DOI der Originalpublikation

https://doi.org/10.5220/0013251500003905

URI

https://irf.fhnw.ch/handle/11654/56301

Link

Zugehörige Forschungsdaten

Reihe / Serie

Reihennummer

Jahrgang / Band

Ausgabe / Nummer

Seiten / Dauer

318-325

Patentnummer

Verlag / Herausgebende Institution

SciTePress

Verlagsort / Veranstaltungsort

Porto

Auflage

Version

Programmiersprache

Abtretungsempfänger:in

Praxispartner:in/Auftraggeber:in

Zusammenfassung

With the introduction of transformer-based language models, research in error detection in text documents has significantly advanced. However, some significant research challenges remain. In the present paper, we aim to address the specific challenge of detecting real-word errors, i.e., words that are syntactically correct but semantically incorrect given the sentence context. In particular, we research three categories of frequent real-word errors in German, viz. verb conjugation errors, case errors, and capitalization errors. To address the Real-word errors refer to words in texts that exist in the underlying dictionary but are incorrect in the context of the sentence. One open issue in detecting real-word errors is that there is limited data available for training the models, especially for languages other than the dominant language in research (namely English). To counteract this limitation, we propose to incorporate synthetic data in the training process for three categories of real-word errors frequently encountered in German text, viz. conjugation errors in verbs, wrong case selection, and capitalization errors. The first contribution of this paper is that we generate high-quality synthetic data from a real-world text data set provided by a Swiss proofreading agency that can be used for model training. In addition to the introduction of a novel and large-scale synthetic data set, the second major contribution of this paper is that we propose to incorporate ensemble learning methods for language models. Actually, a few approaches have been proposed that combine ensemble learning methods with language models. One such strategy, known as boosted prompting, is inspired by classical boosting algorithms. This method iteratively augments the prompt set with new prompts that better generalize regions of the target problem space where the previous prompts underperform (Pitis et al., 2023). Another approach is to train multiple models and combine them for the final output. In (Li et al., 2019), CNN-based and transformer-based models were combined to tackle the challenge of grammatical error correction. In the present paper, we propose to employ boosting techniques to enhance the training process of language models and ultimately improve the accuracy of language models for detecting real-word errors. To the best of our knowledge, this is the first time that boosting is used in combination with language models in this specific way and for this particular task.

Schlagwörter

Fachgebiet (DDC)

330 - Wirtschaft

Projekt

Veranstaltung

14th International Conference on Pattern Recognition Applications and Methods

Startdatum der Ausstellung

Enddatum der Ausstellung

Startdatum der Konferenz

23.02.2025

Enddatum der Konferenz

25.02.2025

Datum der letzten Prüfung

ISBN

978-989-758-730-6

ISSN

Sprache

Englisch

Während FHNW Zugehörigkeit erstellt

Ja

Zukunftsfelder FHNW

Publikationsstatus

Veröffentlicht

Begutachtung

peer-reviewed

Open Access-Status

Closed

Lizenz

Zitation

Masanti, C., Witschel, H. F., & Riesen, K. (2025). Boosting language models for real-word error detection. In M. Castrillon-Santana, M. De Marsico, & A. Fred (Eds.), Proceedings of the 14th International Conference on Pattern Recognition Applications and Methods (ICPRM 2025) (pp. 318–325). SciTePress. https://doi.org/10.5220/0013251500003905

Komplettanzeige