Skip to main content

Showing 1–18 of 18 results for author: Graliński, F

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.13148  [pdf, ps, other

    cs.CL cs.AI

    Adapting LLMs for Minimal-edit Grammatical Error Correction

    Authors: Ryszard Staruch, Filip Graliński, Daniel Dzienisiewicz

    Abstract: Decoder-only large language models have shown superior performance in the fluency-edit English Grammatical Error Correction, but their adaptation for minimal-edit English GEC is still underexplored. To improve their effectiveness in the minimal-edit approach, we explore the error rate adaptation topic and propose a novel training schedule method. Our experiments set a new state-of-the-art result f… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

    Comments: Accepted at BEA-2025

  2. arXiv:2501.02266  [pdf, other

    cs.CL cs.AI

    LLMzSzŁ: a comprehensive LLM benchmark for Polish

    Authors: Krzysztof Jassem, Michał Ciesiółka, Filip Graliński, Piotr Jabłoński, Jakub Pokrywka, Marek Kubis, Monika Jabłońska, Ryszard Staruch

    Abstract: This article introduces the first comprehensive benchmark for the Polish language at this scale: LLMzSzŁ (LLMs Behind the School Desk). It is based on a coherent collection of Polish national exams, including both academic and professional tests extracted from the archives of the Polish Central Examination Board. It covers 4 types of exams, coming from 154 domains. Altogether, it consists of almos… ▽ More

    Submitted 4 January, 2025; originally announced January 2025.

  3. arXiv:2412.14581  [pdf, other

    cs.CL

    CORD: Balancing COnsistency and Rank Distillation for Robust Retrieval-Augmented Generation

    Authors: Youngwon Lee, Seung-won Hwang, Daniel Campos, Filip Graliński, Zhewei Yao, Yuxiong He

    Abstract: With the adoption of retrieval-augmented generation (RAG), large language models (LLMs) are expected to ground their generation to the retrieved contexts. Yet, this is hindered by position bias of LLMs, failing to evenly attend to all contexts. Previous work has addressed this by synthesizing contexts with perturbed positions of gold segment, creating a position-diversified train set. We extend th… ▽ More

    Submitted 19 December, 2024; originally announced December 2024.

  4. arXiv:2412.10684  [pdf, other

    cs.CL

    Inference Scaling for Bridging Retrieval and Augmented Generation

    Authors: Youngwon Lee, Seung-won Hwang, Daniel Campos, Filip Graliński, Zhewei Yao, Yuxiong He

    Abstract: Retrieval-augmented generation (RAG) has emerged as a popular approach to steering the output of a large language model (LLM) by incorporating retrieved contexts as inputs. However, existing work observed the generator bias, such that improving the retrieval results may negatively affect the outcome. In this work, we show such bias can be mitigated, from inference scaling, aggregating inference ca… ▽ More

    Submitted 14 December, 2024; originally announced December 2024.

  5. arXiv:2411.11829  [pdf, other

    cs.LG cs.CL cs.DB

    Tackling prediction tasks in relational databases with LLMs

    Authors: Marek Wydmuch, Łukasz Borchmann, Filip Graliński

    Abstract: Though large language models (LLMs) have demonstrated exceptional performance across numerous problems, their application to predictive tasks in relational databases remains largely unexplored. In this work, we address the notion that LLMs cannot yield satisfactory results on relational databases due to their interconnected tables, complex relationships, and heterogeneous data types. Using the rec… ▽ More

    Submitted 18 November, 2024; originally announced November 2024.

  6. arXiv:2409.03046  [pdf, other

    cs.CL

    Oddballness: universal anomaly detection with language models

    Authors: Filip Graliński, Ryszard Staruch, Krzysztof Jurkiewicz

    Abstract: We present a new method to detect anomalies in texts (in general: in sequences of any data), using language models, in a totally unsupervised manner. The method considers probabilities (likelihoods) generated by a language model, but instead of focusing on low-likelihood tokens, it considers a new metric introduced in this paper: oddballness. Oddballness measures how ``strange'' a given token is a… ▽ More

    Submitted 4 September, 2024; originally announced September 2024.

  7. arXiv:2407.01393  [pdf, other

    cs.CL

    POLygraph: Polish Fake News Dataset

    Authors: Daniel Dzienisiewicz, Filip Graliński, Piotr Jabłoński, Marek Kubis, Paweł Skórzewski, Piotr Wierzchoń

    Abstract: This paper presents the POLygraph dataset, a unique resource for fake news detection in Polish. The dataset, created by an interdisciplinary team, is composed of two parts: the "fake-or-not" dataset with 11,360 pairs of news articles (identified by their URLs) and corresponding labels, and the "fake-they-say" dataset with 5,082 news articles (identified by their URLs) and tweets commenting on them… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

    Comments: 14 pages, 1 figure, accepted to the 14th Workshop on Computational Approaches to Subjectivity, Sentiment & Social Media Analysis (WASSA'24)

  8. arXiv:2402.01300  [pdf, other

    cs.CL

    Two Approaches to Diachronic Normalization of Polish Texts

    Authors: Kacper Dudzic, Filip Graliński, Krzysztof Jassem, Marek Kubis, Piotr Wierzchoń

    Abstract: This paper discusses two approaches to the diachronic normalization of Polish texts: a rule-based solution that relies on a set of handcrafted patterns, and a neural normalization model based on the text-to-text transfer transformer architecture. The training and evaluation data prepared for the task are discussed in detail, along with experiments conducted to compare the proposed normalization so… ▽ More

    Submitted 2 February, 2024; originally announced February 2024.

    Comments: Accepted to the LaTeCH-CLfL 2024 workshop

  9. arXiv:2304.14953  [pdf, other

    cs.CL

    CCpdf: Building a High Quality Corpus for Visually Rich Documents from Web Crawl Data

    Authors: Michał Turski, Tomasz Stanisławek, Karol Kaczmarek, Paweł Dyda, Filip Graliński

    Abstract: In recent years, the field of document understanding has progressed a lot. A significant part of this progress has been possible thanks to the use of language models pretrained on large amounts of documents. However, pretraining corpora used in the domain of document understanding are single domain, monolingual, or nonpublic. Our goal in this paper is to propose an efficient pipeline for creating… ▽ More

    Submitted 6 June, 2023; v1 submitted 28 April, 2023; originally announced April 2023.

    Comments: Accepted at ICDAR 2023

  10. Kleister: Key Information Extraction Datasets Involving Long Documents with Complex Layouts

    Authors: Tomasz Stanisławek, Filip Graliński, Anna Wróblewska, Dawid Lipiński, Agnieszka Kaliska, Paulina Rosalska, Bartosz Topolski, Przemysław Biecek

    Abstract: The relevance of the Key Information Extraction (KIE) task is increasingly important in natural language processing problems. But there are still only a few well-defined problems that serve as benchmarks for solutions in this area. To bridge this gap, we introduce two new datasets (Kleister NDA and Kleister Charity). They involve a mix of scanned and born-digital long formal English-language docum… ▽ More

    Submitted 12 May, 2021; originally announced May 2021.

    Comments: accepted to ICDAR 2021

    Journal ref: International Conference on Document Analysis and Recognition ICDAR 2021

  11. arXiv:2011.03228  [pdf, other

    cs.CL cs.IR

    From Dataset Recycling to Multi-Property Extraction and Beyond

    Authors: Tomasz Dwojak, Michał Pietruszka, Łukasz Borchmann, Jakub Chłędowski, Filip Graliński

    Abstract: This paper investigates various Transformer architectures on the WikiReading Information Extraction and Machine Reading Comprehension dataset. The proposed dual-source model outperforms the current state-of-the-art by a large margin. Next, we introduce WikiReading Recycled-a newly developed public dataset and the task of multiple property extraction. It uses the same data as WikiReading but does n… ▽ More

    Submitted 6 November, 2020; originally announced November 2020.

    Comments: Accepted at CoNLL 2020; this article supersedes arXiv: 2006.08281

  12. arXiv:2010.15552  [pdf, other

    cs.LG

    Successive Halving Top-k Operator

    Authors: Michał Pietruszka, Łukasz Borchmann, Filip Graliński

    Abstract: We propose a differentiable successive halving method of relaxing the top-k operator, rendering gradient-based optimization possible. The need to perform softmax iteratively on the entire vector of scores is avoided by using a tournament-style selection. As a result, a much better approximation of top-k with lower computational cost is achieved compared to the previous approach.

    Submitted 8 October, 2020; originally announced October 2020.

    Comments: Work in progress

  13. arXiv:2010.14464  [pdf, other

    cs.DS cs.CL cs.IR

    Dynamic Boundary Time Warping for Sub-sequence Matching with Few Examples

    Authors: Łukasz Borchmann, Dawid Jurkiewicz, Filip Graliński, Tomasz Górecki

    Abstract: The paper presents a novel method of finding a fragment in a long temporal sequence similar to the set of shorter sequences. We are the first to propose an algorithm for such a search that does not rely on computing the average sequence from query examples. Instead, we use query examples as is, utilizing all of them simultaneously. The introduced method based on the Dynamic Time Warping (DTW) tech… ▽ More

    Submitted 1 September, 2024; v1 submitted 27 October, 2020; originally announced October 2020.

  14. arXiv:2006.08281  [pdf, other

    cs.CL cs.IR

    On the Multi-Property Extraction and Beyond

    Authors: Tomasz Dwojak, Michał Pietruszka, Łukasz Borchmann, Filip Graliński, Jakub Chłędowski

    Abstract: In this paper, we investigate the Dual-source Transformer architecture on the WikiReading information extraction and machine reading comprehension dataset. The proposed model outperforms the current state-of-the-art by a large margin. Next, we introduce WikiReading Recycled - a newly developed public dataset, supporting the task of multiple property extraction. It keeps the spirit of the original… ▽ More

    Submitted 15 June, 2020; originally announced June 2020.

    Comments: 5 pages

  15. arXiv:2005.07934  [pdf, other

    cs.CL

    ApplicaAI at SemEval-2020 Task 11: On RoBERTa-CRF, Span CLS and Whether Self-Training Helps Them

    Authors: Dawid Jurkiewicz, Łukasz Borchmann, Izabela Kosmala, Filip Graliński

    Abstract: This paper presents the winning system for the propaganda Technique Classification (TC) task and the second-placed system for the propaganda Span Identification (SI) task. The purpose of TC task was to identify an applied propaganda technique given propaganda text fragment. The goal of SI task was to find specific text fragments which contain at least one propaganda technique. Both of the develope… ▽ More

    Submitted 5 September, 2020; v1 submitted 16 May, 2020; originally announced May 2020.

  16. arXiv:2003.02356  [pdf, other

    cs.CL

    Kleister: A novel task for Information Extraction involving Long Documents with Complex Layout

    Authors: Filip Graliński, Tomasz Stanisławek, Anna Wróblewska, Dawid Lipiński, Agnieszka Kaliska, Paulina Rosalska, Bartosz Topolski, Przemysław Biecek

    Abstract: State-of-the-art solutions for Natural Language Processing (NLP) are able to capture a broad range of contexts, like the sentence-level context or document-level context for short documents. But these solutions are still struggling when it comes to longer, real-world documents with the information encoded in the spatial structure of the document, such as page elements like tables, forms, headers,… ▽ More

    Submitted 6 March, 2020; v1 submitted 4 March, 2020; originally announced March 2020.

  17. LAMBERT: Layout-Aware (Language) Modeling for information extraction

    Authors: Łukasz Garncarek, Rafał Powalski, Tomasz Stanisławek, Bartosz Topolski, Piotr Halama, Michał Turski, Filip Graliński

    Abstract: We introduce a simple new approach to the problem of understanding documents where non-trivial layout influences the local semantics. To this end, we modify the Transformer encoder architecture in a way that allows it to use layout features obtained from an OCR system, without the need to re-learn language semantics from scratch. We only augment the input of the model with the coordinates of token… ▽ More

    Submitted 28 May, 2021; v1 submitted 19 February, 2020; originally announced February 2020.

    Comments: accepted to ICDAR 2021

    Journal ref: In: Lladós J., Lopresti D., Uchida S. (eds) Document Analysis and Recognition - ICDAR 2021. ICDAR 2021. Lecture Notes in Computer Science, vol 12821. Springer, Cham

  18. arXiv:1911.03911  [pdf, other

    cs.CL

    Contract Discovery: Dataset and a Few-Shot Semantic Retrieval Challenge with Competitive Baselines

    Authors: Łukasz Borchmann, Dawid Wiśniewski, Andrzej Gretkowski, Izabela Kosmala, Dawid Jurkiewicz, Łukasz Szałkiewicz, Gabriela Pałka, Karol Kaczmarek, Agnieszka Kaliska, Filip Graliński

    Abstract: We propose a new shared task of semantic retrieval from legal texts, in which a so-called contract discovery is to be performed, where legal clauses are extracted from documents, given a few examples of similar clauses from other legal acts. The task differs substantially from conventional NLI and shared tasks on legal information extraction (e.g., one has to identify text span instead of a single… ▽ More

    Submitted 8 October, 2020; v1 submitted 10 November, 2019; originally announced November 2019.

    Comments: Submitted to Findings of EMNLP