-
VectorEdits: A Dataset and Benchmark for Instruction-Based Editing of Vector Graphics
Authors:
Josef Kuchař,
Marek Kadlčík,
Michal Spiegel,
Michal Štefánik
Abstract:
We introduce a large-scale dataset for instruction-guided vector image editing, consisting of over 270,000 pairs of SVG images paired with natural language edit instructions. Our dataset enables training and evaluation of models that modify vector graphics based on textual commands. We describe the data collection process, including image pairing via CLIP similarity and instruction generation with…
▽ More
We introduce a large-scale dataset for instruction-guided vector image editing, consisting of over 270,000 pairs of SVG images paired with natural language edit instructions. Our dataset enables training and evaluation of models that modify vector graphics based on textual commands. We describe the data collection process, including image pairing via CLIP similarity and instruction generation with vision-language models. Initial experiments with state-of-the-art large language models reveal that current methods struggle to produce accurate and valid edits, underscoring the challenge of this task. To foster research in natural language-driven vector graphic generation and editing, we make our resources created within this work publicly available.
△ Less
Submitted 18 June, 2025;
originally announced June 2025.
-
Pre-trained Language Models Learn Remarkably Accurate Representations of Numbers
Authors:
Marek Kadlčík,
Michal Štefánik,
Timothee Mickus,
Michal Spiegel,
Josef Kuchař
Abstract:
Pretrained language models (LMs) are prone to arithmetic errors. Existing work showed limited success in probing numeric values from models' representations, indicating that these errors can be attributed to the inherent unreliability of distributionally learned embeddings in representing exact quantities. However, we observe that previous probing methods are inadequate for the emergent structure…
▽ More
Pretrained language models (LMs) are prone to arithmetic errors. Existing work showed limited success in probing numeric values from models' representations, indicating that these errors can be attributed to the inherent unreliability of distributionally learned embeddings in representing exact quantities. However, we observe that previous probing methods are inadequate for the emergent structure of learned number embeddings with sinusoidal patterns.
In response, we propose a novel probing technique that decodes numeric values from input embeddings with near-perfect accuracy across a range of open-source LMs. This proves that after the sole pre-training, LMs represent numbers with remarkable precision. Finally, we find that the embeddings' preciseness judged by our probe's accuracy explains a large portion of LM's errors in elementary arithmetic, and show that aligning the embeddings with the pattern discovered by our probe can mitigate these errors.
△ Less
Submitted 10 June, 2025;
originally announced June 2025.
-
Negation: A Pink Elephant in the Large Language Models' Room?
Authors:
Tereza Vrabcová,
Marek Kadlčík,
Petr Sojka,
Michal Štefánik,
Michal Spiegel
Abstract:
Negations are key to determining sentence meaning, making them essential for logical reasoning. Despite their importance, negations pose a substantial challenge for large language models (LLMs) and remain underexplored.
We constructed and published two new textual entailment datasets NoFEVER-ML and NoSNLI-ML in four languages (English, Czech, German, and Ukrainian) with
examples differing in n…
▽ More
Negations are key to determining sentence meaning, making them essential for logical reasoning. Despite their importance, negations pose a substantial challenge for large language models (LLMs) and remain underexplored.
We constructed and published two new textual entailment datasets NoFEVER-ML and NoSNLI-ML in four languages (English, Czech, German, and Ukrainian) with
examples differing in negation. It allows investigation of the root causes of the negation problem and its exemplification: how popular LLM model properties and language impact their inability to handle negation correctly.
Contrary to previous work, we show that increasing the model size may improve the models' ability to handle negations. Furthermore, we find that both the models' reasoning accuracy and robustness to negation are language-dependent and that the length and explicitness of the premise have an impact on robustness. There is better accuracy in projective language with fixed order, such as English, than in non-projective ones, such as German or Czech.
Our entailment datasets pave the way to further research for explanation and exemplification of the negation problem, minimization of LLM hallucinations, and improvement of LLM reasoning in multilingual settings.
△ Less
Submitted 3 June, 2025; v1 submitted 28 March, 2025;
originally announced March 2025.
-
Attend or Perish: Benchmarking Attention in Algorithmic Reasoning
Authors:
Michal Spiegel,
Michal Štefánik,
Marek Kadlčík,
Josef Kuchař
Abstract:
Can transformers learn to perform algorithmic tasks reliably across previously unseen input/output domains? While pre-trained language models show solid accuracy on benchmarks incorporating algorithmic reasoning, assessing the reliability of these results necessitates an ability to cleanse models' functional capabilities from memorization. In this paper, we propose an algorithmic benchmark compris…
▽ More
Can transformers learn to perform algorithmic tasks reliably across previously unseen input/output domains? While pre-trained language models show solid accuracy on benchmarks incorporating algorithmic reasoning, assessing the reliability of these results necessitates an ability to cleanse models' functional capabilities from memorization. In this paper, we propose an algorithmic benchmark comprising six tasks of infinite input domains where we can also disentangle and trace the correct, robust algorithm necessary for the task. This allows us to assess (i) models' ability to extrapolate to unseen types of inputs, including new lengths, value ranges or input domains, but also (ii) to assess the robustness of the functional mechanism in recent models through the lens of their attention maps. We make the implementation of all our tasks and interoperability methods publicly available at https://github.com/michalspiegel/AttentionSpan .
△ Less
Submitted 28 February, 2025;
originally announced March 2025.
-
BenCzechMark : A Czech-centric Multitask and Multimetric Benchmark for Large Language Models with Duel Scoring Mechanism
Authors:
Martin Fajcik,
Martin Docekal,
Jan Dolezal,
Karel Ondrej,
Karel Beneš,
Jan Kapsa,
Pavel Smrz,
Alexander Polok,
Michal Hradis,
Zuzana Neverilova,
Ales Horak,
Radoslav Sabol,
Michal Stefanik,
Adam Jirkovsky,
David Adamczyk,
Petr Hyner,
Jan Hula,
Hynek Kydlicek
Abstract:
We present BenCzechMark (BCM), the first comprehensive Czech language benchmark designed for large language models, offering diverse tasks, multiple task formats, and multiple evaluation metrics. Its duel scoring system is grounded in statistical significance theory and uses aggregation across tasks inspired by social preference theory. Our benchmark encompasses 50 challenging tasks, with correspo…
▽ More
We present BenCzechMark (BCM), the first comprehensive Czech language benchmark designed for large language models, offering diverse tasks, multiple task formats, and multiple evaluation metrics. Its duel scoring system is grounded in statistical significance theory and uses aggregation across tasks inspired by social preference theory. Our benchmark encompasses 50 challenging tasks, with corresponding test datasets, primarily in native Czech, with 14 newly collected ones. These tasks span 8 categories and cover diverse domains, including historical Czech news, essays from pupils or language learners, and spoken word. Furthermore, we collect and clean BUT-Large Czech Collection, the largest publicly available clean Czech language corpus, and use it for (i) contamination analysis and (ii) continuous pretraining of the first Czech-centric 7B language model with Czech-specific tokenization. We use our model as a baseline for comparison with publicly available multilingual models. Lastly, we release and maintain a leaderboard with existing 50 model submissions, where new model submissions can be made at https://huggingface.co/spaces/CZLC/BenCzechMark.
△ Less
Submitted 22 May, 2025; v1 submitted 23 December, 2024;
originally announced December 2024.
-
A Comparative Study of Text Retrieval Models on DaReCzech
Authors:
Jakub Stetina,
Martin Fajcik,
Michal Stefanik,
Michal Hradis
Abstract:
This article presents a comprehensive evaluation of 7 off-the-shelf document retrieval models: Splade, Plaid, Plaid-X, SimCSE, Contriever, OpenAI ADA and Gemma2 chosen to determine their performance on the Czech retrieval dataset DaReCzech. The primary objective of our experiments is to estimate the quality of modern retrieval approaches in the Czech language. Our analyses include retrieval qualit…
▽ More
This article presents a comprehensive evaluation of 7 off-the-shelf document retrieval models: Splade, Plaid, Plaid-X, SimCSE, Contriever, OpenAI ADA and Gemma2 chosen to determine their performance on the Czech retrieval dataset DaReCzech. The primary objective of our experiments is to estimate the quality of modern retrieval approaches in the Czech language. Our analyses include retrieval quality, speed, and memory footprint. Secondly, we analyze whether it is better to use the model directly in Czech text, or to use machine translation into English, followed by retrieval in English. Our experiments identify the most effective option for Czech information retrieval. The findings revealed notable performance differences among the models, with Gemma22 achieving the highest precision and recall, while Contriever performing poorly. Conclusively, SPLADE and PLAID models offered a balance of efficiency and performance.
△ Less
Submitted 20 December, 2024; v1 submitted 19 November, 2024;
originally announced November 2024.
-
Self-training Language Models for Arithmetic Reasoning
Authors:
Marek Kadlčík,
Michal Štefánik
Abstract:
Recent language models achieve impressive results in tasks involving complex multistep reasoning, but scaling these capabilities further traditionally requires expensive collection of more annotated data. In this work, we explore the potential of improving models' reasoning capabilities without new data, merely using automated feedback to the validity of their predictions in arithmetic reasoning (…
▽ More
Recent language models achieve impressive results in tasks involving complex multistep reasoning, but scaling these capabilities further traditionally requires expensive collection of more annotated data. In this work, we explore the potential of improving models' reasoning capabilities without new data, merely using automated feedback to the validity of their predictions in arithmetic reasoning (self-training).
In systematic experimentation across six different arithmetic reasoning datasets, we find that models can substantially improve in both single-round (offline) and online self-training, reaching a correct result in +13.9% and +25.9% more cases, respectively, underlining the importance of actuality of self-training feedback. We further find that in the single-round, offline self-training, traditional supervised training can deliver gains comparable to preference optimization, but in online self-training, preference optimization methods largely outperform supervised training thanks to their superior stability and robustness on unseen types of problems.
△ Less
Submitted 23 October, 2024; v1 submitted 11 July, 2024;
originally announced July 2024.
-
Concept-aware Data Construction Improves In-context Learning of Language Models
Authors:
Michal Štefánik,
Marek Kadlčík,
Petr Sojka
Abstract:
Many recent language models (LMs) are capable of in-context learning (ICL), manifested in the LMs' ability to perform a new task solely from natural-language instruction. Previous work curating in-context learners assumes that ICL emerges from a vast over-parametrization or the scale of multi-task training. However, recent theoretical work attributes the ICL ability to concept-dependent training d…
▽ More
Many recent language models (LMs) are capable of in-context learning (ICL), manifested in the LMs' ability to perform a new task solely from natural-language instruction. Previous work curating in-context learners assumes that ICL emerges from a vast over-parametrization or the scale of multi-task training. However, recent theoretical work attributes the ICL ability to concept-dependent training data and creates functional in-context learners even in small-scale, synthetic settings.
In this work, we practically explore this newly identified axis of ICL quality. We propose Concept-aware Training (CoAT), a framework for constructing training scenarios that make it beneficial for the LM to learn to utilize the analogical reasoning concepts from demonstrations. We find that by using CoAT, pre-trained transformers can learn to better utilise new latent concepts from demonstrations and that such ability makes ICL more robust to the functional deficiencies of the previous models. Finally, we show that concept-aware in-context learning is more effective for a majority of new tasks when compared to traditional instruction tuning, resulting in a performance comparable to the previous in-context learners using magnitudes of more training data.
△ Less
Submitted 28 June, 2024; v1 submitted 8 March, 2024;
originally announced March 2024.
-
People and Places of Historical Europe: Bootstrapping Annotation Pipeline and a New Corpus of Named Entities in Late Medieval Texts
Authors:
Vít Novotný,
Kristýna Luger,
Michal Štefánik,
Tereza Vrabcová,
Aleš Horák
Abstract:
Although pre-trained named entity recognition (NER) models are highly accurate on modern corpora, they underperform on historical texts due to differences in language OCR errors. In this work, we develop a new NER corpus of 3.6M sentences from late medieval charters written mainly in Czech, Latin, and German.
We show that we can start with a list of known historical figures and locations and an…
▽ More
Although pre-trained named entity recognition (NER) models are highly accurate on modern corpora, they underperform on historical texts due to differences in language OCR errors. In this work, we develop a new NER corpus of 3.6M sentences from late medieval charters written mainly in Czech, Latin, and German.
We show that we can start with a list of known historical figures and locations and an unannotated corpus of historical texts, and use information retrieval techniques to automatically bootstrap a NER-annotated corpus. Using our corpus, we train a NER model that achieves entity-level Precision of 72.81-93.98% with 58.14-81.77% Recall on a manually-annotated test dataset. Furthermore, we show that using a weighted loss function helps to combat class imbalance in token classification tasks. To make it easy for others to reproduce and build upon our work, we publicly release our corpus, models, and experimental code.
△ Less
Submitted 6 June, 2023; v1 submitted 26 May, 2023;
originally announced May 2023.
-
Calc-X and Calcformers: Empowering Arithmetical Chain-of-Thought through Interaction with Symbolic Systems
Authors:
Marek Kadlčík,
Michal Štefánik,
Ondřej Sotolář,
Vlastimil Martinek
Abstract:
Despite outstanding performance in many tasks, language models are notoriously inclined to make factual errors in tasks requiring arithmetic computation. We address this deficiency by creating Calc-X, a collection of datasets that demonstrates the appropriate use of a calculator in reasoning chains. Calc-X is suitable for teaching language models to offload computations to a symbolic system. We su…
▽ More
Despite outstanding performance in many tasks, language models are notoriously inclined to make factual errors in tasks requiring arithmetic computation. We address this deficiency by creating Calc-X, a collection of datasets that demonstrates the appropriate use of a calculator in reasoning chains. Calc-X is suitable for teaching language models to offload computations to a symbolic system. We survey and unify several existing chain-of-thought datasets into a proposed format, resulting in a standard collection of over 300,000 samples requiring arithmetic reasoning. Finally, we use the new Calc-X collection to train open-source calculator-using models we call Calcformers and show that these models approximately double the accuracy of generating correct results compared to vanilla language model baselines. We make all Calc-X datasets, source code and Calcformers models publicly available.
△ Less
Submitted 23 October, 2023; v1 submitted 24 May, 2023;
originally announced May 2023.
-
Concept-aware Training Improves In-context Learning Ability of Language Models
Authors:
Michal Štefánik,
Marek Kadlčík
Abstract:
Many recent language models (LMs) of Transformers family exhibit so-called in-context learning (ICL) ability, manifested in the LMs' ability to modulate their function by a task described in a natural language input. Previous work curating these models assumes that ICL emerges from vast over-parametrization or the scale of multi-task training. However, a complementary branch of recent theoretical…
▽ More
Many recent language models (LMs) of Transformers family exhibit so-called in-context learning (ICL) ability, manifested in the LMs' ability to modulate their function by a task described in a natural language input. Previous work curating these models assumes that ICL emerges from vast over-parametrization or the scale of multi-task training. However, a complementary branch of recent theoretical work attributes ICL emergence to specific properties of training data and creates functional in-context learners in small-scale, synthetic settings.
Inspired by recent findings on data properties driving the emergence of ICL, we propose a method to create LMs able to better utilize the in-context information, by constructing training scenarios where it is beneficial for the LM to capture the analogical reasoning concepts. We measure that data sampling of Concept-aware Training (CoAT) consistently improves models' reasoning ability. As a result, the in-context learners trained with CoAT on only two datasets of a single (QA) task perform comparably to larger models trained on 1600+ tasks.
△ Less
Submitted 23 May, 2023;
originally announced May 2023.
-
Think Twice: Measuring the Efficiency of Eliminating Prediction Shortcuts of Question Answering Models
Authors:
Lukáš Mikula,
Michal Štefánik,
Marek Petrovič,
Petr Sojka
Abstract:
While the Large Language Models (LLMs) dominate a majority of language understanding tasks, previous work shows that some of these results are supported by modelling spurious correlations of training datasets. Authors commonly assess model robustness by evaluating their models on out-of-distribution (OOD) datasets of the same task, but these datasets might share the bias of the training dataset.…
▽ More
While the Large Language Models (LLMs) dominate a majority of language understanding tasks, previous work shows that some of these results are supported by modelling spurious correlations of training datasets. Authors commonly assess model robustness by evaluating their models on out-of-distribution (OOD) datasets of the same task, but these datasets might share the bias of the training dataset.
We propose a simple method for measuring a scale of models' reliance on any identified spurious feature and assess the robustness towards a large set of known and newly found prediction biases for various pre-trained models and debiasing methods in Question Answering (QA). We find that while existing debiasing methods can mitigate reliance on a chosen spurious feature, the OOD performance gains of these methods can not be explained by mitigated reliance on biased features, suggesting that biases are shared among different QA datasets. Finally, we evidence this to be the case by measuring that the performance of models trained on different QA datasets relies comparably on the same bias features. We hope these results will motivate future work to refine the reports of LMs' robustness to a level of adversarial samples addressing specific spurious features.
△ Less
Submitted 6 February, 2024; v1 submitted 11 May, 2023;
originally announced May 2023.
-
Resources and Few-shot Learners for In-context Learning in Slavic Languages
Authors:
Michal Štefánik,
Marek Kadlčík,
Piotr Gramacki,
Petr Sojka
Abstract:
Despite the rapid recent progress in creating accurate and compact in-context learners, most recent work focuses on in-context learning (ICL) for tasks in English. However, the ability to interact with users of languages outside English presents a great potential for broadening the applicability of language technologies to non-English speakers.
In this work, we collect the infrastructure necessa…
▽ More
Despite the rapid recent progress in creating accurate and compact in-context learners, most recent work focuses on in-context learning (ICL) for tasks in English. However, the ability to interact with users of languages outside English presents a great potential for broadening the applicability of language technologies to non-English speakers.
In this work, we collect the infrastructure necessary for training and evaluation of ICL in a selection of Slavic languages: Czech, Polish, and Russian. We link a diverse set of datasets and cast these into a unified instructional format through a set of transformations and newly-crafted templates written purely in target languages. Using the newly-curated dataset, we evaluate a set of the most recent in-context learners and compare their results to the supervised baselines. Finally, we train, evaluate and publish a set of in-context learning models that we train on the collected resources and compare their performance to previous work.
We find that ICL models tuned in English are also able to learn some tasks from non-English contexts, but multilingual instruction fine-tuning consistently improves the ICL ability. We also find that the massive multitask training can be outperformed by single-task training in the target language, uncovering the potential for specializing in-context learners to the language(s) of their application.
△ Less
Submitted 4 April, 2023;
originally announced April 2023.
-
Can In-context Learners Learn a Reasoning Concept from Demonstrations?
Authors:
Michal Štefánik,
Marek Kadlčík
Abstract:
Language models exhibit an emergent ability to learn a new task from a small number of input-output demonstrations. However, recent work shows that in-context learners largely rely on their pre-trained knowledge, such as the sentiment of the labels, instead of learning new associations from the input. We argue that the commonly-used few-shot evaluation using a random selection of in-context demons…
▽ More
Language models exhibit an emergent ability to learn a new task from a small number of input-output demonstrations. However, recent work shows that in-context learners largely rely on their pre-trained knowledge, such as the sentiment of the labels, instead of learning new associations from the input. We argue that the commonly-used few-shot evaluation using a random selection of in-context demonstrations can not disentangle models' reliance on such biases, as most of the randomly-selected demonstrations do not present relations informative for prediction beyond exposing the task's input-output distribution.
Therefore, to evaluate models' in-context learning ability independent of models' memory, we introduce a Concept-sharing few-shot learning method choosing the demonstrations that share an underlying concept with the predicted sample. We extract a set of such concepts from available human explanations and measure how much models can benefit from presenting these concepts in few-shot demonstrations.
We find that most of the recent in-context learners can not consistently benefit from the demonstrated concepts, irrespective of the model size. However, we note that T0 models are more sensitive to exhibited concepts, benefiting from concept-sharing demonstrations in 7 out of 8 evaluation scenarios.
△ Less
Submitted 19 July, 2023; v1 submitted 3 December, 2022;
originally announced December 2022.
-
Soft Alignment Objectives for Robust Adaptation of Language Generation
Authors:
Michal Štefánik,
Marek Kadlčík,
Petr Sojka
Abstract:
Domain adaptation allows generative language models to address specific flaws caused by the domain shift of their application. However, the traditional adaptation by further training on in-domain data rapidly weakens the model's ability to generalize to other domains, making the open-ended deployments of the adapted models prone to errors. This work introduces novel training objectives built upon…
▽ More
Domain adaptation allows generative language models to address specific flaws caused by the domain shift of their application. However, the traditional adaptation by further training on in-domain data rapidly weakens the model's ability to generalize to other domains, making the open-ended deployments of the adapted models prone to errors. This work introduces novel training objectives built upon a semantic similarity of the predicted tokens to the reference.
Our results show that (1) avoiding the common assumption of a single correct prediction by constructing the training target from tokens' semantic similarity can mitigate catastrophic forgetting during domain adaptation, while (2) preserving the quality of the adaptation, (3) with negligible additions to compute costs.
In the broader context, the objectives grounded in a continuous token similarity pioneer the exploration of the middle ground between the efficient but naïve exact-match token-level objectives and expressive but computationally- and resource-intensive sequential objectives.
△ Less
Submitted 26 May, 2023; v1 submitted 29 November, 2022;
originally announced November 2022.
-
Methods for Estimating and Improving Robustness of Language Models
Authors:
Michal Štefánik
Abstract:
Despite their outstanding performance, large language models (LLMs) suffer notorious flaws related to their preference for simple, surface-level textual relations over full semantic complexity of the problem. This proposal investigates a common denominator of this problem in their weak ability to generalise outside of the training domain. We survey diverse research directions providing estimations…
▽ More
Despite their outstanding performance, large language models (LLMs) suffer notorious flaws related to their preference for simple, surface-level textual relations over full semantic complexity of the problem. This proposal investigates a common denominator of this problem in their weak ability to generalise outside of the training domain. We survey diverse research directions providing estimations of model generalisation ability and find that incorporating some of these measures in the training objectives leads to enhanced distributional robustness of neural models. Based on these findings, we present future research directions towards enhancing the robustness of LLMs.
△ Less
Submitted 16 June, 2022;
originally announced June 2022.
-
Adaptor: Objective-Centric Adaptation Framework for Language Models
Authors:
Michal Štefánik,
Vít Novotný,
Nikola Groverová,
Petr Sojka
Abstract:
Progress in natural language processing research is catalyzed by the possibilities given by the widespread software frameworks. This paper introduces Adaptor library that transposes the traditional model-centric approach composed of pre-training + fine-tuning steps to objective-centric approach, composing the training process by applications of selected objectives. We survey research directions th…
▽ More
Progress in natural language processing research is catalyzed by the possibilities given by the widespread software frameworks. This paper introduces Adaptor library that transposes the traditional model-centric approach composed of pre-training + fine-tuning steps to objective-centric approach, composing the training process by applications of selected objectives. We survey research directions that can benefit from enhanced objective-centric experimentation in multitask training, custom objectives development, dynamic training curricula, or domain adaptation. Adaptor aims to ease reproducibility of these research directions in practice. Finally, we demonstrate the practical applicability of Adaptor in selected unsupervised domain adaptation scenarios.
△ Less
Submitted 20 May, 2022; v1 submitted 8 March, 2022;
originally announced March 2022.
-
Regressive Ensemble for Machine Translation Quality Evaluation
Authors:
Michal Štefánik,
Vít Novotný,
Petr Sojka
Abstract:
This work introduces a simple regressive ensemble for evaluating machine translation quality based on a set of novel and established metrics. We evaluate the ensemble using a correlation to expert-based MQM scores of the WMT 2021 Metrics workshop. In both monolingual and zero-shot cross-lingual settings, we show a significant performance improvement over single metrics. In the cross-lingual settin…
▽ More
This work introduces a simple regressive ensemble for evaluating machine translation quality based on a set of novel and established metrics. We evaluate the ensemble using a correlation to expert-based MQM scores of the WMT 2021 Metrics workshop. In both monolingual and zero-shot cross-lingual settings, we show a significant performance improvement over single metrics. In the cross-lingual settings, we also demonstrate that an ensemble approach is well-applicable to unseen languages. Furthermore, we identify a strong reference-free baseline that consistently outperforms the commonly-used BLEU and METEOR measures and significantly improves our ensemble's performance.
△ Less
Submitted 15 September, 2021;
originally announced September 2021.
-
WebMIaS on Docker: Deploying Math-Aware Search in a Single Line of Code
Authors:
Dávid Lupták,
Vít Novotný,
Michal Štefánik,
Petr Sojka
Abstract:
Math informational retrieval (MIR) search engines are absent in the wide-spread production use, even though documents in the STEM fields contain many mathematical formulae, which are sometimes more important than text for understanding. We have developed and open-sourced the WebMIaS MIR search engine that has been successfully deployed in the European Digital Mathematics Library (EuDML). However,…
▽ More
Math informational retrieval (MIR) search engines are absent in the wide-spread production use, even though documents in the STEM fields contain many mathematical formulae, which are sometimes more important than text for understanding. We have developed and open-sourced the WebMIaS MIR search engine that has been successfully deployed in the European Digital Mathematics Library (EuDML). However, its deployment is difficult to automate due to the complexity of this task. Moreover, the solutions developed so far to tackle this challenge are imperfect in terms of speed, maintenance, and robustness. In this paper, we will describe the virtualization of WebMIaS using Docker that solves all three problems and allows anyone to deploy containerized WebMIaS in a single line of code. The publicly available Docker image will also help the community push the development of math-aware search engines in the ARQMath workshop series.
△ Less
Submitted 14 July, 2021; v1 submitted 1 June, 2021;
originally announced June 2021.
-
When FastText Pays Attention: Efficient Estimation of Word Representations using Constrained Positional Weighting
Authors:
Vít Novotný,
Michal Štefánik,
Eniafe Festus Ayetiran,
Petr Sojka,
Radim Řehůřek
Abstract:
In 2018, Mikolov et al. introduced the positional language model, which has characteristics of attention-based neural machine translation models and which achieved state-of-the-art performance on the intrinsic word analogy task. However, the positional model is not practically fast and it has never been evaluated on qualitative criteria or extrinsic tasks. We propose a constrained positional model…
▽ More
In 2018, Mikolov et al. introduced the positional language model, which has characteristics of attention-based neural machine translation models and which achieved state-of-the-art performance on the intrinsic word analogy task. However, the positional model is not practically fast and it has never been evaluated on qualitative criteria or extrinsic tasks. We propose a constrained positional model, which adapts the sparse attention mechanism from neural machine translation to improve the speed of the positional model. We evaluate the positional and constrained positional models on three novel qualitative criteria and on language modeling. We show that the positional and constrained positional models contain interpretable information about the grammatical properties of words and outperform other shallow models on language modeling. We also show that our constrained model outperforms the positional model on language modeling and trains twice as fast.
△ Less
Submitted 28 February, 2022; v1 submitted 19 April, 2021;
originally announced April 2021.
-
One Size Does Not Fit All: Finding the Optimal Subword Sizes for FastText Models across Languages
Authors:
Vít Novotný,
Eniafe Festus Ayetiran,
Dalibor Bačovský,
Dávid Lupták,
Michal Štefánik,
Petr Sojka
Abstract:
Unsupervised representation learning of words from large multilingual corpora is useful for downstream tasks such as word sense disambiguation, semantic text similarity, and information retrieval. The representation precision of log-bilinear fastText models is mostly due to their use of subword information. In previous work, the optimization of fastText's subword sizes has not been fully explored,…
▽ More
Unsupervised representation learning of words from large multilingual corpora is useful for downstream tasks such as word sense disambiguation, semantic text similarity, and information retrieval. The representation precision of log-bilinear fastText models is mostly due to their use of subword information. In previous work, the optimization of fastText's subword sizes has not been fully explored, and non-English fastText models were trained using subword sizes optimized for English and German word analogy tasks. In our work, we find the optimal subword sizes on the English, German, Czech, Italian, Spanish, French, Hindi, Turkish, and Russian word analogy tasks. We then propose a simple n-gram coverage model and we show that it predicts better-than-default subword sizes on the Spanish, French, Hindi, Turkish, and Russian word analogy tasks. We show that the optimization of fastText's subword sizes matters and results in a 14% improvement on the Czech word analogy task. We also show that expensive parameter optimization can be replaced by a simple n-gram coverage model that consistently improves the accuracy of fastText models on the word analogy tasks by up to 3% compared to the default subword sizes, and that it is within 1% accuracy of the optimal subword sizes.
△ Less
Submitted 20 September, 2021; v1 submitted 4 February, 2021;
originally announced February 2021.
-
Text classification with word embedding regularization and soft similarity measure
Authors:
Vít Novotný,
Eniafe Festus Ayetiran,
Michal Štefánik,
Petr Sojka
Abstract:
Since the seminal work of Mikolov et al., word embeddings have become the preferred word representations for many natural language processing tasks. Document similarity measures extracted from word embeddings, such as the soft cosine measure (SCM) and the Word Mover's Distance (WMD), were reported to achieve state-of-the-art performance on semantic text similarity and text classification.
Despit…
▽ More
Since the seminal work of Mikolov et al., word embeddings have become the preferred word representations for many natural language processing tasks. Document similarity measures extracted from word embeddings, such as the soft cosine measure (SCM) and the Word Mover's Distance (WMD), were reported to achieve state-of-the-art performance on semantic text similarity and text classification.
Despite the strong performance of the WMD on text classification and semantic text similarity, its super-cubic average time complexity is impractical. The SCM has quadratic worst-case time complexity, but its performance on text classification has never been compared with the WMD. Recently, two word embedding regularization techniques were shown to reduce storage and memory costs, and to improve training speed, document processing speed, and task performance on word analogy, word similarity, and semantic text similarity. However, the effect of these techniques on text classification has not yet been studied.
In our work, we investigate the individual and joint effect of the two word embedding regularization techniques on the document processing speed and the task performance of the SCM and the WMD on text classification. For evaluation, we use the $k$NN classifier and six standard datasets: BBCSPORT, TWITTER, OHSUMED, REUTERS-21578, AMAZON, and 20NEWS.
We show 39% average $k$NN test error reduction with regularized word embeddings compared to non-regularized word embeddings. We describe a practical procedure for deriving such regularized embeddings through Cholesky factorization. We also show that the SCM with regularized word embeddings significantly outperforms the WMD on text classification and is over 10,000 times faster.
△ Less
Submitted 10 March, 2020;
originally announced March 2020.