Search | arXiv e-print repository

doi 10.3390/app14198887

Extracting Sentence Embeddings from Pretrained Transformer Models

Authors: Lukas Stankevičius, Mantas Lukoševičius

Abstract: Pre-trained transformer models shine in many natural language processing tasks and therefore are expected to bear the representation of the input sentence or text meaning. These sentence-level embeddings are also important in retrieval-augmented generation. But do commonly used plain averaging or prompt templates sufficiently capture and represent the underlying meaning? After providing a comprehe… ▽ More Pre-trained transformer models shine in many natural language processing tasks and therefore are expected to bear the representation of the input sentence or text meaning. These sentence-level embeddings are also important in retrieval-augmented generation. But do commonly used plain averaging or prompt templates sufficiently capture and represent the underlying meaning? After providing a comprehensive review of existing sentence embedding extraction and refinement methods, we thoroughly test different combinations and our original extensions of the most promising ones on pretrained models. Namely, given 110 M parameters, BERT's hidden representations from multiple layers, and many tokens, we try diverse ways to extract optimal sentence embeddings. We test various token aggregation and representation post-processing techniques. We also test multiple ways of using a general Wikitext dataset to complement BERT's sentence embeddings. All methods are tested on eight Semantic Textual Similarity (STS), six short text clustering, and twelve classification tasks. We also evaluate our representation-shaping techniques on other static models, including random token representations. Proposed representation extraction methods improve the performance on STS and clustering tasks for all models considered. Very high improvements for static token-based models, especially random embeddings for STS tasks, almost reach the performance of BERT-derived representations. Our work shows that the representation-shaping techniques significantly improve sentence embeddings extracted from BERT-based and simple baseline models. △ Less

Submitted 20 February, 2025; v1 submitted 15 August, 2024; originally announced August 2024.

Comments: Postprint update

MSC Class: 68T07; 68T50; 68T05 ACM Class: I.2.6; I.2.7

Journal ref: Appl. Sci. 2024, 14(19), 8887

arXiv:2407.19914 [pdf]

Sentiment Analysis of Lithuanian Online Reviews Using Large Language Models

Authors: Brigita Vileikytė, Mantas Lukoševičius, Lukas Stankevičius

Abstract: Sentiment analysis is a widely researched area within Natural Language Processing (NLP), attracting significant interest due to the advent of automated solutions. Despite this, the task remains challenging because of the inherent complexity of languages and the subjective nature of sentiments. It is even more challenging for less-studied and less-resourced languages such as Lithuanian. Our review… ▽ More Sentiment analysis is a widely researched area within Natural Language Processing (NLP), attracting significant interest due to the advent of automated solutions. Despite this, the task remains challenging because of the inherent complexity of languages and the subjective nature of sentiments. It is even more challenging for less-studied and less-resourced languages such as Lithuanian. Our review of existing Lithuanian NLP research reveals that traditional machine learning methods and classification algorithms have limited effectiveness for the task. In this work, we address sentiment analysis of Lithuanian five-star-based online reviews from multiple domains that we collect and clean. We apply transformer models to this task for the first time, exploring the capabilities of pre-trained multilingual Large Language Models (LLMs), specifically focusing on fine-tuning BERT and T5 models. Given the inherent difficulty of the task, the fine-tuned models perform quite well, especially when the sentiments themselves are less ambiguous: 80.74% and 89.61% testing recognition accuracy of the most popular one- and five-star reviews respectively. They significantly outperform current commercial state-of-the-art general-purpose LLM GPT-4. We openly share our fine-tuned LLMs online. △ Less

Submitted 29 July, 2024; originally announced July 2024.

Comments: Accepted at the 29th International Conference on Information Society and University Studies (IVUS 2024)

MSC Class: 68T07; 68T50; 68T05; ACM Class: I.2.6; I.2.7

arXiv:2403.04914 [pdf]

Improving the Equation of Exchange for Cryptoasset Valuation Using Empirical Data

Authors: Stylianos Kampakis, Melody Yuan, Oritsebawo Paul Ikpobe, Linas Stankevicius

Abstract: In the evolving domain of cryptocurrency markets, accurate token valuation remains a critical aspect influencing investment decisions and policy development. Whilst the prevailing equation of exchange pricing model offers a quantitative valuation approach based on the interplay between token price, transaction volume, supply, and either velocity or holding time, it exhibits intrinsic shortcomings.… ▽ More In the evolving domain of cryptocurrency markets, accurate token valuation remains a critical aspect influencing investment decisions and policy development. Whilst the prevailing equation of exchange pricing model offers a quantitative valuation approach based on the interplay between token price, transaction volume, supply, and either velocity or holding time, it exhibits intrinsic shortcomings. Specifically, the model may not consistently delineate the relationship between average token velocity and holding time. This paper aims to refine this equation, enhancing the depth of insight into token valuation methodologies. △ Less

Submitted 7 March, 2024; originally announced March 2024.

arXiv:2203.09963 [pdf, other]

Towards Lithuanian grammatical error correction

Authors: Lukas Stankevičius, Mantas Lukoševičius

Abstract: Everyone wants to write beautiful and correct text, yet the lack of language skills, experience, or hasty typing can result in errors. By employing the recent advances in transformer architectures, we construct a grammatical error correction model for Lithuanian, the language rich in archaic features. We compare subword and byte-level approaches and share our best trained model, achieving F… ▽ More Everyone wants to write beautiful and correct text, yet the lack of language skills, experience, or hasty typing can result in errors. By employing the recent advances in transformer architectures, we construct a grammatical error correction model for Lithuanian, the language rich in archaic features. We compare subword and byte-level approaches and share our best trained model, achieving F$_{0.5}$=0.92, and accompanying code, in an online open-source repository. △ Less

Submitted 18 March, 2022; originally announced March 2022.

MSC Class: 68T07; 68T50; 68T05 ACM Class: I.2.6; I.2.7

arXiv:2201.13242 [pdf, other]

doi 10.3390/app12052636

Correcting diacritics and typos with a ByT5 transformer model

Authors: Lukas Stankevičius, Mantas Lukoševičius, Jurgita Kapočiūtė-Dzikienė, Monika Briedienė, Tomas Krilavičius

Abstract: Due to the fast pace of life and online communications and the prevalence of English and the QWERTY keyboard, people tend to forgo using diacritics, make typographical errors (typos) when typing in other languages. Restoring diacritics and correcting spelling is important for proper language use and the disambiguation of texts for both humans and downstream algorithms. However, both of these probl… ▽ More Due to the fast pace of life and online communications and the prevalence of English and the QWERTY keyboard, people tend to forgo using diacritics, make typographical errors (typos) when typing in other languages. Restoring diacritics and correcting spelling is important for proper language use and the disambiguation of texts for both humans and downstream algorithms. However, both of these problems are typically addressed separately: the state-of-the-art diacritics restoration methods do not tolerate other typos, but classical spellcheckers also cannot deal adequately with all the diacritics missing. In this work, we tackle both problems at once by employing the newly-developed universal ByT5 byte-level seq2seq transformer model that requires no language-specific model structures. For a comparison, we perform diacritics restoration on benchmark datasets of 12 languages, with the addition of Lithuanian. The experimental investigation proves that our approach is able to achieve results (> 98%) comparable to the previous state-of-the-art, despite being trained less and on fewer data. Our approach is also able to restore diacritics in words not seen during training with > 76% accuracy. Our simultaneous diacritics restoration and typos correction approach reaches > 94% alpha-word accuracy on the 13 languages. It has no direct competitors and strongly outperforms classical spell-checking or dictionary-based approaches. We also demonstrate all the accuracies to further improve with more training. Taken together, this shows the great real-world application potential of our suggested methods to more data, languages, and error classes. △ Less

Submitted 18 March, 2022; v1 submitted 31 January, 2022; originally announced January 2022.

MSC Class: 68T07; 68T50; 68T05 ACM Class: I.2.6; I.2.7

Journal ref: Appl. Sci. 2022, 12(5), 2636

arXiv:2105.03279 [pdf, other]

doi 10.1007/978-3-030-88304-1_27

Generating abstractive summaries of Lithuanian news articles using a transformer model

Authors: Lukas Stankevičius, Mantas Lukoševičius

Abstract: In this work, we train the first monolingual Lithuanian transformer model on a relatively large corpus of Lithuanian news articles and compare various output decoding algorithms for abstractive news summarization. We achieve an average ROUGE-2 score 0.163, generated summaries are coherent and look impressive at first glance. However, some of them contain misleading information that is not so easy… ▽ More In this work, we train the first monolingual Lithuanian transformer model on a relatively large corpus of Lithuanian news articles and compare various output decoding algorithms for abstractive news summarization. We achieve an average ROUGE-2 score 0.163, generated summaries are coherent and look impressive at first glance. However, some of them contain misleading information that is not so easy to spot. We describe all the technical details and share our trained model and accompanying code in an online open-source repository, as well as some characteristic samples of the generated summaries. △ Less

Submitted 22 June, 2021; v1 submitted 23 April, 2021; originally announced May 2021.

Comments: Accepted in ICIST 2021

MSC Class: 68T07; 68T50; 68T05 ACM Class: I.2.6; I.2.7

Journal ref: International Conference on Information and Software Technologies - ICIST 2021, Communications in Computer and Information Science, vol 1486 (2021) 341-352

arXiv:2004.03461 [pdf, other]

Testing pre-trained Transformer models for Lithuanian news clustering

Authors: Lukas Stankevičius, Mantas Lukoševičius

Abstract: A recent introduction of Transformer deep learning architecture made breakthroughs in various natural language processing tasks. However, non-English languages could not leverage such new opportunities with the English text pre-trained models. This changed with research focusing on multilingual models, where less-spoken languages are the main beneficiaries. We compare pre-trained multilingual BERT… ▽ More A recent introduction of Transformer deep learning architecture made breakthroughs in various natural language processing tasks. However, non-English languages could not leverage such new opportunities with the English text pre-trained models. This changed with research focusing on multilingual models, where less-spoken languages are the main beneficiaries. We compare pre-trained multilingual BERT, XLM-R, and older learned text representation methods as encodings for the task of Lithuanian news clustering. Our results indicate that publicly available pre-trained multilingual Transformer models can be fine-tuned to surpass word vectors but still score much lower than specially trained doc2vec embeddings. △ Less

Submitted 3 April, 2020; originally announced April 2020.

Comments: Submission accepted at https://ivus.ktu.edu/

MSC Class: 68T05 ACM Class: I.2.6

Journal ref: Proceedings of the Information Society and University Studies 2020, pp. 46-53, vol. 2698, CEUR, Kaunas, 2020, ISSN: 1613-0073

arXiv:1904.01880 [pdf]

doi 10.1016/j.apsusc.2016.05.100

Patterning of diamond like carbon films for sensor applications using silicon containing thermoplastic resist (SiPol) as a hard mask

Authors: D. Virganavičius, V. J. Cadarso, R. Kirchner, L. Stankevičius, T. Tamulevičius, S. Tamulevičius, H. Schift

Abstract: Patterning of diamond-like carbon (DLC) and DLC:metal nanocomposites is of interest for an increasing number of applications. We demonstrate a nanoimprint lithography process based on silicon containing thermoplastic resist combined with plasma etching for straightforward patterning of such films. A variety of different structures with few hundred nanometer feature size and moderate aspect ratios… ▽ More Patterning of diamond-like carbon (DLC) and DLC:metal nanocomposites is of interest for an increasing number of applications. We demonstrate a nanoimprint lithography process based on silicon containing thermoplastic resist combined with plasma etching for straightforward patterning of such films. A variety of different structures with few hundred nanometer feature size and moderate aspect ratios were successfully realized. The quality of produced patterns was directly investigated by the means of optical and scanning electron microscopy (SEM). Such structures were further assessed by employing them in the development of gratings for guided mode resonance (GMR) effect. Optical characterization of such leaky waveguide was compared with numerical simulations based on rigorous coupled wave analysis method with good agreement. The use of such structures as refractive index variation sensors is demonstrated with sensitivity up to 319 nm/RIU, achieving an improvement close to 450% in sensitivity compared to previously reported similar sensors. This pronounced GMR signal fully validates the employed DLC material, the technology to pattern it and the possibility to develop DLC based gratings as corrosion and wear resistant refractometry sensors that are able to operate under harsh conditions providing great value and versatility. △ Less

Submitted 3 April, 2019; originally announced April 2019.

Comments: 24 pages, 9 figures

Journal ref: Applied Surface Science, Volume 385, 1 November 2016, Pages 145-152

Showing 1–8 of 8 results for author: Stankevičius, L