Skip to main content

Showing 1–10 of 10 results for author: Mikhaylovskiy, N

Searching in archive cs. Search in all archives.
.
  1. arXiv:2503.06330  [pdf, other

    cs.CL cond-mat.stat-mech cs.AI

    States of LLM-generated Texts and Phase Transitions between them

    Authors: Nikolay Mikhaylovskiy

    Abstract: It is known for some time that autocorrelations of words in human-written texts decay according to a power law. Recent works have also shown that the autocorrelations decay in texts generated by LLMs is qualitatively different from the literary texts. Solid state physics tie the autocorrelations decay laws to the states of matter. In this work, we empirically demonstrate that, depending on the tem… ▽ More

    Submitted 8 March, 2025; originally announced March 2025.

    Comments: Published as a conference paper at MathAI 2025

  2. arXiv:2306.02334  [pdf

    cs.CL

    Long Text Generation Challenge

    Authors: Nikolay Mikhaylovskiy

    Abstract: We propose a shared task of human-like long text generation, LTG Challenge, that asks models to output a consistent human-like long text (a Harry Potter generic audience fanfic in English), given a prompt of about 1000 tokens. We suggest a novel statistical metric of the text structuredness, GloVe Autocorrelations Power/ Exponential Law Mean Absolute Percentage Error Ratio (GAPELMAPER) and a human… ▽ More

    Submitted 4 June, 2023; originally announced June 2023.

    Comments: Submitted to INLG 2023

    ACM Class: I.2.7

  3. arXiv:2305.06615  [pdf

    cs.CL

    Autocorrelations Decay in Texts and Applicability Limits of Language Models

    Authors: Nikolay Mikhaylovskiy, Ilya Churilov

    Abstract: We show that the laws of autocorrelations decay in texts are closely related to applicability limits of language models. Using distributional semantics we empirically demonstrate that autocorrelations of words in texts decay according to a power law. We show that distributional semantics provides coherent autocorrelations decay exponents for texts translated to multiple languages. The autocorrelat… ▽ More

    Submitted 11 May, 2023; originally announced May 2023.

    Comments: Accepted to Dialog-2023

    ACM Class: I.2.7

  4. arXiv:2208.13021  [pdf

    cs.CL cs.AI

    On Unsupervised Training of Link Grammar Based Language Models

    Authors: Nikolay Mikhaylovskiy

    Abstract: In this short note we explore what is needed for the unsupervised training of graph language models based on link grammars. First, we introduce the ter-mination tags formalism required to build a language model based on a link grammar formalism of Sleator and Temperley [21] and discuss the influence of context on the unsupervised learning of link grammars. Second, we pro-pose a statistical link gr… ▽ More

    Submitted 27 August, 2022; originally announced August 2022.

    Comments: Presented at INLP workshop at AGI-2022

  5. arXiv:2208.12356   

    cs.IR cs.AI

    Lib-SibGMU -- A University Library Circulation Dataset for Recommender Systems Developmen

    Authors: Eduard Zubchuk, Mikhail Arhipkin, Dmitry Menshikov, Aleksandr Karaush, Nikolay Mikhaylovskiy

    Abstract: We opensource under CC BY 4.0 license Lib-SibGMU - a university library circulation dataset - for a wide research community, and benchmark major algorithms for recommender systems on this dataset. For a recommender architecture that consists of a vectorizer that turns the history of the books borrowed into a vector, and a neighborhood-based recommender, trained separately, we show that using the f… ▽ More

    Submitted 11 August, 2023; v1 submitted 25 August, 2022; originally announced August 2022.

    Comments: Dataset copyright discussion

  6. arXiv:2202.04145  [pdf, other

    cs.AI cs.IR

    Using a Language Model in a Kiosk Recommender System at Fast-Food Restaurants

    Authors: Eduard Zubchuk, Dmitry Menshikov, Nikolay Mikhaylovskiy

    Abstract: Kiosks are a popular self-service option in many fast-food restaurants, they save time for the visitors and save labor for the fast-food chains. In this paper, we propose an effective design of a kiosk shopping cart recommender system that combines a language model as a vectorizer and a neural network-based classifier. The model performs better than other models in offline tests and exhibits perfo… ▽ More

    Submitted 8 February, 2022; originally announced February 2022.

  7. arXiv:2106.00052  [pdf

    eess.AS cs.CL cs.LG cs.SD

    Low-Resource Spoken Language Identification Using Self-Attentive Pooling and Deep 1D Time-Channel Separable Convolutions

    Authors: Roman Bedyakin, Nikolay Mikhaylovskiy

    Abstract: This memo describes NTR/TSU winning submission for Low Resource ASR challenge at Dialog2021 conference, language identification track. Spoken Language Identification (LID) is an important step in a multilingual Automated Speech Recognition (ASR) system pipeline. Traditionally, the ASR task requires large volumes of labeled data that are unattainable for most of the world's languages, including m… ▽ More

    Submitted 31 May, 2021; originally announced June 2021.

    Comments: Accepted to Dialog2021. arXiv admin note: text overlap with arXiv:2104.11985

  8. Language ID Prediction from Speech Using Self-Attentive Pooling and 1D-Convolutions

    Authors: Roman Bedyakin, Nikolay Mikhaylovskiy

    Abstract: This memo describes NTR-TSU submission for SIGTYP 2021 Shared Task on predicting language IDs from speech. Spoken Language Identification (LID) is an important step in a multilingual Automated Speech Recognition (ASR) system pipeline. For many low-resource and endangered languages, only single-speaker recordings may be available, demanding a need for domain and speaker-invariant language ID syst… ▽ More

    Submitted 24 April, 2021; originally announced April 2021.

    Comments: Accepted to SYGTYP-2021

  9. arXiv:2103.16193  [pdf

    eess.AS cs.SD

    MediaSpeech: Multilanguage ASR Benchmark and Dataset

    Authors: Rostislav Kolobov, Olga Okhapkina, Olga Omelchishina, Andrey Platunov, Roman Bedyakin, Vyacheslav Moshkin, Dmitry Menshikov, Nikolay Mikhaylovskiy

    Abstract: The performance of automated speech recognition (ASR) systems is well known to differ for varied application domains. At the same time, vendors and research groups typically report ASR quality results either for limited use simplistic domains (audiobooks, TED talks), or proprietary datasets. To fill this gap, we provide an open-source 10-hour ASR system evaluation dataset NTR MediaSpeech for 4 lan… ▽ More

    Submitted 30 March, 2021; originally announced March 2021.

  10. arXiv:2101.04792  [pdf

    eess.AS cs.AI cs.LG

    Learning Efficient Representations for Keyword Spotting with Triplet Loss

    Authors: Roman Vygon, Nikolay Mikhaylovskiy

    Abstract: In the past few years, triplet loss-based metric embeddings have become a de-facto standard for several important computer vision problems, most no-tably, person reidentification. On the other hand, in the area of speech recognition the metric embeddings generated by the triplet loss are rarely used even for classification problems. We fill this gap showing that a combination of two representation… ▽ More

    Submitted 4 June, 2021; v1 submitted 12 January, 2021; originally announced January 2021.

    Comments: Submitted to SPECOM 2021

    Journal ref: In: Karpov A., Potapova R. (eds) Speech and Computer. SPECOM 2021. Lecture Notes in Computer Science, vol 12997. Springer, Cham