-
Syntactic Transfer to Kyrgyz Using the Treebank Translation Method
Authors:
Anton Alekseev,
Alina Tillabaeva,
Gulnara Dzh. Kabaeva,
Sergey I. Nikolenko
Abstract:
The Kyrgyz language, as a low-resource language, requires significant effort to create high-quality syntactic corpora. This study proposes an approach to simplify the development process of a syntactic corpus for Kyrgyz. We present a tool for transferring syntactic annotations from Turkish to Kyrgyz based on a treebank translation method. The effectiveness of the proposed tool was evaluated using…
▽ More
The Kyrgyz language, as a low-resource language, requires significant effort to create high-quality syntactic corpora. This study proposes an approach to simplify the development process of a syntactic corpus for Kyrgyz. We present a tool for transferring syntactic annotations from Turkish to Kyrgyz based on a treebank translation method. The effectiveness of the proposed tool was evaluated using the TueCL treebank. The results demonstrate that this approach achieves higher syntactic annotation accuracy compared to a monolingual model trained on the Kyrgyz KTMU treebank. Additionally, the study introduces a method for assessing the complexity of manual annotation for the resulting syntactic trees, contributing to further optimization of the annotation process.
△ Less
Submitted 17 December, 2024;
originally announced December 2024.
-
Benchmarking Multilabel Topic Classification in the Kyrgyz Language
Authors:
Anton Alekseev,
Sergey I. Nikolenko,
Gulnara Kabaeva
Abstract:
Kyrgyz is a very underrepresented language in terms of modern natural language processing resources. In this work, we present a new public benchmark for topic classification in Kyrgyz, introducing a dataset based on collected and annotated data from the news site 24.KG and presenting several baseline models for news classification in the multilabel setting. We train and evaluate both classical sta…
▽ More
Kyrgyz is a very underrepresented language in terms of modern natural language processing resources. In this work, we present a new public benchmark for topic classification in Kyrgyz, introducing a dataset based on collected and annotated data from the news site 24.KG and presenting several baseline models for news classification in the multilabel setting. We train and evaluate both classical statistical and neural models, reporting the scores, discussing the results, and proposing directions for future work.
△ Less
Submitted 30 August, 2023;
originally announced August 2023.
-
RecVAE: a New Variational Autoencoder for Top-N Recommendations with Implicit Feedback
Authors:
Ilya Shenbin,
Anton Alekseev,
Elena Tutubalina,
Valentin Malykh,
Sergey I. Nikolenko
Abstract:
Recent research has shown the advantages of using autoencoders based on deep neural networks for collaborative filtering. In particular, the recently proposed Mult-VAE model, which used the multinomial likelihood variational autoencoders, has shown excellent results for top-N recommendations. In this work, we propose the Recommender VAE (RecVAE) model that originates from our research on regulariz…
▽ More
Recent research has shown the advantages of using autoencoders based on deep neural networks for collaborative filtering. In particular, the recently proposed Mult-VAE model, which used the multinomial likelihood variational autoencoders, has shown excellent results for top-N recommendations. In this work, we propose the Recommender VAE (RecVAE) model that originates from our research on regularization techniques for variational autoencoders. RecVAE introduces several novel ideas to improve Mult-VAE, including a novel composite prior distribution for the latent codes, a new approach to setting the $β$ hyperparameter for the $β$-VAE framework, and a new approach to training based on alternating updates. In experimental evaluation, we show that RecVAE significantly outperforms previously proposed autoencoder-based models, including Mult-VAE and RaCT, across classical collaborative filtering datasets, and present a detailed ablation study to assess our new developments. Code and models are available at https://github.com/ilya-shenbin/RecVAE.
△ Less
Submitted 23 December, 2019;
originally announced December 2019.
-
Synthetic Data for Deep Learning
Authors:
Sergey I. Nikolenko
Abstract:
Synthetic data is an increasingly popular tool for training deep learning models, especially in computer vision but also in other areas. In this work, we attempt to provide a comprehensive survey of the various directions in the development and application of synthetic data. First, we discuss synthetic datasets for basic computer vision problems, both low-level (e.g., optical flow estimation) and…
▽ More
Synthetic data is an increasingly popular tool for training deep learning models, especially in computer vision but also in other areas. In this work, we attempt to provide a comprehensive survey of the various directions in the development and application of synthetic data. First, we discuss synthetic datasets for basic computer vision problems, both low-level (e.g., optical flow estimation) and high-level (e.g., semantic segmentation), synthetic environments and datasets for outdoor and urban scenes (autonomous driving), indoor scenes (indoor navigation), aerial navigation, simulation environments for robotics, applications of synthetic data outside computer vision (in neural programming, bioinformatics, NLP, and more); we also survey the work on improving synthetic data development and alternative ways to produce it such as GANs. Second, we discuss in detail the synthetic-to-real domain adaptation problem that inevitably arises in applications of synthetic data, including synthetic-to-real refinement with GAN-based models and domain adaptation at the feature/model level without explicit data transformations. Third, we turn to privacy-related applications of synthetic data and review the work on generating synthetic datasets with differential privacy guarantees. We conclude by highlighting the most promising directions for further work in synthetic data studies.
△ Less
Submitted 25 September, 2019;
originally announced September 2019.
-
New Competitiveness Bounds for the Shared Memory Switch
Authors:
Ivan Bochkov,
Alex Davydow,
Nikita Gaevoy,
Sergey I. Nikolenko
Abstract:
We consider one of the simplest and best known buffer management architectures: the shared memory switch with multiple output queues and uniform packets. It was one of the first models studied by competitive analysis, with the Longest Queue Drop (LQD) buffer management policy shown to be at least $\sqrt{2}$- and at most $2$-competitive; a general lower bound of $4/3$ has been proven for all determ…
▽ More
We consider one of the simplest and best known buffer management architectures: the shared memory switch with multiple output queues and uniform packets. It was one of the first models studied by competitive analysis, with the Longest Queue Drop (LQD) buffer management policy shown to be at least $\sqrt{2}$- and at most $2$-competitive; a general lower bound of $4/3$ has been proven for all deterministic online algorithms. Closing the gap between $\sqrt{2}$ and $2$ has remained an open problem in competitive analysis for more than a decade, with only marginal success in reducing the upper bound of $2$. In this work, we first present a simplified proof for the $\sqrt{2}$ lower bound for LQD and then, using a reduction to the continuous case, improve the general lower bound for all deterministic online algorithms from $\frac 43$ to $\sqrt{2}$. Then, we proceed to improve the lower bound of $\sqrt{2}$ specifically for LQD, showing that LQD is at least $1.44546086$-competitive. We are able to prove the bound by presenting an explicit construction of the optimal clairvoyant algorithm which then allows for two different ways to prove lower bounds: by direct computer simulations and by proving lower bounds via linear programming. The linear programming approach yields a lower bound for LQD of $1.4427902$ (still larger than $\sqrt{2}$).
△ Less
Submitted 9 July, 2019;
originally announced July 2019.
-
AspeRa: Aspect-based Rating Prediction Model
Authors:
Sergey I. Nikolenko,
Elena Tutubalina,
Valentin Malykh,
Ilya Shenbin,
Anton Alekseev
Abstract:
We propose a novel end-to-end Aspect-based Rating Prediction model (AspeRa) that estimates user rating based on review texts for the items and at the same time discovers coherent aspects of reviews that can be used to explain predictions or profile users. The AspeRa model uses max-margin losses for joint item and user embedding learning and a dual-headed architecture; it significantly outperforms…
▽ More
We propose a novel end-to-end Aspect-based Rating Prediction model (AspeRa) that estimates user rating based on review texts for the items and at the same time discovers coherent aspects of reviews that can be used to explain predictions or profile users. The AspeRa model uses max-margin losses for joint item and user embedding learning and a dual-headed architecture; it significantly outperforms recently proposed state-of-the-art models such as DeepCoNN, HFT, NARRE, and TransRev on two real world data sets of user reviews. With qualitative examination of the aspects and quantitative evaluation of rating prediction models based on these aspects, we show how aspect embeddings can be used in a recommender system.
△ Less
Submitted 23 January, 2019;
originally announced January 2019.
-
Adapting Convolutional Neural Networks for Geographical Domain Shift
Authors:
Pavel Ostyakov,
Sergey I. Nikolenko
Abstract:
We present the winning solution for the Inclusive Images Competition organized as part of the Conference on Neural Information Processing Systems (NeurIPS 2018) Competition Track. The competition was organized to study ways to cope with domain shift in image processing, specifically geographical shift: the training and two test sets in the competition had different geographical distributions. Our…
▽ More
We present the winning solution for the Inclusive Images Competition organized as part of the Conference on Neural Information Processing Systems (NeurIPS 2018) Competition Track. The competition was organized to study ways to cope with domain shift in image processing, specifically geographical shift: the training and two test sets in the competition had different geographical distributions. Our solution has proven to be relatively straightforward and simple: it is an ensemble of several CNNs where only the last layer is fine-tuned with the help of a small labeled set of tuning labels made available by the organizers. We believe that while domain shift remains a formidable problem, our approach opens up new possibilities for alleviating this problem in practice, where small labeled datasets from the target domain are usually either available or can be obtained and labeled cheaply.
△ Less
Submitted 18 January, 2019;
originally announced January 2019.
-
Learning State Representations in Complex Systems with Multimodal Data
Authors:
Pavel Solovev,
Vladimir Aliev,
Pavel Ostyakov,
Gleb Sterkin,
Elizaveta Logacheva,
Stepan Troeshestov,
Roman Suvorov,
Anton Mashikhin,
Oleg Khomenko,
Sergey I. Nikolenko
Abstract:
Representation learning becomes especially important for complex systems with multimodal data sources such as cameras or sensors. Recent advances in reinforcement learning and optimal control make it possible to design control algorithms on these latent representations, but the field still lacks a large-scale standard dataset for unified comparison. In this work, we present a large-scale dataset a…
▽ More
Representation learning becomes especially important for complex systems with multimodal data sources such as cameras or sensors. Recent advances in reinforcement learning and optimal control make it possible to design control algorithms on these latent representations, but the field still lacks a large-scale standard dataset for unified comparison. In this work, we present a large-scale dataset and evaluation framework for representation learning for the complex task of landing an airplane. We implement and compare several approaches to representation learning on this dataset in terms of the quality of simple supervised learning tasks and disentanglement scores. The resulting representations can be used for further tasks such as anomaly detection, optimal control, model-based reinforcement learning, and other applications.
△ Less
Submitted 15 January, 2019; v1 submitted 27 November, 2018;
originally announced November 2018.
-
SEIGAN: Towards Compositional Image Generation by Simultaneously Learning to Segment, Enhance, and Inpaint
Authors:
Pavel Ostyakov,
Roman Suvorov,
Elizaveta Logacheva,
Oleg Khomenko,
Sergey I. Nikolenko
Abstract:
We present a novel approach to image manipulation and understanding by simultaneously learning to segment object masks, paste objects to another background image, and remove them from original images. For this purpose, we develop a novel generative model for compositional image generation, SEIGAN (Segment-Enhance-Inpaint Generative Adversarial Network), which learns these three operations together…
▽ More
We present a novel approach to image manipulation and understanding by simultaneously learning to segment object masks, paste objects to another background image, and remove them from original images. For this purpose, we develop a novel generative model for compositional image generation, SEIGAN (Segment-Enhance-Inpaint Generative Adversarial Network), which learns these three operations together in an adversarial architecture with additional cycle consistency losses. To train, SEIGAN needs only bounding box supervision and does not require pairing or ground truth masks. SEIGAN produces better generated images (evaluated by human assessors) than other approaches and produces high-quality segmentation masks, improving over other adversarially trained approaches and getting closer to the results of fully supervised training.
△ Less
Submitted 15 January, 2019; v1 submitted 19 November, 2018;
originally announced November 2018.
-
Label Denoising with Large Ensembles of Heterogeneous Neural Networks
Authors:
Pavel Ostyakov,
Elizaveta Logacheva,
Roman Suvorov,
Vladimir Aliev,
Gleb Sterkin,
Oleg Khomenko,
Sergey I. Nikolenko
Abstract:
Despite recent advances in computer vision based on various convolutional architectures, video understanding remains an important challenge. In this work, we present and discuss a top solution for the large-scale video classification (labeling) problem introduced as a Kaggle competition based on the YouTube-8M dataset. We show and compare different approaches to preprocessing, data augmentation, m…
▽ More
Despite recent advances in computer vision based on various convolutional architectures, video understanding remains an important challenge. In this work, we present and discuss a top solution for the large-scale video classification (labeling) problem introduced as a Kaggle competition based on the YouTube-8M dataset. We show and compare different approaches to preprocessing, data augmentation, model architectures, and model combination. Our final model is based on a large ensemble of video- and frame-level models but fits into rather limiting hardware constraints. We apply an approach based on knowledge distillation to deal with noisy labels in the original dataset and the recently developed mixup technique to improve the basic models.
△ Less
Submitted 15 January, 2019; v1 submitted 12 September, 2018;
originally announced September 2018.
-
BayesHammer: Bayesian clustering for error correction in single-cell sequencing
Authors:
Sergey I. Nikolenko,
Anton I. Korobeynikov,
Max A. Alekseyev
Abstract:
Error correction of sequenced reads remains a difficult task, especially in single-cell sequencing projects with extremely non-uniform coverage. While existing error correction tools designed for standard (multi-cell) sequencing data usually come up short in single-cell sequencing projects, algorithms actually used for single-cell error correction have been so far very simplistic.
We introduce s…
▽ More
Error correction of sequenced reads remains a difficult task, especially in single-cell sequencing projects with extremely non-uniform coverage. While existing error correction tools designed for standard (multi-cell) sequencing data usually come up short in single-cell sequencing projects, algorithms actually used for single-cell error correction have been so far very simplistic.
We introduce several novel algorithms based on Hamming graphs and Bayesian subclustering in our new error correction tool BayesHammer. While BayesHammer was designed for single-cell sequencing, we demonstrate that it also improves on existing error correction tools for multi-cell sequencing data while working much faster on real-life datasets. We benchmark BayesHammer on both $k$-mer counts and actual assembly results with the SPAdes genome assembler.
△ Less
Submitted 12 November, 2012;
originally announced November 2012.
-
FIFO Queueing Policies for Packets with Heterogeneous Processing
Authors:
Kirill Kogan,
Alejandro López-Ortiz,
Sergey I. Nikolenko,
Alexander V. Sirotkin,
Denis Tugaryov
Abstract:
We consider the problem of managing a bounded size First-In-First-Out (FIFO) queue buffer, where each incoming unit-sized packet requires several rounds of processing before it can be transmitted out. Our objective is to maximize the total number of successfully transmitted packets. We consider both push-out (when the policy is permitted to drop already admitted packets) and non-push-out cases. In…
▽ More
We consider the problem of managing a bounded size First-In-First-Out (FIFO) queue buffer, where each incoming unit-sized packet requires several rounds of processing before it can be transmitted out. Our objective is to maximize the total number of successfully transmitted packets. We consider both push-out (when the policy is permitted to drop already admitted packets) and non-push-out cases. In particular, we provide analytical guarantees for the throughput performance of our algorithms. We further conduct a comprehensive simulation study which experimentally validates the predicted theoretical behaviour.
△ Less
Submitted 24 April, 2012;
originally announced April 2012.
-
Balancing Work and Size with Bounded Buffers
Authors:
Kirill Kogan,
Alejandro Lopez-Ortiz,
Sergey I. Nikolenko,
Gabriel Scalosub,
Michael Segal
Abstract:
We consider the fundamental problem of managing a bounded size queue buffer where traffic consists of packets of varying size, where each packet requires several rounds of processing before it can be transmitted from the queue buffer. The goal in such an environment is to maximize the overall size of packets that are successfully transmitted. This model is motivated by the ever-growing ubiquity of…
▽ More
We consider the fundamental problem of managing a bounded size queue buffer where traffic consists of packets of varying size, where each packet requires several rounds of processing before it can be transmitted from the queue buffer. The goal in such an environment is to maximize the overall size of packets that are successfully transmitted. This model is motivated by the ever-growing ubiquity of network processors architectures, which must deal with heterogeneously-sized traffic, with heterogeneous processing requirements. Our work addresses the tension between two conflicting algorithmic approaches in such settings: the tendency to favor packets with fewer processing requirements, thus leading to fast contributions to the accumulated throughput, as opposed to preferring packets of larger size, which imply a large increase in throughput at each step. We present a model for studying such systems, and present competitive algorithms whose performance depend on the maximum size a packet may have, and maximum amount of processing a packet may require. We further provide lower bounds on algorithms performance in such settings.
△ Less
Submitted 5 September, 2013; v1 submitted 26 February, 2012;
originally announced February 2012.
-
New Combinatorial Complete One-Way Functions
Authors:
Arist Kojevnikov,
Sergey I. Nikolenko
Abstract:
In 2003, Leonid A. Levin presented the idea of a combinatorial complete one-way function and a sketch of the proof that Tiling represents such a function. In this paper, we present two new one-way functions based on semi-Thue string rewriting systems and a version of the Post Correspondence Problem and prove their completeness. Besides, we present an alternative proof of Levin's result. We also…
▽ More
In 2003, Leonid A. Levin presented the idea of a combinatorial complete one-way function and a sketch of the proof that Tiling represents such a function. In this paper, we present two new one-way functions based on semi-Thue string rewriting systems and a version of the Post Correspondence Problem and prove their completeness. Besides, we present an alternative proof of Levin's result. We also discuss the properties a combinatorial problem should have in order to hold a complete one-way function.
△ Less
Submitted 20 February, 2008;
originally announced February 2008.
-
Hard satisfiable formulas for DPLL-type algorithms
Authors:
Sergey I. Nikolenko
Abstract:
We address lower bounds on the time complexity of algorithms solving the propositional satisfiability problem. Namely, we consider two DPLL-type algorithms, enhanced with the unit clause and pure literal heuristics. Exponential lower bounds for solving satisfiability on provably satisfiable formulas are proven.
We address lower bounds on the time complexity of algorithms solving the propositional satisfiability problem. Namely, we consider two DPLL-type algorithms, enhanced with the unit clause and pure literal heuristics. Exponential lower bounds for solving satisfiability on provably satisfiable formulas are proven.
△ Less
Submitted 15 January, 2003;
originally announced January 2003.