Search | arXiv e-print repository

evoBPE: Evolutionary Protein Sequence Tokenization

Authors: Burak Suyunu, Özdeniz Dolu, Arzucan Özgür

Abstract: Recent advancements in computational biology have drawn compelling parallels between protein sequences and linguistic structures, highlighting the need for sophisticated tokenization methods that capture the intricate evolutionary dynamics of protein sequences. Current subword tokenization techniques, primarily developed for natural language processing, often fail to represent protein sequences' c… ▽ More Recent advancements in computational biology have drawn compelling parallels between protein sequences and linguistic structures, highlighting the need for sophisticated tokenization methods that capture the intricate evolutionary dynamics of protein sequences. Current subword tokenization techniques, primarily developed for natural language processing, often fail to represent protein sequences' complex structural and functional properties adequately. This study introduces evoBPE, a novel tokenization approach that integrates evolutionary mutation patterns into sequence segmentation, addressing critical limitations in existing methods. By leveraging established substitution matrices, evoBPE transcends traditional frequency-based tokenization strategies. The method generates candidate token pairs through biologically informed mutations, evaluating them based on pairwise alignment scores and frequency thresholds. Extensive experiments on human protein sequences show that evoBPE performs better across multiple dimensions. Domain conservation analysis reveals that evoBPE consistently outperforms standard Byte-Pair Encoding, particularly as vocabulary size increases. Furthermore, embedding similarity analysis using ESM-2 suggests that mutation-based token replacements preserve biological sequence properties more effectively than arbitrary substitutions. The research contributes to protein sequence representation by introducing a mutation-aware tokenization method that better captures evolutionary nuances. By bridging computational linguistics and molecular biology, evoBPE opens new possibilities for machine learning applications in protein function prediction, structural modeling, and evolutionary analysis. △ Less

Submitted 11 March, 2025; originally announced March 2025.

Comments: 13 pages, 8 figures, 1 table, 1 algorithm

arXiv:2503.03043 [pdf, ps, other]

Leveraging Randomness in Model and Data Partitioning for Privacy Amplification

Authors: Andy Dong, Wei-Ning Chen, Ayfer Ozgur

Abstract: We study how inherent randomness in the training process -- where each sample (or client in federated learning) contributes only to a randomly selected portion of training -- can be leveraged for privacy amplification. This includes (1) data partitioning, where a sample participates in only a subset of training iterations, and (2) model partitioning, where a sample updates only a subset of the mod… ▽ More We study how inherent randomness in the training process -- where each sample (or client in federated learning) contributes only to a randomly selected portion of training -- can be leveraged for privacy amplification. This includes (1) data partitioning, where a sample participates in only a subset of training iterations, and (2) model partitioning, where a sample updates only a subset of the model parameters. We apply our framework to model parallelism in federated learning, where each client updates a randomly selected subnetwork to reduce memory and computational overhead, and show that existing methods, e.g. model splitting or dropout, provide a significant privacy amplification gain not captured by previous privacy analysis techniques. Additionally, we introduce Balanced Iteration Subsampling, a new data partitioning method where each sample (or client) participates in a fixed number of training iterations. We show that this method yields stronger privacy amplification than Poisson (i.i.d.) sampling of data (or clients). Our results demonstrate that randomness in the training process, which is structured rather than i.i.d. and interacts with data in complex ways, can be systematically leveraged for significant privacy amplification. △ Less

Submitted 1 June, 2025; v1 submitted 4 March, 2025; originally announced March 2025.

arXiv:2502.09659 [pdf]

Cancer Vaccine Adjuvant Name Recognition from Biomedical Literature using Large Language Models

Authors: Hasin Rehana, Jie Zheng, Leo Yeh, Benu Bansal, Nur Bengisu Çam, Christianah Jemiyo, Brett McGregor, Arzucan Özgür, Yongqun He, Junguk Hur

Abstract: Motivation: An adjuvant is a chemical incorporated into vaccines that enhances their efficacy by improving the immune response. Identifying adjuvant names from cancer vaccine studies is essential for furthering research and enhancing immunotherapies. However, the manual curation from the constantly expanding biomedical literature poses significant challenges. This study explores the automated reco… ▽ More Motivation: An adjuvant is a chemical incorporated into vaccines that enhances their efficacy by improving the immune response. Identifying adjuvant names from cancer vaccine studies is essential for furthering research and enhancing immunotherapies. However, the manual curation from the constantly expanding biomedical literature poses significant challenges. This study explores the automated recognition of vaccine adjuvant names using Large Language Models (LLMs), specifically Generative Pretrained Transformers (GPT) and Large Language Model Meta AI (Llama). Methods: We utilized two datasets: 97 clinical trial records from AdjuvareDB and 290 abstracts annotated with the Vaccine Adjuvant Compendium (VAC). GPT-4o and Llama 3.2 were employed in zero-shot and few-shot learning paradigms with up to four examples per prompt. Prompts explicitly targeted adjuvant names, testing the impact of contextual information such as substances or interventions. Outputs underwent automated and manual validation for accuracy and consistency. Results: GPT-4o attained 100% Precision across all situations while exhibiting notable improve in Recall and F1-scores, particularly with incorporating interventions. On the VAC dataset, GPT-4o achieved a maximum F1-score of 77.32% with interventions, surpassing Llama-3.2-3B by approximately 2%. On the AdjuvareDB dataset, GPT-4o reached an F1-score of 81.67% for three-shot prompting with interventions, surpassing Llama-3.2-3 B's maximum F1-score of 65.62%. Conclusion: Our findings demonstrate that LLMs excel at identifying adjuvant names, including rare variations of naming representation. This study emphasizes the capability of LLMs to enhance cancer vaccine development by efficiently extracting insights. Future work aims to broaden the framework to encompass various biomedical literature and enhance model generalizability across various vaccines and adjuvants. △ Less

Submitted 12 February, 2025; originally announced February 2025.

Comments: 10 pages, 6 figures, 4 tables

arXiv:2411.17669 [pdf, ps, other]

Linguistic Laws Meet Protein Sequences: A Comparative Analysis of Subword Tokenization Methods

Authors: Burak Suyunu, Enes Taylan, Arzucan Özgür

Abstract: Tokenization is a crucial step in processing protein sequences for machine learning models, as proteins are complex sequences of amino acids that require meaningful segmentation to capture their functional and structural properties. However, existing subword tokenization methods, developed primarily for human language, may be inadequate for protein sequences, which have unique patterns and constra… ▽ More Tokenization is a crucial step in processing protein sequences for machine learning models, as proteins are complex sequences of amino acids that require meaningful segmentation to capture their functional and structural properties. However, existing subword tokenization methods, developed primarily for human language, may be inadequate for protein sequences, which have unique patterns and constraints. This study evaluates three prominent tokenization approaches, Byte-Pair Encoding (BPE), WordPiece, and SentencePiece, across varying vocabulary sizes (400-6400), analyzing their effectiveness in protein sequence representation, domain boundary preservation, and adherence to established linguistic laws. Our comprehensive analysis reveals distinct behavioral patterns among these tokenizers, with vocabulary size significantly influencing their performance. BPE demonstrates better contextual specialization and marginally better domain boundary preservation at smaller vocabularies, while SentencePiece achieves better encoding efficiency, leading to lower fertility scores. WordPiece offers a balanced compromise between these characteristics. However, all tokenizers show limitations in maintaining protein domain integrity, particularly as vocabulary size increases. Analysis of linguistic law adherence shows partial compliance with Zipf's and Brevity laws but notable deviations from Menzerath's law, suggesting that protein sequences may follow distinct organizational principles from natural languages. These findings highlight the limitations of applying traditional NLP tokenization methods to protein sequences and emphasize the need for developing specialized tokenization strategies that better account for the unique characteristics of proteins. △ Less

Submitted 26 November, 2024; originally announced November 2024.

Comments: 8 pages, 9 figures

arXiv:2407.06718 [pdf, other]

A Simple Architecture for Enterprise Large Language Model Applications based on Role based security and Clearance Levels using Retrieval-Augmented Generation or Mixture of Experts

Authors: Atilla Özgür, Yılmaz Uygun

Abstract: This study proposes a simple architecture for Enterprise application for Large Language Models (LLMs) for role based security and NATO clearance levels. Our proposal aims to address the limitations of current LLMs in handling security and information access. The proposed architecture could be used while utilizing Retrieval-Augmented Generation (RAG) and fine tuning of Mixture of experts models (Mo… ▽ More This study proposes a simple architecture for Enterprise application for Large Language Models (LLMs) for role based security and NATO clearance levels. Our proposal aims to address the limitations of current LLMs in handling security and information access. The proposed architecture could be used while utilizing Retrieval-Augmented Generation (RAG) and fine tuning of Mixture of experts models (MoE). It could be used only with RAG, or only with MoE or with both of them. Using roles and security clearance level of the user, documents in RAG and experts in MoE are filtered. This way information leakage is prevented. △ Less

Submitted 9 July, 2024; originally announced July 2024.

ACM Class: D.2.11; I.2.7

arXiv:2406.10036 [pdf, other]

Information Compression in the AI Era: Recent Advances and Future Challenges

Authors: Jun Chen, Yong Fang, Ashish Khisti, Ayfer Ozgur, Nir Shlezinger, Chao Tian

Abstract: This survey articles focuses on emerging connections between the fields of machine learning and data compression. While fundamental limits of classical (lossy) data compression are established using rate-distortion theory, the connections to machine learning have resulted in new theoretical analysis and application areas. We survey recent works on task-based and goal-oriented compression, the rate… ▽ More This survey articles focuses on emerging connections between the fields of machine learning and data compression. While fundamental limits of classical (lossy) data compression are established using rate-distortion theory, the connections to machine learning have resulted in new theoretical analysis and application areas. We survey recent works on task-based and goal-oriented compression, the rate-distortion-perception theory and compression for estimation and inference. Deep learning based approaches also provide natural data-driven algorithmic approaches to compression. We survey recent works on applying deep learning techniques to task-based or goal-oriented compression, as well as image and video compression. We also discuss the potential use of large language models for text compression. We finally provide some directions for future research in this promising field. △ Less

Submitted 14 June, 2024; originally announced June 2024.

Comments: arXiv admin note: text overlap with arXiv:2002.04290

arXiv:2405.20782 [pdf, other]

Universal Exact Compression of Differentially Private Mechanisms

Authors: Yanxiao Liu, Wei-Ning Chen, Ayfer Özgür, Cheuk Ting Li

Abstract: To reduce the communication cost of differential privacy mechanisms, we introduce a novel construction, called Poisson private representation (PPR), designed to compress and simulate any local randomizer while ensuring local differential privacy. Unlike previous simulation-based local differential privacy mechanisms, PPR exactly preserves the joint distribution of the data and the output of the or… ▽ More To reduce the communication cost of differential privacy mechanisms, we introduce a novel construction, called Poisson private representation (PPR), designed to compress and simulate any local randomizer while ensuring local differential privacy. Unlike previous simulation-based local differential privacy mechanisms, PPR exactly preserves the joint distribution of the data and the output of the original local randomizer. Hence, the PPR-compressed privacy mechanism retains all desirable statistical properties of the original privacy mechanism such as unbiasedness and Gaussianity. Moreover, PPR achieves a compression size within a logarithmic gap from the theoretical lower bound. Using the PPR, we give a new order-wise trade-off between communication, accuracy, central and local differential privacy for distributed mean estimation. Experiment results on distributed mean estimation show that PPR consistently gives a better trade-off between communication, accuracy and central differential privacy compared to the coordinate subsampled Gaussian mechanism, while also providing local differential privacy. △ Less

Submitted 10 November, 2024; v1 submitted 28 May, 2024; originally announced May 2024.

Comments: 33 pages, 5 figures

arXiv:2402.01895 [pdf, ps, other]

$L_q$ Lower Bounds on Distributed Estimation via Fisher Information

Authors: Wei-Ning Chen, Ayfer Özgür

Abstract: Van Trees inequality, also known as the Bayesian Cramér-Rao lower bound, is a powerful tool for establishing lower bounds for minimax estimation through Fisher information. It easily adapts to different statistical models and often yields tight bounds. Recently, its application has been extended to distributed estimation with privacy and communication constraints where it yields order-wise optimal… ▽ More Van Trees inequality, also known as the Bayesian Cramér-Rao lower bound, is a powerful tool for establishing lower bounds for minimax estimation through Fisher information. It easily adapts to different statistical models and often yields tight bounds. Recently, its application has been extended to distributed estimation with privacy and communication constraints where it yields order-wise optimal minimax lower bounds for various parametric tasks under squared $L_2$ loss. However, a widely perceived drawback of the van Trees inequality is that it is limited to squared $L_2$ loss. The goal of this paper is to dispel that perception by introducing a strengthened version of the van Trees inequality that applies to general $L_q$ loss functions by building on the Efroimovich's inequality -- a lesser-known entropic inequality dating back to the 1970s. We then apply the generalized van Trees inequality to lower bound $L_q$ loss in distributed minimax estimation under communication and local differential privacy constraints. This leads to lower bounds for $L_q$ loss that apply to sequentially interactive and blackboard communication protocols. Additionally, we show how the generalized van Trees inequality can be used to obtain \emph{local} and \emph{non-asymptotic} minimax results that capture the hardness of estimating each instance at finite sample sizes. △ Less

Submitted 26 April, 2024; v1 submitted 2 February, 2024; originally announced February 2024.

arXiv:2311.04375 [pdf, ps, other]

Federated Experiment Design under Distributed Differential Privacy

Authors: Wei-Ning Chen, Graham Cormode, Akash Bharadwaj, Peter Romov, Ayfer Özgür

Abstract: Experiment design has a rich history dating back over a century and has found many critical applications across various fields since then. The use and collection of users' data in experiments often involve sensitive personal information, so additional measures to protect individual privacy are required during data collection, storage, and usage. In this work, we focus on the rigorous protection of… ▽ More Experiment design has a rich history dating back over a century and has found many critical applications across various fields since then. The use and collection of users' data in experiments often involve sensitive personal information, so additional measures to protect individual privacy are required during data collection, storage, and usage. In this work, we focus on the rigorous protection of users' privacy (under the notion of differential privacy (DP)) while minimizing the trust toward service providers. Specifically, we consider the estimation of the average treatment effect (ATE) under DP, while only allowing the analyst to collect population-level statistics via secure aggregation, a distributed protocol enabling a service provider to aggregate information without accessing individual data. Although a vital component in modern A/B testing workflows, private distributed experimentation has not previously been studied. To achieve DP, we design local privatization mechanisms that are compatible with secure aggregation and analyze the utility, in terms of the width of confidence intervals, both asymptotically and non-asymptotically. We show how these mechanisms can be scaled up to handle the very large number of participants commonly found in practice. In addition, when introducing DP noise, it is imperative to cleverly split privacy budgets to estimate both the mean and variance of the outcomes and carefully calibrate the confidence intervals according to the DP noise. Last, we present comprehensive experimental evaluations of our proposed schemes and show the privacy-utility trade-offs in experiment design. △ Less

Submitted 7 November, 2023; originally announced November 2023.

arXiv:2307.10634 [pdf, other]

Generative Language Models on Nucleotide Sequences of Human Genes

Authors: Musa Nuri Ihtiyar, Arzucan Ozgur

Abstract: Language models, primarily transformer-based ones, obtained colossal success in NLP. To be more precise, studies like BERT in NLU and works such as GPT-3 for NLG are very crucial. DNA sequences are very close to natural language in terms of structure, so if the DNA-related bioinformatics domain is concerned, discriminative models, like DNABert, exist. Yet, the generative side of the coin is mainly… ▽ More Language models, primarily transformer-based ones, obtained colossal success in NLP. To be more precise, studies like BERT in NLU and works such as GPT-3 for NLG are very crucial. DNA sequences are very close to natural language in terms of structure, so if the DNA-related bioinformatics domain is concerned, discriminative models, like DNABert, exist. Yet, the generative side of the coin is mainly unexplored to the best of our knowledge. Consequently, we focused on developing an autoregressive generative language model like GPT-3 for DNA sequences. Because working with whole DNA sequences is challenging without substantial computational resources, we decided to carry out our study on a smaller scale, focusing on nucleotide sequences of human genes, unique parts in DNA with specific functionalities, instead of the whole DNA. This decision did not change the problem structure a lot due to the fact that both DNA and genes can be seen as 1D sequences consisting of four different nucleotides without losing much information and making too much simplification. First of all, we systematically examined an almost entirely unexplored problem and observed that RNNs performed the best while simple techniques like N-grams were also promising. Another beneficial point was learning how to work with generative models on languages we do not understand, unlike natural language. How essential using real-life tasks beyond the classical metrics such as perplexity is observed. Furthermore, checking whether the data-hungry nature of these models can be changed through selecting a language with minimal vocabulary size, four owing to four different types of nucleotides, is examined. The reason for reviewing this was that choosing such a language might make the problem easier. However, what we observed in this study was it did not provide that much of a change in the amount of data needed. △ Less

Submitted 20 July, 2023; originally announced July 2023.

arXiv:2307.06422 [pdf, other]

Differentially Private Decoupled Graph Convolutions for Multigranular Topology Protection

Authors: Eli Chien, Wei-Ning Chen, Chao Pan, Pan Li, Ayfer Özgür, Olgica Milenkovic

Abstract: GNNs can inadvertently expose sensitive user information and interactions through their model predictions. To address these privacy concerns, Differential Privacy (DP) protocols are employed to control the trade-off between provable privacy protection and model utility. Applying standard DP approaches to GNNs directly is not advisable due to two main reasons. First, the prediction of node labels,… ▽ More GNNs can inadvertently expose sensitive user information and interactions through their model predictions. To address these privacy concerns, Differential Privacy (DP) protocols are employed to control the trade-off between provable privacy protection and model utility. Applying standard DP approaches to GNNs directly is not advisable due to two main reasons. First, the prediction of node labels, which relies on neighboring node attributes through graph convolutions, can lead to privacy leakage. Second, in practical applications, the privacy requirements for node attributes and graph topology may differ. In the latter setting, existing DP-GNN models fail to provide multigranular trade-offs between graph topology privacy, node attribute privacy, and GNN utility. To address both limitations, we propose a new framework termed Graph Differential Privacy (GDP), specifically tailored to graph learning. GDP ensures both provably private model parameters as well as private predictions. Additionally, we describe a novel unified notion of graph dataset adjacency to analyze the properties of GDP for different levels of graph topology privacy. Our findings reveal that DP-GNNs, which rely on graph convolutions, not only fail to meet the requirements for multigranular graph topology privacy but also necessitate the injection of DP noise that scales at least linearly with the maximum node degree. In contrast, our proposed Differentially Private Decoupled Graph Convolutions (DPDGCs) represent a more flexible and efficient alternative to graph convolutions that still provides the necessary guarantees of GDP. To validate our approach, we conducted extensive experiments on seven node classification benchmarking and illustrative synthetic datasets. The results demonstrate that DPDGCs significantly outperform existing DP-GNNs in terms of privacy-utility trade-offs. △ Less

Submitted 14 October, 2023; v1 submitted 12 July, 2023; originally announced July 2023.

Comments: NeurIPS 2023

arXiv:2306.09547 [pdf, other]

Training generative models from privatized data

Authors: Daria Reshetova, Wei-Ning Chen, Ayfer Özgür

Abstract: Local differential privacy is a powerful method for privacy-preserving data collection. In this paper, we develop a framework for training Generative Adversarial Networks (GANs) on differentially privatized data. We show that entropic regularization of optimal transport - a popular regularization method in the literature that has often been leveraged for its computational benefits - enables the ge… ▽ More Local differential privacy is a powerful method for privacy-preserving data collection. In this paper, we develop a framework for training Generative Adversarial Networks (GANs) on differentially privatized data. We show that entropic regularization of optimal transport - a popular regularization method in the literature that has often been leveraged for its computational benefits - enables the generator to learn the raw (unprivatized) data distribution even though it only has access to privatized samples. We prove that at the same time this leads to fast statistical convergence at the parametric rate. This shows that entropic regularization of optimal transport uniquely enables the mitigation of both the effects of privatization noise and the curse of dimensionality in statistical convergence. We provide experimental evidence to support the efficacy of our framework in practice. △ Less

Submitted 29 February, 2024; v1 submitted 15 June, 2023; originally announced June 2023.

arXiv:2306.04924 [pdf, other]

Exact Optimality of Communication-Privacy-Utility Tradeoffs in Distributed Mean Estimation

Authors: Berivan Isik, Wei-Ning Chen, Ayfer Ozgur, Tsachy Weissman, Albert No

Abstract: We study the mean estimation problem under communication and local differential privacy constraints. While previous work has proposed \emph{order}-optimal algorithms for the same problem (i.e., asymptotically optimal as we spend more bits), \emph{exact} optimality (in the non-asymptotic setting) still has not been achieved. In this work, we take a step towards characterizing the \emph{exact}-optim… ▽ More We study the mean estimation problem under communication and local differential privacy constraints. While previous work has proposed \emph{order}-optimal algorithms for the same problem (i.e., asymptotically optimal as we spend more bits), \emph{exact} optimality (in the non-asymptotic setting) still has not been achieved. In this work, we take a step towards characterizing the \emph{exact}-optimal approach in the presence of shared randomness (a random variable shared between the server and the user) and identify several conditions for \emph{exact} optimality. We prove that one of the conditions is to utilize a rotationally symmetric shared random codebook. Based on this, we propose a randomization mechanism where the codebook is a randomly rotated simplex -- satisfying the properties of the \emph{exact}-optimal codebook. The proposed mechanism is based on a $k$-closest encoding which we prove to be \emph{exact}-optimal for the randomly rotated simplex codebook. △ Less

Submitted 28 October, 2023; v1 submitted 8 June, 2023; originally announced June 2023.

Comments: Published at the Conference on Neural Information Processing Systems (NeurIPS), 2023

arXiv:2304.01541 [pdf, other]

Privacy Amplification via Compression: Achieving the Optimal Privacy-Accuracy-Communication Trade-off in Distributed Mean Estimation

Authors: Wei-Ning Chen, Dan Song, Ayfer Ozgur, Peter Kairouz

Abstract: Privacy and communication constraints are two major bottlenecks in federated learning (FL) and analytics (FA). We study the optimal accuracy of mean and frequency estimation (canonical models for FL and FA respectively) under joint communication and $(\varepsilon, δ)$-differential privacy (DP) constraints. We show that in order to achieve the optimal error under $(\varepsilon, δ)$-DP, it is suffic… ▽ More Privacy and communication constraints are two major bottlenecks in federated learning (FL) and analytics (FA). We study the optimal accuracy of mean and frequency estimation (canonical models for FL and FA respectively) under joint communication and $(\varepsilon, δ)$-differential privacy (DP) constraints. We show that in order to achieve the optimal error under $(\varepsilon, δ)$-DP, it is sufficient for each client to send $Θ\left( n \min\left(\varepsilon, \varepsilon^2\right)\right)$ bits for FL and $Θ\left(\log\left( n\min\left(\varepsilon, \varepsilon^2\right) \right)\right)$ bits for FA to the server, where $n$ is the number of participating clients. Without compression, each client needs $O(d)$ bits and $\log d$ bits for the mean and frequency estimation problems respectively (where $d$ corresponds to the number of trainable parameters in FL or the domain size in FA), which means that we can get significant savings in the regime $ n \min\left(\varepsilon, \varepsilon^2\right) = o(d)$, which is often the relevant regime in practice. Our algorithms leverage compression for privacy amplification: when each client communicates only partial information about its sample, we show that privacy can be amplified by randomly selecting the part contributed by each client. △ Less

Submitted 4 April, 2023; originally announced April 2023.

arXiv:2303.17728 [pdf]

Evaluation of GPT and BERT-based models on identifying protein-protein interactions in biomedical text

Authors: Hasin Rehana, Nur Bengisu Çam, Mert Basmaci, Jie Zheng, Christianah Jemiyo, Yongqun He, Arzucan Özgür, Junguk Hur

Abstract: Detecting protein-protein interactions (PPIs) is crucial for understanding genetic mechanisms, disease pathogenesis, and drug design. However, with the fast-paced growth of biomedical literature, there is a growing need for automated and accurate extraction of PPIs to facilitate scientific knowledge discovery. Pre-trained language models, such as generative pre-trained transformers (GPT) and bidir… ▽ More Detecting protein-protein interactions (PPIs) is crucial for understanding genetic mechanisms, disease pathogenesis, and drug design. However, with the fast-paced growth of biomedical literature, there is a growing need for automated and accurate extraction of PPIs to facilitate scientific knowledge discovery. Pre-trained language models, such as generative pre-trained transformers (GPT) and bidirectional encoder representations from transformers (BERT), have shown promising results in natural language processing (NLP) tasks. We evaluated the performance of PPI identification of multiple GPT and BERT models using three manually curated gold-standard corpora: Learning Language in Logic (LLL) with 164 PPIs in 77 sentences, Human Protein Reference Database with 163 PPIs in 145 sentences, and Interaction Extraction Performance Assessment with 335 PPIs in 486 sentences. BERT-based models achieved the best overall performance, with BioBERT achieving the highest recall (91.95%) and F1-score (86.84%) and PubMedBERT achieving the highest precision (85.25%). Interestingly, despite not being explicitly trained for biomedical texts, GPT-4 achieved commendable performance, comparable to the top-performing BERT models. It achieved a precision of 88.37%, a recall of 85.14%, and an F1-score of 86.49% on the LLL dataset. These results suggest that GPT models can effectively detect PPIs from text data, offering promising avenues for application in biomedical literature mining. Further research could explore how these models might be fine-tuned for even more specialized tasks within the biomedical domain. △ Less

Submitted 12 December, 2023; v1 submitted 30 March, 2023; originally announced March 2023.

arXiv:2301.02079 [pdf, other]

PEAK: Explainable Privacy Assistant through Automated Knowledge Extraction

Authors: Gonul Ayci, Arzucan Özgür, Murat Şensoy, Pınar Yolum

Abstract: In the realm of online privacy, privacy assistants play a pivotal role in empowering users to manage their privacy effectively. Although recent studies have shown promising progress in tackling tasks such as privacy violation detection and personalized privacy recommendations, a crucial aspect for widespread user adoption is the capability of these systems to provide explanations for their decisio… ▽ More In the realm of online privacy, privacy assistants play a pivotal role in empowering users to manage their privacy effectively. Although recent studies have shown promising progress in tackling tasks such as privacy violation detection and personalized privacy recommendations, a crucial aspect for widespread user adoption is the capability of these systems to provide explanations for their decision-making processes. This paper presents a privacy assistant for generating explanations for privacy decisions. The privacy assistant focuses on discovering latent topics, identifying explanation categories, establishing explanation schemes, and generating automated explanations. The generated explanations can be used by users to understand the recommendations of the privacy assistant. Our user study of real-world privacy dataset of images shows that users find the generated explanations useful and easy to understand. Additionally, the generated explanations can be used by privacy assistants themselves to improve their decision-making. We show how this can be realized by incorporating the generated explanations into a state-of-the-art privacy assistant. △ Less

Submitted 31 May, 2023; v1 submitted 5 January, 2023; originally announced January 2023.

Comments: 43 pages, 14 figures

arXiv:2211.10041 [pdf, other]

The communication cost of security and privacy in federated frequency estimation

Authors: Wei-Ning Chen, Ayfer Özgür, Graham Cormode, Akash Bharadwaj

Abstract: We consider the federated frequency estimation problem, where each user holds a private item $X_i$ from a size-$d$ domain and a server aims to estimate the empirical frequency (i.e., histogram) of $n$ items with $n \ll d$. Without any security and privacy considerations, each user can communicate its item to the server by using $\log d$ bits. A naive application of secure aggregation protocols wou… ▽ More We consider the federated frequency estimation problem, where each user holds a private item $X_i$ from a size-$d$ domain and a server aims to estimate the empirical frequency (i.e., histogram) of $n$ items with $n \ll d$. Without any security and privacy considerations, each user can communicate its item to the server by using $\log d$ bits. A naive application of secure aggregation protocols would, however, require $d\log n$ bits per user. Can we reduce the communication needed for secure aggregation, and does security come with a fundamental cost in communication? In this paper, we develop an information-theoretic model for secure aggregation that allows us to characterize the fundamental cost of security and privacy in terms of communication. We show that with security (and without privacy) $Ω\left( n \log d \right)$ bits per user are necessary and sufficient to allow the server to compute the frequency distribution. This is significantly smaller than the $d\log n$ bits per user needed by the naive scheme, but significantly higher than the $\log d$ bits per user needed without security. To achieve differential privacy, we construct a linear scheme based on a noisy sketch which locally perturbs the data and does not require a trusted server (a.k.a. distributed differential privacy). We analyze this scheme under $\ell_2$ and $\ell_\infty$ loss. By using our information-theoretic framework, we show that the scheme achieves the optimal accuracy-privacy trade-off with optimal communication cost, while matching the performance in the centralized case where data is stored in the central server. △ Less

Submitted 18 November, 2022; originally announced November 2022.

arXiv:2209.00981 [pdf, other]

Exploiting Pretrained Biochemical Language Models for Targeted Drug Design

Authors: Gökçe Uludoğan, Elif Ozkirimli, Kutlu O. Ulgen, Nilgün Karalı, Arzucan Özgür

Abstract: Motivation: The development of novel compounds targeting proteins of interest is one of the most important tasks in the pharmaceutical industry. Deep generative models have been applied to targeted molecular design and have shown promising results. Recently, target-specific molecule generation has been viewed as a translation between the protein language and the chemical language. However, such a… ▽ More Motivation: The development of novel compounds targeting proteins of interest is one of the most important tasks in the pharmaceutical industry. Deep generative models have been applied to targeted molecular design and have shown promising results. Recently, target-specific molecule generation has been viewed as a translation between the protein language and the chemical language. However, such a model is limited by the availability of interacting protein-ligand pairs. On the other hand, large amounts of unlabeled protein sequences and chemical compounds are available and have been used to train language models that learn useful representations. In this study, we propose exploiting pretrained biochemical language models to initialize (i.e. warm start) targeted molecule generation models. We investigate two warm start strategies: (i) a one-stage strategy where the initialized model is trained on targeted molecule generation (ii) a two-stage strategy containing a pre-finetuning on molecular generation followed by target specific training. We also compare two decoding strategies to generate compounds: beam search and sampling. Results: The results show that the warm-started models perform better than a baseline model trained from scratch. The two proposed warm-start strategies achieve similar results to each other with respect to widely used metrics from benchmarks. However, docking evaluation of the generated compounds for a number of novel proteins suggests that the one-stage strategy generalizes better than the two-stage strategy. Additionally, we observe that beam search outperforms sampling in both docking evaluation and benchmark metrics for assessing compound quality. Availability and implementation: The source code is available at https://github.com/boun-tabi/biochemical-lms-for-drug-design and the materials are archived in Zenodo at https://doi.org/10.5281/zenodo.6832145 △ Less

Submitted 2 September, 2022; originally announced September 2022.

Comments: 12 pages, to appear in Bioinformatics

arXiv:2207.11782 [pdf, other]

Enhancements to the BOUN Treebank Reflecting the Agglutinative Nature of Turkish

Authors: Büşra Marşan, Salih Furkan Akkurt, Muhammet Şen, Merve Gürbüz, Onur Güngör, Şaziye Betül Özateş, Suzan Üsküdarlı, Arzucan Özgür, Tunga Güngör, Balkız Öztürk

Abstract: In this study, we aim to offer linguistically motivated solutions to resolve the issues of the lack of representation of null morphemes, highly productive derivational processes, and syncretic morphemes of Turkish in the BOUN Treebank without diverging from the Universal Dependencies framework. In order to tackle these issues, new annotation conventions were introduced by splitting certain lemma… ▽ More In this study, we aim to offer linguistically motivated solutions to resolve the issues of the lack of representation of null morphemes, highly productive derivational processes, and syncretic morphemes of Turkish in the BOUN Treebank without diverging from the Universal Dependencies framework. In order to tackle these issues, new annotation conventions were introduced by splitting certain lemmas and employing the MISC (miscellaneous) tab in the UD framework to denote derivation. Representational capabilities of the re-annotated treebank were tested on a LSTM-based dependency parser and an updated version of the BoAT Tool is introduced. △ Less

Submitted 24 July, 2022; originally announced July 2022.

Comments: This is a peer reviewed article that has been presented in The International Conference on Agglutinative Language Technologies as a challenge of Natural Language Processing (ALTNLP) 2022

arXiv:2207.09916 [pdf, other]

The Poisson binomial mechanism for secure and private federated learning

Authors: Wei-Ning Chen, Ayfer Özgür, Peter Kairouz

Abstract: We introduce the Poisson Binomial mechanism (PBM), a discrete differential privacy mechanism for distributed mean estimation (DME) with applications to federated learning and analytics. We provide a tight analysis of its privacy guarantees, showing that it achieves the same privacy-accuracy trade-offs as the continuous Gaussian mechanism. Our analysis is based on a novel bound on the Rényi diverge… ▽ More We introduce the Poisson Binomial mechanism (PBM), a discrete differential privacy mechanism for distributed mean estimation (DME) with applications to federated learning and analytics. We provide a tight analysis of its privacy guarantees, showing that it achieves the same privacy-accuracy trade-offs as the continuous Gaussian mechanism. Our analysis is based on a novel bound on the Rényi divergence of two Poisson binomial distributions that may be of independent interest. Unlike previous discrete DP schemes based on additive noise, our mechanism encodes local information into a parameter of the binomial distribution, and hence the output distribution is discrete with bounded support. Moreover, the support does not increase as the privacy budget $\varepsilon \rightarrow 0$ as in the case of additive schemes which require the addition of more noise to achieve higher privacy; on the contrary, the support becomes smaller as $\varepsilon \rightarrow 0$. The bounded support enables us to combine our mechanism with secure aggregation (SecAgg), a multi-party cryptographic protocol, without the need of performing modular clipping which results in an unbiased estimator of the sum of the local vectors. This in turn allows us to apply it in the private FL setting and provide an upper bound on the convergence rate of the SGD algorithm. Moreover, since the support of the output distribution becomes smaller as $\varepsilon \rightarrow 0$, the communication cost of our scheme decreases with the privacy constraint $\varepsilon$, outperforming all previous distributed DP schemes based on additive noise in the high privacy or low communication regimes. △ Less

Submitted 9 July, 2022; originally announced July 2022.

Comments: 25 pages

arXiv:2205.06544 [pdf, other]

Uncertainty-aware Personal Assistant for Making Personalized Privacy Decisions

Authors: Gonul Ayci, Murat Sensoy, Arzucan Özgür, Pınar Yolum

Abstract: Many software systems, such as online social networks enable users to share information about themselves. While the action of sharing is simple, it requires an elaborate thought process on privacy: what to share, with whom to share, and for what purposes. Thinking about these for each piece of content to be shared is tedious. Recent approaches to tackle this problem build personal assistants that… ▽ More Many software systems, such as online social networks enable users to share information about themselves. While the action of sharing is simple, it requires an elaborate thought process on privacy: what to share, with whom to share, and for what purposes. Thinking about these for each piece of content to be shared is tedious. Recent approaches to tackle this problem build personal assistants that can help users by learning what is private over time and recommending privacy labels such as private or public to individual content that a user considers sharing. However, privacy is inherently ambiguous and highly personal. Existing approaches to recommend privacy decisions do not address these aspects of privacy sufficiently. Ideally, a personal assistant should be able to adjust its recommendation based on a given user, considering that user's privacy understanding. Moreover, the personal assistant should be able to assess when its recommendation would be uncertain and let the user make the decision on her own. Accordingly, this paper proposes a personal assistant that uses evidential deep learning to classify content based on its privacy label. An important characteristic of the personal assistant is that it can model its uncertainty in its decisions explicitly, determine that it does not know the answer, and delegate from making a recommendation when its uncertainty is high. By factoring in the user's own understanding of privacy, such as risk factors or own labels, the personal assistant can personalize its recommendations per user. We evaluate our proposed personal assistant using a well-known data set. Our results show that our personal assistant can accurately identify uncertain cases, personalize them to its user's needs, and thus helps users preserve their privacy well. △ Less

Submitted 28 July, 2022; v1 submitted 13 May, 2022; originally announced May 2022.

Comments: 24 pages, 11 figures, 7 tables

arXiv:2205.04185 [pdf, other]

A Dataset and BERT-based Models for Targeted Sentiment Analysis on Turkish Texts

Authors: M. Melih Mutlu, Arzucan Özgür

Abstract: Targeted Sentiment Analysis aims to extract sentiment towards a particular target from a given text. It is a field that is attracting attention due to the increasing accessibility of the Internet, which leads people to generate an enormous amount of data. Sentiment analysis, which in general requires annotated data for training, is a well-researched area for widely studied languages such as Englis… ▽ More Targeted Sentiment Analysis aims to extract sentiment towards a particular target from a given text. It is a field that is attracting attention due to the increasing accessibility of the Internet, which leads people to generate an enormous amount of data. Sentiment analysis, which in general requires annotated data for training, is a well-researched area for widely studied languages such as English. For low-resource languages such as Turkish, there is a lack of such annotated data. We present an annotated Turkish dataset suitable for targeted sentiment analysis. We also propose BERT-based models with different architectures to accomplish the task of targeted sentiment analysis. The results demonstrate that the proposed models outperform the traditional sentiment analysis models for the targeted sentiment analysis task. △ Less

Submitted 9 May, 2022; originally announced May 2022.

arXiv:2111.01387 [pdf, other]

Understanding Entropic Regularization in GANs

Authors: Daria Reshetova, Yikun Bai, Xiugang Wu, Ayfer Ozgur

Abstract: Generative Adversarial Networks are a popular method for learning distributions from data by modeling the target distribution as a function of a known distribution. The function, often referred to as the generator, is optimized to minimize a chosen distance measure between the generated and target distributions. One commonly used measure for this purpose is the Wasserstein distance. However, Wasse… ▽ More Generative Adversarial Networks are a popular method for learning distributions from data by modeling the target distribution as a function of a known distribution. The function, often referred to as the generator, is optimized to minimize a chosen distance measure between the generated and target distributions. One commonly used measure for this purpose is the Wasserstein distance. However, Wasserstein distance is hard to compute and optimize, and in practice entropic regularization techniques are used to improve numerical convergence. The influence of regularization on the learned solution, however, remains not well-understood. In this paper, we study how several popular entropic regularizations of Wasserstein distance impact the solution in a simple benchmark setting where the generator is linear and the target distribution is high-dimensional Gaussian. We show that entropy regularization promotes the solution sparsification, while replacing the Wasserstein distance with the Sinkhorn divergence recovers the unregularized solution. Both regularization techniques remove the curse of dimensionality suffered by Wasserstein distance. We show that the optimal generator can be learned to accuracy $ε$ with $O(1/ε^2)$ samples from the target distribution. We thus conclude that these regularization techniques can improve the quality of the generator learned from empirical data for a large class of distributions. △ Less

Submitted 2 November, 2021; originally announced November 2021.

Comments: 29 pages, 7 figures

arXiv:2110.03189 [pdf, other]

Pointwise Bounds for Distribution Estimation under Communication Constraints

Authors: Wei-Ning Chen, Peter Kairouz, Ayfer Özgür

Abstract: We consider the problem of estimating a $d$-dimensional discrete distribution from its samples observed under a $b$-bit communication constraint. In contrast to most previous results that largely focus on the global minimax error, we study the local behavior of the estimation error and provide \emph{pointwise} bounds that depend on the target distribution $p$. In particular, we show that the… ▽ More We consider the problem of estimating a $d$-dimensional discrete distribution from its samples observed under a $b$-bit communication constraint. In contrast to most previous results that largely focus on the global minimax error, we study the local behavior of the estimation error and provide \emph{pointwise} bounds that depend on the target distribution $p$. In particular, we show that the $\ell_2$ error decays with $O\left(\frac{\lVert p\rVert_{1/2}}{n2^b}\vee \frac{1}{n}\right)$ (In this paper, we use $a\vee b$ and $a \wedge b$ to denote $\max(a, b)$ and $\min(a,b)$ respectively.) when $n$ is sufficiently large, hence it is governed by the \emph{half-norm} of $p$ instead of the ambient dimension $d$. For the achievability result, we propose a two-round sequentially interactive estimation scheme that achieves this error rate uniformly over all $p$. Our scheme is based on a novel local refinement idea, where we first use a standard global minimax scheme to localize $p$ and then use the remaining samples to locally refine our estimate. We also develop a new local minimax lower bound with (almost) matching $\ell_2$ error, showing that any interactive scheme must admit a $Ω\left( \frac{\lVert p \rVert_{{(1+δ)}/{2}}}{n2^b}\right)$ $\ell_2$ error for any $δ> 0$. The lower bound is derived by first finding the best parametric sub-model containing $p$, and then upper bounding the quantized Fisher information under this model. Our upper and lower bounds together indicate that the $\mathcal{H}_{1/2}(p) = \log(\lVert p \rVert_{{1}/{2}})$ bits of communication is both sufficient and necessary to achieve the optimal (centralized) performance, where $\mathcal{H}_{{1}/{2}}(p)$ is the Rényi entropy of order $2$. Therefore, under the $\ell_2$ loss, the correct measure of the local communication complexity at $p$ is its Rényi entropy. △ Less

Submitted 29 October, 2021; v1 submitted 7 October, 2021; originally announced October 2021.

arXiv:2110.00202 [pdf, other]

Batched Thompson Sampling

Authors: Cem Kalkanli, Ayfer Ozgur

Abstract: We introduce a novel anytime Batched Thompson sampling policy for multi-armed bandits where the agent observes the rewards of her actions and adjusts her policy only at the end of a small number of batches. We show that this policy simultaneously achieves a problem dependent regret of order $O(\log(T))$ and a minimax regret of order $O(\sqrt{T\log(T)})$ while the number of batches can be bounded b… ▽ More We introduce a novel anytime Batched Thompson sampling policy for multi-armed bandits where the agent observes the rewards of her actions and adjusts her policy only at the end of a small number of batches. We show that this policy simultaneously achieves a problem dependent regret of order $O(\log(T))$ and a minimax regret of order $O(\sqrt{T\log(T)})$ while the number of batches can be bounded by $O(\log(T))$ independent of the problem instance over a time horizon $T$. We also show that in expectation the number of batches used by our policy can be bounded by an instance dependent bound of order $O(\log\log(T))$. These results indicate that Thompson sampling maintains the same performance in this batched setting as in the case when instantaneous feedback is available after each action, while requiring minimal feedback. These results also indicate that Thompson sampling performs competitively with recently proposed algorithms tailored for the batched setting. These algorithms optimize the batch structure for a given time horizon $T$ and prioritize exploration in the beginning of the experiment to eliminate suboptimal actions. We show that Thompson sampling combined with an adaptive batching strategy can achieve a similar performance without knowing the time horizon $T$ of the problem and without having to carefully optimize the batch structure to achieve a target regret bound (i.e. problem dependent vs minimax regret) for a given $T$. △ Less

Submitted 1 October, 2021; originally announced October 2021.

Comments: This work is accepted to Thirty-fifth Conference on Neural Information Processing Systems, NeurIPS 2021

arXiv:2110.00158 [pdf, other]

doi 10.1109/ISIT45174.2021.9517843

Asymptotic Performance of Thompson Sampling in the Batched Multi-Armed Bandits

Authors: Cem Kalkanli, Ayfer Ozgur

Abstract: We study the asymptotic performance of the Thompson sampling algorithm in the batched multi-armed bandit setting where the time horizon $T$ is divided into batches, and the agent is not able to observe the rewards of her actions until the end of each batch. We show that in this batched setting, Thompson sampling achieves the same asymptotic performance as in the case where instantaneous feedback i… ▽ More We study the asymptotic performance of the Thompson sampling algorithm in the batched multi-armed bandit setting where the time horizon $T$ is divided into batches, and the agent is not able to observe the rewards of her actions until the end of each batch. We show that in this batched setting, Thompson sampling achieves the same asymptotic performance as in the case where instantaneous feedback is available after each action, provided that the batch sizes increase subexponentially. This result implies that Thompson sampling can maintain its performance even if it receives delayed feedback in $ω(\log T)$ batches. We further propose an adaptive batching scheme that reduces the number of batches to $Θ(\log T)$ while maintaining the same performance. Although the batched multi-armed bandit setting has been considered in several recent works, previous results rely on tailored algorithms for the batched setting, which optimize the batch structure and prioritize exploration in the beginning of the experiment to eliminate suboptimal actions. We show that Thompson sampling, on the other hand, is able to achieve a similar asymptotic performance in the batched setting without any modifications. △ Less

Submitted 30 September, 2021; originally announced October 2021.

Comments: This work was presented in 2021 IEEE International Symposium on Information Theory (ISIT)

Journal ref: IEEE International Symposium on Information Theory (ISIT), 2021, pp. 539-544

arXiv:2109.11389 [pdf, other]

doi 10.1017/S1351324920000443

Cluster-based Mention Typing for Named Entity Disambiguation

Authors: Arda Çelebi, Arzucan Özgür

Abstract: An entity mention in text such as "Washington" may correspond to many different named entities such as the city "Washington D.C." or the newspaper "Washington Post." The goal of named entity disambiguation is to identify the mentioned named entity correctly among all possible candidates. If the type (e.g. location or person) of a mentioned entity can be correctly predicted from the context, it may… ▽ More An entity mention in text such as "Washington" may correspond to many different named entities such as the city "Washington D.C." or the newspaper "Washington Post." The goal of named entity disambiguation is to identify the mentioned named entity correctly among all possible candidates. If the type (e.g. location or person) of a mentioned entity can be correctly predicted from the context, it may increase the chance of selecting the right candidate by assigning low probability to the unlikely ones. This paper proposes cluster-based mention typing for named entity disambiguation. The aim of mention typing is to predict the type of a given mention based on its context. Generally, manually curated type taxonomies such as Wikipedia categories are used. We introduce cluster-based mention typing, where named entities are clustered based on their contextual similarities and the cluster ids are assigned as types. The hyperlinked mentions and their context in Wikipedia are used in order to obtain these cluster-based types. Then, mention typing models are trained on these mentions, which have been labeled with their cluster-based types through distant supervision. At the named entity disambiguation phase, first the cluster-based types of a given mention are predicted and then, these types are used as features in a ranking model to select the best entity among the candidates. We represent entities at multiple contextual levels and obtain different clusterings (and thus typing models) based on each level. As each clustering breaks the entity space differently, mention typing based on each clustering discriminates the mention differently. When predictions from all typing models are used together, our system achieves better or comparable results based on randomization tests with respect to the state-of-the-art levels on four defacto test sets. △ Less

Submitted 23 September, 2021; originally announced September 2021.

Comments: 46 pages, 11 figures, 14 tables

Journal ref: Nat. Lang. Eng. 28 (2022) 1-37

arXiv:2109.04712 [pdf, other]

Balancing Methods for Multi-label Text Classification with Long-Tailed Class Distribution

Authors: Yi Huang, Buse Giledereli, Abdullatif Köksal, Arzucan Özgür, Elif Ozkirimli

Abstract: Multi-label text classification is a challenging task because it requires capturing label dependencies. It becomes even more challenging when class distribution is long-tailed. Resampling and re-weighting are common approaches used for addressing the class imbalance problem, however, they are not effective when there is label dependency besides class imbalance because they result in oversampling o… ▽ More Multi-label text classification is a challenging task because it requires capturing label dependencies. It becomes even more challenging when class distribution is long-tailed. Resampling and re-weighting are common approaches used for addressing the class imbalance problem, however, they are not effective when there is label dependency besides class imbalance because they result in oversampling of common labels. Here, we introduce the application of balancing loss functions for multi-label text classification. We perform experiments on a general domain dataset with 90 labels (Reuters-21578) and a domain-specific dataset from PubMed with 18211 labels. We find that a distribution-balanced loss function, which inherently addresses both the class imbalance and label linkage problems, outperforms commonly used loss functions. Distribution balancing methods have been successfully used in the image recognition field. Here, we show their effectiveness in natural language processing. Source code is available at https://github.com/Roche/BalancedLossNLP. △ Less

Submitted 15 October, 2021; v1 submitted 10 September, 2021; originally announced September 2021.

Comments: EMNLP 2021

arXiv:2107.05556 [pdf, other]

DebiasedDTA: A Framework for Improving the Generalizability of Drug-Target Affinity Prediction Models

Authors: Rıza Özçelik, Alperen Bağ, Berk Atıl, Melih Barsbey, Arzucan Özgür, Elif Özkırımlı

Abstract: Computational models that accurately predict the binding affinity of an input protein-chemical pair can accelerate drug discovery studies. These models are trained on available protein-chemical interaction datasets, which may contain dataset biases that may lead the model to learn dataset-specific patterns, instead of generalizable relationships. As a result, the prediction performance of models d… ▽ More Computational models that accurately predict the binding affinity of an input protein-chemical pair can accelerate drug discovery studies. These models are trained on available protein-chemical interaction datasets, which may contain dataset biases that may lead the model to learn dataset-specific patterns, instead of generalizable relationships. As a result, the prediction performance of models drops for previously unseen biomolecules, $\textit{i.e.}$ the prediction models cannot generalize to biomolecules outside of the dataset. The latest approaches that aim to improve model generalizability either have limited applicability or introduce the risk of degrading prediction performance. Here, we present DebiasedDTA, a novel drug-target affinity (DTA) prediction model training framework that addresses dataset biases to improve the generalizability of affinity prediction models. DebiasedDTA reweights the training samples to mitigate the effect of dataset biases and is applicable to most DTA prediction models. The results suggest that models trained in the DebiasedDTA framework can achieve improved generalizability in predicting the interactions of the previously unseen biomolecules, as well as performance improvements on those previously seen. Extensive experiments with different biomolecule representations, model architectures, and datasets demonstrate that DebiasedDTA can upgrade DTA prediction models irrespective of the biomolecule representation, model architecture, and training dataset. Last but not least, we release DebiasedDTA as an open-source python library to enable other researchers to debias their own predictors and/or develop their own debiasing methods. We believe that this python library will corroborate and foster research to develop more generalizable DTA prediction models. △ Less

Submitted 8 January, 2023; v1 submitted 4 July, 2021; originally announced July 2021.

arXiv:2106.08597 [pdf, ps, other]

Breaking The Dimension Dependence in Sparse Distribution Estimation under Communication Constraints

Authors: Wei-Ning Chen, Peter Kairouz, Ayfer Özgür

Abstract: We consider the problem of estimating a $d$-dimensional $s$-sparse discrete distribution from its samples observed under a $b$-bit communication constraint. The best-known previous result on $\ell_2$ estimation error for this problem is $O\left( \frac{s\log\left( {d}/{s}\right)}{n2^b}\right)$. Surprisingly, we show that when sample size $n$ exceeds a minimum threshold $n^*(s, d, b)$, we can achiev… ▽ More We consider the problem of estimating a $d$-dimensional $s$-sparse discrete distribution from its samples observed under a $b$-bit communication constraint. The best-known previous result on $\ell_2$ estimation error for this problem is $O\left( \frac{s\log\left( {d}/{s}\right)}{n2^b}\right)$. Surprisingly, we show that when sample size $n$ exceeds a minimum threshold $n^*(s, d, b)$, we can achieve an $\ell_2$ estimation error of $O\left( \frac{s}{n2^b}\right)$. This implies that when $n>n^*(s, d, b)$ the convergence rate does not depend on the ambient dimension $d$ and is the same as knowing the support of the distribution beforehand. We next ask the question: ``what is the minimum $n^*(s, d, b)$ that allows dimension-free convergence?''. To upper bound $n^*(s, d, b)$, we develop novel localization schemes to accurately and efficiently localize the unknown support. For the non-interactive setting, we show that $n^*(s, d, b) = O\left( \min \left( {d^2\log^2 d}/{2^b}, {s^4\log^2 d}/{2^b}\right) \right)$. Moreover, we connect the problem with non-adaptive group testing and obtain a polynomial-time estimation scheme when $n = \tildeΩ\left({s^4\log^4 d}/{2^b}\right)$. This group testing based scheme is adaptive to the sparsity parameter $s$, and hence can be applied without knowing it. For the interactive setting, we propose a novel tree-based estimation scheme and show that the minimum sample-size needed to achieve dimension-free convergence can be further reduced to $n^*(s, d, b) = \tilde{O}\left( {s^2\log^2 d}/{2^b} \right)$. △ Less

Submitted 16 June, 2021; originally announced June 2021.

arXiv:2105.13793 [pdf, other]

A systematic review of transfer learning based approaches for diabetic retinopathy detection

Authors: Burcu Oltu, Büşra Kübra Karaca, Hamit Erdem, Atilla Özgür

Abstract: Cases of diabetes and related diabetic retinopathy (DR) have been increasing at an alarming rate in modern times. Early detection of DR is an important problem since it may cause permanent blindness in the late stages. In the last two decades, many different approaches have been applied in DR detection. Reviewing academic literature shows that deep neural networks (DNNs) have become the most prefe… ▽ More Cases of diabetes and related diabetic retinopathy (DR) have been increasing at an alarming rate in modern times. Early detection of DR is an important problem since it may cause permanent blindness in the late stages. In the last two decades, many different approaches have been applied in DR detection. Reviewing academic literature shows that deep neural networks (DNNs) have become the most preferred approach for DR detection. Among these DNN approaches, Convolutional Neural Network (CNN) models are the most used ones in the field of medical image classification. Designing a new CNN architecture is a tedious and time-consuming approach. Additionally, training an enormous number of parameters is also a difficult task. Due to this reason, instead of training CNNs from scratch, using pre-trained models has been suggested in recent years as transfer learning approach. Accordingly, the present study as a review focuses on DNN and Transfer Learning based applications of DR detection considering 38 publications between 2015 and 2020. The published papers are summarized using 9 figures and 10 tables, giving information about 22 pre-trained CNN models, 12 DR data sets and standard performance metrics. △ Less

Submitted 28 May, 2021; originally announced May 2021.

Comments: 25 pages 9 figures 10 tables

arXiv:2103.04014 [pdf, ps, other]

Over-the-Air Statistical Estimation

Authors: Chuan-Zheng Lee, Leighton Pate Barnes, Ayfer Ozgur

Abstract: We study schemes and lower bounds for distributed minimax statistical estimation over a Gaussian multiple-access channel (MAC) under squared error loss, in a framework combining statistical estimation and wireless communication. First, we develop "analog" joint estimation-communication schemes that exploit the superposition property of the Gaussian MAC and we characterize their risk in terms of th… ▽ More We study schemes and lower bounds for distributed minimax statistical estimation over a Gaussian multiple-access channel (MAC) under squared error loss, in a framework combining statistical estimation and wireless communication. First, we develop "analog" joint estimation-communication schemes that exploit the superposition property of the Gaussian MAC and we characterize their risk in terms of the number of nodes and dimension of the parameter space. Then, we derive information-theoretic lower bounds on the minimax risk of any estimation scheme restricted to communicate the samples over a given number of uses of the channel and show that the risk achieved by our proposed schemes is within a logarithmic factor of these lower bounds. We compare both achievability and lower bound results to previous "digital" lower bounds, where nodes transmit errorless bits at the Shannon capacity of the MAC, showing that estimation schemes that leverage the physical layer offer a drastic reduction in estimation error over digital schemes relying on a physical-layer abstraction. △ Less

Submitted 5 March, 2021; originally announced March 2021.

Comments: 12 pages, 5 figures

arXiv:2102.05802 [pdf, ps, other]

Fisher Information and Mutual Information Constraints

Authors: Leighton Pate Barnes, Ayfer Ozgur

Abstract: We consider the processing of statistical samples $X\sim P_θ$ by a channel $p(y|x)$, and characterize how the statistical information from the samples for estimating the parameter $θ\in\mathbb{R}^d$ can scale with the mutual information or capacity of the channel. We show that if the statistical model has a sub-Gaussian score function, then the trace of the Fisher information matrix for estimating… ▽ More We consider the processing of statistical samples $X\sim P_θ$ by a channel $p(y|x)$, and characterize how the statistical information from the samples for estimating the parameter $θ\in\mathbb{R}^d$ can scale with the mutual information or capacity of the channel. We show that if the statistical model has a sub-Gaussian score function, then the trace of the Fisher information matrix for estimating $θ$ from $Y$ can scale at most linearly with the mutual information between $X$ and $Y$. We apply this result to obtain minimax lower bounds in distributed statistical estimation problems, and obtain a tight preconstant for Gaussian mean estimation. We then show how our Fisher information bound can also imply mutual information or Jensen-Shannon divergence based distributed strong data processing inequalities. △ Less

Submitted 8 July, 2021; v1 submitted 10 February, 2021; originally announced February 2021.

arXiv:2101.02405 [pdf, other]

Adaptive Group Testing on Networks with Community Structure: The Stochastic Block Model

Authors: Surin Ahn, Wei-Ning Chen, Ayfer Ozgur

Abstract: Group testing was conceived during World War II to identify soldiers infected with syphilis using as few tests as possible, and it has attracted renewed interest during the COVID-19 pandemic. A long-standing assumption in the probabilistic variant of the group testing problem is that individuals are infected by the disease independently. However, this assumption rarely holds in practice, as diseas… ▽ More Group testing was conceived during World War II to identify soldiers infected with syphilis using as few tests as possible, and it has attracted renewed interest during the COVID-19 pandemic. A long-standing assumption in the probabilistic variant of the group testing problem is that individuals are infected by the disease independently. However, this assumption rarely holds in practice, as diseases often spread through interactions between individuals and therefore cause infections to be correlated. Inspired by characteristics of COVID-19 and other infectious diseases, we introduce an infection model over networks which generalizes the traditional i.i.d. model from probabilistic group testing. Under this model, we ask whether knowledge of the network structure can be leveraged to perform group testing more efficiently, focusing specifically on community-structured graphs drawn from the stochastic block model. We prove that a simple community-aware algorithm outperforms the baseline binary splitting algorithm when the model parameters are conducive to "strong community structure." Moreover, our novel lower bounds imply that the community-aware algorithm is order-optimal in certain parameter regimes. We extend our bounds to the noisy setting and support our results with numerical experiments. △ Less

Submitted 17 November, 2022; v1 submitted 7 January, 2021; originally announced January 2021.

Comments: 27 pages, 5 figures. Presented in part at the 2021 IEEE International Symposium on Information Theory (ISIT). Restructured the paper and added new results for the noisy setting

arXiv:2011.03917 [pdf, ps, other]

Asymptotic Convergence of Thompson Sampling

Authors: Cem Kalkanli, Ayfer Ozgur

Abstract: Thompson sampling has been shown to be an effective policy across a variety of online learning tasks. Many works have analyzed the finite time performance of Thompson sampling, and proved that it achieves a sub-linear regret under a broad range of probabilistic settings. However its asymptotic behavior remains mostly underexplored. In this paper, we prove an asymptotic convergence result for Thomp… ▽ More Thompson sampling has been shown to be an effective policy across a variety of online learning tasks. Many works have analyzed the finite time performance of Thompson sampling, and proved that it achieves a sub-linear regret under a broad range of probabilistic settings. However its asymptotic behavior remains mostly underexplored. In this paper, we prove an asymptotic convergence result for Thompson sampling under the assumption of a sub-linear Bayesian regret, and show that the actions of a Thompson sampling agent provide a strongly consistent estimator of the optimal action. Our results rely on the martingale structure inherent in Thompson sampling. △ Less

Submitted 8 November, 2020; originally announced November 2020.

arXiv:2010.09381 [pdf, other]

The RELX Dataset and Matching the Multilingual Blanks for Cross-Lingual Relation Classification

Authors: Abdullatif Köksal, Arzucan Özgür

Abstract: Relation classification is one of the key topics in information extraction, which can be used to construct knowledge bases or to provide useful information for question answering. Current approaches for relation classification are mainly focused on the English language and require lots of training data with human annotations. Creating and annotating a large amount of training data for low-resource… ▽ More Relation classification is one of the key topics in information extraction, which can be used to construct knowledge bases or to provide useful information for question answering. Current approaches for relation classification are mainly focused on the English language and require lots of training data with human annotations. Creating and annotating a large amount of training data for low-resource languages is impractical and expensive. To overcome this issue, we propose two cross-lingual relation classification models: a baseline model based on Multilingual BERT and a new multilingual pretraining setup, which significantly improves the baseline with distant supervision. For evaluation, we introduce a new public benchmark dataset for cross-lingual relation classification in English, French, German, Spanish, and Turkish, called RELX. We also provide the RELX-Distant dataset, which includes hundreds of thousands of sentences with relations from Wikipedia and Wikidata collected by distant supervision for these languages. Our code and data are available at: https://github.com/boun-tabi/RELX △ Less

Submitted 19 October, 2020; originally announced October 2020.

Comments: Findings of EMNLP 2020

arXiv:2009.02526 [pdf, other]

Vapur: A Search Engine to Find Related Protein-Compound Pairs in COVID-19 Literature

Authors: Abdullatif Köksal, Hilal Dönmez, Rıza Özçelik, Elif Ozkirimli, Arzucan Özgür

Abstract: Coronavirus Disease of 2019 (COVID-19) created dire consequences globally and triggered an intense scientific effort from different domains. The resulting publications created a huge text collection in which finding the studies related to a biomolecule of interest is challenging for general purpose search engines because the publications are rich in domain specific terminology. Here, we present Va… ▽ More Coronavirus Disease of 2019 (COVID-19) created dire consequences globally and triggered an intense scientific effort from different domains. The resulting publications created a huge text collection in which finding the studies related to a biomolecule of interest is challenging for general purpose search engines because the publications are rich in domain specific terminology. Here, we present Vapur: an online COVID-19 search engine specifically designed to find related protein - chemical pairs. Vapur is empowered with a relation-oriented inverted index that is able to retrieve and group studies for a query biomolecule with respect to its related entities. The inverted index of Vapur is automatically created with a BioNLP pipeline and integrated with an online user interface. The online interface is designed for the smooth traversal of the current literature by domain researchers and is publicly available at https://tabilab.cmpe.boun.edu.tr/vapur/ . △ Less

Submitted 13 October, 2020; v1 submitted 5 September, 2020; originally announced September 2020.

Comments: EMNLP 2020 - COVID-19 Workshop

arXiv:2008.10249 [pdf, other]

Information Constrained Optimal Transport: From Talagrand, to Marton, to Cover

Authors: Yikun Bai, Xiugang Wu, Ayfer Ozgur

Abstract: The optimal transport problem studies how to transport one measure to another in the most cost-effective way and has wide range of applications from economics to machine learning. In this paper, we introduce and study an information constrained variation of this problem. Our study yields a strengthening and generalization of Talagrand's celebrated transportation cost inequality. Following Marton's… ▽ More The optimal transport problem studies how to transport one measure to another in the most cost-effective way and has wide range of applications from economics to machine learning. In this paper, we introduce and study an information constrained variation of this problem. Our study yields a strengthening and generalization of Talagrand's celebrated transportation cost inequality. Following Marton's approach, we show that the new transportation cost inequality can be used to recover old and new concentration of measure results. Finally, we provide an application of this new inequality to network information theory. We show that it can be used to recover almost immediately a recent solution to a long-standing open problem posed by Cover regarding the capacity of the relay channel. △ Less

Submitted 24 August, 2020; originally announced August 2020.

arXiv:2007.11707 [pdf, other]

Breaking the Communication-Privacy-Accuracy Trilemma

Authors: Wei-Ning Chen, Peter Kairouz, Ayfer Özgür

Abstract: Two major challenges in distributed learning and estimation are 1) preserving the privacy of the local samples; and 2) communicating them efficiently to a central server, while achieving high accuracy for the end-to-end task. While there has been significant interest in addressing each of these challenges separately in the recent literature, treatments that simultaneously address both challenges a… ▽ More Two major challenges in distributed learning and estimation are 1) preserving the privacy of the local samples; and 2) communicating them efficiently to a central server, while achieving high accuracy for the end-to-end task. While there has been significant interest in addressing each of these challenges separately in the recent literature, treatments that simultaneously address both challenges are still largely missing. In this paper, we develop novel encoding and decoding mechanisms that simultaneously achieve optimal privacy and communication efficiency in various canonical settings. In particular, we consider the problems of mean estimation and frequency estimation under $\varepsilon$-local differential privacy and $b$-bit communication constraints. For mean estimation, we propose a scheme based on Kashin's representation and random sampling, with order-optimal estimation error under both constraints. For frequency estimation, we present a mechanism that leverages the recursive structure of Walsh-Hadamard matrices and achieves order-optimal estimation error for all privacy levels and communication budgets. As a by-product, we also construct a distribution estimation mechanism that is rate-optimal for all privacy regimes and communication constraints, extending recent work that is limited to $b=1$ and $\varepsilon=O(1)$. Our results demonstrate that intelligent encoding under joint privacy and communication constraints can yield a performance that matches the optimal accuracy achievable under either constraint alone. △ Less

Submitted 20 April, 2021; v1 submitted 22 July, 2020; originally announced July 2020.

Comments: 35 pages, 9 figures, submitted to NeurIPS 2020

arXiv:2006.08160 [pdf, other]

Lower Bounds and a Near-Optimal Shrinkage Estimator for Least Squares using Random Projections

Authors: Srivatsan Sridhar, Mert Pilanci, Ayfer Özgür

Abstract: In this work, we consider the deterministic optimization using random projections as a statistical estimation problem, where the squared distance between the predictions from the estimator and the true solution is the error metric. In approximately solving a large scale least squares problem using Gaussian sketches, we show that the sketched solution has a conditional Gaussian distribution with th… ▽ More In this work, we consider the deterministic optimization using random projections as a statistical estimation problem, where the squared distance between the predictions from the estimator and the true solution is the error metric. In approximately solving a large scale least squares problem using Gaussian sketches, we show that the sketched solution has a conditional Gaussian distribution with the true solution as its mean. Firstly, tight worst case error lower bounds with explicit constants are derived for any estimator using the Gaussian sketch, and the classical sketching is shown to be the optimal unbiased estimator. For biased estimators, the lower bound also incorporates prior knowledge about the true solution. Secondly, we use the James-Stein estimator to derive an improved estimator for the least squares solution using the Gaussian sketch. An upper bound on the expected error of this estimator is derived, which is smaller than the error of the classical Gaussian sketch solution for any given data. The upper and lower bounds match when the SNR of the true solution is known to be small and the data matrix is well conditioned. Empirically, this estimator achieves smaller error on simulated and real datasets, and works for other common sketching methods as well. △ Less

Submitted 15 June, 2020; originally announced June 2020.

Comments: This work has been submitted to the IEEE Journal on Selected Areas in Information Theory (JSAIT) - Special Issue on Estimation and Inference, and is awaiting review. This document contains 37 pages and 14 figures

arXiv:2006.01760 [pdf, other]

Modelling of daily reference evapotranspiration using deep neural network in different climates

Authors: Atilla Özgür, Sevim Seda Yamaç

Abstract: Precise and reliable estimation of reference evapotranspiration (ET o ) is an essential for the irrigation and water resources management. ET o is difficult to predict due to its complex processes. This complexity can be solved using machine learning methods. This study investigates the performance of artificial neural network (ANN) and deep neural network (DNN) models for estimating daily ET o .… ▽ More Precise and reliable estimation of reference evapotranspiration (ET o ) is an essential for the irrigation and water resources management. ET o is difficult to predict due to its complex processes. This complexity can be solved using machine learning methods. This study investigates the performance of artificial neural network (ANN) and deep neural network (DNN) models for estimating daily ET o . Previously proposed ANN and DNN methods have been realized, and their performances have been compared. Six input data including maximum air temperature (T max ), minimum air temperature (T min ), solar radiation (R n ), maximum relative humidity (RH max ), minimum relative humidity (RH min ) and wind speed (U 2 ) are used from 4 meteorological stations (Adana, Aksaray, Isparta and Niğde) during 1999-2018 in Turkey. The results have shown that our proposed DNN models achieves satisfactory accuracy for daily ET o estimation compared to previous ANN and DNN models. The best performance has been observed with the proposed model of DNN with SeLU activation function (P-DNN-SeLU) in Aksaray with coefficient of determination (R 2 ) of 0.9934, root mean square error (RMSE) of 0.2073 and mean absolute error (MAE) of 0.1590, respectively. Therefore, the P-DNN-SeLU model could be recommended for estimation of ET o in other climate zones of the world. △ Less

Submitted 19 June, 2020; v1 submitted 2 June, 2020; originally announced June 2020.

ACM Class: I.2

arXiv:2005.10848 [pdf, other]

doi 10.1109/JSAIT.2020.3041804

Global Multiclass Classification and Dataset Construction via Heterogeneous Local Experts

Authors: Surin Ahn, Ayfer Ozgur, Mert Pilanci

Abstract: In the domains of dataset construction and crowdsourcing, a notable challenge is to aggregate labels from a heterogeneous set of labelers, each of whom is potentially an expert in some subset of tasks (and less reliable in others). To reduce costs of hiring human labelers or training automated labeling systems, it is of interest to minimize the number of labelers while ensuring the reliability of… ▽ More In the domains of dataset construction and crowdsourcing, a notable challenge is to aggregate labels from a heterogeneous set of labelers, each of whom is potentially an expert in some subset of tasks (and less reliable in others). To reduce costs of hiring human labelers or training automated labeling systems, it is of interest to minimize the number of labelers while ensuring the reliability of the resulting dataset. We model this as the problem of performing $K$-class classification using the predictions of smaller classifiers, each trained on a subset of $[K]$, and derive bounds on the number of classifiers needed to accurately infer the true class of an unlabeled sample under both adversarial and stochastic assumptions. By exploiting a connection to the classical set cover problem, we produce a near-optimal scheme for designing such configurations of classifiers which recovers the well known one-vs.-one classification approach as a special case. Experiments with the MNIST and CIFAR-10 datasets demonstrate the favorable accuracy (compared to a centralized classifier) of our aggregation scheme applied to classifiers trained on subsets of the data. These results suggest a new way to automatically label data or adapt an existing set of local classifiers to larger-scale multiclass problems. △ Less

Submitted 5 January, 2021; v1 submitted 21 May, 2020; originally announced May 2020.

Comments: 27 pages, 8 figures, to be published in IEEE Journal on Selected Areas in Information Theory (JSAIT) - Special Issue on Estimation and Inference

arXiv:2005.10783 [pdf, ps, other]

Fisher information under local differential privacy

Authors: Leighton Pate Barnes, Wei-Ning Chen, Ayfer Ozgur

Abstract: We develop data processing inequalities that describe how Fisher information from statistical samples can scale with the privacy parameter $\varepsilon$ under local differential privacy constraints. These bounds are valid under general conditions on the distribution of the score of the statistical model, and they elucidate under which conditions the dependence on $\varepsilon$ is linear, quadratic… ▽ More We develop data processing inequalities that describe how Fisher information from statistical samples can scale with the privacy parameter $\varepsilon$ under local differential privacy constraints. These bounds are valid under general conditions on the distribution of the score of the statistical model, and they elucidate under which conditions the dependence on $\varepsilon$ is linear, quadratic, or exponential. We show how these inequalities imply order optimal lower bounds for private estimation for both the Gaussian location model and discrete distribution estimation for all levels of privacy $\varepsilon>0$. We further apply these inequalities to sparse Bernoulli models and demonstrate privacy mechanisms and estimators with order-matching squared $\ell^2$ error. △ Less

Submitted 21 May, 2020; originally announced May 2020.

arXiv:2005.10761 [pdf, other]

rTop-k: A Statistical Estimation Approach to Distributed SGD

Authors: Leighton Pate Barnes, Huseyin A. Inan, Berivan Isik, Ayfer Ozgur

Abstract: The large communication cost for exchanging gradients between different nodes significantly limits the scalability of distributed training for large-scale learning models. Motivated by this observation, there has been significant recent interest in techniques that reduce the communication cost of distributed Stochastic Gradient Descent (SGD), with gradient sparsification techniques such as top-k a… ▽ More The large communication cost for exchanging gradients between different nodes significantly limits the scalability of distributed training for large-scale learning models. Motivated by this observation, there has been significant recent interest in techniques that reduce the communication cost of distributed Stochastic Gradient Descent (SGD), with gradient sparsification techniques such as top-k and random-k shown to be particularly effective. The same observation has also motivated a separate line of work in distributed statistical estimation theory focusing on the impact of communication constraints on the estimation efficiency of different statistical models. The primary goal of this paper is to connect these two research lines and demonstrate how statistical estimation models and their analysis can lead to new insights in the design of communication-efficient training techniques. We propose a simple statistical estimation model for the stochastic gradients which captures the sparsity and skewness of their distribution. The statistically optimal communication scheme arising from the analysis of this model leads to a new sparsification technique for SGD, which concatenates random-k and top-k, considered separately in the prior literature. We show through extensive experiments on both image and language domains with CIFAR-10, ImageNet, and Penn Treebank datasets that the concatenated application of these two sparsification methods consistently and significantly outperforms either method applied alone. △ Less

Submitted 2 December, 2020; v1 submitted 21 May, 2020; originally announced May 2020.

arXiv:2004.01277 [pdf, other]

The Courtade-Kumar Most Informative Boolean Function Conjecture and a Symmetrized Li-Médard Conjecture are Equivalent

Authors: Leighton Pate Barnes, Ayfer Özgür

Abstract: We consider the Courtade-Kumar most informative Boolean function conjecture for balanced functions, as well as a conjecture by Li and Médard that dictatorship functions also maximize the $L^α$ norm of $T_pf$ for $1\leqα\leq2$ where $T_p$ is the noise operator and $f$ is a balanced Boolean function. By using a result due to Laguerre from the 1880's, we are able to bound how many times an $L^α$-norm… ▽ More We consider the Courtade-Kumar most informative Boolean function conjecture for balanced functions, as well as a conjecture by Li and Médard that dictatorship functions also maximize the $L^α$ norm of $T_pf$ for $1\leqα\leq2$ where $T_p$ is the noise operator and $f$ is a balanced Boolean function. By using a result due to Laguerre from the 1880's, we are able to bound how many times an $L^α$-norm related quantity can cross zero as a function of $α$, and show that these two conjectures are essentially equivalent. △ Less

Submitted 2 April, 2020; originally announced April 2020.

arXiv:2002.10416 [pdf, other]

Resources for Turkish Dependency Parsing: Introducing the BOUN Treebank and the BoAT Annotation Tool

Authors: Utku Türk, Furkan Atmaca, Şaziye Betül Özateş, Gözde Berk, Seyyit Talha Bedir, Abdullatif Köksal, Balkız Öztürk Başaran, Tunga Güngör, Arzucan Özgür

Abstract: In this paper, we introduce the resources that we developed for Turkish dependency parsing, which include a novel manually annotated treebank (BOUN Treebank), along with the guidelines we adopted, and a new annotation tool (BoAT). The manual annotation process we employed was shaped and implemented by a team of four linguists and five Natural Language Processing (NLP) specialists. Decisions regard… ▽ More In this paper, we introduce the resources that we developed for Turkish dependency parsing, which include a novel manually annotated treebank (BOUN Treebank), along with the guidelines we adopted, and a new annotation tool (BoAT). The manual annotation process we employed was shaped and implemented by a team of four linguists and five Natural Language Processing (NLP) specialists. Decisions regarding the annotation of the BOUN Treebank were made in line with the Universal Dependencies (UD) framework as well as our recent efforts for unifying the Turkish UD treebanks through manual re-annotation. To the best of our knowledge, BOUN Treebank is the largest Turkish treebank. It contains a total of 9,761 sentences from various topics including biographical texts, national newspapers, instructional texts, popular culture articles, and essays. In addition, we report the parsing results of a state-of-the-art dependency parser obtained over the BOUN Treebank as well as two other treebanks in Turkish. Our results demonstrate that the unification of the Turkish annotation scheme and the introduction of a more comprehensive treebank lead to improved performance with regard to dependency parsing. △ Less

Submitted 16 September, 2021; v1 submitted 24 February, 2020; originally announced February 2020.

Comments: Language Resource and Evaluation

arXiv:2002.10116 [pdf, other]

doi 10.1109/ACCESS.2022.3202947

A Hybrid Approach to Dependency Parsing: Combining Rules and Morphology with Deep Learning

Authors: Şaziye Betül Özateş, Arzucan Özgür, Tunga Güngör, Balkız Öztürk

Abstract: Fully data-driven, deep learning-based models are usually designed as language-independent and have been shown to be successful for many natural language processing tasks. However, when the studied language is low-resourced and the amount of training data is insufficient, these models can benefit from the integration of natural language grammar-based information. We propose two approaches to depen… ▽ More Fully data-driven, deep learning-based models are usually designed as language-independent and have been shown to be successful for many natural language processing tasks. However, when the studied language is low-resourced and the amount of training data is insufficient, these models can benefit from the integration of natural language grammar-based information. We propose two approaches to dependency parsing especially for languages with restricted amount of training data. Our first approach combines a state-of-the-art deep learning-based parser with a rule-based approach and the second one incorporates morphological information into the parser. In the rule-based approach, the parsing decisions made by the rules are encoded and concatenated with the vector representations of the input words as additional information to the deep network. The morphology-based approach proposes different methods to include the morphological structure of words into the parser network. Experiments are conducted on the IMST-UD Treebank and the results suggest that integration of explicit knowledge about the target language to a neural parser through a rule-based parsing system and morphological analysis leads to more accurate annotations and hence, increases the parsing performance in terms of attachment scores. The proposed methods are developed for Turkish, but can be adapted to other languages as well. △ Less

Submitted 24 February, 2020; originally announced February 2020.

Comments: 25 pages, 7 figures

ACM Class: I.2.7

arXiv:2002.06053 [pdf, other]

doi 10.1016/j.drudis.2020.01.020

Exploring Chemical Space using Natural Language Processing Methodologies for Drug Discovery

Authors: Hakime Öztürk, Arzucan Özgür, Philippe Schwaller, Teodoro Laino, Elif Ozkirimli

Abstract: Text-based representations of chemicals and proteins can be thought of as unstructured languages codified by humans to describe domain-specific knowledge. Advances in natural language processing (NLP) methodologies in the processing of spoken languages accelerated the application of NLP to elucidate hidden knowledge in textual representations of these biochemical entities and then use it to constr… ▽ More Text-based representations of chemicals and proteins can be thought of as unstructured languages codified by humans to describe domain-specific knowledge. Advances in natural language processing (NLP) methodologies in the processing of spoken languages accelerated the application of NLP to elucidate hidden knowledge in textual representations of these biochemical entities and then use it to construct models to predict molecular properties or to design novel molecules. This review outlines the impact made by these advances on drug discovery and aims to further the dialogue between medicinal chemists and computer scientists. △ Less

Submitted 10 February, 2020; originally announced February 2020.

arXiv:1912.04977 [pdf, other]

Advances and Open Problems in Federated Learning

Authors: Peter Kairouz, H. Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, Rafael G. L. D'Oliveira, Hubert Eichner, Salim El Rouayheb, David Evans, Josh Gardner, Zachary Garrett, Adrià Gascón, Badih Ghazi, Phillip B. Gibbons, Marco Gruteser, Zaid Harchaoui, Chaoyang He, Lie He, Zhouyuan Huo, Ben Hutchinson , et al. (34 additional authors not shown)

Abstract: Federated learning (FL) is a machine learning setting where many clients (e.g. mobile devices or whole organizations) collaboratively train a model under the orchestration of a central server (e.g. service provider), while keeping the training data decentralized. FL embodies the principles of focused data collection and minimization, and can mitigate many of the systemic privacy risks and costs re… ▽ More Federated learning (FL) is a machine learning setting where many clients (e.g. mobile devices or whole organizations) collaboratively train a model under the orchestration of a central server (e.g. service provider), while keeping the training data decentralized. FL embodies the principles of focused data collection and minimization, and can mitigate many of the systemic privacy risks and costs resulting from traditional, centralized machine learning and data science approaches. Motivated by the explosive growth in FL research, this paper discusses recent advances and presents an extensive collection of open problems and challenges. △ Less

Submitted 8 March, 2021; v1 submitted 10 December, 2019; originally announced December 2019.

Comments: Published in Foundations and Trends in Machine Learning Vol 4 Issue 1. See: https://www.nowpublishers.com/article/Details/MAL-083

arXiv:1910.01625 [pdf, ps, other]

Minimax Bounds for Distributed Logistic Regression

Authors: Leighton Pate Barnes, Ayfer Ozgur

Abstract: We consider a distributed logistic regression problem where labeled data pairs $(X_i,Y_i)\in \mathbb{R}^d\times\{-1,1\}$ for $i=1,\ldots,n$ are distributed across multiple machines in a network and must be communicated to a centralized estimator using at most $k$ bits per labeled pair. We assume that the data $X_i$ come independently from some distribution $P_X$, and that the distribution of… ▽ More We consider a distributed logistic regression problem where labeled data pairs $(X_i,Y_i)\in \mathbb{R}^d\times\{-1,1\}$ for $i=1,\ldots,n$ are distributed across multiple machines in a network and must be communicated to a centralized estimator using at most $k$ bits per labeled pair. We assume that the data $X_i$ come independently from some distribution $P_X$, and that the distribution of $Y_i$ conditioned on $X_i$ follows a logistic model with some parameter $θ\in\mathbb{R}^d$. By using a Fisher information argument, we give minimax lower bounds for estimating $θ$ under different assumptions on the tail of the distribution $P_X$. We consider both $\ell^2$ and logistic losses, and show that for the logistic loss our sub-Gaussian lower bound is order-optimal and cannot be improved. △ Less

Submitted 3 October, 2019; originally announced October 2019.

Showing 1–50 of 96 results for author: Ozgur, A