Skip to main content

Showing 1–50 of 96 results for author: Ozgur, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2503.08838  [pdf, ps, other

    cs.CL q-bio.QM

    evoBPE: Evolutionary Protein Sequence Tokenization

    Authors: Burak Suyunu, Özdeniz Dolu, Arzucan Özgür

    Abstract: Recent advancements in computational biology have drawn compelling parallels between protein sequences and linguistic structures, highlighting the need for sophisticated tokenization methods that capture the intricate evolutionary dynamics of protein sequences. Current subword tokenization techniques, primarily developed for natural language processing, often fail to represent protein sequences' c… ▽ More

    Submitted 11 March, 2025; originally announced March 2025.

    Comments: 13 pages, 8 figures, 1 table, 1 algorithm

  2. arXiv:2503.03043  [pdf, ps, other

    cs.LG cs.CR

    Leveraging Randomness in Model and Data Partitioning for Privacy Amplification

    Authors: Andy Dong, Wei-Ning Chen, Ayfer Ozgur

    Abstract: We study how inherent randomness in the training process -- where each sample (or client in federated learning) contributes only to a randomly selected portion of training -- can be leveraged for privacy amplification. This includes (1) data partitioning, where a sample participates in only a subset of training iterations, and (2) model partitioning, where a sample updates only a subset of the mod… ▽ More

    Submitted 1 June, 2025; v1 submitted 4 March, 2025; originally announced March 2025.

  3. arXiv:2502.09659  [pdf

    cs.CL cs.AI cs.CY

    Cancer Vaccine Adjuvant Name Recognition from Biomedical Literature using Large Language Models

    Authors: Hasin Rehana, Jie Zheng, Leo Yeh, Benu Bansal, Nur Bengisu Çam, Christianah Jemiyo, Brett McGregor, Arzucan Özgür, Yongqun He, Junguk Hur

    Abstract: Motivation: An adjuvant is a chemical incorporated into vaccines that enhances their efficacy by improving the immune response. Identifying adjuvant names from cancer vaccine studies is essential for furthering research and enhancing immunotherapies. However, the manual curation from the constantly expanding biomedical literature poses significant challenges. This study explores the automated reco… ▽ More

    Submitted 12 February, 2025; originally announced February 2025.

    Comments: 10 pages, 6 figures, 4 tables

  4. arXiv:2411.17669  [pdf, ps, other

    cs.CL q-bio.QM

    Linguistic Laws Meet Protein Sequences: A Comparative Analysis of Subword Tokenization Methods

    Authors: Burak Suyunu, Enes Taylan, Arzucan Özgür

    Abstract: Tokenization is a crucial step in processing protein sequences for machine learning models, as proteins are complex sequences of amino acids that require meaningful segmentation to capture their functional and structural properties. However, existing subword tokenization methods, developed primarily for human language, may be inadequate for protein sequences, which have unique patterns and constra… ▽ More

    Submitted 26 November, 2024; originally announced November 2024.

    Comments: 8 pages, 9 figures

  5. arXiv:2407.06718  [pdf, other

    cs.AI

    A Simple Architecture for Enterprise Large Language Model Applications based on Role based security and Clearance Levels using Retrieval-Augmented Generation or Mixture of Experts

    Authors: Atilla Özgür, Yılmaz Uygun

    Abstract: This study proposes a simple architecture for Enterprise application for Large Language Models (LLMs) for role based security and NATO clearance levels. Our proposal aims to address the limitations of current LLMs in handling security and information access. The proposed architecture could be used while utilizing Retrieval-Augmented Generation (RAG) and fine tuning of Mixture of experts models (Mo… ▽ More

    Submitted 9 July, 2024; originally announced July 2024.

    ACM Class: D.2.11; I.2.7

  6. arXiv:2406.10036  [pdf, other

    cs.IT

    Information Compression in the AI Era: Recent Advances and Future Challenges

    Authors: Jun Chen, Yong Fang, Ashish Khisti, Ayfer Ozgur, Nir Shlezinger, Chao Tian

    Abstract: This survey articles focuses on emerging connections between the fields of machine learning and data compression. While fundamental limits of classical (lossy) data compression are established using rate-distortion theory, the connections to machine learning have resulted in new theoretical analysis and application areas. We survey recent works on task-based and goal-oriented compression, the rate… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: arXiv admin note: text overlap with arXiv:2002.04290

  7. arXiv:2405.20782  [pdf, other

    cs.CR cs.IT stat.ML

    Universal Exact Compression of Differentially Private Mechanisms

    Authors: Yanxiao Liu, Wei-Ning Chen, Ayfer Özgür, Cheuk Ting Li

    Abstract: To reduce the communication cost of differential privacy mechanisms, we introduce a novel construction, called Poisson private representation (PPR), designed to compress and simulate any local randomizer while ensuring local differential privacy. Unlike previous simulation-based local differential privacy mechanisms, PPR exactly preserves the joint distribution of the data and the output of the or… ▽ More

    Submitted 10 November, 2024; v1 submitted 28 May, 2024; originally announced May 2024.

    Comments: 33 pages, 5 figures

  8. arXiv:2402.01895  [pdf, ps, other

    cs.IT math.ST

    $L_q$ Lower Bounds on Distributed Estimation via Fisher Information

    Authors: Wei-Ning Chen, Ayfer Özgür

    Abstract: Van Trees inequality, also known as the Bayesian Cramér-Rao lower bound, is a powerful tool for establishing lower bounds for minimax estimation through Fisher information. It easily adapts to different statistical models and often yields tight bounds. Recently, its application has been extended to distributed estimation with privacy and communication constraints where it yields order-wise optimal… ▽ More

    Submitted 26 April, 2024; v1 submitted 2 February, 2024; originally announced February 2024.

  9. arXiv:2311.04375  [pdf, ps, other

    cs.CR stat.AP

    Federated Experiment Design under Distributed Differential Privacy

    Authors: Wei-Ning Chen, Graham Cormode, Akash Bharadwaj, Peter Romov, Ayfer Özgür

    Abstract: Experiment design has a rich history dating back over a century and has found many critical applications across various fields since then. The use and collection of users' data in experiments often involve sensitive personal information, so additional measures to protect individual privacy are required during data collection, storage, and usage. In this work, we focus on the rigorous protection of… ▽ More

    Submitted 7 November, 2023; originally announced November 2023.

  10. arXiv:2307.10634  [pdf, other

    q-bio.GN cs.CL cs.LG

    Generative Language Models on Nucleotide Sequences of Human Genes

    Authors: Musa Nuri Ihtiyar, Arzucan Ozgur

    Abstract: Language models, primarily transformer-based ones, obtained colossal success in NLP. To be more precise, studies like BERT in NLU and works such as GPT-3 for NLG are very crucial. DNA sequences are very close to natural language in terms of structure, so if the DNA-related bioinformatics domain is concerned, discriminative models, like DNABert, exist. Yet, the generative side of the coin is mainly… ▽ More

    Submitted 20 July, 2023; originally announced July 2023.

  11. arXiv:2307.06422  [pdf, other

    cs.LG

    Differentially Private Decoupled Graph Convolutions for Multigranular Topology Protection

    Authors: Eli Chien, Wei-Ning Chen, Chao Pan, Pan Li, Ayfer Özgür, Olgica Milenkovic

    Abstract: GNNs can inadvertently expose sensitive user information and interactions through their model predictions. To address these privacy concerns, Differential Privacy (DP) protocols are employed to control the trade-off between provable privacy protection and model utility. Applying standard DP approaches to GNNs directly is not advisable due to two main reasons. First, the prediction of node labels,… ▽ More

    Submitted 14 October, 2023; v1 submitted 12 July, 2023; originally announced July 2023.

    Comments: NeurIPS 2023

  12. arXiv:2306.09547  [pdf, other

    cs.LG cs.CR cs.IT

    Training generative models from privatized data

    Authors: Daria Reshetova, Wei-Ning Chen, Ayfer Özgür

    Abstract: Local differential privacy is a powerful method for privacy-preserving data collection. In this paper, we develop a framework for training Generative Adversarial Networks (GANs) on differentially privatized data. We show that entropic regularization of optimal transport - a popular regularization method in the literature that has often been leveraged for its computational benefits - enables the ge… ▽ More

    Submitted 29 February, 2024; v1 submitted 15 June, 2023; originally announced June 2023.

  13. arXiv:2306.04924  [pdf, other

    cs.LG cs.CR cs.DC cs.IT stat.ML

    Exact Optimality of Communication-Privacy-Utility Tradeoffs in Distributed Mean Estimation

    Authors: Berivan Isik, Wei-Ning Chen, Ayfer Ozgur, Tsachy Weissman, Albert No

    Abstract: We study the mean estimation problem under communication and local differential privacy constraints. While previous work has proposed \emph{order}-optimal algorithms for the same problem (i.e., asymptotically optimal as we spend more bits), \emph{exact} optimality (in the non-asymptotic setting) still has not been achieved. In this work, we take a step towards characterizing the \emph{exact}-optim… ▽ More

    Submitted 28 October, 2023; v1 submitted 8 June, 2023; originally announced June 2023.

    Comments: Published at the Conference on Neural Information Processing Systems (NeurIPS), 2023

  14. arXiv:2304.01541  [pdf, other

    stat.ML cs.CR cs.LG

    Privacy Amplification via Compression: Achieving the Optimal Privacy-Accuracy-Communication Trade-off in Distributed Mean Estimation

    Authors: Wei-Ning Chen, Dan Song, Ayfer Ozgur, Peter Kairouz

    Abstract: Privacy and communication constraints are two major bottlenecks in federated learning (FL) and analytics (FA). We study the optimal accuracy of mean and frequency estimation (canonical models for FL and FA respectively) under joint communication and $(\varepsilon, δ)$-differential privacy (DP) constraints. We show that in order to achieve the optimal error under $(\varepsilon, δ)$-DP, it is suffic… ▽ More

    Submitted 4 April, 2023; originally announced April 2023.

  15. arXiv:2303.17728  [pdf

    cs.CL cs.AI

    Evaluation of GPT and BERT-based models on identifying protein-protein interactions in biomedical text

    Authors: Hasin Rehana, Nur Bengisu Çam, Mert Basmaci, Jie Zheng, Christianah Jemiyo, Yongqun He, Arzucan Özgür, Junguk Hur

    Abstract: Detecting protein-protein interactions (PPIs) is crucial for understanding genetic mechanisms, disease pathogenesis, and drug design. However, with the fast-paced growth of biomedical literature, there is a growing need for automated and accurate extraction of PPIs to facilitate scientific knowledge discovery. Pre-trained language models, such as generative pre-trained transformers (GPT) and bidir… ▽ More

    Submitted 12 December, 2023; v1 submitted 30 March, 2023; originally announced March 2023.

  16. arXiv:2301.02079  [pdf, other

    cs.AI cs.CR cs.HC

    PEAK: Explainable Privacy Assistant through Automated Knowledge Extraction

    Authors: Gonul Ayci, Arzucan Özgür, Murat Şensoy, Pınar Yolum

    Abstract: In the realm of online privacy, privacy assistants play a pivotal role in empowering users to manage their privacy effectively. Although recent studies have shown promising progress in tackling tasks such as privacy violation detection and personalized privacy recommendations, a crucial aspect for widespread user adoption is the capability of these systems to provide explanations for their decisio… ▽ More

    Submitted 31 May, 2023; v1 submitted 5 January, 2023; originally announced January 2023.

    Comments: 43 pages, 14 figures

  17. arXiv:2211.10041  [pdf, other

    cs.IT cs.DS

    The communication cost of security and privacy in federated frequency estimation

    Authors: Wei-Ning Chen, Ayfer Özgür, Graham Cormode, Akash Bharadwaj

    Abstract: We consider the federated frequency estimation problem, where each user holds a private item $X_i$ from a size-$d$ domain and a server aims to estimate the empirical frequency (i.e., histogram) of $n$ items with $n \ll d$. Without any security and privacy considerations, each user can communicate its item to the server by using $\log d$ bits. A naive application of secure aggregation protocols wou… ▽ More

    Submitted 18 November, 2022; originally announced November 2022.

  18. arXiv:2209.00981  [pdf, other

    cs.LG cs.CL q-bio.BM q-bio.QM stat.ML

    Exploiting Pretrained Biochemical Language Models for Targeted Drug Design

    Authors: Gökçe Uludoğan, Elif Ozkirimli, Kutlu O. Ulgen, Nilgün Karalı, Arzucan Özgür

    Abstract: Motivation: The development of novel compounds targeting proteins of interest is one of the most important tasks in the pharmaceutical industry. Deep generative models have been applied to targeted molecular design and have shown promising results. Recently, target-specific molecule generation has been viewed as a translation between the protein language and the chemical language. However, such a… ▽ More

    Submitted 2 September, 2022; originally announced September 2022.

    Comments: 12 pages, to appear in Bioinformatics

  19. arXiv:2207.11782  [pdf, other

    cs.CL

    Enhancements to the BOUN Treebank Reflecting the Agglutinative Nature of Turkish

    Authors: Büşra Marşan, Salih Furkan Akkurt, Muhammet Şen, Merve Gürbüz, Onur Güngör, Şaziye Betül Özateş, Suzan Üsküdarlı, Arzucan Özgür, Tunga Güngör, Balkız Öztürk

    Abstract: In this study, we aim to offer linguistically motivated solutions to resolve the issues of the lack of representation of null morphemes, highly productive derivational processes, and syncretic morphemes of Turkish in the BOUN Treebank without diverging from the Universal Dependencies framework. In order to tackle these issues, new annotation conventions were introduced by splitting certain lemma… ▽ More

    Submitted 24 July, 2022; originally announced July 2022.

    Comments: This is a peer reviewed article that has been presented in The International Conference on Agglutinative Language Technologies as a challenge of Natural Language Processing (ALTNLP) 2022

  20. arXiv:2207.09916  [pdf, other

    cs.CR cs.IT cs.LG stat.ML

    The Poisson binomial mechanism for secure and private federated learning

    Authors: Wei-Ning Chen, Ayfer Özgür, Peter Kairouz

    Abstract: We introduce the Poisson Binomial mechanism (PBM), a discrete differential privacy mechanism for distributed mean estimation (DME) with applications to federated learning and analytics. We provide a tight analysis of its privacy guarantees, showing that it achieves the same privacy-accuracy trade-offs as the continuous Gaussian mechanism. Our analysis is based on a novel bound on the Rényi diverge… ▽ More

    Submitted 9 July, 2022; originally announced July 2022.

    Comments: 25 pages

  21. arXiv:2205.06544  [pdf, other

    cs.AI

    Uncertainty-aware Personal Assistant for Making Personalized Privacy Decisions

    Authors: Gonul Ayci, Murat Sensoy, Arzucan Özgür, Pınar Yolum

    Abstract: Many software systems, such as online social networks enable users to share information about themselves. While the action of sharing is simple, it requires an elaborate thought process on privacy: what to share, with whom to share, and for what purposes. Thinking about these for each piece of content to be shared is tedious. Recent approaches to tackle this problem build personal assistants that… ▽ More

    Submitted 28 July, 2022; v1 submitted 13 May, 2022; originally announced May 2022.

    Comments: 24 pages, 11 figures, 7 tables

  22. arXiv:2205.04185  [pdf, other

    cs.CL cs.LG

    A Dataset and BERT-based Models for Targeted Sentiment Analysis on Turkish Texts

    Authors: M. Melih Mutlu, Arzucan Özgür

    Abstract: Targeted Sentiment Analysis aims to extract sentiment towards a particular target from a given text. It is a field that is attracting attention due to the increasing accessibility of the Internet, which leads people to generate an enormous amount of data. Sentiment analysis, which in general requires annotated data for training, is a well-researched area for widely studied languages such as Englis… ▽ More

    Submitted 9 May, 2022; originally announced May 2022.

  23. arXiv:2111.01387  [pdf, other

    cs.LG stat.ML

    Understanding Entropic Regularization in GANs

    Authors: Daria Reshetova, Yikun Bai, Xiugang Wu, Ayfer Ozgur

    Abstract: Generative Adversarial Networks are a popular method for learning distributions from data by modeling the target distribution as a function of a known distribution. The function, often referred to as the generator, is optimized to minimize a chosen distance measure between the generated and target distributions. One commonly used measure for this purpose is the Wasserstein distance. However, Wasse… ▽ More

    Submitted 2 November, 2021; originally announced November 2021.

    Comments: 29 pages, 7 figures

  24. arXiv:2110.03189  [pdf, other

    cs.IT

    Pointwise Bounds for Distribution Estimation under Communication Constraints

    Authors: Wei-Ning Chen, Peter Kairouz, Ayfer Özgür

    Abstract: We consider the problem of estimating a $d$-dimensional discrete distribution from its samples observed under a $b$-bit communication constraint. In contrast to most previous results that largely focus on the global minimax error, we study the local behavior of the estimation error and provide \emph{pointwise} bounds that depend on the target distribution $p$. In particular, we show that the… ▽ More

    Submitted 29 October, 2021; v1 submitted 7 October, 2021; originally announced October 2021.

  25. arXiv:2110.00202  [pdf, other

    cs.LG

    Batched Thompson Sampling

    Authors: Cem Kalkanli, Ayfer Ozgur

    Abstract: We introduce a novel anytime Batched Thompson sampling policy for multi-armed bandits where the agent observes the rewards of her actions and adjusts her policy only at the end of a small number of batches. We show that this policy simultaneously achieves a problem dependent regret of order $O(\log(T))$ and a minimax regret of order $O(\sqrt{T\log(T)})$ while the number of batches can be bounded b… ▽ More

    Submitted 1 October, 2021; originally announced October 2021.

    Comments: This work is accepted to Thirty-fifth Conference on Neural Information Processing Systems, NeurIPS 2021

  26. Asymptotic Performance of Thompson Sampling in the Batched Multi-Armed Bandits

    Authors: Cem Kalkanli, Ayfer Ozgur

    Abstract: We study the asymptotic performance of the Thompson sampling algorithm in the batched multi-armed bandit setting where the time horizon $T$ is divided into batches, and the agent is not able to observe the rewards of her actions until the end of each batch. We show that in this batched setting, Thompson sampling achieves the same asymptotic performance as in the case where instantaneous feedback i… ▽ More

    Submitted 30 September, 2021; originally announced October 2021.

    Comments: This work was presented in 2021 IEEE International Symposium on Information Theory (ISIT)

    Journal ref: IEEE International Symposium on Information Theory (ISIT), 2021, pp. 539-544

  27. Cluster-based Mention Typing for Named Entity Disambiguation

    Authors: Arda Çelebi, Arzucan Özgür

    Abstract: An entity mention in text such as "Washington" may correspond to many different named entities such as the city "Washington D.C." or the newspaper "Washington Post." The goal of named entity disambiguation is to identify the mentioned named entity correctly among all possible candidates. If the type (e.g. location or person) of a mentioned entity can be correctly predicted from the context, it may… ▽ More

    Submitted 23 September, 2021; originally announced September 2021.

    Comments: 46 pages, 11 figures, 14 tables

    Journal ref: Nat. Lang. Eng. 28 (2022) 1-37

  28. arXiv:2109.04712  [pdf, other

    cs.CL

    Balancing Methods for Multi-label Text Classification with Long-Tailed Class Distribution

    Authors: Yi Huang, Buse Giledereli, Abdullatif Köksal, Arzucan Özgür, Elif Ozkirimli

    Abstract: Multi-label text classification is a challenging task because it requires capturing label dependencies. It becomes even more challenging when class distribution is long-tailed. Resampling and re-weighting are common approaches used for addressing the class imbalance problem, however, they are not effective when there is label dependency besides class imbalance because they result in oversampling o… ▽ More

    Submitted 15 October, 2021; v1 submitted 10 September, 2021; originally announced September 2021.

    Comments: EMNLP 2021

  29. arXiv:2107.05556  [pdf, other

    q-bio.QM cs.LG

    DebiasedDTA: A Framework for Improving the Generalizability of Drug-Target Affinity Prediction Models

    Authors: Rıza Özçelik, Alperen Bağ, Berk Atıl, Melih Barsbey, Arzucan Özgür, Elif Özkırımlı

    Abstract: Computational models that accurately predict the binding affinity of an input protein-chemical pair can accelerate drug discovery studies. These models are trained on available protein-chemical interaction datasets, which may contain dataset biases that may lead the model to learn dataset-specific patterns, instead of generalizable relationships. As a result, the prediction performance of models d… ▽ More

    Submitted 8 January, 2023; v1 submitted 4 July, 2021; originally announced July 2021.

  30. arXiv:2106.08597  [pdf, ps, other

    stat.ML cs.LG

    Breaking The Dimension Dependence in Sparse Distribution Estimation under Communication Constraints

    Authors: Wei-Ning Chen, Peter Kairouz, Ayfer Özgür

    Abstract: We consider the problem of estimating a $d$-dimensional $s$-sparse discrete distribution from its samples observed under a $b$-bit communication constraint. The best-known previous result on $\ell_2$ estimation error for this problem is $O\left( \frac{s\log\left( {d}/{s}\right)}{n2^b}\right)$. Surprisingly, we show that when sample size $n$ exceeds a minimum threshold $n^*(s, d, b)$, we can achiev… ▽ More

    Submitted 16 June, 2021; originally announced June 2021.

  31. arXiv:2105.13793  [pdf, other

    eess.IV cs.CV

    A systematic review of transfer learning based approaches for diabetic retinopathy detection

    Authors: Burcu Oltu, Büşra Kübra Karaca, Hamit Erdem, Atilla Özgür

    Abstract: Cases of diabetes and related diabetic retinopathy (DR) have been increasing at an alarming rate in modern times. Early detection of DR is an important problem since it may cause permanent blindness in the late stages. In the last two decades, many different approaches have been applied in DR detection. Reviewing academic literature shows that deep neural networks (DNNs) have become the most prefe… ▽ More

    Submitted 28 May, 2021; originally announced May 2021.

    Comments: 25 pages 9 figures 10 tables

  32. arXiv:2103.04014  [pdf, ps, other

    cs.IT cs.DC math.ST stat.ML

    Over-the-Air Statistical Estimation

    Authors: Chuan-Zheng Lee, Leighton Pate Barnes, Ayfer Ozgur

    Abstract: We study schemes and lower bounds for distributed minimax statistical estimation over a Gaussian multiple-access channel (MAC) under squared error loss, in a framework combining statistical estimation and wireless communication. First, we develop "analog" joint estimation-communication schemes that exploit the superposition property of the Gaussian MAC and we characterize their risk in terms of th… ▽ More

    Submitted 5 March, 2021; originally announced March 2021.

    Comments: 12 pages, 5 figures

  33. arXiv:2102.05802  [pdf, ps, other

    cs.IT math.ST

    Fisher Information and Mutual Information Constraints

    Authors: Leighton Pate Barnes, Ayfer Ozgur

    Abstract: We consider the processing of statistical samples $X\sim P_θ$ by a channel $p(y|x)$, and characterize how the statistical information from the samples for estimating the parameter $θ\in\mathbb{R}^d$ can scale with the mutual information or capacity of the channel. We show that if the statistical model has a sub-Gaussian score function, then the trace of the Fisher information matrix for estimating… ▽ More

    Submitted 8 July, 2021; v1 submitted 10 February, 2021; originally announced February 2021.

  34. arXiv:2101.02405  [pdf, other

    cs.SI stat.ME

    Adaptive Group Testing on Networks with Community Structure: The Stochastic Block Model

    Authors: Surin Ahn, Wei-Ning Chen, Ayfer Ozgur

    Abstract: Group testing was conceived during World War II to identify soldiers infected with syphilis using as few tests as possible, and it has attracted renewed interest during the COVID-19 pandemic. A long-standing assumption in the probabilistic variant of the group testing problem is that individuals are infected by the disease independently. However, this assumption rarely holds in practice, as diseas… ▽ More

    Submitted 17 November, 2022; v1 submitted 7 January, 2021; originally announced January 2021.

    Comments: 27 pages, 5 figures. Presented in part at the 2021 IEEE International Symposium on Information Theory (ISIT). Restructured the paper and added new results for the noisy setting

  35. arXiv:2011.03917  [pdf, ps, other

    cs.LG math.ST

    Asymptotic Convergence of Thompson Sampling

    Authors: Cem Kalkanli, Ayfer Ozgur

    Abstract: Thompson sampling has been shown to be an effective policy across a variety of online learning tasks. Many works have analyzed the finite time performance of Thompson sampling, and proved that it achieves a sub-linear regret under a broad range of probabilistic settings. However its asymptotic behavior remains mostly underexplored. In this paper, we prove an asymptotic convergence result for Thomp… ▽ More

    Submitted 8 November, 2020; originally announced November 2020.

  36. arXiv:2010.09381  [pdf, other

    cs.CL

    The RELX Dataset and Matching the Multilingual Blanks for Cross-Lingual Relation Classification

    Authors: Abdullatif Köksal, Arzucan Özgür

    Abstract: Relation classification is one of the key topics in information extraction, which can be used to construct knowledge bases or to provide useful information for question answering. Current approaches for relation classification are mainly focused on the English language and require lots of training data with human annotations. Creating and annotating a large amount of training data for low-resource… ▽ More

    Submitted 19 October, 2020; originally announced October 2020.

    Comments: Findings of EMNLP 2020

  37. arXiv:2009.02526  [pdf, other

    cs.IR cs.LG q-bio.MN

    Vapur: A Search Engine to Find Related Protein-Compound Pairs in COVID-19 Literature

    Authors: Abdullatif Köksal, Hilal Dönmez, Rıza Özçelik, Elif Ozkirimli, Arzucan Özgür

    Abstract: Coronavirus Disease of 2019 (COVID-19) created dire consequences globally and triggered an intense scientific effort from different domains. The resulting publications created a huge text collection in which finding the studies related to a biomolecule of interest is challenging for general purpose search engines because the publications are rich in domain specific terminology. Here, we present Va… ▽ More

    Submitted 13 October, 2020; v1 submitted 5 September, 2020; originally announced September 2020.

    Comments: EMNLP 2020 - COVID-19 Workshop

  38. arXiv:2008.10249  [pdf, other

    cs.IT math.FA math.PR stat.ML

    Information Constrained Optimal Transport: From Talagrand, to Marton, to Cover

    Authors: Yikun Bai, Xiugang Wu, Ayfer Ozgur

    Abstract: The optimal transport problem studies how to transport one measure to another in the most cost-effective way and has wide range of applications from economics to machine learning. In this paper, we introduce and study an information constrained variation of this problem. Our study yields a strengthening and generalization of Talagrand's celebrated transportation cost inequality. Following Marton's… ▽ More

    Submitted 24 August, 2020; originally announced August 2020.

  39. arXiv:2007.11707  [pdf, other

    cs.LG cs.CR cs.IT stat.ML

    Breaking the Communication-Privacy-Accuracy Trilemma

    Authors: Wei-Ning Chen, Peter Kairouz, Ayfer Özgür

    Abstract: Two major challenges in distributed learning and estimation are 1) preserving the privacy of the local samples; and 2) communicating them efficiently to a central server, while achieving high accuracy for the end-to-end task. While there has been significant interest in addressing each of these challenges separately in the recent literature, treatments that simultaneously address both challenges a… ▽ More

    Submitted 20 April, 2021; v1 submitted 22 July, 2020; originally announced July 2020.

    Comments: 35 pages, 9 figures, submitted to NeurIPS 2020

  40. arXiv:2006.08160  [pdf, other

    math.OC cs.IT stat.ML

    Lower Bounds and a Near-Optimal Shrinkage Estimator for Least Squares using Random Projections

    Authors: Srivatsan Sridhar, Mert Pilanci, Ayfer Özgür

    Abstract: In this work, we consider the deterministic optimization using random projections as a statistical estimation problem, where the squared distance between the predictions from the estimator and the true solution is the error metric. In approximately solving a large scale least squares problem using Gaussian sketches, we show that the sketched solution has a conditional Gaussian distribution with th… ▽ More

    Submitted 15 June, 2020; originally announced June 2020.

    Comments: This work has been submitted to the IEEE Journal on Selected Areas in Information Theory (JSAIT) - Special Issue on Estimation and Inference, and is awaiting review. This document contains 37 pages and 14 figures

  41. arXiv:2006.01760  [pdf, other

    cs.LG stat.ML

    Modelling of daily reference evapotranspiration using deep neural network in different climates

    Authors: Atilla Özgür, Sevim Seda Yamaç

    Abstract: Precise and reliable estimation of reference evapotranspiration (ET o ) is an essential for the irrigation and water resources management. ET o is difficult to predict due to its complex processes. This complexity can be solved using machine learning methods. This study investigates the performance of artificial neural network (ANN) and deep neural network (DNN) models for estimating daily ET o .… ▽ More

    Submitted 19 June, 2020; v1 submitted 2 June, 2020; originally announced June 2020.

    ACM Class: I.2

  42. arXiv:2005.10848  [pdf, other

    cs.LG cs.IT stat.ML

    Global Multiclass Classification and Dataset Construction via Heterogeneous Local Experts

    Authors: Surin Ahn, Ayfer Ozgur, Mert Pilanci

    Abstract: In the domains of dataset construction and crowdsourcing, a notable challenge is to aggregate labels from a heterogeneous set of labelers, each of whom is potentially an expert in some subset of tasks (and less reliable in others). To reduce costs of hiring human labelers or training automated labeling systems, it is of interest to minimize the number of labelers while ensuring the reliability of… ▽ More

    Submitted 5 January, 2021; v1 submitted 21 May, 2020; originally announced May 2020.

    Comments: 27 pages, 8 figures, to be published in IEEE Journal on Selected Areas in Information Theory (JSAIT) - Special Issue on Estimation and Inference

  43. arXiv:2005.10783  [pdf, ps, other

    cs.IT math.ST stat.ML

    Fisher information under local differential privacy

    Authors: Leighton Pate Barnes, Wei-Ning Chen, Ayfer Ozgur

    Abstract: We develop data processing inequalities that describe how Fisher information from statistical samples can scale with the privacy parameter $\varepsilon$ under local differential privacy constraints. These bounds are valid under general conditions on the distribution of the score of the statistical model, and they elucidate under which conditions the dependence on $\varepsilon$ is linear, quadratic… ▽ More

    Submitted 21 May, 2020; originally announced May 2020.

  44. arXiv:2005.10761  [pdf, other

    cs.LG cs.IT math.ST stat.ML

    rTop-k: A Statistical Estimation Approach to Distributed SGD

    Authors: Leighton Pate Barnes, Huseyin A. Inan, Berivan Isik, Ayfer Ozgur

    Abstract: The large communication cost for exchanging gradients between different nodes significantly limits the scalability of distributed training for large-scale learning models. Motivated by this observation, there has been significant recent interest in techniques that reduce the communication cost of distributed Stochastic Gradient Descent (SGD), with gradient sparsification techniques such as top-k a… ▽ More

    Submitted 2 December, 2020; v1 submitted 21 May, 2020; originally announced May 2020.

  45. arXiv:2004.01277  [pdf, other

    cs.IT

    The Courtade-Kumar Most Informative Boolean Function Conjecture and a Symmetrized Li-Médard Conjecture are Equivalent

    Authors: Leighton Pate Barnes, Ayfer Özgür

    Abstract: We consider the Courtade-Kumar most informative Boolean function conjecture for balanced functions, as well as a conjecture by Li and Médard that dictatorship functions also maximize the $L^α$ norm of $T_pf$ for $1\leqα\leq2$ where $T_p$ is the noise operator and $f$ is a balanced Boolean function. By using a result due to Laguerre from the 1880's, we are able to bound how many times an $L^α$-norm… ▽ More

    Submitted 2 April, 2020; originally announced April 2020.

  46. arXiv:2002.10416  [pdf, other

    cs.CL

    Resources for Turkish Dependency Parsing: Introducing the BOUN Treebank and the BoAT Annotation Tool

    Authors: Utku Türk, Furkan Atmaca, Şaziye Betül Özateş, Gözde Berk, Seyyit Talha Bedir, Abdullatif Köksal, Balkız Öztürk Başaran, Tunga Güngör, Arzucan Özgür

    Abstract: In this paper, we introduce the resources that we developed for Turkish dependency parsing, which include a novel manually annotated treebank (BOUN Treebank), along with the guidelines we adopted, and a new annotation tool (BoAT). The manual annotation process we employed was shaped and implemented by a team of four linguists and five Natural Language Processing (NLP) specialists. Decisions regard… ▽ More

    Submitted 16 September, 2021; v1 submitted 24 February, 2020; originally announced February 2020.

    Comments: Language Resource and Evaluation

  47. A Hybrid Approach to Dependency Parsing: Combining Rules and Morphology with Deep Learning

    Authors: Şaziye Betül Özateş, Arzucan Özgür, Tunga Güngör, Balkız Öztürk

    Abstract: Fully data-driven, deep learning-based models are usually designed as language-independent and have been shown to be successful for many natural language processing tasks. However, when the studied language is low-resourced and the amount of training data is insufficient, these models can benefit from the integration of natural language grammar-based information. We propose two approaches to depen… ▽ More

    Submitted 24 February, 2020; originally announced February 2020.

    Comments: 25 pages, 7 figures

    ACM Class: I.2.7

  48. arXiv:2002.06053  [pdf, other

    q-bio.BM cs.CL cs.LG stat.ML

    Exploring Chemical Space using Natural Language Processing Methodologies for Drug Discovery

    Authors: Hakime Öztürk, Arzucan Özgür, Philippe Schwaller, Teodoro Laino, Elif Ozkirimli

    Abstract: Text-based representations of chemicals and proteins can be thought of as unstructured languages codified by humans to describe domain-specific knowledge. Advances in natural language processing (NLP) methodologies in the processing of spoken languages accelerated the application of NLP to elucidate hidden knowledge in textual representations of these biochemical entities and then use it to constr… ▽ More

    Submitted 10 February, 2020; originally announced February 2020.

  49. arXiv:1912.04977  [pdf, other

    cs.LG cs.CR stat.ML

    Advances and Open Problems in Federated Learning

    Authors: Peter Kairouz, H. Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, Rafael G. L. D'Oliveira, Hubert Eichner, Salim El Rouayheb, David Evans, Josh Gardner, Zachary Garrett, Adrià Gascón, Badih Ghazi, Phillip B. Gibbons, Marco Gruteser, Zaid Harchaoui, Chaoyang He, Lie He, Zhouyuan Huo, Ben Hutchinson , et al. (34 additional authors not shown)

    Abstract: Federated learning (FL) is a machine learning setting where many clients (e.g. mobile devices or whole organizations) collaboratively train a model under the orchestration of a central server (e.g. service provider), while keeping the training data decentralized. FL embodies the principles of focused data collection and minimization, and can mitigate many of the systemic privacy risks and costs re… ▽ More

    Submitted 8 March, 2021; v1 submitted 10 December, 2019; originally announced December 2019.

    Comments: Published in Foundations and Trends in Machine Learning Vol 4 Issue 1. See: https://www.nowpublishers.com/article/Details/MAL-083

  50. arXiv:1910.01625  [pdf, ps, other

    cs.IT math.ST

    Minimax Bounds for Distributed Logistic Regression

    Authors: Leighton Pate Barnes, Ayfer Ozgur

    Abstract: We consider a distributed logistic regression problem where labeled data pairs $(X_i,Y_i)\in \mathbb{R}^d\times\{-1,1\}$ for $i=1,\ldots,n$ are distributed across multiple machines in a network and must be communicated to a centralized estimator using at most $k$ bits per labeled pair. We assume that the data $X_i$ come independently from some distribution $P_X$, and that the distribution of… ▽ More

    Submitted 3 October, 2019; originally announced October 2019.