Skip to main content

Showing 1–19 of 19 results for author: Preoţiuc-Pietro, D

.
  1. arXiv:2505.23804  [pdf, ps, other

    cs.CL cs.AI cs.LG

    Calibrating LLMs for Text-to-SQL Parsing by Leveraging Sub-clause Frequencies

    Authors: Terrance Liu, Shuyi Wang, Daniel Preotiuc-Pietro, Yash Chandarana, Chirag Gupta

    Abstract: While large language models (LLMs) achieve strong performance on text-to-SQL parsing, they sometimes exhibit unexpected failures in which they are confidently incorrect. Building trustworthy text-to-SQL systems thus requires eliciting reliable uncertainty measures from the LLM. In this paper, we study the problem of providing a calibrated confidence score that conveys the likelihood of an output q… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

  2. arXiv:2505.15070  [pdf, ps, other

    cs.IR cs.CL

    An Alternative to FLOPS Regularization to Effectively Productionize SPLADE-Doc

    Authors: Aldo Porco, Dhruv Mehra, Igor Malioutov, Karthik Radhakrishnan, Moniba Keymanesh, Daniel Preoţiuc-Pietro, Sean MacAvaney, Pengxiang Cheng

    Abstract: Learned Sparse Retrieval (LSR) models encode text as weighted term vectors, which need to be sparse to leverage inverted index structures during retrieval. SPLADE, the most popular LSR model, uses FLOPS regularization to encourage vector sparsity during training. However, FLOPS regularization does not ensure sparsity among terms - only within a given query or document. Terms with very high Documen… ▽ More

    Submitted 20 May, 2025; originally announced May 2025.

    Comments: Accepted as a short paper at SIGIR 2025

  3. arXiv:2504.11626  [pdf, other

    cs.CL cs.AI

    Improving Instruct Models for Free: A Study on Partial Adaptation

    Authors: Ozan İrsoy, Pengxiang Cheng, Jennifer L. Chen, Daniel Preoţiuc-Pietro, Shiyue Zhang, Duccio Pappadopulo

    Abstract: Instruct models, obtained from various instruction tuning or post-training steps, are commonly deemed superior and more usable than their base counterpart. While the model gains instruction following ability, instruction tuning may lead to forgetting the knowledge from pre-training or it may encourage the model being overly conversational or verbose. This, in turn, can lead to degradation of in-co… ▽ More

    Submitted 15 April, 2025; originally announced April 2025.

    Comments: Author ordering chosen at random

  4. arXiv:2403.16668  [pdf, other

    cs.CL cs.SI

    Who is bragging more online? A large scale analysis of bragging in social media

    Authors: Mali Jin, Daniel Preoţiuc-Pietro, A. Seza Doğruöz, Nikolaos Aletras

    Abstract: Bragging is the act of uttering statements that are likely to be positively viewed by others and it is extensively employed in human communication with the aim to build a positive self-image of oneself. Social media is a natural platform for users to employ bragging in order to gain admiration, respect, attention and followers from their audiences. Yet, little is known about the scale of bragging… ▽ More

    Submitted 25 March, 2024; originally announced March 2024.

    Comments: Accepted at LREC-COLING 2024

  5. arXiv:2309.07990  [pdf, other

    cs.CL

    Leveraging Contextual Information for Effective Entity Salience Detection

    Authors: Rajarshi Bhowmik, Marco Ponza, Atharva Tendle, Anant Gupta, Rebecca Jiang, Xingyu Lu, Qian Zhao, Daniel Preotiuc-Pietro

    Abstract: In text documents such as news articles, the content and key events usually revolve around a subset of all the entities mentioned in a document. These entities, often deemed as salient entities, provide useful cues of the aboutness of a document to a reader. Identifying the salience of entities was found helpful in several downstream applications such as search, ranking, and entity-centric summari… ▽ More

    Submitted 2 April, 2024; v1 submitted 14 September, 2023; originally announced September 2023.

  6. arXiv:2309.07794  [pdf, other

    cs.CL cs.LG cs.SI

    Improving Multimodal Classification of Social Media Posts by Leveraging Image-Text Auxiliary Tasks

    Authors: Danae Sánchez Villegas, Daniel Preoţiuc-Pietro, Nikolaos Aletras

    Abstract: Effectively leveraging multimodal information from social media posts is essential to various downstream tasks such as sentiment analysis, sarcasm detection or hate speech classification. Jointly modeling text and images is challenging because cross-modal semantics might be hidden or the relation between image and text is weak. However, prior work on multimodal classification of social media posts… ▽ More

    Submitted 3 February, 2024; v1 submitted 14 September, 2023; originally announced September 2023.

    Comments: Accepted at EACL 2024 Findings

  7. arXiv:2309.06991  [pdf, other

    cs.LG cs.CL stat.ML

    Unsupervised Contrast-Consistent Ranking with Language Models

    Authors: Niklas Stoehr, Pengxiang Cheng, Jing Wang, Daniel Preotiuc-Pietro, Rajarshi Bhowmik

    Abstract: Language models contain ranking-based knowledge and are powerful solvers of in-context ranking tasks. For instance, they may have parametric knowledge about the ordering of countries by size or may be able to rank product reviews by sentiment. We compare pairwise, pointwise and listwise prompting techniques to elicit a language model's ranking knowledge. However, we find that even with careful cal… ▽ More

    Submitted 3 February, 2024; v1 submitted 13 September, 2023; originally announced September 2023.

    Comments: Long Paper at EACL 2024

  8. arXiv:2305.16252  [pdf, other

    cs.CL

    Overcoming Catastrophic Forgetting in Massively Multilingual Continual Learning

    Authors: Genta Indra Winata, Lingjue Xie, Karthik Radhakrishnan, Shijie Wu, Xisen Jin, Pengxiang Cheng, Mayank Kulkarni, Daniel Preotiuc-Pietro

    Abstract: Real-life multilingual systems should be able to efficiently incorporate new languages as data distributions fed to the system evolve and shift over time. To do this, systems need to handle the issue of catastrophic forgetting, where the model performance drops for languages or tasks seen further in its past. In this paper, we study catastrophic forgetting, as well as methods to minimize this, in… ▽ More

    Submitted 25 May, 2023; originally announced May 2023.

    Comments: ACL 2023 Findings

  9. arXiv:2212.09849  [pdf, other

    cs.CL cs.LG

    Dataless Knowledge Fusion by Merging Weights of Language Models

    Authors: Xisen Jin, Xiang Ren, Daniel Preotiuc-Pietro, Pengxiang Cheng

    Abstract: Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models. Oftentimes fine-tuned models are readily available but their training data is not, due to data privacy or intellectual property concerns. This creates a barrier to fusing knowledge across individual models to yield a better single model. In this paper, we study the problem of merging indiv… ▽ More

    Submitted 21 May, 2025; v1 submitted 19 December, 2022; originally announced December 2022.

    Comments: ICLR 2023; The code is available at https://github.com/bloomberg/dataless-model-merging and https://github.com/AuCson/RegMean-LLama3-8B. Fixed typos

  10. arXiv:2205.03313  [pdf, other

    cs.CL

    Combining Humor and Sarcasm for Improving Political Parody Detection

    Authors: Xiao Ao, Danae Sánchez Villegas, Daniel Preoţiuc-Pietro, Nikolaos Aletras

    Abstract: Parody is a figurative device used for mimicking entities for comedic or critical purposes. Parody is intentionally humorous and often involves sarcasm. This paper explores jointly modelling these figurative tropes with the goal of improving performance of political parody detection in tweets. To this end, we present a multi-encoder model that combines three parallel encoders to enrich parody-spec… ▽ More

    Submitted 6 May, 2022; originally announced May 2022.

    Comments: Accepted at NAACL 2022

  11. arXiv:2204.02213  [pdf, other

    cs.CL

    EntSUM: A Data Set for Entity-Centric Summarization

    Authors: Mounica Maddela, Mayank Kulkarni, Daniel Preotiuc-Pietro

    Abstract: Controllable summarization aims to provide summaries that take into account user-specified aspects and preferences to better assist them with their information need, as opposed to the standard summarization setup which build a single generic summary of a document. We introduce a human-annotated data set EntSUM for controllable summarization with a focus on named entities as the aspects to control.… ▽ More

    Submitted 5 April, 2022; originally announced April 2022.

    Comments: Accepted at ACL 2022

  12. arXiv:2203.05840  [pdf, other

    cs.CL

    Automatic Identification and Classification of Bragging in Social Media

    Authors: Mali Jin, Daniel Preoţiuc-Pietro, A. Seza Doğruöz, Nikolaos Aletras

    Abstract: Bragging is a speech act employed with the goal of constructing a favorable self-image through positive statements about oneself. It is widespread in daily communication and especially popular in social media, where users aim to build a positive image of their persona directly or indirectly. In this paper, we present the first large scale study of bragging in computational linguistics, building on… ▽ More

    Submitted 11 March, 2022; originally announced March 2022.

    Comments: Accepted at ACL 2022

  13. arXiv:2009.14734  [pdf, other

    cs.CL cs.SI

    Point-of-Interest Type Inference from Social Media Text

    Authors: Danae Sánchez Villegas, Daniel Preoţiuc-Pietro, Nikolaos Aletras

    Abstract: Physical places help shape how we perceive the experiences we have there. For the first time, we study the relationship between social media text and the type of the place from where it was posted, whether a park, restaurant, or someplace else. To facilitate this, we introduce a novel data set of $\sim$200,000 English tweets published from 2,761 different points-of-interest in the U.S., enriched w… ▽ More

    Submitted 2 October, 2020; v1 submitted 30 September, 2020; originally announced September 2020.

    Comments: Accepted at AACL-IJCNLP 2020

  14. arXiv:2004.13878  [pdf, other

    cs.CL

    Analyzing Political Parody in Social Media

    Authors: Antonis Maronikolakis, Danae Sanchez Villegas, Daniel Preotiuc-Pietro, Nikolaos Aletras

    Abstract: Parody is a figurative device used to imitate an entity for comedic or critical purposes and represents a widespread phenomenon in social media through many popular parody accounts. In this paper, we present the first computational study of parody. We introduce a new publicly available data set of tweets from real politicians and their corresponding parody accounts. We run a battery of supervised… ▽ More

    Submitted 1 May, 2020; v1 submitted 28 April, 2020; originally announced April 2020.

  15. arXiv:1906.03890  [pdf, ps, other

    cs.CL cs.SI

    Automatically Identifying Complaints in Social Media

    Authors: Daniel Preotiuc-Pietro, Mihaela Gaman, Nikolaos Aletras

    Abstract: Complaining is a basic speech act regularly used in human and computer mediated communication to express a negative mismatch between reality and expectations in a particular situation. Automatically identifying complaints in social media is of utmost importance for organizations or brands to improve the customer experience or in developing dialogue systems for handling and responding to complaints… ▽ More

    Submitted 10 June, 2019; originally announced June 2019.

    Comments: Accepted at ACL 2019

  16. arXiv:1906.00790  [pdf, other

    cs.CL

    Multi-task Pairwise Neural Ranking for Hashtag Segmentation

    Authors: Mounica Maddela, Wei Xu, Daniel Preoţiuc-Pietro

    Abstract: Hashtags are often employed on social media and beyond to add metadata to a textual utterance with the goal of increasing discoverability, aiding search, or providing additional semantics. However, the semantic content of hashtags is not straightforward to infer as these represent ad-hoc conventions which frequently include multiple words joined together and can include abbreviations and unorthodo… ▽ More

    Submitted 13 June, 2019; v1 submitted 3 June, 2019; originally announced June 2019.

    Comments: 12 pages, ACL 2019

  17. arXiv:1904.02670  [pdf, other

    cs.HC cs.SI

    What Twitter Profile and Posted Images Reveal About Depression and Anxiety

    Authors: Sharath Chandra Guntuku, Daniel Preotiuc-Pietro, Johannes C. Eichstaedt, Lyle H. Ungar

    Abstract: Previous work has found strong links between the choice of social media images and users' emotions, demographics and personality traits. In this study, we examine which attributes of profile and posted images are associated with depression and anxiety of Twitter users. We used a sample of 28,749 Facebook users to build a language prediction model of survey-reported depression and anxiety, and vali… ▽ More

    Submitted 4 April, 2019; originally announced April 2019.

    Comments: ICWSM 2019

  18. arXiv:1808.09600  [pdf, ps, other

    cs.SI cs.CY

    The Remarkable Benefit of User-Level Aggregation for Lexical-based Population-Level Predictions

    Authors: Salvatore Giorgi, Daniel Preotiuc-Pietro, Anneke Buffone, Daniel Rieman, Lyle H. Ungar, H. Andrew Schwartz

    Abstract: Nowcasting based on social media text promises to provide unobtrusive and near real-time predictions of community-level outcomes. These outcomes are typically regarding people, but the data is often aggregated without regard to users in the Twitter populations of each community. This paper describes a simple yet effective method for building community-level models using Twitter language aggregated… ▽ More

    Submitted 28 August, 2018; originally announced August 2018.

    Comments: To appear in the proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP)

  19. arXiv:1606.03561  [pdf, other

    cs.IR cs.SI

    Sub-Story Detection in Twitter with Hierarchical Dirichlet Processes

    Authors: P. K. Srijith, Mark Hepple, Kalina Bontcheva, Daniel Preotiuc-Pietro

    Abstract: Social media has now become the de facto information source on real world events. The challenge, however, due to the high volume and velocity nature of social media streams, is in how to follow all posts pertaining to a given event over time, a task referred to as story detection. Moreover, there are often several different stories pertaining to a given event, which we refer to as sub-stories and… ▽ More

    Submitted 11 June, 2016; originally announced June 2016.