Search | arXiv e-print repository

A Survey on Multilingual Mental Disorders Detection from Social Media Data

Authors: Ana-Maria Bucur, Marcos Zampieri, Tharindu Ranasinghe, Fabio Crestani

Abstract: The increasing prevalence of mental health disorders globally highlights the urgent need for effective digital screening methods that can be used in multilingual contexts. Most existing studies, however, focus on English data, overlooking critical mental health signals that may be present in non-English texts. To address this important gap, we present the first survey on the detection of mental he… ▽ More The increasing prevalence of mental health disorders globally highlights the urgent need for effective digital screening methods that can be used in multilingual contexts. Most existing studies, however, focus on English data, overlooking critical mental health signals that may be present in non-English texts. To address this important gap, we present the first survey on the detection of mental health disorders using multilingual social media data. We investigate the cultural nuances that influence online language patterns and self-disclosure behaviors, and how these factors can impact the performance of NLP tools. Additionally, we provide a comprehensive list of multilingual data collections that can be used for developing NLP models for mental health screening. Our findings can inform the design of effective multilingual mental health screening tools that can meet the needs of diverse populations, ultimately improving mental health outcomes on a global scale. △ Less

Submitted 21 May, 2025; originally announced May 2025.

arXiv:2503.21513 [pdf, other]

Datasets for Depression Modeling in Social Media: An Overview

Authors: Ana-Maria Bucur, Andreea-Codrina Moldovan, Krutika Parvatikar, Marcos Zampieri, Ashiqur R. KhudaBukhsh, Liviu P. Dinu

Abstract: Depression is the most common mental health disorder, and its prevalence increased during the COVID-19 pandemic. As one of the most extensively researched psychological conditions, recent research has increasingly focused on leveraging social media data to enhance traditional methods of depression screening. This paper addresses the growing interest in interdisciplinary research on depression, and… ▽ More Depression is the most common mental health disorder, and its prevalence increased during the COVID-19 pandemic. As one of the most extensively researched psychological conditions, recent research has increasingly focused on leveraging social media data to enhance traditional methods of depression screening. This paper addresses the growing interest in interdisciplinary research on depression, and aims to support early-career researchers by providing a comprehensive and up-to-date list of datasets for analyzing and predicting depression through social media data. We present an overview of datasets published between 2019 and 2024. We also make the comprehensive list of datasets available online as a continuously updated resource, with the hope that it will facilitate further interdisciplinary research into the linguistic expressions of depression on social media. △ Less

Submitted 27 March, 2025; originally announced March 2025.

Comments: Accepted to CLPsych Workshop, NAACL 2025

arXiv:2502.11926 [pdf, ps, other]

BRIGHTER: BRIdging the Gap in Human-Annotated Textual Emotion Recognition Datasets for 28 Languages

Authors: Shamsuddeen Hassan Muhammad, Nedjma Ousidhoum, Idris Abdulmumin, Jan Philip Wahle, Terry Ruas, Meriem Beloucif, Christine de Kock, Nirmal Surange, Daniela Teodorescu, Ibrahim Said Ahmad, David Ifeoluwa Adelani, Alham Fikri Aji, Felermino D. M. A. Ali, Ilseyar Alimova, Vladimir Araujo, Nikolay Babakov, Naomi Baes, Ana-Maria Bucur, Andiswa Bukula, Guanqun Cao, Rodrigo Tufino Cardenas, Rendi Chevi, Chiamaka Ijeoma Chukwuneke, Alexandra Ciobotaru, Daryna Dementieva , et al. (23 additional authors not shown)

Abstract: People worldwide use language in subtle and complex ways to express emotions. Although emotion recognition--an umbrella term for several NLP tasks--impacts various applications within NLP and beyond, most work in this area has focused on high-resource languages. This has led to significant disparities in research efforts and proposed solutions, particularly for under-resourced languages, which oft… ▽ More People worldwide use language in subtle and complex ways to express emotions. Although emotion recognition--an umbrella term for several NLP tasks--impacts various applications within NLP and beyond, most work in this area has focused on high-resource languages. This has led to significant disparities in research efforts and proposed solutions, particularly for under-resourced languages, which often lack high-quality annotated datasets. In this paper, we present BRIGHTER--a collection of multi-labeled, emotion-annotated datasets in 28 different languages and across several domains. BRIGHTER primarily covers low-resource languages from Africa, Asia, Eastern Europe, and Latin America, with instances labeled by fluent speakers. We highlight the challenges related to the data collection and annotation processes, and then report experimental results for monolingual and crosslingual multi-label emotion identification, as well as emotion intensity recognition. We analyse the variability in performance across languages and text domains, both with and without the use of LLMs, and show that the BRIGHTER datasets represent a meaningful step towards addressing the gap in text-based emotion recognition. △ Less

Submitted 29 May, 2025; v1 submitted 17 February, 2025; originally announced February 2025.

Comments: Accepted at ACL2025 (Main)

arXiv:2501.08981 [pdf]

doi 10.1080/1331677X.2015.1106330

Discretionary vs nondiscretionary in fiscal mechanism. Non-automatic fiscal stabilisers vs automatic fiscal stabilisers

Authors: Vasile Bratian, Amelia Bucur, Camelia Oprean, Cristina Tanasescu

Abstract: The goal of the present study is to increase the intelligibility of macroeconomic phenomena triggered by governmental intervention in economy by means of fiscal policies. During cyclical movements, fiscal policy can play an important role in order to help stabilise the economy. But discretionary policy usually implies implementation lags and is not automatically reversed when economic conditions c… ▽ More The goal of the present study is to increase the intelligibility of macroeconomic phenomena triggered by governmental intervention in economy by means of fiscal policies. During cyclical movements, fiscal policy can play an important role in order to help stabilise the economy. But discretionary policy usually implies implementation lags and is not automatically reversed when economic conditions change. In contrast, automatic fiscal stabilisers (SFA) ensure a prompter, and self-correcting fiscal response. The present study aims to tackle the topic of discretionary vs nondiscretionary characteristic of fiscal stabilisers (SF). In this context, the scope of the research undertaking is to launch a scientific debate over the definitions of the concepts of non-automatic fiscal stabilisers (SfnA) and SFAs. We describe how we can quantify the discretionary and non-discretionary character of the fiscal policy, by the analysis of the structure of the conventional budget balance (SBc), budget balance associated with the current GDP. In the final part of this article, we propose a quantitative equilibrium model for establishing the mathematical prerequisites for an SF to become automatic. Likewise, on the basis of the proposed mathematical model we have performed a qualitative analysis of the influence factors. △ Less

Submitted 15 January, 2025; originally announced January 2025.

Journal ref: Economic Research, 29, 1, 2016, 1-17

arXiv:2501.07881 [pdf]

An Approach on the Modelling of Long Economic Cycles in the Context of Sustainable Development

Authors: Cristina Tanasescu, Amelia Bucur, Camelia Oprean-Stan

Abstract: One of the themes that have been approached more and more within the specialised literature is being represented by economic cycles. The analysis of these is very useful in the long term predictions, in finding solutions for the economic raise and for detecting the economic crisis. At the same time, it is underlined in a lot of scientific and research papers, the importance of the sustainable deve… ▽ More One of the themes that have been approached more and more within the specialised literature is being represented by economic cycles. The analysis of these is very useful in the long term predictions, in finding solutions for the economic raise and for detecting the economic crisis. At the same time, it is underlined in a lot of scientific and research papers, the importance of the sustainable development in the present and future society. In this paper we intend to bring contributions to the study of the cycles of a sustainable economy and we will analyse it having in mind the purpose of creating the sustainable economy. We will demonstrate the fact that curves that represent graphically all these, are not simple logistics anymore, bi-logistics or multilogistics curves, but curves in plan that are obtained by composing logistics functions with the function of the sustainable development or with the function that shapes the economic component of it mathematically. We will present an interpretation of mathematic models within the frame of the sustainable development. △ Less

Submitted 14 January, 2025; originally announced January 2025.

Journal ref: Revista Economica,68,4,2016

arXiv:2410.08793 [pdf, ps, other]

On the State of NLP Approaches to Modeling Depression in Social Media: A Post-COVID-19 Outlook

Authors: Ana-Maria Bucur, Andreea-Codrina Moldovan, Krutika Parvatikar, Marcos Zampieri, Ashiqur R. KhudaBukhsh, Liviu P. Dinu

Abstract: Computational approaches to predicting mental health conditions in social media have been substantially explored in the past years. Multiple reviews have been published on this topic, providing the community with comprehensive accounts of the research in this area. Among all mental health conditions, depression is the most widely studied due to its worldwide prevalence. The COVID-19 global pandemi… ▽ More Computational approaches to predicting mental health conditions in social media have been substantially explored in the past years. Multiple reviews have been published on this topic, providing the community with comprehensive accounts of the research in this area. Among all mental health conditions, depression is the most widely studied due to its worldwide prevalence. The COVID-19 global pandemic, starting in early 2020, has had a great impact on mental health worldwide. Harsh measures employed by governments to slow the spread of the virus (e.g., lockdowns) and the subsequent economic downturn experienced in many countries have significantly impacted people's lives and mental health. Studies have shown a substantial increase of above 50% in the rate of depression in the population. In this context, we present a review on natural language processing (NLP) approaches to modeling depression in social media, providing the reader with a post-COVID-19 outlook. This review contributes to the understanding of the impacts of the pandemic on modeling depression in social media. We outline how state-of-the-art approaches and new datasets have been used in the context of the COVID-19 pandemic. Finally, we also discuss ethical issues in collecting and processing mental health data, considering fairness, accountability, and ethics. △ Less

Submitted 7 March, 2025; v1 submitted 11 October, 2024; originally announced October 2024.

arXiv:2409.11074 [pdf, other]

RoMath: A Mathematical Reasoning Benchmark in Romanian

Authors: Adrian Cosma, Ana-Maria Bucur, Emilian Radoi

Abstract: Mathematics has long been conveyed through natural language, primarily for human understanding. With the rise of mechanized mathematics and proof assistants, there is a growing need to understand informal mathematical text, yet most existing benchmarks focus solely on English, overlooking other languages. This paper introduces RoMath, a Romanian mathematical reasoning benchmark suite comprising th… ▽ More Mathematics has long been conveyed through natural language, primarily for human understanding. With the rise of mechanized mathematics and proof assistants, there is a growing need to understand informal mathematical text, yet most existing benchmarks focus solely on English, overlooking other languages. This paper introduces RoMath, a Romanian mathematical reasoning benchmark suite comprising three subsets: Baccalaureate, Competitions and Synthetic, which cover a range of mathematical domains and difficulty levels, aiming to improve non-English language models and promote multilingual AI development. By focusing on Romanian, a low-resource language with unique linguistic features, RoMath addresses the limitations of Anglo-centric models and emphasizes the need for dedicated resources beyond simple automatic translation. We benchmark several open-weight language models, highlighting the importance of creating resources for underrepresented languages. Code and datasets are be made available. △ Less

Submitted 20 May, 2025; v1 submitted 17 September, 2024; originally announced September 2024.

Comments: 5 Figures, 11 Tables

arXiv:2401.02746 [pdf, other]

Reading Between the Frames: Multi-Modal Depression Detection in Videos from Non-Verbal Cues

Authors: David Gimeno-Gómez, Ana-Maria Bucur, Adrian Cosma, Carlos-David Martínez-Hinarejos, Paolo Rosso

Abstract: Depression, a prominent contributor to global disability, affects a substantial portion of the population. Efforts to detect depression from social media texts have been prevalent, yet only a few works explored depression detection from user-generated video content. In this work, we address this research gap by proposing a simple and flexible multi-modal temporal model capable of discerning non-ve… ▽ More Depression, a prominent contributor to global disability, affects a substantial portion of the population. Efforts to detect depression from social media texts have been prevalent, yet only a few works explored depression detection from user-generated video content. In this work, we address this research gap by proposing a simple and flexible multi-modal temporal model capable of discerning non-verbal depression cues from diverse modalities in noisy, real-world videos. We show that, for in-the-wild videos, using additional high-level non-verbal cues is crucial to achieving good performance, and we extracted and processed audio speech embeddings, face emotion embeddings, face, body and hand landmarks, and gaze and blinking information. Through extensive experiments, we show that our model achieves state-of-the-art results on three key benchmark datasets for depression detection from video by a substantial margin. Our code is publicly available on GitHub. △ Less

Submitted 5 January, 2024; originally announced January 2024.

Comments: Accepted at 46th European Conference on Information Retrieval (ECIR 2024)

arXiv:2310.10568 [pdf, ps, other]

Frobenius sign separation for abelian varieties

Authors: Alina Bucur, Francesc Fité, Kiran S. Kedlaya

Abstract: Let A and A' be nonzero abelian varieties defined over a number field k such that Hom(A,A')=0. Under the Generalized Riemann hypothesis for motivic L-functions attached to A and A', we show that there exists a prime p of k of good reduction for A and A' at which the Frobenius traces of A and A' are nonzero and differ by sign, and such that the norm of p is O_{k,g,g'}(log(2NN')^2), where N and N' r… ▽ More Let A and A' be nonzero abelian varieties defined over a number field k such that Hom(A,A')=0. Under the Generalized Riemann hypothesis for motivic L-functions attached to A and A', we show that there exists a prime p of k of good reduction for A and A' at which the Frobenius traces of A and A' are nonzero and differ by sign, and such that the norm of p is O_{k,g,g'}(log(2NN')^2), where N and N' respectively denote the absolute conductors of A and A'. We also make the dependence of the big-O constant on k and the dimensions g,g' of A,A' explicit up to an effectively computable absolute constant. Our method extends that of Chen, Park, and Swaminathan who considered the case in which A and A' are elliptic curves. △ Less

Submitted 3 April, 2025; v1 submitted 16 October, 2023; originally announced October 2023.

Comments: 10 pages. Accepted in Proc. Amer. Math. Soc. Several changes were made in order to turn the implied constant in the O-notation into a universal constant independent of the field of definition and the dimensions. Includes material formerly appearing in arXiv:2002.08807

MSC Class: 11G10; 11G05; 11R44; 11M41

arXiv:2307.16045 [pdf, other]

Automatic Extraction of the Romanian Academic Word List: Data and Methods

Authors: Ana-Maria Bucur, Andreea Dincă, Mădălina Chitez, Roxana Rogobete

Abstract: This paper presents the methodology and data used for the automatic extraction of the Romanian Academic Word List (Ro-AWL). Academic Word Lists are useful in both L2 and L1 teaching contexts. For the Romanian language, no such resource exists so far. Ro-AWL has been generated by combining methods from corpus and computational linguistics with L2 academic writing approaches. We use two types of dat… ▽ More This paper presents the methodology and data used for the automatic extraction of the Romanian Academic Word List (Ro-AWL). Academic Word Lists are useful in both L2 and L1 teaching contexts. For the Romanian language, no such resource exists so far. Ro-AWL has been generated by combining methods from corpus and computational linguistics with L2 academic writing approaches. We use two types of data: (a) existing data, such as the Romanian Frequency List based on the ROMBAC corpus, and (b) self-compiled data, such as the expert academic writing corpus EXPRES. For constructing the academic word list, we follow the methodology for building the Academic Vocabulary List for the English language. The distribution of Ro-AWL features (general distribution, POS distribution) into four disciplinary datasets is in line with previous research. Ro-AWL is freely available and can be used for teaching, research and NLP applications. △ Less

Submitted 29 July, 2023; originally announced July 2023.

arXiv:2307.02313 [pdf, other]

Utilizing ChatGPT Generated Data to Retrieve Depression Symptoms from Social Media

Authors: Ana-Maria Bucur

Abstract: In this work, we present the contribution of the BLUE team in the eRisk Lab task on searching for symptoms of depression. The task consists of retrieving and ranking Reddit social media sentences that convey symptoms of depression from the BDI-II questionnaire. Given that synthetic data provided by LLMs have been proven to be a reliable method for augmenting data and fine-tuning downstream models,… ▽ More In this work, we present the contribution of the BLUE team in the eRisk Lab task on searching for symptoms of depression. The task consists of retrieving and ranking Reddit social media sentences that convey symptoms of depression from the BDI-II questionnaire. Given that synthetic data provided by LLMs have been proven to be a reliable method for augmenting data and fine-tuning downstream models, we chose to generate synthetic data using ChatGPT for each of the symptoms of the BDI-II questionnaire. We designed a prompt such that the generated data contains more richness and semantic diversity than the BDI-II responses for each question and, at the same time, contains emotional and anecdotal experiences that are specific to the more intimate way of sharing experiences on Reddit. We perform semantic search and rank the sentences' relevance to the BDI-II symptoms by cosine similarity. We used two state-of-the-art transformer-based models (MentalRoBERTa and a variant of MPNet) for embedding the social media posts, the original and generated responses of the BDI-II. Our results show that using sentence embeddings from a model designed for semantic search outperforms the approach using embeddings from a model pre-trained on mental health data. Furthermore, the generated synthetic data were proved too specific for this task, the approach simply relying on the BDI-II responses had the best performance. △ Less

Submitted 6 July, 2023; v1 submitted 5 July, 2023; originally announced July 2023.

arXiv:2301.05453 [pdf, other]

It's Just a Matter of Time: Detecting Depression with Time-Enriched Multimodal Transformers

Authors: Ana-Maria Bucur, Adrian Cosma, Paolo Rosso, Liviu P. Dinu

Abstract: Depression detection from user-generated content on the internet has been a long-lasting topic of interest in the research community, providing valuable screening tools for psychologists. The ubiquitous use of social media platforms lays out the perfect avenue for exploring mental health manifestations in posts and interactions with other users. Current methods for depression detection from social… ▽ More Depression detection from user-generated content on the internet has been a long-lasting topic of interest in the research community, providing valuable screening tools for psychologists. The ubiquitous use of social media platforms lays out the perfect avenue for exploring mental health manifestations in posts and interactions with other users. Current methods for depression detection from social media mainly focus on text processing, and only a few also utilize images posted by users. In this work, we propose a flexible time-enriched multimodal transformer architecture for detecting depression from social media posts, using pretrained models for extracting image and text embeddings. Our model operates directly at the user-level, and we enrich it with the relative time between posts by using time2vec positional embeddings. Moreover, we propose another model variant, which can operate on randomly sampled and unordered sets of posts to be more robust to dataset noise. We show that our method, using EmoBERTa and CLIP embeddings, surpasses other methods on two multimodal datasets, obtaining state-of-the-art results of 0.931 F1 score on a popular multimodal Twitter dataset, and 0.902 F1 score on the only multimodal Reddit dataset. △ Less

Submitted 6 February, 2023; v1 submitted 13 January, 2023; originally announced January 2023.

Comments: Accepted at ECIR 2023

arXiv:2209.13579 [pdf, ps, other]

Power-saving error terms for the number of $D_4$-quartic extensions over a number field ordered by discriminant

Authors: Alina Bucur, Alexandra Florea, Allechar Serrano López, Ila Varma

Abstract: We study the asymptotic count of dihedral quartic extensions over a fixed number field with bounded norm of the relative discriminant. The main term of this count (including a summation formula for the constant) can be found in the literature (see Cohen--Diaz y Diaz--Olivier for the statement without proof and see Klüners for a proof), but a power-saving for the error term has not been explicitly… ▽ More We study the asymptotic count of dihedral quartic extensions over a fixed number field with bounded norm of the relative discriminant. The main term of this count (including a summation formula for the constant) can be found in the literature (see Cohen--Diaz y Diaz--Olivier for the statement without proof and see Klüners for a proof), but a power-saving for the error term has not been explicitly determined except in the case that the base field is $\mathbb{Q}$. In this article, we describe the argument for obtaining both the explicit main term and a power-saving error term for the number of $D_4$-quartic extensions over a general base number field ordered by the norms of their relative discriminants. We also give an extensive overview of the history and development of number field asymptotics. △ Less

Submitted 27 September, 2022; originally announced September 2022.

arXiv:2207.00753 [pdf, other]

An End-to-End Set Transformer for User-Level Classification of Depression and Gambling Disorder

Authors: Ana-Maria Bucur, Adrian Cosma, Liviu P. Dinu, Paolo Rosso

Abstract: This work proposes a transformer architecture for user-level classification of gambling addiction and depression that is trainable end-to-end. As opposed to other methods that operate at the post level, we process a set of social media posts from a particular individual, to make use of the interactions between posts and eliminate label noise at the post level. We exploit the fact that, by not inje… ▽ More This work proposes a transformer architecture for user-level classification of gambling addiction and depression that is trainable end-to-end. As opposed to other methods that operate at the post level, we process a set of social media posts from a particular individual, to make use of the interactions between posts and eliminate label noise at the post level. We exploit the fact that, by not injecting positional encodings, multi-head attention is permutation invariant and we process randomly sampled sets of texts from a user after being encoded with a modern pretrained sentence encoder (RoBERTa / MiniLM). Moreover, our architecture is interpretable with modern feature attribution methods and allows for automatic dataset creation by identifying discriminating posts in a user's text-set. We perform ablation studies on hyper-parameters and evaluate our method for the eRisk 2022 Lab on early detection of signs of pathological gambling and early risk detection of depression. The method proposed by our team BLUE obtained the best ERDE5 score of 0.015, and the second-best ERDE50 score of 0.009 for pathological gambling detection. For the early detection of depression, we obtained the second-best ERDE50 of 0.027. △ Less

Submitted 2 July, 2022; originally announced July 2022.

arXiv:2204.13569 [pdf, other]

Life is not Always Depressing: Exploring the Happy Moments of People Diagnosed with Depression

Authors: Ana-Maria Bucur, Adrian Cosma, Liviu P. Dinu

Abstract: In this work, we explore the relationship between depression and manifestations of happiness in social media. While the majority of works surrounding depression focus on symptoms, psychological research shows that there is a strong link between seeking happiness and being diagnosed with depression. We make use of Positive-Unlabeled learning paradigm to automatically extract happy moments from soci… ▽ More In this work, we explore the relationship between depression and manifestations of happiness in social media. While the majority of works surrounding depression focus on symptoms, psychological research shows that there is a strong link between seeking happiness and being diagnosed with depression. We make use of Positive-Unlabeled learning paradigm to automatically extract happy moments from social media posts of both controls and users diagnosed with depression, and qualitatively analyze them with linguistic tools such as LIWC and keyness information. We show that the life of depressed individuals is not always bleak, with positive events related to friends and family being more noteworthy to their lives compared to the more mundane happy events reported by control users. △ Less

Submitted 8 May, 2022; v1 submitted 28 April, 2022; originally announced April 2022.

Comments: Accepted to LREC 2022

arXiv:2202.07543 [pdf, other]

BLUE at Memotion 2.0 2022: You have my Image, my Text and my Transformer

Authors: Ana-Maria Bucur, Adrian Cosma, Ioan-Bogdan Iordache

Abstract: Memes are prevalent on the internet and continue to grow and evolve alongside our culture. An automatic understanding of memes propagating on the internet can shed light on the general sentiment and cultural attitudes of people. In this work, we present team BLUE's solution for the second edition of the MEMOTION shared task. We showcase two approaches for meme classification (i.e. sentiment, humou… ▽ More Memes are prevalent on the internet and continue to grow and evolve alongside our culture. An automatic understanding of memes propagating on the internet can shed light on the general sentiment and cultural attitudes of people. In this work, we present team BLUE's solution for the second edition of the MEMOTION shared task. We showcase two approaches for meme classification (i.e. sentiment, humour, offensive, sarcasm and motivation levels) using a text-only method using BERT, and a Multi-Modal-Multi-Task transformer network that operates on both the meme image and its caption to output the final scores. In both approaches, we leverage state-of-the-art pretrained models for text (BERT, Sentence Transformer) and image processing (EfficientNetV4, CLIP). Through our efforts, we obtain first place in task A, second place in task B and third place in task C. In addition, our team obtained the highest average score for all three tasks. △ Less

Submitted 4 April, 2022; v1 submitted 15 February, 2022; originally announced February 2022.

Comments: De-Factify: Workshop on Multimodal Fact Checking and Hate Speech Detection, co-located with AAAI 2022

arXiv:2110.02869 [pdf, other]

Sequence-to-Sequence Lexical Normalization with Multilingual Transformers

Authors: Ana-Maria Bucur, Adrian Cosma, Liviu P. Dinu

Abstract: Current benchmark tasks for natural language processing contain text that is qualitatively different from the text used in informal day to day digital communication. This discrepancy has led to severe performance degradation of state-of-the-art NLP models when fine-tuned on real-world data. One way to resolve this issue is through lexical normalization, which is the process of transforming non-sta… ▽ More Current benchmark tasks for natural language processing contain text that is qualitatively different from the text used in informal day to day digital communication. This discrepancy has led to severe performance degradation of state-of-the-art NLP models when fine-tuned on real-world data. One way to resolve this issue is through lexical normalization, which is the process of transforming non-standard text, usually from social media, into a more standardized form. In this work, we propose a sentence-level sequence-to-sequence model based on mBART, which frames the problem as a machine translation problem. As the noisy text is a pervasive problem across languages, not just English, we leverage the multi-lingual pre-training of mBART to fine-tune it to our data. While current approaches mainly operate at the word or subword level, we argue that this approach is straightforward from a technical standpoint and builds upon existing pre-trained transformer networks. Our results show that while word-level, intrinsic, performance evaluation is behind other methods, our model improves performance on extrinsic, downstream tasks through normalization compared to models operating on raw, unprocessed, social media text. △ Less

Submitted 12 October, 2021; v1 submitted 6 October, 2021; originally announced October 2021.

Comments: In Proceedings of the 7th Workshop on Noisy User-generated Text (WNUT 2021), EMNLP 2021

arXiv:2109.11167 [pdf, ps, other]

Geometric generalizations of the square sieve, with an application to cyclic covers

Authors: Alina Bucur, Alina Carmen Cojocaru, Matilde N. Lalín, Lillian B. Pierce

Abstract: We formulate a general problem: given projective schemes $\mathbb{Y}$ and $\mathbb{X}$ over a global field $K$ and a $K$-morphism $η$ from $\mathbb{Y}$ to $\mathbb{X}$ of finite degree, how many points in $\mathbb{X}(K)$ of height at most $B$ have a pre-image under $η$ in $\mathbb{Y}(K)$? This problem is inspired by a well-known conjecture of Serre on quantitative upper bounds for the number of po… ▽ More We formulate a general problem: given projective schemes $\mathbb{Y}$ and $\mathbb{X}$ over a global field $K$ and a $K$-morphism $η$ from $\mathbb{Y}$ to $\mathbb{X}$ of finite degree, how many points in $\mathbb{X}(K)$ of height at most $B$ have a pre-image under $η$ in $\mathbb{Y}(K)$? This problem is inspired by a well-known conjecture of Serre on quantitative upper bounds for the number of points of bounded height on an irreducible projective variety defined over a number field. We give a non-trivial answer to the general problem when $K=\mathbb{F}_q(T)$ and $\mathbb{Y}$ is a prime degree cyclic cover of $\mathbb{X}=\mathbb{P}_{K}^n$. Our tool is a new geometric sieve, which generalizes the polynomial sieve to a geometric setting over global function fields. △ Less

Submitted 22 August, 2022; v1 submitted 23 September, 2021; originally announced September 2021.

Comments: Appendix by Joseph Rabinoff, 40 pages

arXiv:2108.00279 [pdf, other]

A Psychologically Informed Part-of-Speech Analysis of Depression in Social Media

Authors: Ana-Maria Bucur, Ioana R. Podină, Liviu P. Dinu

Abstract: In this work, we provide an extensive part-of-speech analysis of the discourse of social media users with depression. Research in psychology revealed that depressed users tend to be self-focused, more preoccupied with themselves and ruminate more about their lives and emotions. Our work aims to make use of large-scale datasets and computational methods for a quantitative exploration of discourse.… ▽ More In this work, we provide an extensive part-of-speech analysis of the discourse of social media users with depression. Research in psychology revealed that depressed users tend to be self-focused, more preoccupied with themselves and ruminate more about their lives and emotions. Our work aims to make use of large-scale datasets and computational methods for a quantitative exploration of discourse. We use the publicly available depression dataset from the Early Risk Prediction on the Internet Workshop (eRisk) 2018 and extract part-of-speech features and several indices based on them. Our results reveal statistically significant differences between the depressed and non-depressed individuals confirming findings from the existing psychology literature. Our work provides insights regarding the way in which depressed individuals are expressing themselves on social media platforms, allowing for better-informed computational models to help monitor and prevent mental illnesses. △ Less

Submitted 31 July, 2021; originally announced August 2021.

Comments: Accepted to RANLP 2021

arXiv:2106.16175 [pdf, other]

Early Risk Detection of Pathological Gambling, Self-Harm and Depression Using BERT

Authors: Ana-Maria Bucur, Adrian Cosma, Liviu P. Dinu

Abstract: Early risk detection of mental illnesses has a massive positive impact upon the well-being of people. The eRisk workshop has been at the forefront of enabling interdisciplinary research in developing computational methods to automatically estimate early risk factors for mental issues such as depression, self-harm, anorexia and pathological gambling. In this paper, we present the contributions of t… ▽ More Early risk detection of mental illnesses has a massive positive impact upon the well-being of people. The eRisk workshop has been at the forefront of enabling interdisciplinary research in developing computational methods to automatically estimate early risk factors for mental issues such as depression, self-harm, anorexia and pathological gambling. In this paper, we present the contributions of the BLUE team in the 2021 edition of the workshop, in which we tackle the problems of early detection of gambling addiction, self-harm and estimating depression severity from social media posts. We employ pre-trained BERT transformers and data crawled automatically from mental health subreddits and obtain reasonable results on all three tasks. △ Less

Submitted 30 June, 2021; originally announced June 2021.

Comments: Accepted to Early Risk Prediction on the Internet Workshop, Conference and Labs of the Evaluation Forum (CLEF 2021)

arXiv:2105.14888 [pdf, other]

An Exploratory Analysis of the Relation Between Offensive Language and Mental Health

Authors: Ana-Maria Bucur, Marcos Zampieri, Liviu P. Dinu

Abstract: In this paper, we analyze the interplay between the use of offensive language and mental health. We acquired publicly available datasets created for offensive language identification and depression detection and we train computational models to compare the use of offensive language in social media posts written by groups of individuals with and without self-reported depression diagnosis. We also l… ▽ More In this paper, we analyze the interplay between the use of offensive language and mental health. We acquired publicly available datasets created for offensive language identification and depression detection and we train computational models to compare the use of offensive language in social media posts written by groups of individuals with and without self-reported depression diagnosis. We also look at samples written by groups of individuals whose posts show signs of depression according to recent related studies. Our analysis indicates that offensive language is more frequently used in the samples written by individuals with self-reported depression as well as individuals showing signs of depression. The results discussed here open new avenues in research in politeness/offensiveness and mental health. △ Less

Submitted 24 June, 2021; v1 submitted 31 May, 2021; originally announced May 2021.

Comments: Accepted to Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

arXiv:2011.01695 [pdf, ps, other]

Detecting Early Onset of Depression from Social Media Text using Learned Confidence Scores

Authors: Ana-Maria Bucur, Liviu P. Dinu

Abstract: Computational research on mental health disorders from written texts covers an interdisciplinary area between natural language processing and psychology. A crucial aspect of this problem is prevention and early diagnosis, as suicide resulted from depression being the second leading cause of death for young adults. In this work, we focus on methods for detecting the early onset of depression from s… ▽ More Computational research on mental health disorders from written texts covers an interdisciplinary area between natural language processing and psychology. A crucial aspect of this problem is prevention and early diagnosis, as suicide resulted from depression being the second leading cause of death for young adults. In this work, we focus on methods for detecting the early onset of depression from social media texts, in particular from Reddit. To that end, we explore the eRisk 2018 dataset and achieve good results with regard to the state of the art by leveraging topic analysis and learned confidence scores to guide the decision process. △ Less

Submitted 3 November, 2020; originally announced November 2020.

Comments: Accepted at Seventh Italian Conference on Computational Linguistics CLiC-it 2020

arXiv:2002.08807 [pdf, ps, other]

Effective Sato-Tate conjecture for abelian varieties and applications

Authors: Alina Bucur, Francesc Fité, Kiran S. Kedlaya

Abstract: From the generalized Riemann hypothesis for motivic L-functions, we derive an effective version of the Sato-Tate conjecture for an abelian variety A defined over a number field k with connected Sato-Tate group. By effective we mean that we give an upper bound on the error term in the count predicted by the Sato-Tate measure that only depends on certain invariants of A. We discuss three application… ▽ More From the generalized Riemann hypothesis for motivic L-functions, we derive an effective version of the Sato-Tate conjecture for an abelian variety A defined over a number field k with connected Sato-Tate group. By effective we mean that we give an upper bound on the error term in the count predicted by the Sato-Tate measure that only depends on certain invariants of A. We discuss three applications of this conditional result. First, for an abelian variety defined over k, we consider a variant of Linnik's problem for abelian varieties that asks for an upper bound on the least norm of a prime whose normalized Frobenius trace lies in a given interval. Second, for an elliptic curve defined over k with complex multiplication, we determine (up to multiplication by a nonzero constant) the asymptotic number of primes whose Frobenius trace attain the integral part of the Hasse-Weil bound. Third, for a pair of abelian varieties defined over k with no common factors up to k-isogeny, we find an upper bound on the least norm of a prime at which the respective Frobenius traces have opposite sign. △ Less

Submitted 13 October, 2023; v1 submitted 20 February, 2020; originally announced February 2020.

Comments: 25 pages; refereed version. A table of notation has been added at the end. §5 has been removed; its content will be incorporated into a subsequent paper

MSC Class: 11G10; 11G05; 11R44

arXiv:1703.04336 [pdf, other]

A Visual Representation of Wittgenstein's Tractatus Logico-Philosophicus

Authors: Anca Bucur, Sergiu Nisioi

Abstract: In this paper we present a data visualization method together with its potential usefulness in digital humanities and philosophy of language. We compile a multilingual parallel corpus from different versions of Wittgenstein's Tractatus Logico-Philosophicus, including the original in German and translations into English, Spanish, French, and Russian. Using this corpus, we compute a similarity measu… ▽ More In this paper we present a data visualization method together with its potential usefulness in digital humanities and philosophy of language. We compile a multilingual parallel corpus from different versions of Wittgenstein's Tractatus Logico-Philosophicus, including the original in German and translations into English, Spanish, French, and Russian. Using this corpus, we compute a similarity measure between propositions and render a visual network of relations for different languages. △ Less

Submitted 13 March, 2017; originally announced March 2017.

Comments: Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH)

arXiv:1610.00164 [pdf, ps, other]

Traces, high powers and one level density for families of curves over finite fields

Authors: Alina Bucur, Edgar Costa, Chantal David, João Guerreiro, David Lowry-Duda

Abstract: The zeta function of a curve $C$ over a finite field may be expressed in terms of the characteristic polynomial of a unitary matrix $Θ_C$. We develop and present a new technique to compute the expected value of $\mathrm{Tr}(Θ_C^n)$ for various moduli spaces of curves of genus $g$ over a fixed finite field in the limit as $g$ is large, generalizing and extending the work of Rudnick and Chinis.… ▽ More The zeta function of a curve $C$ over a finite field may be expressed in terms of the characteristic polynomial of a unitary matrix $Θ_C$. We develop and present a new technique to compute the expected value of $\mathrm{Tr}(Θ_C^n)$ for various moduli spaces of curves of genus $g$ over a fixed finite field in the limit as $g$ is large, generalizing and extending the work of Rudnick and Chinis. This is achieved by using function field zeta functions, explicit formulae, and the densities of prime polynomials with prescribed ramification types at certain places as given by Bucur, David, Feigon, Kaplan, Lalín and Wood [BDF$^+$16] and by Zhao. We extend [BDF$^+$16] by describing explicit dependence on the place and give an explicit proof of the Lindelöf bound for function field Dirichlet $L$-functions $L(1/2 + it, χ)$. As applications, we compute the one-level density for hyperelliptic curves, cyclic $\ell$-covers, and cubic non-Galois covers. △ Less

Submitted 1 October, 2016; originally announced October 2016.

Comments: 24 pages

arXiv:1505.07136 [pdf, ps, other]

The distribution of $\mathbb{F}_q$-points on cyclic $\ell$-covers of genus $g$

Authors: Alina Bucur, Chantal David, Brooke Feigon, Nathan Kaplan, Matilde Lalín, Ekin Ozman, Melanie Matchett Wood

Abstract: We study fluctuations in the number of points of $\ell$-cyclic covers of the projective line over the finite field $\mathbb{F}_q$ when $q \equiv 1 \mod \ell$ is fixed and the genus tends to infinity. The distribution is given as a sum of $q+1$ i.i.d. random variables. This was settled for hyperelliptic curves by Kurlberg and Rudnick, while statistics were obtained for certain components of the mod… ▽ More We study fluctuations in the number of points of $\ell$-cyclic covers of the projective line over the finite field $\mathbb{F}_q$ when $q \equiv 1 \mod \ell$ is fixed and the genus tends to infinity. The distribution is given as a sum of $q+1$ i.i.d. random variables. This was settled for hyperelliptic curves by Kurlberg and Rudnick, while statistics were obtained for certain components of the moduli space of $\ell$-cyclic covers by Bucur, David, Feigon and Lalín. In this paper, we obtain statistics for the distribution of the number of points as the covers vary over the full moduli space of $\ell$-cyclic covers of genus $g$. This is achieved by relating $\ell$-covers to cyclic function field extensions, and counting such extensions with prescribed ramification and splitting conditions at a finite number of primes. △ Less

Submitted 26 May, 2015; originally announced May 2015.

arXiv:1503.03276 [pdf, ps, other]

Statistics for biquadratic covers of the projective line over finite fields

Authors: Elisa Lorenzo, Giulio Meleleo, Piermarco Milione, Alina Bucur

Abstract: We study the distribution of the traces of the Frobenius endomorphism of genus $g$ curves which are quartic non-cyclic covers of $\mathbb{P}^{1}_{\mathbb{F}_{q}}$, as the curve varies in an irreducible component of the moduli space. We show that for $q$ fixed, the limiting distribution of the trace of Frobenius equals the sum of $q + 1$ independent random discrete variables. We also show that when… ▽ More We study the distribution of the traces of the Frobenius endomorphism of genus $g$ curves which are quartic non-cyclic covers of $\mathbb{P}^{1}_{\mathbb{F}_{q}}$, as the curve varies in an irreducible component of the moduli space. We show that for $q$ fixed, the limiting distribution of the trace of Frobenius equals the sum of $q + 1$ independent random discrete variables. We also show that when both $g$ and $q$ go to infinity, the normalized trace has a standard complex Gaussian distribution. Finally, we extend these computations to the general case of arbitrary covers of $\mathbb{P}^{1}_{\mathbb{F}_{q}}$ with Galois group isomorphic to $r$ copies of $\mathbb{Z}/2\mathbb{Z}$. For $r = 1$, we recover the already known hyperelliptic case. We also include an appendix by Alina Bucur giving the heuristic of these distributions. △ Less

Submitted 20 October, 2015; v1 submitted 11 March, 2015; originally announced March 2015.

arXiv:1304.7876 [pdf, ps, other]

Statistics for ordinary Artin-Schreier covers and other $p$-rank strata

Authors: Alina Bucur, Chantal David, Brooke Feigon, Matilde Lalin

Abstract: We study the distribution of the number of points and of the zeroes of the zeta function in different $p$-rank strata of Artin-Schreier covers over $\F_q$ when $q$ is fixed and the genus goes to infinity. The $p$-rank strata considered include the ordinary family, the whole family, and the family of curves with $p$-rank equal to $p-1.$ While the zeta zeroes always approach the standard Gaussian di… ▽ More We study the distribution of the number of points and of the zeroes of the zeta function in different $p$-rank strata of Artin-Schreier covers over $\F_q$ when $q$ is fixed and the genus goes to infinity. The $p$-rank strata considered include the ordinary family, the whole family, and the family of curves with $p$-rank equal to $p-1.$ While the zeta zeroes always approach the standard Gaussian distribution, the number of points over $\F_q$ has a distribution that varies with the specific family. △ Less

Submitted 30 April, 2013; originally announced April 2013.

Comments: 34 pages

MSC Class: 11G20 (Primary); 11M50; 14G15 (Secondary)

arXiv:1301.0139 [pdf, ps, other]

An application of the effective Sato-Tate conjecture

Authors: Alina Bucur, Kiran S. Kedlaya

Abstract: Based on the Lagarias-Odlyzko effectivization of the Chebotarev density theorem, Kumar Murty gave an effective version of the Sato-Tate conjecture for an elliptic curve conditional on analytic continuation and Riemann hypothesis for the symmetric power $L$-functions. We use Murty's analysis to give a similar conditional effectivization of the generalized Sato-Tate conjecture for an arbitrary motiv… ▽ More Based on the Lagarias-Odlyzko effectivization of the Chebotarev density theorem, Kumar Murty gave an effective version of the Sato-Tate conjecture for an elliptic curve conditional on analytic continuation and Riemann hypothesis for the symmetric power $L$-functions. We use Murty's analysis to give a similar conditional effectivization of the generalized Sato-Tate conjecture for an arbitrary motive. As an application, we give a conditional upper bound of the form $O((\log N)^2 (\log \log 2N)^2)$ for the smallest prime at which two given rational elliptic curves with conductor at most $N$ have Frobenius traces of opposite sign. △ Less

Submitted 7 June, 2015; v1 submitted 1 January, 2013; originally announced January 2013.

Comments: 12 pages; v2: refereed version

MSC Class: 11G05; 11R44

arXiv:1111.4701 [pdf, ps, other]

Distribution of zeta zeroes of Artin--Schreier curves

Authors: Alina Bucur, Chantal David, Brooke Feigon, Matilde Lalin, Kaneenika Sinha

Abstract: We study the distribution of the zeroes of the zeta functions of the family of Artin-Schreier covers of the projective line over $\mathbb{F}_q$ when $q$ is fixed and the genus goes to infinity. We consider both the global and the mesoscopic regimes, proving that when the genus goes to infinity, the number of zeroes with angles in a prescribed non-trivial subinterval of $[-π,π)$ has a standard Gaus… ▽ More We study the distribution of the zeroes of the zeta functions of the family of Artin-Schreier covers of the projective line over $\mathbb{F}_q$ when $q$ is fixed and the genus goes to infinity. We consider both the global and the mesoscopic regimes, proving that when the genus goes to infinity, the number of zeroes with angles in a prescribed non-trivial subinterval of $[-π,π)$ has a standard Gaussian distribution (when properly normalized). △ Less

Submitted 29 December, 2012; v1 submitted 20 November, 2011; originally announced November 2011.

Comments: 22 pages

MSC Class: 11G20 (Primary) 11M50; 14G15 (Secondary)

arXiv:1003.5222 [pdf, ps, other]

The probability that a complete intersection is smooth

Authors: Alina Bucur, Kiran S. Kedlaya

Abstract: Given a smooth subscheme of a projective space over a finite field, we compute the probability that its intersection with a fixed number of hypersurface sections of large degree is smooth of the expected dimension. This generalizes the case of a single hypersurface, due to Poonen. We use this result to give a probabilistic model for the number of rational points of such a complete intersection. A… ▽ More Given a smooth subscheme of a projective space over a finite field, we compute the probability that its intersection with a fixed number of hypersurface sections of large degree is smooth of the expected dimension. This generalizes the case of a single hypersurface, due to Poonen. We use this result to give a probabilistic model for the number of rational points of such a complete intersection. A somewhat surprising corollary is that the number of rational points on a random smooth intersection of two surfaces in projective 3-space is strictly less than the number of points on the projective line. △ Less

Submitted 15 October, 2012; v1 submitted 26 March, 2010; originally announced March 2010.

Comments: 14 pages; v3: final journal version

MSC Class: 14G15; 11M38

arXiv:0912.4761 [pdf, ps, other]

doi 10.1016/j.jnt.2010.05.009

The fluctuations in the number of points of smooth plane curves over finite fields

Authors: Alina Bucur, Chantal David, Brooke Feigon, Matilde Lalín

Abstract: In this note, we study the fluctuations in the number of points of smooth projective plane curves over finite fields $\mathbb{F}_q$ as $q$ is fixed and the genus varies. More precisely, we show that these fluctuations are predicted by a natural probabilistic model, in which the points of the projective plane impose independent conditions on the curve. The main tool we use is a geometric sieving… ▽ More In this note, we study the fluctuations in the number of points of smooth projective plane curves over finite fields $\mathbb{F}_q$ as $q$ is fixed and the genus varies. More precisely, we show that these fluctuations are predicted by a natural probabilistic model, in which the points of the projective plane impose independent conditions on the curve. The main tool we use is a geometric sieving process introduced by Poonen. △ Less

Submitted 23 December, 2009; originally announced December 2009.

Comments: 12 pages

MSC Class: 11G20; 11T55; 11G25

Journal ref: J. Number Theory 130 (2010), pp. 2528-2541

arXiv:0907.5434 [pdf, ps, other]

doi 10.1093/imrn/rnp162

Statistics for traces of cyclic trigonal curves over finite fields

Authors: Alina Bucur, Chantal David, Brooke Feigon, Matilde Lalín

Abstract: We study the variation of the trace of the Frobenius endomorphism associated to a cyclic trigonal curve of genus g over a field of q elements as the curve varies in an irreducible component of the moduli space. We show that for q fixed and g increasing, the limiting distribution of the trace of the Frobenius equals the sum of q+1 independent random variables taking the value 0 with probability 2… ▽ More We study the variation of the trace of the Frobenius endomorphism associated to a cyclic trigonal curve of genus g over a field of q elements as the curve varies in an irreducible component of the moduli space. We show that for q fixed and g increasing, the limiting distribution of the trace of the Frobenius equals the sum of q+1 independent random variables taking the value 0 with probability 2/(q+2) and 1, e^{(2pi i)/3}, e^{(4pi i)/3} each with probability q/(3(q+2)). This extends the work of Kurlberg and Rudnick who considered the same limit for hyperelliptic curves. We also show that when both g and q go to infinity, the normalized trace has a standard complex Gaussian distribution and how to generalize these results to p-fold covers of the projective line. △ Less

Submitted 11 September, 2009; v1 submitted 30 July, 2009; originally announced July 2009.

Comments: 30 pages, added statement and sketch of proof in Section 7 for generalization of results to p-fold covers of the projective line, the final version of this article will be published in International Mathematics Research Notices

MSC Class: 11G20; 11T55; 11G25

Journal ref: Int. Math. Res. Not. IMRN 2010, no. 5, 932--967

Showing 1–33 of 33 results for author: Bucur, A