Skip to main content

Showing 1–8 of 8 results for author: Arora, A

Searching in archive econ. Search in all archives.
.
  1. arXiv:2504.15448  [pdf, ps, other

    econ.GN cs.CL

    Visualizing Public Opinion on X: A Real-Time Sentiment Dashboard Using VADER and DistilBERT

    Authors: Yanampally Abhiram Reddy, Siddhi Agarwal, Vikram Parashar, Arshiya Arora

    Abstract: In the age of social media, understanding public sentiment toward major corporations is crucial for investors, policymakers, and researchers. This paper presents a comprehensive sentiment analysis system tailored for corporate reputation monitoring, combining Natural Language Processing (NLP) and machine learning techniques to accurately interpret public opinion in real time. The methodology integ… ▽ More

    Submitted 1 June, 2025; v1 submitted 21 April, 2025; originally announced April 2025.

    Comments: 19 pages, 2 figures

  2. arXiv:2406.15593  [pdf, other

    cs.CL econ.GN

    News Deja Vu: Connecting Past and Present with Semantic Search

    Authors: Brevin Franklin, Emily Silcock, Abhishek Arora, Tom Bryan, Melissa Dell

    Abstract: Social scientists and the general public often analyze contemporary events by drawing parallels with the past, a process complicated by the vast, noisy, and unstructured nature of historical texts. For example, hundreds of millions of page scans from historical newspapers have been noisily transcribed. Traditional sparse methods for searching for relevant material in these vast corpora, e.g., with… ▽ More

    Submitted 19 December, 2024; v1 submitted 21 June, 2024; originally announced June 2024.

  3. arXiv:2406.15576  [pdf, other

    cs.CL econ.GN

    Contrastive Entity Coreference and Disambiguation for Historical Texts

    Authors: Abhishek Arora, Emily Silcock, Leander Heldring, Melissa Dell

    Abstract: Massive-scale historical document collections are crucial for social science research. Despite increasing digitization, these documents typically lack unique cross-document identifiers for individuals mentioned within the texts, as well as individual identifiers from external knowledgebases like Wikipedia/Wikidata. Existing entity disambiguation methods often fall short in accuracy for historical… ▽ More

    Submitted 21 June, 2024; originally announced June 2024.

  4. arXiv:2406.09490  [pdf, other

    cs.CL econ.GN

    Newswire: A Large-Scale Structured Database of a Century of Historical News

    Authors: Emily Silcock, Abhishek Arora, Luca D'Amico-Wong, Melissa Dell

    Abstract: In the U.S. historically, local newspapers drew their content largely from newswires like the Associated Press. Historians argue that newswires played a pivotal role in creating a national identity and shared understanding of the world, but there is no comprehensive archive of the content sent over newswires. We reconstruct such an archive by applying a customized deep learning pipeline to hundred… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: arXiv admin note: text overlap with arXiv:2306.17810, arXiv:2308.12477

  5. arXiv:2310.10050  [pdf, other

    cs.CV cs.CL econ.GN

    EfficientOCR: An Extensible, Open-Source Package for Efficiently Digitizing World Knowledge

    Authors: Tom Bryan, Jacob Carlson, Abhishek Arora, Melissa Dell

    Abstract: Billions of public domain documents remain trapped in hard copy or lack an accurate digitization. Modern natural language processing methods cannot be used to index, retrieve, and summarize their texts; conduct computational textual analyses; or extract information for statistical analyses, and these texts cannot be incorporated into language model training. Given the diversity and sheer quantity… ▽ More

    Submitted 16 October, 2023; originally announced October 2023.

  6. arXiv:2308.12477  [pdf, other

    cs.CL cs.CV econ.GN

    American Stories: A Large-Scale Structured Text Dataset of Historical U.S. Newspapers

    Authors: Melissa Dell, Jacob Carlson, Tom Bryan, Emily Silcock, Abhishek Arora, Zejiang Shen, Luca D'Amico-Wong, Quan Le, Pablo Querubin, Leander Heldring

    Abstract: Existing full text datasets of U.S. public domain newspapers do not recognize the often complex layouts of newspaper scans, and as a result the digitized content scrambles texts from articles, headlines, captions, advertisements, and other layout regions. OCR quality can also be low. This study develops a novel, deep learning pipeline for extracting full article texts from newspaper images and app… ▽ More

    Submitted 23 August, 2023; originally announced August 2023.

  7. arXiv:2305.14672  [pdf, other

    cs.CL cs.CV econ.GN

    Quantifying Character Similarity with Vision Transformers

    Authors: Xinmei Yang, Abhishek Arora, Shao-Yu Jheng, Melissa Dell

    Abstract: Record linkage is a bedrock of quantitative social science, as analyses often require linking data from multiple, noisy sources. Off-the-shelf string matching methods are widely used, as they are straightforward and cheap to implement and scale. Not all character substitutions are equally probable, and for some settings there are widely used handcrafted lists denoting which string substitutions ar… ▽ More

    Submitted 23 May, 2023; originally announced May 2023.

  8. arXiv:2304.03464  [pdf, other

    cs.CV cs.CL econ.GN

    Linking Representations with Multimodal Contrastive Learning

    Authors: Abhishek Arora, Xinmei Yang, Shao-Yu Jheng, Melissa Dell

    Abstract: Many applications require linking individuals, firms, or locations across datasets. Most widely used methods, especially in social science, do not employ deep learning, with record linkage commonly approached using string matching techniques. Moreover, existing methods do not exploit the inherently multimodal nature of documents. In historical record linkage applications, documents are typically n… ▽ More

    Submitted 21 June, 2024; v1 submitted 6 April, 2023; originally announced April 2023.