Skip to main content

Showing 1–4 of 4 results for author: Sudhir, S

.
  1. arXiv:2506.06541  [pdf, ps, other

    cs.DB cs.AI cs.MA

    KramaBench: A Benchmark for AI Systems on Data-to-Insight Pipelines over Data Lakes

    Authors: Eugenie Lai, Gerardo Vitagliano, Ziyu Zhang, Sivaprasad Sudhir, Om Chabra, Anna Zeng, Anton A. Zabreyko, Chenning Li, Ferdi Kossmann, Jialin Ding, Jun Chen, Markos Markakis, Matthew Russo, Weiyang Wang, Ziniu Wu, Michael J. Cafarella, Lei Cao, Samuel Madden, Tim Kraska

    Abstract: Constructing real-world data-to-insight pipelines often involves data extraction from data lakes, data integration across heterogeneous data sources, and diverse operations from data cleaning to analysis. The design and implementation of data science pipelines require domain knowledge, technical expertise, and even project-specific insights. AI systems have shown remarkable reasoning, coding, and… ▽ More

    Submitted 6 June, 2025; originally announced June 2025.

  2. arXiv:2505.14661  [pdf, ps, other

    cs.DB cs.AI

    Abacus: A Cost-Based Optimizer for Semantic Operator Systems

    Authors: Matthew Russo, Sivaprasad Sudhir, Gerardo Vitagliano, Chunwei Liu, Tim Kraska, Samuel Madden, Michael Cafarella

    Abstract: LLMs enable an exciting new class of data processing applications over large collections of unstructured documents. Several new programming frameworks have enabled developers to build these applications by composing them out of semantic operators: a declarative set of AI-powered data transformations with natural language specifications. These include LLM-powered maps, filters, joins, etc. used for… ▽ More

    Submitted 17 June, 2025; v1 submitted 20 May, 2025; originally announced May 2025.

    Comments: 16 pages, 6 figures

    ACM Class: H.2.4; I.2.5

  3. Is Audio Spoof Detection Robust to Laundering Attacks?

    Authors: Hashim Ali, Surya Subramani, Shefali Sudhir, Raksha Varahamurthy, Hafiz Malik

    Abstract: Voice-cloning (VC) systems have seen an exceptional increase in the realism of synthesized speech in recent years. The high quality of synthesized speech and the availability of low-cost VC services have given rise to many potential abuses of this technology. Several detection methodologies have been proposed over the years that can detect voice spoofs with reasonably good accuracy. However, these… ▽ More

    Submitted 26 August, 2024; originally announced August 2024.

    Comments: Conference Paper

  4. arXiv:2106.05664  [pdf, other

    cs.CL cs.AI

    Ruddit: Norms of Offensiveness for English Reddit Comments

    Authors: Rishav Hada, Sohi Sudhir, Pushkar Mishra, Helen Yannakoudakis, Saif M. Mohammad, Ekaterina Shutova

    Abstract: On social media platforms, hateful and offensive language negatively impact the mental well-being of users and the participation of people from diverse backgrounds. Automatic methods to detect offensive language have largely relied on datasets with categorical labels. However, comments can vary in their degree of offensiveness. We create the first dataset of English language Reddit comments that h… ▽ More

    Submitted 25 January, 2022; v1 submitted 10 June, 2021; originally announced June 2021.

    Comments: Camera-ready version in ACL 2021