Skip to main content

Showing 1–50 of 275 results for author: Rao, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2509.21917  [pdf, ps, other

    cs.CV cs.MM

    Taming Flow-based I2V Models for Creative Video Editing

    Authors: Xianghao Kong, Hansheng Chen, Yuwei Guo, Lvmin Zhang, Gordon Wetzstein, Maneesh Agrawala, Anyi Rao

    Abstract: Although image editing techniques have advanced significantly, video editing, which aims to manipulate videos according to user intent, remains an emerging challenge. Most existing image-conditioned video editing methods either require inversion with model-specific design or need extensive optimization, limiting their capability of leveraging up-to-date image-to-video (I2V) models to transfer the… ▽ More

    Submitted 26 September, 2025; originally announced September 2025.

  2. arXiv:2509.21263  [pdf, ps, other

    cs.CV

    Dense Semantic Matching with VGGT Prior

    Authors: Songlin Yang, Tianyi Wei, Yushi Lan, Zeqi Xiao, Anyi Rao, Xingang Pan

    Abstract: Semantic matching aims to establish pixel-level correspondences between instances of the same category and represents a fundamental task in computer vision. Existing approaches suffer from two limitations: (i) Geometric Ambiguity: Their reliance on 2D foundation model features (e.g., Stable Diffusion, DINO) often fails to disambiguate symmetric structures, requiring extra fine-tuning yet lacking g… ▽ More

    Submitted 25 September, 2025; originally announced September 2025.

  3. arXiv:2509.19941  [pdf, ps, other

    cs.CL cs.AI

    CorIL: Towards Enriching Indian Language to Indian Language Parallel Corpora and Machine Translation Systems

    Authors: Soham Bhattacharjee, Mukund K Roy, Yathish Poojary, Bhargav Dave, Mihir Raj, Vandan Mujadia, Baban Gain, Pruthwik Mishra, Arafat Ahsan, Parameswari Krishnamurthy, Ashwath Rao, Gurpreet Singh Josan, Preeti Dubey, Aadil Amin Kak, Anna Rao Kulkarni, Narendra VG, Sunita Arora, Rakesh Balbantray, Prasenjit Majumdar, Karunesh K Arora, Asif Ekbal, Dipti Mishra Sharma

    Abstract: India's linguistic landscape is one of the most diverse in the world, comprising over 120 major languages and approximately 1,600 additional languages, with 22 officially recognized as scheduled languages in the Indian Constitution. Despite recent progress in multilingual neural machine translation (NMT), high-quality parallel corpora for Indian languages remain scarce, especially across varied do… ▽ More

    Submitted 24 September, 2025; originally announced September 2025.

  4. arXiv:2509.18638  [pdf, ps, other

    cs.CV cs.AI

    Learning neuroimaging models from health system-scale data

    Authors: Yiwei Lyu, Samir Harake, Asadur Chowdury, Soumyanil Banerjee, Rachel Gologorsky, Shixuan Liu, Anna-Katharina Meissner, Akshay Rao, Chenhui Zhao, Akhil Kondepudi, Cheng Jiang, Xinhai Hou, Rushikesh S. Joshi, Volker Neuschmelting, Ashok Srinivasan, Dawn Kleindorfer, Brian Athey, Vikas Gulani, Aditya Pandey, Honglak Lee, Todd Hollon

    Abstract: Neuroimaging is a ubiquitous tool for evaluating patients with neurological diseases. The global demand for magnetic resonance imaging (MRI) studies has risen steadily, placing significant strain on health systems, prolonging turnaround times, and intensifying physician burnout \cite{Chen2017-bt, Rula2024-qp-1}. These challenges disproportionately impact patients in low-resource and rural settings… ▽ More

    Submitted 23 September, 2025; originally announced September 2025.

  5. arXiv:2509.14296  [pdf, ps, other

    cs.DB

    Spezi Data Pipeline: Streamlining FHIR-based Interoperable Digital Health Data Workflows

    Authors: Vasiliki Bikia, Paul Schmiedmayer, Aydin Zahedivash, Lauren Aalami, Adrit Rao, Vishnu Ravi, Matthew Turk, Scott R. Ceresnak, Oliver Aalami

    Abstract: The increasing adoption of digital health technologies has amplified the need for robust, interoperable solutions to manage complex healthcare data. We present the Spezi Data Pipeline, an open-source Python toolkit designed to streamline the analysis of digital health data, from secure access and retrieval to processing, visualization, and export. The Pipeline is integrated into the larger Stanfor… ▽ More

    Submitted 16 September, 2025; originally announced September 2025.

  6. arXiv:2508.06486  [pdf, ps, other

    cs.DS math.NA

    Does block size matter in randomized block Krylov low-rank approximation?

    Authors: Tyler Chen, Ethan N. Epperly, Raphael A. Meyer, Christopher Musco, Akash Rao

    Abstract: We study the problem of computing a rank-$k$ approximation of a matrix using randomized block Krylov iteration. Prior work has shown that, for block size $b = 1$ or $b = k$, a $(1 + \varepsilon)$-factor approximation to the best rank-$k$ approximation can be obtained after $\tilde O(k/\sqrt{\varepsilon})$ matrix-vector products with the target matrix. On the other hand, when $b$ is between $1$ and… ▽ More

    Submitted 8 August, 2025; originally announced August 2025.

    MSC Class: 65F55 65F15 ACM Class: G.1.3; F.2.1

  7. arXiv:2507.22051  [pdf, ps, other

    cs.HC

    DataSway: Vivifying Metaphoric Visualization with Animation Clip Generation and Coordination

    Authors: Liwenhan Xie, Jiayi Zhou, Anyi Rao, Huamin Qu, Xinhuan Shu

    Abstract: Animating metaphoric visualizations brings data to life, enhancing the comprehension of abstract data encodings and fostering deeper engagement. However, creators face significant challenges in designing these animations, such as crafting motions that align semantically with the metaphors, maintaining faithful data representation during animation, and seamlessly integrating interactivity. We propo… ▽ More

    Submitted 30 July, 2025; v1 submitted 29 July, 2025; originally announced July 2025.

    Comments: 19 pages, 5 figures; Website: https://shellywhen.github.io/projects/DataSway

  8. arXiv:2507.20355  [pdf, ps, other

    cs.HC

    CineVision: An Interactive Pre-visualization Storyboard System for Director-Cinematographer Collaboration

    Authors: Zheng Wei, Hongtao Wu, Lvmin Zhang, Xian Xu, Yefeng Zheng, Pan Hui, Maneesh Agrawala, Huamin Qu, Anyi Rao

    Abstract: Effective communication between directors and cinematographers is fundamental in film production, yet traditional approaches relying on visual references and hand-drawn storyboards often lack the efficiency and precision necessary during pre-production. We present CineVision, an AI-driven platform that integrates scriptwriting with real-time visual pre-visualization to bridge this communication ga… ▽ More

    Submitted 4 August, 2025; v1 submitted 27 July, 2025; originally announced July 2025.

    Comments: UIST 2025

  9. arXiv:2507.13385  [pdf, ps, other

    cs.CV cs.LG

    Using Multiple Input Modalities Can Improve Data-Efficiency and O.O.D. Generalization for ML with Satellite Imagery

    Authors: Arjun Rao, Esther Rolf

    Abstract: A large variety of geospatial data layers is available around the world ranging from remotely-sensed raster data like satellite imagery, digital elevation models, predicted land cover maps, and human-annotated data, to data derived from environmental sensors such as air temperature or wind speed data. A large majority of machine learning models trained on satellite imagery (SatML), however, are de… ▽ More

    Submitted 15 July, 2025; originally announced July 2025.

    Comments: 17 pages, 9 figures, 7 tables. Accepted to TerraBytes@ICML 2025

  10. arXiv:2507.06261  [pdf, ps, other

    cs.CL cs.AI

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Authors: Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobsson, Idan Szpektor, Nan-Jiang Jiang, Krishna Haridasan, Ahmed Omran, Nikunj Saunshi, Dara Bahri, Gaurav Mishra, Eric Chu , et al. (3284 additional authors not shown)

    Abstract: In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal unde… ▽ More

    Submitted 22 July, 2025; v1 submitted 7 July, 2025; originally announced July 2025.

    Comments: 72 pages, 17 figures

  11. arXiv:2507.05443  [pdf, ps, other

    cs.CL cs.CY

    Gendered Divides in Online Discussions about Reproductive Rights

    Authors: Ashwin Rao, Sze Yuh Nina Wang, Kristina Lerman

    Abstract: The U.S. Supreme Court's 2022 ruling in Dobbs v. Jackson Women's Health Organization marked a turning point in the national debate over reproductive rights. While the ideological divide over abortion is well documented, less is known about how gender and local sociopolitical contexts interact to shape public discourse. Drawing on nearly 10 million abortion-related posts on X (formerly Twitter) fro… ▽ More

    Submitted 7 July, 2025; originally announced July 2025.

  12. arXiv:2507.05201  [pdf, ps, other

    cs.AI cs.CL cs.CV

    MedGemma Technical Report

    Authors: Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, Justin Chen, Fereshteh Mahvar, Liron Yatziv, Tiffany Chen, Bram Sterling, Stefanie Anna Baby, Susanna Maria Baby, Jeremy Lai, Samuel Schmidgall, Lu Yang, Kejia Chen, Per Bjornsson, Shashir Reddy, Ryan Brush, Kenneth Philbrick , et al. (56 additional authors not shown)

    Abstract: Artificial intelligence (AI) has significant potential in healthcare applications, but its training and deployment faces challenges due to healthcare's diverse data, complex tasks, and the need to preserve privacy. Foundation models that perform well on medical tasks and require less task-specific tuning data are critical to accelerate the development of healthcare AI applications. We introduce Me… ▽ More

    Submitted 12 July, 2025; v1 submitted 7 July, 2025; originally announced July 2025.

  13. arXiv:2507.02708  [pdf, ps, other

    cs.RO

    Optimizing Start Locations in Ergodic Search for Disaster Response

    Authors: Ananya Rao, Alyssa Hargis, David Wettergreen, Howie Choset

    Abstract: In disaster response scenarios, deploying robotic teams effectively is crucial for improving situational awareness and enhancing search and rescue operations. The use of robots in search and rescue has been studied but the question of where to start robot deployments has not been addressed. This work addresses the problem of optimally selecting starting locations for robots with heterogeneous capa… ▽ More

    Submitted 30 July, 2025; v1 submitted 3 July, 2025; originally announced July 2025.

  14. arXiv:2506.18882  [pdf, ps, other

    cs.CV

    Light of Normals: Unified Feature Representation for Universal Photometric Stereo

    Authors: Hong Li, Houyuan Chen, Chongjie Ye, Zhaoxi Chen, Bohan Li, Shaocong Xu, Xianda Guo, Xuhui Liu, Yikai Wang, Baochang Zhang, Satoshi Ikehata, Boxin Shi, Anyi Rao, Hao Zhao

    Abstract: Universal photometric stereo (PS) is defined by two factors: it must (i) operate under arbitrary, unknown lighting conditions and (ii) avoid reliance on specific illumination models. Despite progress (e.g., SDM UniPS), two challenges remain. First, current encoders cannot guarantee that illumination and normal information are decoupled. To enforce decoupling, we introduce LINO UniPS with two key c… ▽ More

    Submitted 27 September, 2025; v1 submitted 23 June, 2025; originally announced June 2025.

    Comments: Home: https://houyuanchen111.github.io/lino.github.io Github: https://github.com/houyuanchen111/LINO_UniPS HuggingFace Demo: https://huggingface.co/spaces/houyuanchen/lino

  15. arXiv:2506.12724  [pdf, ps, other

    cs.CV

    Dynamic Modality Scheduling for Multimodal Large Models via Confidence, Uncertainty, and Semantic Consistency

    Authors: Hiroshi Tanaka, Anika Rao, Hana Satou, Michael Johnson, Sofia García

    Abstract: Multimodal Large Models (MLLMs) have achieved remarkable progress in vision-language understanding and generation tasks. However, existing MLLMs typically rely on static modality fusion strategies, which treat all modalities equally regardless of their instance-level reliability or semantic contribution. This often leads to suboptimal performance, especially in scenarios with noisy, missing, or mi… ▽ More

    Submitted 15 June, 2025; originally announced June 2025.

  16. arXiv:2506.06964  [pdf, ps, other

    cs.CL cs.LG

    Learning to Clarify by Reinforcement Learning Through Reward-Weighted Fine-Tuning

    Authors: Subhojyoti Mukherjee, Viet Dac Lai, Raghavendra Addanki, Ryan Rossi, Seunghyun Yoon, Trung Bui, Anup Rao, Jayakumar Subramanian, Branislav Kveton

    Abstract: Question answering (QA) agents automatically answer questions posed in natural language. In this work, we learn to ask clarifying questions in QA agents. The key idea in our method is to simulate conversations that contain clarifying questions and learn from them using reinforcement learning (RL). To make RL practical, we propose and analyze offline RL objectives that can be viewed as reward-weigh… ▽ More

    Submitted 7 June, 2025; originally announced June 2025.

    Comments: 39 pages

  17. arXiv:2506.00049  [pdf, ps, other

    cs.IR cs.AI

    Rethinking Hybrid Retrieval: When Small Embeddings and LLM Re-ranking Beat Bigger Models

    Authors: Arjun Rao, Hanieh Alipour, Nick Pendar

    Abstract: This paper presents a comparison of embedding models in tri-modal hybrid retrieval for Retrieval-Augmented Generation (RAG) systems. We investigate the fusion of dense semantic, sparse lexical, and graph-based embeddings, focusing on the performance of the MiniLM-v6 and BGE-Large architectures. Contrary to conventional assumptions, our results show that the compact MiniLM-v6 outperforms the larger… ▽ More

    Submitted 28 May, 2025; originally announced June 2025.

  18. arXiv:2505.21862  [pdf, ps, other

    cs.CV

    Towards Scalable Language-Image Pre-training for 3D Medical Imaging

    Authors: Chenhui Zhao, Yiwei Lyu, Asadur Chowdury, Edward Harake, Akhil Kondepudi, Akshay Rao, Xinhai Hou, Honglak Lee, Todd Hollon

    Abstract: The scalability of current language-image pre-training for 3D medical imaging, such as CT and MRI, is constrained by the need for radiologists to manually curate raw clinical studies. In this work, we pioneer pre-training directly on uncurated studies, which both aligns more closely with the radiologist's workflow and provides a natural path to scalability. However, the unique structure of such da… ▽ More

    Submitted 25 September, 2025; v1 submitted 27 May, 2025; originally announced May 2025.

  19. arXiv:2505.10913  [pdf, other

    cs.LG

    Automated Identification of Logical Errors in Programs: Advancing Scalable Analysis of Student Misconceptions

    Authors: Muntasir Hoq, Ananya Rao, Reisha Jaishankar, Krish Piryani, Nithya Janapati, Jessica Vandenberg, Bradford Mott, Narges Norouzi, James Lester, Bita Akram

    Abstract: In Computer Science (CS) education, understanding factors contributing to students' programming difficulties is crucial for effective learning support. By identifying specific issues students face, educators can provide targeted assistance to help them overcome obstacles and improve learning outcomes. While identifying sources of struggle, such as misconceptions, in real-time can be challenging in… ▽ More

    Submitted 16 May, 2025; originally announced May 2025.

    Comments: Accepted for publication at the 18th International Conference on Educational Data Mining (EDM), 2025

    ACM Class: K.3.1

  20. arXiv:2505.06537  [pdf, ps, other

    cs.CV cs.AI

    ProFashion: Prototype-guided Fashion Video Generation with Multiple Reference Images

    Authors: Xianghao Kong, Qiaosong Qi, Yuanbin Wang, Anyi Rao, Biaolong Chen, Aixi Zhang, Si Liu, Hao Jiang

    Abstract: Fashion video generation aims to synthesize temporally consistent videos from reference images of a designated character. Despite significant progress, existing diffusion-based methods only support a single reference image as input, severely limiting their capability to generate view-consistent fashion videos, especially when there are different patterns on the clothes from different perspectives.… ▽ More

    Submitted 10 May, 2025; originally announced May 2025.

  21. arXiv:2504.16091  [pdf, other

    cs.CR math.HO

    Post-Quantum Homomorphic Encryption: A Case for Code-Based Alternatives

    Authors: Siddhartha Siddhiprada Bhoi, Arathi Arakala, Amy Beth Corman, Asha Rao

    Abstract: Homomorphic Encryption (HE) allows secure and privacy-protected computation on encrypted data without the need to decrypt it. Since Shor's algorithm rendered prime factorisation and discrete logarithm-based ciphers insecure with quantum computations, researchers have been working on building post-quantum homomorphic encryption (PQHE) algorithms. Most of the current PQHE algorithms are secured by L… ▽ More

    Submitted 28 March, 2025; originally announced April 2025.

  22. Evaluation and Incident Prevention in an Enterprise AI Assistant

    Authors: Akash V. Maharaj, David Arbour, Daniel Lee, Uttaran Bhattacharya, Anup Rao, Austin Zane, Avi Feller, Kun Qian, Yunyao Li

    Abstract: Enterprise AI Assistants are increasingly deployed in domains where accuracy is paramount, making each erroneous output a potentially significant incident. This paper presents a comprehensive framework for monitoring, benchmarking, and continuously improving such complex, multi-component systems under active development by multiple teams. Our approach encompasses three key elements: (1) a hierarch… ▽ More

    Submitted 11 April, 2025; originally announced April 2025.

    Comments: 7 pages, 5 figures. Accepted at IAAI-25

  23. arXiv:2504.08832  [pdf, other

    cs.CY cs.AI

    Generative AI in Collaborative Academic Report Writing: Advantages, Disadvantages, and Ethical Considerations

    Authors: Mahshid Sadeghpour, Arathi Arakala, Asha Rao

    Abstract: The availability and abundance of GenAI tools to administer tasks traditionally managed by people have raised concerns, particularly within the education and academic sectors, as some students may highly rely on these tools to complete the assignments designed to enable learning. This article focuses on informing students about the significance of investing their time during their studies on devel… ▽ More

    Submitted 10 April, 2025; originally announced April 2025.

    Comments: 21 pages, 5 figures

  24. arXiv:2504.08296  [pdf, other

    cs.CV

    Generative AI for Film Creation: A Survey of Recent Advances

    Authors: Ruihan Zhang, Borou Yu, Jiajian Min, Yetong Xin, Zheng Wei, Juncheng Nemo Shi, Mingzhen Huang, Xianghao Kong, Nix Liu Xin, Shanshan Jiang, Praagya Bahuguna, Mark Chan, Khushi Hora, Lijian Yang, Yongqi Liang, Runhe Bian, Yunlei Liu, Isabela Campillo Valencia, Patricia Morales Tredinick, Ilia Kozlov, Sijia Jiang, Peiwen Huang, Na Chen, Xuanxuan Liu, Anyi Rao

    Abstract: Generative AI (GenAI) is transforming filmmaking, equipping artists with tools like text-to-image and image-to-video diffusion, neural radiance fields, avatar generation, and 3D synthesis. This paper examines the adoption of these technologies in filmmaking, analyzing workflows from recent AI-driven films to understand how GenAI contributes to character creation, aesthetic styling, and narration.… ▽ More

    Submitted 11 April, 2025; originally announced April 2025.

    Comments: Accepted at CVPR 2025 CVEU workshop: AI for Creative Visual Content Generation Editing and Understanding

  25. arXiv:2503.19786  [pdf, other

    cs.CL cs.AI

    Gemma 3 Technical Report

    Authors: Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean-bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gaël Liu, Francesco Visin, Kathleen Kenealy, Lucas Beyer, Xiaohai Zhai, Anton Tsitsulin , et al. (191 additional authors not shown)

    Abstract: We introduce Gemma 3, a multimodal addition to the Gemma family of lightweight open models, ranging in scale from 1 to 27 billion parameters. This version introduces vision understanding abilities, a wider coverage of languages and longer context - at least 128K tokens. We also change the architecture of the model to reduce the KV-cache memory that tends to explode with long context. This is achie… ▽ More

    Submitted 25 March, 2025; originally announced March 2025.

  26. arXiv:2503.12295  [pdf, other

    cs.LG math.NA

    Towards Learning High-Precision Least Squares Algorithms with Sequence Models

    Authors: Jerry Liu, Jessica Grogan, Owen Dugan, Ashish Rao, Simran Arora, Atri Rudra, Christopher Ré

    Abstract: This paper investigates whether sequence models can learn to perform numerical algorithms, e.g. gradient descent, on the fundamental problem of least squares. Our goal is to inherit two properties of standard algorithms from numerical analysis: (1) machine precision, i.e. we want to obtain solutions that are accurate to near floating point error, and (2) numerical generality, i.e. we want them to… ▽ More

    Submitted 15 March, 2025; originally announced March 2025.

    Comments: 75 pages, 18 figures. ICLR 2025

  27. arXiv:2503.04998  [pdf, other

    cs.RO

    Multi-Agent Ergodic Exploration under Smoke-Based, Time-Varying Sensor Visibility Constraints

    Authors: Elena Wittemyer, Ananya Rao, Ian Abraham, Howie Choset

    Abstract: In this work, we consider the problem of multi-agent informative path planning (IPP) for robots whose sensor visibility continuously changes as a consequence of a time-varying natural phenomenon. We leverage ergodic trajectory optimization (ETO), which generates paths such that the amount of time an agent spends in an area is proportional to the expected information in that area. We focus specific… ▽ More

    Submitted 6 March, 2025; originally announced March 2025.

    Comments: Accepted to ICRA 2025

  28. arXiv:2503.04641  [pdf, ps, other

    cs.CV cs.AI cs.LG

    Simulating the Real World: A Unified Survey of Multimodal Generative Models

    Authors: Yuqi Hu, Longguang Wang, Xian Liu, Ling-Hao Chen, Yuwei Guo, Yukai Shi, Ce Liu, Anyi Rao, Zeyu Wang, Hui Xiong

    Abstract: Understanding and replicating the real world is a critical challenge in Artificial General Intelligence (AGI) research. To achieve this, many existing approaches, such as world models, aim to capture the fundamental principles governing the physical world, enabling more accurate simulations and meaningful interactions. However, current methods often treat different modalities, including 2D (images… ▽ More

    Submitted 12 August, 2025; v1 submitted 6 March, 2025; originally announced March 2025.

    Comments: Repository for the related papers at https://github.com/ALEEEHU/World-Simulator

  29. arXiv:2502.15017  [pdf, other

    cs.LG cs.CR

    Interpreting Adversarial Attacks and Defences using Architectures with Enhanced Interpretability

    Authors: Akshay G Rao, Chandrashekhar Lakshminarayanan, Arun Rajkumar

    Abstract: Adversarial attacks in deep learning represent a significant threat to the integrity and reliability of machine learning models. Adversarial training has been a popular defence technique against these adversarial attacks. In this work, we capitalize on a network architecture, namely Deep Linearly Gated Networks (DLGN), which has better interpretation capabilities than regular deep network architec… ▽ More

    Submitted 20 February, 2025; originally announced February 2025.

    Comments: Publication accepted at AAAI Deployable AI conference 2025 (proof - https://sites.google.com/view/dai-2025/accepted-papers?authuser=0) Total 17 pages

  30. arXiv:2502.08590  [pdf, other

    cs.CV

    Light-A-Video: Training-free Video Relighting via Progressive Light Fusion

    Authors: Yujie Zhou, Jiazi Bu, Pengyang Ling, Pan Zhang, Tong Wu, Qidong Huang, Jinsong Li, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Anyi Rao, Jiaqi Wang, Li Niu

    Abstract: Recent advancements in image relighting models, driven by large-scale datasets and pre-trained diffusion models, have enabled the imposition of consistent lighting. However, video relighting still lags, primarily due to the excessive training costs and the scarcity of diverse, high-quality video relighting datasets. A simple application of image relighting models on a frame-by-frame basis leads to… ▽ More

    Submitted 12 March, 2025; v1 submitted 12 February, 2025; originally announced February 2025.

    Comments: Project Page: https://bujiazi.github.io/light-a-video.github.io/

  31. arXiv:2501.14249  [pdf, ps, other

    cs.LG cs.AI cs.CL

    Humanity's Last Exam

    Authors: Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, Adam Khoja, Ryan Kim, Richard Ren, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Dmitry Dodonov, Tung Nguyen, Jaeho Lee, Daron Anderson, Mikhail Doroshenko, Alun Cennyth Stokes , et al. (1087 additional authors not shown)

    Abstract: Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of… ▽ More

    Submitted 25 September, 2025; v1 submitted 24 January, 2025; originally announced January 2025.

    Comments: 29 pages, 6 figures

  32. arXiv:2412.18708  [pdf

    cs.AI

    CAG: Chunked Augmented Generation for Google Chrome's Built-in Gemini Nano

    Authors: Vivek Vellaiyappan Surulimuthu, Aditya Karnam Gururaj Rao

    Abstract: We present Chunked Augmented Generation (CAG), an architecture specifically designed to overcome the context window limitations of Google Chrome's built-in Gemini Nano model. While Chrome's integration of Gemini Nano represents a significant advancement in bringing AI capabilities directly to the browser, its restricted context window poses challenges for processing large inputs. CAG addresses thi… ▽ More

    Submitted 24 December, 2024; originally announced December 2024.

    Comments: 36 pages, 19 figures

    MSC Class: 68T01 (Primary) ACM Class: I.2.0; I.2.1; I.2.7

  33. arXiv:2412.14414  [pdf, other

    cs.SI cs.CL cs.CY

    In-Group Love, Out-Group Hate: A Framework to Measure Affective Polarization via Contentious Online Discussions

    Authors: Buddhika Nettasinghe, Ashwin Rao, Bohan Jiang, Allon Percus, Kristina Lerman

    Abstract: Affective polarization, the emotional divide between ideological groups marked by in-group love and out-group hate, has intensified in the United States, driving contentious issues like masking and lockdowns during the COVID-19 pandemic. Despite its societal impact, existing models of opinion change fail to account for emotional dynamics nor offer methods to quantify affective polarization robustl… ▽ More

    Submitted 18 December, 2024; originally announced December 2024.

  34. arXiv:2412.07019  [pdf, other

    cs.CL cs.CY

    Assessing the Impact of Conspiracy Theories Using Large Language Models

    Authors: Bohan Jiang, Dawei Li, Zhen Tan, Xinyi Zhou, Ashwin Rao, Kristina Lerman, H. Russell Bernard, Huan Liu

    Abstract: Measuring the relative impact of CTs is important for prioritizing responses and allocating resources effectively, especially during crises. However, assessing the actual impact of CTs on the public poses unique challenges. It requires not only the collection of CT-specific knowledge but also diverse information from social, psychological, and cultural dimensions. Recent advancements in large lang… ▽ More

    Submitted 9 December, 2024; originally announced December 2024.

  35. arXiv:2412.06487  [pdf, other

    eess.IV cs.CV cs.LG

    Improving text-conditioned latent diffusion for cancer pathology

    Authors: Aakash Madhav Rao, Debayan Gupta

    Abstract: The development of generative models in the past decade has allowed for hyperrealistic data synthesis. While potentially beneficial, this synthetic data generation process has been relatively underexplored in cancer histopathology. One algorithm for synthesising a realistic image is diffusion; it iteratively converts an image to noise and learns the recovery process from this noise [Wang and Vasto… ▽ More

    Submitted 9 December, 2024; originally announced December 2024.

  36. arXiv:2412.01273  [pdf, other

    cs.HC cs.CV

    AR-Facilitated Safety Inspection and Fall Hazard Detection on Construction Sites

    Authors: Jiazhou Liu, Aravinda S. Rao, Fucai Ke, Tim Dwyer, Benjamin Tag, Pari Delir Haghighi

    Abstract: Together with industry experts, we are exploring the potential of head-mounted augmented reality to facilitate safety inspections on high-rise construction sites. A particular concern in the industry is inspecting perimeter safety screens on higher levels of construction sites, intended to prevent falls of people and objects. We aim to support workers performing this inspection task by tracking wh… ▽ More

    Submitted 2 December, 2024; originally announced December 2024.

    Comments: 2 pages, 1 figure, ISMAR24 Workshop Paper

  37. arXiv:2412.00224  [pdf, other

    cs.AI cs.DB cs.MA

    An AI-Driven Data Mesh Architecture Enhancing Decision-Making in Infrastructure Construction and Public Procurement

    Authors: Saurabh Mishra, Mahendra Shinde, Aniket Yadav, Bilal Ayyub, Anand Rao

    Abstract: Infrastructure construction, often dubbed an "industry of industries," is closely linked with government spending and public procurement, offering significant opportunities for improved efficiency and productivity through better transparency and information access. By leveraging these opportunities, we can achieve notable gains in productivity, cost savings, and broader economic benefits. Our appr… ▽ More

    Submitted 29 November, 2024; originally announced December 2024.

  38. arXiv:2411.12516  [pdf, other

    cond-mat.mes-hall cs.CV cs.ET cs.LG quant-ph

    Modular Autonomous Virtualization System for Two-Dimensional Semiconductor Quantum Dot Arrays

    Authors: Anantha S. Rao, Donovan Buterakos, Barnaby van Straaten, Valentin John, Cécile X. Yu, Stefan D. Oosterhout, Lucas Stehouwer, Giordano Scappucci, Menno Veldhorst, Francesco Borsoi, Justyna P. Zwolak

    Abstract: Arrays of gate-defined semiconductor quantum dots are among the leading candidates for building scalable quantum processors. High-fidelity initialization, control, and readout of spin qubit registers require exquisite and targeted control over key Hamiltonian parameters that define the electrostatic environment. However, due to the tight gate pitch, capacitive crosstalk between gates hinders indep… ▽ More

    Submitted 6 May, 2025; v1 submitted 19 November, 2024; originally announced November 2024.

    Comments: 14 pages, 5 figures, 9 pages of supplemental material

    Journal ref: Phys. Rev. X 15, 021034 (2025)

  39. arXiv:2411.08981  [pdf, other

    cs.AI eess.SY

    Reliability, Resilience and Human Factors Engineering for Trustworthy AI Systems

    Authors: Saurabh Mishra, Anand Rao, Ramayya Krishnan, Bilal Ayyub, Amin Aria, Enrico Zio

    Abstract: As AI systems become integral to critical operations across industries and services, ensuring their reliability and safety is essential. We offer a framework that integrates established reliability and resilience engineering principles into AI systems. By applying traditional metrics such as failure rate and Mean Time Between Failures (MTBF) along with resilience engineering and human reliability… ▽ More

    Submitted 13 November, 2024; originally announced November 2024.

  40. arXiv:2410.10570  [pdf, other

    cs.HC eess.SY

    Mindalogue: LLM-Powered Nonlinear Interaction for Effective Learning and Task Exploration

    Authors: Rui Zhang, Ziyao Zhang, Fengliang Zhu, Jiajie Zhou, Anyi Rao

    Abstract: Current generative AI models like ChatGPT, Claude, and Gemini are widely used for knowledge dissemination, task decomposition, and creative thinking. However, their linear interaction methods often force users to repeatedly compare and copy contextual information when handling complex tasks, increasing cognitive load and operational costs. Moreover, the ambiguity in model responses requires users… ▽ More

    Submitted 15 October, 2024; v1 submitted 14 October, 2024; originally announced October 2024.

    Comments: 17 pages, 9 figures

    MSC Class: 68U35(Primary); 68T20(Secondary) ACM Class: H.5.2

  41. arXiv:2410.09076  [pdf, other

    cs.CL

    Llettuce: An Open Source Natural Language Processing Tool for the Translation of Medical Terms into Uniform Clinical Encoding

    Authors: James Mitchell-White, Reza Omdivar, Esmond Urwin, Karthikeyan Sivakumar, Ruizhe Li, Andy Rae, Xiaoyan Wang, Theresia Mina, John Chambers, Grazziela Figueredo, Philip R Quinlan

    Abstract: This paper introduces Llettuce, an open-source tool designed to address the complexities of converting medical terms into OMOP standard concepts. Unlike existing solutions such as the Athena database search and Usagi, which struggle with semantic nuances and require substantial manual input, Llettuce leverages advanced natural language processing, including large language models and fuzzy matching… ▽ More

    Submitted 4 October, 2024; originally announced October 2024.

  42. arXiv:2410.04129  [pdf, ps, other

    eess.SY cs.RO math.OC

    Trajectory elongation strategies with minimum curvature discontinuities for a Dubins vehicle

    Authors: Aditya K. Rao, Twinkle Tripathy

    Abstract: In this paper, we present strategies for designing curvature-bounded trajectories of any desired length between any two given oriented points. The proposed trajectory is constructed by the concatenation of three circular arcs of varying radii. Such a trajectory guarantees a complete coverage of the maximum set of reachable lengths while minimising the number of changeover points in the trajectory… ▽ More

    Submitted 5 October, 2024; originally announced October 2024.

    Comments: Preprint submitted to Automatica

  43. arXiv:2410.03224  [pdf, other

    cs.HC cs.AI cs.CV cs.GR

    ScriptViz: A Visualization Tool to Aid Scriptwriting based on a Large Movie Database

    Authors: Anyi Rao, Jean-Peïc Chou, Maneesh Agrawala

    Abstract: Scriptwriters usually rely on their mental visualization to create a vivid story by using their imagination to see, feel, and experience the scenes they are writing. Besides mental visualization, they often refer to existing images or scenes in movies and analyze the visual elements to create a certain mood or atmosphere. In this paper, we develop ScriptViz to provide external visualization based… ▽ More

    Submitted 4 October, 2024; originally announced October 2024.

    Comments: Accepted in the 37th Annual ACM Symposium on User Interface Software and Technology (UIST'24). Webpage: https://virtualfilmstudio.github.io/projects/scriptviz

  44. arXiv:2408.17424  [pdf, other

    cs.CV cs.HC

    CinePreGen: Camera Controllable Video Previsualization via Engine-powered Diffusion

    Authors: Yiran Chen, Anyi Rao, Xuekun Jiang, Shishi Xiao, Ruiqing Ma, Zeyu Wang, Hui Xiong, Bo Dai

    Abstract: With advancements in video generative AI models (e.g., SORA), creators are increasingly using these techniques to enhance video previsualization. However, they face challenges with incomplete and mismatched AI workflows. Existing methods mainly rely on text descriptions and struggle with camera placement, a key component of previsualization. To address these issues, we introduce CinePreGen, a visu… ▽ More

    Submitted 30 August, 2024; originally announced August 2024.

  45. arXiv:2408.05895  [pdf, other

    cs.CY cs.CR

    Gender of Recruiter Makes a Difference: A study into Cybersecurity Graduate Recruitment

    Authors: Joanne L. Hall, Asha Rao

    Abstract: An ever-widening workforce gap exists in the global cybersecurity industry but diverse talent is underutilized. The global cybersecurity workforce is only 25% female. Much research exists on the effect of gender bias on the hiring of women into the technical workforce, but little on how the gender of the recruiter (gender difference) affects recruitment decisions. This research reveals differences… ▽ More

    Submitted 11 August, 2024; originally announced August 2024.

    Comments: 22 pages, 4 figures

  46. arXiv:2408.00118  [pdf, other

    cs.CL cs.AI

    Gemma 2: Improving Open Language Models at a Practical Size

    Authors: Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, Johan Ferret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, Charline Le Lan, Sammy Jerome, Anton Tsitsulin, Nino Vieillard, Piotr Stanczyk, Sertan Girgin, Nikola Momchev, Matt Hoffman , et al. (173 additional authors not shown)

    Abstract: In this work, we introduce Gemma 2, a new addition to the Gemma family of lightweight, state-of-the-art open models, ranging in scale from 2 billion to 27 billion parameters. In this new version, we apply several known technical modifications to the Transformer architecture, such as interleaving local-global attentions (Beltagy et al., 2020a) and group-query attention (Ainslie et al., 2023). We al… ▽ More

    Submitted 2 October, 2024; v1 submitted 31 July, 2024; originally announced August 2024.

  47. arXiv:2407.21783  [pdf, other

    cs.AI cs.CL cs.CV

    The Llama 3 Herd of Models

    Authors: Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Roziere , et al. (536 additional authors not shown)

    Abstract: Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical… ▽ More

    Submitted 23 November, 2024; v1 submitted 31 July, 2024; originally announced July 2024.

  48. arXiv:2407.05483  [pdf, other

    cs.CL cs.LG

    Just read twice: closing the recall gap for recurrent language models

    Authors: Simran Arora, Aman Timalsina, Aaryan Singhal, Benjamin Spector, Sabri Eyuboglu, Xinyi Zhao, Ashish Rao, Atri Rudra, Christopher Ré

    Abstract: Recurrent large language models that compete with Transformers in language modeling perplexity are emerging at a rapid rate (e.g., Mamba, RWKV). Excitingly, these architectures use a constant amount of memory during inference. However, due to the limited memory, recurrent LMs cannot recall and use all the information in long contexts leading to brittle in-context learning (ICL) quality. A key chal… ▽ More

    Submitted 7 July, 2024; originally announced July 2024.

  49. arXiv:2407.01802  [pdf, ps, other

    cs.CC

    An XOR Lemma for Deterministic Communication Complexity

    Authors: Siddharth Iyer, Anup Rao

    Abstract: We prove a lower bound on the communication complexity of computing the $n$-fold xor of an arbitrary function $f$, in terms of the communication complexity and rank of $f$. We prove that $D(f^{\oplus n}) \geq n \cdot \Big(\frac{Ω(D(f))}{\log \mathsf{rk}(f)} -\log \mathsf{rk}(f)\Big )$, where here $D(f), D(f^{\oplus n})$ represent the deterministic communication complexity, and $\mathsf{rk}(f)$ is… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

  50. arXiv:2406.12702  [pdf, other

    cs.CL

    [WIP] Jailbreak Paradox: The Achilles' Heel of LLMs

    Authors: Abhinav Rao, Monojit Choudhury, Somak Aditya

    Abstract: We introduce two paradoxes concerning jailbreak of foundation models: First, it is impossible to construct a perfect jailbreak classifier, and second, a weaker model cannot consistently detect whether a stronger (in a pareto-dominant sense) model is jailbroken or not. We provide formal proofs for these paradoxes and a short case study on Llama and GPT4-o to demonstrate this. We discuss broader the… ▽ More

    Submitted 20 June, 2024; v1 submitted 18 June, 2024; originally announced June 2024.