Skip to main content

Showing 1–50 of 117 results for author: Qian, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2504.14221  [pdf, other

    cs.CV

    Real-IAD D3: A Real-World 2D/Pseudo-3D/3D Dataset for Industrial Anomaly Detection

    Authors: Wenbing Zhu, Lidong Wang, Ziqing Zhou, Chengjie Wang, Yurui Pan, Ruoyi Zhang, Zhuhao Chen, Linjie Cheng, Bin-Bin Gao, Jiangning Zhang, Zhenye Gan, Yuxie Wang, Yulong Chen, Shuguang Qian, Mingmin Chi, Bo Peng, Lizhuang Ma

    Abstract: The increasing complexity of industrial anomaly detection (IAD) has positioned multimodal detection methods as a focal area of machine vision research. However, dedicated multimodal datasets specifically tailored for IAD remain limited. Pioneering datasets like MVTec 3D have laid essential groundwork in multimodal IAD by incorporating RGB+3D data, but still face challenges in bridging the gap with… ▽ More

    Submitted 19 April, 2025; originally announced April 2025.

    Comments: 13 pages. Dataset and code: https://realiad4ad.github.io/Real-IAD D3

  2. arXiv:2504.10878  [pdf, other

    cs.CV cs.AI cs.LG

    Large Language Model-Informed Feature Discovery Improves Prediction and Interpretation of Credibility Perceptions of Visual Content

    Authors: Yilang Peng, Sijia Qian, Yingdan Lu, Cuihua Shen

    Abstract: In today's visually dominated social media landscape, predicting the perceived credibility of visual content and understanding what drives human judgment are crucial for countering misinformation. However, these tasks are challenging due to the diversity and richness of visual features. We introduce a Large Language Model (LLM)-informed feature discovery framework that leverages multimodal LLMs, s… ▽ More

    Submitted 15 April, 2025; originally announced April 2025.

    Comments: 26 pages

    ACM Class: I.4.9; J.4

  3. arXiv:2503.23746  [pdf, other

    cs.CV cs.CL cs.LG cs.MM cs.SI

    Short-video Propagation Influence Rating: A New Real-world Dataset and A New Large Graph Model

    Authors: Dizhan Xue, Jing Cui, Shengsheng Qian, Chuanrui Hu, Changsheng Xu

    Abstract: Short-video platforms have gained immense popularity, captivating the interest of millions, if not billions, of users globally. Recently, researchers have highlighted the significance of analyzing the propagation of short-videos, which typically involves discovering commercial values, public opinions, user behaviors, etc. This paper proposes a new Short-video Propagation Influence Rating (SPIR) ta… ▽ More

    Submitted 31 March, 2025; originally announced March 2025.

  4. arXiv:2503.23512  [pdf, other

    cs.CL

    SCORE: Story Coherence and Retrieval Enhancement for AI Narratives

    Authors: Qiang Yi, Yangfan He, Jianhui Wang, Xinyuan Song, Shiyao Qian, Xinhang Yuan, Miao Zhang, Li Sun, Keqin Li, Kuan Lu, Menghao Huo, Jiaqi Chen, Tianyu Shi

    Abstract: Large Language Models (LLMs) can generate creative and engaging narratives from user-specified input, but maintaining coherence and emotional depth throughout these AI-generated stories remains a challenge. In this work, we propose SCORE, a framework for Story Coherence and Retrieval Enhancement, designed to detect and resolve narrative inconsistencies. By tracking key item statuses and generating… ▽ More

    Submitted 21 April, 2025; v1 submitted 30 March, 2025; originally announced March 2025.

  5. arXiv:2503.20519  [pdf, other

    cs.CV

    MAR-3D: Progressive Masked Auto-regressor for High-Resolution 3D Generation

    Authors: Jinnan Chen, Lingting Zhu, Zeyu Hu, Shengju Qian, Yugang Chen, Xin Wang, Gim Hee Lee

    Abstract: Recent advances in auto-regressive transformers have revolutionized generative modeling across different domains, from language processing to visual generation, demonstrating remarkable capabilities. However, applying these advances to 3D generation presents three key challenges: the unordered nature of 3D data conflicts with sequential next-token prediction paradigm, conventional vector quantizat… ▽ More

    Submitted 20 April, 2025; v1 submitted 26 March, 2025; originally announced March 2025.

    Comments: CVPR 2025 Highlight: https://jinnan-chen.github.io/projects/MAR-3D/

  6. arXiv:2503.18461  [pdf, other

    cs.CV

    MuMA: 3D PBR Texturing via Multi-Channel Multi-View Generation and Agentic Post-Processing

    Authors: Lingting Zhu, Jingrui Ye, Runze Zhang, Zeyu Hu, Yingda Yin, Lanjiong Li, Jinnan Chen, Shengju Qian, Xin Wang, Qingmin Liao, Lequan Yu

    Abstract: Current methods for 3D generation still fall short in physically based rendering (PBR) texturing, primarily due to limited data and challenges in modeling multi-channel materials. In this work, we propose MuMA, a method for 3D PBR texturing through Multi-channel Multi-view generation and Agentic post-processing. Our approach features two key innovations: 1) We opt to model shaded and albedo appear… ▽ More

    Submitted 24 March, 2025; originally announced March 2025.

    Comments: 17 pages, 14 figures

  7. arXiv:2503.16158  [pdf, other

    cs.CL

    Automatically Generating Chinese Homophone Words to Probe Machine Translation Estimation Systems

    Authors: Shenbin Qian, Constantin Orăsan, Diptesh Kanojia, Félix do Carmo

    Abstract: Evaluating machine translation (MT) of user-generated content (UGC) involves unique challenges such as checking whether the nuance of emotions from the source are preserved in the target text. Recent studies have proposed emotion-related datasets, frameworks and models to automatically evaluate MT quality of Chinese UGC, without relying on reference translations. However, whether these models are… ▽ More

    Submitted 20 March, 2025; originally announced March 2025.

    Comments: Accepted to the 10th Workshop on Noisy and User-generated Text at NAACL 2025

  8. arXiv:2502.10810  [pdf, other

    cs.CV

    SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding

    Authors: Zhenyu Yang, Yuhang Hu, Zemin Du, Dizhan Xue, Shengsheng Qian, Jiahong Wu, Fan Yang, Weiming Dong, Changsheng Xu

    Abstract: Despite the significant advancements of Large Vision-Language Models (LVLMs) on established benchmarks, there remains a notable gap in suitable evaluation regarding their applicability in the emerging domain of long-context streaming video understanding. Current benchmarks for video understanding typically emphasize isolated single-instance text inputs and fail to evaluate the capacity to sustain… ▽ More

    Submitted 15 February, 2025; originally announced February 2025.

    Comments: ICLR 2025 Accept (Spotlight)

  9. arXiv:2501.16237  [pdf, other

    cs.LG physics.ins-det

    Application of Structured State Space Models to High energy physics with locality-sensitive hashing

    Authors: Cheng Jiang, Sitian Qian

    Abstract: Modern high-energy physics (HEP) experiments are increasingly challenged by the vast size and complexity of their datasets, particularly regarding large-scale point cloud processing and long sequences. In this study, to address these challenges, we explore the application of structured state space models (SSMs), proposing one of the first trials to integrate local-sensitive hashing into either a h… ▽ More

    Submitted 27 January, 2025; originally announced January 2025.

    Comments: 6 figures, accepted by AISTATS 2025 as poster, camera ready versions to be updated

  10. arXiv:2501.04473  [pdf, other

    cs.CL

    When LLMs Struggle: Reference-less Translation Evaluation for Low-resource Languages

    Authors: Archchana Sindhujan, Diptesh Kanojia, Constantin Orasan, Shenbin Qian

    Abstract: This paper investigates the reference-less evaluation of machine translation for low-resource language pairs, known as quality estimation (QE). Segment-level QE is a challenging cross-lingual language understanding task that provides a quality score (0-100) to the translated output. We comprehensively evaluate large language models (LLMs) in zero/few-shot scenarios and perform instruction fine-tun… ▽ More

    Submitted 8 January, 2025; originally announced January 2025.

  11. arXiv:2412.13877  [pdf, other

    cs.RO cs.AI

    RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation

    Authors: Kun Wu, Chengkai Hou, Jiaming Liu, Zhengping Che, Xiaozhu Ju, Zhuqin Yang, Meng Li, Yinuo Zhao, Zhiyuan Xu, Guang Yang, Shichao Fan, Xinhua Wang, Fei Liao, Zhen Zhao, Guangyu Li, Zhao Jin, Lecheng Wang, Jilei Mao, Ning Liu, Pei Ren, Qiang Zhang, Yaoxu Lyu, Mengzhen Liu, Jingyang He, Yulin Luo , et al. (12 additional authors not shown)

    Abstract: In this paper, we introduce RoboMIND (Multi-embodiment Intelligence Normative Data for Robot Manipulation), a dataset containing 107k demonstration trajectories across 479 diverse tasks involving 96 object classes. RoboMIND is collected through human teleoperation and encompasses comprehensive robotic-related information, including multi-view observations, proprioceptive robot state information, a… ▽ More

    Submitted 14 February, 2025; v1 submitted 18 December, 2024; originally announced December 2024.

  12. arXiv:2412.10892  [pdf, other

    cs.LG cs.AI

    Know Unreported Roadway Incidents in Real-time: Early Traffic Anomaly Detection

    Authors: Haocheng Duan, Hao Wu, Sean Qian

    Abstract: This research aims to know traffic anomalies as early as possible. A traffic anomaly refers to a generic incident on the road that influences traffic flow and calls for urgent traffic management measures. `Knowing'' the occurrence of a traffic anomaly is twofold: the ability to detect this anomaly before it is reported anywhere, or it may be such that an anomaly can be predicted before it actually… ▽ More

    Submitted 23 April, 2025; v1 submitted 14 December, 2024; originally announced December 2024.

  13. arXiv:2412.00851  [pdf, other

    cs.CV

    DynSUP: Dynamic Gaussian Splatting from An Unposed Image Pair

    Authors: Weihang Li, Weirong Chen, Shenhan Qian, Jiajie Chen, Daniel Cremers, Haoang Li

    Abstract: Recent advances in 3D Gaussian Splatting have shown promising results. Existing methods typically assume static scenes and/or multiple images with prior poses. Dynamics, sparse views, and unknown poses significantly increase the problem complexity due to insufficient geometric constraints. To overcome this challenge, we propose a method that can use only two images without prior poses to fit Gauss… ▽ More

    Submitted 1 December, 2024; originally announced December 2024.

  14. arXiv:2411.10478  [pdf, other

    cs.LG cs.AI

    Large Language Models for Constructing and Optimizing Machine Learning Workflows: A Survey

    Authors: Yang Gu, Hengyu You, Jian Cao, Muran Yu, Haoran Fan, Shiyou Qian

    Abstract: Building effective machine learning (ML) workflows to address complex tasks is a primary focus of the Automatic ML (AutoML) community and a critical step toward achieving artificial general intelligence (AGI). Recently, the integration of Large Language Models (LLMs) into ML workflows has shown great potential for automating and enhancing various stages of the ML pipeline. This survey provides a c… ▽ More

    Submitted 25 December, 2024; v1 submitted 11 November, 2024; originally announced November 2024.

  15. arXiv:2411.08561  [pdf, other

    cs.SE cs.AI

    LogLLM: Log-based Anomaly Detection Using Large Language Models

    Authors: Wei Guan, Jian Cao, Shiyou Qian, Jianqi Gao, Chun Ouyang

    Abstract: Software systems often record important runtime information in logs to help with troubleshooting. Log-based anomaly detection has become a key research area that aims to identify system issues through log data, ultimately enhancing the reliability of software systems. Traditional deep learning methods often struggle to capture the semantic information embedded in log data, which is typically organ… ▽ More

    Submitted 13 April, 2025; v1 submitted 13 November, 2024; originally announced November 2024.

  16. arXiv:2411.04799  [pdf, other

    cs.CL cs.AI

    Kwai-STaR: Transform LLMs into State-Transition Reasoners

    Authors: Xingyu Lu, Yuhang Hu, Changyi Liu, Tianke Zhang, Zhenyu Yang, Zhixiang Ding, Shengsheng Qian, Meng Du, Ruiwen Kang, Kaiyu Tang, Fan Yang, Tingting Gao, Di Zhang, Hai-Tao Zheng, Bin Wen

    Abstract: Mathematical reasoning presents a significant challenge to the cognitive capabilities of LLMs. Various methods have been proposed to enhance the mathematical ability of LLMs. However, few recognize the value of state transition for LLM reasoning. In this work, we define mathematical problem-solving as a process of transiting from an initial unsolved state to the final resolved state, and propose K… ▽ More

    Submitted 12 November, 2024; v1 submitted 7 November, 2024; originally announced November 2024.

    Comments: 6 pages, 2 figures

  17. arXiv:2410.18362  [pdf, other

    cs.SE cs.CL cs.CV

    WAFFLE: Multi-Modal Model for Automated Front-End Development

    Authors: Shanchao Liang, Nan Jiang, Shangshu Qian, Lin Tan

    Abstract: Web development involves turning UI designs into functional webpages, which can be difficult for both beginners and experienced developers due to the complexity of HTML's hierarchical structures and styles. While Large Language Models (LLMs) have shown promise in generating source code, two major challenges persist in UI-to-HTML code generation: (1) effectively representing HTML's hierarchical str… ▽ More

    Submitted 23 October, 2024; originally announced October 2024.

  18. arXiv:2410.10319  [pdf, other

    cs.CV cs.MM

    Spatial-Aware Efficient Projector for MLLMs via Multi-Layer Feature Aggregation

    Authors: Shun Qian, Bingquan Liu, Chengjie Sun, Zhen Xu, Baoxun Wang

    Abstract: The projector plays a crucial role in multi-modal language models (MLLMs). The number of visual tokens it outputs affects the efficiency of the MLLM, while the quality of the visual tokens influences the visual understanding capabilities of the MLLM. Current explorations on the projector focus on reducing the number of visual tokens to improve efficiency, often overlooking the inherent spatial dis… ▽ More

    Submitted 14 October, 2024; originally announced October 2024.

    Comments: 10 pages, 3 figures

  19. arXiv:2410.06338  [pdf, other

    cs.CL

    Are Large Language Models State-of-the-art Quality Estimators for Machine Translation of User-generated Content?

    Authors: Shenbin Qian, Constantin Orăsan, Diptesh Kanojia, Félix do Carmo

    Abstract: This paper investigates whether large language models (LLMs) are state-of-the-art quality estimators for machine translation of user-generated content (UGC) that contains emotional expressions, without the use of reference translations. To achieve this, we employ an existing emotion-related dataset with human-annotated errors and calculate quality evaluation scores based on the Multi-dimensional Q… ▽ More

    Submitted 8 October, 2024; originally announced October 2024.

  20. arXiv:2410.03278  [pdf, other

    cs.CL

    What do Large Language Models Need for Machine Translation Evaluation?

    Authors: Shenbin Qian, Archchana Sindhujan, Minnie Kabra, Diptesh Kanojia, Constantin Orăsan, Tharindu Ranasinghe, Frédéric Blain

    Abstract: Leveraging large language models (LLMs) for various natural language processing tasks has led to superlative claims about their performance. For the evaluation of machine translation (MT), existing research shows that LLMs are able to achieve results comparable to fine-tuned multilingual pre-trained language models. In this paper, we explore what translation information, such as the source, refere… ▽ More

    Submitted 9 October, 2024; v1 submitted 4 October, 2024; originally announced October 2024.

    Comments: Accepted to EMNLP 2024 Main Conference

  21. arXiv:2410.03277  [pdf, other

    cs.CL

    A Multi-task Learning Framework for Evaluating Machine Translation of Emotion-loaded User-generated Content

    Authors: Shenbin Qian, Constantin Orăsan, Diptesh Kanojia, Félix do Carmo

    Abstract: Machine translation (MT) of user-generated content (UGC) poses unique challenges, including handling slang, emotion, and literary devices like irony and sarcasm. Evaluating the quality of these translations is challenging as current metrics do not focus on these ubiquitous features of UGC. To address this issue, we utilize an existing emotion-related dataset that includes emotion labels and human-… ▽ More

    Submitted 4 October, 2024; originally announced October 2024.

  22. arXiv:2409.18996  [pdf, other

    cs.CL cs.AI cs.CV cs.LG cs.MM

    From Linguistic Giants to Sensory Maestros: A Survey on Cross-Modal Reasoning with Large Language Models

    Authors: Shengsheng Qian, Zuyi Zhou, Dizhan Xue, Bing Wang, Changsheng Xu

    Abstract: Cross-modal reasoning (CMR), the intricate process of synthesizing and drawing inferences across divergent sensory modalities, is increasingly recognized as a crucial capability in the progression toward more sophisticated and anthropomorphic artificial intelligence systems. Large Language Models (LLMs) represent a class of AI algorithms specifically engineered to parse, produce, and engage with h… ▽ More

    Submitted 18 September, 2024; originally announced September 2024.

    ACM Class: A.1

  23. Crafting Synthetic Realities: Examining Visual Realism and Misinformation Potential of Photorealistic AI-Generated Images

    Authors: Qiyao Peng, Yingdan Lu, Yilang Peng, Sijia Qian, Xinyi Liu, Cuihua Shen

    Abstract: Advances in generative models have created Artificial Intelligence-Generated Images (AIGIs) nearly indistinguishable from real photographs. Leveraging a large corpus of 30,824 AIGIs collected from Instagram and Twitter, and combining quantitative content analysis with qualitative analysis, this study unpacks AI photorealism of AIGIs from four key dimensions, content, human, aesthetic, and producti… ▽ More

    Submitted 14 March, 2025; v1 submitted 25 September, 2024; originally announced September 2024.

  24. arXiv:2409.16803  [pdf, other

    eess.AS cs.SD

    Incorporating Spatial Cues in Modular Speaker Diarization for Multi-channel Multi-party Meetings

    Authors: Ruoyu Wang, Shutong Niu, Gaobin Yang, Jun Du, Shuangqing Qian, Tian Gao, Jia Pan

    Abstract: Although fully end-to-end speaker diarization systems have made significant progress in recent years, modular systems often achieve superior results in real-world scenarios due to their greater adaptability and robustness. Historically, modular speaker diarization methods have seldom discussed how to leverage spatial cues from multi-channel speech. This paper proposes a three-stage modular system… ▽ More

    Submitted 25 September, 2024; originally announced September 2024.

    Comments: 5 pages, Submitted to ICASSP 2025

  25. arXiv:2409.03282  [pdf, other

    cs.LG eess.SP

    Interpretable mixture of experts for time series prediction under recurrent and non-recurrent conditions

    Authors: Zemian Ke, Haocheng Duan, Sean Qian

    Abstract: Non-recurrent conditions caused by incidents are different from recurrent conditions that follow periodic patterns. Existing traffic speed prediction studies are incident-agnostic and use one single model to learn all possible patterns from these drastically diverse conditions. This study proposes a novel Mixture of Experts (MoE) model to improve traffic speed prediction under two separate conditi… ▽ More

    Submitted 5 September, 2024; originally announced September 2024.

  26. arXiv:2409.02041  [pdf, other

    eess.AS cs.SD

    The USTC-NERCSLIP Systems for the CHiME-8 NOTSOFAR-1 Challenge

    Authors: Shutong Niu, Ruoyu Wang, Jun Du, Gaobin Yang, Yanhui Tu, Siyuan Wu, Shuangqing Qian, Huaxin Wu, Haitao Xu, Xueyang Zhang, Guolong Zhong, Xindi Yu, Jieru Chen, Mengzhi Wang, Di Cai, Tian Gao, Genshun Wan, Feng Ma, Jia Pan, Jianqing Gao

    Abstract: This technical report outlines our submission system for the CHiME-8 NOTSOFAR-1 Challenge. The primary difficulty of this challenge is the dataset recorded across various conference rooms, which captures real-world complexities such as high overlap rates, background noises, a variable number of speakers, and natural conversation styles. To address these issues, we optimized the system in several a… ▽ More

    Submitted 24 October, 2024; v1 submitted 3 September, 2024; originally announced September 2024.

  27. arXiv:2408.13945  [pdf, other

    eess.IV cs.CV physics.med-ph

    Personalized Topology-Informed Localization of Standard 12-Lead ECG Electrode Placement from Incomplete Cardiac MRIs for Efficient Cardiac Digital Twins

    Authors: Lei Li, Hannah Smith, Yilin Lyu, Julia Camps, Shuang Qian, Blanca Rodriguez, Abhirup Banerjee, Vicente Grau

    Abstract: Cardiac digital twins (CDTs) offer personalized in-silico cardiac representations for the inference of multi-scale properties tied to cardiac mechanisms. The creation of CDTs requires precise information about the electrode position on the torso, especially for the personalized electrocardiogram (ECG) calibration. However, current studies commonly rely on additional acquisition of torso imaging an… ▽ More

    Submitted 25 February, 2025; v1 submitted 25 August, 2024; originally announced August 2024.

  28. arXiv:2407.12248  [pdf, other

    cs.DC

    Mitigating Interference of Microservices with a Scoring Mechanism in Large-scale Clusters

    Authors: Dingyu Yang, Kangpeng Zheng, Shiyou Qian, Jian Cao, Guangtao Xue

    Abstract: Co-locating latency-critical services (LCSs) and best-effort jobs (BEJs) constitute the principal approach for enhancing resource utilization in production. Nevertheless, the co-location practice hurts the performance of LCSs due to resource competition, even when employing isolation technology. Through an extensive analysis of voluminous real trace data derived from two production clusters, we ob… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

    Journal ref: Journal of Supercomputing 2025

  29. arXiv:2407.07364  [pdf, other

    cs.LG cs.AI eess.SY

    Real-time system optimal traffic routing under uncertainties -- Can physics models boost reinforcement learning?

    Authors: Zemian Ke, Qiling Zou, Jiachao Liu, Sean Qian

    Abstract: System optimal traffic routing can mitigate congestion by assigning routes for a portion of vehicles so that the total travel time of all vehicles in the transportation system can be reduced. However, achieving real-time optimal routing poses challenges due to uncertain demands and unknown system dynamics, particularly in expansive transportation networks. While physics model-based methods are sen… ▽ More

    Submitted 10 July, 2024; originally announced July 2024.

  30. arXiv:2407.06192  [pdf, other

    cs.CV cs.AI cs.CL

    Multi-Object Hallucination in Vision-Language Models

    Authors: Xuweiyi Chen, Ziqiao Ma, Xuejun Zhang, Sihan Xu, Shengyi Qian, Jianing Yang, David F. Fouhey, Joyce Chai

    Abstract: Large vision language models (LVLMs) often suffer from object hallucination, producing objects not present in the given images. While current benchmarks for object hallucination primarily concentrate on the presence of a single object class rather than individual entities, this work systematically investigates multi-object hallucination, examining how models misperceive (e.g., invent nonexistent o… ▽ More

    Submitted 31 October, 2024; v1 submitted 8 July, 2024; originally announced July 2024.

    Comments: Accepted to NeurIPS 2024 | Project page: https://multi-object-hallucination.github.io/

  31. arXiv:2406.18158  [pdf, other

    cs.RO cs.CV

    3D-MVP: 3D Multiview Pretraining for Robotic Manipulation

    Authors: Shengyi Qian, Kaichun Mo, Valts Blukis, David F. Fouhey, Dieter Fox, Ankit Goyal

    Abstract: Recent works have shown that visual pretraining on egocentric datasets using masked autoencoders (MAE) can improve generalization for downstream robotics tasks. However, these approaches pretrain only on 2D images, while many robotics applications require 3D scene understanding. In this work, we propose 3D-MVP, a novel approach for 3D Multi-View Pretraining using masked autoencoders. We leverage R… ▽ More

    Submitted 23 March, 2025; v1 submitted 26 June, 2024; originally announced June 2024.

    Comments: CVPR 2025

  32. arXiv:2406.17777  [pdf, other

    cs.CV

    Text-Animator: Controllable Visual Text Video Generation

    Authors: Lin Liu, Quande Liu, Shengju Qian, Yuan Zhou, Wengang Zhou, Houqiang Li, Lingxi Xie, Qi Tian

    Abstract: Video generation is a challenging yet pivotal task in various industries, such as gaming, e-commerce, and advertising. One significant unresolved aspect within T2V is the effective visualization of text within generated videos. Despite the progress achieved in Text-to-Video~(T2V) generation, current methods still cannot effectively visualize texts in videos directly, as they mainly focus on summar… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

    Comments: Project Page: https://laulampaul.github.io/text-animator.html

  33. arXiv:2406.16321  [pdf, other

    cs.LG cs.AI

    Mosaic of Modalities: A Comprehensive Benchmark for Multimodal Graph Learning

    Authors: Jing Zhu, Yuhang Zhou, Shengyi Qian, Zhongmou He, Tong Zhao, Neil Shah, Danai Koutra

    Abstract: Graph machine learning has made significant strides in recent years, yet the integration of visual information with graph structure and its potential for improving performance in downstream tasks remains an underexplored area. To address this critical gap, we introduce the Multimodal Graph Benchmark (MM-GRAPH), a pioneering benchmark that incorporates both visual and textual information into graph… ▽ More

    Submitted 30 March, 2025; v1 submitted 24 June, 2024; originally announced June 2024.

    Comments: CVPR 2025

  34. arXiv:2406.15781  [pdf, other

    cs.CL

    DABL: Detecting Semantic Anomalies in Business Processes Using Large Language Models

    Authors: Wei Guan, Jian Cao, Jianqi Gao, Haiyan Zhao, Shiyou Qian

    Abstract: Detecting anomalies in business processes is crucial for ensuring operational success. While many existing methods rely on statistical frequency to detect anomalies, it's important to note that infrequent behavior doesn't necessarily imply undesirability. To address this challenge, detecting anomalies from a semantic viewpoint proves to be a more effective approach. However, current semantic anoma… ▽ More

    Submitted 22 June, 2024; originally announced June 2024.

  35. arXiv:2406.15769  [pdf, other

    cs.DC

    Humas: A Heterogeneity- and Upgrade-aware Microservice Auto-scaling Framework in Large-scale Data Centers

    Authors: Qin Hua, Dingyu Yang, Shiyou Qian, Jian Cao, Guangtao Xue, Minglu Li

    Abstract: An effective auto-scaling framework is essential for microservices to ensure performance stability and resource efficiency under dynamic workloads. As revealed by many prior studies, the key to efficient auto-scaling lies in accurately learning performance patterns, i.e., the relationship between performance metrics and workloads in data-driven schemes. However, we notice that there are two signif… ▽ More

    Submitted 22 June, 2024; originally announced June 2024.

    Comments: 14 pages; 27 figures

  36. arXiv:2406.05132  [pdf, other

    cs.CV cs.AI cs.CL cs.LG cs.RO

    3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination

    Authors: Jianing Yang, Xuweiyi Chen, Nikhil Madaan, Madhavan Iyengar, Shengyi Qian, David F. Fouhey, Joyce Chai

    Abstract: The integration of language and 3D perception is crucial for embodied agents and robots that comprehend and interact with the physical world. While large language models (LLMs) have demonstrated impressive language understanding and generation capabilities, their adaptation to 3D environments (3D-LLMs) remains in its early stages. A primary challenge is a lack of large-scale datasets with dense gr… ▽ More

    Submitted 20 March, 2025; v1 submitted 7 June, 2024; originally announced June 2024.

    Comments: CVPR 2025. Project website: https://3d-grand.github.io

  37. arXiv:2406.04640  [pdf, other

    cs.LG

    LinkGPT: Teaching Large Language Models To Predict Missing Links

    Authors: Zhongmou He, Jing Zhu, Shengyi Qian, Joyce Chai, Danai Koutra

    Abstract: Large Language Models (LLMs) have shown promising results on various language and vision tasks. Recently, there has been growing interest in applying LLMs to graph-based tasks, particularly on Text-Attributed Graphs (TAGs). However, most studies have focused on node classification, while the use of LLMs for link prediction (LP) remains understudied. In this work, we propose a new task on LLMs, whe… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

  38. arXiv:2406.03007  [pdf, other

    cs.CL cs.AI cs.CR cs.LG

    BadAgent: Inserting and Activating Backdoor Attacks in LLM Agents

    Authors: Yifei Wang, Dizhan Xue, Shengjie Zhang, Shengsheng Qian

    Abstract: With the prosperity of large language models (LLMs), powerful LLM-based intelligent agents have been developed to provide customized services with a set of user-defined tools. State-of-the-art methods for constructing LLM agents adopt trained LLMs and further fine-tune them on data for the agent task. However, we show that such methods are vulnerable to our proposed backdoor attacks named BadAgent… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

    Comments: Accepted by ACL 2024

  39. arXiv:2404.19026  [pdf, other

    cs.CV

    MeGA: Hybrid Mesh-Gaussian Head Avatar for High-Fidelity Rendering and Head Editing

    Authors: Cong Wang, Di Kang, He-Yi Sun, Shen-Han Qian, Zi-Xuan Wang, Linchao Bao, Song-Hai Zhang

    Abstract: Creating high-fidelity head avatars from multi-view videos is a core issue for many AR/VR applications. However, existing methods usually struggle to obtain high-quality renderings for all different head components simultaneously since they use one single representation to model components with drastically different characteristics (e.g., skin vs. hair). In this paper, we propose a Hybrid Mesh-Gau… ▽ More

    Submitted 29 April, 2024; originally announced April 2024.

    Comments: Project page: https://conallwang.github.io/MeGA_Pages/

  40. arXiv:2404.18219  [pdf, other

    physics.ins-det cs.LG hep-ex hep-ph physics.data-an

    BUFF: Boosted Decision Tree based Ultra-Fast Flow matching

    Authors: Cheng Jiang, Sitian Qian, Huilin Qu

    Abstract: Tabular data stands out as one of the most frequently encountered types in high energy physics. Unlike commonly homogeneous data such as pixelated images, simulating high-dimensional tabular data and accurately capturing their correlations are often quite challenging, even with the most advanced architectures. Based on the findings that tree-based models surpass the performance of deep learning mo… ▽ More

    Submitted 28 April, 2024; originally announced April 2024.

    Comments: 9 pages, 10 figures, 1 additional figure in appendix

  41. arXiv:2404.15275  [pdf, other

    cs.CV

    ID-Animator: Zero-Shot Identity-Preserving Human Video Generation

    Authors: Xuanhua He, Quande Liu, Shengju Qian, Xin Wang, Tao Hu, Ke Cao, Keyu Yan, Jie Zhang

    Abstract: Generating high-fidelity human video with specified identities has attracted significant attention in the content generation community. However, existing techniques struggle to strike a balance between training efficiency and identity preservation, either requiring tedious case-by-case fine-tuning or usually missing identity details in the video generation process. In this study, we present \textb… ▽ More

    Submitted 25 June, 2024; v1 submitted 23 April, 2024; originally announced April 2024.

    Comments: Project Page: https://id-animator.github.io/

  42. arXiv:2404.02445  [pdf, other

    cs.DC

    MOPAR: A Model Partitioning Framework for Deep Learning Inference Services on Serverless Platforms

    Authors: Jiaang Duan, Shiyou Qian, Dingyu Yang, Hanwen Hu, Jian Cao, Guangtao Xue

    Abstract: With its elastic power and a pay-as-you-go cost model, the deployment of deep learning inference services (DLISs) on serverless platforms is emerging as a prevalent trend. However, the varying resource requirements of different layers in DL models hinder resource utilization and increase costs, when DLISs are deployed as a single function on serverless platforms. To tackle this problem, we propose… ▽ More

    Submitted 3 April, 2024; originally announced April 2024.

  43. arXiv:2403.20041  [pdf

    cs.CL

    Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs

    Authors: Luchang Li, Sheng Qian, Jie Lu, Lunxi Yuan, Rui Wang, Qin Xie

    Abstract: The Large Language Model (LLM) is widely employed for tasks such as intelligent assistants, text summarization, translation, and multi-modality on mobile phones. However, the current methods for on-device LLM deployment maintain slow inference speed, which causes poor user experience. To facilitate high-efficiency LLM deployment on device GPUs, we propose four optimization techniques: (a) a symbol… ▽ More

    Submitted 5 July, 2024; v1 submitted 29 March, 2024; originally announced March 2024.

    Comments: 21 pages, 6 figures, fix "E0M4" spell mistake, fix FLOPS to TFLOPS

  44. arXiv:2403.19622  [pdf, other

    cs.RO cs.CV

    RH20T-P: A Primitive-Level Robotic Dataset Towards Composable Generalization Agents

    Authors: Zeren Chen, Zhelun Shi, Xiaoya Lu, Lehan He, Sucheng Qian, Zhenfei Yin, Wanli Ouyang, Jing Shao, Yu Qiao, Cewu Lu, Lu Sheng

    Abstract: Achieving generalizability in solving out-of-distribution tasks is one of the ultimate goals of learning robotic manipulation. Recent progress of Vision-Language Models (VLMs) has shown that VLM-based task planners can alleviate the difficulty of solving novel tasks, by decomposing the compounded tasks as a plan of sequentially executing primitive-level skills that have been already mastered. It i… ▽ More

    Submitted 1 February, 2025; v1 submitted 28 March, 2024; originally announced March 2024.

    Comments: 18 pages, 11 figures, 7 tables. Accepted by NeurIPS 2024 Workshop

  45. arXiv:2403.16999  [pdf, other

    cs.CV

    Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning

    Authors: Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, Hongsheng Li

    Abstract: Multi-Modal Large Language Models (MLLMs) have demonstrated impressive performance in various VQA tasks. However, they often lack interpretability and struggle with complex visual inputs, especially when the resolution of the input image is high or when the interested region that could provide key information for answering the question is small. To address these challenges, we collect and introduc… ▽ More

    Submitted 4 November, 2024; v1 submitted 25 March, 2024; originally announced March 2024.

    Comments: Project Page: https://hao-shao.com/projects/viscot.html

  46. arXiv:2403.12580  [pdf, other

    cs.CV

    Real-IAD: A Real-World Multi-View Dataset for Benchmarking Versatile Industrial Anomaly Detection

    Authors: Chengjie Wang, Wenbing Zhu, Bin-Bin Gao, Zhenye Gan, Jianning Zhang, Zhihao Gu, Shuguang Qian, Mingang Chen, Lizhuang Ma

    Abstract: Industrial anomaly detection (IAD) has garnered significant attention and experienced rapid development. However, the recent development of IAD approach has encountered certain difficulties due to dataset limitations. On the one hand, most of the state-of-the-art methods have achieved saturation (over 99% in AUROC) on mainstream datasets such as MVTec, and the differences of methods cannot be well… ▽ More

    Submitted 19 March, 2024; originally announced March 2024.

    Comments: It is accepted by CVPR2024

  47. arXiv:2401.17095  [pdf, other

    cs.LG cs.AI

    Traffic estimation in unobserved network locations using data-driven macroscopic models

    Authors: Pablo Guarda, Sean Qian

    Abstract: This paper leverages macroscopic models and multi-source spatiotemporal data collected from automatic traffic counters and probe vehicles to accurately estimate traffic flow and travel time in links where these measurements are unavailable. This problem is critical in transportation planning applications where the sensor coverage is low and the planned interventions have network-wide impacts. The… ▽ More

    Submitted 30 January, 2024; originally announced January 2024.

    Comments: 34 pages, 28 figures, 6 tables

  48. arXiv:2401.06341  [pdf, other

    cs.CV cs.RO

    AffordanceLLM: Grounding Affordance from Vision Language Models

    Authors: Shengyi Qian, Weifeng Chen, Min Bai, Xiong Zhou, Zhuowen Tu, Li Erran Li

    Abstract: Affordance grounding refers to the task of finding the area of an object with which one can interact. It is a fundamental but challenging task, as a successful solution requires the comprehensive understanding of a scene in multiple aspects including detection, localization, and recognition of objects with their parts, of geo-spatial configuration/layout of the scene, of 3D shapes and physics, as… ▽ More

    Submitted 17 April, 2024; v1 submitted 11 January, 2024; originally announced January 2024.

  49. arXiv:2312.07955  [pdf, other

    cs.CV cs.AI cs.CR cs.LG

    Erasing Self-Supervised Learning Backdoor by Cluster Activation Masking

    Authors: Shengsheng Qian, Dizhan Xue, Yifei Wang, Shengjie Zhang, Huaiwen Zhang, Changsheng Xu

    Abstract: Self-Supervised Learning (SSL) is an effective paradigm for learning representations from unlabeled data, such as text, images, and videos. However, researchers have recently found that SSL is vulnerable to backdoor attacks. The attacker can embed hidden SSL backdoors via a few poisoned examples in the training dataset and maliciously manipulate the behavior of downstream models. To defend against… ▽ More

    Submitted 1 November, 2024; v1 submitted 13 December, 2023; originally announced December 2023.

  50. arXiv:2312.04302  [pdf, other

    cs.CV cs.CL

    Prompt Highlighter: Interactive Control for Multi-Modal LLMs

    Authors: Yuechen Zhang, Shengju Qian, Bohao Peng, Shu Liu, Jiaya Jia

    Abstract: This study targets a critical aspect of multi-modal LLMs' (LLMs&VLMs) inference: explicit controllable text generation. Multi-modal LLMs empower multi-modality understanding with the capability of semantic generation yet bring less explainability and heavier reliance on prompt contents due to their autoregressive generative nature. While manipulating prompt formats could improve outputs, designing… ▽ More

    Submitted 20 March, 2024; v1 submitted 7 December, 2023; originally announced December 2023.

    Comments: CVPR 2024; Project Page: https://julianjuaner.github.io/projects/PromptHighlighter