Multimedia

Authors and titles for June 2025

Total of 153 entries

Showing up to 2000 entries per page: fewer | more | all

[1] arXiv:2506.00868 [pdf, html, other]: Title: Multiverse Through Deepfakes: The MultiFakeVerse Dataset of Person-Centric Visual and Conceptual Manipulations

Parul Gupta, Shreya Ghosh, Tom Gedeon, Thanh-Toan Do, Abhinav Dhall

Subjects: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
[2] arXiv:2506.01211 [pdf, html, other]: Title: Iola Walker: A Mobile Footfall Detection System for Music Composition

Will James

Subjects: Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
[3] arXiv:2506.01668 [pdf, html, other]: Title: Small Stickers, Big Meanings: A Multilingual Sticker Semantic Understanding Dataset with a Gamified Approach

Heng Er Metilda Chee, Jiayin Wang, Zhiqiang Guo, Weizhi Ma, Min Zhang

Subjects: Multimedia (cs.MM); Information Retrieval (cs.IR)
[4] arXiv:2506.02380 [pdf, html, other]: Title: EyeNavGS: A 6-DoF Navigation Dataset and Record-n-Replay Software for Real-World 3DGS Scenes in VR

Zihao Ding, Cheng-Tse Lee, Mufeng Zhu, Tao Guan, Yuan-Chun Sun, Cheng-Hsin Hsu, Yao Liu

Subjects: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Human-Computer Interaction (cs.HC)
[5] arXiv:2506.02414 [pdf, html, other]: Title: StarVC: A Unified Auto-Regressive Framework for Joint Text and Speech Generation in Voice Conversion

Fengjin Li, Jie Wang, Yadong Niu, Yongqing Wang, Meng Meng, Jian Luan, Zhiyong Wu

Comments: 5 pages, 2 figures, Accepted by Interspeech 2025, Demo: this https URL

Subjects: Multimedia (cs.MM); Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[6] arXiv:2506.02997 [pdf, html, other]: Title: Controllable Text-to-Speech Synthesis with Masked-Autoencoded Style-Rich Representation

Yongqi Wang, Chunlei Zhang, Hangting Chen, Zhou Zhao, Dong Yu

Subjects: Multimedia (cs.MM)
[7] arXiv:2506.03530 [pdf, other]: Title: How Far Are We from Predicting Missing Modalities with Foundation Models?

Guanzhou Ke, Yi Xie, Xiaoli Wang, Guoqing Chao, Bo Wang, Shengfeng He

Subjects: Multimedia (cs.MM); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
[8] arXiv:2506.05851 [pdf, html, other]: Title: DeepFake Doctor: Diagnosing and Treating Audio-Video Fake Detection

Marcel Klemt, Carlotta Segna, Anna Rohrbach

Subjects: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[9] arXiv:2506.05987 [pdf, html, other]: Title: The JPEG XL Image Coding System: History, Features, Coding Tools, Design Rationale, and Future

Jon Sneyers, Jyrki Alakuijala, Luca Versari, Zoltán Szabadka, Sami Boukortt, Amnon Cohen-Tidhar, Moritz Firsching, Evgenii Kliuchnikov, Tal Lev-Ami, Eric Portis, Thomas Richter, Osamu Watanabe

Comments: 73 pages, 62 figures

Subjects: Multimedia (cs.MM)
[10] arXiv:2506.06018 [pdf, html, other]: Title: Optimization-Free Universal Watermark Forgery with Regenerative Diffusion Models

Chaoyi Zhu, Zaitang Li, Renyi Yang, Robert Birke, Pin-Yu Chen, Tsung-Yi Ho, Lydia Y. Chen

Subjects: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
[11] arXiv:2506.06037 [pdf, html, other]: Title: SVD: Spatial Video Dataset

M. H. Izadimehr, Milad Ghanbari, Guodong Chen, Wei Zhou, Xiaoshuai Hao, Mallesham Dasari, Christian Timmerer, Hadi Amirpour

Subjects: Multimedia (cs.MM)
[12] arXiv:2506.06691 [pdf, html, other]: Title: An Efficient Digital Watermarking Technique for Small Scale devices

Kaushik Talathi, Aparna Santra Biswas

Comments: 28 pages, 11 figures, 4 tables

Subjects: Multimedia (cs.MM); Cryptography and Security (cs.CR)
[13] arXiv:2506.06743 [pdf, html, other]: Title: The State-of-the-Art in Lifelog Retrieval: A Review of Progress at the ACM Lifelog Search Challenge Workshop 2022-24

Allie Tran, Werner Bailer, Duc-Tien Dang-Nguyen, Graham Healy, Steve Hodges, Björn Þór Jónsson, Luca Rossetto, Klaus Schoeffmann, Minh-Triet Tran, Lucia Vadicamo, Cathal Gurrin

Subjects: Multimedia (cs.MM); Information Retrieval (cs.IR)
[14] arXiv:2506.06938 [pdf, other]: Title: Experimental Evaluation of Static Image Sub-Region-Based Search Models Using CLIP

Bastian Jäckl, Vojtěch Kloda, Daniel A. Keim, Jakub Lokoč

Comments: 14 pages, 4 figures, 2 tables

Subjects: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
[15] arXiv:2506.07076 [pdf, html, other]: Title: Harmony-Aware Music-driven Motion Synthesis with Perceptual Constraint on UGC Datasets

Xinyi Wu, Haohong Wang, Aggelos K. Katsaggelos

Subjects: Multimedia (cs.MM)
[16] arXiv:2506.09506 [pdf, html, other]: Title: Dynamic Sub-region Search in Homogeneous Collections Using CLIP

Bastian Jäckl, Vojtěch Kloda, Daniel A. Keim, Jakub Lokoč

Comments: 18 pages, 4 figures, 5 tables

Subjects: Multimedia (cs.MM)
[17] arXiv:2506.09795 [pdf, html, other]: Title: Learning Quality from Complexity and Structure: A Feature-Fused XGBoost Model for Video Quality Assessment

Amritha Premkumar, Prajit T Rajendran, Vignesh V Menon

Comments: ICME 2025

Subjects: Multimedia (cs.MM)
[18] arXiv:2506.10001 [pdf, html, other]: Title: Semantic Communication-Enabled Cloud-Edge-End-collaborative Metaverse Services Architecure

Yuxuan Li, Sheng Jinag, Bizhu Wang

Comments: arXiv admin note: text overlap with arXiv:2407.13764 by other authors

Subjects: Multimedia (cs.MM); Artificial Intelligence (cs.AI)
[19] arXiv:2506.10002 [pdf, html, other]: Title: EQ-TAA: Equivariant Traffic Accident Anticipation via Diffusion-Based Accident Video Synthesis

Jianwu Fang, Lei-Lei Li, Zhedong Zheng, Hongkai Yu, Jianru Xue, Zhengguo Li, Tat-Seng Chua

Comments: Accepted by IEEE-TMM

Subjects: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
[20] arXiv:2506.10003 [pdf, html, other]: Title: Integrating multimedia documents in 3D city models for a better understanding of territories

C.Gautier, J. Delanoy, G. Gesquière

Comments: 8 pages, 11 figures

Journal-ref: sprs-annals-X-4-W2-2022-69-2022

Subjects: Multimedia (cs.MM); Human-Computer Interaction (cs.HC)
[21] arXiv:2506.10004 [pdf, other]: Title: Immersive Multimedia Communication: State-of-the-Art on eXtended Reality Streaming

Haopeng Wang, Haiwei Dong, Abdulmotaleb El Saddik

Comments: accepted by ACM Transactions on Multimedia Computing, Communications, and Applications

Subjects: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Networking and Internet Architecture (cs.NI)
[22] arXiv:2506.10006 [pdf, other]: Title: HER2 Expression Prediction with Flexible Multi-Modal Inputs via Dynamic Bidirectional Reconstruction

Jie Qin, Wei Yang, Yan Su, Yiran Zhu, Weizhen Li, Yunyue Pan, Chengchang Pan, Honggang Qi

Comments: 7 pages,5 figures,3 tables,submitted to the 33rd ACM International Conference on Multimedia(ACM MM 2025)

Subjects: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[23] arXiv:2506.10007 [pdf, html, other]: Title: Controllable Expressive 3D Facial Animation via Diffusion in a Unified Multimodal Space

Kangwei Liu, Junwu Liu, Xiaowei Yi, Jinlin Guo, Yun Cao

Comments: Accepted by ICME2025

Subjects: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
[24] arXiv:2506.10008 [pdf, html, other]: Title: Structured Graph Representations for Visual Narrative Reasoning: A Hierarchical Framework for Comics

Yi-Chun Chen

Comments: This paper has been submitted to ACM Multimedia 2025 and is currently under review

Subjects: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
[25] arXiv:2506.10010 [pdf, other]: Title: Multimodal Emotion Coupling via Speech-to-Facial and Bodily Gestures in Dyadic Interaction

Von Ralph Dane Marquez Herbuela, Yukie Nagai

Subjects: Multimedia (cs.MM); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[26] arXiv:2506.10011 [pdf, html, other]: Title: WDMIR: Wavelet-Driven Multimodal Intent Recognition

Weiyin Gong, Kai Zhang, Yanghai Zhang, Qi Liu, Xinjie Sun, Junyu Lu, Linbo Zhu

Comments: Accepted at IJCAI 2025, 9pages, 6figures

Subjects: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
[27] arXiv:2506.10012 [pdf, other]: Title: Thief of Truth: VR comics about the relationship between AI and humans

Joonhyung Bae

Subjects: Multimedia (cs.MM); Human-Computer Interaction (cs.HC)
[28] arXiv:2506.10013 [pdf, html, other]: Title: Immersive Fantasy Based on Digital Nostalgia: Environmental Narratives for the Korean Millennials and Gen Z

Yerin Doh, Joonhyung Bae

Comments: Accepted at ISEA 2025 (International Symposium on Electronic Art)

Subjects: Multimedia (cs.MM); Computers and Society (cs.CY)
[29] arXiv:2506.10016 [pdf, other]: Title: A Survey of Generative Categories and Techniques in Multimodal Large Language Models

Longzhen Han, Awes Mubarak, Almas Baimagambetov, Nikolaos Polatidis, Thar Baker

Subjects: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
[30] arXiv:2506.10416 [pdf, html, other]: Title: Can Sound Replace Vision in LLaVA With Token Substitution?

Ali Vosoughi, Jing Bi, Pinxin Liu, Yunlong Tang, Chenliang Xu

Comments: 29 pages including references and appendices

Subjects: Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[31] arXiv:2506.14803 [pdf, html, other]: Title: Omnidirectional Video Super-Resolution using Deep Learning

Arbind Agrahari Baniya, Tsz-Kwan Lee, Peter W. Eklund, Sunil Aryal

Journal-ref: in IEEE Transactions on Multimedia, vol. 26, pp. 540-554, 2024

Subjects: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[32] arXiv:2506.16258 [pdf, html, other]: Title: ViFusion: In-Network Tensor Fusion for Scalable Video Feature Indexing

Yisu Wang, Yixiang Zhu, Xinjiao Li, Yulong Zhang, Ruilong Wu, Dirk Kutscher

Subjects: Multimedia (cs.MM)
[33] arXiv:2506.16495 [pdf, html, other]: Title: DT-UFC: Universal Large Model Feature Coding via Peaky-to-Balanced Distribution Transformation

Changsheng Gao, Zijie Liu, Li Li, Dong Liu, Xiaoyan Sun, Weisi Lin

Subjects: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
[34] arXiv:2506.17623 [pdf, html, other]: Title: Can Generated Images Serve as a Viable Modality for Text-Centric Multimodal Learning?

Yuesheng Huang, Peng Zhang, Riliang Liu, Jiaqi Liang

Comments: 4 figures,7 tables

Subjects: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
[35] arXiv:2506.18055 [pdf, html, other]: Title: Face-Voice Association for Audiovisual Active Speaker Detection in Egocentric Recordings

Jason Clarke, Yoshihiko Gotoh, Stefan Goetze

Comments: Accepted to EUSIPCO 2025. 5 pages, 1 figure. To appear in the Proceedings of the 33rd European Signal Processing Conference (EUSIPCO), September 8-12, 2025, Palermo, Italy

Subjects: Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[36] arXiv:2506.19769 [pdf, html, other]: Title: A Survey of Multi-sensor Fusion Perception for Embodied AI: Background, Methods, Challenges and Prospects

Shulan Ruan, Rongwei Wang, Xuchen Shen, Huijie Liu, Baihui Xiao, Jun Shi, Kun Zhang, Zhenya Huang, Yu Liu, Enhong Chen, You He

Subjects: Multimedia (cs.MM); Artificial Intelligence (cs.AI)
[37] arXiv:2506.20944 [pdf, html, other]: Title: E-FreeM2: Efficient Training-Free Multi-Scale and Cross-Modal News Verification via MLLMs

Van-Hoang Phan, Long-Khanh Pham, Dang Vu, Anh-Duy Tran, Minh-Son Dao

Comments: Accepted to AsiaCCS 2025 @ SCID

Subjects: Multimedia (cs.MM); Cryptography and Security (cs.CR)
[38] arXiv:2506.21865 [pdf, html, other]: Title: RiverEcho: Real-Time Interactive Digital System for Ancient Yellow River Culture

Haofeng Wang, Yilin Guo, Zehao Li, Tong Yue, Yizong Wang, Enci Zhang, Rongqun Lin, Feng Gao, Shiqi Wang, Siwei Ma

Comments: IEEE International Conference on Multimedia and Expo Workshop, 2025.(Accepted)

Subjects: Multimedia (cs.MM); Computation and Language (cs.CL)
[39] arXiv:2506.23484 [pdf, html, other]: Title: TAG-WM: Tamper-Aware Generative Image Watermarking via Diffusion Inversion Sensitivity

Yuzhuo Chen, Zehua Ma, Han Fang, Weiming Zhang, Nenghai Yu

Comments: Accepted by ICCV 2025 (2025 IEEE/CVF International Conference on Computer Vision)

Subjects: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
[40] arXiv:2506.23707 [pdf, html, other]: Title: Efficient and Accurate Image Provenance Analysis: A Scalable Pipeline for Large-scale Images

Jiewei Lai, Lan Zhang, Chen Tang, Pengcheng Sun

Comments: 25 pages, 6 figures

Subjects: Multimedia (cs.MM)
[41] arXiv:2506.00562 (cross-list from cs.CV) [pdf, html, other]: Title: SEED: A Benchmark Dataset for Sequential Facial Attribute Editing with Diffusion Models

Yule Zhu, Ping Liu, Zhedong Zheng, Wei Liu

Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
[42] arXiv:2506.00667 (cross-list from cs.CV) [pdf, html, other]: Title: Scene Detection Policies and Keyframe Extraction Strategies for Large-Scale Video Analysis

Vasilii Korolkov

Comments: 24 pages, 8 figures, submitted as a preprint. ArXiv preprint only, not submitted to a journal yet

Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
[43] arXiv:2506.00854 (cross-list from cs.CL) [pdf, html, other]: Title: EEG2TEXT-CN: An Exploratory Study of Open-Vocabulary Chinese Text-EEG Alignment via Large Language Model and Contrastive Learning on ChineseEEG

Jacky Tai-Yu Lu, Jung Chiang, Chi-Sheng Chen, Anna Nai-Yun Tung, Hsiang Wei Hu, Yuan Chiao Cheng

Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Neurons and Cognition (q-bio.NC)
[44] arXiv:2506.00974 (cross-list from cs.CV) [pdf, html, other]: Title: Camera Trajectory Generation: A Comprehensive Survey of Methods, Metrics, and Future Directions

Zahra Dehghanian, Pouya Ardekhani, Amir Vahedi, Hamid Beigy, Hamid R. Rabiee

Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
[45] arXiv:2506.01109 (cross-list from cs.CV) [pdf, html, other]: Title: CountingFruit: Real-Time 3D Fruit Counting with Language-Guided Semantic Gaussian Splatting

Fengze Li, Yangle Liu, Jieming Ma, Hai-Ning Liang, Yaochun Shen, Huangxiang Li, Zhijing Wu

Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
[46] arXiv:2506.01319 (cross-list from cs.SD) [pdf, html, other]: Title: Learning Sparsity for Effective and Efficient Music Performance Question Answering

Xingjian Diao, Tianzhen Yang, Chunhui Zhang, Weiyi Wu, Ming Cheng, Jiang Gui

Comments: Accepted to the main conference of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025)

Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
[47] arXiv:2506.01478 (cross-list from cs.LG) [pdf, html, other]: Title: MUDI: A Multimodal Biomedical Dataset for Understanding Pharmacodynamic Drug-Drug Interactions

Tung-Lam Ngo, Ba-Hoang Tran, Duy-Cat Can, Trung-Hieu Do, Oliver Y. Chén, Hoang-Quynh Le

Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Multimedia (cs.MM); Quantitative Methods (q-bio.QM)
[48] arXiv:2506.01482 (cross-list from cs.LG) [pdf, html, other]: Title: Automatic Stage Lighting Control: Is it a Rule-Driven Process or Generative Task?

Zijian Zhao, Dian Jin, Zijing Zhou, Xiaoyu Zhang

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
[49] arXiv:2506.01822 (cross-list from cs.CV) [pdf, html, other]: Title: GSCodec Studio: A Modular Framework for Gaussian Splat Compression

Sicheng Li, Chengzhen Wu, Hao Li, Xiang Gao, Yiyi Liao, Lu Yu

Comments: Repository of the project: this https URL

Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
[50] arXiv:2506.01850 (cross-list from cs.CV) [pdf, html, other]: Title: MoDA: Modulation Adapter for Fine-Grained Visual Grounding in Instructional MLLMs

Wayner Barrios, Andrés Villa, Juan León Alcázar, SouYoung Jin, Bernard Ghanem

Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
[51] arXiv:2506.02083 (cross-list from cs.SD) [pdf, html, other]: Title: LASPA: Language Agnostic Speaker Disentanglement with Prefix-Tuned Cross-Attention

Aditya Srinivas Menon, Raj Prakash Gohil, Kumud Tripathi, Pankaj Wasnik

Comments: Accepted at Interspeech 2025, Netherlands

Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
[52] arXiv:2506.02401 (cross-list from cs.SD) [pdf, html, other]: Title: Trusted Fake Audio Detection Based on Dirichlet Distribution

Chi Ding, Junxiao Xue, Cong Wang, Hao Zhou

Subjects: Sound (cs.SD); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
[53] arXiv:2506.02574 (cross-list from eess.IV) [pdf, html, other]: Title: Dynamic mapping from static labels: remote sensing dynamic sample generation with temporal-spectral embedding

Shuai Yuan, Shuang Chen, Tianwu Lin, Jie Wang, Peng Gong

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
[54] arXiv:2506.03144 (cross-list from cs.CV) [pdf, html, other]: Title: MERIT: Multilingual Semantic Retrieval with Interleaved Multi-Condition Query

Wei Chow, Yuan Gao, Linfeng Li, Xian Wang, Qi Xu, Hang Song, Lingdong Kong, Ran Zhou, Yi Zeng, Yidong Cai, Botian Jiang, Shilin Xu, Jiajun Zhang, Minghui Qiu, Xiangtai Li, Tianshu Yang, Siliang Tang, Juncheng Li

Comments: Preprint; Project Page, Code, and Dataset at: this https URL

Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)
[55] arXiv:2506.03150 (cross-list from cs.CV) [pdf, html, other]: Title: IllumiCraft: Unified Geometry and Illumination Diffusion for Controllable Video Generation

Yuanze Lin, Yi-Wen Chen, Yi-Hsuan Tsai, Ronald Clark, Ming-Hsuan Yang

Comments: Tech Report

Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
[56] arXiv:2506.03364 (cross-list from eess.AS) [pdf, html, other]: Title: Towards Source Attribution of Singing Voice Deepfake with Multimodal Foundation Models

Orchid Chetia Phukan, Girish, Mohd Mujtaba Akhtar, Swarup Ranjan Behera, Priyabrata Mallick, Pailla Balakrishna Reddy, Arun Balaji Buduru, Rajesh Sharma

Comments: Accepted to INTERSPEECH 2025

Subjects: Audio and Speech Processing (eess.AS); Multimedia (cs.MM); Sound (cs.SD)
[57] arXiv:2506.03378 (cross-list from eess.AS) [pdf, html, other]: Title: SNIFR : Boosting Fine-Grained Child Harmful Content Detection Through Audio-Visual Alignment with Cascaded Cross-Transformer

Orchid Chetia Phukan, Mohd Mujtaba Akhtar, Girish, Swarup Ranjan Behera, Abu Osama Siddiqui, Sarthak Jain, Priyabrata Mallick, Jaya Sai Kiran Patibandla, Pailla Balakrishna Reddy, Arun Balaji Buduru, Rajesh Sharma

Comments: Accepted to INTERSPEECH 2025

Subjects: Audio and Speech Processing (eess.AS); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
[58] arXiv:2506.03594 (cross-list from cs.GR) [pdf, html, other]: Title: SplArt: Articulation Estimation and Part-Level Reconstruction with 3D Gaussian Splatting

Shengjie Lin, Jiading Fang, Muhammad Zubair Irshad, Vitor Campagnolo Guizilini, Rares Andrei Ambrus, Greg Shakhnarovich, Matthew R. Walter

Comments: this https URL

Subjects: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Robotics (cs.RO)
[59] arXiv:2506.03831 (cross-list from cs.SD) [pdf, html, other]: Title: Conformer-based Ultrasound-to-Speech Conversion

Ibrahim Ibrahimov, Zainkó Csaba, Gábor Gosztolya

Comments: accepted to Interspeech 2025

Subjects: Sound (cs.SD); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
[60] arXiv:2506.04070 (cross-list from cs.CL) [pdf, html, other]: Title: LaF-GRPO: In-Situ Navigation Instruction Generation for the Visually Impaired via GRPO with LLM-as-Follower Reward

Yi Zhao, Siqi Wang, Jing Li

Subjects: Computation and Language (cs.CL); Multimedia (cs.MM)
[61] arXiv:2506.04214 (cross-list from cs.CV) [pdf, html, other]: Title: Sounding that Object: Interactive Object-Aware Image to Audio Generation

Tingle Li, Baihe Huang, Xiaobin Zhuang, Dongya Jia, Jiawei Chen, Yuping Wang, Zhuo Chen, Gopala Anumanchipalli, Yuxuan Wang

Comments: ICML 2025

Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[62] arXiv:2506.04444 (cross-list from cs.CV) [pdf, html, other]: Title: Photoreal Scene Reconstruction from an Egocentric Device

Zhaoyang Lv, Maurizio Monge, Ka Chen, Yufeng Zhu, Michael Goesele, Jakob Engel, Zhao Dong, Richard Newcombe

Comments: Paper accepted to SIGGRAPH Conference Paper 2025

Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
[63] arXiv:2506.04555 (cross-list from cs.CV) [pdf, html, other]: Title: Enhancing Frequency for Single Image Super-Resolution with Learnable Separable Kernels

Heng Tian

Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
[64] arXiv:2506.04755 (cross-list from cs.CV) [pdf, html, other]: Title: Truth in the Few: High-Value Data Selection for Efficient Multi-Modal Reasoning

Shenshen Li, Kaiyuan Deng, Lei Wang, Hao Yang, Chong Peng, Peng Yan, Fumin Shen, Heng Tao Shen, Xing Xu

Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
[65] arXiv:2506.04858 (cross-list from cs.HC) [pdf, html, other]: Title: Beyond the Desktop: XR-Driven Segmentation with Meta Quest 3 and MX Ink

Lisle Faray de Paiva, Gijs Luijten, Ana Sofia Ferreira Santos, Moon Kim, Behrus Puladi, Jens Kleesiek, Jan Egger

Comments: 10 pages

Subjects: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY); Graphics (cs.GR); Multimedia (cs.MM)
[66] arXiv:2506.05384 (cross-list from cs.CV) [pdf, html, other]: Title: Q-Ponder: A Unified Training Pipeline for Reasoning-based Visual Quality Assessment

Zhuoxuan Cai, Jian Zhang, Xinbin Yuan, Peng-Tao Jiang, Wenxiang Chen, Bowen Tang, Lujian Yao, Qiyuan Wang, Jinwen Chen, Bo Li

Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
[67] arXiv:2506.05395 (cross-list from cs.CV) [pdf, html, other]: Title: TriPSS: A Tri-Modal Keyframe Extraction Framework Using Perceptual, Structural, and Semantic Representations

Mert Can Cakmak, Nitin Agarwal, Diwash Poudel

Subjects: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Multimedia (cs.MM); Image and Video Processing (eess.IV)
[68] arXiv:2506.05414 (cross-list from cs.CV) [pdf, other]: Title: SAVVY: Spatial Awareness via Audio-Visual LLMs through Seeing and Hearing

Mingfei Chen, Zijun Cui, Xiulong Liu, Jinlin Xiang, Caleb Zheng, Jingyuan Li, Eli Shlizerman

Comments: Project website with demo videos: this https URL

Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[69] arXiv:2506.05538 (cross-list from cs.LG) [pdf, html, other]: Title: SocialDF: Benchmark Dataset and Detection Model for Mitigating Harmful Deepfake Content on Social Media Platforms

Arnesh Batra, Anushk Kumar, Jashn Khemani, Arush Gumber, Arhan Jain, Somil Gupta

Subjects: Machine Learning (cs.LG); Multimedia (cs.MM)
[70] arXiv:2506.05683 (cross-list from cs.LG) [pdf, html, other]: Title: Multi-Modal Multi-Task Federated Foundation Models for Next-Generation Extended Reality Systems: Towards Privacy-Preserving Distributed Intelligence in AR/VR/MR

Fardis Nadimi, Payam Abdisarabshali, Kasra Borazjani, Jacob Chakareski, Seyyedali Hosseinalipour

Comments: 16 pages, 4 Figures, 8 Tables

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Multimedia (cs.MM)
[71] arXiv:2506.06407 (cross-list from cs.CR) [pdf, other]: Title: TimeWak: Temporal Chained-Hashing Watermark for Time Series Data

Zhi Wen Soi, Chaoyi Zhu, Fouad Abiad, Aditya Shankar, Jeroen M. Galjaard, Huijuan Wang, Lydia Y. Chen

Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
[72] arXiv:2506.07050 (cross-list from cs.CV) [pdf, html, other]: Title: From Swath to Full-Disc: Advancing Precipitation Retrieval with Multimodal Knowledge Expansion

Zheng Wang, Kai Ying, Bin Xu, Chunjiao Wang, Cong Bai

Subjects: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Multimedia (cs.MM)
[73] arXiv:2506.07138 (cross-list from cs.CV) [pdf, html, other]: Title: Learning Compact Vision Tokens for Efficient Large Multimodal Models

Hao Tang, Chengchao Shen

Comments: The source code and trained weights are available at this https URL

Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
[74] arXiv:2506.07634 (cross-list from eess.AS) [pdf, html, other]: Title: SongBloom: Coherent Song Generation via Interleaved Autoregressive Sketching and Diffusion Refinement

Chenyu Yang, Shuai Wang, Hangting Chen, Wei Tan, Jianwei Yu, Haizhou Li

Comments: Submitted to NeurIPS2025

Subjects: Audio and Speech Processing (eess.AS); Multimedia (cs.MM)
[75] arXiv:2506.07863 (cross-list from cs.CV) [pdf, html, other]: Title: VIVAT: Virtuous Improving VAE Training through Artifact Mitigation

Lev Novitskiy, Viacheslav Vasilev, Maria Kovaleva, Vladimir Arkhipkin, Denis Dimitrov

Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
[76] arXiv:2506.08200 (cross-list from cs.HC) [pdf, html, other]: Title: AffectMachine-Pop: A controllable expert system for real-time pop music generation

Kat R. Agres, Adyasha Dash, Phoebe Chua, Stefan K. Ehrlich

Journal-ref: 2025 AAAI Workshop on Artificial Intelligence for Music, 39th Annual AAAI Conference on Artificial Intelligence. Philadelphia, PA, USA

Subjects: Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
[77] arXiv:2506.08493 (cross-list from cs.CV) [pdf, html, other]: Title: Context-aware TFL: A Universal Context-aware Contrastive Learning Framework for Temporal Forgery Localization

Qilin Yin, Wei Lu, Xiangyang Luo, Xiaochun Cao

Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
[78] arXiv:2506.08524 (cross-list from cs.SD) [pdf, html, other]: Title: Teaching Physical Awareness to LLMs through Sounds

Weiguo Wang, Andy Nie, Wenrui Zhou, Yi Kai, Chengchen Hu

Comments: ICML 2025

Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Robotics (cs.RO); Audio and Speech Processing (eess.AS)
[79] arXiv:2506.08591 (cross-list from cs.CV) [pdf, html, other]: Title: Diversity-Guided MLP Reduction for Efficient Large Vision Transformers

Chengchao Shen, Hourun Zhu, Gongfan Fang, Jianxin Wang, Xinchao Wang

Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
[80] arXiv:2506.09650 (cross-list from cs.CV) [pdf, html, other]: Title: HopaDIFF: Holistic-Partial Aware Fourier Conditioned Diffusion for Referring Human Action Segmentation in Multi-Person Scenarios

Kunyu Peng, Junchao Huang, Xiangsheng Huang, Di Wen, Junwei Zheng, Yufan Chen, Kailun Yang, Jiamin Wu, Chongqing Hao, Rainer Stiefelhagen

Comments: The code is available at this https URL

Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Robotics (cs.RO); Image and Video Processing (eess.IV)
[81] arXiv:2506.09792 (cross-list from cs.SD) [pdf, html, other]: Title: Incorporating Linguistic Constraints from External Knowledge Source for Audio-Visual Target Speech Extraction

Wenxuan Wu, Shuai Wang, Xixin Wu, Helen Meng, Haizhou Li

Comments: Accepted by Interspeech 2025

Subjects: Sound (cs.SD); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
[82] arXiv:2506.09999 (cross-list from cs.LG) [pdf, html, other]: Title: Leveraging Pre-Trained Models for Multimodal Class-Incremental Learning under Adaptive Fusion

Yukun Chen, Zihuan Qiu, Fanman Meng, Hongliang Li, Linfeng Xu, Qingbo Wu

Subjects: Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[83] arXiv:2506.10005 (cross-list from cs.CV) [pdf, html, other]: Title: Multimodal Cinematic Video Synthesis Using Text-to-Image and Audio Generation Models

Sridhar S, Nithin A, Shakeel Rifath, Vasantha Raj

Comments: 10 pages, seven figures about Multimodal Cinematic Video Synthesis Using Text-to-Image and Audio Generation Models

Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Graphics (cs.GR); Multimedia (cs.MM)
[84] arXiv:2506.10009 (cross-list from eess.IV) [pdf, html, other]: Title: The Iris File Extension

Ryan Erik Landvater, Michael David Olp, Mustafa Yousif, Ulysses Balis

Comments: 17 pages, 7 figures

Subjects: Image and Video Processing (eess.IV); Multimedia (cs.MM)
[85] arXiv:2506.10452 (cross-list from cs.CV) [pdf, html, other]: Title: Towards Robust Multimodal Emotion Recognition under Missing Modalities and Distribution Shifts

Guowei Zhong, Ruohong Huan, Mingzhen Wu, Ronghua Liang, Peng Chen

Comments: Submitted to TAC. The code is available at this https URL

Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
[86] arXiv:2506.10574 (cross-list from cs.CV) [pdf, html, other]: Title: DanceChat: Large Language Model-Guided Music-to-Dance Generation

Qing Wang, Xiaohang Yang, Yilan Dong, Naveen Raj Govindaraj, Gregory Slabaugh, Shanxin Yuan

Comments: check demos at this https URL

Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[87] arXiv:2506.10857 (cross-list from cs.CV) [pdf, html, other]: Title: VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos

Jiashuo Yu, Yue Wu, Meng Chu, Zhifei Ren, Zizheng Huang, Pei Chu, Ruijie Zhang, Yinan He, Qirui Li, Songze Li, Zhenxiang Li, Zhongying Tu, Conghui He, Yu Qiao, Yali Wang, Yi Wang, Limin Wang

Comments: Technical Report

Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
[88] arXiv:2506.10932 (cross-list from cs.HC) [pdf, other]: Title: Video-Mediated Emotion Disclosure: Expressions of Fear, Sadness, and Joy by People with Schizophrenia on YouTube

Jiaying Lizzy Liu, Yan Zhang

Comments: 10 pages

Journal-ref: ASIS&T 2025

Subjects: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY); Multimedia (cs.MM)
[89] arXiv:2506.10941 (cross-list from cs.CV) [pdf, other]: Title: VINCIE: Unlocking In-context Image Editing from Video

Leigang Qu, Feng Cheng, Ziyan Yang, Qi Zhao, Shanchuan Lin, Yichun Shi, Yicong Li, Wenjie Wang, Tat-Seng Chua, Lu Jiang

Comments: Project page: this https URL

Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
[90] arXiv:2506.11036 (cross-list from cs.LG) [pdf, html, other]: Title: Human-centered Interactive Learning via MLLMs for Text-to-Image Person Re-identification

Yang Qin, Chao Chen, Zhihang Fu, Dezhong Peng, Xi Peng, Peng Hu

Subjects: Machine Learning (cs.LG); Multimedia (cs.MM)
[91] arXiv:2506.11521 (cross-list from cs.CR) [pdf, html, other]: Title: Investigating Vulnerabilities and Defenses Against Audio-Visual Attacks: A Comprehensive Survey Emphasizing Multimodal Models

Jinming Wen, Xinyi Wu, Shuai Zhao, Yanhao Jia, Yuwen Li

Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
[92] arXiv:2506.11737 (cross-list from cs.CV) [pdf, html, other]: Title: Quizzard@INOVA Challenge 2025 -- Track A: Plug-and-Play Technique in Interleaved Multi-Image Model

Dinh Viet Cuong, Hoang-Bao Le, An Pham Ngoc Nguyen, Liting Zhou, Cathal Gurrin

Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)
[93] arXiv:2506.11934 (cross-list from cs.SI) [pdf, html, other]: Title: Temporal Dynamics of Emotions in Italian Online Soccer Fandoms

Salvatore Citraro, Giovanni Mauro, Emanuele Ferragina

Subjects: Social and Information Networks (cs.SI); Multimedia (cs.MM)
[94] arXiv:2506.12269 (cross-list from eess.IV) [pdf, html, other]: Title: ICME 2025 Grand Challenge on Video Super-Resolution for Video Conferencing

Babak Naderi, Ross Cutler, Juhee Cho, Nabakumar Khongbantabam, Dejan Ivkovic

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
[95] arXiv:2506.12573 (cross-list from cs.SD) [pdf, html, other]: Title: Video-Guided Text-to-Music Generation Using Public Domain Movie Collections

Haven Kim, Zachary Novack, Weihan Xu, Julian McAuley, Hao-Wen Dong

Comments: ISMIR 2025 regular paper. Dataset, code, and demo available at this https URL

Subjects: Sound (cs.SD); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
[96] arXiv:2506.12935 (cross-list from cs.CL) [pdf, html, other]: Title: SoundMind: RL-Incentivized Logic Reasoning for Audio-Language Models

Xingjian Diao, Chunhui Zhang, Keyi Kong, Weiyi Wu, Chiyu Ma, Zhongyu Ouyang, Peijun Qing, Soroush Vosoughi, Jiang Gui

Subjects: Computation and Language (cs.CL); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[97] arXiv:2506.13001 (cross-list from cs.SD) [pdf, html, other]: Title: Personalizable Long-Context Symbolic Music Infilling with MIDI-RWKV

Christian Zhou-Zheng, Philippe Pasquier

Subjects: Sound (cs.SD); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
[98] arXiv:2506.13038 (cross-list from cs.CV) [pdf, html, other]: Title: HKD4VLM: A Progressive Hybrid Knowledge Distillation Framework for Robust Multimodal Hallucination and Factuality Detection in VLMs

Zijian Zhang, Xuecheng Wu, Danlei Huang, Siyu Yan, Chong Peng, Xuezhi Cao

Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
[99] arXiv:2506.13971 (cross-list from eess.AS) [pdf, html, other]: Title: Multimodal Fusion with Semi-Supervised Learning Minimizes Annotation Quantity for Modeling Videoconference Conversation Experience

Andrew Chang, Chenkai Hu, Ji Qi, Zhuojian Wei, Kexin Zhang, Viswadruth Akkaraju, David Poeppel, Dustin Freeman

Comments: Interspeech 2025

Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Multimedia (cs.MM)
[100] arXiv:2506.14223 (cross-list from cs.SD) [pdf, html, other]: Title: Fretting-Transformer: Encoder-Decoder Model for MIDI to Tablature Transcription

Anna Hamberger, Sebastian Murgul, Jochen Schmidt, Michael Heizmann

Comments: Accepted to the 50th International Computer Music Conference (ICMC), 2025

Subjects: Sound (cs.SD); Computation and Language (cs.CL); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
[101] arXiv:2506.14396 (cross-list from cs.SD) [pdf, html, other]: Title: Manipulated Regions Localization For Partially Deepfake Audio: A Survey

Jiayi He, Jiangyan Yi, Jianhua Tao, Siding Zeng, Hao Gu

Subjects: Sound (cs.SD); Multimedia (cs.MM)
[102] arXiv:2506.14427 (cross-list from eess.AS) [pdf, html, other]: Title: M3SD: Multi-modal, Multi-scenario and Multi-language Speaker Diarization Dataset

Shilong Wu

Subjects: Audio and Speech Processing (eess.AS); Multimedia (cs.MM)
[103] arXiv:2506.14771 (cross-list from eess.IV) [pdf, html, other]: Title: Empirical Studies of Large Scale Environment Scanning by Consumer Electronics

Mengyuan Wang, Yang Liu, Haopeng Wang, Haiwei Dong, Abdulmotaleb El Saddik

Comments: Accepted by IEEE Consumer Electronics Magazine

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Multimedia (cs.MM)
[104] arXiv:2506.14805 (cross-list from cs.CV) [pdf, html, other]: Title: Argus Inspection: Do Multimodal Large Language Models Possess the Eye of Panoptes?

Yang Yao, Lingyu Li, Jiaxin Song, Chiyu Chen, Zhenqi He, Yixu Wang, Xin Wang, Tianle Gu, Jie Li, Yan Teng, Yingchun Wang

Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
[105] arXiv:2506.14824 (cross-list from cs.LG) [pdf, html, other]: Title: FedNano: Toward Lightweight Federated Tuning for Pretrained Multimodal Large Language Models

Yao Zhang, Hewei Gao, Haokun Chen, Weiguo Li, Yunpu Ma, Volker Tresp

Comments: 12 pages, 3 figures

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
[106] arXiv:2506.15154 (cross-list from cs.SD) [pdf, html, other]: Title: SonicVerse: Multi-Task Learning for Music Feature-Informed Captioning

Anuradha Chopra, Abhinaba Roy, Dorien Herremans

Comments: 14 pages, 2 figures, Accepted to AIMC 2025

Journal-ref: Proceedings of the 6th Conference on AI Music Creativity (AIMC 2025), Brussels, Belgium, September 10th - 12th, 2025

Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
[107] arXiv:2506.15228 (cross-list from eess.IV) [pdf, html, other]: Title: ABC: Adaptive BayesNet Structure Learning for Computational Scalable Multi-task Image Compression

Yufeng Zhang, Wenrui Dai, Hang Yu, Shizhan Liu, Junhui Hou, Jianguo Li, Weiyao Lin

Subjects: Image and Video Processing (eess.IV); Multimedia (cs.MM)
[108] arXiv:2506.15276 (cross-list from cs.CV) [pdf, html, other]: Title: MSNeRV: Neural Video Representation with Multi-Scale Feature Fusion

Jun Zhu, Xinfeng Zhang, Lv Tang, JunHao Jiang

Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Image and Video Processing (eess.IV)
[109] arXiv:2506.15298 (cross-list from cs.CV) [pdf, html, other]: Title: MEGC2025: Micro-Expression Grand Challenge on Spot Then Recognize and Visual Question Answering

Xinqi Fan, Jingting Li, John See, Moi Hoon Yap, Wen-Huang Cheng, Xiaobai Li, Xiaopeng Hong, Su-Jing Wang, Adrian K. Davision

Comments: Micro-Expression Grand Challenge (MEGC) at ACM MM 2025

Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
[110] arXiv:2506.15677 (cross-list from cs.AI) [pdf, other]: Title: Embodied Web Agents: Bridging Physical-Digital Realms for Integrated Agent Intelligence

Yining Hong, Rui Sun, Bingxuan Li, Xingcheng Yao, Maxine Wu, Alexander Chien, Da Yin, Ying Nian Wu, Zhecan James Wang, Kai-Wei Chang

Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Robotics (cs.RO)
[111] arXiv:2506.15759 (cross-list from cs.SD) [pdf, html, other]: Title: Sonic4D: Spatial Audio Generation for Immersive 4D Scene Exploration

Siyi Xie, Hanxin Zhu, Tianyu He, Xin Li, Zhibo Chen

Comments: 17 pages, 7 figures. Project page: this https URL

Subjects: Sound (cs.SD); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
[112] arXiv:2506.15937 (cross-list from cs.CV) [pdf, html, other]: Title: Beyond Audio and Pose: A General-Purpose Framework for Video Synchronization

Yosub Shin, Igor Molybog

Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
[113] arXiv:2506.16116 (cross-list from eess.IV) [pdf, html, other]: Title: Enhanced Dermatology Image Quality Assessment via Cross-Domain Training

Ignacio Hernández Montilla, Alfonso Medela, Paola Pasquali, Andy Aguilar, Taig Mac Carthy, Gerardo Fernández, Antonio Martorell, Enrique Onieva

Comments: 9 pages, 4 figures. This manuscript has been accepted to the 2025 12th International Conference on Bioinformatics Research and Applications (ICBRA 2025). It will be published in International Conference Proceedings by ACM, which will be archived in ACM Digital Library, indexed by Ei Compendex and Scopus

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
[114] arXiv:2506.16273 (cross-list from cs.CV) [pdf, html, other]: Title: Fine-grained Image Retrieval via Dual-Vision Adaptation

Xin Jiang, Meiqi Cao, Hao Tang, Fei Shen, Zechao Li

Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
[115] arXiv:2506.16633 (cross-list from cs.CL) [pdf, html, other]: Title: GeoGuess: Multimodal Reasoning based on Hierarchy of Visual Information in Street View

Fenghua Cheng, Jinxiang Wang, Sen Wang, Zi Huang, Xue Li

Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
[116] arXiv:2506.16745 (cross-list from cs.CV) [pdf, html, other]: Title: Class Agnostic Instance-level Descriptor for Visual Instance Search

Qi-Ying Sun, Wan-Lei Zhao, Yi-Bo Miao, Chong-Wah Ngo

Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
[117] arXiv:2506.16784 (cross-list from cs.CV) [pdf, html, other]: Title: TextBraTS: Text-Guided Volumetric Brain Tumor Segmentation with Innovative Dataset Development and Fusion Module Exploration

Xiaoyu Shi, Rahul Kumar Jain, Yinhao Li, Ruibo Hou, Jingliang Cheng, Jie Bai, Guohua Zhao, Lanfen Lin, Rui Xu, Yen-wei Chen

Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
[118] arXiv:2506.17016 (cross-list from cs.LG) [pdf, html, other]: Title: The Hidden Cost of an Image: Quantifying the Energy Consumption of AI Image Generation

Giulia Bertazzini, Chiara Albisani, Daniele Baracchi, Dasara Shullani, Roberto Verdecchia

Subjects: Machine Learning (cs.LG); Multimedia (cs.MM)
[119] arXiv:2506.17342 (cross-list from cs.LG) [pdf, html, other]: Title: Adaptive Social Metaverse Streaming based on Federated Multi-Agent Deep Reinforcement Learning

Zijian Long, Haopeng Wang, Haiwei Dong, Abdulmotaleb El Saddik

Comments: Accepted by IEEE Transactions on Computational Social Systems

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Networking and Internet Architecture (cs.NI)
[120] arXiv:2506.17351 (cross-list from cs.SD) [pdf, html, other]: Title: Zero-Shot Cognitive Impairment Detection from Speech Using AudioLLM

Mostafa Shahin, Beena Ahmed, Julien Epps

Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
[121] arXiv:2506.17499 (cross-list from cs.LG) [pdf, html, other]: Title: Episode-specific Fine-tuning for Metric-based Few-shot Learners with Optimization-based Training

Xuanyu Zhuang, Geoffroy Peeters, Gaël Richard

Subjects: Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD)
[122] arXiv:2506.17707 (cross-list from cs.CV) [pdf, html, other]: Title: Programmable-Room: Interactive Textured 3D Room Meshes Generation Empowered by Large Language Models

Jihyun Kim, Junho Park, Kyeongbo Kong, Suk-Ju Kang

Comments: Accepted by IEEE Transactions on Multimedia

Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
[123] arXiv:2506.17912 (cross-list from cs.CV) [pdf, html, other]: Title: PlanMoGPT: Flow-Enhanced Progressive Planning for Text to Motion Synthesis

Chuhao Jin, Haosen Li, Bingzi Zhang, Che Liu, Xiting Wang, Ruihua Song, Wenbing Huang, Ying Qin, Fuzheng Zhang, Di Zhang

Comments: 14 pages, 7 figures

Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
[124] arXiv:2506.18021 (cross-list from cs.CV) [pdf, html, other]: Title: On the Robustness of Human-Object Interaction Detection against Distribution Shift

Chi Xie, Shuang Liang, Jie Li, Feng Zhu, Rui Zhao, Yichen Wei, Shengjie Zhao

Comments: This work has been submitted to the IEEE for possible publication

Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
[125] arXiv:2506.18034 (cross-list from cs.CV) [pdf, html, other]: Title: Pre-Trained LLM is a Semantic-Aware and Generalizable Segmentation Booster

Fenghe Tang, Wenxin Ma, Zhiyang He, Xiaodong Tao, Zihang Jiang, S. Kevin Zhou

Comments: Accepted by MICCAI 2025. Code: this https URL

Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
[126] arXiv:2506.18866 (cross-list from cs.CV) [pdf, html, other]: Title: OmniAvatar: Efficient Audio-Driven Avatar Video Generation with Adaptive Body Animation

Qijun Gan, Ruizi Yang, Jianke Zhu, Shaofei Xue, Steven Hoi

Comments: Project page: this https URL

Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
[127] arXiv:2506.18881 (cross-list from cs.CV) [pdf, html, other]: Title: Let Your Video Listen to Your Music!

Xinyu Zhang, Dong Gong, Zicheng Duan, Anton van den Hengel, Lingqiao Liu

Comments: project page: this https URL

Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
[128] arXiv:2506.18898 (cross-list from cs.CV) [pdf, html, other]: Title: Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations

Jiaming Han, Hao Chen, Yang Zhao, Hanyu Wang, Qi Zhao, Ziyan Yang, Hao He, Xiangyu Yue, Lu Jiang

Comments: Project page: this https URL

Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
[129] arXiv:2506.19051 (cross-list from eess.IV) [pdf, html, other]: Title: NIC-RobustBench: A Comprehensive Open-Source Toolkit for Neural Image Compression and Robustness Analysis

Georgii Bychkov, Khaled Abud, Egor Kovalev, Alexander Gushchin, Dmitriy Vatolin, Anastasia Antsiferova

Comments: arXiv admin note: text overlap with arXiv:2411.11795

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
[130] arXiv:2506.20070 (cross-list from cs.IR) [pdf, html, other]: Title: Multimodal Information Retrieval for Open World with Edit Distance Weak Supervision

KMA Solaiman, Bharat Bhargava

Comments: Submitted to ICDE'24. An earlier version of this paper appeared on TechRxiv: this https URL, uploaded on February 05, 2023

Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG); Multimedia (cs.MM)
[131] arXiv:2506.20214 (cross-list from cs.CV) [pdf, html, other]: Title: UniCode$^2$: Cascaded Large-scale Codebooks for Unified Multimodal Understanding and Generation

Yanzhe Chen (Yen-chieh Chan), Huasong Zhong, Yan Li, Zhenheng Yang

Comments: 19 pages, 5 figures

Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
[132] arXiv:2506.20370 (cross-list from cs.CV) [pdf, html, other]: Title: InvZW: Invariant Feature Learning via Noise-Adversarial Training for Robust Image Zero-Watermarking

Abdullah All Tanvir, Xin Zhong

Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
[133] arXiv:2506.20494 (cross-list from cs.LG) [pdf, html, other]: Title: Multimodal Representation Learning and Fusion

Qihang Jin, Enze Ge, Yuhang Xie, Hongying Luo, Junhao Song, Ziqian Bi, Chia Xin Liang, Jibin Guan, Joe Yeong, Junfeng Hao

Subjects: Machine Learning (cs.LG); Multimedia (cs.MM)
[134] arXiv:2506.20548 (cross-list from cs.CV) [pdf, html, other]: Title: Pay Less Attention to Deceptive Artifacts: Robust Detection of Compressed Deepfakes on Online Social Networks

Manyi Li, Renshuai Tao, Yufan Liu, Chuangchuang Tan, Haotong Qin, Bing Li, Yunchao Wei, Yao Zhao

Comments: 20 pages, 10 figures

Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
[135] arXiv:2506.20609 (cross-list from cs.SD) [pdf, html, other]: Title: Deciphering GunType Hierarchy through Acoustic Analysis of Gunshot Recordings

Ankit Shah, Rita Singh, Bhiksha Raj, Alexander Hauptmann

Comments: 4 pages + 1 References

Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
[136] arXiv:2506.20817 (cross-list from cs.IR) [pdf, html, other]: Title: RAG-VisualRec: An Open Resource for Vision- and Text-Enhanced Retrieval-Augmented Generation in Recommendation

Ali Tourani, Fatemeh Nazary, Yashar Deldjoo

Comments: 20 pages, 6 figures, 5 tables

Subjects: Information Retrieval (cs.IR); Multimedia (cs.MM)
[137] arXiv:2506.20947 (cross-list from cs.CV) [pdf, html, other]: Title: Hierarchical Sub-action Tree for Continuous Sign Language Recognition

Dejie Yang, Zhu Xu, Xinjie Gao, Yang Liu

Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
[138] arXiv:2506.21272 (cross-list from cs.GR) [pdf, html, other]: Title: FairyGen: Storied Cartoon Video from a Single Child-Drawn Character

Jiayi Zheng, Xiaodong Cun

Comments: Project Page: this https URL ; Code: this https URL

Subjects: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
[139] arXiv:2506.21298 (cross-list from cs.SD) [pdf, html, other]: Title: Exploring Adapter Design Tradeoffs for Low Resource Music Generation

Atharva Mehta, Shivam Chauhan, Monojit Choudhury

Comments: 9 pages, 5 figures

Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
[140] arXiv:2506.21552 (cross-list from cs.CV) [pdf, html, other]: Title: Whole-Body Conditioned Egocentric Video Prediction

Yutong Bai, Danny Tran, Amir Bar, Yann LeCun, Trevor Darrell, Jitendra Malik

Comments: Project Page: this https URL

Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Robotics (cs.RO)
[141] arXiv:2506.21851 (cross-list from cs.CV) [pdf, html, other]: Title: End-to-End RGB-IR Joint Image Compression With Channel-wise Cross-modality Entropy Model

Haofeng Wang, Fangtao Zhou, Qi Zhang, Zeyuan Chen, Enci Zhang, Zhao Wang, Xiaofeng Huang, Siwei Ma

Comments: IEEE International Conference on Systems, Man, and Cybernetics 2025. (SMC), under review

Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Image and Video Processing (eess.IV)
[142] arXiv:2506.21862 (cross-list from cs.CV) [pdf, html, other]: Title: LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs

Boyuan Sun, Jiaxing Zhao, Xihan Wei, Qibin Hou

Comments: 21 pages, 4 figures, 7 tables

Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
[143] arXiv:2506.21885 (cross-list from cs.CV) [pdf, html, other]: Title: Integrating Multi-Modal Sensors: A Review of Fusion Techniques for Intelligent Vehicles

Chuheng Wei, Ziye Qin, Ziyan Zhang, Guoyuan Wu, Matthew J. Barth

Comments: Accepted by IEEE IV 2025

Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Robotics (cs.RO)
[144] arXiv:2506.21912 (cross-list from cs.CV) [pdf, html, other]: Title: Generating Attribute-Aware Human Motions from Textual Prompt

Xinghan Wang, Kun Xu, Fei Li, Cao Sheng, Jiazhong Yu, Yadong Mu

Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
[145] arXiv:2506.22036 (cross-list from cs.LG) [pdf, html, other]: Title: Hyper-modal Imputation Diffusion Embedding with Dual-Distillation for Federated Multimodal Knowledge Graph Completion

Ying Zhang, Yu Zhao, Xuhui Sui, Baohang Zhou, Xiangrui Cai, Li Shen, Xiaojie Yuan, Dacheng Tao

Comments: Submitted to the IEEE for possible publication

Subjects: Machine Learning (cs.LG); Multimedia (cs.MM)
[146] arXiv:2506.22237 (cross-list from cs.SD) [pdf, html, other]: Title: Fine-Tuning MIDI-to-Audio Alignment using a Neural Network on Piano Roll and CQT Representations

Sebastian Murgul, Moritz Reiser, Michael Heizmann, Christoph Seibert

Comments: 9 pages, 3 figures, 6 tables

Subjects: Sound (cs.SD); Computation and Language (cs.CL); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
[147] arXiv:2506.22790 (cross-list from eess.IV) [pdf, html, other]: Title: ICME 2025 Generalizable HDR and SDR Video Quality Measurement Grand Challenge

Yixu Chen, Bowen Chen, Hai Wei, Alan C. Bovik, Baojun Li, Wei Sun, Linhan Cao, Kang Fu, Dandan Zhu, Jun Jia, Menghan Hu, Xiongkuo Min, Guangtao Zhai, Dounia Hammou, Fei Yin, Rafal Mantiuk, Amritha Premkumar, Prajit T Rajendran, Vignesh V Menon

Comments: ICME 2025 Grand Challenges

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
[148] arXiv:2506.22871 (cross-list from cs.LG) [pdf, html, other]: Title: P$^2$U: Progressive Precision Update For Efficient Model Distribution

Homayun Afrabandpey, Hamed Rezazadegan Tavakoli

Subjects: Machine Learning (cs.LG); Multimedia (cs.MM)
[149] arXiv:2506.22926 (cross-list from cs.HC) [pdf, html, other]: Title: Coordinated 2D-3D Visualization of Volumetric Medical Data in XR with Multimodal Interactions

Qixuan Liu, Shi Qiu, Yinqiao Wang, Xiwen Wu, Kenneth Siu Ho Chok, Chi-Wing Fu, Pheng-Ann Heng

Comments: IEEE VIS 2025 Short Paper

Subjects: Human-Computer Interaction (cs.HC); Graphics (cs.GR); Multimedia (cs.MM)
[150] arXiv:2506.22967 (cross-list from cs.CV) [pdf, html, other]: Title: ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment

Amir Aghdam, Vincent Tao Hu

Comments: Preprint manuscript - Project page: this https URL

Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
[151] arXiv:2506.23066 (cross-list from cs.CV) [pdf, html, other]: Title: CoreMark: Toward Robust and Universal Text Watermarking Technique

Jiale Meng, Yiming Li, Zheming Lu, Zewei He, Hao Luo, Tianwei Zhang

Comments: 10 pages, 16 figures

Subjects: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Multimedia (cs.MM)
[152] arXiv:2506.23151 (cross-list from cs.CV) [pdf, html, other]: Title: MEMFOF: High-Resolution Training for Memory-Efficient Multi-Frame Optical Flow Estimation

Vladislav Bargatin, Egor Chistov, Alexander Yakovenko, Dmitriy Vatolin

Comments: Accepted at ICCV 2025

Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
[153] arXiv:2506.23254 (cross-list from cs.CV) [pdf, other]: Title: PixelBoost: Leveraging Brownian Motion for Realistic-Image Super-Resolution

Aradhana Mishra, Bumshik Lee

Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Image and Video Processing (eess.IV)

Total of 153 entries

Showing up to 2000 entries per page: fewer | more | all