Search | arXiv e-print repository

R-PINN: Recovery-type a-posteriori estimator enhanced adaptive PINN

Authors: Rongxin Lu, Jiwei Jia, Young Ju Lee, Zheng Lu, Chensong Zhang

Abstract: In recent years, with the advancements in machine learning and neural networks, algorithms using physics-informed neural networks (PINNs) to solve PDEs have gained widespread applications. While these algorithms are well-suited for a wide range of equations, they often exhibit suboptimal performance when applied to equations with large local gradients, resulting in substantial localized errors. To… ▽ More In recent years, with the advancements in machine learning and neural networks, algorithms using physics-informed neural networks (PINNs) to solve PDEs have gained widespread applications. While these algorithms are well-suited for a wide range of equations, they often exhibit suboptimal performance when applied to equations with large local gradients, resulting in substantial localized errors. To address this issue, this paper proposes an adaptive PINN algorithm designed to improve accuracy in such cases. The core idea of the algorithm is to adaptively adjust the distribution of collocation points based on the recovery-type a-posterior error of the current numerical solution, enabling a better approximation of the true solution. This approach is inspired by the adaptive finite element method. By combining the recovery-type a-posteriori estimator, a gradient-recovery estimator commonly used in the adaptive finite element method (FEM) with PINNs, we introduce the Recovery-type a-posteriori estimator enhanced adaptive PINN (R-PINN) and compare its performance with a typical adaptive PINN algorithm, FI-PINN. Our results demonstrate that R-PINN achieves faster convergence with fewer adaptive points and significantly outperforms in the cases with multiple regions of large errors than FI-PINN. Notably, our method is a hybrid numerical approach for solving partial differential equations, integrating adaptive FEM with PINNs. △ Less

Submitted 11 June, 2025; originally announced June 2025.

arXiv:2506.08071 [pdf, ps, other]

CuRe: Cultural Gaps in the Long Tail of Text-to-Image Systems

Authors: Aniket Rege, Zinnia Nie, Mahesh Ramesh, Unmesh Raskar, Zhuoran Yu, Aditya Kusupati, Yong Jae Lee, Ramya Korlakai Vinayak

Abstract: Popular text-to-image (T2I) systems are trained on web-scraped data, which is heavily Amero and Euro-centric, underrepresenting the cultures of the Global South. To analyze these biases, we introduce CuRe, a novel and scalable benchmarking and scoring suite for cultural representativeness that leverages the marginal utility of attribute specification to T2I systems as a proxy for human judgments.… ▽ More Popular text-to-image (T2I) systems are trained on web-scraped data, which is heavily Amero and Euro-centric, underrepresenting the cultures of the Global South. To analyze these biases, we introduce CuRe, a novel and scalable benchmarking and scoring suite for cultural representativeness that leverages the marginal utility of attribute specification to T2I systems as a proxy for human judgments. Our CuRe benchmark dataset has a novel categorical hierarchy built from the crowdsourced Wikimedia knowledge graph, with 300 cultural artifacts across 32 cultural subcategories grouped into six broad cultural axes (food, art, fashion, architecture, celebrations, and people). Our dataset's categorical hierarchy enables CuRe scorers to evaluate T2I systems by analyzing their response to increasing the informativeness of text conditioning, enabling fine-grained cultural comparisons. We empirically observe much stronger correlations of our class of scorers to human judgments of perceptual similarity, image-text alignment, and cultural diversity across image encoders (SigLIP 2, AIMV2 and DINOv2), vision-language models (OpenCLIP, SigLIP 2, Gemini 2.0 Flash) and state-of-the-art text-to-image systems, including three variants of Stable Diffusion (1.5, XL, 3.5 Large), FLUX.1 [dev], Ideogram 2.0, and DALL-E 3. The code and dataset is open-sourced and available at https://aniketrege.github.io/cure/. △ Less

Submitted 9 June, 2025; originally announced June 2025.

Comments: 41 pages, 22 figures, 17 tables

arXiv:2505.23806 [pdf, ps, other]

MedOrchestra: A Hybrid Cloud-Local LLM Approach for Clinical Data Interpretation

Authors: Sihyeon Lee, Hyunjoo Song, Jong-chan Lee, Yoon Jin Lee, Boram Lee, Hee-Eon Lim, Dongyeong Kim, Jinwook Seo, Bohyoung Kim

Abstract: Deploying large language models (LLMs) in clinical settings faces critical trade-offs: cloud LLMs, with their extensive parameters and superior performance, pose risks to sensitive clinical data privacy, while local LLMs preserve privacy but often fail at complex clinical interpretation tasks. We propose MedOrchestra, a hybrid framework where a cloud LLM decomposes complex clinical tasks into mana… ▽ More Deploying large language models (LLMs) in clinical settings faces critical trade-offs: cloud LLMs, with their extensive parameters and superior performance, pose risks to sensitive clinical data privacy, while local LLMs preserve privacy but often fail at complex clinical interpretation tasks. We propose MedOrchestra, a hybrid framework where a cloud LLM decomposes complex clinical tasks into manageable subtasks and prompt generation, while a local LLM executes these subtasks in a privacy-preserving manner. Without accessing clinical data, the cloud LLM generates and validates subtask prompts using clinical guidelines and synthetic test cases. The local LLM executes subtasks locally and synthesizes outputs generated by the cloud LLM. We evaluate MedOrchestra on pancreatic cancer staging using 100 radiology reports under NCCN guidelines. On free-text reports, MedOrchestra achieves 70.21% accuracy, outperforming local model baselines (without guideline: 48.94%, with guideline: 56.59%) and board-certified clinicians (gastroenterologists: 59.57%, surgeons: 65.96%, radiologists: 55.32%). On structured reports, MedOrchestra reaches 85.42% accuracy, showing clear superiority across all settings. △ Less

Submitted 27 May, 2025; originally announced May 2025.

arXiv:2505.21954 [pdf, ps, other]

UniTalk: Towards Universal Active Speaker Detection in Real World Scenarios

Authors: Le Thien Phuc Nguyen, Zhuoran Yu, Khoa Quang Nhat Cao, Yuwei Guo, Tu Ho Manh Pham, Tuan Tai Nguyen, Toan Ngo Duc Vo, Lucas Poon, Soochahn Lee, Yong Jae Lee

Abstract: We present UniTalk, a novel dataset specifically designed for the task of active speaker detection, emphasizing challenging scenarios to enhance model generalization. Unlike previously established benchmarks such as AVA, which predominantly features old movies and thus exhibits significant domain gaps, UniTalk focuses explicitly on diverse and difficult real-world conditions. These include underre… ▽ More We present UniTalk, a novel dataset specifically designed for the task of active speaker detection, emphasizing challenging scenarios to enhance model generalization. Unlike previously established benchmarks such as AVA, which predominantly features old movies and thus exhibits significant domain gaps, UniTalk focuses explicitly on diverse and difficult real-world conditions. These include underrepresented languages, noisy backgrounds, and crowded scenes - such as multiple visible speakers speaking concurrently or in overlapping turns. It contains over 44.5 hours of video with frame-level active speaker annotations across 48,693 speaking identities, and spans a broad range of video types that reflect real-world conditions. Through rigorous evaluation, we show that state-of-the-art models, while achieving nearly perfect scores on AVA, fail to reach saturation on UniTalk, suggesting that the ASD task remains far from solved under realistic conditions. Nevertheless, models trained on UniTalk demonstrate stronger generalization to modern "in-the-wild" datasets like Talkies and ASW, as well as to AVA. UniTalk thus establishes a new benchmark for active speaker detection, providing researchers with a valuable resource for developing and evaluating versatile and resilient models. Dataset: https://huggingface.co/datasets/plnguyen2908/UniTalk-ASD Code: https://github.com/plnguyen2908/UniTalk-ASD-code △ Less

Submitted 28 May, 2025; originally announced May 2025.

arXiv:2505.20289 [pdf, other]

VisualToolAgent (VisTA): A Reinforcement Learning Framework for Visual Tool Selection

Authors: Zeyi Huang, Yuyang Ji, Anirudh Sundara Rajan, Zefan Cai, Wen Xiao, Junjie Hu, Yong Jae Lee

Abstract: We introduce VisTA, a new reinforcement learning framework that empowers visual agents to dynamically explore, select, and combine tools from a diverse library based on empirical performance. Existing methods for tool-augmented reasoning either rely on training-free prompting or large-scale fine-tuning; both lack active tool exploration and typically assume limited tool diversity, and fine-tuning… ▽ More We introduce VisTA, a new reinforcement learning framework that empowers visual agents to dynamically explore, select, and combine tools from a diverse library based on empirical performance. Existing methods for tool-augmented reasoning either rely on training-free prompting or large-scale fine-tuning; both lack active tool exploration and typically assume limited tool diversity, and fine-tuning methods additionally demand extensive human supervision. In contrast, VisTA leverages end-to-end reinforcement learning to iteratively refine sophisticated, query-specific tool selection strategies, using task outcomes as feedback signals. Through Group Relative Policy Optimization (GRPO), our framework enables an agent to autonomously discover effective tool-selection pathways without requiring explicit reasoning supervision. Experiments on the ChartQA, Geometry3K, and BlindTest benchmarks demonstrate that VisTA achieves substantial performance gains over training-free baselines, especially on out-of-distribution examples. These results highlight VisTA's ability to enhance generalization, adaptively utilize diverse tools, and pave the way for flexible, experience-driven visual reasoning systems. △ Less

Submitted 26 May, 2025; originally announced May 2025.

arXiv:2505.20021 [pdf, ps, other]

Decomposing Complex Visual Comprehension into Atomic Visual Skills for Vision Language Models

Authors: Hyunsik Chae, Seungwoo Yoon, Jaden Park, Chloe Yewon Chun, Yongin Cho, Mu Cai, Yong Jae Lee, Ernest K. Ryu

Abstract: Recent Vision-Language Models (VLMs) have demonstrated impressive multimodal comprehension and reasoning capabilities, yet they often struggle with trivially simple visual tasks. In this work, we focus on the domain of basic 2D Euclidean geometry and systematically categorize the fundamental, indivisible visual perception skills, which we refer to as atomic visual skills. We then introduce the Ato… ▽ More Recent Vision-Language Models (VLMs) have demonstrated impressive multimodal comprehension and reasoning capabilities, yet they often struggle with trivially simple visual tasks. In this work, we focus on the domain of basic 2D Euclidean geometry and systematically categorize the fundamental, indivisible visual perception skills, which we refer to as atomic visual skills. We then introduce the Atomic Visual Skills Dataset (AVSD) for evaluating VLMs on the atomic visual skills. Using AVSD, we benchmark state-of-the-art VLMs and find that they struggle with these tasks, despite being trivial for adult humans. Our findings highlight the need for purpose-built datasets to train and evaluate VLMs on atomic, rather than composite, visual perception tasks. △ Less

Submitted 26 May, 2025; originally announced May 2025.

Comments: 69 pages, 16 figures

arXiv:2505.00880 [pdf]

A Model of UV-Blue Absorbance in Bulk Liquid of Venusian Cloud Aerosols Is Consistent with Efficient Organic Absorbers at High Concentrations

Authors: Jan Spacek, Yeon J. Lee, Paul B. Rimmer, Janusz J. Petkowski

Abstract: At visible wavelengths, Venus appears serene and pale-yellow, but since the 1920s, observers have noted high-contrast features in the ultraviolet. These features track the about 4-day superrotation of the upper cloud deck and vary widely over time and space. The identity of the UV absorber(s)-active between at least 280 and 500 nm-remains unknown, as no proposed candidate fully matches all observa… ▽ More At visible wavelengths, Venus appears serene and pale-yellow, but since the 1920s, observers have noted high-contrast features in the ultraviolet. These features track the about 4-day superrotation of the upper cloud deck and vary widely over time and space. The identity of the UV absorber(s)-active between at least 280 and 500 nm-remains unknown, as no proposed candidate fully matches all observational data. From remote observations of Venus, and accounting for light scattering by sub-micrometer droplets, we modeled the 365-455 nm absorbance per cm of the bulk liquids forming Venus's clouds. Assuming a uniform distribution in mode 1 and 2 particles across a 6 km layer below the cloud top at 65 km, we constrain the bulk absorbance with a peak at A375 nm being 2942 per cm. This extremely high absorbance implies the presence of a highly efficient absorber, most likely conjugated organics, at relatively high concentration-e.g. about 25 g/L for porphyrin type pigments. Inorganic absorbers, with molar absorption coefficients typically in the range of 1,000-10,000 per M per cm, would either need to comprise a large portion of the aerosols or are simply not light absorbent enough, even if present in pure form. We emphasize that all candidate absorbers must be evaluated against Venus's reflectance curve using (i) known molar absorption coefficients, (ii) realistic atmospheric distributions, and (iii) appropriate particle size distributions. The upcoming Rocket Lab mission will test the hypothesis of organics in Venus's clouds. △ Less

Submitted 1 May, 2025; originally announced May 2025.

arXiv:2504.20998 [pdf, other]

YoChameleon: Personalized Vision and Language Generation

Authors: Thao Nguyen, Krishna Kumar Singh, Jing Shi, Trung Bui, Yong Jae Lee, Yuheng Li

Abstract: Large Multimodal Models (e.g., GPT-4, Gemini, Chameleon) have evolved into powerful tools with millions of users. However, they remain generic models and lack personalized knowledge of specific user concepts. Previous work has explored personalization for text generation, yet it remains unclear how these methods can be adapted to new modalities, such as image generation. In this paper, we introduc… ▽ More Large Multimodal Models (e.g., GPT-4, Gemini, Chameleon) have evolved into powerful tools with millions of users. However, they remain generic models and lack personalized knowledge of specific user concepts. Previous work has explored personalization for text generation, yet it remains unclear how these methods can be adapted to new modalities, such as image generation. In this paper, we introduce Yo'Chameleon, the first attempt to study personalization for large multimodal models. Given 3-5 images of a particular concept, Yo'Chameleon leverages soft-prompt tuning to embed subject-specific information to (i) answer questions about the subject and (ii) recreate pixel-level details to produce images of the subject in new contexts. Yo'Chameleon is trained with (i) a self-prompting optimization mechanism to balance performance across multiple modalities, and (ii) a ``soft-positive" image generation approach to enhance image quality in a few-shot setting. △ Less

Submitted 29 April, 2025; originally announced April 2025.

Comments: CVPR 2025; Project page: https://thaoshibe.github.io/YoChameleon

arXiv:2504.20996 [pdf, other]

X-Fusion: Introducing New Modality to Frozen Large Language Models

Authors: Sicheng Mo, Thao Nguyen, Xun Huang, Siddharth Srinivasan Iyer, Yijun Li, Yuchen Liu, Abhishek Tandon, Eli Shechtman, Krishna Kumar Singh, Yong Jae Lee, Bolei Zhou, Yuheng Li

Abstract: We propose X-Fusion, a framework that extends pretrained Large Language Models (LLMs) for multimodal tasks while preserving their language capabilities. X-Fusion employs a dual-tower design with modality-specific weights, keeping the LLM's parameters frozen while integrating vision-specific information for both understanding and generation. Our experiments demonstrate that X-Fusion consistently ou… ▽ More We propose X-Fusion, a framework that extends pretrained Large Language Models (LLMs) for multimodal tasks while preserving their language capabilities. X-Fusion employs a dual-tower design with modality-specific weights, keeping the LLM's parameters frozen while integrating vision-specific information for both understanding and generation. Our experiments demonstrate that X-Fusion consistently outperforms alternative architectures on both image-to-text and text-to-image tasks. We find that incorporating understanding-focused data improves generation quality, reducing image data noise enhances overall performance, and feature alignment accelerates convergence for smaller models but has minimal impact on larger ones. Our findings provide valuable insights into building efficient unified multimodal models. △ Less

Submitted 29 April, 2025; originally announced April 2025.

Comments: Project Page: https://sichengmo.github.io/XFusion/

arXiv:2504.00557 [pdf, other]

Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features

Authors: Jewon Lee, Ki-Ung Song, Seungmin Yang, Donguk Lim, Jaeyeon Kim, Wooksu Shin, Bo-Kyeong Kim, Yong Jae Lee, Tae-Ho Kim

Abstract: Visual token reduction lowers inference costs caused by extensive image features in large vision-language models (LVLMs). Unlike relevant studies that prune tokens in self-attention-only LVLMs, our work uniquely addresses cross-attention-based models, which achieve superior performance. We identify that the key-value (KV) cache size for image tokens in cross-attention layers significantly exceeds… ▽ More Visual token reduction lowers inference costs caused by extensive image features in large vision-language models (LVLMs). Unlike relevant studies that prune tokens in self-attention-only LVLMs, our work uniquely addresses cross-attention-based models, which achieve superior performance. We identify that the key-value (KV) cache size for image tokens in cross-attention layers significantly exceeds that of text tokens in self-attention layers, posing a major compute bottleneck. To mitigate this issue, we exploit the sparse nature in cross-attention maps to selectively prune redundant visual features. Our Trimmed Llama effectively reduces KV cache demands without requiring additional training. By benefiting from 50%-reduced visual features, our model can reduce inference latency and memory usage while achieving benchmark parity. △ Less

Submitted 1 April, 2025; originally announced April 2025.

Comments: accepted at CVPR 2025 Workshop on ELVM

arXiv:2503.19559 [pdf, other]

Combined Annual Modulation Dark Matter Search with COSINE-100 and ANAIS-112

Authors: N. Carlin, J. Y. Cho, J. J. Choi, S. Choi, A. C. Ezeribe, L. E. França, C. Ha, I. S. Hahn, S. J. Hollick, S. B. Hong, E. J. Jeon, H. W. Joo, W. G. Kang, M. Kauer, B. H. Kim, H. J. Kim, J. Kim, K. W. Kim, S. H. Kim, S. K. Kim, W. K. Kim, Y. D. Kim, Y. H. Kim, Y. J. Ko, D. H. Lee , et al. (49 additional authors not shown)

Abstract: The annual modulation signal, claimed to be consistent with dark matter as observed by DAMA/LIBRA in a sodium-iodide based detector, has persisted for over two decades. COSINE-100 and ANAIS-112 were designed to test the claim directly using the same target material. COSINE-100, located at Yangyang Underground Laboratory in South Korea, and ANAIS-112, located at Canfranc Underground Laboratory in S… ▽ More The annual modulation signal, claimed to be consistent with dark matter as observed by DAMA/LIBRA in a sodium-iodide based detector, has persisted for over two decades. COSINE-100 and ANAIS-112 were designed to test the claim directly using the same target material. COSINE-100, located at Yangyang Underground Laboratory in South Korea, and ANAIS-112, located at Canfranc Underground Laboratory in Spain, have been taking data since 2016 and 2017, respectively. Each experiment published its respective results independently. In this paper, we present the results of an annual modulation search as a test of the signal observed by DAMA/LIBRA with the first three respective years of data from COSINE-100 and ANAIS-112. Using a Markov Chain Monte Carlo method, we find best fit values for modulation amplitude of $-0.0002 {\pm} 0.0026$ cpd/kg/keV in the 1-6 keV and $0.0021 {\pm} 0.0028$ cpd/kg/keV in the 2-6 keV energy regions. These results are not compatible with DAMA/LIBRA's assertion for their observation of annual modulation at $3.7σ$ and $2.6σ$, respectively. Performing a simple combination of the newly released 6-years datasets from both experiments find values consistent with no modulation at $0.0005 {\pm} 0.0019$ cpd/kg/keV in the 1-6 keV and $0.0027 {\pm} 0.0021$ cpd/kg/keV in the 2-6 keV energy regions with $4.68σ$ and $3.53σ$ respective exclusions of the DAMA/LIBRA signal. △ Less

Submitted 25 March, 2025; originally announced March 2025.

Comments: 6 pages, 4 figures, 3 tables

arXiv:2503.13058 [pdf, other]

Do Vision Models Develop Human-Like Progressive Difficulty Understanding?

Authors: Zeyi Huang, Utkarsh Ojha, Yuyang Ji, Donghyun Lee, Yong Jae Lee

Abstract: When a human undertakes a test, their responses likely follow a pattern: if they answered an easy question $(2 \times 3)$ incorrectly, they would likely answer a more difficult one $(2 \times 3 \times 4)$ incorrectly; and if they answered a difficult question correctly, they would likely answer the easy one correctly. Anything else hints at memorization. Do current visual recognition models exhibi… ▽ More When a human undertakes a test, their responses likely follow a pattern: if they answered an easy question $(2 \times 3)$ incorrectly, they would likely answer a more difficult one $(2 \times 3 \times 4)$ incorrectly; and if they answered a difficult question correctly, they would likely answer the easy one correctly. Anything else hints at memorization. Do current visual recognition models exhibit a similarly structured learning capacity? In this work, we consider the task of image classification and study if those models' responses follow that pattern. Since real images aren't labeled with difficulty, we first create a dataset of 100 categories, 10 attributes, and 3 difficulty levels using recent generative models: for each category (e.g., dog) and attribute (e.g., occlusion), we generate images of increasing difficulty (e.g., a dog without occlusion, a dog only partly visible). We find that most of the models do in fact behave similarly to the aforementioned pattern around 80-90% of the time. Using this property, we then explore a new way to evaluate those models. Instead of testing the model on every possible test image, we create an adaptive test akin to GRE, in which the model's performance on the current round of images determines the test images in the next round. This allows the model to skip over questions too easy/hard for itself, and helps us get its overall performance in fewer steps. △ Less

Submitted 17 March, 2025; originally announced March 2025.

arXiv:2503.06349 [pdf, other]

doi 10.1145/3706599.3720147

Fits like a Flex-Glove: Automatic Design of Personalized FPCB-Based Tactile Sensing Gloves

Authors: Devin Murphy, Yichen Li, Crystal Owens, Layla Stanton, Young Joong Lee, Paul Pu Liang, Yiyue Luo, Antonio Torralba, Wojciech Matusik

Abstract: Resistive tactile sensing gloves have captured the interest of researchers spanning diverse domains, such as robotics, healthcare, and human-computer interaction. However, existing fabrication methods often require labor-intensive assembly or costly equipment, limiting accessibility. Leveraging flexible printed circuit board (FPCB) technology, we present an automated pipeline for generating resist… ▽ More Resistive tactile sensing gloves have captured the interest of researchers spanning diverse domains, such as robotics, healthcare, and human-computer interaction. However, existing fabrication methods often require labor-intensive assembly or costly equipment, limiting accessibility. Leveraging flexible printed circuit board (FPCB) technology, we present an automated pipeline for generating resistive tactile sensing glove design files solely from a simple hand photo on legal-size paper, which can be readily supplied to commercial board houses for manufacturing. Our method enables cost-effective, accessible production at under \$130 per glove with sensor assembly times under 15 minutes. Sensor performance was characterized under varying pressure loads, and a preliminary user evaluation showcases four unique automatically manufactured designs, evaluated for their reliability and comfort. △ Less

Submitted 8 March, 2025; originally announced March 2025.

Comments: 8 pages, 6 figures, to be published in Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (CHI EA '25)

arXiv:2502.18530 [pdf, ps, other]

IMPROVE: Iterative Model Pipeline Refinement and Optimization Leveraging LLM Experts

Authors: Eric Xue, Ke Chen, Zeyi Huang, Yuyang Ji, Yong Jae Lee, Haohan Wang

Abstract: Large language model (LLM) agents have emerged as a promising solution to automate the workflow of machine learning, but most existing methods share a common limitation: they attempt to optimize entire pipelines in a single step before evaluation, making it difficult to attribute improvements to specific changes. This lack of granularity leads to unstable optimization and slower convergence, limit… ▽ More Large language model (LLM) agents have emerged as a promising solution to automate the workflow of machine learning, but most existing methods share a common limitation: they attempt to optimize entire pipelines in a single step before evaluation, making it difficult to attribute improvements to specific changes. This lack of granularity leads to unstable optimization and slower convergence, limiting their effectiveness. To address this, we introduce Iterative Refinement, a novel strategy for LLM-driven ML pipeline design inspired by how human ML experts iteratively refine models, focusing on one component at a time rather than making sweeping changes all at once. By systematically updating individual components based on real training feedback, Iterative Refinement improves overall model performance. We also provide some theoretical edvience of the superior properties of this Iterative Refinement. Further, we implement this strategy in IMPROVE, an end-to-end LLM agent framework for automating and optimizing object classification pipelines. Through extensive evaluations across datasets of varying sizes and domains, we demonstrate that Iterative Refinement enables IMPROVE to consistently achieve better performance over existing zero-shot LLM-based approaches. △ Less

Submitted 1 June, 2025; v1 submitted 24 February, 2025; originally announced February 2025.

arXiv:2502.07778 [pdf, other]

Stay-Positive: A Case for Ignoring Real Image Features in Fake Image Detection

Authors: Anirudh Sundara Rajan, Yong Jae Lee

Abstract: Detecting AI generated images is a challenging yet essential task. A primary difficulty arises from the detectors tendency to rely on spurious patterns, such as compression artifacts, which can influence its decisions. These issues often stem from specific patterns that the detector associates with the real data distribution, making it difficult to isolate the actual generative traces. We argue th… ▽ More Detecting AI generated images is a challenging yet essential task. A primary difficulty arises from the detectors tendency to rely on spurious patterns, such as compression artifacts, which can influence its decisions. These issues often stem from specific patterns that the detector associates with the real data distribution, making it difficult to isolate the actual generative traces. We argue that an image should be classified as fake if and only if it contains artifacts introduced by the generative model. Based on this premise, we propose Stay Positive, an algorithm designed to constrain the detectors focus to generative artifacts while disregarding those associated with real data. Experimental results demonstrate that detectors trained with Stay Positive exhibit reduced susceptibility to spurious correlations, leading to improved generalization and robustness to post processing. Additionally, unlike detectors that associate artifacts with real images, those that focus purely on fake artifacts are better at detecting inpainted real images. △ Less

Submitted 25 May, 2025; v1 submitted 11 February, 2025; originally announced February 2025.

arXiv:2502.05353 [pdf, other]

Point-Identifying Semiparametric Sample Selection Models with No Excluded Variable

Authors: Dongwoo Kim, Young Jun Lee

Abstract: Sample selection is pervasive in applied economic studies. This paper develops semiparametric selection models that achieve point identification without relying on exclusion restrictions, an assumption long believed necessary for identification in semiparametric selection models. Our identification conditions require at least one continuously distributed covariate and certain nonlinearity in the s… ▽ More Sample selection is pervasive in applied economic studies. This paper develops semiparametric selection models that achieve point identification without relying on exclusion restrictions, an assumption long believed necessary for identification in semiparametric selection models. Our identification conditions require at least one continuously distributed covariate and certain nonlinearity in the selection process. We propose a two-step plug-in estimator that is root-n-consistent, asymptotically normal, and computationally straightforward (readily available in statistical software), allowing for heteroskedasticity. Our approach provides a middle ground between Lee (2009)'s nonparametric bounds and Honoré and Hu (2020)'s linear selection bounds, while ensuring point identification. Simulation evidence confirms its excellent finite-sample performance. We apply our method to estimate the racial and gender wage disparity using data from the US Current Population Survey. Our estimates tend to lie outside the Honoré and Hu bounds. △ Less

Submitted 7 February, 2025; originally announced February 2025.

arXiv:2502.04650 [pdf, other]

Context images for Venus Express radio occultation measurements: A search for a correlation between temperature structure and UV contrasts in the clouds of Venus

Authors: Maarten Roos-Serote, Colin Wilson, Ryan MacDonald, Silvia Tellmann, Yeon Joo Lee, Igor Khatuntsev

Abstract: Venus exhibits strong and changing contrasts at ultraviolet wavelengths apparently related to the clouds and the dynamics in the cloud layer, but to date their origin continues to be unknown. We investigate the nature of the UV contrasts exhibited by Venus clouds by examining possible correlations between the thermal structure inferred from radio occultation data and UV brightness from imagery dat… ▽ More Venus exhibits strong and changing contrasts at ultraviolet wavelengths apparently related to the clouds and the dynamics in the cloud layer, but to date their origin continues to be unknown. We investigate the nature of the UV contrasts exhibited by Venus clouds by examining possible correlations between the thermal structure inferred from radio occultation data and UV brightness from imagery data, both observed with Venus Express. We analyse Venus Express images obtained from 11 hours before to a few hours after the time of radio occultation measurements of the same area. We account for the advection of clouds by zonal and meridional winds and apply a phase angle correction to compensate for the changing viewing geometry. We find a possible anti-correlation between UV-brightness and atmospheric temperature in the 65-70 km altitude range for low latitudes. Heating in this altitude and latitude region due to an increase in the UV-absorber has been predicted by radiative forcing studies. The predictions roughly match our observed temperature amplitude between UV-dark and UV-bright regions. We find no evidence for any correlation between UV-brightness and static stability in the atmosphere in the 50-80 km altitude region. This could be the first observational evidence for a direct link between UV-brightness and atmospheric temperature in the 65-70km altitude region in the clouds of Venus. △ Less

Submitted 6 February, 2025; originally announced February 2025.

Comments: 10 pages, 12 figures, submitted to A&A January 2025

arXiv:2501.13665 [pdf, other]

Limits on WIMP dark matter with NaI(Tl) crystals in three years of COSINE-100 data

Authors: G. H. Yu, N. Carlin, J. Y. Cho, J. J. Choi, S. Choi, A. C. Ezeribe, L. E. Franca, C. Ha, I. S. Hahn, S. J. Hollick, E. J. Jeon, H. W. Joo, W. G. Kang, M. Kauer, B. H. Kim, H. J. Kim, J. Kim, K. W. Kim, S. H. Kim, S. K. Kim, W. K. Kim, Y. D. Kim, Y. H. Kim, Y. J. Ko, D. H. Lee , et al. (34 additional authors not shown)

Abstract: We report limits on WIMP dark matter derived from three years of data collected by the COSINE-100 experiment with NaI(Tl) crystals, achieving an improved energy threshold of 0.7 keV. This lowered threshold enhances sensitivity in the sub-GeV mass range, extending the reach for direct detection of low-mass dark matter. Although no excess of WIMP-like events was observed, the increased sensitivity e… ▽ More We report limits on WIMP dark matter derived from three years of data collected by the COSINE-100 experiment with NaI(Tl) crystals, achieving an improved energy threshold of 0.7 keV. This lowered threshold enhances sensitivity in the sub-GeV mass range, extending the reach for direct detection of low-mass dark matter. Although no excess of WIMP-like events was observed, the increased sensitivity enabled a model-independent comparison between the expected WIMP signal rate-based on mass limits from our data-and DAMA's reported modulation amplitude. Our findings strongly disfavor the DAMA signal as originating from WIMP interactions, fully excluding DAMA/LIBRA 3$σ$ allowed regions and providing enhanced WIMP mass limits by an order of magnitude in the spin-independent model compared to previous results. In the spin-dependent model, cross-section upper limits were obtained in the mass range [0.1-5.0] GeV/c$^2$, with additional sensitivity to sub-GeV WIMPs through the inclusion of the Migdal effect. These results represent substantial progress in low-mass dark matter exploration and reinforce constraints on the longstanding DAMA claim. △ Less

Submitted 23 January, 2025; originally announced January 2025.

arXiv:2501.11899 [pdf, other]

LASER: Lip Landmark Assisted Speaker Detection for Robustness

Authors: Le Thien Phuc Nguyen, Zhuoran Yu, Yong Jae Lee

Abstract: Active Speaker Detection (ASD) aims to identify speaking individuals in complex visual scenes. While humans can easily detect speech by matching lip movements to audio, current ASD models struggle to establish this correspondence, often misclassifying non-speaking instances when audio and lip movements are unsynchronized. To address this limitation, we propose Lip landmark Assisted Speaker dEtecti… ▽ More Active Speaker Detection (ASD) aims to identify speaking individuals in complex visual scenes. While humans can easily detect speech by matching lip movements to audio, current ASD models struggle to establish this correspondence, often misclassifying non-speaking instances when audio and lip movements are unsynchronized. To address this limitation, we propose Lip landmark Assisted Speaker dEtection for Robustness (LASER). Unlike models that rely solely on facial frames, LASER explicitly focuses on lip movements by integrating lip landmarks in training. Specifically, given a face track, LASER extracts frame-level visual features and the 2D coordinates of lip landmarks using a lightweight detector. These coordinates are encoded into dense feature maps, providing spatial and structural information on lip positions. Recognizing that landmark detectors may sometimes fail under challenging conditions (e.g., low resolution, occlusions, extreme angles), we incorporate an auxiliary consistency loss to align predictions from both lip-aware and face-only features, ensuring reliable performance even when lip data is absent. Extensive experiments across multiple datasets show that LASER outperforms state-of-the-art models, especially in scenarios with desynchronized audio and visuals, demonstrating robust performance in real-world video contexts. Code is available at \url{https://github.com/plnguyen2908/LASER_ASD}. △ Less

Submitted 21 January, 2025; originally announced January 2025.

arXiv:2501.04851 [pdf, ps, other]

Polynomially growing integer sequences all whose terms are composite

Authors: Dan Ismailescu, Yunkyu James Lee

Abstract: We identify pairs of positive integers $(t, d)$ with the property that the integer sequence with general term $\lfloor{n^t/d\rfloor}$ contains at most finitely many primes. We identify pairs of positive integers $(t, d)$ with the property that the integer sequence with general term $\lfloor{n^t/d\rfloor}$ contains at most finitely many primes. △ Less

Submitted 8 January, 2025; originally announced January 2025.

Comments: 12 pages, 1 table

MSC Class: 11B50; 11A41; 11Y55

arXiv:2501.04336 [pdf, other]

Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs

Authors: Zeyi Huang, Yuyang Ji, Xiaofang Wang, Nikhil Mehta, Tong Xiao, Donghyun Lee, Sigmund Vanvalkenburgh, Shengxin Zha, Bolin Lai, Licheng Yu, Ning Zhang, Yong Jae Lee, Miao Liu

Abstract: Long-form video understanding with Large Vision Language Models is challenged by the need to analyze temporally dispersed yet spatially concentrated key moments within limited context windows. In this work, we introduce VideoMindPalace, a new framework inspired by the "Mind Palace", which organizes critical video moments into a topologically structured semantic graph. VideoMindPalace organizes key… ▽ More Long-form video understanding with Large Vision Language Models is challenged by the need to analyze temporally dispersed yet spatially concentrated key moments within limited context windows. In this work, we introduce VideoMindPalace, a new framework inspired by the "Mind Palace", which organizes critical video moments into a topologically structured semantic graph. VideoMindPalace organizes key information through (i) hand-object tracking and interaction, (ii) clustered activity zones representing specific areas of recurring activities, and (iii) environment layout mapping, allowing natural language parsing by LLMs to provide grounded insights on spatio-temporal and 3D context. In addition, we propose the Video MindPalace Benchmark (VMB), to assess human-like reasoning, including spatial localization, temporal reasoning, and layout-aware sequential understanding. Evaluated on VMB and established video QA datasets, including EgoSchema, NExT-QA, IntentQA, and the Active Memories Benchmark, VideoMindPalace demonstrates notable gains in spatio-temporal coherence and human-aligned reasoning, advancing long-form video analysis capabilities in VLMs. △ Less

Submitted 8 January, 2025; originally announced January 2025.

arXiv:2501.02791 [pdf, other]

Orthogonal greedy algorithm for linear operator learning with shallow neural network

Authors: Ye Lin, Jiwei Jia, Young Ju Lee, Ran Zhang

Abstract: Greedy algorithms, particularly the orthogonal greedy algorithm (OGA), have proven effective in training shallow neural networks for fitting functions and solving partial differential equations (PDEs). In this paper, we extend the application of OGA to the tasks of linear operator learning, which is equivalent to learning the kernel function through integral transforms. Firstly, a novel greedy alg… ▽ More Greedy algorithms, particularly the orthogonal greedy algorithm (OGA), have proven effective in training shallow neural networks for fitting functions and solving partial differential equations (PDEs). In this paper, we extend the application of OGA to the tasks of linear operator learning, which is equivalent to learning the kernel function through integral transforms. Firstly, a novel greedy algorithm is developed for kernel estimation rate in a new semi-inner product, which can be utilized to approximate the Green's function of linear PDEs from data. Secondly, we introduce the OGA for point-wise kernel estimation to further improve the approximation rate, achieving orders of accuracy improvement across various tasks and baseline models. In addition, we provide a theoretical analysis on the kernel estimation problem and the optimal approximation rates for both algorithms, establishing their efficacy and potential for future applications in PDEs and operator learning tasks. △ Less

Submitted 6 January, 2025; originally announced January 2025.

arXiv:2411.05256 [pdf, ps, other]

doi 10.1088/1748-0221/20/06/T06006

Radiopurity measurements of liquid scintillator for the COSINE-100 Upgrade

Authors: J. Kim, C. Ha, S. H. Kim, W. K. Kim, Y. D. Kim, Y. J. Ko, E. K. Lee, H. Lee, H. S. Lee, I. S. Lee, J. Lee, S. H. Lee, S. M. Lee, Y. J. Lee, G. H. Yu

Abstract: A new 2,400 L liquid scintillator has been produced for the COSINE-100 Upgrade, which is under construction at Yemilab for the next COSINE dark matter experiment phase. The linear-alkyl-benzene-based scintillator is designed to serve as a veto for NaI(Tl) crystal targets and a separate platform for rare event searches. We measured using a sample consisting of a custom-made 445 mL cylindrical Teflo… ▽ More A new 2,400 L liquid scintillator has been produced for the COSINE-100 Upgrade, which is under construction at Yemilab for the next COSINE dark matter experiment phase. The linear-alkyl-benzene-based scintillator is designed to serve as a veto for NaI(Tl) crystal targets and a separate platform for rare event searches. We measured using a sample consisting of a custom-made 445 mL cylindrical Teflon container equipped with two 3-inch photomultiplier tubes. Analyses show activity levels of $0.091 \pm 0.042$ mBq/kg for $^{238}$U and $0.012 \pm 0.007$ mBq/kg for $^{232}$Th. △ Less

Submitted 30 June, 2025; v1 submitted 7 November, 2024; originally announced November 2024.

Journal ref: J. Instrum. 20 (2025) T06006

arXiv:2410.20743 [pdf]

Routing Light Emission from Monolayer MoS$_2$ by Mie Resonances of Crystalline Silicon Nanospheres

Authors: Keisuke Ozawa, Hiroshi Sugimoto, Daisuke Shima, Tatsuki Hinamoto, Mojtaba Karimi Habil, Yan Joe Lee, Søren Raza, Keisuke Imaeda, Kosei Ueno, Mark L. Brongersma, Minoru Fujii

Abstract: A dielectric Mie-resonant nanoantenna is capable of controlling the directionality of the emission from nearby quantum emitters through the excitation of multiple degenerate Mie resonances. A crystalline silicon nanosphere (Si NS) is a promising candidate for a dielectric nanoantenna because crystalline Si has a large refractive index (3.8 at 650 nm) and the small imaginary part of a complex refra… ▽ More A dielectric Mie-resonant nanoantenna is capable of controlling the directionality of the emission from nearby quantum emitters through the excitation of multiple degenerate Mie resonances. A crystalline silicon nanosphere (Si NS) is a promising candidate for a dielectric nanoantenna because crystalline Si has a large refractive index (3.8 at 650 nm) and the small imaginary part of a complex refractive index (0.015 at 650 nm) as an optical material. In this work, we control the emission directionality of excitons supported by monolayer transition metal dichalcogenides (1L-TMDCs) using a Si NS. We first discuss the condition to extract the emission preferentially towards the Si NS side from the analytical calculations. We then study the photoluminescence (PL) of 1L-TMDCs on which differently sized single Si NSs are placed. We show that the PL spectral shape strongly depends on the emission direction, and that the emission toward the Si NS side (top) with respect to the opposite side (bottom) is the largest at wavelengths between the magnetic dipole and electric dipole Mie resonances of a Si NS. Finally, we quantitatively discuss the spectral shape of the top-to-bottom ratio from numerical simulations. △ Less

Submitted 28 October, 2024; originally announced October 2024.

Comments: 8 pages, 5 figures

arXiv:2410.19122 [pdf, ps, other]

Greedy Algorithm for Neural Networks for Indefinite Elliptic Problems

Authors: Qingguo Hong, Jiwei Jia, Young Ju Lee, Ziqian Li

Abstract: The paper presents a priori error analysis of the shallow neural network approximation to the solution to the indefinite elliptic equation and and cutting-edge implementation of the Orthogonal Greedy Algorithm (OGA) tailored to overcome the challenges of indefinite elliptic problems, which is a domain where conventional approaches often struggle due to nontraditional difficulties due to the lack o… ▽ More The paper presents a priori error analysis of the shallow neural network approximation to the solution to the indefinite elliptic equation and and cutting-edge implementation of the Orthogonal Greedy Algorithm (OGA) tailored to overcome the challenges of indefinite elliptic problems, which is a domain where conventional approaches often struggle due to nontraditional difficulties due to the lack of coerciveness. A rigorous a priori error analysis that shows the neural networks ability to approximate indefinite problems is confirmed numerically by OGA methods. We also present a discretization error analysis of the relevant numerical quadrature. In particular, massive numerical implementations are conducted to justify the theory, some of which showcase the OGAs superior performance in comparison to the traditional finite element method. This advancement illustrates the potential of neural networks enhanced by OGA to solve intricate computational problems more efficiently, thereby marking a significant leap forward in the application of machine learning techniques to mathematical problem-solving. △ Less

Submitted 24 October, 2024; originally announced October 2024.

Comments: 17 pages

arXiv:2410.11835 [pdf, other]

Aligned Datasets Improve Detection of Latent Diffusion-Generated Images

Authors: Anirudh Sundara Rajan, Utkarsh Ojha, Jedidiah Schloesser, Yong Jae Lee

Abstract: As latent diffusion models (LDMs) democratize image generation capabilities, there is a growing need to detect fake images. A good detector should focus on the generative models fingerprints while ignoring image properties such as semantic content, resolution, file format, etc. Fake image detectors are usually built in a data driven way, where a model is trained to separate real from fake images.… ▽ More As latent diffusion models (LDMs) democratize image generation capabilities, there is a growing need to detect fake images. A good detector should focus on the generative models fingerprints while ignoring image properties such as semantic content, resolution, file format, etc. Fake image detectors are usually built in a data driven way, where a model is trained to separate real from fake images. Existing works primarily investigate network architecture choices and training recipes. In this work, we argue that in addition to these algorithmic choices, we also require a well aligned dataset of real/fake images to train a robust detector. For the family of LDMs, we propose a very simple way to achieve this: we reconstruct all the real images using the LDMs autoencoder, without any denoising operation. We then train a model to separate these real images from their reconstructions. The fakes created this way are extremely similar to the real ones in almost every aspect (e.g., size, aspect ratio, semantic content), which forces the model to look for the LDM decoders artifacts. We empirically show that this way of creating aligned real/fake datasets, which also sidesteps the computationally expensive denoising process, helps in building a detector that focuses less on spurious correlations, something that a very popular existing method is susceptible to. Finally, to demonstrate just how effective the alignment in a dataset can be, we build a detector using images that are not natural objects, and present promising results. Overall, our work identifies the subtle but significant issues that arise when training a fake image detector and proposes a simple and inexpensive solution to address these problems. △ Less

Submitted 26 February, 2025; v1 submitted 15 October, 2024; originally announced October 2024.

arXiv:2410.10818 [pdf, other]

TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models

Authors: Mu Cai, Reuben Tan, Jianrui Zhang, Bocheng Zou, Kai Zhang, Feng Yao, Fangrui Zhu, Jing Gu, Yiwu Zhong, Yuzhang Shang, Yao Dou, Jaden Park, Jianfeng Gao, Yong Jae Lee, Jianwei Yang

Abstract: Understanding fine-grained temporal dynamics is crucial for multimodal video comprehension and generation. Due to the lack of fine-grained temporal annotations, existing video benchmarks mostly resemble static image benchmarks and are incompetent at evaluating models for temporal understanding. In this paper, we introduce TemporalBench, a new benchmark dedicated to evaluating fine-grained temporal… ▽ More Understanding fine-grained temporal dynamics is crucial for multimodal video comprehension and generation. Due to the lack of fine-grained temporal annotations, existing video benchmarks mostly resemble static image benchmarks and are incompetent at evaluating models for temporal understanding. In this paper, we introduce TemporalBench, a new benchmark dedicated to evaluating fine-grained temporal understanding in videos. TemporalBench consists of ~10K video question-answer pairs, derived from ~2K high-quality human annotations detailing the temporal dynamics in video clips. As a result, our benchmark provides a unique testbed for evaluating various temporal understanding and reasoning abilities such as action frequency, motion magnitude, event order, etc. Moreover, it enables evaluations on various tasks like both video question answering and captioning, both short and long video understanding, as well as different models such as multimodal video embedding models and text generation models. Results show that state-of-the-art models like GPT-4o achieve only 38.5% question answering accuracy on TemporalBench, demonstrating a significant gap (~30%) between humans and AI in temporal understanding. Furthermore, we notice a critical pitfall for multi-choice QA where LLMs can detect the subtle changes in negative captions and find a centralized description as a cue for its prediction, where we propose Multiple Binary Accuracy (MBA) to correct such bias. We hope that TemporalBench can foster research on improving models' temporal reasoning capabilities. Both dataset and evaluation code will be made available. △ Less

Submitted 15 October, 2024; v1 submitted 14 October, 2024; originally announced October 2024.

Comments: Project Page: https://temporalbench.github.io/

arXiv:2410.02763 [pdf, other]

Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos

Authors: Jianrui Zhang, Mu Cai, Yong Jae Lee

Abstract: There has been growing sentiment recently that modern large multimodal models (LMMs) have addressed most of the key challenges related to short video comprehension. As a result, both academia and industry are gradually shifting their attention towards the more complex challenges posed by understanding long-form videos. However, is this really the case? Our studies indicate that LMMs still lack man… ▽ More There has been growing sentiment recently that modern large multimodal models (LMMs) have addressed most of the key challenges related to short video comprehension. As a result, both academia and industry are gradually shifting their attention towards the more complex challenges posed by understanding long-form videos. However, is this really the case? Our studies indicate that LMMs still lack many fundamental reasoning capabilities even when dealing with short videos. We introduce Vinoground, a temporal counterfactual LMM evaluation benchmark encompassing 1000 short and natural video-caption pairs. We demonstrate that existing LMMs severely struggle to distinguish temporal differences between different actions and object transformations. For example, the best model GPT-4o only obtains ~50% on our text and video scores, showing a large gap compared to the human baseline of ~90%. All open-source multimodal models and CLIP-based models perform much worse, producing mostly random chance performance. Through this work, we shed light onto the fact that temporal reasoning in short videos is a problem yet to be fully solved. The dataset and evaluation code are available at https://vinoground.github.io. △ Less

Submitted 3 October, 2024; originally announced October 2024.

Comments: Project Page: https://vinoground.github.io

arXiv:2410.00905 [pdf, other]

Removing Distributional Discrepancies in Captions Improves Image-Text Alignment

Authors: Yuheng Li, Haotian Liu, Mu Cai, Yijun Li, Eli Shechtman, Zhe Lin, Yong Jae Lee, Krishna Kumar Singh

Abstract: In this paper, we introduce a model designed to improve the prediction of image-text alignment, targeting the challenge of compositional understanding in current visual-language models. Our approach focuses on generating high-quality training datasets for the alignment task by producing mixed-type negative captions derived from positive ones. Critically, we address the distribution imbalance betwe… ▽ More In this paper, we introduce a model designed to improve the prediction of image-text alignment, targeting the challenge of compositional understanding in current visual-language models. Our approach focuses on generating high-quality training datasets for the alignment task by producing mixed-type negative captions derived from positive ones. Critically, we address the distribution imbalance between positive and negative captions to ensure that the alignment model does not depend solely on textual information but also considers the associated images for predicting alignment accurately. By creating this enhanced training data, we fine-tune an existing leading visual-language model to boost its capability in understanding alignment. Our model significantly outperforms current top-performing methods across various datasets. We also demonstrate the applicability of our model by ranking the images generated by text-to-image models based on text alignment. Project page: \url{https://yuheng-li.github.io/LLaVA-score/} △ Less

Submitted 1 October, 2024; originally announced October 2024.

arXiv:2409.16551 [pdf, ps, other]

fOGA: Orthogonal Greedy Algorithm for Fractional Laplace Equations

Authors: Ruitong Shan, Young Ju Lee, Jiwei Jia

Abstract: In this paper, we explore the finite difference approximation of the fractional Laplace operator in conjunction with a neural network method for solving it. We discretized the fractional Laplace operator using the Riemann-Liouville formula relevant to fractional equations. A shallow neural network was constructed to address the discrete fractional operator, coupled with the OGA algorithm. To valid… ▽ More In this paper, we explore the finite difference approximation of the fractional Laplace operator in conjunction with a neural network method for solving it. We discretized the fractional Laplace operator using the Riemann-Liouville formula relevant to fractional equations. A shallow neural network was constructed to address the discrete fractional operator, coupled with the OGA algorithm. To validate the feasibility of our approach, we conducted numerical experiments, testing both the Laplace operator and the fractional Laplace operator, yielding favorable convergence results. △ Less

Submitted 24 September, 2024; originally announced September 2024.

Comments: 15 pages

arXiv:2409.13226 [pdf, other]

COSINE-100 Full Dataset Challenges the Annual Modulation Signal of DAMA/LIBRA

Authors: N. Carlin, J. Y. Cho, J. J. Choi, S. Choi, A. C. Ezeribe, L. E. Franca, C. Ha, I. S. Hahn, S. J. Hollick, E. J. Jeon, H. W. Joo, W. G. Kang, M. Kauer, B. H. Kim, H. J. Kim, J. Kim, K. W. Kim, S. H. Kim, S. K. Kim, W. K. Kim, Y. D. Kim, Y. H. Kim, Y. J. Ko, D. H. Lee, E. K. Lee , et al. (34 additional authors not shown)

Abstract: For over 25 years, the DAMA/LIBRA collaboration has claimed to observe an annual modulation signal, suggesting the existence of dark matter interactions. However, no other experiments have replicated their result using different detector materials. To address this puzzle, the COSINE-100 collaboration conducted a model-independent test using 106 kg of sodium iodide as detectors, the same target mat… ▽ More For over 25 years, the DAMA/LIBRA collaboration has claimed to observe an annual modulation signal, suggesting the existence of dark matter interactions. However, no other experiments have replicated their result using different detector materials. To address this puzzle, the COSINE-100 collaboration conducted a model-independent test using 106 kg of sodium iodide as detectors, the same target material as DAMA/LIBRA. Analyzing data collected over 6.4 years, with improved energy calibration and time-dependent background description, we found no evidence of an annual modulation signal, challenging the DAMA/LIBRA result with a confidence level greater than 3$σ$. This finding represents a significant step toward resolving the long-standing debate surrounding DAMA/LIBRA's dark matter claim, indicating that the observed modulation is unlikely to be caused by dark matter interactions. △ Less

Submitted 20 September, 2024; originally announced September 2024.

arXiv:2409.12963 [pdf, other]

Interpolating Video-LLMs: Toward Longer-sequence LMMs in a Training-free Manner

Authors: Yuzhang Shang, Bingxin Xu, Weitai Kang, Mu Cai, Yuheng Li, Zehao Wen, Zhen Dong, Kurt Keutzer, Yong Jae Lee, Yan Yan

Abstract: Advancements in Large Language Models (LLMs) inspire various strategies for integrating video modalities. A key approach is Video-LLMs, which incorporate an optimizable interface linking sophisticated video encoders to LLMs. However, due to computation and data limitations, these Video-LLMs are typically pre-trained to process only short videos, limiting their broader application for understanding… ▽ More Advancements in Large Language Models (LLMs) inspire various strategies for integrating video modalities. A key approach is Video-LLMs, which incorporate an optimizable interface linking sophisticated video encoders to LLMs. However, due to computation and data limitations, these Video-LLMs are typically pre-trained to process only short videos, limiting their broader application for understanding longer video content. Additionally, fine-tuning Video-LLMs to handle longer videos is cost-prohibitive. Consequently, it becomes essential to explore the interpolation of Video-LLMs under a completely training-free setting. In this paper, we first identify the primary challenges in interpolating Video-LLMs: (1) the video encoder and modality alignment projector are fixed, preventing the integration of additional frames into Video-LLMs, and (2) the LLM backbone is limited in its content length capabilities, which complicates the processing of an increased number of video tokens. To address these challenges, we propose a specific INTerPolation method for Video-LLMs (INTP-Video-LLMs). We introduce an alternative video token rearrangement technique that circumvents limitations imposed by the fixed video encoder and alignment projector. Furthermore, we introduce a training-free LLM context window extension method to enable Video-LLMs to understand a correspondingly increased number of visual tokens. △ Less

Submitted 1 October, 2024; v1 submitted 19 September, 2024; originally announced September 2024.

arXiv:2409.06827 [pdf, other]

Cross-Modal Self-Supervised Learning with Effective Contrastive Units for LiDAR Point Clouds

Authors: Mu Cai, Chenxu Luo, Yong Jae Lee, Xiaodong Yang

Abstract: 3D perception in LiDAR point clouds is crucial for a self-driving vehicle to properly act in 3D environment. However, manually labeling point clouds is hard and costly. There has been a growing interest in self-supervised pre-training of 3D perception models. Following the success of contrastive learning in images, current methods mostly conduct contrastive pre-training on point clouds only. Yet a… ▽ More 3D perception in LiDAR point clouds is crucial for a self-driving vehicle to properly act in 3D environment. However, manually labeling point clouds is hard and costly. There has been a growing interest in self-supervised pre-training of 3D perception models. Following the success of contrastive learning in images, current methods mostly conduct contrastive pre-training on point clouds only. Yet an autonomous driving vehicle is typically supplied with multiple sensors including cameras and LiDAR. In this context, we systematically study single modality, cross-modality, and multi-modality for contrastive learning of point clouds, and show that cross-modality wins over other alternatives. In addition, considering the huge difference between the training sources in 2D images and 3D point clouds, it remains unclear how to design more effective contrastive units for LiDAR. We therefore propose the instance-aware and similarity-balanced contrastive units that are tailored for self-driving point clouds. Extensive experiments reveal that our approach achieves remarkable performance gains over various point cloud models across the downstream perception tasks of LiDAR based 3D object detection and 3D semantic segmentation on the four popular benchmarks including Waymo Open Dataset, nuScenes, SemanticKITTI and ONCE. △ Less

Submitted 10 September, 2024; originally announced September 2024.

Comments: IROS 2024

arXiv:2409.04353 [pdf, other]

Whole Heart Perfusion with High-Multiband Simultaneous Multislice Imaging via Linear Phase Modulated Extended Field of View (SMILE)

Authors: Shen Zhao, Junyu Wang, Xitong Wang, Sizhuo Liu, Quan Chen, Kevin Kai Li, Yoo Jin Lee, Michael Salerno

Abstract: Purpose: To develop a simultaneous multislice (SMS) first-pass perfusion technique that can achieve whole heart coverage with high multi-band factors, while avoiding the issue of slice leakage. Methods: The proposed Simultaneous Multislice Imaging via Linear phase modulated Extended field of view (SMILE) treats the SMS acquisition and reconstruction within an extended field of view framework, al… ▽ More Purpose: To develop a simultaneous multislice (SMS) first-pass perfusion technique that can achieve whole heart coverage with high multi-band factors, while avoiding the issue of slice leakage. Methods: The proposed Simultaneous Multislice Imaging via Linear phase modulated Extended field of view (SMILE) treats the SMS acquisition and reconstruction within an extended field of view framework, allowing arbitrarily under-sampling of phase encoding lines of the extended k-space matrix and enabling the direct application of 2D parallel imaging reconstruction techniques. We presented a theoretical framework that offers insights into the performance of SMILE. We performed retrospective comparison on 28 subjects and prospective perfusion experiments on 43 patients undergoing routine clinical CMR studies with SMILE at multiband (MB) factors of 3-5, with a net acceleration rate ($R$) of 8 and 10 respectively, and compared SMILE to conventional SMS techniques using standard FOV 2D CAIPI acquisition and standard 2D slice separation techniques including split-slice GRAPPA and ROCK-SPIRiT. Results: Retrospective studies demonstrated 5.2 to 8.0 dB improvement in signal to error ratio (SER) of SMILE over CAIPI perfusion. Prospective studies showed good image quality with grades of 4.1 $\pm$ 0.7 for MB = 3, $R$ = 8 and 3.5 $\pm$ 1.0 for MB = 5, $R$ = 10. (5-point Likert Scale) Conclusion: The theoretical derivation and experimental results validate the SMILE's improved performance at high acceleration and MB as compared to the existing 2D CAIPI SMS acquisition and reconstruction techniques for first-pass myocardial perfusion imaging. △ Less

Submitted 27 January, 2025; v1 submitted 6 September, 2024; originally announced September 2024.

Comments: 18pages, 13 figures

arXiv:2408.14688 [pdf, other]

doi 10.1088/1748-0221/19/12/P12013

Lowering threshold of NaI(Tl) scintillator to 0.7 keV in the COSINE-100 experiment

Authors: G. H. Yu, N. Carlin, J. Y. Cho, J. J. Choi, S. Choi, A. C. Ezeribe, L. E. França, C. Ha, I. S. Hahn, S. J. Hollick, E. J. Jeon, H. W. Joo, W. G. Kang, M. Kauer, B. H. Kim, H. J. Kim, J. Kim, K. W. Kim, S. H. Kim, S. K. Kim, W. K. Kim, Y. D. Kim, Y. H. Kim, Y. J. Ko, D. H. Lee , et al. (34 additional authors not shown)

Abstract: COSINE-100 is a direct dark matter search experiment, with the primary goal of testing the annual modulation signal observed by DAMA/LIBRA, using the same target material, NaI(Tl). In previous analyses, we achieved the same 1 keV energy threshold used in the DAMA/LIBRA's analysis that reported an annual modulation signal with 11.6$σ$ significance. In this article, we report an improved analysis th… ▽ More COSINE-100 is a direct dark matter search experiment, with the primary goal of testing the annual modulation signal observed by DAMA/LIBRA, using the same target material, NaI(Tl). In previous analyses, we achieved the same 1 keV energy threshold used in the DAMA/LIBRA's analysis that reported an annual modulation signal with 11.6$σ$ significance. In this article, we report an improved analysis that lowered the threshold to 0.7 keV, thanks to the application of Multi-Layer Perception network and a new likelihood parameter with waveforms in the frequency domain. The lower threshold would enable a better comparison of COSINE-100 with new DAMA results with a 0.75 keV threshold and account for differences in quenching factors. Furthermore the lower threshold can enhance COSINE-100's sensitivity to sub-GeV dark matter searches. △ Less

Submitted 22 December, 2024; v1 submitted 26 August, 2024; originally announced August 2024.

Journal ref: JINST 19 P12013 (2024)

arXiv:2408.14419 [pdf, ps, other]

CHARTOM: A Visual Theory-of-Mind Benchmark for LLMs on Misleading Charts

Authors: Shubham Bharti, Shiyun Cheng, Jihyun Rho, Jianrui Zhang, Mu Cai, Yong Jae Lee, Martina Rau, Xiaojin Zhu

Abstract: We introduce CHARTOM, a visual theory-of-mind benchmark designed to evaluate multimodal large language models' capability to understand and reason about misleading data visualizations though charts. CHARTOM consists of carefully designed charts and associated questions that require a language model to not only correctly comprehend the factual content in the chart (the FACT question) but also judge… ▽ More We introduce CHARTOM, a visual theory-of-mind benchmark designed to evaluate multimodal large language models' capability to understand and reason about misleading data visualizations though charts. CHARTOM consists of carefully designed charts and associated questions that require a language model to not only correctly comprehend the factual content in the chart (the FACT question) but also judge whether the chart will be misleading to a human readers (the MIND question), a dual capability with significant societal benefits. We detail the construction of our benchmark including its calibration on human performance and estimation of MIND ground truth called the Human Misleadingness Index. We evaluated several leading LLMs -- including GPT, Claude, Gemini, Qwen, Llama, and Llava series models -- on the CHARTOM dataset and found that it was challenging to all models both on FACT and MIND questions. This highlights the limitations of current LLMs and presents significant opportunity for future LLMs to improve on understanding misleading charts. △ Less

Submitted 28 June, 2025; v1 submitted 26 August, 2024; originally announced August 2024.

arXiv:2408.09806 [pdf, other]

Improved background modeling for dark matter search with COSINE-100

Authors: G. H. Yu, N. Carlin, J. Y. Cho, J. J. Choi, S. Choi, A. C. Ezeribe, L. E. Franca, C. Ha, I. S. Hahn, S. J. Hollick, E. J. Jeon, H. W. Joo, W. G. Kang, M. Kauer, B. H. Kim, H. J. Kim, J. Kim, K. W. Kim, S. H. Kim, S. K. Kim, W. K. Kim, Y. D. Kim, Y. H. Kim, Y. J. Ko, D. H. Lee , et al. (33 additional authors not shown)

Abstract: COSINE-100 aims to conclusively test the claimed dark matter annual modulation signal detected by DAMA/LIBRA collaboration. DAMA/LIBRA has released updated analysis results by lowering the energy threshold to 0.75 keV through various upgrades. They have consistently claimed to have observed the annual modulation. In COSINE-100, it is crucial to lower the energy threshold for a direct comparison wi… ▽ More COSINE-100 aims to conclusively test the claimed dark matter annual modulation signal detected by DAMA/LIBRA collaboration. DAMA/LIBRA has released updated analysis results by lowering the energy threshold to 0.75 keV through various upgrades. They have consistently claimed to have observed the annual modulation. In COSINE-100, it is crucial to lower the energy threshold for a direct comparison with DAMA/LIBRA, which also enhances the sensitivity of the search for low-mass dark matter, enabling COSINE-100 to explore this area. Therefore, it is essential to have a precise and quantitative understanding of the background spectrum across all energy ranges. This study expands the background modeling from 0.7 to 4000 keV using 2.82 years of COSINE-100 data. The modeling has been improved to describe the background spectrum across all energy ranges accurately. Assessments of the background spectrum are presented, considering the nonproportionality of NaI(Tl) crystals at both low and high energies and the characteristic X-rays produced by the interaction of external backgrounds with materials such as copper. Additionally, constraints on the fit parameters obtained from the alpha spectrum modeling fit are integrated into this model. These improvements are detailed in the paper. △ Less

Submitted 19 August, 2024; originally announced August 2024.

arXiv:2407.10972 [pdf, other]

VGBench: Evaluating Large Language Models on Vector Graphics Understanding and Generation

Authors: Bocheng Zou, Mu Cai, Jianrui Zhang, Yong Jae Lee

Abstract: In the realm of vision models, the primary mode of representation is using pixels to rasterize the visual world. Yet this is not always the best or unique way to represent visual content, especially for designers and artists who depict the world using geometry primitives such as polygons. Vector graphics (VG), on the other hand, offer a textual representation of visual content, which can be more c… ▽ More In the realm of vision models, the primary mode of representation is using pixels to rasterize the visual world. Yet this is not always the best or unique way to represent visual content, especially for designers and artists who depict the world using geometry primitives such as polygons. Vector graphics (VG), on the other hand, offer a textual representation of visual content, which can be more concise and powerful for content like cartoons, sketches and scientific figures. Recent studies have shown promising results on processing vector graphics with capable Large Language Models (LLMs). However, such works focus solely on qualitative results, understanding, or a specific type of vector graphics. We propose VGBench, a comprehensive benchmark for LLMs on handling vector graphics through diverse aspects, including (a) both visual understanding and generation, (b) evaluation of various vector graphics formats, (c) diverse question types, (d) wide range of prompting techniques, (e) under multiple LLMs and (f) comparison with VLMs on rasterized representations. Evaluating on our collected 4279 understanding and 5845 generation samples, we find that LLMs show strong capability on both aspects while exhibiting less desirable performance on low-level formats (SVG). Both data and evaluation pipeline will be open-sourced at https://vgbench.github.io. △ Less

Submitted 29 August, 2024; v1 submitted 15 July, 2024; originally announced July 2024.

Comments: Project Page: https://vgbench.github.io

arXiv:2407.09541 [pdf, other]

MATE: Meet At The Embedding -- Connecting Images with Long Texts

Authors: Young Kyun Jang, Junmo Kang, Yong Jae Lee, Donghyun Kim

Abstract: While advancements in Vision Language Models (VLMs) have significantly improved the alignment of visual and textual data, these models primarily focus on aligning images with short descriptive captions. This focus limits their ability to handle complex text interactions, particularly with longer texts such as lengthy captions or documents, which have not been extensively explored yet. In this pape… ▽ More While advancements in Vision Language Models (VLMs) have significantly improved the alignment of visual and textual data, these models primarily focus on aligning images with short descriptive captions. This focus limits their ability to handle complex text interactions, particularly with longer texts such as lengthy captions or documents, which have not been extensively explored yet. In this paper, we introduce Meet At The Embedding (MATE), a novel approach that combines the capabilities of VLMs with Large Language Models (LLMs) to overcome this challenge without the need for additional image-long text pairs. Specifically, we replace the text encoder of the VLM with a pretrained LLM-based encoder that excels in understanding long texts. To bridge the gap between VLM and LLM, MATE incorporates a projection module that is trained in a multi-stage manner. It starts by aligning the embeddings from the VLM text encoder with those from the LLM using extensive text pairs. This module is then employed to seamlessly align image embeddings closely with LLM embeddings. We propose two new cross-modal retrieval benchmarks to assess the task of connecting images with long texts (lengthy captions / documents). Extensive experimental results demonstrate that MATE effectively connects images with long texts, uncovering diverse semantic relationships. △ Less

Submitted 26 June, 2024; originally announced July 2024.

arXiv:2407.03593 [pdf, other]

Green Multigrid Network

Authors: Ye Lin, Young Ju Lee, Jiwei Jia

Abstract: GreenLearning networks (GL) directly learn Green's function in physical space, making them an interpretable model for capturing unknown solution operators of partial differential equations (PDEs). For many PDEs, the corresponding Green's function exhibits asymptotic smoothness. In this paper, we propose a framework named Green Multigrid networks (GreenMGNet), an operator learning algorithm designe… ▽ More GreenLearning networks (GL) directly learn Green's function in physical space, making them an interpretable model for capturing unknown solution operators of partial differential equations (PDEs). For many PDEs, the corresponding Green's function exhibits asymptotic smoothness. In this paper, we propose a framework named Green Multigrid networks (GreenMGNet), an operator learning algorithm designed for a class of asymptotically smooth Green's functions. Compared with the pioneering GL, the new framework presents itself with better accuracy and efficiency, thereby achieving a significant improvement. GreenMGNet is composed of two technical novelties. First, Green's function is modeled as a piecewise function to take into account its singular behavior in some parts of the hyperplane. Such piecewise function is then approximated by a neural network with augmented output(AugNN) so that it can capture singularity accurately. Second, the asymptotic smoothness property of Green's function is used to leverage the Multi-Level Multi-Integration (MLMI) algorithm for both the training and inference stages. Several test cases of operator learning are presented to demonstrate the accuracy and effectiveness of the proposed method. On average, GreenMGNet achieves $3.8\%$ to $39.15\%$ accuracy improvement. To match the accuracy level of GL, GreenMGNet requires only about $10\%$ of the full grid data, resulting in a $55.9\%$ and $92.5\%$ reduction in training time and GPU memory cost for one-dimensional test problems, and a $37.7\%$ and $62.5\%$ reduction for two-dimensional test problems. △ Less

Submitted 3 July, 2024; originally announced July 2024.

arXiv:2406.20095 [pdf, other]

LLaRA: Supercharging Robot Learning Data for Vision-Language Policy

Authors: Xiang Li, Cristina Mata, Jongwoo Park, Kumara Kahatapitiya, Yoo Sung Jang, Jinghuan Shang, Kanchana Ranasinghe, Ryan Burgert, Mu Cai, Yong Jae Lee, Michael S. Ryoo

Abstract: Vision Language Models (VLMs) have recently been leveraged to generate robotic actions, forming Vision-Language-Action (VLA) models. However, directly adapting a pretrained VLM for robotic control remains challenging, particularly when constrained by a limited number of robot demonstrations. In this work, we introduce LLaRA: Large Language and Robotics Assistant, a framework that formulates robot… ▽ More Vision Language Models (VLMs) have recently been leveraged to generate robotic actions, forming Vision-Language-Action (VLA) models. However, directly adapting a pretrained VLM for robotic control remains challenging, particularly when constrained by a limited number of robot demonstrations. In this work, we introduce LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as visuo-textual conversations and enables an efficient transfer of a pretrained VLM into a powerful VLA, motivated by the success of visual instruction tuning in Computer Vision. First, we present an automated pipeline to generate conversation-style instruction tuning data for robots from existing behavior cloning datasets, aligning robotic actions with image pixel coordinates. Further, we enhance this dataset in a self-supervised manner by defining six auxiliary tasks, without requiring any additional action annotations. We show that a VLM finetuned with a limited amount of such datasets can produce meaningful action decisions for robotic control. Through experiments across multiple simulated and real-world tasks, we demonstrate that LLaRA achieves state-of-the-art performance while preserving the generalization capabilities of large language models. The code, datasets, and pretrained models are available at https://github.com/LostXine/LLaRA. △ Less

Submitted 30 January, 2025; v1 submitted 28 June, 2024; originally announced June 2024.

Comments: ICLR 2025

arXiv:2406.09400 [pdf, other]

Yo'LLaVA: Your Personalized Language and Vision Assistant

Authors: Thao Nguyen, Haotian Liu, Yuheng Li, Mu Cai, Utkarsh Ojha, Yong Jae Lee

Abstract: Large Multimodal Models (LMMs) have shown remarkable capabilities across a variety of tasks (e.g., image captioning, visual question answering). While broad, their knowledge remains generic (e.g., recognizing a dog), and they are unable to handle personalized subjects (e.g., recognizing a user's pet dog). Human reasoning, in contrast, typically operates within the context of specific subjects in o… ▽ More Large Multimodal Models (LMMs) have shown remarkable capabilities across a variety of tasks (e.g., image captioning, visual question answering). While broad, their knowledge remains generic (e.g., recognizing a dog), and they are unable to handle personalized subjects (e.g., recognizing a user's pet dog). Human reasoning, in contrast, typically operates within the context of specific subjects in our surroundings. For example, one might ask, "What should I buy for my dog's birthday?"; as opposed to a generic inquiry about "What should I buy for a dog's birthday?". Similarly, when looking at a friend's image, the interest lies in seeing their activities (e.g., "my friend is holding a cat"), rather than merely observing generic human actions (e.g., "a man is holding a cat"). In this paper, we introduce the novel task of personalizing LMMs, so that they can have conversations about a specific subject. We propose Yo'LLaVA, which learns to embed a personalized subject into a set of latent tokens given a handful of example images of the subject. Our qualitative and quantitative analyses reveal that Yo'LLaVA can learn the concept more efficiently using fewer tokens and more effectively encode the visual attributes compared to strong prompting baselines (e.g., LLaVA). △ Less

Submitted 4 December, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

Comments: NeurIPS 2024; Project page: https://thaoshibe.github.io/YoLLaVA

arXiv:2405.18089 [pdf, other]

Semi-nonparametric models of multidimensional matching: an optimal transport approach

Authors: Dongwoo Kim, Young Jun Lee

Abstract: This paper proposes empirically tractable multidimensional matching models, focusing on worker-job matching. We generalize the parametric model proposed by Lindenlaub (2017), which relies on the assumption of joint normality of observed characteristics of workers and jobs. In our paper, we allow unrestricted distributions of characteristics and show identification of the production technology, and… ▽ More This paper proposes empirically tractable multidimensional matching models, focusing on worker-job matching. We generalize the parametric model proposed by Lindenlaub (2017), which relies on the assumption of joint normality of observed characteristics of workers and jobs. In our paper, we allow unrestricted distributions of characteristics and show identification of the production technology, and equilibrium wage and matching functions using tools from optimal transport theory. Given identification, we propose efficient, consistent, asymptotically normal sieve estimators. We revisit Lindenlaub's empirical application and show that, between 1990 and 2010, the U.S. economy experienced much larger technological progress favoring cognitive abilities than the original findings suggest. Furthermore, our flexible model specifications provide a significantly better fit for patterns in the evolution of wage inequality. △ Less

Submitted 28 May, 2024; originally announced May 2024.

arXiv:2405.17430 [pdf, other]

Matryoshka Multimodal Models

Authors: Mu Cai, Jianwei Yang, Jianfeng Gao, Yong Jae Lee

Abstract: Large Multimodal Models (LMMs) such as LLaVA have shown strong performance in visual-linguistic reasoning. These models first embed images into a fixed large number of visual tokens and then feed them into a Large Language Model (LLM). However, this design causes an excessive number of tokens for dense visual scenarios such as high-resolution images and videos, leading to great inefficiency. While… ▽ More Large Multimodal Models (LMMs) such as LLaVA have shown strong performance in visual-linguistic reasoning. These models first embed images into a fixed large number of visual tokens and then feed them into a Large Language Model (LLM). However, this design causes an excessive number of tokens for dense visual scenarios such as high-resolution images and videos, leading to great inefficiency. While token pruning/merging methods do exist, they produce a single length output for each image and do not afford flexibility in trading off information density v.s. efficiency. Inspired by the concept of Matryoshka Dolls, we propose M3: Matryoshka Multimodal Models, which learns to represent visual content as nested sets of visual tokens that capture information across multiple coarse-to-fine granularities. Our approach offers several unique benefits for LMMs: (1) One can explicitly control the visual granularity per test instance during inference, e.g. , adjusting the number of tokens used to represent an image based on the anticipated complexity or simplicity of the content; (2) M3 provides a framework for analyzing the granularity needed for existing datasets, where we find that COCO-style benchmarks only need around ~9 visual tokens to obtain accuracy similar to that of using all 576 tokens; (3) Our approach provides a foundation to explore the best trade-off between performance and visual token length at sample level, where our investigation reveals that a large gap exists between the oracle upper bound and current fixed-scale representations. △ Less

Submitted 29 July, 2024; v1 submitted 27 May, 2024; originally announced May 2024.

Comments: Project Page: https://matryoshka-mm.github.io/

arXiv:2405.16894 [pdf, ps, other]

An Unconstrained Formulation of Some Constrained Partial Differential Equations and its Application to Finite Neuron Methods

Authors: Jiwei Jia, Young Ju Lee, Ruitong Shan

Abstract: In this paper, we present a new framework how a PDE with constraints can be formulated into a sequence of PDEs with no constraints, whose solutions are convergent to the solution of the PDE with constraints. This framework is then used to build a novel finite neuron method to solve the 2nd order elliptic equations with the Dirichlet boundary condition. Our algorithm is the first algorithm, proven… ▽ More In this paper, we present a new framework how a PDE with constraints can be formulated into a sequence of PDEs with no constraints, whose solutions are convergent to the solution of the PDE with constraints. This framework is then used to build a novel finite neuron method to solve the 2nd order elliptic equations with the Dirichlet boundary condition. Our algorithm is the first algorithm, proven to lead to shallow neural network solutions with an optimal H1 norm error. We show that a widely used penalized PDE, which imposes the Dirichlet boundary condition weakly can be interpreted as the first element of the sequence of PDEs within our framework. Furthermore, numerically, we show that it may not lead to the solution with the optimal H1 norm error bound in general. On the other hand, we theoretically demonstrate that the second and later elements of a sequence of PDEs can lead to an adequate solution with the optimal H1 norm error bound. A number of sample tests are performed to confirm the effectiveness of the proposed algorithm and the relevant theory. △ Less

Submitted 27 May, 2024; originally announced May 2024.

arXiv:2405.13455 [pdf, ps, other]

Carleson measures for weighted Bergman--Zygmund spaces

Authors: Hong Rae Cho, Hyungwoon Koo, Young Joo Lee, Atte Pennanen, Jouni Rättyä, Fanglei Wu

Abstract: For $0<p<\infty$, $Ψ:[0,\infty)\to(0,\infty)$ and a finite positive Borel measure $μ$ on the unit disc $\mathbb{D}$, the Lebesgue--Zygmund space $L^p_{μ,Ψ}$ consists of all measurable functions $f$ such that $\lVert f \rVert_{L_{μ, Ψ}^{p}}^p =\int_{\mathbb{D}}|f|^pΨ(|f|)\,dμ< \infty$. For an integrable radial function $ω$ on $\mathbb{D}$, the corresponding weighted Bergman-Zygmund space… ▽ More For $0<p<\infty$, $Ψ:[0,\infty)\to(0,\infty)$ and a finite positive Borel measure $μ$ on the unit disc $\mathbb{D}$, the Lebesgue--Zygmund space $L^p_{μ,Ψ}$ consists of all measurable functions $f$ such that $\lVert f \rVert_{L_{μ, Ψ}^{p}}^p =\int_{\mathbb{D}}|f|^pΨ(|f|)\,dμ< \infty$. For an integrable radial function $ω$ on $\mathbb{D}$, the corresponding weighted Bergman-Zygmund space $A_{ω, Ψ}^{p}$ is the set of all analytic functions in $L_{μ, Ψ}^{p}$ with $dμ=ω\,dA$. The purpose of the paper is to characterize bounded (and compact) embeddings $A_{ω,Ψ}^{p}\subset L_{μ, Φ}^{q}$, when $0<p\le q<\infty$, the functions $Ψ$ and $Φ$ are essential monotonic, and $Ψ,Φ,ω$ satisfy certain doubling properties. The tools developed on the way to the main results are applied to characterize bounded and compact integral operators acting from $A^p_{ω,Ψ}$ to $A^q_{ν,Φ}$, provided $ν$ admits the same doubling property as $ω$. △ Less

Submitted 22 May, 2024; originally announced May 2024.

arXiv:2404.19167 [pdf]

Advancing low-field MRI with a universal denoising imaging transformer: Towards fast and high-quality imaging

Authors: Zheren Zhu, Azaan Rehman, Xiaozhi Cao, Congyu Liao, Yoo Jin Lee, Michael Ohliger, Hui Xue, Yang Yang

Abstract: Recent developments in low-field (LF) magnetic resonance imaging (MRI) systems present remarkable opportunities for affordable and widespread MRI access. A robust denoising method to overcome the intrinsic low signal-noise-ratio (SNR) barrier is critical to the success of LF MRI. However, current data-driven MRI denoising methods predominantly handle magnitude images and rely on customized models… ▽ More Recent developments in low-field (LF) magnetic resonance imaging (MRI) systems present remarkable opportunities for affordable and widespread MRI access. A robust denoising method to overcome the intrinsic low signal-noise-ratio (SNR) barrier is critical to the success of LF MRI. However, current data-driven MRI denoising methods predominantly handle magnitude images and rely on customized models with constrained data diversity and quantity, which exhibit limited generalizability in clinical applications across diverse MRI systems, pulse sequences, and organs. In this study, we present ImT-MRD: a complex-valued imaging transformer trained on a vast number of clinical MRI scans aiming at universal MR denoising at LF systems. Compared with averaging multiple-repeated scans for higher image SNR, the model obtains better image quality from fewer repetitions, demonstrating its capability for accelerating scans under various clinical settings. Moreover, with its complex-valued image input, the model can denoise intermediate results before advanced post-processing and prepare high-quality data for further MRI research. By delivering universal and accurate denoising across clinical and research tasks, our model holds great promise to expedite the evolution of LF MRI for accessible and equal biomedical applications. △ Less

Submitted 29 April, 2024; originally announced April 2024.

arXiv:2403.18881 [pdf]

Transmission IR Microscopy for the Quantitation of Biomolecular Mass In Live Cells

Authors: Yow-Ren Chang, Seong-Min Kim, Young Jong Lee

Abstract: Absolute quantity imaging of biomolecules on a single cell level is critical for measurement assurance in biosciences and bioindustries. While infrared (IR) transmission microscopy is a powerful label-free imaging modality capable of chemical quantification, its applicability to hydrated biological samples remains challenging due to the strong water absorption. We overcome this challenge by applyi… ▽ More Absolute quantity imaging of biomolecules on a single cell level is critical for measurement assurance in biosciences and bioindustries. While infrared (IR) transmission microscopy is a powerful label-free imaging modality capable of chemical quantification, its applicability to hydrated biological samples remains challenging due to the strong water absorption. We overcome this challenge by applying a solvent absorption compensation (SAC) technique to a home-built quantum cascade laser IR microscope. SAC-IR microscopy improves the chemical sensitivity considerably by adjusting the incident light intensity to pre-compensate the IR absorption by water while retaining the full dynamic range. We demonstrate the label-free chemical imaging of key biomolecules of a cell, such as protein, fatty acid, and nucleic acid, with sub-cellular spatial resolution. By imaging live fibroblast cells over twelve hours, we monitor the mass change of the three molecular species of single cells at various phases, including cell division. While the current live-cell imaging demonstration involved three wavenumbers, more wavenumber images could measure more biomolecules in live cells with higher accuracy. As a label-free method to measure absolute quantities of various molecules in a cell, SAC-IR microscopy can potentially become a standard chemical characterization tool for live cells in biology, medicine, and biotechnology. △ Less

Submitted 27 March, 2024; originally announced March 2024.

Comments: Body: 19 pages, 5 figures. Supplemental: 11 pages, 6 figures

arXiv:2403.15388 [pdf, other]

LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models

Authors: Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, Yan Yan

Abstract: Large Multimodal Models (LMMs) have shown significant visual reasoning capabilities by connecting a visual encoder and a large language model. LMMs typically take in a fixed and large amount of visual tokens, such as the penultimate layer features in the CLIP visual encoder, as the prefix content. Recent LMMs incorporate more complex visual inputs, such as high-resolution images and videos, which… ▽ More Large Multimodal Models (LMMs) have shown significant visual reasoning capabilities by connecting a visual encoder and a large language model. LMMs typically take in a fixed and large amount of visual tokens, such as the penultimate layer features in the CLIP visual encoder, as the prefix content. Recent LMMs incorporate more complex visual inputs, such as high-resolution images and videos, which further increases the number of visual tokens significantly. However, due to the inherent design of the Transformer architecture, the computational costs of these models tend to increase quadratically with the number of input tokens. To tackle this problem, we explore a token reduction mechanism that identifies significant spatial redundancy among visual tokens. In response, we propose PruMerge, a novel adaptive visual token reduction strategy that significantly reduces the number of visual tokens without compromising the performance of LMMs. Specifically, to metric the importance of each token, we exploit the sparsity observed in the visual encoder, characterized by the sparse distribution of attention scores between the class token and visual tokens. This sparsity enables us to dynamically select the most crucial visual tokens to retain. Subsequently, we cluster the selected (unpruned) tokens based on their key similarity and merge them with the unpruned tokens, effectively supplementing and enhancing their informational content. Empirically, when applied to LLaVA-1.5, our approach can compress the visual tokens by 14 times on average, and achieve comparable performance across diverse visual question-answering and reasoning tasks. Code and checkpoints are at https://llava-prumerge.github.io/. △ Less

Submitted 22 May, 2024; v1 submitted 22 March, 2024; originally announced March 2024.

Comments: Project page: https://llava-prumerge.github.io/

arXiv:2403.02638 [pdf, other]

Real-time portable muography with Hankuk Atmospheric-muon Wide Landscaping : HAWL

Authors: J. Seo, N. Carlin, D. F. F. S. Cavalcante, J. S. Chung, L. E. Franca, C. Ha, J. Kim, J. Y. Kim, H. Kimku, B. C. Koh, Y. J. Lee, B. B. Manzato, S. W. Oh, R. L. C. Pitta, S. J. Won

Abstract: Cosmic ray muons prove valuable across various fields, from particle physics experiments to non-invasive tomography, thanks to their high flux and exceptional penetrating capability. Utilizing a scintillator detector, one can effectively study the topography of mountains situated above tunnels and underground spaces. The Hankuk Atmospheric-muon Wide Landscaping (HAWL) project successfully charts t… ▽ More Cosmic ray muons prove valuable across various fields, from particle physics experiments to non-invasive tomography, thanks to their high flux and exceptional penetrating capability. Utilizing a scintillator detector, one can effectively study the topography of mountains situated above tunnels and underground spaces. The Hankuk Atmospheric-muon Wide Landscaping (HAWL) project successfully charts the mountainous region of eastern Korea by measuring cosmic ray muons with a detector in motion. The real-time muon flux measurement shows a tunnel length accuracy of 6.0 %, with a detectable overburden range spanning from 8 to 400 meter-water-equivalent depth. This is the first real-time portable muon tomography. △ Less

Submitted 4 August, 2024; v1 submitted 4 March, 2024; originally announced March 2024.

Comments: 10pages, 12 figures

Showing 1–50 of 182 results for author: Lee, Y J