-
A Training-Free Style-Personalization via Scale-wise Autoregressive Model
Authors:
Kyoungmin Lee,
Jihun Park,
Jongmin Gim,
Wonhyeok Choi,
Kyumin Hwang,
Jaeyeul Kim,
Sunghoon Im
Abstract:
We present a training-free framework for style-personalized image generation that controls content and style information during inference using a scale-wise autoregressive model. Our method employs a three-path design--content, style, and generation--each guided by a corresponding text prompt, enabling flexible and efficient control over image semantics without any additional training. A central c…
▽ More
We present a training-free framework for style-personalized image generation that controls content and style information during inference using a scale-wise autoregressive model. Our method employs a three-path design--content, style, and generation--each guided by a corresponding text prompt, enabling flexible and efficient control over image semantics without any additional training. A central contribution of this work is a step-wise and attention-wise intervention analysis. Through systematic prompt and feature injection, we find that early-to-middle generation steps play a pivotal role in shaping both content and style, and that query features predominantly encode content-specific information. Guided by these insights, we introduce two targeted mechanisms: Key Stage Attention Sharing, which aligns content and style during the semantically critical steps, and Adaptive Query Sharing, which reinforces content semantics in later steps through similarity-aware query blending. Extensive experiments demonstrate that our method achieves competitive style fidelity and prompt fidelity compared to fine-tuned baselines, while offering faster inference and greater deployment flexibility.
△ Less
Submitted 6 July, 2025;
originally announced July 2025.
-
Reasoning or Not? A Comprehensive Evaluation of Reasoning LLMs for Dialogue Summarization
Authors:
Keyan Jin,
Yapeng Wang,
Leonel Santos,
Tao Fang,
Xu Yang,
Sio Kei Im,
Hugo Gonçalo Oliveira
Abstract:
Dialogue summarization is a challenging task with significant practical value in customer service, meeting analysis, and conversational AI. Although large language models (LLMs) have achieved substantial progress in summarization tasks, the performance of step-by-step reasoning architectures-specifically Long Chain-of-Thought (CoT) implementations such as OpenAI-o1 and DeepSeek-R1-remains unexplor…
▽ More
Dialogue summarization is a challenging task with significant practical value in customer service, meeting analysis, and conversational AI. Although large language models (LLMs) have achieved substantial progress in summarization tasks, the performance of step-by-step reasoning architectures-specifically Long Chain-of-Thought (CoT) implementations such as OpenAI-o1 and DeepSeek-R1-remains unexplored for dialogue scenarios requiring concurrent abstraction and conciseness. In this work, we present the first comprehensive and systematic evaluation of state-of-the-art reasoning LLMs and non-reasoning LLMs across three major paradigms-generic, role-oriented, and query-oriented dialogue summarization. Our study spans diverse languages, domains, and summary lengths, leveraging strong benchmarks (SAMSum, DialogSum, CSDS, and QMSum) and advanced evaluation protocols that include both LLM-based automatic metrics and human-inspired criteria. Contrary to trends in other reasoning-intensive tasks, our findings show that explicit stepwise reasoning does not consistently improve dialogue summarization quality. Instead, reasoning LLMs are often prone to verbosity, factual inconsistencies, and less concise summaries compared to their non-reasoning counterparts. Through scenario-specific analyses and detailed case studies, we further identify when and why explicit reasoning may fail to benefit-or even hinder-summarization in complex dialogue contexts. Our work provides new insights into the limitations of current reasoning LLMs and highlights the need for targeted modeling and evaluation strategies for real-world dialogue summarization.
△ Less
Submitted 2 July, 2025;
originally announced July 2025.
-
ADAptation: Reconstruction-based Unsupervised Active Learning for Breast Ultrasound Diagnosis
Authors:
Yaofei Duan,
Yuhao Huang,
Xin Yang,
Luyi Han,
Xinyu Xie,
Zhiyuan Zhu,
Ping He,
Ka-Hou Chan,
Ligang Cui,
Sio-Kei Im,
Dong Ni,
Tao Tan
Abstract:
Deep learning-based diagnostic models often suffer performance drops due to distribution shifts between training (source) and test (target) domains. Collecting and labeling sufficient target domain data for model retraining represents an optimal solution, yet is limited by time and scarce resources. Active learning (AL) offers an efficient approach to reduce annotation costs while maintaining perf…
▽ More
Deep learning-based diagnostic models often suffer performance drops due to distribution shifts between training (source) and test (target) domains. Collecting and labeling sufficient target domain data for model retraining represents an optimal solution, yet is limited by time and scarce resources. Active learning (AL) offers an efficient approach to reduce annotation costs while maintaining performance, but struggles to handle the challenge posed by distribution variations across different datasets. In this study, we propose a novel unsupervised Active learning framework for Domain Adaptation, named ADAptation, which efficiently selects informative samples from multi-domain data pools under limited annotation budget. As a fundamental step, our method first utilizes the distribution homogenization capabilities of diffusion models to bridge cross-dataset gaps by translating target images into source-domain style. We then introduce two key innovations: (a) a hypersphere-constrained contrastive learning network for compact feature clustering, and (b) a dual-scoring mechanism that quantifies and balances sample uncertainty and representativeness. Extensive experiments on four breast ultrasound datasets (three public and one in-house/multi-center) across five common deep classifiers demonstrate that our method surpasses existing strong AL-based competitors, validating its effectiveness and generalization for clinical domain adaptation. The code is available at the anonymized link: https://github.com/miccai25-966/ADAptation.
△ Less
Submitted 1 July, 2025;
originally announced July 2025.
-
DeRAGEC: Denoising Named Entity Candidates with Synthetic Rationale for ASR Error Correction
Authors:
Solee Im,
Wonjun Lee,
Jinmyeong An,
Yunsu Kim,
Jungseul Ok,
Gary Geunbae Lee
Abstract:
We present DeRAGEC, a method for improving Named Entity (NE) correction in Automatic Speech Recognition (ASR) systems. By extending the Retrieval-Augmented Generative Error Correction (RAGEC) framework, DeRAGEC employs synthetic denoising rationales to filter out noisy NE candidates before correction. By leveraging phonetic similarity and augmented definitions, it refines noisy retrieved NEs using…
▽ More
We present DeRAGEC, a method for improving Named Entity (NE) correction in Automatic Speech Recognition (ASR) systems. By extending the Retrieval-Augmented Generative Error Correction (RAGEC) framework, DeRAGEC employs synthetic denoising rationales to filter out noisy NE candidates before correction. By leveraging phonetic similarity and augmented definitions, it refines noisy retrieved NEs using in-context learning, requiring no additional training. Experimental results on CommonVoice and STOP datasets show significant improvements in Word Error Rate (WER) and NE hit ratio, outperforming baseline ASR and RAGEC methods. Specifically, we achieved a 28% relative reduction in WER compared to ASR without postprocessing. Our source code is publicly available at: https://github.com/solee0022/deragec
△ Less
Submitted 9 June, 2025;
originally announced June 2025.
-
Bridging Geometric and Semantic Foundation Models for Generalized Monocular Depth Estimation
Authors:
Sanggyun Ma,
Wonjoon Choi,
Jihun Park,
Jaeyeul Kim,
Seunghun Lee,
Jiwan Seo,
Sunghoon Im
Abstract:
We present Bridging Geometric and Semantic (BriGeS), an effective method that fuses geometric and semantic information within foundation models to enhance Monocular Depth Estimation (MDE). Central to BriGeS is the Bridging Gate, which integrates the complementary strengths of depth and segmentation foundation models. This integration is further refined by our Attention Temperature Scaling techniqu…
▽ More
We present Bridging Geometric and Semantic (BriGeS), an effective method that fuses geometric and semantic information within foundation models to enhance Monocular Depth Estimation (MDE). Central to BriGeS is the Bridging Gate, which integrates the complementary strengths of depth and segmentation foundation models. This integration is further refined by our Attention Temperature Scaling technique. It finely adjusts the focus of the attention mechanisms to prevent over-concentration on specific features, thus ensuring balanced performance across diverse inputs. BriGeS capitalizes on pre-trained foundation models and adopts a strategy that focuses on training only the Bridging Gate. This method significantly reduces resource demands and training time while maintaining the model's ability to generalize effectively. Extensive experiments across multiple challenging datasets demonstrate that BriGeS outperforms state-of-the-art methods in MDE for complex scenes, effectively handling intricate structures and overlapping objects.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
Towards Interpretability Without Sacrifice: Faithful Dense Layer Decomposition with Mixture of Decoders
Authors:
James Oldfield,
Shawn Im,
Yixuan Li,
Mihalis A. Nicolaou,
Ioannis Patras,
Grigorios G Chrysos
Abstract:
Multilayer perceptrons (MLPs) are an integral part of large language models, yet their dense representations render them difficult to understand, edit, and steer. Recent methods learn interpretable approximations via neuron-level sparsity, yet fail to faithfully reconstruct the original mapping--significantly increasing model's next-token cross-entropy loss. In this paper, we advocate for moving t…
▽ More
Multilayer perceptrons (MLPs) are an integral part of large language models, yet their dense representations render them difficult to understand, edit, and steer. Recent methods learn interpretable approximations via neuron-level sparsity, yet fail to faithfully reconstruct the original mapping--significantly increasing model's next-token cross-entropy loss. In this paper, we advocate for moving to layer-level sparsity to overcome the accuracy trade-off in sparse layer approximation. Under this paradigm, we introduce Mixture of Decoders (MxDs). MxDs generalize MLPs and Gated Linear Units, expanding pre-trained dense layers into tens of thousands of specialized sublayers. Through a flexible form of tensor factorization, each sparsely activating MxD sublayer implements a linear transformation with full-rank weights--preserving the original decoders' expressive capacity even under heavy sparsity. Experimentally, we show that MxDs significantly outperform state-of-the-art methods (e.g., Transcoders) on the sparsity-accuracy frontier in language models with up to 3B parameters. Further evaluations on sparse probing and feature steering demonstrate that MxDs learn similarly specialized features of natural language--opening up a promising new avenue for designing interpretable yet faithful decompositions. Our code is included at: https://github.com/james-oldfield/MxD/.
△ Less
Submitted 27 May, 2025;
originally announced May 2025.
-
Visual Instruction Bottleneck Tuning
Authors:
Changdae Oh,
Jiatong Li,
Shawn Im,
Yixuan Li
Abstract:
Despite widespread adoption, multimodal large language models (MLLMs) suffer performance degradation when encountering unfamiliar queries under distribution shifts. Existing methods to improve MLLM generalization typically require either more instruction data or larger advanced model architectures, both of which incur non-trivial human labor or computational costs. In this work, we take an alterna…
▽ More
Despite widespread adoption, multimodal large language models (MLLMs) suffer performance degradation when encountering unfamiliar queries under distribution shifts. Existing methods to improve MLLM generalization typically require either more instruction data or larger advanced model architectures, both of which incur non-trivial human labor or computational costs. In this work, we take an alternative approach to enhance the robustness of MLLMs under distribution shifts, from a representation learning perspective. Inspired by the information bottleneck (IB) principle, we derive a variational lower bound of the IB for MLLMs and devise a practical implementation, Visual Instruction Bottleneck Tuning (Vittle). We then provide a theoretical justification of Vittle by revealing its connection to an information-theoretic robustness metric of MLLM. Empirical validation of three MLLMs on open-ended and closed-form question answering and object hallucination detection tasks over 45 datasets, including 30 shift scenarios, demonstrates that Vittle consistently improves the MLLM's robustness under shifts by pursuing the learning of a minimal sufficient representation.
△ Less
Submitted 20 May, 2025;
originally announced May 2025.
-
Online Distributed Queue Length Estimation
Authors:
Aditya Bhaskara,
Sreenivas Gollapudi,
Sungjin Im,
Kostas Kollias,
Kamesh Munagala
Abstract:
Queue length monitoring is a commonly arising problem in numerous applications such as queue management systems, scheduling, and traffic monitoring. Motivated by such applications, we formulate a queue monitoring problem, where there is a FIFO queue with arbitrary arrivals and departures, and a server needs to monitor the length of a queue by using decentralized pings from packets in the queue. Pa…
▽ More
Queue length monitoring is a commonly arising problem in numerous applications such as queue management systems, scheduling, and traffic monitoring. Motivated by such applications, we formulate a queue monitoring problem, where there is a FIFO queue with arbitrary arrivals and departures, and a server needs to monitor the length of a queue by using decentralized pings from packets in the queue. Packets can send pings informing the server about the number of packets ahead of them in the queue. Via novel online policies and lower bounds, we tightly characterize the trade-off between the number of pings sent and the accuracy of the server's real time estimates. Our work studies the trade-off under various arrival and departure processes, including constant-rate, Poisson, and adversarial processes.
△ Less
Submitted 25 April, 2025;
originally announced April 2025.
-
A Training-Free Style-aligned Image Generation with Scale-wise Autoregressive Model
Authors:
Jihun Park,
Jongmin Gim,
Kyoungmin Lee,
Minseok Oh,
Minwoo Choi,
Jaeyeul Kim,
Woo Chool Park,
Sunghoon Im
Abstract:
We present a training-free style-aligned image generation method that leverages a scale-wise autoregressive model. While large-scale text-to-image (T2I) models, particularly diffusion-based methods, have demonstrated impressive generation quality, they often suffer from style misalignment across generated image sets and slow inference speeds, limiting their practical usability. To address these is…
▽ More
We present a training-free style-aligned image generation method that leverages a scale-wise autoregressive model. While large-scale text-to-image (T2I) models, particularly diffusion-based methods, have demonstrated impressive generation quality, they often suffer from style misalignment across generated image sets and slow inference speeds, limiting their practical usability. To address these issues, we propose three key components: initial feature replacement to ensure consistent background appearance, pivotal feature interpolation to align object placement, and dynamic style injection, which reinforces style consistency using a schedule function. Unlike previous methods requiring fine-tuning or additional training, our approach maintains fast inference while preserving individual content details. Extensive experiments show that our method achieves generation quality comparable to competing approaches, significantly improves style alignment, and delivers inference speeds over six times faster than the fastest model.
△ Less
Submitted 8 April, 2025;
originally announced April 2025.
-
Efficient Algorithms for Cardinality Estimation and Conjunctive Query Evaluation With Simple Degree Constraints
Authors:
Sungjin Im,
Benjamin Moseley,
Hung Q. Ngo,
Kirk Pruhs
Abstract:
Cardinality estimation and conjunctive query evaluation are two of the most fundamental problems in database query processing. Recent work proposed, studied, and implemented a robust and practical information-theoretic cardinality estimation framework. In this framework, the estimator is the cardinality upper bound of a conjunctive query subject to ``degree-constraints'', which model a rich set of…
▽ More
Cardinality estimation and conjunctive query evaluation are two of the most fundamental problems in database query processing. Recent work proposed, studied, and implemented a robust and practical information-theoretic cardinality estimation framework. In this framework, the estimator is the cardinality upper bound of a conjunctive query subject to ``degree-constraints'', which model a rich set of input data statistics. For general degree constraints, computing this bound is computationally hard. Researchers have naturally sought efficiently computable relaxed upper bounds that are as tight as possible. The polymatroid bound is the tightest among those relaxed upper bounds. While it is an open question whether the polymatroid bound can be computed in polynomial-time in general, it is known to be computable in polynomial-time for some classes of degree constraints.
Our focus is on a common class of degree constraints called simple degree constraints. Researchers had not previously determined how to compute the polymatroid bound in polynomial time for this class of constraints. Our first main result is a polynomial time algorithm to compute the polymatroid bound given simple degree constraints. Our second main result is a polynomial-time algorithm to compute a ``proof sequence'' establishing this bound. This proof sequence can then be incorporated in the PANDA-framework to give a faster algorithm to evaluate a conjunctive query. In addition, we show computational limitations to extending our results to broader classes of degree constraints. Finally, our technique leads naturally to a new relaxed upper bound called the {\em flow bound}, which is computationally tractable.
△ Less
Submitted 3 April, 2025;
originally announced April 2025.
-
Intrinsic Image Decomposition for Robust Self-supervised Monocular Depth Estimation on Reflective Surfaces
Authors:
Wonhyeok Choi,
Kyumin Hwang,
Minwoo Choi,
Kiljoon Han,
Wonjoon Choi,
Mingyu Shin,
Sunghoon Im
Abstract:
Self-supervised monocular depth estimation (SSMDE) has gained attention in the field of deep learning as it estimates depth without requiring ground truth depth maps. This approach typically uses a photometric consistency loss between a synthesized image, generated from the estimated depth, and the original image, thereby reducing the need for extensive dataset acquisition. However, the convention…
▽ More
Self-supervised monocular depth estimation (SSMDE) has gained attention in the field of deep learning as it estimates depth without requiring ground truth depth maps. This approach typically uses a photometric consistency loss between a synthesized image, generated from the estimated depth, and the original image, thereby reducing the need for extensive dataset acquisition. However, the conventional photometric consistency loss relies on the Lambertian assumption, which often leads to significant errors when dealing with reflective surfaces that deviate from this model. To address this limitation, we propose a novel framework that incorporates intrinsic image decomposition into SSMDE. Our method synergistically trains for both monocular depth estimation and intrinsic image decomposition. The accurate depth estimation facilitates multi-image consistency for intrinsic image decomposition by aligning different view coordinate systems, while the decomposition process identifies reflective areas and excludes corrupted gradients from the depth training process. Furthermore, our framework introduces a pseudo-depth generation and knowledge distillation technique to further enhance the performance of the student model across both reflective and non-reflective surfaces. Comprehensive evaluations on multiple datasets show that our approach significantly outperforms existing SSMDE baselines in depth prediction, especially on reflective surfaces.
△ Less
Submitted 28 March, 2025;
originally announced March 2025.
-
Towards Lossless Implicit Neural Representation via Bit Plane Decomposition
Authors:
Woo Kyoung Han,
Byeonghun Lee,
Hyunmin Cho,
Sunghoon Im,
Kyong Hwan Jin
Abstract:
We quantify the upper bound on the size of the implicit neural representation (INR) model from a digital perspective. The upper bound of the model size increases exponentially as the required bit-precision increases. To this end, we present a bit-plane decomposition method that makes INR predict bit-planes, producing the same effect as reducing the upper bound of the model size. We validate our hy…
▽ More
We quantify the upper bound on the size of the implicit neural representation (INR) model from a digital perspective. The upper bound of the model size increases exponentially as the required bit-precision increases. To this end, we present a bit-plane decomposition method that makes INR predict bit-planes, producing the same effect as reducing the upper bound of the model size. We validate our hypothesis that reducing the upper bound leads to faster convergence with constant model size. Our method achieves lossless representation in 2D image and audio fitting, even for high bit-depth signals, such as 16-bit, which was previously unachievable. We pioneered the presence of bit bias, which INR prioritizes as the most significant bit (MSB). We expand the application of the INR task to bit depth expansion, lossless image compression, and extreme network quantization. Our source code is available at https://github.com/WooKyoungHan/LosslessINR
△ Less
Submitted 20 March, 2025; v1 submitted 28 February, 2025;
originally announced February 2025.
-
Adaptive Quantum Scaling Model for Histogram Distribution-based Quantum Watermarking
Authors:
Zheng Xing,
Chan-Tong Lam,
Xiaochen Yuan,
Sio-Kei Im,
Penousal Machado
Abstract:
The development of quantum image representation and quantum measurement techniques has made quantum image processing research a hot topic. In this paper, a novel Adaptive Quantum Scaling Model (AQSM) is first proposed for scrambling watermark images. Then, on the basis of the proposed AQSM, a novel quantum watermarking scheme is presented. Unlike existing quantum watermarking schemes with fixed em…
▽ More
The development of quantum image representation and quantum measurement techniques has made quantum image processing research a hot topic. In this paper, a novel Adaptive Quantum Scaling Model (AQSM) is first proposed for scrambling watermark images. Then, on the basis of the proposed AQSM, a novel quantum watermarking scheme is presented. Unlike existing quantum watermarking schemes with fixed embedding scales, the proposed method can flexibly embed watermarks of different sizes. In order to improve the robustness of the watermarking algorithm, a novel Histogram Distribution-based Watermarking Mechanism (HDWM) is proposed, which utilizes the histogram distribution property of the watermark image to determine the embedding strategy. In order to improve the accuracy of extracted watermark information, a quantum refining method is suggested, which can realize a certain error correction. The required key quantum circuits are designed. Finally, the effectiveness and robustness of the proposed quantum watermarking method are evaluated by simulation experiments on three image size scales. The results demonstrate the invisibility and good robustness of the watermarking algorithm.
△ Less
Submitted 31 March, 2025; v1 submitted 25 February, 2025;
originally announced February 2025.
-
Self-supervised Monocular Depth Estimation Robust to Reflective Surface Leveraged by Triplet Mining
Authors:
Wonhyeok Choi,
Kyumin Hwang,
Wei Peng,
Minwoo Choi,
Sunghoon Im
Abstract:
Self-supervised monocular depth estimation (SSMDE) aims to predict the dense depth map of a monocular image, by learning depth from RGB image sequences, eliminating the need for ground-truth depth labels. Although this approach simplifies data acquisition compared to supervised methods, it struggles with reflective surfaces, as they violate the assumptions of Lambertian reflectance, leading to ina…
▽ More
Self-supervised monocular depth estimation (SSMDE) aims to predict the dense depth map of a monocular image, by learning depth from RGB image sequences, eliminating the need for ground-truth depth labels. Although this approach simplifies data acquisition compared to supervised methods, it struggles with reflective surfaces, as they violate the assumptions of Lambertian reflectance, leading to inaccurate training on such surfaces. To tackle this problem, we propose a novel training strategy for an SSMDE by leveraging triplet mining to pinpoint reflective regions at the pixel level, guided by the camera geometry between different viewpoints. The proposed reflection-aware triplet mining loss specifically penalizes the inappropriate photometric error minimization on the localized reflective regions while preserving depth accuracy in non-reflective areas. We also incorporate a reflection-aware knowledge distillation method that enables a student model to selectively learn the pixel-level knowledge from reflective and non-reflective regions. This results in robust depth estimation across areas. Evaluation results on multiple datasets demonstrate that our method effectively enhances depth quality on reflective surfaces and outperforms state-of-the-art SSMDE baselines.
△ Less
Submitted 20 February, 2025;
originally announced February 2025.
-
LabTOP: A Unified Model for Lab Test Outcome Prediction on Electronic Health Records
Authors:
Sujeong Im,
Jungwoo Oh,
Edward Choi
Abstract:
Lab tests are fundamental for diagnosing diseases and monitoring patient conditions. However, frequent testing can be burdensome for patients, and test results may not always be immediately available. To address these challenges, we propose LabTOP, a unified model that predicts lab test outcomes by leveraging a language modeling approach on EHR data. Unlike conventional methods that estimate only…
▽ More
Lab tests are fundamental for diagnosing diseases and monitoring patient conditions. However, frequent testing can be burdensome for patients, and test results may not always be immediately available. To address these challenges, we propose LabTOP, a unified model that predicts lab test outcomes by leveraging a language modeling approach on EHR data. Unlike conventional methods that estimate only a subset of lab tests or classify discrete value ranges, LabTOP performs continuous numerical predictions for a diverse range of lab items. We evaluate LabTOP on three publicly available EHR datasets and demonstrate that it outperforms existing methods, including traditional machine learning models and state-of-the-art large language models. We also conduct extensive ablation studies to confirm the effectiveness of our design choices. We believe that LabTOP will serve as an accurate and generalizable framework for lab test outcome prediction, with potential applications in clinical decision support and early detection of critical conditions.
△ Less
Submitted 5 July, 2025; v1 submitted 19 February, 2025;
originally announced February 2025.
-
A Unified Understanding and Evaluation of Steering Methods
Authors:
Shawn Im,
Yixuan Li
Abstract:
Steering methods provide a practical approach to controlling large language models by applying steering vectors to intermediate activations, guiding outputs toward desired behaviors while avoiding retraining. Despite their growing importance, the field lacks a unified understanding and consistent evaluation across tasks and datasets, hindering progress. This paper introduces a unified framework fo…
▽ More
Steering methods provide a practical approach to controlling large language models by applying steering vectors to intermediate activations, guiding outputs toward desired behaviors while avoiding retraining. Despite their growing importance, the field lacks a unified understanding and consistent evaluation across tasks and datasets, hindering progress. This paper introduces a unified framework for analyzing and evaluating steering methods, formalizing their core principles and offering theoretical insights into their effectiveness. Through comprehensive empirical evaluations on multiple-choice and open-ended text generation tasks, we validate these insights, identifying key factors that influence performance and demonstrating the superiority of certain methods. Our work bridges theoretical and practical perspectives, offering actionable guidance for advancing the design, optimization, and deployment of steering methods in LLMs.
△ Less
Submitted 4 February, 2025;
originally announced February 2025.
-
Diagrammatics of information
Authors:
Mee Seong Im,
Clement Kam,
Caden Pici
Abstract:
We introduce a diagrammatic perspective for Shannon entropy created by the first author and Mikhail Khovanov and connect it to information theory and mutual information. We also give two complete proofs that the $5$-term dilogarithm deforms to the $4$-term infinitesimal dilogarithm.
We introduce a diagrammatic perspective for Shannon entropy created by the first author and Mikhail Khovanov and connect it to information theory and mutual information. We also give two complete proofs that the $5$-term dilogarithm deforms to the $4$-term infinitesimal dilogarithm.
△ Less
Submitted 3 February, 2025;
originally announced February 2025.
-
Understanding Multimodal LLMs Under Distribution Shifts: An Information-Theoretic Approach
Authors:
Changdae Oh,
Zhen Fang,
Shawn Im,
Xuefeng Du,
Yixuan Li
Abstract:
Multimodal large language models (MLLMs) have shown promising capabilities but struggle under distribution shifts, where evaluation data differ from instruction tuning distributions. Although previous works have provided empirical evaluations, we argue that establishing a formal framework that can characterize and quantify the risk of MLLMs is necessary to ensure the safe and reliable application…
▽ More
Multimodal large language models (MLLMs) have shown promising capabilities but struggle under distribution shifts, where evaluation data differ from instruction tuning distributions. Although previous works have provided empirical evaluations, we argue that establishing a formal framework that can characterize and quantify the risk of MLLMs is necessary to ensure the safe and reliable application of MLLMs in the real world. By taking an information-theoretic perspective, we propose the first theoretical framework that enables the quantification of the maximum risk of MLLMs under distribution shifts. Central to our framework is the introduction of Effective Mutual Information (EMI), a principled metric that quantifies the relevance between input queries and model responses. We derive an upper bound for the EMI difference between in-distribution (ID) and out-of-distribution (OOD) data, connecting it to visual and textual distributional discrepancies. Extensive experiments on real benchmark datasets, spanning 61 shift scenarios, empirically validate our theoretical insights.
△ Less
Submitted 24 May, 2025; v1 submitted 1 February, 2025;
originally announced February 2025.
-
DyPCL: Dynamic Phoneme-level Contrastive Learning for Dysarthric Speech Recognition
Authors:
Wonjun Lee,
Solee Im,
Heejin Do,
Yunsu Kim,
Jungseul Ok,
Gary Geunbae Lee
Abstract:
Dysarthric speech recognition often suffers from performance degradation due to the intrinsic diversity of dysarthric severity and extrinsic disparity from normal speech. To bridge these gaps, we propose a Dynamic Phoneme-level Contrastive Learning (DyPCL) method, which leads to obtaining invariant representations across diverse speakers. We decompose the speech utterance into phoneme segments for…
▽ More
Dysarthric speech recognition often suffers from performance degradation due to the intrinsic diversity of dysarthric severity and extrinsic disparity from normal speech. To bridge these gaps, we propose a Dynamic Phoneme-level Contrastive Learning (DyPCL) method, which leads to obtaining invariant representations across diverse speakers. We decompose the speech utterance into phoneme segments for phoneme-level contrastive learning, leveraging dynamic connectionist temporal classification alignment. Unlike prior studies focusing on utterance-level embeddings, our granular learning allows discrimination of subtle parts of speech. In addition, we introduce dynamic curriculum learning, which progressively transitions from easy negative samples to difficult-to-distinguishable negative samples based on phonetic similarity of phoneme. Our approach to training by difficulty levels alleviates the inherent variability of speakers, better identifying challenging speeches. Evaluated on the UASpeech dataset, DyPCL outperforms baseline models, achieving an average 22.10\% relative reduction in word error rate (WER) across the overall dysarthria group.
△ Less
Submitted 3 February, 2025; v1 submitted 31 January, 2025;
originally announced January 2025.
-
TARDiS : Text Augmentation for Refining Diversity and Separability
Authors:
Kyungmin Kim,
SangHun Im,
GiBaeg Kim,
Heung-Seon Oh
Abstract:
Text augmentation (TA) is a critical technique for text classification, especially in few-shot settings. This paper introduces a novel LLM-based TA method, TARDiS, to address challenges inherent in the generation and alignment stages of two-stage TA methods. For the generation stage, we propose two generation processes, SEG and CEG, incorporating multiple class-specific prompts to enhance diversit…
▽ More
Text augmentation (TA) is a critical technique for text classification, especially in few-shot settings. This paper introduces a novel LLM-based TA method, TARDiS, to address challenges inherent in the generation and alignment stages of two-stage TA methods. For the generation stage, we propose two generation processes, SEG and CEG, incorporating multiple class-specific prompts to enhance diversity and separability. For the alignment stage, we introduce a class adaptation (CA) method to ensure that generated examples align with their target classes through verification and modification. Experimental results demonstrate TARDiS's effectiveness, outperforming state-of-the-art LLM-based TA methods in various few-shot text classification tasks. An in-depth analysis confirms the detailed behaviors at each stage.
△ Less
Submitted 5 January, 2025;
originally announced January 2025.
-
Binary Search with Distributional Predictions
Authors:
Michael Dinitz,
Sungjin Im,
Thomas Lavastida,
Benjamin Moseley,
Aidin Niaparast,
Sergei Vassilvitskii
Abstract:
Algorithms with (machine-learned) predictions is a powerful framework for combining traditional worst-case algorithms with modern machine learning. However, the vast majority of work in this space assumes that the prediction itself is non-probabilistic, even if it is generated by some stochastic process (such as a machine learning system). This is a poor fit for modern ML, particularly modern neur…
▽ More
Algorithms with (machine-learned) predictions is a powerful framework for combining traditional worst-case algorithms with modern machine learning. However, the vast majority of work in this space assumes that the prediction itself is non-probabilistic, even if it is generated by some stochastic process (such as a machine learning system). This is a poor fit for modern ML, particularly modern neural networks, which naturally generate a distribution. We initiate the study of algorithms with distributional predictions, where the prediction itself is a distribution. We focus on one of the simplest yet fundamental settings: binary search (or searching a sorted array). This setting has one of the simplest algorithms with a point prediction, but what happens if the prediction is a distribution? We show that this is a richer setting: there are simple distributions where using the classical prediction-based algorithm with any single prediction does poorly. Motivated by this, as our main result, we give an algorithm with query complexity $O(H(p) + \log η)$, where $H(p)$ is the entropy of the true distribution $p$ and $η$ is the earth mover's distance between $p$ and the predicted distribution $\hat p$. This also yields the first distributionally-robust algorithm for the classical problem of computing an optimal binary search tree given a distribution over target keys. We complement this with a lower bound showing that this query complexity is essentially optimal (up to constants), and experiments validating the practical usefulness of our algorithm.
△ Less
Submitted 24 November, 2024;
originally announced November 2024.
-
Strategic Facility Location via Predictions
Authors:
Qingyun Chen,
Nick Gravin,
Sungjin Im
Abstract:
The facility location with strategic agents is a canonical problem in the literature on mechanism design without money. Recently, Agrawal et. al. considered this problem in the context of machine learning augmented algorithms, where the mechanism designer is also given a prediction of the optimal facility location. An ideal mechanism in this framework produces an outcome that is close to the socia…
▽ More
The facility location with strategic agents is a canonical problem in the literature on mechanism design without money. Recently, Agrawal et. al. considered this problem in the context of machine learning augmented algorithms, where the mechanism designer is also given a prediction of the optimal facility location. An ideal mechanism in this framework produces an outcome that is close to the social optimum when the prediction is accurate (consistency) and gracefully degrades as the prediction deviates from the truth, while retaining some of the worst-case approximation guarantees (robustness). The previous work only addressed this problem in the two-dimensional Euclidean space providing optimal trade-offs between robustness and consistency guarantees for deterministic mechanisms.
We consider the problem for \emph{general} metric spaces. Our only assumption is that the metric is continuous, meaning that any pair of points must be connected by a continuous shortest path. We introduce a novel mechanism that in addition to agents' reported locations takes a predicted optimal facility location $\hat{o}$. We call this mechanism $\texttt{Harmonic}$, as it selects one of the reported locations $\tilde{\ell}_i$ with probability inversely proportional to $d(\hat{o},\tilde{\ell}_i)+ Δ$ for a constant parameter $Δ$. While \harm \ mechanism is not truthful, we can \emph{characterize the set of undominated strategies} for each agent $i$ as solely consisting of the points on a shortest path from their true location $\ell_i$ to the predicted location $\hat{o}$. We further derive \emph{consistency and robustness guarantees on the Price of Anarchy (PoA)} for the game induced by the mechanism.
△ Less
Submitted 9 October, 2024;
originally announced October 2024.
-
Challenges and Future Directions of Data-Centric AI Alignment
Authors:
Min-Hsuan Yeh,
Jeffrey Wang,
Xuefeng Du,
Seongheon Park,
Leitian Tao,
Shawn Im,
Yixuan Li
Abstract:
As AI systems become increasingly capable and influential, ensuring their alignment with human values, preferences, and goals has become a critical research focus. Current alignment methods primarily focus on designing algorithms and loss functions but often underestimate the crucial role of data. This paper advocates for a shift towards data-centric AI alignment, emphasizing the need to enhance t…
▽ More
As AI systems become increasingly capable and influential, ensuring their alignment with human values, preferences, and goals has become a critical research focus. Current alignment methods primarily focus on designing algorithms and loss functions but often underestimate the crucial role of data. This paper advocates for a shift towards data-centric AI alignment, emphasizing the need to enhance the quality and representativeness of data used in aligning AI systems. In this position paper, we highlight key challenges associated with both human-based and AI-based feedback within the data-centric alignment framework. Through qualitative analysis, we identify multiple sources of unreliability in human feedback, as well as problems related to temporal drift, context dependence, and AI-based feedback failing to capture human values due to inherent model limitations. We propose future research directions, including improved feedback collection practices, robust data-cleaning methodologies, and rigorous feedback verification processes. We call for future research into these critical directions to ensure, addressing gaps that persist in understanding and improving data-centric alignment practices.
△ Less
Submitted 1 May, 2025; v1 submitted 2 October, 2024;
originally announced October 2024.
-
BurstM: Deep Burst Multi-scale SR using Fourier Space with Optical Flow
Authors:
EungGu Kang,
Byeonghun Lee,
Sunghoon Im,
Kyong Hwan Jin
Abstract:
Multi frame super-resolution(MFSR) achieves higher performance than single image super-resolution (SISR), because MFSR leverages abundant information from multiple frames. Recent MFSR approaches adapt the deformable convolution network (DCN) to align the frames. However, the existing MFSR suffers from misalignments between the reference and source frames due to the limitations of DCN, such as smal…
▽ More
Multi frame super-resolution(MFSR) achieves higher performance than single image super-resolution (SISR), because MFSR leverages abundant information from multiple frames. Recent MFSR approaches adapt the deformable convolution network (DCN) to align the frames. However, the existing MFSR suffers from misalignments between the reference and source frames due to the limitations of DCN, such as small receptive fields and the predefined number of kernels. From these problems, existing MFSR approaches struggle to represent high-frequency information. To this end, we propose Deep Burst Multi-scale SR using Fourier Space with Optical Flow (BurstM). The proposed method estimates the optical flow offset for accurate alignment and predicts the continuous Fourier coefficient of each frame for representing high-frequency textures. In addition, we have enhanced the network flexibility by supporting various super-resolution (SR) scale factors with the unimodel. We demonstrate that our method has the highest performance and flexibility than the existing MFSR methods. Our source code is available at https://github.com/Egkang-Luis/burstm
△ Less
Submitted 21 September, 2024;
originally announced September 2024.
-
Entropy, cocycles, and their diagrammatics
Authors:
Mee Seong Im,
Mikhail Khovanov
Abstract:
The first part of the paper explains how to encode a one-cocycle and a two-cocycle on a group $G$ with values in its representation by networks of planar trivalent graphs with edges labelled by elements of $G$, elements of the representation floating in the regions, and suitable rules for manipulation of these diagrams. When the group is a semidirect product, there is a similar presentation via ov…
▽ More
The first part of the paper explains how to encode a one-cocycle and a two-cocycle on a group $G$ with values in its representation by networks of planar trivalent graphs with edges labelled by elements of $G$, elements of the representation floating in the regions, and suitable rules for manipulation of these diagrams. When the group is a semidirect product, there is a similar presentation via overlapping networks for the two subgroups involved.
M. Kontsevich and J.-L. Cathelineau have shown how to interpret the entropy of a finite random variable and infinitesimal dilogarithms, including their four-term functional relations, via 2-cocycles on the group of affine symmetries of a line.
We convert their construction into a diagrammatical calculus evaluating planar networks that describe morphisms in suitable monoidal categories. In particular, the four-term relations become equalities of networks analogous to associativity equations. The resulting monoidal categories complement existing categorical and operadic approaches to entropy.
△ Less
Submitted 8 October, 2024; v1 submitted 12 September, 2024;
originally announced September 2024.
-
Keyword-Aware ASR Error Augmentation for Robust Dialogue State Tracking
Authors:
Jihyun Lee,
Solee Im,
Wonjun Lee,
Gary Geunbae Lee
Abstract:
Dialogue State Tracking (DST) is a key part of task-oriented dialogue systems, identifying important information in conversations. However, its accuracy drops significantly in spoken dialogue environments due to named entity errors from Automatic Speech Recognition (ASR) systems. We introduce a simple yet effective data augmentation method that targets those entities to improve the robustness of D…
▽ More
Dialogue State Tracking (DST) is a key part of task-oriented dialogue systems, identifying important information in conversations. However, its accuracy drops significantly in spoken dialogue environments due to named entity errors from Automatic Speech Recognition (ASR) systems. We introduce a simple yet effective data augmentation method that targets those entities to improve the robustness of DST model. Our novel method can control the placement of errors using keyword-highlighted prompts while introducing phonetically similar errors. As a result, our method generated sufficient error patterns on keywords, leading to improved accuracy in noised and low-accuracy ASR environments.
△ Less
Submitted 10 September, 2024;
originally announced September 2024.
-
Online Scheduling via Gradient Descent for Weighted Flow Time Minimization
Authors:
Qingyun Chen,
Sungjin Im,
Aditya Petety
Abstract:
In this paper, we explore how a natural generalization of Shortest Remaining Processing Time (SRPT) can be a powerful \emph{meta-algorithm} for online scheduling. The meta-algorithm processes jobs to maximally reduce the objective of the corresponding offline scheduling problem of the remaining jobs: minimizing the total weighted completion time of them (the residual optimum). We show that it achi…
▽ More
In this paper, we explore how a natural generalization of Shortest Remaining Processing Time (SRPT) can be a powerful \emph{meta-algorithm} for online scheduling. The meta-algorithm processes jobs to maximally reduce the objective of the corresponding offline scheduling problem of the remaining jobs: minimizing the total weighted completion time of them (the residual optimum). We show that it achieves scalability for minimizing total weighted flow time when the residual optimum exhibits \emph{supermodularity}. Scalability here means it is $O(1)$-competitive with an arbitrarily small speed augmentation advantage over the adversary, representing the best possible outcome achievable for various scheduling problems.
Thanks to this finding, our approach does not require the residual optimum to have a closed mathematical form. Consequently, we can obtain the schedule by solving a linear program, which makes our approach readily applicable to a rich body of applications. Furthermore, by establishing a novel connection to \emph{substitute valuations in Walrasian markets}, we show how to achieve supermodularity, thereby obtaining scalable algorithms for various scheduling problems, such as matroid scheduling, generalized network flow, and generalized arbitrary speed-up curves, etc., and this is the \emph{first} non-trivial or scalable algorithm for many of them.
△ Less
Submitted 4 September, 2024;
originally announced September 2024.
-
Style-Editor: Text-driven object-centric style editing
Authors:
Jihun Park,
Jongmin Gim,
Kyoungmin Lee,
Seunghun Lee,
Sunghoon Im
Abstract:
We present Text-driven object-centric style editing model named Style-Editor, a novel method that guides style editing at an object-centric level using textual inputs. The core of Style-Editor is our Patch-wise Co-Directional (PCD) loss, meticulously designed for precise object-centric editing that are closely aligned with the input text. This loss combines a patch directional loss for text-guided…
▽ More
We present Text-driven object-centric style editing model named Style-Editor, a novel method that guides style editing at an object-centric level using textual inputs. The core of Style-Editor is our Patch-wise Co-Directional (PCD) loss, meticulously designed for precise object-centric editing that are closely aligned with the input text. This loss combines a patch directional loss for text-guided style direction and a patch distribution consistency loss for even CLIP embedding distribution across object regions. It ensures a seamless and harmonious style editing across object regions. Key to our method are the Text-Matched Patch Selection (TMPS) and Pre-fixed Region Selection (PRS) modules for identifying object locations via text, eliminating the need for segmentation masks. Lastly, we introduce an Adaptive Background Preservation (ABP) loss to maintain the original style and structural essence of the image's background. This loss is applied to dynamically identified background areas. Extensive experiments underline the effectiveness of our approach in creating visually coherent and textually aligned style editing.
△ Less
Submitted 8 April, 2025; v1 submitted 15 August, 2024;
originally announced August 2024.
-
On the Generalization of Preference Learning with DPO
Authors:
Shawn Im,
Yixuan Li
Abstract:
Large language models (LLMs) have demonstrated remarkable capabilities but often struggle to align with human preferences, leading to harmful or undesirable outputs. Preference learning, which trains models to distinguish between preferred and non-preferred responses based on human feedback, has become a crucial component for ensuring that LLMs align with human values. Despite the widespread adopt…
▽ More
Large language models (LLMs) have demonstrated remarkable capabilities but often struggle to align with human preferences, leading to harmful or undesirable outputs. Preference learning, which trains models to distinguish between preferred and non-preferred responses based on human feedback, has become a crucial component for ensuring that LLMs align with human values. Despite the widespread adoption in real-world systems, a thorough theoretical understanding of the generalization guarantees for these models remain lacking. This paper bridges that gap by introducing a new theoretical framework to analyze the generalization guarantees of models trained with direct preference optimization (DPO). While existing generalization theory often focuses on overparameterized models achieving near-optimal loss or models independent of the training process, our framework rigorously assesses how well models generalize after a finite number of gradient steps, reflecting real-world LLM training practices. By analyzing the reward margin associated with each sample and its trajectory throughout training, we can effectively bound the generalization error. We derive learning guarantees showing that, under specific conditions, models trained with DPO can correctly discern preferred responses on unseen data with high probability. These insights are empirically validated on contemporary LLMs, underscoring the practical relevance of our theoretical findings.
△ Less
Submitted 6 December, 2024; v1 submitted 6 August, 2024;
originally announced August 2024.
-
Flow4D: Leveraging 4D Voxel Network for LiDAR Scene Flow Estimation
Authors:
Jaeyeul Kim,
Jungwan Woo,
Ukcheol Shin,
Jean Oh,
Sunghoon Im
Abstract:
Understanding the motion states of the surrounding environment is critical for safe autonomous driving. These motion states can be accurately derived from scene flow, which captures the three-dimensional motion field of points. Existing LiDAR scene flow methods extract spatial features from each point cloud and then fuse them channel-wise, resulting in the implicit extraction of spatio-temporal fe…
▽ More
Understanding the motion states of the surrounding environment is critical for safe autonomous driving. These motion states can be accurately derived from scene flow, which captures the three-dimensional motion field of points. Existing LiDAR scene flow methods extract spatial features from each point cloud and then fuse them channel-wise, resulting in the implicit extraction of spatio-temporal features. Furthermore, they utilize 2D Bird's Eye View and process only two frames, missing crucial spatial information along the Z-axis and the broader temporal context, leading to suboptimal performance. To address these limitations, we propose Flow4D, which temporally fuses multiple point clouds after the 3D intra-voxel feature encoder, enabling more explicit extraction of spatio-temporal features through a 4D voxel network. However, while using 4D convolution improves performance, it significantly increases the computational load. For further efficiency, we introduce the Spatio-Temporal Decomposition Block (STDB), which combines 3D and 1D convolutions instead of using heavy 4D convolution. In addition, Flow4D further improves performance by using five frames to take advantage of richer temporal information. As a result, the proposed method achieves a 45.9% higher performance compared to the state-of-the-art while running in real-time, and won 1st place in the 2024 Argoverse 2 Scene Flow Challenge. The code is available at https://github.com/dgist-cvlab/Flow4D.
△ Less
Submitted 10 July, 2024;
originally announced July 2024.
-
Context-Aware Video Instance Segmentation
Authors:
Seunghun Lee,
Jiwan Seo,
Kiljoon Han,
Minwoo Choi,
Sunghoon Im
Abstract:
In this paper, we introduce the Context-Aware Video Instance Segmentation (CAVIS), a novel framework designed to enhance instance association by integrating contextual information adjacent to each object. To efficiently extract and leverage this information, we propose the Context-Aware Instance Tracker (CAIT), which merges contextual data surrounding the instances with the core instance features…
▽ More
In this paper, we introduce the Context-Aware Video Instance Segmentation (CAVIS), a novel framework designed to enhance instance association by integrating contextual information adjacent to each object. To efficiently extract and leverage this information, we propose the Context-Aware Instance Tracker (CAIT), which merges contextual data surrounding the instances with the core instance features to improve tracking accuracy. Additionally, we introduce the Prototypical Cross-frame Contrastive (PCC) loss, which ensures consistency in object-level features across frames, thereby significantly enhancing instance matching accuracy. CAVIS demonstrates superior performance over state-of-the-art methods on all benchmark datasets in video instance segmentation (VIS) and video panoptic segmentation (VPS). Notably, our method excels on the OVIS dataset, which is known for its particularly challenging videos.
△ Less
Submitted 3 July, 2024;
originally announced July 2024.
-
Online Load and Graph Balancing for Random Order Inputs
Authors:
Sungjin Im,
Ravi Kumar,
Shi Li,
Aditya Petety,
Manish Purohit
Abstract:
Online load balancing for heterogeneous machines aims to minimize the makespan (maximum machine workload) by scheduling arriving jobs with varying sizes on different machines. In the adversarial setting, where an adversary chooses not only the collection of job sizes but also their arrival order, the problem is well-understood and the optimal competitive ratio is known to be $Θ(\log m)$ where $m$…
▽ More
Online load balancing for heterogeneous machines aims to minimize the makespan (maximum machine workload) by scheduling arriving jobs with varying sizes on different machines. In the adversarial setting, where an adversary chooses not only the collection of job sizes but also their arrival order, the problem is well-understood and the optimal competitive ratio is known to be $Θ(\log m)$ where $m$ is the number of machines. In the more realistic random arrival order model, the understanding is limited. Previously, the best lower bound on the competitive ratio was only $Ω(\log \log m)$.
We significantly improve this bound by showing an $Ω( \sqrt {\log m})$ lower bound, even for the restricted case where each job has a unit size on two machines and infinite size on the others. On the positive side, we propose an $O(\log m/\log \log m)$-competitive algorithm, demonstrating that better performance is possible in the random arrival model.
△ Less
Submitted 20 May, 2024; v1 submitted 13 May, 2024;
originally announced May 2024.
-
JDEC: JPEG Decoding via Enhanced Continuous Cosine Coefficients
Authors:
Woo Kyoung Han,
Sunghoon Im,
Jaedeok Kim,
Kyong Hwan Jin
Abstract:
We propose a practical approach to JPEG image decoding, utilizing a local implicit neural representation with continuous cosine formulation. The JPEG algorithm significantly quantizes discrete cosine transform (DCT) spectra to achieve a high compression rate, inevitably resulting in quality degradation while encoding an image. We have designed a continuous cosine spectrum estimator to address the…
▽ More
We propose a practical approach to JPEG image decoding, utilizing a local implicit neural representation with continuous cosine formulation. The JPEG algorithm significantly quantizes discrete cosine transform (DCT) spectra to achieve a high compression rate, inevitably resulting in quality degradation while encoding an image. We have designed a continuous cosine spectrum estimator to address the quality degradation issue that restores the distorted spectrum. By leveraging local DCT formulations, our network has the privilege to exploit dequantization and upsampling simultaneously. Our proposed model enables decoding compressed images directly across different quality factors using a single pre-trained model without relying on a conventional JPEG decoder. As a result, our proposed network achieves state-of-the-art performance in flexible color image JPEG artifact removal tasks. Our source code is available at https://github.com/WooKyoungHan/JDEC.
△ Less
Submitted 2 April, 2024;
originally announced April 2024.
-
Fashion Style Editing with Generative Human Prior
Authors:
Chaerin Kong,
Seungyong Lee,
Soohyeok Im,
Wonsuk Yang
Abstract:
Image editing has been a long-standing challenge in the research community with its far-reaching impact on numerous applications. Recently, text-driven methods started to deliver promising results in domains like human faces, but their applications to more complex domains have been relatively limited. In this work, we explore the task of fashion style editing, where we aim to manipulate the fashio…
▽ More
Image editing has been a long-standing challenge in the research community with its far-reaching impact on numerous applications. Recently, text-driven methods started to deliver promising results in domains like human faces, but their applications to more complex domains have been relatively limited. In this work, we explore the task of fashion style editing, where we aim to manipulate the fashion style of human imagery using text descriptions. Specifically, we leverage a generative human prior and achieve fashion style editing by navigating its learned latent space. We first verify that the existing text-driven editing methods fall short for our problem due to their overly simplified guidance signal, and propose two directions to reinforce the guidance: textual augmentation and visual referencing. Combined with our empirical findings on the latent space structure, our Fashion Style Editing framework (FaSE) successfully projects abstract fashion concepts onto human images and introduces exciting new applications to the field.
△ Less
Submitted 2 April, 2024;
originally announced April 2024.
-
HyperCLOVA X Technical Report
Authors:
Kang Min Yoo,
Jaegeun Han,
Sookyo In,
Heewon Jeon,
Jisu Jeong,
Jaewook Kang,
Hyunwook Kim,
Kyung-Min Kim,
Munhyong Kim,
Sungju Kim,
Donghyun Kwak,
Hanock Kwak,
Se Jung Kwon,
Bado Lee,
Dongsoo Lee,
Gichang Lee,
Jooho Lee,
Baeseong Park,
Seongjin Shin,
Joonsang Yu,
Seolki Baek,
Sumin Byeon,
Eungsup Cho,
Dooseok Choe,
Jeesung Han
, et al. (371 additional authors not shown)
Abstract:
We introduce HyperCLOVA X, a family of large language models (LLMs) tailored to the Korean language and culture, along with competitive capabilities in English, math, and coding. HyperCLOVA X was trained on a balanced mix of Korean, English, and code data, followed by instruction-tuning with high-quality human-annotated datasets while abiding by strict safety guidelines reflecting our commitment t…
▽ More
We introduce HyperCLOVA X, a family of large language models (LLMs) tailored to the Korean language and culture, along with competitive capabilities in English, math, and coding. HyperCLOVA X was trained on a balanced mix of Korean, English, and code data, followed by instruction-tuning with high-quality human-annotated datasets while abiding by strict safety guidelines reflecting our commitment to responsible AI. The model is evaluated across various benchmarks, including comprehensive reasoning, knowledge, commonsense, factuality, coding, math, chatting, instruction-following, and harmlessness, in both Korean and English. HyperCLOVA X exhibits strong reasoning capabilities in Korean backed by a deep understanding of the language and cultural nuances. Further analysis of the inherent bilingual nature and its extension to multilingualism highlights the model's cross-lingual proficiency and strong generalization ability to untargeted languages, including machine translation between several language pairs and cross-lingual inference tasks. We believe that HyperCLOVA X can provide helpful guidance for regions or countries in developing their sovereign LLMs.
△ Less
Submitted 13 April, 2024; v1 submitted 2 April, 2024;
originally announced April 2024.
-
Understanding the Learning Dynamics of Alignment with Human Feedback
Authors:
Shawn Im,
Yixuan Li
Abstract:
Aligning large language models (LLMs) with human intentions has become a critical task for safely deploying models in real-world systems. While existing alignment approaches have seen empirical success, theoretically understanding how these methods affect model behavior remains an open question. Our work provides an initial attempt to theoretically analyze the learning dynamics of human preference…
▽ More
Aligning large language models (LLMs) with human intentions has become a critical task for safely deploying models in real-world systems. While existing alignment approaches have seen empirical success, theoretically understanding how these methods affect model behavior remains an open question. Our work provides an initial attempt to theoretically analyze the learning dynamics of human preference alignment. We formally show how the distribution of preference datasets influences the rate of model updates and provide rigorous guarantees on the training accuracy. Our theory also reveals an intricate phenomenon where the optimization is prone to prioritizing certain behaviors with higher preference distinguishability. We empirically validate our findings on contemporary LLMs and alignment tasks, reinforcing our theoretical insights and shedding light on considerations for future alignment approaches. Disclaimer: This paper contains potentially offensive text; reader discretion is advised.
△ Less
Submitted 6 August, 2024; v1 submitted 27 March, 2024;
originally announced March 2024.
-
Multi-task Learning for Real-time Autonomous Driving Leveraging Task-adaptive Attention Generator
Authors:
Wonhyeok Choi,
Mingyu Shin,
Hyukzae Lee,
Jaehoon Cho,
Jaehyeon Park,
Sunghoon Im
Abstract:
Real-time processing is crucial in autonomous driving systems due to the imperative of instantaneous decision-making and rapid response. In real-world scenarios, autonomous vehicles are continuously tasked with interpreting their surroundings, analyzing intricate sensor data, and making decisions within split seconds to ensure safety through numerous computer vision tasks. In this paper, we presen…
▽ More
Real-time processing is crucial in autonomous driving systems due to the imperative of instantaneous decision-making and rapid response. In real-world scenarios, autonomous vehicles are continuously tasked with interpreting their surroundings, analyzing intricate sensor data, and making decisions within split seconds to ensure safety through numerous computer vision tasks. In this paper, we present a new real-time multi-task network adept at three vital autonomous driving tasks: monocular 3D object detection, semantic segmentation, and dense depth estimation. To counter the challenge of negative transfer, which is the prevalent issue in multi-task learning, we introduce a task-adaptive attention generator. This generator is designed to automatically discern interrelations across the three tasks and arrange the task-sharing pattern, all while leveraging the efficiency of the hard-parameter sharing approach. To the best of our knowledge, the proposed model is pioneering in its capability to concurrently handle multiple tasks, notably 3D object detection, while maintaining real-time processing speeds. Our rigorously optimized network, when tested on the Cityscapes-3D datasets, consistently outperforms various baseline models. Moreover, an in-depth ablation study substantiates the efficacy of the methodologies integrated into our framework.
△ Less
Submitted 6 March, 2024;
originally announced March 2024.
-
Analysis of the Two-Step Heterogeneous Transfer Learning for Laryngeal Blood Vessel Classification: Issue and Improvement
Authors:
Xinyi Fang,
Xu Yang,
Chak Fong Chong,
Kei Long Wong,
Yapeng Wang,
Tiankui Zhang,
Sio-Kei Im
Abstract:
Accurate classification of laryngeal vascular as benign or malignant is crucial for early detection of laryngeal cancer. However, organizations with limited access to laryngeal vascular images face challenges due to the lack of large and homogeneous public datasets for effective learning. Distinguished from the most familiar works, which directly transfer the ImageNet pre-trained models to the tar…
▽ More
Accurate classification of laryngeal vascular as benign or malignant is crucial for early detection of laryngeal cancer. However, organizations with limited access to laryngeal vascular images face challenges due to the lack of large and homogeneous public datasets for effective learning. Distinguished from the most familiar works, which directly transfer the ImageNet pre-trained models to the target domain for fine-tuning, this work pioneers exploring two-step heterogeneous transfer learning (THTL) for laryngeal lesion classification with nine deep-learning models, utilizing the diabetic retinopathy color fundus images, semantically non-identical yet vascular images, as the intermediate domain. Attention visualization technique, Layer Class Activate Map (LayerCAM), reveals a novel finding that yet the intermediate and the target domain both reflect vascular structure to a certain extent, the prevalent radial vascular pattern in the intermediate domain prevents learning the features of twisted and tangled vessels that distinguish the malignant class in the target domain, summarizes a vital rule for laryngeal lesion classification using THTL. To address this, we introduce an enhanced fine-tuning strategy in THTL called Step-Wise Fine-Tuning (SWFT) and apply it to the ResNet models. SWFT progressively refines model performance by accumulating fine-tuning layers from back to front, guided by the visualization results of LayerCAM. Comparison with the original THTL approach shows significant improvements. For ResNet18, the accuracy and malignant recall increases by 26.1% and 79.8%, respectively, while for ResNet50, these indicators improve by 20.4% and 62.2%, respectively.
△ Less
Submitted 14 April, 2024; v1 submitted 29 February, 2024;
originally announced February 2024.
-
Category-wise Fine-Tuning: Resisting Incorrect Pseudo-Labels in Multi-Label Image Classification with Partial Labels
Authors:
Chak Fong Chong,
Xinyi Fang,
Jielong Guo,
Yapeng Wang,
Wei Ke,
Chan-Tong Lam,
Sio-Kei Im
Abstract:
Large-scale image datasets are often partially labeled, where only a few categories' labels are known for each image. Assigning pseudo-labels to unknown labels to gain additional training signals has become prevalent for training deep classification models. However, some pseudo-labels are inevitably incorrect, leading to a notable decline in the model classification performance. In this paper, we…
▽ More
Large-scale image datasets are often partially labeled, where only a few categories' labels are known for each image. Assigning pseudo-labels to unknown labels to gain additional training signals has become prevalent for training deep classification models. However, some pseudo-labels are inevitably incorrect, leading to a notable decline in the model classification performance. In this paper, we propose a novel method called Category-wise Fine-Tuning (CFT), aiming to reduce model inaccuracies caused by the wrong pseudo-labels. In particular, CFT employs known labels without pseudo-labels to fine-tune the logistic regressions of trained models individually to calibrate each category's model predictions. Genetic Algorithm, seldom used for training deep models, is also utilized in CFT to maximize the classification performance directly. CFT is applied to well-trained models, unlike most existing methods that train models from scratch. Hence, CFT is general and compatible with models trained with different methods and schemes, as demonstrated through extensive experiments. CFT requires only a few seconds for each category for calibration with consumer-grade GPUs. We achieve state-of-the-art results on three benchmarking datasets, including the CheXpert chest X-ray competition dataset (ensemble mAUC 93.33%, single model 91.82%), partially labeled MS-COCO (average mAP 83.69%), and Open Image V3 (mAP 85.31%), outperforming the previous bests by 0.28%, 2.21%, 2.50%, and 0.91%, respectively. The single model on CheXpert has been officially evaluated by the competition server, endorsing the correctness of the result. The outstanding results and generalizability indicate that CFT could be substantial and prevalent for classification model development. Code is available at: https://github.com/maxium0526/category-wise-fine-tuning.
△ Less
Submitted 30 January, 2024;
originally announced January 2024.
-
Data Exchange Markets via Utility Balancing
Authors:
Aditya Bhaskara,
Sreenivas Gollapudi,
Sungjin Im,
Kostas Kollias,
Kamesh Munagala,
Govind S. Sankar
Abstract:
This paper explores the design of a balanced data-sharing marketplace for entities with heterogeneous datasets and machine learning models that they seek to refine using data from other agents. The goal of the marketplace is to encourage participation for data sharing in the presence of such heterogeneity. Our market design approach for data sharing focuses on interim utility balance, where partic…
▽ More
This paper explores the design of a balanced data-sharing marketplace for entities with heterogeneous datasets and machine learning models that they seek to refine using data from other agents. The goal of the marketplace is to encourage participation for data sharing in the presence of such heterogeneity. Our market design approach for data sharing focuses on interim utility balance, where participants contribute and receive equitable utility from refinement of their models. We present such a market model for which we study computational complexity, solution existence, and approximation algorithms for welfare maximization and core stability. We finally support our theoretical insights with simulations on a mean estimation task inspired by road traffic delay estimation.
△ Less
Submitted 23 January, 2024;
originally announced January 2024.
-
Depth-discriminative Metric Learning for Monocular 3D Object Detection
Authors:
Wonhyeok Choi,
Mingyu Shin,
Sunghoon Im
Abstract:
Monocular 3D object detection poses a significant challenge due to the lack of depth information in RGB images. Many existing methods strive to enhance the object depth estimation performance by allocating additional parameters for object depth estimation, utilizing extra modules or data. In contrast, we introduce a novel metric learning scheme that encourages the model to extract depth-discrimina…
▽ More
Monocular 3D object detection poses a significant challenge due to the lack of depth information in RGB images. Many existing methods strive to enhance the object depth estimation performance by allocating additional parameters for object depth estimation, utilizing extra modules or data. In contrast, we introduce a novel metric learning scheme that encourages the model to extract depth-discriminative features regardless of the visual attributes without increasing inference time and model size. Our method employs the distance-preserving function to organize the feature space manifold in relation to ground-truth object depth. The proposed (K, B, eps)-quasi-isometric loss leverages predetermined pairwise distance restriction as guidance for adjusting the distance among object descriptors without disrupting the non-linearity of the natural feature manifold. Moreover, we introduce an auxiliary head for object-wise depth estimation, which enhances depth quality while maintaining the inference time. The broad applicability of our method is demonstrated through experiments that show improvements in overall performance when integrated into various baselines. The results show that our method consistently improves the performance of various baselines by 23.51% and 5.78% on average across KITTI and Waymo, respectively.
△ Less
Submitted 2 January, 2024;
originally announced January 2024.
-
Boolean TQFTs with accumulating defects, sofic systems, and automata for infinite words
Authors:
Paul Gustafson,
Mee Seong Im,
Mikhail Khovanov
Abstract:
Any finite state automaton gives rise to a Boolean one-dimensional TQFT with defects and inner endpoints of cobordisms. This paper extends the correspondence to Boolean TQFTs where defects accumulate toward inner endpoints, relating such TQFTs and topological theories to sofic systems and $ω$-automata.
Any finite state automaton gives rise to a Boolean one-dimensional TQFT with defects and inner endpoints of cobordisms. This paper extends the correspondence to Boolean TQFTs where defects accumulate toward inner endpoints, relating such TQFTs and topological theories to sofic systems and $ω$-automata.
△ Less
Submitted 25 December, 2023;
originally announced December 2023.
-
Polynomial Time Convergence of the Iterative Evaluation of Datalogo Programs
Authors:
Sungjin Im,
Benjamin Moseley,
Hung Q. Ngo,
Kirk Pruhs
Abstract:
Datalogo is an extension of Datalog that allows for aggregation and recursion over an arbitrary commutative semiring. Like Datalog, Datalogo programs can be evaluated via the natural iterative algorithm until a fixed point is reached. However unlike Datalog, the natural iterative evaluation of some Datalogo programs over some semirings may not converge. It is known that the commutative semirings f…
▽ More
Datalogo is an extension of Datalog that allows for aggregation and recursion over an arbitrary commutative semiring. Like Datalog, Datalogo programs can be evaluated via the natural iterative algorithm until a fixed point is reached. However unlike Datalog, the natural iterative evaluation of some Datalogo programs over some semirings may not converge. It is known that the commutative semirings for which the iterative evaluation of Datalogo programs is guaranteed to converge are exactly those semirings that are stable [7]. Previously, the best known upper bound on the number of iterations until convergence over $p$-stable semirings is $\sum_{i=1}^n (p+2)^i = Θ(p^n)$ steps, where $n$ is (essentially) the output size. We establish that, in fact, the natural iterative evaluation of a Datalogoprogram over a $p$-stable semiring converges within a polynomial number of iterations. In particular our upper bound is $O( σp n^2( n^2 \lg λ+ \lg σ))$ where $σ$ is the number of elements in the semiring present in either the input databases or the Datalogo program, and $λ$ is the maximum number of terms in any product in the Datalogo program.
△ Less
Submitted 21 February, 2024; v1 submitted 21 December, 2023;
originally announced December 2023.
-
Rethinking LiDAR Domain Generalization: Single Source as Multiple Density Domains
Authors:
Jaeyeul Kim,
Jungwan Woo,
Jeonghoon Kim,
Sunghoon Im
Abstract:
In the realm of LiDAR-based perception, significant strides have been made, yet domain generalization remains a substantial challenge. The performance often deteriorates when models are applied to unfamiliar datasets with different LiDAR sensors or deployed in new environments, primarily due to variations in point cloud density distributions. To tackle this challenge, we propose a Density Discrimi…
▽ More
In the realm of LiDAR-based perception, significant strides have been made, yet domain generalization remains a substantial challenge. The performance often deteriorates when models are applied to unfamiliar datasets with different LiDAR sensors or deployed in new environments, primarily due to variations in point cloud density distributions. To tackle this challenge, we propose a Density Discriminative Feature Embedding (DDFE) module, capitalizing on the observation that a single source LiDAR point cloud encompasses a spectrum of densities. The DDFE module is meticulously designed to extract density-specific features within a single source domain, facilitating the recognition of objects sharing similar density characteristics across different LiDAR sensors. In addition, we introduce a simple yet effective density augmentation technique aimed at expanding the spectrum of density in source data, thereby enhancing the capabilities of the DDFE. Our DDFE stands out as a versatile and lightweight domain generalization module. It can be seamlessly integrated into various 3D backbone networks, where it has demonstrated superior performance over current state-of-the-art domain generalization methods. Code is available at https://github.com/dgist-cvlab/MultiDensityDG.
△ Less
Submitted 16 July, 2024; v1 submitted 19 December, 2023;
originally announced December 2023.
-
Evaluating the Utility of Model Explanations for Model Development
Authors:
Shawn Im,
Jacob Andreas,
Yilun Zhou
Abstract:
One of the motivations for explainable AI is to allow humans to make better and more informed decisions regarding the use and deployment of AI models. But careful evaluations are needed to assess whether this expectation has been fulfilled. Current evaluations mainly focus on algorithmic properties of explanations, and those that involve human subjects often employ subjective questions to test hum…
▽ More
One of the motivations for explainable AI is to allow humans to make better and more informed decisions regarding the use and deployment of AI models. But careful evaluations are needed to assess whether this expectation has been fulfilled. Current evaluations mainly focus on algorithmic properties of explanations, and those that involve human subjects often employ subjective questions to test human's perception of explanation usefulness, without being grounded in objective metrics and measurements. In this work, we evaluate whether explanations can improve human decision-making in practical scenarios of machine learning model development. We conduct a mixed-methods user study involving image data to evaluate saliency maps generated by SmoothGrad, GradCAM, and an oracle explanation on two tasks: model selection and counterfactual simulation. To our surprise, we did not find evidence of significant improvement on these tasks when users were provided with any of the saliency maps, even the synthetic oracle explanation designed to be simple to understand and highly indicative of the answer. Nonetheless, explanations did help users more accurately describe the models. These findings suggest caution regarding the usefulness and potential for misunderstanding in saliency-based explanations.
△ Less
Submitted 10 December, 2023;
originally announced December 2023.
-
On the Convergence Rate of Linear Datalogo over Stable Semirings
Authors:
Sungjin Im,
Benjamin Moseley,
Hung Ngo,
Kirk Pruhs
Abstract:
Datalogo is an extension of Datalog, where instead of a program being a collection of union of conjunctive queries over the standard Boolean semiring, a program may now be a collection of sum-product queries over an arbitrary commutative partially ordered pre-semiring. Datalogo is more powerful than Datalog in that its additional algebraic structure alows for supporting recursion with aggregation.…
▽ More
Datalogo is an extension of Datalog, where instead of a program being a collection of union of conjunctive queries over the standard Boolean semiring, a program may now be a collection of sum-product queries over an arbitrary commutative partially ordered pre-semiring. Datalogo is more powerful than Datalog in that its additional algebraic structure alows for supporting recursion with aggregation. At the same time, Datalogo retains the syntactic and semantic simplicity of Datalog: Datalogo has declarative least fixpoint semantics. The least fixpoint can be found via the naïve evaluation algorithm that repeatedly applies the immediate consequence operator until no further change is possible.
It was shown in~\cite{Khamis0PSW22} that, when the underlying semiring is $p$-stable, then the naïve evaluation of any Datalogo program over the semiring converges in a finite number of steps. However, the upper bounds on the rate of convergence were exponential in the number $n$ of ground IDB atoms.
This paper establishes polynomial upper bounds on the convergence rate of the naïve algorithm on {\bf linear} Datalogo programs, which is quite common in practice. In particular, the main result of this paper is that the convergence rate of linear Datalogo programs under any $p$-stable semiring is $O(pn^3)$. Next, we study the convergence rate in terms of the number of elements in the semiring for linear Datalogo programs. When $L$ is the number of elements, we show that the convergence rate is bounded by $O(pn \log L)$. This significantly improves the convergence rate for small $L$.
△ Less
Submitted 5 July, 2025; v1 submitted 29 November, 2023;
originally announced November 2023.
-
Rotation Matters: Generalized Monocular 3D Object Detection for Various Camera Systems
Authors:
SungHo Moon,
JinWoo Bae,
SungHoon Im
Abstract:
Research on monocular 3D object detection is being actively studied, and as a result, performance has been steadily improving. However, 3D object detection performance is significantly reduced when applied to a camera system different from the system used to capture the training datasets. For example, a 3D detector trained on datasets from a passenger car mostly fails to regress accurate 3D boundi…
▽ More
Research on monocular 3D object detection is being actively studied, and as a result, performance has been steadily improving. However, 3D object detection performance is significantly reduced when applied to a camera system different from the system used to capture the training datasets. For example, a 3D detector trained on datasets from a passenger car mostly fails to regress accurate 3D bounding boxes for a camera mounted on a bus. In this paper, we conduct extensive experiments to analyze the factors that cause performance degradation. We find that changing the camera pose, especially camera orientation, relative to the road plane caused performance degradation. In addition, we propose a generalized 3D object detection method that can be universally applied to various camera systems. We newly design a compensation module that corrects the estimated 3D bounding box location and heading direction. The proposed module can be applied to most of the recent 3D object detection networks. It increases AP3D score (KITTI moderate, IoU $> 70\%$) about 6-to-10-times above the baselines without additional training. Both quantitative and qualitative results show the effectiveness of the proposed method.
△ Less
Submitted 8 October, 2023;
originally announced October 2023.
-
Implicit Neural Image Stitching
Authors:
Minsu Kim,
Jaewon Lee,
Byeonghun Lee,
Sunghoon Im,
Kyong Hwan Jin
Abstract:
Existing frameworks for image stitching often provide visually reasonable stitchings. However, they suffer from blurry artifacts and disparities in illumination, depth level, etc. Although the recent learning-based stitchings relax such disparities, the required methods impose sacrifice of image qualities failing to capture high-frequency details for stitched images. To address the problem, we pro…
▽ More
Existing frameworks for image stitching often provide visually reasonable stitchings. However, they suffer from blurry artifacts and disparities in illumination, depth level, etc. Although the recent learning-based stitchings relax such disparities, the required methods impose sacrifice of image qualities failing to capture high-frequency details for stitched images. To address the problem, we propose a novel approach, implicit Neural Image Stitching (NIS) that extends arbitrary-scale super-resolution. Our method estimates Fourier coefficients of images for quality-enhancing warps. Then, the suggested model blends color mismatches and misalignment in the latent space and decodes the features into RGB values of stitched images. Our experiments show that our approach achieves improvement in resolving the low-definition imaging of the previous deep image stitching with favorable accelerated image-enhancing methods. Our source code is available at https://github.com/minshu-kim/NIS.
△ Less
Submitted 21 January, 2024; v1 submitted 4 September, 2023;
originally announced September 2023.
-
From finite state automata to tangle cobordisms: a TQFT journey from one to four dimensions
Authors:
Mee Seong Im,
Mikhail Khovanov
Abstract:
This is a brief introduction to link homology theories that categorify Reshetikhin--Turaev $\mathsf{SL}(N)$-quantum link invariants. A recently discovered surprising connection between finite state automata and Boolean TQFTs in dimension one is explained as a warm-up.
This is a brief introduction to link homology theories that categorify Reshetikhin--Turaev $\mathsf{SL}(N)$-quantum link invariants. A recently discovered surprising connection between finite state automata and Boolean TQFTs in dimension one is explained as a warm-up.
△ Less
Submitted 17 September, 2023; v1 submitted 1 September, 2023;
originally announced September 2023.
-
Publicly Shareable Clinical Large Language Model Built on Synthetic Clinical Notes
Authors:
Sunjun Kweon,
Junu Kim,
Jiyoun Kim,
Sujeong Im,
Eunbyeol Cho,
Seongsu Bae,
Jungwoo Oh,
Gyubok Lee,
Jong Hak Moon,
Seng Chan You,
Seungjin Baek,
Chang Hoon Han,
Yoon Bin Jung,
Yohan Jo,
Edward Choi
Abstract:
The development of large language models tailored for handling patients' clinical notes is often hindered by the limited accessibility and usability of these notes due to strict privacy regulations. To address these challenges, we first create synthetic large-scale clinical notes using publicly available case reports extracted from biomedical literature. We then use these synthetic notes to train…
▽ More
The development of large language models tailored for handling patients' clinical notes is often hindered by the limited accessibility and usability of these notes due to strict privacy regulations. To address these challenges, we first create synthetic large-scale clinical notes using publicly available case reports extracted from biomedical literature. We then use these synthetic notes to train our specialized clinical large language model, Asclepius. While Asclepius is trained on synthetic data, we assess its potential performance in real-world applications by evaluating it using real clinical notes. We benchmark Asclepius against several other large language models, including GPT-3.5-turbo and other open-source alternatives. To further validate our approach using synthetic notes, we also compare Asclepius with its variants trained on real clinical notes. Our findings convincingly demonstrate that synthetic clinical notes can serve as viable substitutes for real ones when constructing high-performing clinical language models. This conclusion is supported by detailed evaluations conducted by both GPT-4 and medical professionals. All resources including weights, codes, and data used in the development of Asclepius are made publicly accessible for future research. (https://github.com/starmpcc/Asclepius)
△ Less
Submitted 29 July, 2024; v1 submitted 1 September, 2023;
originally announced September 2023.