-
System Prompt Optimization with Meta-Learning
Authors:
Yumin Choi,
Jinheon Baek,
Sung Ju Hwang
Abstract:
Large Language Models (LLMs) have shown remarkable capabilities, with optimizing their input prompts playing a pivotal role in maximizing their performance. However, while LLM prompts consist of both the task-agnostic system prompts and task-specific user prompts, existing work on prompt optimization has focused on user prompts specific to individual queries or tasks, and largely overlooked the sy…
▽ More
Large Language Models (LLMs) have shown remarkable capabilities, with optimizing their input prompts playing a pivotal role in maximizing their performance. However, while LLM prompts consist of both the task-agnostic system prompts and task-specific user prompts, existing work on prompt optimization has focused on user prompts specific to individual queries or tasks, and largely overlooked the system prompt that is, once optimized, applicable across different tasks and domains. Motivated by this, we introduce the novel problem of bilevel system prompt optimization, whose objective is to design system prompts that are robust to diverse user prompts and transferable to unseen tasks. To tackle this problem, we then propose a meta-learning framework, which meta-learns the system prompt by optimizing it over various user prompts across multiple datasets, while simultaneously updating the user prompts in an iterative manner to ensure synergy between them. We conduct experiments on 14 unseen datasets spanning 5 different domains, on which we show that our approach produces system prompts that generalize effectively to diverse user prompts. Also, our findings reveal that the optimized system prompt enables rapid adaptation even to unseen tasks, requiring fewer optimization steps for test-time user prompts while achieving improved performance.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
UniversalRAG: Retrieval-Augmented Generation over Multiple Corpora with Diverse Modalities and Granularities
Authors:
Woongyeong Yeo,
Kangsan Kim,
Soyeong Jeong,
Jinheon Baek,
Sung Ju Hwang
Abstract:
Retrieval-Augmented Generation (RAG) has shown substantial promise in improving factual accuracy by grounding model responses with external knowledge relevant to queries. However, most existing RAG approaches are limited to a text-only corpus, and while recent efforts have extended RAG to other modalities such as images and videos, they typically operate over a single modality-specific corpus. In…
▽ More
Retrieval-Augmented Generation (RAG) has shown substantial promise in improving factual accuracy by grounding model responses with external knowledge relevant to queries. However, most existing RAG approaches are limited to a text-only corpus, and while recent efforts have extended RAG to other modalities such as images and videos, they typically operate over a single modality-specific corpus. In contrast, real-world queries vary widely in the type of knowledge they require, which a single type of knowledge source cannot address. To address this, we introduce UniversalRAG, a novel RAG framework designed to retrieve and integrate knowledge from heterogeneous sources with diverse modalities and granularities. Specifically, motivated by the observation that forcing all modalities into a unified representation space derived from a single combined corpus causes a modality gap, where the retrieval tends to favor items from the same modality as the query, we propose a modality-aware routing mechanism that dynamically identifies the most appropriate modality-specific corpus and performs targeted retrieval within it. Also, beyond modality, we organize each modality into multiple granularity levels, enabling fine-tuned retrieval tailored to the complexity and scope of the query. We validate UniversalRAG on 8 benchmarks spanning multiple modalities, showing its superiority over modality-specific and unified baselines.
△ Less
Submitted 29 April, 2025;
originally announced April 2025.
-
Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning
Authors:
Minju Seo,
Jinheon Baek,
Seongyun Lee,
Sung Ju Hwang
Abstract:
Despite the rapid growth of machine learning research, corresponding code implementations are often unavailable, making it slow and labor-intensive for researchers to reproduce results and build upon prior work. In the meantime, recent Large Language Models (LLMs) excel at understanding scientific documents and generating high-quality code. Inspired by this, we introduce PaperCoder, a multi-agent…
▽ More
Despite the rapid growth of machine learning research, corresponding code implementations are often unavailable, making it slow and labor-intensive for researchers to reproduce results and build upon prior work. In the meantime, recent Large Language Models (LLMs) excel at understanding scientific documents and generating high-quality code. Inspired by this, we introduce PaperCoder, a multi-agent LLM framework that transforms machine learning papers into functional code repositories. PaperCoder operates in three stages: planning, where it constructs a high-level roadmap, designs the system architecture with diagrams, identifies file dependencies, and generates configuration files; analysis, which focuses on interpreting implementation-specific details; and generation, where modular, dependency-aware code is produced. Moreover, each phase is instantiated through a set of specialized agents designed to collaborate effectively across the pipeline. We then evaluate PaperCoder on generating code implementations from machine learning papers based on both model-based and human evaluations, specifically from the original paper authors, with author-released repositories as ground truth if available. Our results demonstrate the effectiveness of PaperCoder in creating high-quality, faithful implementations. Furthermore, it consistently shows strengths in the recently released PaperBench benchmark, surpassing strong baselines by substantial margins. Code is available at: https://github.com/going-doer/Paper2Code.
△ Less
Submitted 26 April, 2025; v1 submitted 23 April, 2025;
originally announced April 2025.
-
ShieldGemma 2: Robust and Tractable Image Content Moderation
Authors:
Wenjun Zeng,
Dana Kurniawan,
Ryan Mullins,
Yuchi Liu,
Tamoghna Saha,
Dirichi Ike-Njoku,
Jindong Gu,
Yiwen Song,
Cai Xu,
Jingjing Zhou,
Aparna Joshi,
Shravan Dheep,
Mani Malek,
Hamid Palangi,
Joon Baek,
Rick Pereira,
Karthik Narasimhan
Abstract:
We introduce ShieldGemma 2, a 4B parameter image content moderation model built on Gemma 3. This model provides robust safety risk predictions across the following key harm categories: Sexually Explicit, Violence \& Gore, and Dangerous Content for synthetic images (e.g. output of any image generation model) and natural images (e.g. any image input to a Vision-Language Model). We evaluated on both…
▽ More
We introduce ShieldGemma 2, a 4B parameter image content moderation model built on Gemma 3. This model provides robust safety risk predictions across the following key harm categories: Sexually Explicit, Violence \& Gore, and Dangerous Content for synthetic images (e.g. output of any image generation model) and natural images (e.g. any image input to a Vision-Language Model). We evaluated on both internal and external benchmarks to demonstrate state-of-the-art performance compared to LlavaGuard \citep{helff2024llavaguard}, GPT-4o mini \citep{hurst2024gpt}, and the base Gemma 3 model \citep{gemma_2025} based on our policies. Additionally, we present a novel adversarial data generation pipeline which enables a controlled, diverse, and robust image generation. ShieldGemma 2 provides an open image moderation tool to advance multimodal safety and responsible AI development.
△ Less
Submitted 8 April, 2025; v1 submitted 1 April, 2025;
originally announced April 2025.
-
Factored Agents: Decoupling In-Context Learning and Memorization for Robust Tool Use
Authors:
Nicholas Roth,
Christopher Hidey,
Lucas Spangher,
William F. Arnold,
Chang Ye,
Nick Masiewicki,
Jinoo Baek,
Peter Grabowski,
Eugene Ie
Abstract:
In this paper, we propose a novel factored agent architecture designed to overcome the limitations of traditional single-agent systems in agentic AI. Our approach decomposes the agent into two specialized components: (1) a large language model (LLM) that serves as a high level planner and in-context learner, which may use dynamically available information in user prompts, (2) a smaller language mo…
▽ More
In this paper, we propose a novel factored agent architecture designed to overcome the limitations of traditional single-agent systems in agentic AI. Our approach decomposes the agent into two specialized components: (1) a large language model (LLM) that serves as a high level planner and in-context learner, which may use dynamically available information in user prompts, (2) a smaller language model which acts as a memorizer of tool format and output. This decoupling addresses prevalent issues in monolithic designs, including malformed, missing, and hallucinated API fields, as well as suboptimal planning in dynamic environments. Empirical evaluations demonstrate that our factored architecture significantly improves planning accuracy and error resilience, while elucidating the inherent trade-off between in-context learning and static memorization. These findings suggest that a factored approach is a promising pathway for developing more robust and adaptable agentic AI systems.
△ Less
Submitted 2 April, 2025; v1 submitted 28 March, 2025;
originally announced March 2025.
-
Mitigating Trade-off: Stream and Query-guided Aggregation for Efficient and Effective 3D Occupancy Prediction
Authors:
Seokha Moon,
Janghyun Baek,
Giseop Kim,
Jinkyu Kim,
Sunwook Choi
Abstract:
3D occupancy prediction has emerged as a key perception task for autonomous driving, as it reconstructs 3D environments to provide a comprehensive scene understanding. Recent studies focus on integrating spatiotemporal information obtained from past observations to improve prediction accuracy, using a multi-frame fusion approach that processes multiple past frames together. However, these methods…
▽ More
3D occupancy prediction has emerged as a key perception task for autonomous driving, as it reconstructs 3D environments to provide a comprehensive scene understanding. Recent studies focus on integrating spatiotemporal information obtained from past observations to improve prediction accuracy, using a multi-frame fusion approach that processes multiple past frames together. However, these methods struggle with a trade-off between efficiency and accuracy, which significantly limits their practicality. To mitigate this trade-off, we propose StreamOcc, a novel framework that aggregates spatio-temporal information in a stream-based manner. StreamOcc consists of two key components: (i) Stream-based Voxel Aggregation, which effectively accumulates past observations while minimizing computational costs, and (ii) Query-guided Aggregation, which recurrently aggregates instance-level features of dynamic objects into corresponding voxel features, refining fine-grained details of dynamic objects. Experiments on the Occ3D-nuScenes dataset show that StreamOcc achieves state-of-the-art performance in real-time settings, while reducing memory usage by more than 50% compared to previous methods.
△ Less
Submitted 27 March, 2025;
originally announced March 2025.
-
DynASyn: Multi-Subject Personalization Enabling Dynamic Action Synthesis
Authors:
Yongjin Choi,
Chanhun Park,
Seung Jun Baek
Abstract:
Recent advances in text-to-image diffusion models spurred research on personalization, i.e., a customized image synthesis, of subjects within reference images. Although existing personalization methods are able to alter the subjects' positions or to personalize multiple subjects simultaneously, they often struggle to modify the behaviors of subjects or their dynamic interactions. The difficulty is…
▽ More
Recent advances in text-to-image diffusion models spurred research on personalization, i.e., a customized image synthesis, of subjects within reference images. Although existing personalization methods are able to alter the subjects' positions or to personalize multiple subjects simultaneously, they often struggle to modify the behaviors of subjects or their dynamic interactions. The difficulty is attributable to overfitting to reference images, which worsens if only a single reference image is available. We propose DynASyn, an effective multi-subject personalization from a single reference image addressing these challenges. DynASyn preserves the subject identity in the personalization process by aligning concept-based priors with subject appearances and actions. This is achieved by regularizing the attention maps between the subject token and images through concept-based priors. In addition, we propose concept-based prompt-and-image augmentation for an enhanced trade-off between identity preservation and action diversity. We adopt an SDE-based editing guided by augmented prompts to generate diverse appearances and actions while maintaining identity consistency in the augmented images. Experiments show that DynASyn is capable of synthesizing highly realistic images of subjects with novel contexts and dynamic interactions with the surroundings, and outperforms baseline methods in both quantitative and qualitative aspects.
△ Less
Submitted 22 March, 2025;
originally announced March 2025.
-
Research on Superalignment Should Advance Now with Parallel Optimization of Competence and Conformity
Authors:
HyunJin Kim,
Xiaoyuan Yi,
Jing Yao,
Muhua Huang,
JinYeong Bak,
James Evans,
Xing Xie
Abstract:
The recent leap in AI capabilities, driven by big generative models, has sparked the possibility of achieving Artificial General Intelligence (AGI) and further triggered discussions on Artificial Superintelligence (ASI), a system surpassing all humans across all domains. This gives rise to the critical research question of: If we realize ASI, how do we align it with human values, ensuring it benef…
▽ More
The recent leap in AI capabilities, driven by big generative models, has sparked the possibility of achieving Artificial General Intelligence (AGI) and further triggered discussions on Artificial Superintelligence (ASI), a system surpassing all humans across all domains. This gives rise to the critical research question of: If we realize ASI, how do we align it with human values, ensuring it benefits rather than harms human society, a.k.a., the Superalignment problem. Despite ASI being regarded by many as solely a hypothetical concept, in this paper, we argue that superalignment is achievable and research on it should advance immediately, through simultaneous and alternating optimization of task competence and value conformity. We posit that superalignment is not merely a safeguard for ASI but also necessary for its realization. To support this position, we first provide a formal definition of superalignment rooted in the gap between capability and capacity and elaborate on our argument. Then we review existing paradigms, explore their interconnections and limitations, and illustrate a potential path to superalignment centered on two fundamental principles. We hope this work sheds light on a practical approach for developing the value-aligned next-generation AI, garnering greater benefits and reducing potential harms for humanity.
△ Less
Submitted 7 March, 2025;
originally announced March 2025.
-
GUIDE-CoT: Goal-driven and User-Informed Dynamic Estimation for Pedestrian Trajectory using Chain-of-Thought
Authors:
Sungsik Kim,
Janghyun Baek,
Jinkyu Kim,
Jaekoo Lee
Abstract:
While Large Language Models (LLMs) have recently shown impressive results in reasoning tasks, their application to pedestrian trajectory prediction remains challenging due to two key limitations: insufficient use of visual information and the difficulty of predicting entire trajectories. To address these challenges, we propose Goal-driven and User-Informed Dynamic Estimation for pedestrian traject…
▽ More
While Large Language Models (LLMs) have recently shown impressive results in reasoning tasks, their application to pedestrian trajectory prediction remains challenging due to two key limitations: insufficient use of visual information and the difficulty of predicting entire trajectories. To address these challenges, we propose Goal-driven and User-Informed Dynamic Estimation for pedestrian trajectory using Chain-of-Thought (GUIDE-CoT). Our approach integrates two innovative modules: (1) a goal-oriented visual prompt, which enhances goal prediction accuracy combining visual prompts with a pretrained visual encoder, and (2) a chain-of-thought (CoT) LLM for trajectory generation, which generates realistic trajectories toward the predicted goal. Moreover, our method introduces controllable trajectory generation, allowing for flexible and user-guided modifications to the predicted paths. Through extensive experiments on the ETH/UCY benchmark datasets, our method achieves state-of-the-art performance, delivering both high accuracy and greater adaptability in pedestrian trajectory prediction. Our code is publicly available at https://github.com/ai-kmu/GUIDE-CoT.
△ Less
Submitted 9 March, 2025;
originally announced March 2025.
-
Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching
Authors:
Simon A. Aytes,
Jinheon Baek,
Sung Ju Hwang
Abstract:
Recent advances in large language models have demonstrated remarkable reasoning capabilities through Chain of Thought (CoT) prompting, but often at the cost of excessive verbosity in their intermediate outputs, which increases computational overhead. We introduce Sketch-of-Thought (SoT), a novel prompting framework that combines cognitive-inspired reasoning paradigms with linguistic constraints to…
▽ More
Recent advances in large language models have demonstrated remarkable reasoning capabilities through Chain of Thought (CoT) prompting, but often at the cost of excessive verbosity in their intermediate outputs, which increases computational overhead. We introduce Sketch-of-Thought (SoT), a novel prompting framework that combines cognitive-inspired reasoning paradigms with linguistic constraints to minimize token usage while preserving reasoning accuracy. SoT is designed as a flexible framework that can incorporate any custom reasoning paradigms based on cognitive science, and we instantiate it with three such paradigms - Conceptual Chaining, Chunked Symbolism, and Expert Lexicons - each tailored to different reasoning tasks and selected dynamically via a lightweight routing model. Through comprehensive evaluation across 15 reasoning datasets with multiple languages and multimodal scenarios, we demonstrate that SoT achieves token reductions of 76% with negligible accuracy impact. In certain domains like mathematical and multi-hop reasoning, it even improves accuracy while using significantly fewer tokens. Our code is publicly available: https://www.github.com/SimonAytes/SoT.
△ Less
Submitted 7 March, 2025;
originally announced March 2025.
-
Enhancing the Product Quality of the Injection Process Using eXplainable Artificial Intelligence
Authors:
Jisoo Hong,
Yongmin Hong,
Jung-Woo Baek,
Sung-Woo Kang
Abstract:
The injection molding process is a traditional technique for making products in various industries such as electronics and automobiles via solidifying liquid resin into certain molds. Although the process is not related to creating the main part of engines or semiconductors, this manufacturing methodology sets the final form of the products. Re-cently, research has continued to reduce the defect r…
▽ More
The injection molding process is a traditional technique for making products in various industries such as electronics and automobiles via solidifying liquid resin into certain molds. Although the process is not related to creating the main part of engines or semiconductors, this manufacturing methodology sets the final form of the products. Re-cently, research has continued to reduce the defect rate of the injection molding process. This study proposes an optimal injection molding process control system to reduce the defect rate of injection molding products with XAI (eXplainable Artificial Intelligence) ap-proaches. Boosting algorithms (XGBoost and LightGBM) are used as tree-based classifiers for predicting whether each product is normal or defective. The main features to control the process for improving the product are extracted by SHapley Additive exPlanations, while the individual conditional expectation analyzes the optimal control range of these extracted features. To validate the methodology presented in this work, the actual injection molding AI manufacturing dataset provided by KAMP (Korea AI Manufacturing Platform) is employed for the case study. The results reveal that the defect rate decreases from 1.00% (Original defect rate) to 0.21% with XGBoost and 0.13% with LightGBM, respectively.
△ Less
Submitted 4 March, 2025;
originally announced March 2025.
-
Hiring under Congestion and Algorithmic Monoculture: Value of Strategic Behavior
Authors:
Jackie Baek,
Hamsa Bastani,
Shihan Chen
Abstract:
We study the impact of strategic behavior in a setting where firms compete to hire from a shared pool of applicants, and firms use a common algorithm to evaluate them. Each applicant is associated with a scalar score that is observed by all firms, provided by the algorithm. Firms simultaneously make interview decisions, where the number of interviews is capacity-constrained. Job offers are given t…
▽ More
We study the impact of strategic behavior in a setting where firms compete to hire from a shared pool of applicants, and firms use a common algorithm to evaluate them. Each applicant is associated with a scalar score that is observed by all firms, provided by the algorithm. Firms simultaneously make interview decisions, where the number of interviews is capacity-constrained. Job offers are given to those who pass the interview, and an applicant who receives multiple offers accepts one of them uniformly at random. We fully characterize the set of Nash equilibria under this model. Defining social welfare as the total number of applicants who find a job, we then compare the social welfare at a Nash equilibrium to a naive baseline where all firms interview applicants with the highest scores. We show that the Nash equilibrium greatly improves upon social welfare compared to the naive baseline, especially when the interview capacity is small and the number of firms is large. We also show that the price of anarchy is small, providing further appeal for the equilibrium solution.
We then study how the firms may converge to a Nash equilibrium. We show that when firms make interview decisions sequentially and each firm takes the best response action assuming they are the last to act, this process converges to an equilibrium when interview capacities are small. However, we show that the task of computing the best response is difficult if firms have to use its own historical samples to estimate it, while this task becomes trivial if firms have information on the degree of competition for each applicant. Therefore, converging to an equilibrium can be greatly facilitated if firms have information on the level of competition for each applicant.
△ Less
Submitted 27 February, 2025;
originally announced February 2025.
-
Towards Better Understanding of Program-of-Thought Reasoning in Cross-Lingual and Multilingual Environments
Authors:
Patomporn Payoungkhamdee,
Pume Tuchinda,
Jinheon Baek,
Samuel Cahyawijaya,
Can Udomcharoenchaikit,
Potsawee Manakul,
Peerat Limkonchotiwat,
Ekapol Chuangsuwanich,
Sarana Nutanong
Abstract:
Multi-step reasoning is essential for large language models (LLMs), yet multilingual performance remains challenging. While Chain-of-Thought (CoT) prompting improves reasoning, it struggles with non-English languages due to the entanglement of reasoning and execution. Program-of-Thought (PoT) prompting separates reasoning from execution, offering a promising alternative but shifting the challenge…
▽ More
Multi-step reasoning is essential for large language models (LLMs), yet multilingual performance remains challenging. While Chain-of-Thought (CoT) prompting improves reasoning, it struggles with non-English languages due to the entanglement of reasoning and execution. Program-of-Thought (PoT) prompting separates reasoning from execution, offering a promising alternative but shifting the challenge to generating programs from non-English questions. We propose a framework to evaluate PoT by separating multilingual reasoning from code execution to examine (i) the impact of fine-tuning on question-reasoning alignment and (ii) how reasoning quality affects answer correctness. Our findings demonstrate that PoT fine-tuning substantially enhances multilingual reasoning, outperforming CoT fine-tuned models. We further demonstrate a strong correlation between reasoning quality (measured through code quality) and answer accuracy, highlighting its potential as a test-time performance improvement heuristic.
△ Less
Submitted 25 February, 2025;
originally announced February 2025.
-
Toward Responsible Federated Large Language Models: Leveraging a Safety Filter and Constitutional AI
Authors:
Eunchung Noh,
Jeonghun Baek
Abstract:
Recent research has increasingly focused on training large language models (LLMs) using federated learning, known as FedLLM. However, responsible AI (RAI), which aims to ensure safe responses, remains underexplored in the context of FedLLM. In FedLLM, client data used for training may contain harmful content, leading to unsafe LLMs that generate harmful responses. Aggregating such unsafe LLMs into…
▽ More
Recent research has increasingly focused on training large language models (LLMs) using federated learning, known as FedLLM. However, responsible AI (RAI), which aims to ensure safe responses, remains underexplored in the context of FedLLM. In FedLLM, client data used for training may contain harmful content, leading to unsafe LLMs that generate harmful responses. Aggregating such unsafe LLMs into the global model and distributing them to clients may result in the widespread deployment of unsafe LLMs. To address this issue, we incorporate two well-known RAI methods into FedLLM: the safety filter and constitutional AI. Our experiments demonstrate that these methods significantly enhance the safety of the LLM, achieving over a 20% improvement on AdvBench, a benchmark for evaluating safety performance.
△ Less
Submitted 23 February, 2025;
originally announced February 2025.
-
Harnessing PDF Data for Improving Japanese Large Multimodal Models
Authors:
Jeonghun Baek,
Akiko Aizawa,
Kiyoharu Aizawa
Abstract:
Large Multimodal Models (LMMs) have demonstrated strong performance in English, but their effectiveness in Japanese remains limited due to the lack of high-quality training data. Current Japanese LMMs often rely on translated English datasets, restricting their ability to capture Japan-specific cultural knowledge. To address this, we explore the potential of Japanese PDF data as a training resourc…
▽ More
Large Multimodal Models (LMMs) have demonstrated strong performance in English, but their effectiveness in Japanese remains limited due to the lack of high-quality training data. Current Japanese LMMs often rely on translated English datasets, restricting their ability to capture Japan-specific cultural knowledge. To address this, we explore the potential of Japanese PDF data as a training resource, an area that remains largely underutilized. We introduce a fully automated pipeline that leverages pretrained models to extract image-text pairs from PDFs through layout analysis, OCR, and vision-language pairing, removing the need for manual annotation. Additionally, we construct instruction data from extracted image-text pairs to enrich the training data. To evaluate the effectiveness of PDF-derived data, we train Japanese LMMs and assess their performance on the Japanese LMM Benchmark. Our results demonstrate substantial improvements, with performance gains ranging from 3.9% to 13.8% on Heron-Bench. Further analysis highlights the impact of PDF-derived data on various factors, such as model size and language models, reinforcing its value as a multimodal resource for Japanese LMMs. We plan to make the source code and data publicly available upon acceptance.
△ Less
Submitted 20 February, 2025;
originally announced February 2025.
-
Beyond Turn-taking: Introducing Text-based Overlap into Human-LLM Interactions
Authors:
JiWoo Kim,
Minsuk Chang,
JinYeong Bak
Abstract:
Traditional text-based human-AI interactions often adhere to a strict turn-taking approach. In this research, we propose a novel approach that incorporates overlapping messages, mirroring natural human conversations. Through a formative study, we observed that even in text-based contexts, users instinctively engage in overlapping behaviors like "A: Today I went to-" "B: yeah." To capitalize on the…
▽ More
Traditional text-based human-AI interactions often adhere to a strict turn-taking approach. In this research, we propose a novel approach that incorporates overlapping messages, mirroring natural human conversations. Through a formative study, we observed that even in text-based contexts, users instinctively engage in overlapping behaviors like "A: Today I went to-" "B: yeah." To capitalize on these insights, we developed OverlapBot, a prototype chatbot where both AI and users can initiate overlapping. Our user study revealed that OverlapBot was perceived as more communicative and immersive than traditional turn-taking chatbot, fostering faster and more natural interactions. Our findings contribute to the understanding of design space for overlapping interactions. We also provide recommendations for implementing overlap-capable AI interactions to enhance the fluidity and engagement of text-based conversations.
△ Less
Submitted 29 January, 2025;
originally announced January 2025.
-
Dreamweaver: Learning Compositional World Models from Pixels
Authors:
Junyeob Baek,
Yi-Fu Wu,
Gautam Singh,
Sungjin Ahn
Abstract:
Humans have an innate ability to decompose their perceptions of the world into objects and their attributes, such as colors, shapes, and movement patterns. This cognitive process enables us to imagine novel futures by recombining familiar concepts. However, replicating this ability in artificial intelligence systems has proven challenging, particularly when it comes to modeling videos into composi…
▽ More
Humans have an innate ability to decompose their perceptions of the world into objects and their attributes, such as colors, shapes, and movement patterns. This cognitive process enables us to imagine novel futures by recombining familiar concepts. However, replicating this ability in artificial intelligence systems has proven challenging, particularly when it comes to modeling videos into compositional concepts and generating unseen, recomposed futures without relying on auxiliary data, such as text, masks, or bounding boxes. In this paper, we propose Dreamweaver, a neural architecture designed to discover hierarchical and compositional representations from raw videos and generate compositional future simulations. Our approach leverages a novel Recurrent Block-Slot Unit (RBSU) to decompose videos into their constituent objects and attributes. In addition, Dreamweaver uses a multi-future-frame prediction objective to capture disentangled representations for dynamic concepts more effectively as well as static concepts. In experiments, we demonstrate our model outperforms current state-of-the-art baselines for world modeling when evaluated under the DCI framework across multiple datasets. Furthermore, we show how the modularized concept representations of our model enable compositional imagination, allowing the generation of novel videos by recombining attributes from previously seen objects. cun-bjy.github.io/dreamweaver-website
△ Less
Submitted 10 April, 2025; v1 submitted 23 January, 2025;
originally announced January 2025.
-
Real-time Verification and Refinement of Language Model Text Generation
Authors:
Joonho Ko,
Jinheon Baek,
Sung Ju Hwang
Abstract:
Large language models (LLMs) have shown remarkable performance across a wide range of natural language tasks. However, a critical challenge remains in that they sometimes generate factually incorrect answers. To address this, while many previous work has focused on identifying errors in their generation and further refining them, they are slow in deployment since they are designed to verify the re…
▽ More
Large language models (LLMs) have shown remarkable performance across a wide range of natural language tasks. However, a critical challenge remains in that they sometimes generate factually incorrect answers. To address this, while many previous work has focused on identifying errors in their generation and further refining them, they are slow in deployment since they are designed to verify the response from LLMs only after their entire generation (from the first to last tokens) is done. Further, we observe that once LLMs generate incorrect tokens early on, there is a higher likelihood that subsequent tokens will also be factually incorrect. To this end, in this work, we propose Streaming-VR (Streaming Verification and Refinement), a novel approach designed to enhance the efficiency of verification and refinement of LLM outputs. Specifically, the proposed Streaming-VR enables on-the-fly verification and correction of tokens as they are being generated, similar to a streaming process, ensuring that each subset of tokens is checked and refined in real-time by another LLM as the LLM constructs its response. Through comprehensive evaluations on multiple datasets, we demonstrate that our approach not only enhances the factual accuracy of LLMs, but also offers a more efficient solution compared to prior refinement methods.
△ Less
Submitted 13 April, 2025; v1 submitted 13 January, 2025;
originally announced January 2025.
-
VideoRAG: Retrieval-Augmented Generation over Video Corpus
Authors:
Soyeong Jeong,
Kangsan Kim,
Jinheon Baek,
Sung Ju Hwang
Abstract:
Retrieval-Augmented Generation (RAG) is a powerful strategy for improving the factual accuracy of models by retrieving external knowledge relevant to queries and incorporating it into the generation process. However, existing approaches primarily focus on text, with some recent advancements considering images, and they largely overlook videos, a rich source of multimodal knowledge capable of repre…
▽ More
Retrieval-Augmented Generation (RAG) is a powerful strategy for improving the factual accuracy of models by retrieving external knowledge relevant to queries and incorporating it into the generation process. However, existing approaches primarily focus on text, with some recent advancements considering images, and they largely overlook videos, a rich source of multimodal knowledge capable of representing contextual details more effectively than any other modality. While very recent studies explore the use of videos in response generation, they either predefine query-associated videos without retrieval or convert videos into textual descriptions losing multimodal richness. To tackle these, we introduce VideoRAG, a framework that not only dynamically retrieves videos based on their relevance with queries but also utilizes both visual and textual information. The operation of VideoRAG is powered by recent Large Video Language Models (LVLMs), which enable the direct processing of video content to represent it for retrieval and the seamless integration of retrieved videos jointly with queries for response generation. Also, inspired by that the context size of LVLMs may not be sufficient to process all frames in extremely long videos and not all frames are equally important, we introduce a video frame selection mechanism to extract the most informative subset of frames, along with a strategy to extract textual information from videos (as it can aid the understanding of video content) when their subtitles are not available. We experimentally validate the effectiveness of VideoRAG, showcasing that it is superior to relevant baselines. Code is available at https://github.com/starsuzi/VideoRAG.
△ Less
Submitted 4 March, 2025; v1 submitted 10 January, 2025;
originally announced January 2025.
-
Efficient Long Context Language Model Retrieval with Compression
Authors:
Minju Seo,
Jinheon Baek,
Seongyun Lee,
Sung Ju Hwang
Abstract:
Long Context Language Models (LCLMs) have emerged as a new paradigm to perform Information Retrieval (IR), which enables the direct ingestion and retrieval of information by processing an entire corpus in their single context, showcasing the potential to surpass traditional sparse and dense retrieval methods. However, processing a large number of passages within in-context for retrieval is computa…
▽ More
Long Context Language Models (LCLMs) have emerged as a new paradigm to perform Information Retrieval (IR), which enables the direct ingestion and retrieval of information by processing an entire corpus in their single context, showcasing the potential to surpass traditional sparse and dense retrieval methods. However, processing a large number of passages within in-context for retrieval is computationally expensive, and handling their representations during inference further exacerbates the processing time; thus, we aim to make LCLM retrieval more efficient and potentially more effective with passage compression. Specifically, we propose a new compression approach tailored for LCLM retrieval, which is trained to maximize the retrieval performance while minimizing the length of the compressed passages. To accomplish this, we generate the synthetic data, where compressed passages are automatically created and labeled as chosen or rejected according to their retrieval success for a given query, and we train the proposed Compression model for Long context Retrieval (CoLoR) with this data via preference optimization while adding the length regularization loss on top of it to enforce brevity. Through extensive experiments on 9 datasets, we show that CoLoR improves the retrieval performance by 6% while compressing the in-context size by a factor of 1.91.
△ Less
Submitted 24 December, 2024;
originally announced December 2024.
-
Revisiting In-Context Learning with Long Context Language Models
Authors:
Jinheon Baek,
Sun Jae Lee,
Prakhar Gupta,
Geunseob Oh,
Siddharth Dalmia,
Prateek Kolhar
Abstract:
In-Context Learning (ICL) is a technique by which language models make predictions based on examples provided in their input context. Previously, their context window size imposed a limit on the number of examples that can be shown, making example selection techniques crucial for identifying the maximally effective set of examples. However, the recent advent of Long Context Language Models (LCLMs)…
▽ More
In-Context Learning (ICL) is a technique by which language models make predictions based on examples provided in their input context. Previously, their context window size imposed a limit on the number of examples that can be shown, making example selection techniques crucial for identifying the maximally effective set of examples. However, the recent advent of Long Context Language Models (LCLMs) has significantly increased the number of examples that can be included in context, raising an important question of whether ICL performance in a many-shot regime is still sensitive to the method of sample selection. To answer this, we revisit these approaches in the context of LCLMs through extensive experiments on 18 datasets spanning 4 tasks. Surprisingly, we observe that sophisticated example selection techniques do not yield significant improvements over a simple random sample selection method. Instead, we find that the advent of LCLMs has fundamentally shifted the challenge of ICL from that of selecting the most effective examples to that of collecting sufficient examples to fill the context window. Specifically, in certain datasets, including all available examples does not fully utilize the context window; however, by augmenting the examples in context with a simple data augmentation approach, we substantially improve ICL performance by 5%.
△ Less
Submitted 6 January, 2025; v1 submitted 22 December, 2024;
originally announced December 2024.
-
The Road to Artificial SuperIntelligence: A Comprehensive Survey of Superalignment
Authors:
HyunJin Kim,
Xiaoyuan Yi,
Jing Yao,
Jianxun Lian,
Muhua Huang,
Shitong Duan,
JinYeong Bak,
Xing Xie
Abstract:
The emergence of large language models (LLMs) has sparked the possibility of about Artificial Superintelligence (ASI), a hypothetical AI system surpassing human intelligence. However, existing alignment paradigms struggle to guide such advanced AI systems. Superalignment, the alignment of AI systems with human values and safety requirements at superhuman levels of capability aims to addresses two…
▽ More
The emergence of large language models (LLMs) has sparked the possibility of about Artificial Superintelligence (ASI), a hypothetical AI system surpassing human intelligence. However, existing alignment paradigms struggle to guide such advanced AI systems. Superalignment, the alignment of AI systems with human values and safety requirements at superhuman levels of capability aims to addresses two primary goals -- scalability in supervision to provide high-quality guidance signals and robust governance to ensure alignment with human values. In this survey, we examine scalable oversight methods and potential solutions for superalignment. Specifically, we explore the concept of ASI, the challenges it poses, and the limitations of current alignment paradigms in addressing the superalignment problem. Then we review scalable oversight methods for superalignment. Finally, we discuss the key challenges and propose pathways for the safe and continual improvement of ASI systems. By comprehensively reviewing the current literature, our goal is provide a systematical introduction of existing methods, analyze their strengths and limitations, and discuss potential future directions.
△ Less
Submitted 25 December, 2024; v1 submitted 20 December, 2024;
originally announced December 2024.
-
Self-Training Meets Consistency: Improving LLMs' Reasoning with Consistency-Driven Rationale Evaluation
Authors:
Jaehyeok Lee,
Keisuke Sakaguchi,
JinYeong Bak
Abstract:
Self-training approach for large language models (LLMs) improves reasoning abilities by training the models on their self-generated rationales. Previous approaches have labeled rationales that produce correct answers for a given question as appropriate for training. However, a single measure risks misjudging rationale quality, leading the models to learn flawed reasoning patterns. To address this…
▽ More
Self-training approach for large language models (LLMs) improves reasoning abilities by training the models on their self-generated rationales. Previous approaches have labeled rationales that produce correct answers for a given question as appropriate for training. However, a single measure risks misjudging rationale quality, leading the models to learn flawed reasoning patterns. To address this issue, we propose CREST (Consistency-driven Rationale Evaluation for Self-Training), a self-training framework that further evaluates each rationale through follow-up questions and leverages this evaluation to guide its training. Specifically, we introduce two methods: (1) filtering out rationales that frequently result in incorrect answers on follow-up questions and (2) preference learning based on mixed preferences from rationale evaluation results of both original and follow-up questions. Experiments on three question-answering datasets using open LLMs show that CREST not only improves the logical robustness and correctness of rationales but also improves reasoning abilities compared to previous self-training approaches.
△ Less
Submitted 6 February, 2025; v1 submitted 10 November, 2024;
originally announced November 2024.
-
GlocalCLIP: Object-agnostic Global-Local Prompt Learning for Zero-shot Anomaly Detection
Authors:
Jiyul Ham,
Yonggon Jung,
Jun-Geol Baek
Abstract:
Zero-shot anomaly detection (ZSAD) is crucial for detecting anomalous patterns in target datasets without using training samples, specifically in scenarios where there are distributional differences between the target domain and training data or where data scarcity arises because of restricted access. Although recently pretrained vision-language models demonstrate strong zero-shot performance acro…
▽ More
Zero-shot anomaly detection (ZSAD) is crucial for detecting anomalous patterns in target datasets without using training samples, specifically in scenarios where there are distributional differences between the target domain and training data or where data scarcity arises because of restricted access. Although recently pretrained vision-language models demonstrate strong zero-shot performance across various visual tasks, they focus on learning class semantics, which makes their direct application to ZSAD challenging. To address this scenario, we propose GlocalCLIP, which uniquely separates global and local prompts and jointly optimizes them. This approach enables the object-agnostic glocal semantic prompt to effectively capture general normal and anomalous patterns without dependency on specific objects in the image. We refine the text prompts for more precise adjustments by utilizing deep-text prompt tuning in the text encoder. In the vision encoder, we apply V-V attention layers to capture detailed local image features. Finally, we introduce glocal contrastive learning to improve the complementary learning of global and local prompts, effectively detecting anomalous patterns across various domains. The generalization performance of GlocalCLIP in ZSAD was demonstrated on 15 real-world datasets from both the industrial and medical domains, achieving superior performance compared to existing methods. Code will be made available at https://github.com/YUL-git/GlocalCLIP.
△ Less
Submitted 8 December, 2024; v1 submitted 9 November, 2024;
originally announced November 2024.
-
Rethinking Code Refinement: Learning to Judge Code Efficiency
Authors:
Minju Seo,
Jinheon Baek,
Sung Ju Hwang
Abstract:
Large Language Models (LLMs) have demonstrated impressive capabilities in understanding and generating codes. Due to these capabilities, many recent methods are proposed to automatically refine the codes with LLMs. However, we should rethink that the refined codes (from LLMs and even humans) are not always more efficient than their original versions. On the other hand, running two different versio…
▽ More
Large Language Models (LLMs) have demonstrated impressive capabilities in understanding and generating codes. Due to these capabilities, many recent methods are proposed to automatically refine the codes with LLMs. However, we should rethink that the refined codes (from LLMs and even humans) are not always more efficient than their original versions. On the other hand, running two different versions of codes and comparing them every time is not ideal and time-consuming. Therefore, in this work, we propose a novel method based on the code language model that is trained to judge the efficiency between two different codes (generated across humans and machines) by either classifying the superior one or predicting the relative improvement. We validate our method on multiple programming languages with multiple refinement steps, demonstrating that the proposed method can effectively distinguish between more and less efficient versions of code.
△ Less
Submitted 29 October, 2024;
originally announced October 2024.
-
JMMMU: A Japanese Massive Multi-discipline Multimodal Understanding Benchmark for Culture-aware Evaluation
Authors:
Shota Onohara,
Atsuyuki Miyai,
Yuki Imajuku,
Kazuki Egashira,
Jeonghun Baek,
Xiang Yue,
Graham Neubig,
Kiyoharu Aizawa
Abstract:
Accelerating research on Large Multimodal Models (LMMs) in non-English languages is crucial for enhancing user experiences across broader populations. In this paper, we introduce JMMMU (Japanese MMMU), the first large-scale Japanese benchmark designed to evaluate LMMs on expert-level tasks based on the Japanese cultural context. To facilitate comprehensive culture-aware evaluation, JMMMU features…
▽ More
Accelerating research on Large Multimodal Models (LMMs) in non-English languages is crucial for enhancing user experiences across broader populations. In this paper, we introduce JMMMU (Japanese MMMU), the first large-scale Japanese benchmark designed to evaluate LMMs on expert-level tasks based on the Japanese cultural context. To facilitate comprehensive culture-aware evaluation, JMMMU features two complementary subsets: (i) culture-agnostic (CA) subset, where the culture-independent subjects (e.g., Math) are selected and translated into Japanese, enabling one-to-one comparison with its English counterpart MMMU; and (ii) culture-specific (CS) subset, comprising newly crafted subjects that reflect Japanese cultural context. Using the CA subset, we observe performance drop in many LMMs when evaluated in Japanese, which is purely attributable to language variation. Using the CS subset, we reveal their inadequate Japanese cultural understanding. Further, by combining both subsets, we identify that some LMMs perform well on the CA subset but not on the CS subset, exposing a shallow understanding of the Japanese language that lacks depth in cultural understanding. We hope this work will not only help advance LMM performance in Japanese but also serve as a guideline to create high-standard, culturally diverse benchmarks for multilingual LMM development. The project page is https://mmmu-japanese-benchmark.github.io/JMMMU/.
△ Less
Submitted 19 March, 2025; v1 submitted 22 October, 2024;
originally announced October 2024.
-
Unified Multimodal Interleaved Document Representation for Retrieval
Authors:
Jaewoo Lee,
Joonho Ko,
Jinheon Baek,
Soyeong Jeong,
Sung Ju Hwang
Abstract:
Information Retrieval (IR) methods aim to identify documents relevant to a query, which have been widely applied in various natural language tasks. However, existing approaches typically consider only the textual content within documents, overlooking the fact that documents can contain multiple modalities, including images and tables. Also, they often segment each long document into multiple discr…
▽ More
Information Retrieval (IR) methods aim to identify documents relevant to a query, which have been widely applied in various natural language tasks. However, existing approaches typically consider only the textual content within documents, overlooking the fact that documents can contain multiple modalities, including images and tables. Also, they often segment each long document into multiple discrete passages for embedding, which prevents them from capturing the overall document context and interactions between paragraphs. To address these two challenges, we propose a method that holistically embeds documents interleaved with multiple modalities by leveraging the capability of recent vision-language models that enable the processing and integration of text, images, and tables into a unified format and representation. Moreover, to mitigate the information loss from segmenting documents into passages, instead of representing and retrieving passages individually, we further merge the representations of segmented passages into one single document representation, while we additionally introduce a reranking strategy to decouple and identify the relevant passage within the document if necessary. Then, through extensive experiments on diverse IR scenarios considering both the textual and multimodal queries, we show that our approach substantially outperforms relevant baselines, thanks to the consideration of the multimodal information within documents.
△ Less
Submitted 16 December, 2024; v1 submitted 3 October, 2024;
originally announced October 2024.
-
Tuning Fast Memory Size based on Modeling of Page Migration for Tiered Memory
Authors:
Shangye Chen,
Jin Huang,
Shuangyan Yang,
Jie Liu,
Huaicheng Li,
Dimitrios Nikolopoulos,
Junhee Ryu,
Jinho Baek,
Kwangsik Shin,
Dong Li
Abstract:
Tiered memory, built upon a combination of fast memory and slow memory, provides a cost-effective solution to meet ever-increasing requirements from emerging applications for large memory capacity. Reducing the size of fast memory is valuable to improve memory utilization in production and reduce production costs because fast memory tends to be expensive. However, deciding the fast memory size is…
▽ More
Tiered memory, built upon a combination of fast memory and slow memory, provides a cost-effective solution to meet ever-increasing requirements from emerging applications for large memory capacity. Reducing the size of fast memory is valuable to improve memory utilization in production and reduce production costs because fast memory tends to be expensive. However, deciding the fast memory size is challenging because there is a complex interplay between application characterization and the overhead of page migration used to mitigate the impact of limited fast memory capacity. In this paper, we introduce a system, Tuna, to decide fast memory size based on modeling of page migration. Tuna uses micro-benchmarking to model the impact of page migration on application performance using three metrics. Tuna decides the fast memory size based on offline modeling results and limited information on workload telemetry. Evaluating with common big-memory applications and using 5% as the performance loss target, we show that Tuna in combination with a page management system (TPP) saves fast memory by 8.5% on average (up to 16%). This is in contrast to the 5% saving in fast memory reported by Microsoft Pond for the same workloads (BFS and SSSP) and the same performance loss target.
△ Less
Submitted 30 September, 2024;
originally announced October 2024.
-
Formalizing Mason-Stothers Theorem and its Corollaries in Lean 4
Authors:
Jineon Baek,
Seewoo Lee
Abstract:
The ABC conjecture implies many conjectures and theorems in number theory, including the celebrated Fermat's Last Theorem. Mason-Stothers Theorem is a function field analogue of the ABC conjecture that admits a much more elementary proof with many interesting consequences, including a polynomial version of Fermat's Last Theorem. While years of dedicated effort are expected for a full formalization…
▽ More
The ABC conjecture implies many conjectures and theorems in number theory, including the celebrated Fermat's Last Theorem. Mason-Stothers Theorem is a function field analogue of the ABC conjecture that admits a much more elementary proof with many interesting consequences, including a polynomial version of Fermat's Last Theorem. While years of dedicated effort are expected for a full formalization of Fermat's Last Theorem, the simple proof of Mason-Stothers Theorem and its corollaries calls for an immediate formalization.
We formalize an elementary proof of by Snyder in Lean 4, and also formalize many consequences of Mason-Stothers, including (i) non-solvability of Fermat-Cartan equations in polynomials, (ii) non-parametrizability of a certain elliptic curve, and (iii) Davenport's Theorem. We compare our work to existing formalizations of Mason-Stothers by Eberl in Isabelle and Wagemaker in Lean 3 respectively. Our formalization is based on the mathlib4 library of Lean 4, and is currently being ported back to mathlib4.
△ Less
Submitted 27 August, 2024;
originally announced August 2024.
-
Perturb-and-Compare Approach for Detecting Out-of-Distribution Samples in Constrained Access Environments
Authors:
Heeyoung Lee,
Hoyoon Byun,
Changdae Oh,
JinYeong Bak,
Kyungwoo Song
Abstract:
Accessing machine learning models through remote APIs has been gaining prevalence following the recent trend of scaling up model parameters for increased performance. Even though these models exhibit remarkable ability, detecting out-of-distribution (OOD) samples remains a crucial safety concern for end users as these samples may induce unreliable outputs from the model. In this work, we propose a…
▽ More
Accessing machine learning models through remote APIs has been gaining prevalence following the recent trend of scaling up model parameters for increased performance. Even though these models exhibit remarkable ability, detecting out-of-distribution (OOD) samples remains a crucial safety concern for end users as these samples may induce unreliable outputs from the model. In this work, we propose an OOD detection framework, MixDiff, that is applicable even when the model's parameters or its activations are not accessible to the end user. To bypass the access restriction, MixDiff applies an identical input-level perturbation to a given target sample and a similar in-distribution (ID) sample, then compares the relative difference in the model outputs of these two samples. MixDiff is model-agnostic and compatible with existing output-based OOD detection methods. We provide theoretical analysis to illustrate MixDiff's effectiveness in discerning OOD samples that induce overconfident outputs from the model and empirically demonstrate that MixDiff consistently enhances the OOD detection performance on various datasets in vision and text domains.
△ Less
Submitted 19 August, 2024;
originally announced August 2024.
-
Harmful Suicide Content Detection
Authors:
Kyumin Park,
Myung Jae Baik,
YeongJun Hwang,
Yen Shin,
HoJae Lee,
Ruda Lee,
Sang Min Lee,
Je Young Hannah Sun,
Ah Rah Lee,
Si Yeun Yoon,
Dong-ho Lee,
Jihyung Moon,
JinYeong Bak,
Kyunghyun Cho,
Jong-Woo Paik,
Sungjoon Park
Abstract:
Harmful suicide content on the Internet is a significant risk factor inducing suicidal thoughts and behaviors among vulnerable populations. Despite global efforts, existing resources are insufficient, specifically in high-risk regions like the Republic of Korea. Current research mainly focuses on understanding negative effects of such content or suicide risk in individuals, rather than on automati…
▽ More
Harmful suicide content on the Internet is a significant risk factor inducing suicidal thoughts and behaviors among vulnerable populations. Despite global efforts, existing resources are insufficient, specifically in high-risk regions like the Republic of Korea. Current research mainly focuses on understanding negative effects of such content or suicide risk in individuals, rather than on automatically detecting the harmfulness of content. To fill this gap, we introduce a harmful suicide content detection task for classifying online suicide content into five harmfulness levels. We develop a multi-modal benchmark and a task description document in collaboration with medical professionals, and leverage large language models (LLMs) to explore efficient methods for moderating such content. Our contributions include proposing a novel detection task, a multi-modal Korean benchmark with expert annotations, and suggesting strategies using LLMs to detect illegal and harmful content. Owing to the potential harm involved, we publicize our implementations and benchmark, incorporating an ethical verification process.
△ Less
Submitted 2 June, 2024;
originally announced July 2024.
-
KpopMT: Translation Dataset with Terminology for Kpop Fandom
Authors:
JiWoo Kim,
Yunsu Kim,
JinYeong Bak
Abstract:
While machines learn from existing corpora, humans have the unique capability to establish and accept new language systems. This makes human form unique language systems within social groups. Aligning with this, we focus on a gap remaining in addressing translation challenges within social groups, where in-group members utilize unique terminologies. We propose KpopMT dataset, which aims to fill th…
▽ More
While machines learn from existing corpora, humans have the unique capability to establish and accept new language systems. This makes human form unique language systems within social groups. Aligning with this, we focus on a gap remaining in addressing translation challenges within social groups, where in-group members utilize unique terminologies. We propose KpopMT dataset, which aims to fill this gap by enabling precise terminology translation, choosing Kpop fandom as an initiative for social groups given its global popularity. Expert translators provide 1k English translations for Korean posts and comments, each annotated with specific terminology within social groups' language systems. We evaluate existing translation systems including GPT models on KpopMT to identify their failure cases. Results show overall low scores, underscoring the challenges of reflecting group-specific terminologies and styles in translation. We make KpopMT publicly available.
△ Less
Submitted 10 July, 2024;
originally announced July 2024.
-
MentalAgora: A Gateway to Advanced Personalized Care in Mental Health through Multi-Agent Debating and Attribute Control
Authors:
Yeonji Lee,
Sangjun Park,
Kyunghyun Cho,
JinYeong Bak
Abstract:
As mental health issues globally escalate, there is a tremendous need for advanced digital support systems. We introduce MentalAgora, a novel framework employing large language models enhanced by interaction between multiple agents for tailored mental health support. This framework operates through three stages: strategic debating, tailored counselor creation, and response generation, enabling the…
▽ More
As mental health issues globally escalate, there is a tremendous need for advanced digital support systems. We introduce MentalAgora, a novel framework employing large language models enhanced by interaction between multiple agents for tailored mental health support. This framework operates through three stages: strategic debating, tailored counselor creation, and response generation, enabling the dynamic customization of responses based on individual user preferences and therapeutic needs. We conduct experiments utilizing a high-quality evaluation dataset TherapyTalk crafted with mental health professionals, shwoing that MentalAgora generates expert-aligned and user preference-enhanced responses. Our evaluations, including experiments and user studies, demonstrate that MentalAgora aligns with professional standards and effectively meets user preferences, setting a new benchmark for digital mental health interventions.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
Pose-dIVE: Pose-Diversified Augmentation with Diffusion Model for Person Re-Identification
Authors:
Inès Hyeonsu Kim,
JoungBin Lee,
Woojeong Jin,
Soowon Son,
Kyusun Cho,
Junyoung Seo,
Min-Seop Kwak,
Seokju Cho,
JeongYeol Baek,
Byeongwon Lee,
Seungryong Kim
Abstract:
Person re-identification (Re-ID) often faces challenges due to variations in human poses and camera viewpoints, which significantly affect the appearance of individuals across images. Existing datasets frequently lack diversity and scalability in these aspects, hindering the generalization of Re-ID models to new camera systems. We propose Pose-dIVE, a novel data augmentation approach that incorpor…
▽ More
Person re-identification (Re-ID) often faces challenges due to variations in human poses and camera viewpoints, which significantly affect the appearance of individuals across images. Existing datasets frequently lack diversity and scalability in these aspects, hindering the generalization of Re-ID models to new camera systems. We propose Pose-dIVE, a novel data augmentation approach that incorporates sparse and underrepresented human pose and camera viewpoint examples into the training data, addressing the limited diversity in the original training data distribution. Our objective is to augment the training dataset to enable existing Re-ID models to learn features unbiased by human pose and camera viewpoint variations. To achieve this, we leverage the knowledge of pre-trained large-scale diffusion models. By conditioning the diffusion model on both the human pose and camera viewpoint concurrently through the SMPL model, we generate training data with diverse human poses and camera viewpoints. Experimental results demonstrate the effectiveness of our method in addressing human pose bias and enhancing the generalizability of Re-ID models compared to other data augmentation-based Re-ID approaches.
△ Less
Submitted 15 October, 2024; v1 submitted 23 June, 2024;
originally announced June 2024.
-
Database-Augmented Query Representation for Information Retrieval
Authors:
Soyeong Jeong,
Jinheon Baek,
Sukmin Cho,
Sung Ju Hwang,
Jong C. Park
Abstract:
Information retrieval models that aim to search for the documents relevant to the given query have shown many successes, which have been applied to diverse tasks. However, the query provided by the user is oftentimes very short, which challenges the retrievers to correctly fetch relevant documents. To tackle this, existing studies have proposed expanding the query with a couple of additional (user…
▽ More
Information retrieval models that aim to search for the documents relevant to the given query have shown many successes, which have been applied to diverse tasks. However, the query provided by the user is oftentimes very short, which challenges the retrievers to correctly fetch relevant documents. To tackle this, existing studies have proposed expanding the query with a couple of additional (user-related) features related to the query. Yet, they may be suboptimal to effectively augment the query, though there is plenty of information available to augment it in a relational database. Motivated by this, we present a novel retrieval framework called Database-Augmented Query representation (DAQu), which augments the original query with various (query-related) metadata across multiple tables. In addition, as the number of features in the metadata can be very large and there is no order among them, we encode them with our graph-based set encoding strategy, which considers hierarchies of features in the database without order. We validate DAQu in diverse retrieval scenarios that can incorporate metadata from the relational database, demonstrating that ours significantly enhances overall retrieval performance, compared to existing query augmentation methods.
△ Less
Submitted 23 June, 2024;
originally announced June 2024.
-
Social Learning with Bounded Rationality: Negative Reviews Persist under Newest First
Authors:
Jackie Baek,
Atanas Dinev,
Thodoris Lykouris
Abstract:
We study a model of social learning from reviews where customers are computationally limited and make purchases based on reading only the first few reviews displayed by the platform. Under this bounded rationality, we establish that the review ordering policy can have a significant impact. In particular, the popular Newest First ordering induces a negative review to persist as the most recent revi…
▽ More
We study a model of social learning from reviews where customers are computationally limited and make purchases based on reading only the first few reviews displayed by the platform. Under this bounded rationality, we establish that the review ordering policy can have a significant impact. In particular, the popular Newest First ordering induces a negative review to persist as the most recent review longer than a positive review. This phenomenon, which we term the Cost of Newest First, can make the long-term revenue unboundedly lower than a counterpart where reviews are exogenously drawn for each customer.
We show that the impact of the Cost of Newest First can be mitigated under dynamic pricing, which allows the price to depend on the set of displayed reviews. Under the optimal dynamic pricing policy, the revenue loss is at most a factor of 2. On the way, we identify a structural property for this optimal dynamic pricing: the prices should ensure that the probability of a purchase is always the same, regardless of the state of reviews. We also study an extension of the model where customers put more weight on more recent reviews (and discount older reviews based on their time of posting), and we show that Newest First is still not the optimal ordering policy if customers discount slowly.
Lastly, we corroborate our theoretical findings using a real-world review dataset. We find that the average rating of the first page of reviews is statistically significantly smaller than the overall average rating, which is in line with our theoretical results.
△ Less
Submitted 22 August, 2024; v1 submitted 11 June, 2024;
originally announced June 2024.
-
PlanDQ: Hierarchical Plan Orchestration via D-Conductor and Q-Performer
Authors:
Chang Chen,
Junyeob Baek,
Fei Deng,
Kenji Kawaguchi,
Caglar Gulcehre,
Sungjin Ahn
Abstract:
Despite the recent advancements in offline RL, no unified algorithm could achieve superior performance across a broad range of tasks. Offline \textit{value function learning}, in particular, struggles with sparse-reward, long-horizon tasks due to the difficulty of solving credit assignment and extrapolation errors that accumulates as the horizon of the task grows.~On the other hand, models that ca…
▽ More
Despite the recent advancements in offline RL, no unified algorithm could achieve superior performance across a broad range of tasks. Offline \textit{value function learning}, in particular, struggles with sparse-reward, long-horizon tasks due to the difficulty of solving credit assignment and extrapolation errors that accumulates as the horizon of the task grows.~On the other hand, models that can perform well in long-horizon tasks are designed specifically for goal-conditioned tasks, which commonly perform worse than value function learning methods on short-horizon, dense-reward scenarios. To bridge this gap, we propose a hierarchical planner designed for offline RL called PlanDQ. PlanDQ incorporates a diffusion-based planner at the high level, named D-Conductor, which guides the low-level policy through sub-goals. At the low level, we used a Q-learning based approach called the Q-Performer to accomplish these sub-goals. Our experimental results suggest that PlanDQ can achieve superior or competitive performance on D4RL continuous control benchmark tasks as well as AntMaze, Kitchen, and Calvin as long-horizon tasks.
△ Less
Submitted 10 June, 2024;
originally announced June 2024.
-
CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark
Authors:
David Romero,
Chenyang Lyu,
Haryo Akbarianto Wibowo,
Teresa Lynn,
Injy Hamed,
Aditya Nanda Kishore,
Aishik Mandal,
Alina Dragonetti,
Artem Abzaliev,
Atnafu Lambebo Tonja,
Bontu Fufa Balcha,
Chenxi Whitehouse,
Christian Salamea,
Dan John Velasco,
David Ifeoluwa Adelani,
David Le Meur,
Emilio Villa-Cueva,
Fajri Koto,
Fauzan Farooqui,
Frederico Belcavello,
Ganzorig Batnasan,
Gisela Vallejo,
Grainne Caulfield,
Guido Ivetta,
Haiyue Song
, et al. (51 additional authors not shown)
Abstract:
Visual Question Answering (VQA) is an important task in multimodal AI, and it is often used to test the ability of vision-language models to understand and reason on knowledge present in both visual and textual data. However, most of the current VQA models use datasets that are primarily focused on English and a few major world languages, with images that are typically Western-centric. While recen…
▽ More
Visual Question Answering (VQA) is an important task in multimodal AI, and it is often used to test the ability of vision-language models to understand and reason on knowledge present in both visual and textual data. However, most of the current VQA models use datasets that are primarily focused on English and a few major world languages, with images that are typically Western-centric. While recent efforts have tried to increase the number of languages covered on VQA datasets, they still lack diversity in low-resource languages. More importantly, although these datasets often extend their linguistic range via translation or some other approaches, they usually keep images the same, resulting in narrow cultural representation. To address these limitations, we construct CVQA, a new Culturally-diverse multilingual Visual Question Answering benchmark, designed to cover a rich set of languages and cultures, where we engage native speakers and cultural experts in the data collection process. As a result, CVQA includes culturally-driven images and questions from across 30 countries on four continents, covering 31 languages with 13 scripts, providing a total of 10k questions. We then benchmark several Multimodal Large Language Models (MLLMs) on CVQA, and show that the dataset is challenging for the current state-of-the-art models. This benchmark can serve as a probing evaluation suite for assessing the cultural capability and bias of multimodal models and hopefully encourage more research efforts toward increasing cultural awareness and linguistic diversity in this field.
△ Less
Submitted 4 November, 2024; v1 submitted 9 June, 2024;
originally announced June 2024.
-
The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models
Authors:
Seungone Kim,
Juyoung Suk,
Ji Yong Cho,
Shayne Longpre,
Chaeeun Kim,
Dongkeun Yoon,
Guijin Son,
Yejin Cho,
Sheikh Shafayat,
Jinheon Baek,
Sue Hyun Park,
Hyeonbin Hwang,
Jinkyung Jo,
Hyowon Cho,
Haebin Shin,
Seongyun Lee,
Hanseok Oh,
Noah Lee,
Namgyu Ho,
Se June Joo,
Miyoung Ko,
Yoonjoo Lee,
Hyungjoo Chae,
Jamin Shin,
Joel Jang
, et al. (7 additional authors not shown)
Abstract:
As language models (LMs) become capable of handling a wide range of tasks, their evaluation is becoming as challenging as their development. Most generation benchmarks currently assess LMs using abstract evaluation criteria like helpfulness and harmlessness, which often lack the flexibility and granularity of human assessment. Additionally, these benchmarks tend to focus disproportionately on spec…
▽ More
As language models (LMs) become capable of handling a wide range of tasks, their evaluation is becoming as challenging as their development. Most generation benchmarks currently assess LMs using abstract evaluation criteria like helpfulness and harmlessness, which often lack the flexibility and granularity of human assessment. Additionally, these benchmarks tend to focus disproportionately on specific capabilities such as instruction following, leading to coverage bias. To overcome these limitations, we introduce the BiGGen Bench, a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks. A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation. We apply this benchmark to assess 103 frontier LMs using five evaluator LMs. Our code, data, and evaluation results are all publicly available at https://github.com/prometheus-eval/prometheus-eval/tree/main/BiGGen-Bench.
△ Less
Submitted 25 March, 2025; v1 submitted 9 June, 2024;
originally announced June 2024.
-
ResearchAgent: Iterative Research Idea Generation over Scientific Literature with Large Language Models
Authors:
Jinheon Baek,
Sujay Kumar Jauhar,
Silviu Cucerzan,
Sung Ju Hwang
Abstract:
The pace of scientific research, vital for improving human life, is complex, slow, and needs specialized expertise. Meanwhile, novel, impactful research often stems from both a deep understanding of prior work, and a cross-pollination of ideas across domains and fields. To enhance the productivity of researchers, we propose ResearchAgent, which leverages the encyclopedic knowledge and linguistic r…
▽ More
The pace of scientific research, vital for improving human life, is complex, slow, and needs specialized expertise. Meanwhile, novel, impactful research often stems from both a deep understanding of prior work, and a cross-pollination of ideas across domains and fields. To enhance the productivity of researchers, we propose ResearchAgent, which leverages the encyclopedic knowledge and linguistic reasoning capabilities of Large Language Models (LLMs) to assist them in their work. This system automatically defines novel problems, proposes methods and designs experiments, while iteratively refining them based on the feedback from collaborative LLM-powered reviewing agents. Specifically, starting with a core scientific paper, ResearchAgent is augmented not only with relevant publications by connecting information over an academic graph but also entities retrieved from a knowledge store derived from shared underlying concepts mined across numerous papers. Then, mimicking a scientific approach to improving ideas with peer discussions, we leverage multiple LLM-based ReviewingAgents that provide reviews and feedback via iterative revision processes. These reviewing agents are instantiated with human preference-aligned LLMs whose criteria for evaluation are elicited from actual human judgments via LLM prompting. We experimentally validate our ResearchAgent on scientific publications across multiple disciplines, showing its effectiveness in generating novel, clear, and valid ideas based on both human and model-based evaluation results. Our initial foray into AI-mediated scientific research has important implications for the development of future systems aimed at supporting researchers in their ideation and operationalization of novel work.
△ Less
Submitted 9 February, 2025; v1 submitted 11 April, 2024;
originally announced April 2024.
-
The SaTML '24 CNN Interpretability Competition: New Innovations for Concept-Level Interpretability
Authors:
Stephen Casper,
Jieun Yun,
Joonhyuk Baek,
Yeseong Jung,
Minhwan Kim,
Kiwan Kwon,
Saerom Park,
Hayden Moore,
David Shriver,
Marissa Connor,
Keltin Grimes,
Angus Nicolson,
Arush Tagade,
Jessica Rumbelow,
Hieu Minh Nguyen,
Dylan Hadfield-Menell
Abstract:
Interpretability techniques are valuable for helping humans understand and oversee AI systems. The SaTML 2024 CNN Interpretability Competition solicited novel methods for studying convolutional neural networks (CNNs) at the ImageNet scale. The objective of the competition was to help human crowd-workers identify trojans in CNNs. This report showcases the methods and results of four featured compet…
▽ More
Interpretability techniques are valuable for helping humans understand and oversee AI systems. The SaTML 2024 CNN Interpretability Competition solicited novel methods for studying convolutional neural networks (CNNs) at the ImageNet scale. The objective of the competition was to help human crowd-workers identify trojans in CNNs. This report showcases the methods and results of four featured competition entries. It remains challenging to help humans reliably diagnose trojans via interpretability tools. However, the competition's entries have contributed new techniques and set a new record on the benchmark from Casper et al., 2023.
△ Less
Submitted 3 April, 2024;
originally announced April 2024.
-
Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity
Authors:
Soyeong Jeong,
Jinheon Baek,
Sukmin Cho,
Sung Ju Hwang,
Jong C. Park
Abstract:
Retrieval-Augmented Large Language Models (LLMs), which incorporate the non-parametric knowledge from external knowledge bases into LLMs, have emerged as a promising approach to enhancing response accuracy in several tasks, such as Question-Answering (QA). However, even though there are various approaches dealing with queries of different complexities, they either handle simple queries with unnece…
▽ More
Retrieval-Augmented Large Language Models (LLMs), which incorporate the non-parametric knowledge from external knowledge bases into LLMs, have emerged as a promising approach to enhancing response accuracy in several tasks, such as Question-Answering (QA). However, even though there are various approaches dealing with queries of different complexities, they either handle simple queries with unnecessary computational overhead or fail to adequately address complex multi-step queries; yet, not all user requests fall into only one of the simple or complex categories. In this work, we propose a novel adaptive QA framework, that can dynamically select the most suitable strategy for (retrieval-augmented) LLMs from the simplest to the most sophisticated ones based on the query complexity. Also, this selection process is operationalized with a classifier, which is a smaller LM trained to predict the complexity level of incoming queries with automatically collected labels, obtained from actual predicted outcomes of models and inherent inductive biases in datasets. This approach offers a balanced strategy, seamlessly adapting between the iterative and single-step retrieval-augmented LLMs, as well as the no-retrieval methods, in response to a range of query complexities. We validate our model on a set of open-domain QA datasets, covering multiple query complexities, and show that ours enhances the overall efficiency and accuracy of QA systems, compared to relevant baselines including the adaptive retrieval approaches. Code is available at: https://github.com/starsuzi/Adaptive-RAG.
△ Less
Submitted 28 March, 2024; v1 submitted 21 March, 2024;
originally announced March 2024.
-
Retrieval-Augmented Data Augmentation for Low-Resource Domain Tasks
Authors:
Minju Seo,
Jinheon Baek,
James Thorne,
Sung Ju Hwang
Abstract:
Despite large successes of recent language models on diverse tasks, they suffer from severe performance degeneration in low-resource settings with limited training data available. Many existing works tackle this problem by generating synthetic data from the training data and then training models on them, recently using Large Language Models (LLMs). However, in low-resource settings, the amount of…
▽ More
Despite large successes of recent language models on diverse tasks, they suffer from severe performance degeneration in low-resource settings with limited training data available. Many existing works tackle this problem by generating synthetic data from the training data and then training models on them, recently using Large Language Models (LLMs). However, in low-resource settings, the amount of seed data samples to use for data augmentation is very small, which makes generated samples suboptimal and less diverse. To tackle this challenge, we propose a novel method that augments training data by incorporating a wealth of examples from other datasets, along with the given training data. Specifically, we first retrieve the relevant instances from other datasets, such as their input-output pairs or contexts, based on their similarities with the given seed data, and then prompt LLMs to generate new samples with the contextual information within and across the original and retrieved samples. This approach can ensure that the generated data is not only relevant but also more diverse than what could be achieved using the limited seed data alone. We validate our proposed Retrieval-Augmented Data Augmentation (RADA) framework on multiple datasets under low-resource settings of training and test-time data augmentation scenarios, on which it outperforms existing LLM-powered data augmentation baselines.
△ Less
Submitted 20 February, 2024;
originally announced February 2024.
-
Inflation with Diffusion: Efficient Temporal Adaptation for Text-to-Video Super-Resolution
Authors:
Xin Yuan,
Jinoo Baek,
Keyang Xu,
Omer Tov,
Hongliang Fei
Abstract:
We propose an efficient diffusion-based text-to-video super-resolution (SR) tuning approach that leverages the readily learned capacity of pixel level image diffusion model to capture spatial information for video generation. To accomplish this goal, we design an efficient architecture by inflating the weightings of the text-to-image SR model into our video generation framework. Additionally, we i…
▽ More
We propose an efficient diffusion-based text-to-video super-resolution (SR) tuning approach that leverages the readily learned capacity of pixel level image diffusion model to capture spatial information for video generation. To accomplish this goal, we design an efficient architecture by inflating the weightings of the text-to-image SR model into our video generation framework. Additionally, we incorporate a temporal adapter to ensure temporal coherence across video frames. We investigate different tuning approaches based on our inflated architecture and report trade-offs between computational costs and super-resolution quality. Empirical evaluation, both quantitative and qualitative, on the Shutterstock video dataset, demonstrates that our approach is able to perform text-to-video SR generation with good visual quality and temporal consistency. To evaluate temporal coherence, we also present visualizations in video format in https://drive.google.com/drive/folders/1YVc-KMSJqOrEUdQWVaI-Yfu8Vsfu_1aO?usp=sharing .
△ Less
Submitted 18 January, 2024;
originally announced January 2024.
-
N-Adaptive Ritz Method: A Neural Network Enriched Partition of Unity for Boundary Value Problems
Authors:
Jonghyuk Baek,
Yanran Wang,
J. S. Chen
Abstract:
Conventional finite element methods are known to be tedious in adaptive refinements due to their conformal regularity requirements. Further, the enrichment functions for adaptive refinements are often not readily available in general applications. This work introduces a novel neural network-enriched Partition of Unity (NN-PU) approach for solving boundary value problems via artificial neural netwo…
▽ More
Conventional finite element methods are known to be tedious in adaptive refinements due to their conformal regularity requirements. Further, the enrichment functions for adaptive refinements are often not readily available in general applications. This work introduces a novel neural network-enriched Partition of Unity (NN-PU) approach for solving boundary value problems via artificial neural networks with a potential energy-based loss function minimization. The flexibility and adaptivity of the NN function space are utilized to capture complex solution patterns that the conventional Galerkin methods fail to capture. The NN enrichment is constructed by combining pre-trained feature-encoded NN blocks with an additional untrained NN block. The pre-trained NN blocks learn specific local features during the offline stage, enabling efficient enrichment of the approximation space during the online stage through the Ritz-type energy minimization. The NN enrichment is introduced under the Partition of Unity (PU) framework, ensuring convergence of the proposed method. The proposed NN-PU approximation and feature-encoded transfer learning forms an adaptive approximation framework, termed the neural-refinement (n-refinement), for solving boundary value problems. Demonstrated by solving various elasticity problems, the proposed method offers accurate solutions while notably reducing the computational cost compared to the conventional adaptive refinement in the mesh-based methods.
△ Less
Submitted 16 January, 2024;
originally announced January 2024.
-
Context Enhanced Transformer for Single Image Object Detection
Authors:
Seungjun An,
Seonghoon Park,
Gyeongnyeon Kim,
Jeongyeol Baek,
Byeongwon Lee,
Seungryong Kim
Abstract:
With the increasing importance of video data in real-world applications, there is a rising need for efficient object detection methods that utilize temporal information. While existing video object detection (VOD) techniques employ various strategies to address this challenge, they typically depend on locally adjacent frames or randomly sampled images within a clip. Although recent Transformer-bas…
▽ More
With the increasing importance of video data in real-world applications, there is a rising need for efficient object detection methods that utilize temporal information. While existing video object detection (VOD) techniques employ various strategies to address this challenge, they typically depend on locally adjacent frames or randomly sampled images within a clip. Although recent Transformer-based VOD methods have shown promising results, their reliance on multiple inputs and additional network complexity to incorporate temporal information limits their practical applicability. In this paper, we propose a novel approach to single image object detection, called Context Enhanced TRansformer (CETR), by incorporating temporal context into DETR using a newly designed memory module. To efficiently store temporal information, we construct a class-wise memory that collects contextual information across data. Additionally, we present a classification-based sampling technique to selectively utilize the relevant memory for the current image. In the testing, We introduce a test-time memory adaptation method that updates individual memory functions by considering the test distribution. Experiments with CityCam and ImageNet VID datasets exhibit the efficiency of the framework on various video systems. The project page and code will be made available at: https://ku-cvlab.github.io/CETR.
△ Less
Submitted 26 December, 2023; v1 submitted 22 December, 2023;
originally announced December 2023.
-
Cross-Lingual Learning in Multilingual Scene Text Recognition
Authors:
Jeonghun Baek,
Yusuke Matsui,
Kiyoharu Aizawa
Abstract:
In this paper, we investigate cross-lingual learning (CLL) for multilingual scene text recognition (STR). CLL transfers knowledge from one language to another. We aim to find the condition that exploits knowledge from high-resource languages for improving performance in low-resource languages. To do so, we first examine if two general insights about CLL discussed in previous works are applied to m…
▽ More
In this paper, we investigate cross-lingual learning (CLL) for multilingual scene text recognition (STR). CLL transfers knowledge from one language to another. We aim to find the condition that exploits knowledge from high-resource languages for improving performance in low-resource languages. To do so, we first examine if two general insights about CLL discussed in previous works are applied to multilingual STR: (1) Joint learning with high- and low-resource languages may reduce performance on low-resource languages, and (2) CLL works best between typologically similar languages. Through extensive experiments, we show that two general insights may not be applied to multilingual STR. After that, we show that the crucial condition for CLL is the dataset size of high-resource languages regardless of the kind of high-resource languages. Our code, data, and models are available at https://github.com/ku21fan/CLL-STR.
△ Less
Submitted 17 December, 2023;
originally announced December 2023.
-
3D Teeth Reconstruction from Panoramic Radiographs using Neural Implicit Functions
Authors:
Sihwa Park,
Seongjun Kim,
In-Seok Song,
Seung Jun Baek
Abstract:
Panoramic radiography is a widely used imaging modality in dental practice and research. However, it only provides flattened 2D images, which limits the detailed assessment of dental structures. In this paper, we propose Occudent, a framework for 3D teeth reconstruction from panoramic radiographs using neural implicit functions, which, to the best of our knowledge, is the first work to do so. For…
▽ More
Panoramic radiography is a widely used imaging modality in dental practice and research. However, it only provides flattened 2D images, which limits the detailed assessment of dental structures. In this paper, we propose Occudent, a framework for 3D teeth reconstruction from panoramic radiographs using neural implicit functions, which, to the best of our knowledge, is the first work to do so. For a given point in 3D space, the implicit function estimates whether the point is occupied by a tooth, and thus implicitly determines the boundaries of 3D tooth shapes. Firstly, Occudent applies multi-label segmentation to the input panoramic radiograph. Next, tooth shape embeddings as well as tooth class embeddings are generated from the segmentation outputs, which are fed to the reconstruction network. A novel module called Conditional eXcitation (CX) is proposed in order to effectively incorporate the combined shape and class embeddings into the implicit function. The performance of Occudent is evaluated using both quantitative and qualitative measures. Importantly, Occudent is trained and validated with actual panoramic radiographs as input, distinct from recent works which used synthesized images. Experiments demonstrate the superiority of Occudent over state-of-the-art methods.
△ Less
Submitted 28 November, 2023;
originally announced November 2023.
-
PEMA: An Offsite-Tunable Plug-in External Memory Adaptation for Language Models
Authors:
HyunJin Kim,
Young Jin Kim,
JinYeong Bak
Abstract:
Pre-trained language models (PLMs) show impressive performance in various downstream NLP tasks. However, pre-training large language models demands substantial memory and training compute. Furthermore, due to the substantial resources required, many PLM weights are confidential. Consequently, users are compelled to share their data with model owners for fine-tuning specific tasks. To overcome the…
▽ More
Pre-trained language models (PLMs) show impressive performance in various downstream NLP tasks. However, pre-training large language models demands substantial memory and training compute. Furthermore, due to the substantial resources required, many PLM weights are confidential. Consequently, users are compelled to share their data with model owners for fine-tuning specific tasks. To overcome the limitations, we introduce Plug-in External Memory Adaptation (PEMA), a Parameter-Efficient Fine-Tuning (PEFT) method, enabling PLM fine-tuning without requiring access to all the weights. PEMA integrates with context representations from test data during inference to perform downstream tasks. It uses external memory to store PLM-generated context representations mapped with target tokens. Our method utilizes weight matrices of LoRA-like bottlenecked adapter in the PLM's final layer to enhance efficiency. Our approach also includes Gradual Unrolling, a novel interpolation strategy to improve generation quality. We validate PEMA's effectiveness through experiments on syntactic and real datasets for machine translation and style transfer. Our findings show that PEMA outperforms other PEFT approaches in memory and latency efficiency for training, and also excels in maintaining sentence meaning and generating appropriate language and styles.
△ Less
Submitted 29 March, 2024; v1 submitted 14 November, 2023;
originally announced November 2023.
-
Knowledge-Augmented Large Language Models for Personalized Contextual Query Suggestion
Authors:
Jinheon Baek,
Nirupama Chandrasekaran,
Silviu Cucerzan,
Allen herring,
Sujay Kumar Jauhar
Abstract:
Large Language Models (LLMs) excel at tackling various natural language tasks. However, due to the significant costs involved in re-training or fine-tuning them, they remain largely static and difficult to personalize. Nevertheless, a variety of applications could benefit from generations that are tailored to users' preferences, goals, and knowledge. Among them is web search, where knowing what a…
▽ More
Large Language Models (LLMs) excel at tackling various natural language tasks. However, due to the significant costs involved in re-training or fine-tuning them, they remain largely static and difficult to personalize. Nevertheless, a variety of applications could benefit from generations that are tailored to users' preferences, goals, and knowledge. Among them is web search, where knowing what a user is trying to accomplish, what they care about, and what they know can lead to improved search experiences. In this work, we propose a novel and general approach that augments an LLM with relevant context from users' interaction histories with a search engine in order to personalize its outputs. Specifically, we construct an entity-centric knowledge store for each user based on their search and browsing activities on the web, which is then leveraged to provide contextually relevant LLM prompt augmentations. This knowledge store is light-weight, since it only produces user-specific aggregate projections of interests and knowledge onto public knowledge graphs, and leverages existing search log infrastructure, thereby mitigating the privacy, compliance, and scalability concerns associated with building deep user profiles for personalization. We validate our approach on the task of contextual query suggestion, which requires understanding not only the user's current search context but also what they historically know and care about. Through a number of experiments based on human evaluation, we show that our approach is significantly better than several other LLM-powered baselines, generating query suggestions that are contextually more relevant, personalized, and useful.
△ Less
Submitted 19 February, 2024; v1 submitted 9 November, 2023;
originally announced November 2023.