-
Decomposing Complex Visual Comprehension into Atomic Visual Skills for Vision Language Models
Authors:
Hyunsik Chae,
Seungwoo Yoon,
Jaden Park,
Chloe Yewon Chun,
Yongin Cho,
Mu Cai,
Yong Jae Lee,
Ernest K. Ryu
Abstract:
Recent Vision-Language Models (VLMs) have demonstrated impressive multimodal comprehension and reasoning capabilities, yet they often struggle with trivially simple visual tasks. In this work, we focus on the domain of basic 2D Euclidean geometry and systematically categorize the fundamental, indivisible visual perception skills, which we refer to as atomic visual skills. We then introduce the Ato…
▽ More
Recent Vision-Language Models (VLMs) have demonstrated impressive multimodal comprehension and reasoning capabilities, yet they often struggle with trivially simple visual tasks. In this work, we focus on the domain of basic 2D Euclidean geometry and systematically categorize the fundamental, indivisible visual perception skills, which we refer to as atomic visual skills. We then introduce the Atomic Visual Skills Dataset (AVSD) for evaluating VLMs on the atomic visual skills. Using AVSD, we benchmark state-of-the-art VLMs and find that they struggle with these tasks, despite being trivial for adult humans. Our findings highlight the need for purpose-built datasets to train and evaluate VLMs on atomic, rather than composite, visual perception tasks.
△ Less
Submitted 26 May, 2025;
originally announced May 2025.
-
Detecting Offensive Memes with Social Biases in Singapore Context Using Multimodal Large Language Models
Authors:
Cao Yuxuan,
Wu Jiayang,
Alistair Cheong Liang Chuen,
Bryan Shan Guanrong,
Theodore Lee Chong Jen,
Sherman Chann Zhi Shen
Abstract:
Traditional online content moderation systems struggle to classify modern multimodal means of communication, such as memes, a highly nuanced and information-dense medium. This task is especially hard in a culturally diverse society like Singapore, where low-resource languages are used and extensive knowledge on local context is needed to interpret online content. We curate a large collection of 11…
▽ More
Traditional online content moderation systems struggle to classify modern multimodal means of communication, such as memes, a highly nuanced and information-dense medium. This task is especially hard in a culturally diverse society like Singapore, where low-resource languages are used and extensive knowledge on local context is needed to interpret online content. We curate a large collection of 112K memes labeled by GPT-4V for fine-tuning a VLM to classify offensive memes in Singapore context. We show the effectiveness of fine-tuned VLMs on our dataset, and propose a pipeline containing OCR, translation and a 7-billion parameter-class VLM. Our solutions reach 80.62% accuracy and 0.8192 AUROC on a held-out test set, and can greatly aid human in moderating online contents. The dataset, code, and model weights have been open-sourced at https://github.com/aliencaocao/vlm-for-memes-aisg.
△ Less
Submitted 8 March, 2025; v1 submitted 25 February, 2025;
originally announced February 2025.
-
Estimating the Spectral Moments of the Kernel Integral Operator from Finite Sample Matrices
Authors:
Chanwoo Chun,
SueYeon Chung,
Daniel D. Lee
Abstract:
Analyzing the structure of sampled features from an input data distribution is challenging when constrained by limited measurements in both the number of inputs and features. Traditional approaches often rely on the eigenvalue spectrum of the sample covariance matrix derived from finite measurement matrices; however, these spectra are sensitive to the size of the measurement matrix, leading to bia…
▽ More
Analyzing the structure of sampled features from an input data distribution is challenging when constrained by limited measurements in both the number of inputs and features. Traditional approaches often rely on the eigenvalue spectrum of the sample covariance matrix derived from finite measurement matrices; however, these spectra are sensitive to the size of the measurement matrix, leading to biased insights. In this paper, we introduce a novel algorithm that provides unbiased estimates of the spectral moments of the kernel integral operator in the limit of infinite inputs and features from finitely sampled measurement matrices. Our method, based on dynamic programming, is efficient and capable of estimating the moments of the operator spectrum. We demonstrate the accuracy of our estimator on radial basis function (RBF) kernels, highlighting its consistency with the theoretical spectra. Furthermore, we showcase the practical utility and robustness of our method in understanding the geometry of learned representations in neural networks.
△ Less
Submitted 8 February, 2025; v1 submitted 23 October, 2024;
originally announced October 2024.
-
Tailoring Generative Adversarial Networks for Smooth Airfoil Design
Authors:
Joyjit Chattoraj,
Jian Cheng Wong,
Zhang Zexuan,
Manna Dai,
Xia Yingzhi,
Li Jichao,
Xu Xinxing,
Ooi Chin Chun,
Yang Feng,
Dao My Ha,
Liu Yong
Abstract:
In the realm of aerospace design, achieving smooth curves is paramount, particularly when crafting objects such as airfoils. Generative Adversarial Network (GAN), a widely employed generative AI technique, has proven instrumental in synthesizing airfoil designs. However, a common limitation of GAN is the inherent lack of smoothness in the generated airfoil surfaces. To address this issue, we prese…
▽ More
In the realm of aerospace design, achieving smooth curves is paramount, particularly when crafting objects such as airfoils. Generative Adversarial Network (GAN), a widely employed generative AI technique, has proven instrumental in synthesizing airfoil designs. However, a common limitation of GAN is the inherent lack of smoothness in the generated airfoil surfaces. To address this issue, we present a GAN model featuring a customized loss function built to produce seamlessly contoured airfoil designs. Additionally, our model demonstrates a substantial increase in design diversity compared to a conventional GAN augmented with a post-processing smoothing filter.
△ Less
Submitted 17 April, 2024;
originally announced April 2024.
-
Towards Diverse and Effective Question-Answer Pair Generation from Children Storybooks
Authors:
Sugyeong Eo,
Hyeonseok Moon,
Jinsung Kim,
Yuna Hur,
Jeongwook Kim,
Songeun Lee,
Changwoo Chun,
Sungsoo Park,
Heuiseok Lim
Abstract:
Recent advances in QA pair generation (QAG) have raised interest in applying this technique to the educational field. However, the diversity of QA types remains a challenge despite its contributions to comprehensive learning and assessment of children. In this paper, we propose a QAG framework that enhances QA type diversity by producing different interrogative sentences and implicit/explicit answ…
▽ More
Recent advances in QA pair generation (QAG) have raised interest in applying this technique to the educational field. However, the diversity of QA types remains a challenge despite its contributions to comprehensive learning and assessment of children. In this paper, we propose a QAG framework that enhances QA type diversity by producing different interrogative sentences and implicit/explicit answers. Our framework comprises a QFS-based answer generator, an iterative QA generator, and a relevancy-aware ranker. The two generators aim to expand the number of candidates while covering various types. The ranker trained on the in-context negative samples clarifies the top-N outputs based on the ranking score. Extensive evaluations and detailed analyses demonstrate that our approach outperforms previous state-of-the-art results by significant margins, achieving improved diversity and quality. Our task-oriented processes are consistent with real-world demand, which highlights our system's high applicability.
△ Less
Submitted 11 June, 2023;
originally announced June 2023.
-
Sparsity-depth Tradeoff in Infinitely Wide Deep Neural Networks
Authors:
Chanwoo Chun,
Daniel D. Lee
Abstract:
We investigate how sparse neural activity affects the generalization performance of a deep Bayesian neural network at the large width limit. To this end, we derive a neural network Gaussian Process (NNGP) kernel with rectified linear unit (ReLU) activation and a predetermined fraction of active neurons. Using the NNGP kernel, we observe that the sparser networks outperform the non-sparse networks…
▽ More
We investigate how sparse neural activity affects the generalization performance of a deep Bayesian neural network at the large width limit. To this end, we derive a neural network Gaussian Process (NNGP) kernel with rectified linear unit (ReLU) activation and a predetermined fraction of active neurons. Using the NNGP kernel, we observe that the sparser networks outperform the non-sparse networks at shallow depths on a variety of datasets. We validate this observation by extending the existing theory on the generalization error of kernel-ridge regression.
△ Less
Submitted 17 May, 2023;
originally announced May 2023.
-
Meat Freshness Prediction
Authors:
Bhargav Sagiraju,
Nathan Casanova,
Lam Ivan Chuen Chun,
Manan Lohia,
Toshinori Yoshiyasu
Abstract:
In most retail stores, the number of days since initial processing is used as a proxy for estimating the freshness of perishable foods or freshness is assessed manually by an employee. While the former method can lead to wastage, as some fresh foods might get disposed after a fixed number of days, the latter can be time-consuming, expensive and impractical at scale. This project aims to propose a…
▽ More
In most retail stores, the number of days since initial processing is used as a proxy for estimating the freshness of perishable foods or freshness is assessed manually by an employee. While the former method can lead to wastage, as some fresh foods might get disposed after a fixed number of days, the latter can be time-consuming, expensive and impractical at scale. This project aims to propose a Machine Learning (ML) based approach that evaluates freshness of food based on live data. For the current scope, it only considers meat as a the subject of analysis and attempts to classify pieces of meat as fresh, half-fresh or spoiled. Finally the model achieved an accuracy of above 90% and relatively high performance in terms of the cost of misclassification. It is expected that the technology will contribute to the optimization of the client's business operation, reducing the risk of selling defective or rotten products that can entail serious monetary, non-monetary and health-based consequences while also achieving higher corporate value as a sustainable company by reducing food wastage through timely sales and disposal.
△ Less
Submitted 1 May, 2023;
originally announced May 2023.
-
Transport Capacity Optimization for Resource Allocation in Tera-IoT Networks
Authors:
Cheol Jeong,
Chang-Jae Chun,
Won-Yong Shin,
Il-Min Kim
Abstract:
We present a new adaptive resource optimization strategy that jointly allocates the subwindow and transmit power in multi-device terahertz (THz) band Internet of Things (Tera-IoT) networks. Unlike the prior studies focusing mostly on maximizing the sum distance, we incorporate both rate and transmission distance into the objective function of our problem formulation with key features of THz bands,…
▽ More
We present a new adaptive resource optimization strategy that jointly allocates the subwindow and transmit power in multi-device terahertz (THz) band Internet of Things (Tera-IoT) networks. Unlike the prior studies focusing mostly on maximizing the sum distance, we incorporate both rate and transmission distance into the objective function of our problem formulation with key features of THz bands, including the spreading and molecular absorption losses. More specifically, as a performance metric of Tera-IoT networks, we adopt the transport capacity (TC), which is defined as the sum of the rate-distance products over all users. This metric has been widely adopted in large-scale ad hoc networks, and would also be appropriate for evaluating the performance of various Tera-IoT applications. We then formulate an optimization problem that aims at maximizing the TC. Moreover, motivated by the importance of the transmission distance that is very limited due to the high path loss in THz bands, our optimization problem is extended to the case of allocating the subwindow, transmit power, and transmission distance. We show how to solve our problems via an effective two-stage resource allocation strategy. We demonstrate the superiority of our adaptive solution over benchmark methods via intensive numerical evaluations for various environmental setups of large-scale Tera-IoT networks.
△ Less
Submitted 29 January, 2022;
originally announced January 2022.
-
2.5D Image based Robotic Grasping
Authors:
Song Yaoxian,
Cheng Chun,
Fei Yuejiao,
Li Xiangqing,
Yu Changbin
Abstract:
We consider the problem of robotic grasping using depth + RGB information sampling from a real sensor. we design an encoder-decoder neural network to predict grasp policy in real time. This method can fuse the advantage of depth image and RGB image at the same time and is robust for grasp and observation height.We evaluate our method in a physical robotic system and propose an open-loop algorithm…
▽ More
We consider the problem of robotic grasping using depth + RGB information sampling from a real sensor. we design an encoder-decoder neural network to predict grasp policy in real time. This method can fuse the advantage of depth image and RGB image at the same time and is robust for grasp and observation height.We evaluate our method in a physical robotic system and propose an open-loop algorithm to realize robotic grasp operation. We analyze the result of experiment from multi-perspective and the result shows that our method is competitive with the state-of-the-art in grasp performance, real-time and model size. The video is available in https://youtu.be/Wxw_r5a8qV0
△ Less
Submitted 31 May, 2019;
originally announced May 2019.
-
Deep Learning Based Joint Pilot Design and Channel Estimation for Multiuser MIMO Channels
Authors:
Chang-Jae Chun,
Jae-Mo Kang,
Il-Min Kim
Abstract:
In this paper, we propose a joint pilot design and channel estimation scheme based on the deep learning (DL) technique for multiuser multiple-input multiple output (MIMO) channels. To this end, we construct a pilot designer using two-layer neural networks (TNNs) and a channel estimator using deep neural networks (DNNs), which are jointly trained to minimize the mean square error (MSE) of channel e…
▽ More
In this paper, we propose a joint pilot design and channel estimation scheme based on the deep learning (DL) technique for multiuser multiple-input multiple output (MIMO) channels. To this end, we construct a pilot designer using two-layer neural networks (TNNs) and a channel estimator using deep neural networks (DNNs), which are jointly trained to minimize the mean square error (MSE) of channel estimation. To effectively reduce the interference among the multiple users, we also use the successive interference cancellation (SIC) technique in the channel estimation process. The numerical results demonstrate that the proposed scheme considerably outperforms the state-of-the-art linear minimum mean square error (LMMSE) based channel estimation scheme.
△ Less
Submitted 10 December, 2018;
originally announced December 2018.
-
Channel Tracking for Wireless Energy Transfer: A Deep Recurrent Neural Network Approach
Authors:
Jae-Mo Kang,
Chang-Jae Chun,
Il-Min Kim,
Dong In Kim
Abstract:
In this paper, we study channel tracking for the wireless energy transfer (WET) system, which is practically a very important, but challenging problem. Regarding the time-varying channels as a sequence to be predicted, we exploit the recurrent neural network (RNN) technique for channel tracking. Particularly, combining the deep long short-term memory (LSTM) RNN with the deep feedforward neural net…
▽ More
In this paper, we study channel tracking for the wireless energy transfer (WET) system, which is practically a very important, but challenging problem. Regarding the time-varying channels as a sequence to be predicted, we exploit the recurrent neural network (RNN) technique for channel tracking. Particularly, combining the deep long short-term memory (LSTM) RNN with the deep feedforward neural network, we develop a novel channel tracking scheme for the WET system, which estimates the channel state information (CSI) at the energy transmitter based on the previous CSI estimates, and the current and previous harvested energy feedback information from the energy receiver. Numerical results demonstrate the superior performance and effectiveness of the proposed scheme.
△ Less
Submitted 7 December, 2018;
originally announced December 2018.
-
Dynamic Power Splitting for SWIPT with Nonlinear Energy Harvesting in Ergodic Fading Channel
Authors:
Jae-Mo Kang,
Chang-Jae Chun,
Il-Min Kim,
Dong In Kim
Abstract:
We study the dynamic power splitting for simultaneous wireless information and power transfer (SWIPT) in the ergodic fading channel. Considering the nonlinearity of practical energy harvesting circuits, we adopt the realistic nonlinear energy harvesting (EH) model rather than the idealistic linear EH model. To characterize the optimal rate-energy (RE) tradeoff, we consider the problem of maximizin…
▽ More
We study the dynamic power splitting for simultaneous wireless information and power transfer (SWIPT) in the ergodic fading channel. Considering the nonlinearity of practical energy harvesting circuits, we adopt the realistic nonlinear energy harvesting (EH) model rather than the idealistic linear EH model. To characterize the optimal rate-energy (RE) tradeoff, we consider the problem of maximizing the R-E region, which is nonconvex. We solve this challenging problem for two different cases of the channel state information (CSI): (i) when the CSI is known only at the receiver (the CSIR case) and (ii) when the CSI is known at both the transmitter and the receiver (the CSI case). For these two cases, we develop the corresponding optimal dynamic power splitting schemes. To address the complexity issue, we also propose the suboptimal schemes with low complexities. Comparing the proposed schemes to the existing schemes, we provide various useful and interesting insights into the dynamic power splitting for the nonlinear EH. Furthermore, we extend the analysis to the scenarios of the partial CSI at the transmitter and the harvested energy maximization. Numerical results demonstrate that the proposed schemes significantly outperform the existing schemes and the proposed suboptimal scheme works very close to the optimal scheme at a much lower complexity.
△ Less
Submitted 27 August, 2018; v1 submitted 19 April, 2018;
originally announced April 2018.