Search | arXiv e-print repository

Subspace-based Approximate Hessian Method for Zeroth-Order Optimization

Authors: Dongyoon Kim, Sungjae Lee, Wonjin Lee, Kwang In Kim

Abstract: Zeroth-order optimization addresses problems where gradient information is inaccessible or impractical to compute. While most existing methods rely on first-order approximations, incorporating second-order (curvature) information can, in principle, significantly accelerate convergence. However, the high cost of function evaluations required to estimate Hessian matrices often limits practical appli… ▽ More Zeroth-order optimization addresses problems where gradient information is inaccessible or impractical to compute. While most existing methods rely on first-order approximations, incorporating second-order (curvature) information can, in principle, significantly accelerate convergence. However, the high cost of function evaluations required to estimate Hessian matrices often limits practical applicability. We present the subspace-based approximate Hessian (ZO-SAH) method, a zeroth-order optimization algorithm that mitigates these costs by focusing on randomly selected two-dimensional subspaces. Within each subspace, ZO-SAH estimates the Hessian by fitting a quadratic polynomial to the objective function and extracting its second-order coefficients. To further reduce function-query costs, ZO-SAH employs a periodic subspace-switching strategy that reuses function evaluations across optimization steps. Experiments on eight benchmark datasets, including logistic regression and deep neural network training tasks, demonstrate that ZO-SAH achieves significantly faster convergence than existing zeroth-order methods. △ Less

Submitted 8 July, 2025; originally announced July 2025.

Comments: 20 pages, 8 figures

arXiv:2505.17475 [pdf, ps, other]

PoseBH: Prototypical Multi-Dataset Training Beyond Human Pose Estimation

Authors: Uyoung Jeong, Jonathan Freer, Seungryul Baek, Hyung Jin Chang, Kwang In Kim

Abstract: We study multi-dataset training (MDT) for pose estimation, where skeletal heterogeneity presents a unique challenge that existing methods have yet to address. In traditional domains, \eg regression and classification, MDT typically relies on dataset merging or multi-head supervision. However, the diversity of skeleton types and limited cross-dataset supervision complicate integration in pose estim… ▽ More We study multi-dataset training (MDT) for pose estimation, where skeletal heterogeneity presents a unique challenge that existing methods have yet to address. In traditional domains, \eg regression and classification, MDT typically relies on dataset merging or multi-head supervision. However, the diversity of skeleton types and limited cross-dataset supervision complicate integration in pose estimation. To address these challenges, we introduce PoseBH, a new MDT framework that tackles keypoint heterogeneity and limited supervision through two key techniques. First, we propose nonparametric keypoint prototypes that learn within a unified embedding space, enabling seamless integration across skeleton types. Second, we develop a cross-type self-supervision mechanism that aligns keypoint predictions with keypoint embedding prototypes, providing supervision without relying on teacher-student models or additional augmentations. PoseBH substantially improves generalization across whole-body and animal pose datasets, including COCO-WholeBody, AP-10K, and APT-36K, while preserving performance on standard human pose benchmarks (COCO, MPII, and AIC). Furthermore, our learned keypoint embeddings transfer effectively to hand shape estimation (InterHand2.6M) and human body shape estimation (3DPW). The code for PoseBH is available at: https://github.com/uyoung-jeong/PoseBH. △ Less

Submitted 23 May, 2025; originally announced May 2025.

Comments: accepted to CVPR 2025

arXiv:2503.15035 [pdf, other]

GraspCorrect: Robotic Grasp Correction via Vision-Language Model-Guided Feedback

Authors: Sungjae Lee, Yeonjoo Hong, Kwang In Kim

Abstract: Despite significant advancements in robotic manipulation, achieving consistent and stable grasping remains a fundamental challenge, often limiting the successful execution of complex tasks. Our analysis reveals that even state-of-the-art policy models frequently exhibit unstable grasping behaviors, leading to failure cases that create bottlenecks in real-world robotic applications. To address thes… ▽ More Despite significant advancements in robotic manipulation, achieving consistent and stable grasping remains a fundamental challenge, often limiting the successful execution of complex tasks. Our analysis reveals that even state-of-the-art policy models frequently exhibit unstable grasping behaviors, leading to failure cases that create bottlenecks in real-world robotic applications. To address these challenges, we introduce GraspCorrect, a plug-and-play module designed to enhance grasp performance through vision-language model-guided feedback. GraspCorrect employs an iterative visual question-answering framework with two key components: grasp-guided prompting, which incorporates task-specific constraints, and object-aware sampling, which ensures the selection of physically feasible grasp candidates. By iteratively generating intermediate visual goals and translating them into joint-level actions, GraspCorrect significantly improves grasp stability and consistently enhances task success rates across existing policy models in the RLBench and CALVIN datasets. △ Less

Submitted 19 March, 2025; originally announced March 2025.

arXiv:2412.07629 [pdf, other]

Piece of Table: A Divide-and-Conquer Approach for Selecting Subtables in Table Question Answering

Authors: Wonjin Lee, Kyumin Kim, Sungjae Lee, Jihun Lee, Kwang In Kim

Abstract: Applying language models (LMs) to tables is challenging due to the inherent structural differences between two-dimensional tables and one-dimensional text for which the LMs were originally designed. Furthermore, when applying linearized tables to LMs, the maximum token lengths often imposed in self-attention calculations make it difficult to comprehensively understand the context spread across lar… ▽ More Applying language models (LMs) to tables is challenging due to the inherent structural differences between two-dimensional tables and one-dimensional text for which the LMs were originally designed. Furthermore, when applying linearized tables to LMs, the maximum token lengths often imposed in self-attention calculations make it difficult to comprehensively understand the context spread across large tables. To address these challenges, we present PieTa (Piece of Table), a new framework for subtable-based question answering (QA). PieTa operates through an iterative process of dividing tables into smaller windows, using LMs to select relevant cells within each window, and merging these cells into a subtable. This multi-resolution approach captures dependencies across multiple rows and columns while avoiding the limitations caused by long context inputs. Instantiated as a simple iterative subtable union algorithm, PieTa demonstrates improved performance over previous subtable-based QA approaches. △ Less

Submitted 19 February, 2025; v1 submitted 10 December, 2024; originally announced December 2024.

arXiv:2406.04772 [pdf, other]

REP: Resource-Efficient Prompting for Rehearsal-Free Continual Learning

Authors: Sungho Jeon, Xinyue Ma, Kwang In Kim, Myeongjae Jeon

Abstract: Recent rehearsal-free methods, guided by prompts, excel in vision-related continual learning (CL) with drifting data but lack resource efficiency, making real-world deployment challenging. In this paper, we introduce Resource-Efficient Prompting (REP), which improves the computational and memory efficiency of prompt-based rehearsal-free methods while minimizing accuracy trade-offs. Our approach em… ▽ More Recent rehearsal-free methods, guided by prompts, excel in vision-related continual learning (CL) with drifting data but lack resource efficiency, making real-world deployment challenging. In this paper, we introduce Resource-Efficient Prompting (REP), which improves the computational and memory efficiency of prompt-based rehearsal-free methods while minimizing accuracy trade-offs. Our approach employs swift prompt selection to refine input data using a carefully provisioned model and introduces adaptive token merging (AToM) and layer dropping (ALD) for efficient prompt updates. AToM and ALD selectively skip data and model layers while preserving task-specific features during new-task learning. Extensive experiments on multiple image classification datasets demonstrates REP's superior resource efficiency over state-of-the-art ViT- and CNN-based methods. △ Less

Submitted 16 February, 2025; v1 submitted 7 June, 2024; originally announced June 2024.

arXiv:2311.17094 [pdf, other]

In Search of a Data Transformation That Accelerates Neural Field Training

Authors: Junwon Seo, Sangyoon Lee, Kwang In Kim, Jaeho Lee

Abstract: Neural field is an emerging paradigm in data representation that trains a neural network to approximate the given signal. A key obstacle that prevents its widespread adoption is the encoding speed-generating neural fields requires an overfitting of a neural network, which can take a significant number of SGD steps to reach the desired fidelity level. In this paper, we delve into the impacts of dat… ▽ More Neural field is an emerging paradigm in data representation that trains a neural network to approximate the given signal. A key obstacle that prevents its widespread adoption is the encoding speed-generating neural fields requires an overfitting of a neural network, which can take a significant number of SGD steps to reach the desired fidelity level. In this paper, we delve into the impacts of data transformations on the speed of neural field training, specifically focusing on how permuting pixel locations affect the convergence speed of SGD. Counterintuitively, we find that randomly permuting the pixel locations can considerably accelerate the training. To explain this phenomenon, we examine the neural field training through the lens of PSNR curves, loss landscapes, and error patterns. Our analyses suggest that the random pixel permutations remove the easy-to-fit patterns, which facilitate easy optimization in the early stage but hinder capturing fine details of the signal. △ Less

Submitted 26 March, 2024; v1 submitted 28 November, 2023; originally announced November 2023.

Comments: CVPR 2024

arXiv:2309.14072 [pdf, other]

BoIR: Box-Supervised Instance Representation for Multi-Person Pose Estimation

Authors: Uyoung Jeong, Seungryul Baek, Hyung Jin Chang, Kwang In Kim

Abstract: Single-stage multi-person human pose estimation (MPPE) methods have shown great performance improvements, but existing methods fail to disentangle features by individual instances under crowded scenes. In this paper, we propose a bounding box-level instance representation learning called BoIR, which simultaneously solves instance detection, instance disentanglement, and instance-keypoint associati… ▽ More Single-stage multi-person human pose estimation (MPPE) methods have shown great performance improvements, but existing methods fail to disentangle features by individual instances under crowded scenes. In this paper, we propose a bounding box-level instance representation learning called BoIR, which simultaneously solves instance detection, instance disentanglement, and instance-keypoint association problems. Our new instance embedding loss provides a learning signal on the entire area of the image with bounding box annotations, achieving globally consistent and disentangled instance representation. Our method exploits multi-task learning of bottom-up keypoint estimation, bounding box regression, and contrastive instance embedding learning, without additional computational cost during inference. BoIR is effective for crowded scenes, outperforming state-of-the-art on COCO val (0.8 AP), COCO test-dev (0.5 AP), CrowdPose (4.9 AP), and OCHuman (3.5 AP). Code will be available at https://github.com/uyoung-jeong/BoIR △ Less

Submitted 2 November, 2023; v1 submitted 25 September, 2023; originally announced September 2023.

Comments: Accepted to BMVC 2023, 19 pages including the appendix, 6 figures, 7 tables

Journal ref: BMVC. 34 (2023) 763-764

arXiv:2301.02761 [pdf, other]

Active Learning Guided by Efficient Surrogate Learners

Authors: Yunpyo An, Suyeong Park, Kwang In Kim

Abstract: Re-training a deep learning model each time a single data point receives a new label is impractical due to the inherent complexity of the training process. Consequently, existing active learning (AL) algorithms tend to adopt a batch-based approach where, during each AL iteration, a set of data points is collectively chosen for annotation. However, this strategy frequently leads to redundant sampli… ▽ More Re-training a deep learning model each time a single data point receives a new label is impractical due to the inherent complexity of the training process. Consequently, existing active learning (AL) algorithms tend to adopt a batch-based approach where, during each AL iteration, a set of data points is collectively chosen for annotation. However, this strategy frequently leads to redundant sampling, ultimately eroding the efficacy of the labeling procedure. In this paper, we introduce a new AL algorithm that harnesses the power of a Gaussian process surrogate in conjunction with the neural network principal learner. Our proposed model adeptly updates the surrogate learner for every new data instance, enabling it to emulate and capitalize on the continuous learning dynamics of the neural network without necessitating a complete retraining of the principal model for each individual label. Experiments on four benchmark datasets demonstrate that this approach yields significant enhancements, either rivaling or aligning with the performance of state-of-the-art techniques. △ Less

Submitted 17 December, 2023; v1 submitted 6 January, 2023; originally announced January 2023.

arXiv:2208.00874 [pdf, other]

S$^2$Contact: Graph-based Network for 3D Hand-Object Contact Estimation with Semi-Supervised Learning

Authors: Tze Ho Elden Tse, Zhongqun Zhang, Kwang In Kim, Ales Leonardis, Feng Zheng, Hyung Jin Chang

Abstract: Despite the recent efforts in accurate 3D annotations in hand and object datasets, there still exist gaps in 3D hand and object reconstructions. Existing works leverage contact maps to refine inaccurate hand-object pose estimations and generate grasps given object models. However, they require explicit 3D supervision which is seldom available and therefore, are limited to constrained settings, e.g… ▽ More Despite the recent efforts in accurate 3D annotations in hand and object datasets, there still exist gaps in 3D hand and object reconstructions. Existing works leverage contact maps to refine inaccurate hand-object pose estimations and generate grasps given object models. However, they require explicit 3D supervision which is seldom available and therefore, are limited to constrained settings, e.g., where thermal cameras observe residual heat left on manipulated objects. In this paper, we propose a novel semi-supervised framework that allows us to learn contact from monocular images. Specifically, we leverage visual and geometric consistency constraints in large-scale datasets for generating pseudo-labels in semi-supervised learning and propose an efficient graph-based network to infer contact. Our semi-supervised learning framework achieves a favourable improvement over the existing supervised learning methods trained on data with `limited' annotations. Notably, our proposed model is able to achieve superior results with less than half the network parameters and memory access cost when compared with the commonly-used PointNet-based approach. We show benefits from using a contact map that rules hand-object interactions to produce more accurate reconstructions. We further demonstrate that training with pseudo-labels can extend contact map estimations to out-of-domain objects and generalise better across multiple datasets. △ Less

Submitted 3 August, 2023; v1 submitted 1 August, 2022; originally announced August 2022.

Comments: Accepted to ECCV 2022

arXiv:2204.13062 [pdf, other]

Collaborative Learning for Hand and Object Reconstruction with Attention-guided Graph Convolution

Authors: Tze Ho Elden Tse, Kwang In Kim, Ales Leonardis, Hyung Jin Chang

Abstract: Estimating the pose and shape of hands and objects under interaction finds numerous applications including augmented and virtual reality. Existing approaches for hand and object reconstruction require explicitly defined physical constraints and known objects, which limits its application domains. Our algorithm is agnostic to object models, and it learns the physical rules governing hand-object int… ▽ More Estimating the pose and shape of hands and objects under interaction finds numerous applications including augmented and virtual reality. Existing approaches for hand and object reconstruction require explicitly defined physical constraints and known objects, which limits its application domains. Our algorithm is agnostic to object models, and it learns the physical rules governing hand-object interaction. This requires automatically inferring the shapes and physical interaction of hands and (potentially unknown) objects. We seek to approach this challenging problem by proposing a collaborative learning strategy where two-branches of deep networks are learning from each other. Specifically, we transfer hand mesh information to the object branch and vice versa for the hand branch. The resulting optimisation (training) problem can be unstable, and we address this via two strategies: (i) attention-guided graph convolution which helps identify and focus on mutual occlusion and (ii) unsupervised associative loss which facilitates the transfer of information between the branches. Experiments using four widely-used benchmarks show that our framework achieves beyond state-of-the-art accuracy in 3D pose estimation, as well as recovers dense 3D hand and object shapes. Each technical component above contributes meaningfully in the ablation study. △ Less

Submitted 27 April, 2022; originally announced April 2022.

Comments: Accepted to CVPR 2022

arXiv:2111.02865 [pdf, other]

Testing using Privileged Information by Adapting Features with Statistical Dependence

Authors: Kwang In Kim, James Tompkin

Abstract: Given an imperfect predictor, we exploit additional features at test time to improve the predictions made, without retraining and without knowledge of the prediction function. This scenario arises if training labels or data are proprietary, restricted, or no longer available, or if training itself is prohibitively expensive. We assume that the additional features are useful if they exhibit strong… ▽ More Given an imperfect predictor, we exploit additional features at test time to improve the predictions made, without retraining and without knowledge of the prediction function. This scenario arises if training labels or data are proprietary, restricted, or no longer available, or if training itself is prohibitively expensive. We assume that the additional features are useful if they exhibit strong statistical dependence to the underlying perfect predictor. Then, we empirically estimate and strengthen the statistical dependence between the initial noisy predictor and the additional features via manifold denoising. As an example, we show that this approach leads to improvement in real-world visual attribute ranking. Project webpage: http://www.jamestompkin.com/tupi △ Less

Submitted 4 November, 2021; originally announced November 2021.

Comments: Published at ICCV 2021. Webpage: http://www.jamestompkin.com/tupi

arXiv:2107.07330 [pdf, other]

DynaDog+T: A Parametric Animal Model for Synthetic Canine Image Generation

Authors: Jake Deane, Sinead Kearney, Kwang In Kim, Darren Cosker

Abstract: Synthetic data is becoming increasingly common for training computer vision models for a variety of tasks. Notably, such data has been applied in tasks related to humans such as 3D pose estimation where data is either difficult to create or obtain in realistic settings. Comparatively, there has been less work into synthetic animal data and it's uses for training models. Consequently, we introduce… ▽ More Synthetic data is becoming increasingly common for training computer vision models for a variety of tasks. Notably, such data has been applied in tasks related to humans such as 3D pose estimation where data is either difficult to create or obtain in realistic settings. Comparatively, there has been less work into synthetic animal data and it's uses for training models. Consequently, we introduce a parametric canine model, DynaDog+T, for generating synthetic canine images and data which we use for a common computer vision task, binary segmentation, which would otherwise be difficult due to the lack of available data. △ Less

Submitted 20 July, 2021; v1 submitted 15 July, 2021; originally announced July 2021.

Comments: CV4Animals Workshop in CVPR 2021. Update to correct minor spelling and grammer mistakes in supplementary material

arXiv:2106.13215 [pdf]

GaussiGAN: Controllable Image Synthesis with 3D Gaussians from Unposed Silhouettes

Authors: Youssef A. Mejjati, Isa Milefchik, Aaron Gokaslan, Oliver Wang, Kwang In Kim, James Tompkin

Abstract: We present an algorithm that learns a coarse 3D representation of objects from unposed multi-view 2D mask supervision, then uses it to generate detailed mask and image texture. In contrast to existing voxel-based methods for unposed object reconstruction, our approach learns to represent the generated shape and pose with a set of self-supervised canonical 3D anisotropic Gaussians via a perspective… ▽ More We present an algorithm that learns a coarse 3D representation of objects from unposed multi-view 2D mask supervision, then uses it to generate detailed mask and image texture. In contrast to existing voxel-based methods for unposed object reconstruction, our approach learns to represent the generated shape and pose with a set of self-supervised canonical 3D anisotropic Gaussians via a perspective camera, and a set of per-image transforms. We show that this approach can robustly estimate a 3D space for the camera and object, while recent baselines sometimes struggle to reconstruct coherent 3D spaces in this setting. We show results on synthetic datasets with realistic lighting, and demonstrate object insertion with interactive posing. With our work, we help move towards structured representations that handle more real-world variation in learning-based object reconstruction. △ Less

Submitted 24 June, 2021; originally announced June 2021.

arXiv:2008.05413 [pdf, other]

Look here! A parametric learning based approach to redirect visual attention

Authors: Youssef Alami Mejjati, Celso F. Gomez, Kwang In Kim, Eli Shechtman, Zoya Bylinskii

Abstract: Across photography, marketing, and website design, being able to direct the viewer's attention is a powerful tool. Motivated by professional workflows, we introduce an automatic method to make an image region more attention-capturing via subtle image edits that maintain realism and fidelity to the original. From an input image and a user-provided mask, our GazeShiftNet model predicts a distinct se… ▽ More Across photography, marketing, and website design, being able to direct the viewer's attention is a powerful tool. Motivated by professional workflows, we introduce an automatic method to make an image region more attention-capturing via subtle image edits that maintain realism and fidelity to the original. From an input image and a user-provided mask, our GazeShiftNet model predicts a distinct set of global parametric transformations to be applied to the foreground and background image regions separately. We present the results of quantitative and qualitative experiments that demonstrate improvements over prior state-of-the-art. In contrast to existing attention shifting algorithms, our global parametric approach better preserves image semantics and avoids typical generative artifacts. Our edits enable inference at interactive rates on any image size, and easily generalize to videos. Extensions of our model allow for multi-style edits and the ability to both increase and attenuate attention in an image region. Furthermore, users can customize the edited images by dialing the edits up or down via interpolations in parameter space. This paper presents a practical tool that can simplify future image editing pipelines. △ Less

Submitted 12 August, 2020; originally announced August 2020.

Comments: To appear in ECCV 2020

arXiv:2007.08012 [pdf, other]

Combining Task Predictors via Enhancing Joint Predictability

Authors: Kwang In Kim, Christian Richardt, Hyung Jin Chang

Abstract: Predictor combination aims to improve a (target) predictor of a learning task based on the (reference) predictors of potentially relevant tasks, without having access to the internals of individual predictors. We present a new predictor combination algorithm that improves the target by i) measuring the relevance of references based on their capabilities in predicting the target, and ii) strengthen… ▽ More Predictor combination aims to improve a (target) predictor of a learning task based on the (reference) predictors of potentially relevant tasks, without having access to the internals of individual predictors. We present a new predictor combination algorithm that improves the target by i) measuring the relevance of references based on their capabilities in predicting the target, and ii) strengthening such estimated relevance. Unlike existing predictor combination approaches that only exploit pairwise relationships between the target and each reference, and thereby ignore potentially useful dependence among references, our algorithm jointly assesses the relevance of all references by adopting a Bayesian framework. This also offers a rigorous way to automatically select only relevant references. Based on experiments on seven real-world datasets from visual attribute ranking and multi-class classification scenarios, we demonstrate that our algorithm offers a significant performance gain and broadens the application range of existing predictor combination approaches. △ Less

Submitted 15 July, 2020; originally announced July 2020.

arXiv:2004.07788 [pdf, other]

RGBD-Dog: Predicting Canine Pose from RGBD Sensors

Authors: Sinead Kearney, Wenbin Li, Martin Parsons, Kwang In Kim, Darren Cosker

Abstract: The automatic extraction of animal \reb{3D} pose from images without markers is of interest in a range of scientific fields. Most work to date predicts animal pose from RGB images, based on 2D labelling of joint positions. However, due to the difficult nature of obtaining training data, no ground truth dataset of 3D animal motion is available to quantitatively evaluate these approaches. In additio… ▽ More The automatic extraction of animal \reb{3D} pose from images without markers is of interest in a range of scientific fields. Most work to date predicts animal pose from RGB images, based on 2D labelling of joint positions. However, due to the difficult nature of obtaining training data, no ground truth dataset of 3D animal motion is available to quantitatively evaluate these approaches. In addition, a lack of 3D animal pose data also makes it difficult to train 3D pose-prediction methods in a similar manner to the popular field of body-pose prediction. In our work, we focus on the problem of 3D canine pose estimation from RGBD images, recording a diverse range of dog breeds with several Microsoft Kinect v2s, simultaneously obtaining the 3D ground truth skeleton via a motion capture system. We generate a dataset of synthetic RGBD images from this data. A stacked hourglass network is trained to predict 3D joint locations, which is then constrained using prior models of shape and pose. We evaluate our model on both synthetic and real RGBD images and compare our results to previously published work fitting canine models to images. Finally, despite our training set consisting only of dog data, visual inspection implies that our network can produce good predictions for images of other quadrupeds -- e.g. horses or cats -- when their pose is similar to that contained in our training set. △ Less

Submitted 16 April, 2020; originally announced April 2020.

Comments: 18 pages, 16 figures, to be published in CVPR 2020

arXiv:2002.04709 [pdf, other]

Task-Aware Variational Adversarial Active Learning

Authors: Kwanyoung Kim, Dongwon Park, Kwang In Kim, Se Young Chun

Abstract: Often, labeling large amount of data is challenging due to high labeling cost limiting the application domain of deep learning techniques. Active learning (AL) tackles this by querying the most informative samples to be annotated among unlabeled pool. Two promising directions for AL that have been recently explored are task-agnostic approach to select data points that are far from the current labe… ▽ More Often, labeling large amount of data is challenging due to high labeling cost limiting the application domain of deep learning techniques. Active learning (AL) tackles this by querying the most informative samples to be annotated among unlabeled pool. Two promising directions for AL that have been recently explored are task-agnostic approach to select data points that are far from the current labeled pool and task-aware approach that relies on the perspective of task model. Unfortunately, the former does not exploit structures from tasks and the latter does not seem to well-utilize overall data distribution. Here, we propose task-aware variational adversarial AL (TA-VAAL) that modifies task-agnostic VAAL, that considered data distribution of both label and unlabeled pools, by relaxing task learning loss prediction to ranking loss prediction and by using ranking conditional generative adversarial network to embed normalized ranking loss information on VAAL. Our proposed TA-VAAL outperforms state-of-the-arts on various benchmark datasets for classifications with balanced / imbalanced labels as well as semantic segmentation and its task-aware and task-agnostic AL properties were confirmed with our in-depth analyses. △ Less

Submitted 8 December, 2020; v1 submitted 11 February, 2020; originally announced February 2020.

Comments: 14 pages, 13 figures, 1 table

arXiv:2001.02595 [pdf, other]

Generating Object Stamps

Authors: Youssef Alami Mejjati, Zejiang Shen, Michael Snower, Aaron Gokaslan, Oliver Wang, James Tompkin, Kwang In Kim

Abstract: We present an algorithm to generate diverse foreground objects and composite them into background images using a GAN architecture. Given an object class, a user-provided bounding box, and a background image, we first use a mask generator to create an object shape, and then use a texture generator to fill the mask such that the texture integrates with the background. By separating the problem of ob… ▽ More We present an algorithm to generate diverse foreground objects and composite them into background images using a GAN architecture. Given an object class, a user-provided bounding box, and a background image, we first use a mask generator to create an object shape, and then use a texture generator to fill the mask such that the texture integrates with the background. By separating the problem of object insertion into these two stages, we show that our model allows us to improve the realism of diverse object generation that also agrees with the provided background image. Our results on the challenging COCO dataset show improved overall quality and diversity compared to state-of-the-art object insertion approaches. △ Less

Submitted 10 January, 2020; v1 submitted 1 January, 2020; originally announced January 2020.

Comments: 27 pages, 25 figures, 11 tables. Paper under review

arXiv:1905.04967 [pdf, other]

Implicit Filter Sparsification In Convolutional Neural Networks

Authors: Dushyant Mehta, Kwang In Kim, Christian Theobalt

Abstract: We show implicit filter level sparsity manifests in convolutional neural networks (CNNs) which employ Batch Normalization and ReLU activation, and are trained with adaptive gradient descent techniques and L2 regularization or weight decay. Through an extensive empirical study (Mehta et al., 2019) we hypothesize the mechanism behind the sparsification process, and find surprising links to certain f… ▽ More We show implicit filter level sparsity manifests in convolutional neural networks (CNNs) which employ Batch Normalization and ReLU activation, and are trained with adaptive gradient descent techniques and L2 regularization or weight decay. Through an extensive empirical study (Mehta et al., 2019) we hypothesize the mechanism behind the sparsification process, and find surprising links to certain filter sparsification heuristics proposed in literature. Emergence of, and the subsequent pruning of selective features is observed to be one of the contributing mechanisms, leading to feature sparsity at par or better than certain explicit sparsification / pruning approaches. In this workshop article we summarize our findings, and point out corollaries of selective-featurepenalization which could also be employed as heuristics for filter pruning △ Less

Submitted 13 May, 2019; originally announced May 2019.

Comments: ODML-CDNNR 2019 (ICML'19 workshop) extended abstract of the CVPR 2019 paper "On Implicit Filter Level Sparsity in Convolutional Neural Networks, Mehta et al." (arXiv:1811.12495)

arXiv:1904.05159 [pdf, other]

Joint Manifold Diffusion for Combining Predictions on Decoupled Observations

Authors: Kwang In Kim, Hyung Jin Chang

Abstract: We present a new predictor combination algorithm that improves a given task predictor based on potentially relevant reference predictors. Existing approaches are limited in that, to discover the underlying task dependence, they either require known parametric forms of all predictors or access to a single fixed dataset on which all predictors are jointly evaluated. To overcome these limitations, we… ▽ More We present a new predictor combination algorithm that improves a given task predictor based on potentially relevant reference predictors. Existing approaches are limited in that, to discover the underlying task dependence, they either require known parametric forms of all predictors or access to a single fixed dataset on which all predictors are jointly evaluated. To overcome these limitations, we design a new non-parametric task dependence estimation procedure that automatically aligns evaluations of heterogeneous predictors across disjoint feature sets. Our algorithm is instantiated as a robust manifold diffusion process that jointly refines the estimated predictor alignments and the corresponding task dependence. We apply this algorithm to the relative attributes ranking problem and demonstrate that it not only broadens the application range of predictor combination approaches but also outperforms existing methods even when applied to classical predictor combination settings. △ Less

Submitted 10 April, 2019; originally announced April 2019.

Comments: Published at CVPR 2019

arXiv:1904.04196 [pdf, other]

Pushing the Envelope for RGB-based Dense 3D Hand Pose Estimation via Neural Rendering

Authors: Seungryul Baek, Kwang In Kim, Tae-Kyun Kim

Abstract: Estimating 3D hand meshes from single RGB images is challenging, due to intrinsic 2D-3D mapping ambiguities and limited training data. We adopt a compact parametric 3D hand model that represents deformable and articulated hand meshes. To achieve the model fitting to RGB images, we investigate and contribute in three ways: 1) Neural rendering: inspired by recent work on human body, our hand mesh es… ▽ More Estimating 3D hand meshes from single RGB images is challenging, due to intrinsic 2D-3D mapping ambiguities and limited training data. We adopt a compact parametric 3D hand model that represents deformable and articulated hand meshes. To achieve the model fitting to RGB images, we investigate and contribute in three ways: 1) Neural rendering: inspired by recent work on human body, our hand mesh estimator (HME) is implemented by a neural network and a differentiable renderer, supervised by 2D segmentation masks and 3D skeletons. HME demonstrates good performance for estimating diverse hand shapes and improves pose estimation accuracies. 2) Iterative testing refinement: Our fitting function is differentiable. We iteratively refine the initial estimate using the gradients, in the spirit of iterative model fitting methods like ICP. The idea is supported by the latest research on human body. 3) Self-data augmentation: collecting sized RGB-mesh (or segmentation mask)-skeleton triplets for training is a big hurdle. Once the model is successfully fitted to input RGB images, its meshes i.e. shapes and articulations, are realistic, and we augment view-points on top of estimated dense hand poses. Experiments using three RGB-based benchmarks show that our framework offers beyond state-of-the-art accuracy in 3D pose estimation, as well as recovers dense 3D hand shapes. Each technical component above meaningfully improves the accuracy in the ablation study. △ Less

Submitted 9 April, 2019; v1 submitted 8 April, 2019; originally announced April 2019.

Comments: Accepted to CVPR 2019

arXiv:1811.12495 [pdf, other]

On Implicit Filter Level Sparsity in Convolutional Neural Networks

Authors: Dushyant Mehta, Kwang In Kim, Christian Theobalt

Abstract: We investigate filter level sparsity that emerges in convolutional neural networks (CNNs) which employ Batch Normalization and ReLU activation, and are trained with adaptive gradient descent techniques and L2 regularization or weight decay. We conduct an extensive experimental study casting our initial findings into hypotheses and conclusions about the mechanisms underlying the emergent filter lev… ▽ More We investigate filter level sparsity that emerges in convolutional neural networks (CNNs) which employ Batch Normalization and ReLU activation, and are trained with adaptive gradient descent techniques and L2 regularization or weight decay. We conduct an extensive experimental study casting our initial findings into hypotheses and conclusions about the mechanisms underlying the emergent filter level sparsity. This study allows new insight into the performance gap obeserved between adapative and non-adaptive gradient descent methods in practice. Further, analysis of the effect of training strategies and hyperparameters on the sparsity leads to practical suggestions in designing CNN training strategies enabling us to explore the tradeoffs between feature selectivity, network capacity, and generalization performance. Lastly, we show that the implicit sparsity can be harnessed for neural network speedup at par or better than explicit sparsification / pruning approaches, with no modifications to the typical training pipeline required. △ Less

Submitted 5 April, 2019; v1 submitted 29 November, 2018; originally announced November 2018.

Comments: Accepted at CVPR 2019

arXiv:1808.04325 [pdf, other]

Improving Shape Deformation in Unsupervised Image-to-Image Translation

Authors: Aaron Gokaslan, Vivek Ramanujan, Daniel Ritchie, Kwang In Kim, James Tompkin

Abstract: Unsupervised image-to-image translation techniques are able to map local texture between two domains, but they are typically unsuccessful when the domains require larger shape change. Inspired by semantic segmentation, we introduce a discriminator with dilated convolutions that is able to use information from across the entire image to train a more context-aware generator. This is coupled with a m… ▽ More Unsupervised image-to-image translation techniques are able to map local texture between two domains, but they are typically unsuccessful when the domains require larger shape change. Inspired by semantic segmentation, we introduce a discriminator with dilated convolutions that is able to use information from across the entire image to train a more context-aware generator. This is coupled with a multi-scale perceptual loss that is better able to represent error in the underlying shape of objects. We demonstrate that this design is more capable of representing shape deformation in a challenging toy dataset, plus in complex mappings with significant dataset variation between humans, dolls, and anime faces, and between cats and dogs. △ Less

Submitted 17 January, 2019; v1 submitted 13 August, 2018; originally announced August 2018.

arXiv:1806.03891 [pdf, other]

Multi-Task Deep Networks for Depth-Based 6D Object Pose and Joint Registration in Crowd Scenarios

Authors: Juil Sock, Kwang In Kim, Caner Sahin, Tae-Kyun Kim

Abstract: In bin-picking scenarios, multiple instances of an object of interest are stacked in a pile randomly, and hence, the instances are inherently subjected to the challenges: severe occlusion, clutter, and similar-looking distractors. Most existing methods are, however, for single isolated object instances, while some recent methods tackle crowd scenarios as post-refinement which accounts multiple obj… ▽ More In bin-picking scenarios, multiple instances of an object of interest are stacked in a pile randomly, and hence, the instances are inherently subjected to the challenges: severe occlusion, clutter, and similar-looking distractors. Most existing methods are, however, for single isolated object instances, while some recent methods tackle crowd scenarios as post-refinement which accounts multiple object relations. In this paper, we address recovering 6D poses of multiple instances in bin-picking scenarios in depth modality by multi-task learning in deep neural networks. Our architecture jointly learns multiple sub-tasks: 2D detection, depth, and 3D pose estimation of individual objects; and joint registration of multiple objects. For training data generation, depth images of physically plausible object pose configurations are generated by a 3D object model in a physics simulation, which yields diverse occlusion patterns to learn. We adopt a state-of-the-art object detector, and 2D offsets are further estimated via a network to refine misaligned 2D detections. The depth and 3D pose estimator is designed to generate multiple hypotheses per detection. This allows the joint registration network to learn occlusion patterns and remove physically implausible pose hypotheses. We apply our architecture on both synthetic (our own and Sileane dataset) and real (a public Bin-Picking dataset) data, showing that it significantly outperforms state-of-the-art methods by 15-31% in average precision. △ Less

Submitted 11 June, 2018; originally announced June 2018.

arXiv:1806.02311 [pdf, other]

Unsupervised Attention-guided Image to Image Translation

Authors: Youssef A. Mejjati, Christian Richardt, James Tompkin, Darren Cosker, Kwang In Kim

Abstract: Current unsupervised image-to-image translation techniques struggle to focus their attention on individual objects without altering the background or the way multiple objects interact within a scene. Motivated by the important role of attention in human perception, we tackle this limitation by introducing unsupervised attention mechanisms that are jointly adversarialy trained with the generators a… ▽ More Current unsupervised image-to-image translation techniques struggle to focus their attention on individual objects without altering the background or the way multiple objects interact within a scene. Motivated by the important role of attention in human perception, we tackle this limitation by introducing unsupervised attention mechanisms that are jointly adversarialy trained with the generators and discriminators. We demonstrate qualitatively and quantitatively that our approach is able to attend to relevant regions in the image without requiring supervision, and that by doing so it achieves more realistic mappings compared to recent approaches. △ Less

Submitted 8 November, 2018; v1 submitted 6 June, 2018; originally announced June 2018.

Journal ref: NIPS 2018

arXiv:1805.04497 [pdf, other]

Augmented Skeleton Space Transfer for Depth-based Hand Pose Estimation

Authors: Seungryul Baek, Kwang In Kim, Tae-Kyun Kim

Abstract: Crucial to the success of training a depth-based 3D hand pose estimator (HPE) is the availability of comprehensive datasets covering diverse camera perspectives, shapes, and pose variations. However, collecting such annotated datasets is challenging. We propose to complete existing databases by generating new database entries. The key idea is to synthesize data in the skeleton space (instead of do… ▽ More Crucial to the success of training a depth-based 3D hand pose estimator (HPE) is the availability of comprehensive datasets covering diverse camera perspectives, shapes, and pose variations. However, collecting such annotated datasets is challenging. We propose to complete existing databases by generating new database entries. The key idea is to synthesize data in the skeleton space (instead of doing so in the depth-map space) which enables an easy and intuitive way of manipulating data entries. Since the skeleton entries generated in this way do not have the corresponding depth map entries, we exploit them by training a separate hand pose generator (HPG) which synthesizes the depth map from the skeleton entries. By training the HPG and HPE in a single unified optimization framework enforcing that 1) the HPE agrees with the paired depth and skeleton entries; and 2) the HPG-HPE combination satisfies the cyclic consistency (both the input and the output of HPG-HPE are skeletons) observed via the newly generated unpaired skeletons, our algorithm constructs a HPE which is robust to variations that go beyond the coverage of the existing database. Our training algorithm adopts the generative adversarial networks (GAN) training process. As a by-product, we obtain a hand pose discriminator (HPD) that is capable of picking out realistic hand poses. Our algorithm exploits this capability to refine the initial skeleton estimates in testing, further improving the accuracy. We test our algorithm on four challenging benchmark datasets (ICVL, MSRA, NYU and Big Hand 2.2M datasets) and demonstrate that our approach outperforms or is on par with state-of-the-art methods quantitatively and qualitatively. △ Less

Submitted 11 May, 2018; originally announced May 2018.

Comments: Accepted to CVPR 2018

arXiv:1804.04082 [pdf, other]

Ranking CGANs: Subjective Control over Semantic Image Attributes

Authors: Yassir Saquil, Kwang In Kim, Peter Hall

Abstract: In this paper, we investigate the use of generative adversarial networks in the task of image generation according to subjective measures of semantic attributes. Unlike the standard (CGAN) that generates images from discrete categorical labels, our architecture handles both continuous and discrete scales. Given pairwise comparisons of images, our model, called RankCGAN, performs two tasks: it lear… ▽ More In this paper, we investigate the use of generative adversarial networks in the task of image generation according to subjective measures of semantic attributes. Unlike the standard (CGAN) that generates images from discrete categorical labels, our architecture handles both continuous and discrete scales. Given pairwise comparisons of images, our model, called RankCGAN, performs two tasks: it learns to rank images using a subjective measure; and it learns a generative model that can be controlled by that measure. RankCGAN associates each subjective measure of interest to a distinct dimension of some latent space. We perform experiments on UT-Zap50K, PubFig and OSR datasets and demonstrate that the model is expressive and diverse enough to conduct two-attribute exploration and image editing. △ Less

Submitted 24 July, 2018; v1 submitted 11 April, 2018; originally announced April 2018.

arXiv:1706.03863 [pdf, other]

Criteria Sliders: Learning Continuous Database Criteria via Interactive Ranking

Authors: James Tompkin, Kwang In Kim, Hanspeter Pfister, Christian Theobalt

Abstract: Large databases are often organized by hand-labeled metadata, or criteria, which are expensive to collect. We can use unsupervised learning to model database variation, but these models are often high dimensional, complex to parameterize, or require expert knowledge. We learn low-dimensional continuous criteria via interactive ranking, so that the novice user need only describe the relative orderi… ▽ More Large databases are often organized by hand-labeled metadata, or criteria, which are expensive to collect. We can use unsupervised learning to model database variation, but these models are often high dimensional, complex to parameterize, or require expert knowledge. We learn low-dimensional continuous criteria via interactive ranking, so that the novice user need only describe the relative ordering of examples. This is formed as semi-supervised label propagation in which we maximize the information gained from a limited number of examples. Further, we actively suggest data points to the user to rank in a more informative way than existing work. Our efficient approach allows users to interactively organize thousands of data points along 1D and 2D continuous sliders. We experiment with datasets of imagery and geometry to demonstrate that our tool is useful for quickly assessing and organizing the content of large databases. △ Less

Submitted 12 June, 2017; originally announced June 2017.

arXiv:1706.02003 [pdf, other]

Deep Convolutional Decision Jungle for Image Classification

Authors: Seungryul Baek, Kwang In Kim, Tae-Kyun Kim

Abstract: We propose a novel method called deep convolutional decision jungle (CDJ) and its learning algorithm for image classification. The CDJ maintains the structure of standard convolutional neural networks (CNNs), i.e. multiple layers of multiple response maps fully connected. Each response map-or node-in both the convolutional and fully-connected layers selectively respond to class labels s.t. each da… ▽ More We propose a novel method called deep convolutional decision jungle (CDJ) and its learning algorithm for image classification. The CDJ maintains the structure of standard convolutional neural networks (CNNs), i.e. multiple layers of multiple response maps fully connected. Each response map-or node-in both the convolutional and fully-connected layers selectively respond to class labels s.t. each data sample travels via a specific soft route of those activated nodes. The proposed method CDJ automatically learns features, whereas decision forests and jungles require pre-defined feature sets. Compared to CNNs, the method embeds the benefits of using data-dependent discriminative functions, which better handles multi-modal/heterogeneous data; further,the method offers more diverse sparse network responses, which in turn can be used for cost-effective learning/classification. The network is learnt by combining conventional softmax and proposed entropy losses in each layer. The entropy loss,as used in decision tree growing, measures the purity of data activation according to the class label distribution. The back-propagation rule for the proposed loss function is derived from stochastic gradient descent (SGD) optimization of CNNs. We show that our proposed method outperforms state-of-the-art methods on three public image classification benchmarks and one face verification dataset. We also demonstrate the use of auxiliary data labels, when available, which helps our method to learn more discriminative routing and representations and leads to improved classification. △ Less

Submitted 18 May, 2018; v1 submitted 6 June, 2017; originally announced June 2017.

arXiv:1610.09334 [pdf, ps, other]

Real-time Online Action Detection Forests using Spatio-temporal Contexts

Authors: Seungryul Baek, Kwang In Kim, Tae-Kyun Kim

Abstract: Online action detection (OAD) is challenging since 1) robust yet computationally expensive features cannot be straightforwardly used due to the real-time processing requirements and 2) the localization and classification of actions have to be performed even before they are fully observed. We propose a new random forest (RF)-based online action detection framework that addresses these challenges. O… ▽ More Online action detection (OAD) is challenging since 1) robust yet computationally expensive features cannot be straightforwardly used due to the real-time processing requirements and 2) the localization and classification of actions have to be performed even before they are fully observed. We propose a new random forest (RF)-based online action detection framework that addresses these challenges. Our algorithm uses computationally efficient skeletal joint features. High accuracy is achieved by using robust convolutional neural network (CNN)-based features which are extracted from the raw RGBD images, plus the temporal relationships between the current frame of interest, and the past and future frames. While these high-quality features are not available in real-time testing scenario, we demonstrate that they can be effectively exploited in training RF classifiers: We use these spatio-temporal contexts to craft RF's new split functions improving RFs' leaf node statistics. Experiments with challenging MSRAction3D, G3D, and OAD datasets demonstrate that our algorithm significantly improves the accuracy over the state-of-the-art online action detection algorithms while achieving the real-time efficiency of existing skeleton-based RF classifiers. △ Less

Submitted 28 October, 2016; originally announced October 2016.

arXiv:1602.06439 [pdf, ps, other]

doi 10.1109/ICCV.2015.318

Context-guided diffusion for label propagation on graphs

Authors: Kwang In Kim, James Tompkin, Hanspeter Pfister, Christian Theobalt

Abstract: Existing approaches for diffusion on graphs, e.g., for label propagation, are mainly focused on isotropic diffusion, which is induced by the commonly-used graph Laplacian regularizer. Inspired by the success of diffusivity tensors for anisotropic diffusion in image processing, we presents anisotropic diffusion on graphs and the corresponding label propagation algorithm. We develop positive definit… ▽ More Existing approaches for diffusion on graphs, e.g., for label propagation, are mainly focused on isotropic diffusion, which is induced by the commonly-used graph Laplacian regularizer. Inspired by the success of diffusivity tensors for anisotropic diffusion in image processing, we presents anisotropic diffusion on graphs and the corresponding label propagation algorithm. We develop positive definite diffusivity operators on the vector bundles of Riemannian manifolds, and discretize them to diffusivity operators on graphs. This enables us to easily define new robust diffusivity operators which significantly improve semi-supervised learning performance over existing diffusion algorithms. △ Less

Submitted 20 February, 2016; originally announced February 2016.

arXiv:1602.03808 [pdf, other]

doi 10.1109/CVPR.2015.7298831

Semi-supervised Learning with Explicit Relationship Regularization

Authors: Kwang In Kim, James Tompkin, Hanspeter Pfister, Christian Theobalt

Abstract: In many learning tasks, the structure of the target space of a function holds rich information about the relationships between evaluations of functions on different data points. Existing approaches attempt to exploit this relationship information implicitly by enforcing smoothness on function evaluations only. However, what happens if we explicitly regularize the relationships between function eva… ▽ More In many learning tasks, the structure of the target space of a function holds rich information about the relationships between evaluations of functions on different data points. Existing approaches attempt to exploit this relationship information implicitly by enforcing smoothness on function evaluations only. However, what happens if we explicitly regularize the relationships between function evaluations? Inspired by homophily, we regularize based on a smooth relationship function, either defined from the data or with labels. In experiments, we demonstrate that this significantly improves the performance of state-of-the-art algorithms in semi-supervised classification and in spectral data embedding for constrained clustering and dimensionality reduction. △ Less

Submitted 11 February, 2016; originally announced February 2016.

Comments: Accepted version of paper published at CVPR 2015, http://dx.doi.org/10.1109/CVPR.2015.7298831

arXiv:1602.03805 [pdf, other]

doi 10.1109/CVPR.2015.7299186

Local High-order Regularization on Data Manifolds

Authors: Kwang In Kim, James Tompkin, Hanspeter Pfister, Christian Theobalt

Abstract: The common graph Laplacian regularizer is well-established in semi-supervised learning and spectral dimensionality reduction. However, as a first-order regularizer, it can lead to degenerate functions in high-dimensional manifolds. The iterated graph Laplacian enables high-order regularization, but it has a high computational complexity and so cannot be applied to large problems. We introduce a ne… ▽ More The common graph Laplacian regularizer is well-established in semi-supervised learning and spectral dimensionality reduction. However, as a first-order regularizer, it can lead to degenerate functions in high-dimensional manifolds. The iterated graph Laplacian enables high-order regularization, but it has a high computational complexity and so cannot be applied to large problems. We introduce a new regularizer which is globally high order and so does not suffer from the degeneracy of the graph Laplacian regularizer, but is also sparse for efficient computation in semi-supervised learning applications. We reduce computational complexity by building a local first-order approximation of the manifold as a surrogate geometry, and construct our high-order regularizer based on local derivative evaluations therein. Experiments on human body shape and pose analysis demonstrate the effectiveness and efficiency of our method. △ Less

Submitted 11 February, 2016; originally announced February 2016.

Comments: Accepted version of paper published at CVPR 2015, http://dx.doi.org/10.1109/CVPR.2015.7299186

Showing 1–33 of 33 results for author: Kim, K I