Search | arXiv e-print repository

arXiv:2503.19945 [pdf, other]

Optimizing Breast Cancer Detection in Mammograms: A Comprehensive Study of Transfer Learning, Resolution Reduction, and Multi-View Classification

Authors: Daniel G. P. Petrini, Hae Yong Kim

Abstract: This study explores open questions in the application of machine learning for breast cancer detection in mammograms. Current approaches often employ a two-stage transfer learning process: first, adapting a backbone model trained on natural images to develop a patch classifier, which is then used to create a single-view whole-image classifier. Additionally, many studies leverage both mammographic v… ▽ More This study explores open questions in the application of machine learning for breast cancer detection in mammograms. Current approaches often employ a two-stage transfer learning process: first, adapting a backbone model trained on natural images to develop a patch classifier, which is then used to create a single-view whole-image classifier. Additionally, many studies leverage both mammographic views to enhance model performance. In this work, we systematically investigate five key questions: (1) Is the intermediate patch classifier essential for optimal performance? (2) Do backbone models that excel in natural image classification consistently outperform others on mammograms? (3) When reducing mammogram resolution for GPU processing, does the learn-to-resize technique outperform conventional methods? (4) Does incorporating both mammographic views in a two-view classifier significantly improve detection accuracy? (5) How do these findings vary when analyzing low-quality versus high-quality mammograms? By addressing these questions, we developed models that outperform previous results for both single-view and two-view classifiers. Our findings provide insights into model architecture and transfer learning strategies contributing to more accurate and efficient mammogram analysis. △ Less

Submitted 25 March, 2025; originally announced March 2025.

Comments: 8 pages

arXiv:2502.00497 [pdf]

Convolutional Fourier Analysis Network (CFAN): A Unified Time-Frequency Approach for ECG Classification

Authors: Sam Jeong, Hae Yong Kim

Abstract: Machine learning has revolutionized biomedical signal analysis, particularly in electrocardiogram (ECG) classification. While convolutional neural networks (CNNs) excel at automatic feature extraction, the optimal integration of time- and frequency-domain information remains unresolved. This study introduces the Convolutional Fourier Analysis Network (CFAN), a novel architecture that unifies time-… ▽ More Machine learning has revolutionized biomedical signal analysis, particularly in electrocardiogram (ECG) classification. While convolutional neural networks (CNNs) excel at automatic feature extraction, the optimal integration of time- and frequency-domain information remains unresolved. This study introduces the Convolutional Fourier Analysis Network (CFAN), a novel architecture that unifies time-frequency analysis by embedding Fourier principles directly into CNN layers. We evaluate CFAN against four benchmarks - spectrogram-based 2D CNN (SPECT); 1D CNN (CNN1D); Fourier-based 1D CNN (FFT1D); and CNN1D with integrated Fourier Analysis Network (CNN1D-FAN) - across three ECG tasks: arrhythmia classification (MIT-BIH), identity recognition (ECG-ID), and apnea detection (Apnea-ECG). CFAN achieved state-of-the-art performance, surpassing all competing methods with accuracies of 98.95% (MIT-BIH), 96.83% (ECG-ID), and 95.01% (Apnea-ECG). Notably, on ECG-ID and Apnea-ECG, CFAN demonstrated statistically significant improvements over the second-best method (CNN1D-FAN, $p \leq 0.02$), further validating its superior performance. Key innovations include CONV-FAN blocks that combine sine, cosine and GELU activations in convolutional layers to capture periodic features and joint time-frequency learning without spectrogram conversion. Our results highlight CFAN's potential for broader biomedical and signal classification applications. △ Less

Submitted 13 May, 2025; v1 submitted 1 February, 2025; originally announced February 2025.

arXiv:2411.18995 [pdf, other]

MVFormer: Diversifying Feature Normalization and Token Mixing for Efficient Vision Transformers

Authors: Jongseong Bae, Susang Kim, Minsu Cho, Ha Young Kim

Abstract: Active research is currently underway to enhance the efficiency of vision transformers (ViTs). Most studies have focused solely on effective token mixers, overlooking the potential relationship with normalization. To boost diverse feature learning, we propose two components: a normalization module called multi-view normalization (MVN) and a token mixer called multi-view token mixer (MVTM). The MVN… ▽ More Active research is currently underway to enhance the efficiency of vision transformers (ViTs). Most studies have focused solely on effective token mixers, overlooking the potential relationship with normalization. To boost diverse feature learning, we propose two components: a normalization module called multi-view normalization (MVN) and a token mixer called multi-view token mixer (MVTM). The MVN integrates three differently normalized features via batch, layer, and instance normalization using a learnable weighted sum. Each normalization method outputs a different distribution, generating distinct features. Thus, the MVN is expected to offer diverse pattern information to the token mixer, resulting in beneficial synergy. The MVTM is a convolution-based multiscale token mixer with local, intermediate, and global filters, and it incorporates stage specificity by configuring various receptive fields for the token mixer at each stage, efficiently capturing ranges of visual patterns. We propose a novel ViT model, multi-vision transformer (MVFormer), adopting the MVN and MVTM in the MetaFormer block, the generalized ViT scheme. Our MVFormer outperforms state-of-the-art convolution-based ViTs on image classification, object detection, and instance and semantic segmentation with the same or lower parameters and MACs. Particularly, MVFormer variants, MVFormer-T, S, and B achieve 83.4%, 84.3%, and 84.6% top-1 accuracy, respectively, on ImageNet-1K benchmark. △ Less

Submitted 28 November, 2024; originally announced November 2024.

arXiv:2411.17248 [pdf, other]

DiffSLT: Enhancing Diversity in Sign Language Translation via Diffusion Model

Authors: JiHwan Moon, Jihoon Park, Jungeun Kim, Jongseong Bae, Hyeongwoo Jeon, Ha Young Kim

Abstract: Sign language translation (SLT) is challenging, as it involves converting sign language videos into natural language. Previous studies have prioritized accuracy over diversity. However, diversity is crucial for handling lexical and syntactic ambiguities in machine translation, suggesting it could similarly benefit SLT. In this work, we propose DiffSLT, a novel gloss-free SLT framework that leverag… ▽ More Sign language translation (SLT) is challenging, as it involves converting sign language videos into natural language. Previous studies have prioritized accuracy over diversity. However, diversity is crucial for handling lexical and syntactic ambiguities in machine translation, suggesting it could similarly benefit SLT. In this work, we propose DiffSLT, a novel gloss-free SLT framework that leverages a diffusion model, enabling diverse translations while preserving sign language semantics. DiffSLT transforms random noise into the target latent representation, conditioned on the visual features of input video. To enhance visual conditioning, we design Guidance Fusion Module, which fully utilizes the multi-level spatiotemporal information of the visual features. We also introduce DiffSLT-P, a DiffSLT variant that conditions on pseudo-glosses and visual features, providing key textual guidance and reducing the modality gap. As a result, DiffSLT and DiffSLT-P significantly improve diversity over previous gloss-free SLT methods and achieve state-of-the-art performance on two SLT datasets, thereby markedly improving translation quality. △ Less

Submitted 26 November, 2024; originally announced November 2024.

Comments: Project page: https://diffslt.github.io/

arXiv:2411.16789 [pdf, other]

Leveraging the Power of MLLMs for Gloss-Free Sign Language Translation

Authors: Jungeun Kim, Hyeongwoo Jeon, Jongseong Bae, Ha Young Kim

Abstract: Sign language translation (SLT) is a challenging task that involves translating sign language images into spoken language. For SLT models to perform this task successfully, they must bridge the modality gap and identify subtle variations in sign language components to understand their meanings accurately. To address these challenges, we propose a novel gloss-free SLT framework called Multimodal Si… ▽ More Sign language translation (SLT) is a challenging task that involves translating sign language images into spoken language. For SLT models to perform this task successfully, they must bridge the modality gap and identify subtle variations in sign language components to understand their meanings accurately. To address these challenges, we propose a novel gloss-free SLT framework called Multimodal Sign Language Translation (MMSLT), which leverages the representational capabilities of off-the-shelf multimodal large language models (MLLMs). Specifically, we generate detailed textual descriptions of sign language components using MLLMs. Then, through our proposed multimodal-language pre-training module, we integrate these description features with sign video features to align them within the spoken sentence space. Our approach achieves state-of-the-art performance on benchmark datasets PHOENIX14T and CSL-Daily, highlighting the potential of MLLMs to be effectively utilized in SLT. △ Less

Submitted 25 November, 2024; originally announced November 2024.

arXiv:2411.16129 [pdf, other]

Three Cars Approaching within 100m! Enhancing Distant Geometry by Tri-Axis Voxel Scanning for Camera-based Semantic Scene Completion

Authors: Jongseong Bae, Junwoo Ha, Ha Young Kim

Abstract: Camera-based Semantic Scene Completion (SSC) is gaining attentions in the 3D perception field. However, properties such as perspective and occlusion lead to the underestimation of the geometry in distant regions, posing a critical issue for safety-focused autonomous driving systems. To tackle this, we propose ScanSSC, a novel camera-based SSC model composed of a Scan Module and Scan Loss, both des… ▽ More Camera-based Semantic Scene Completion (SSC) is gaining attentions in the 3D perception field. However, properties such as perspective and occlusion lead to the underestimation of the geometry in distant regions, posing a critical issue for safety-focused autonomous driving systems. To tackle this, we propose ScanSSC, a novel camera-based SSC model composed of a Scan Module and Scan Loss, both designed to enhance distant scenes by leveraging context from near-viewpoint scenes. The Scan Module uses axis-wise masked attention, where each axis employing a near-to-far cascade masking that enables distant voxels to capture relationships with preceding voxels. In addition, the Scan Loss computes the cross-entropy along each axis between cumulative logits and corresponding class distributions in a near-to-far direction, thereby propagating rich context-aware signals to distant voxels. Leveraging the synergy between these components, ScanSSC achieves state-of-the-art performance, with IoUs of 44.54 and 48.29, and mIoUs of 17.40 and 20.14 on the SemanticKITTI and SSCBench-KITTI-360 benchmarks. △ Less

Submitted 25 March, 2025; v1 submitted 25 November, 2024; originally announced November 2024.

Comments: Accepted to CVPR 2025

arXiv:2410.01531 [pdf, other]

TiVaT: A Transformer with a Single Unified Mechanism for Capturing Asynchronous Dependencies in Multivariate Time Series Forecasting

Authors: Junwoo Ha, Hyukjae Kwon, Sungsoo Kim, Kisu Lee, Seungjae Park, Ha Young Kim

Abstract: Multivariate time series (MTS) forecasting is vital across various domains but remains challenging due to the need to simultaneously model temporal and inter-variate dependencies. Existing channel-dependent models, where Transformer-based models dominate, process these dependencies separately, limiting their capacity to capture complex interactions such as lead-lag dynamics. To address this issue,… ▽ More Multivariate time series (MTS) forecasting is vital across various domains but remains challenging due to the need to simultaneously model temporal and inter-variate dependencies. Existing channel-dependent models, where Transformer-based models dominate, process these dependencies separately, limiting their capacity to capture complex interactions such as lead-lag dynamics. To address this issue, we propose TiVaT (Time-variate Transformer), a novel architecture incorporating a single unified module, a Joint-Axis (JA) attention module, that concurrently processes temporal and variate modeling. The JA attention module dynamically selects relevant features to particularly capture asynchronous interactions. In addition, we introduce distance-aware time-variate sampling in the JA attention, a novel mechanism that extracts significant patterns through a learned 2D embedding space while reducing noise. Extensive experiments demonstrate TiVaT's overall performance across diverse datasets, particularly excelling in scenarios with intricate asynchronous dependencies. △ Less

Submitted 30 January, 2025; v1 submitted 2 October, 2024; originally announced October 2024.

Comments: 15pages

MSC Class: I.2.0

arXiv:2407.12514 [pdf, other]

On Initializing Transformers with Pre-trained Embeddings

Authors: Ha Young Kim, Niranjan Balasubramanian, Byungkon Kang

Abstract: It has become common practice now to use random initialization schemes, rather than the pre-trained embeddings, when training transformer based models from scratch. Indeed, we find that pre-trained word embeddings from GloVe, and some sub-word embeddings extracted from language models such as T5 and mT5 fare much worse compared to random initialization. This is counter-intuitive given the well-kno… ▽ More It has become common practice now to use random initialization schemes, rather than the pre-trained embeddings, when training transformer based models from scratch. Indeed, we find that pre-trained word embeddings from GloVe, and some sub-word embeddings extracted from language models such as T5 and mT5 fare much worse compared to random initialization. This is counter-intuitive given the well-known representational and transfer-learning advantages of pre-training. Interestingly, we also find that BERT and mBERT embeddings fare better than random initialization, showing the advantages of pre-trained representations. In this work, we posit two potential factors that contribute to these mixed results: the model sensitivity to parameter distribution and the embedding interactions with position encodings. We observe that pre-trained GloVe, T5, and mT5 embeddings have a wider distribution of values. As argued in the initialization studies, such large value initializations can lead to poor training because of saturated outputs. Further, the larger embedding values can, in effect, absorb the smaller position encoding values when added together, thus losing position information. Standardizing the pre-trained embeddings to a narrow range (e.g. as prescribed by Xavier) leads to substantial gains for Glove, T5, and mT5 embeddings. On the other hand, BERT pre-trained embeddings, while larger, are still relatively closer to Xavier initialization range which may allow it to effectively transfer the pre-trained knowledge. △ Less

Submitted 17 July, 2024; originally announced July 2024.

ACM Class: I.2.7

arXiv:2401.16688 [pdf]

doi 10.1109/ACCESS.2024.3422259

Characterization of Magnetic Labyrinthine Structures Through Junctions and Terminals Detection Using Template Matching and CNN

Authors: Vinícius Yu Okubo, Kotaro Shimizu, B. S. Shivaram, Hae Yong Kim

Abstract: Defects influence diverse properties of materials, shaping their structural, mechanical, and electronic characteristics. Among a variety of materials exhibiting unique defects, magnets exhibit diverse nano- to micro-scale defects and have been intensively studied in materials science. Specifically, defects in magnetic labyrinthine patterns, called junctions and terminals are ubiquitous and serve a… ▽ More Defects influence diverse properties of materials, shaping their structural, mechanical, and electronic characteristics. Among a variety of materials exhibiting unique defects, magnets exhibit diverse nano- to micro-scale defects and have been intensively studied in materials science. Specifically, defects in magnetic labyrinthine patterns, called junctions and terminals are ubiquitous and serve as points of interest. While detecting and characterizing such defects is crucial for understanding magnets, systematically investigating large-scale images containing over a thousand closely packed junctions and terminals remains a formidable challenge. This study introduces a new technique called TM-CNN (Template Matching - Convolutional Neural Network) designed to detect a multitude of small objects in images, such as the defects in magnetic labyrinthine patterns. TM-CNN was used to identify 641,649 such structures in 444 experimental images, and the results were explored to deepen understanding of magnetic materials. It employs a two-stage detection approach combining template matching, used in initial detection, with a convolutional neural network, used to eliminate incorrect identifications. To train a CNN classifier, it is necessary to annotate a large number of training images. This difficulty prevents the use of CNN in many practical applications. TM-CNN significantly reduces the manual workload for creating training images by automatically making most of the annotations and leaving only a small number of corrections to human reviewers. In testing, TM-CNN achieved an impressive F1 score of 0.991, far outperforming traditional template matching and CNN-based object detection algorithms. △ Less

Submitted 18 July, 2024; v1 submitted 29 January, 2024; originally announced January 2024.

Comments: 12 pages, 7 figures, published in IEEE Access

Journal ref: IEEE Access, vol. 12, pp. 92419 - 92430, 2024

arXiv:2309.00372 [pdf, other]

On the Localization of Ultrasound Image Slices within Point Distribution Models

Authors: Lennart Bastian, Vincent Bürgin, Ha Young Kim, Alexander Baumann, Benjamin Busam, Mahdi Saleh, Nassir Navab

Abstract: Thyroid disorders are most commonly diagnosed using high-resolution Ultrasound (US). Longitudinal nodule tracking is a pivotal diagnostic protocol for monitoring changes in pathological thyroid morphology. This task, however, imposes a substantial cognitive load on clinicians due to the inherent challenge of maintaining a mental 3D reconstruction of the organ. We thus present a framework for autom… ▽ More Thyroid disorders are most commonly diagnosed using high-resolution Ultrasound (US). Longitudinal nodule tracking is a pivotal diagnostic protocol for monitoring changes in pathological thyroid morphology. This task, however, imposes a substantial cognitive load on clinicians due to the inherent challenge of maintaining a mental 3D reconstruction of the organ. We thus present a framework for automated US image slice localization within a 3D shape representation to ease how such sonographic diagnoses are carried out. Our proposed method learns a common latent embedding space between US image patches and the 3D surface of an individual's thyroid shape, or a statistical aggregation in the form of a statistical shape model (SSM), via contrastive metric learning. Using cross-modality registration and Procrustes analysis, we leverage features from our model to register US slices to a 3D mesh representation of the thyroid shape. We demonstrate that our multi-modal registration framework can localize images on the 3D surface topology of a patient-specific organ and the mean shape of an SSM. Experimental results indicate slice positions can be predicted within an average of 1.2 mm of the ground-truth slice location on the patient-specific 3D anatomy and 4.6 mm on the SSM, exemplifying its usefulness for slice localization during sonographic acquisitions. Code is publically available: \href{https://github.com/vuenc/slice-to-shape}{https://github.com/vuenc/slice-to-shape} △ Less

Submitted 1 September, 2023; originally announced September 2023.

Comments: ShapeMI Workshop @ MICCAI 2023; 12 pages 2 figures

arXiv:2308.15791 [pdf, other]

Neural Video Compression with Temporal Layer-Adaptive Hierarchical B-frame Coding

Authors: Yeongwoong Kim, Suyong Bahk, Seungeon Kim, Won Hee Lee, Dokwan Oh, Hui Yong Kim

Abstract: Neural video compression (NVC) is a rapidly evolving video coding research area, with some models achieving superior coding efficiency compared to the latest video coding standard Versatile Video Coding (VVC). In conventional video coding standards, the hierarchical B-frame coding, which utilizes a bidirectional prediction structure for higher compression, had been well-studied and exploited. In N… ▽ More Neural video compression (NVC) is a rapidly evolving video coding research area, with some models achieving superior coding efficiency compared to the latest video coding standard Versatile Video Coding (VVC). In conventional video coding standards, the hierarchical B-frame coding, which utilizes a bidirectional prediction structure for higher compression, had been well-studied and exploited. In NVC, however, limited research has investigated the hierarchical B scheme. In this paper, we propose an NVC model exploiting hierarchical B-frame coding with temporal layer-adaptive optimization. We first extend an existing unidirectional NVC model to a bidirectional model, which achieves -21.13% BD-rate gain over the unidirectional baseline model. However, this model faces challenges when applied to sequences with complex or large motions, leading to performance degradation. To address this, we introduce temporal layer-adaptive optimization, incorporating methods such as temporal layer-adaptive quality scaling (TAQS) and temporal layer-adaptive latent scaling (TALS). The final model with the proposed methods achieves an impressive BD-rate gain of -39.86% against the baseline. It also resolves the challenges in sequences with large or complex motions with up to -49.13% more BD-rate gains than the simple bidirectional extension. This improvement is attributed to the allocation of more bits to lower temporal layers, thereby enhancing overall reconstruction quality with smaller bits. Since our method has little dependency on a specific NVC model architecture, it can serve as a general tool for extending unidirectional NVC models to the ones with hierarchical B-frame coding. △ Less

Submitted 5 September, 2023; v1 submitted 30 August, 2023; originally announced August 2023.

arXiv:2307.01227 [pdf, other]

ESGCN: Edge Squeeze Attention Graph Convolutional Network for Traffic Flow Forecasting

Authors: Sangrok Lee, Ha Young Kim

Abstract: Traffic forecasting is a highly challenging task owing to the dynamical spatio-temporal dependencies of traffic flows. To handle this, we focus on modeling the spatio-temporal dynamics and propose a network termed Edge Squeeze Graph Convolutional Network (ESGCN) to forecast traffic flow in multiple regions. ESGCN consists of two modules: W-module and ES module. W-module is a fully node-wise convol… ▽ More Traffic forecasting is a highly challenging task owing to the dynamical spatio-temporal dependencies of traffic flows. To handle this, we focus on modeling the spatio-temporal dynamics and propose a network termed Edge Squeeze Graph Convolutional Network (ESGCN) to forecast traffic flow in multiple regions. ESGCN consists of two modules: W-module and ES module. W-module is a fully node-wise convolutional network. It encodes the time-series of each traffic region separately and decomposes the time-series at various scales to capture fine and coarse features. The ES module models the spatio-temporal dynamics using Graph Convolutional Network (GCN) and generates an Adaptive Adjacency Matrix (AAM) with temporal features. To improve the accuracy of AAM, we introduce three key concepts. 1) Using edge features to directly capture the spatiotemporal flow representation among regions. 2) Applying an edge attention mechanism to GCN to extract the AAM from the edge features. Here, the attention mechanism can effectively determine important spatio-temporal adjacency relations. 3) Proposing a novel node contrastive loss to suppress obstructed connections and emphasize related connections. Experimental results show that ESGCN achieves state-of-the-art performance by a large margin on four real-world datasets (PEMS03, 04, 07, and 08) with a low computational cost. △ Less

Submitted 12 July, 2023; v1 submitted 3 July, 2023; originally announced July 2023.

Comments: 7 Pages, 3 figures

arXiv:2306.16670 [pdf, other]

doi 10.1109/TCSVT.2023.3302858

End-to-End Learnable Multi-Scale Feature Compression for VCM

Authors: Yeongwoong Kim, Hyewon Jeong, Janghyun Yu, Younhee Kim, Jooyoung Lee, Se Yoon Jeong, Hui Yong Kim

Abstract: The proliferation of deep learning-based machine vision applications has given rise to a new type of compression, so called video coding for machine (VCM). VCM differs from traditional video coding in that it is optimized for machine vision performance instead of human visual quality. In the feature compression track of MPEG-VCM, multi-scale features extracted from images are subject to compressio… ▽ More The proliferation of deep learning-based machine vision applications has given rise to a new type of compression, so called video coding for machine (VCM). VCM differs from traditional video coding in that it is optimized for machine vision performance instead of human visual quality. In the feature compression track of MPEG-VCM, multi-scale features extracted from images are subject to compression. Recent feature compression works have demonstrated that the versatile video coding (VVC) standard-based approach can achieve a BD-rate reduction of up to 96% against MPEG-VCM feature anchor. However, it is still sub-optimal as VVC was not designed for extracted features but for natural images. Moreover, the high encoding complexity of VVC makes it difficult to design a lightweight encoder without sacrificing performance. To address these challenges, we propose a novel multi-scale feature compression method that enables both the end-to-end optimization on the extracted features and the design of lightweight encoders. The proposed model combines a learnable compressor with a multi-scale feature fusion network so that the redundancy in the multi-scale features is effectively removed. Instead of simply cascading the fusion network and the compression network, we integrate the fusion and encoding processes in an interleaved way. Our model first encodes a larger-scale feature to obtain a latent representation and then fuses the latent with a smaller-scale feature. This process is successively performed until the smallest-scale feature is fused and then the encoded latent at the final stage is entropy-coded for transmission. The results show that our model outperforms previous approaches by at least 52% BD-rate reduction and has $\times5$ to $\times27$ times less encoding time for object detection... △ Less

Submitted 8 August, 2023; v1 submitted 29 June, 2023; originally announced June 2023.

Comments: 13 pages, accepted by IEEE Transactions on Circuits and Systems for Video Technology

arXiv:2304.07515 [pdf, other]

S3M: Scalable Statistical Shape Modeling through Unsupervised Correspondences

Authors: Lennart Bastian, Alexander Baumann, Emily Hoppe, Vincent Bürgin, Ha Young Kim, Mahdi Saleh, Benjamin Busam, Nassir Navab

Abstract: Statistical shape models (SSMs) are an established way to represent the anatomy of a population with various clinically relevant applications. However, they typically require domain expertise, and labor-intensive landmark annotations to construct. We address these shortcomings by proposing an unsupervised method that leverages deep geometric features and functional correspondences to simultaneousl… ▽ More Statistical shape models (SSMs) are an established way to represent the anatomy of a population with various clinically relevant applications. However, they typically require domain expertise, and labor-intensive landmark annotations to construct. We address these shortcomings by proposing an unsupervised method that leverages deep geometric features and functional correspondences to simultaneously learn local and global shape structures across population anatomies. Our pipeline significantly improves unsupervised correspondence estimation for SSMs compared to baseline methods, even on highly irregular surface topologies. We demonstrate this for two different anatomical structures: the thyroid and a multi-chamber heart dataset. Furthermore, our method is robust enough to learn from noisy neural network predictions, potentially enabling scaling SSMs to larger patient populations without manual segmentation annotation. △ Less

Submitted 24 July, 2023; v1 submitted 15 April, 2023; originally announced April 2023.

Comments: Accepted at MICCAI 2023. 13 pages, 6 figures

arXiv:2303.02328 [pdf, ps, other]

Decompose, Adjust, Compose: Effective Normalization by Playing with Frequency for Domain Generalization

Authors: Sangrok Lee, Jongseong Bae, Ha Young Kim

Abstract: Domain generalization (DG) is a principal task to evaluate the robustness of computer vision models. Many previous studies have used normalization for DG. In normalization, statistics and normalized features are regarded as style and content, respectively. However, it has a content variation problem when removing style because the boundary between content and style is unclear. This study addresses… ▽ More Domain generalization (DG) is a principal task to evaluate the robustness of computer vision models. Many previous studies have used normalization for DG. In normalization, statistics and normalized features are regarded as style and content, respectively. However, it has a content variation problem when removing style because the boundary between content and style is unclear. This study addresses this problem from the frequency domain perspective, where amplitude and phase are considered as style and content, respectively. First, we verify the quantitative phase variation of normalization through the mathematical derivation of the Fourier transform formula. Then, based on this, we propose a novel normalization method, PCNorm, which eliminates style only as the preserving content through spectral decomposition. Furthermore, we propose advanced PCNorm variants, CCNorm and SCNorm, which adjust the degrees of variations in content and style, respectively. Thus, they can learn domain-agnostic representations for DG. With the normalization methods, we propose ResNet-variant models, DAC-P and DAC-SC, which are robust to the domain gap. The proposed models outperform other recent DG methods. The DAC-SC achieves an average state-of-the-art performance of 65.6% on five datasets: PACS, VLCS, Office-Home, DomainNet, and TerraIncognita. △ Less

Submitted 15 March, 2023; v1 submitted 4 March, 2023; originally announced March 2023.

Comments: 10 pages,6 figures, Conference on Computer Vision and Pattern Recognition 2023

arXiv:2111.03664 [pdf, other]

doi 10.1109/TASLP.2023.3297955

Oracle Teacher: Leveraging Target Information for Better Knowledge Distillation of CTC Models

Authors: Ji Won Yoon, Hyung Yong Kim, Hyeonseung Lee, Sunghwan Ahn, Nam Soo Kim

Abstract: Knowledge distillation (KD), best known as an effective method for model compression, aims at transferring the knowledge of a bigger network (teacher) to a much smaller network (student). Conventional KD methods usually employ the teacher model trained in a supervised manner, where output labels are treated only as targets. Extending this supervised scheme further, we introduce a new type of teach… ▽ More Knowledge distillation (KD), best known as an effective method for model compression, aims at transferring the knowledge of a bigger network (teacher) to a much smaller network (student). Conventional KD methods usually employ the teacher model trained in a supervised manner, where output labels are treated only as targets. Extending this supervised scheme further, we introduce a new type of teacher model for connectionist temporal classification (CTC)-based sequence models, namely Oracle Teacher, that leverages both the source inputs and the output labels as the teacher model's input. Since the Oracle Teacher learns a more accurate CTC alignment by referring to the target information, it can provide the student with more optimal guidance. One potential risk for the proposed approach is a trivial solution that the model's output directly copies the target input. Based on a many-to-one mapping property of the CTC algorithm, we present a training strategy that can effectively prevent the trivial solution and thus enables utilizing both source and target inputs for model training. Extensive experiments are conducted on two sequence learning tasks: speech recognition and scene text recognition. From the experimental results, we empirically show that the proposed model improves the students across these tasks while achieving a considerable speed-up in the teacher model's training time. △ Less

Submitted 11 August, 2023; v1 submitted 5 November, 2021; originally announced November 2021.

Comments: Accepted by IEEE/ACM Transactions on Audio, Speech and Language Processing

arXiv:2110.01606 [pdf]

doi 10.1109/ACCESS.2022.3193250

Breast Cancer Diagnosis in Two-View Mammography Using End-to-End Trained EfficientNet-Based Convolutional Network

Authors: Daniel G. P. Petrini, Carlos Shimizu, Rosimeire A. Roela, Gabriel V. Valente, Maria A. A. K. Folgueira, Hae Yong Kim

Abstract: Some recent studies have described deep convolutional neural networks to diagnose breast cancer in mammograms with similar or even superior performance to that of human experts. One of the best techniques does two transfer learnings: the first uses a model trained on natural images to create a "patch classifier" that categorizes small subimages; the second uses the patch classifier to scan the who… ▽ More Some recent studies have described deep convolutional neural networks to diagnose breast cancer in mammograms with similar or even superior performance to that of human experts. One of the best techniques does two transfer learnings: the first uses a model trained on natural images to create a "patch classifier" that categorizes small subimages; the second uses the patch classifier to scan the whole mammogram and create the "single-view whole-image classifier". We propose to make a third transfer learning to obtain a "two-view classifier" to use the two mammographic views: bilateral craniocaudal and mediolateral oblique. We use EfficientNet as the basis of our model. We "end-to-end" train the entire system using CBIS-DDSM dataset. To ensure statistical robustness, we test our system twice using: (a) 5-fold cross validation; and (b) the original training/test division of the dataset. Our technique reached an AUC of 0.9344 using 5-fold cross validation (accuracy, sensitivity and specificity are 85.13% at the equal error rate point of ROC). Using the original dataset division, our technique achieved an AUC of 0.8483, as far as we know the highest reported AUC for this problem, although the subtle differences in the testing conditions of each work do not allow for an accurate comparison. The inference code and model are available at https://github.com/dpetrini/two-views-classifier △ Less

Submitted 3 August, 2022; v1 submitted 1 October, 2021; originally announced October 2021.

Comments: Updated to published version in IEEE Access

Journal ref: IEEE Access, vol. 10, pp. 77723-77731, 2022

arXiv:2007.12903 [pdf, other]

doi 10.24963/ijcai.2020/518

Robust Front-End for Multi-Channel ASR using Flow-Based Density Estimation

Authors: Hyeongju Kim, Hyeonseung Lee, Woo Hyun Kang, Hyung Yong Kim, Nam Soo Kim

Abstract: For multi-channel speech recognition, speech enhancement techniques such as denoising or dereverberation are conventionally applied as a front-end processor. Deep learning-based front-ends using such techniques require aligned clean and noisy speech pairs which are generally obtained via data simulation. Recently, several joint optimization techniques have been proposed to train the front-end with… ▽ More For multi-channel speech recognition, speech enhancement techniques such as denoising or dereverberation are conventionally applied as a front-end processor. Deep learning-based front-ends using such techniques require aligned clean and noisy speech pairs which are generally obtained via data simulation. Recently, several joint optimization techniques have been proposed to train the front-end without parallel data within an end-to-end automatic speech recognition (ASR) scheme. However, the ASR objective is sub-optimal and insufficient for fully training the front-end, which still leaves room for improvement. In this paper, we propose a novel approach which incorporates flow-based density estimation for the robust front-end using non-parallel clean and noisy speech. Experimental results on the CHiME-4 dataset show that the proposed method outperforms the conventional techniques where the front-end is trained only with ASR objective. △ Less

Submitted 25 July, 2020; originally announced July 2020.

Comments: 7 pages, 3 figures

Journal ref: Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, {IJCAI} 2020

arXiv:1810.10327 [pdf, other]

BshapeNet: Object Detection and Instance Segmentation with Bounding Shape Masks

Authors: Ba Rom Kang, Ha Young Kim

Abstract: Recent object detectors use four-coordinate bounding box (bbox) regression to predict object locations. Providing additional information indicating the object positions and coordinates will improve detection performance. Thus, we propose two types of masks: a bbox mask and a bounding shape (bshape) mask, to represent the object's bbox and boundary shape, respectively. For each of these types, we c… ▽ More Recent object detectors use four-coordinate bounding box (bbox) regression to predict object locations. Providing additional information indicating the object positions and coordinates will improve detection performance. Thus, we propose two types of masks: a bbox mask and a bounding shape (bshape) mask, to represent the object's bbox and boundary shape, respectively. For each of these types, we consider two variants: the Thick model and the Scored model, both of which have the same morphology but differ in ways to make their boundaries thicker. To evaluate the proposed masks, we design extended frameworks by adding a bshape mask (or a bbox mask) branch to a Faster R-CNN framework, and call this BshapeNet (or BboxNet). Further, we propose BshapeNet+, a network that combines a bshape mask branch with a Mask R-CNN to improve instance segmentation as well as detection. Among our proposed models, BshapeNet+ demonstrates the best performance in both tasks and achieves highly competitive results with state of the art (SOTA) models. Particularly, it improves the detection results over Faster R-CNN+RoIAlign (37.3% and 28.9%) with a detection AP of 42.4% and 32.3% on MS COCO test-dev and Cityscapes val, respectively. Furthermore, for small objects, it achieves 24.9% AP on COCO test-dev, a significant improvement over previous SOTA models. For instance segmentation, it is substantially superior to Mask R-CNN on both test datasets. △ Less

Submitted 31 July, 2019; v1 submitted 15 October, 2018; originally announced October 2018.

Comments: 10 pages,6 figures

Showing 1–19 of 19 results for author: Kim, H Y