-
WebChoreArena: Evaluating Web Browsing Agents on Realistic Tedious Web Tasks
Authors:
Atsuyuki Miyai,
Zaiying Zhao,
Kazuki Egashira,
Atsuki Sato,
Tatsumi Sunada,
Shota Onohara,
Hiromasa Yamanishi,
Mashiro Toyooka,
Kunato Nishina,
Ryoma Maeda,
Kiyoharu Aizawa,
Toshihiko Yamasaki
Abstract:
Powered by a large language model (LLM), a web browsing agent operates web browsers in a human-like manner and offers a highly transparent path toward automating a wide range of everyday tasks. As web agents become increasingly capable and demonstrate proficiency in general browsing tasks, a critical question emerges: Can they go beyond general browsing to robustly handle tasks that are tedious an…
▽ More
Powered by a large language model (LLM), a web browsing agent operates web browsers in a human-like manner and offers a highly transparent path toward automating a wide range of everyday tasks. As web agents become increasingly capable and demonstrate proficiency in general browsing tasks, a critical question emerges: Can they go beyond general browsing to robustly handle tasks that are tedious and complex, or chores that humans often avoid doing themselves? In this paper, we introduce WebChoreArena, a new fully reproducible benchmark comprising 532 carefully curated tasks designed to extend the scope of WebArena beyond general browsing to more labor-intensive and tedious tasks. WebChoreArena systematically integrates three key challenges: (i) Massive Memory tasks requiring accurate retrieval of large amounts of information in the observations, (ii) Calculation tasks demanding precise mathematical reasoning, and (iii) Long-Term Memory tasks necessitating long-term memory across multiple webpages. Built on top of the fully reproducible and widely adopted four WebArena simulation environments, WebChoreArena ensures strict reproducibility and enables fair, direct comparisons with the established WebArena benchmark, offering key insights into agent progress. Our experimental results demonstrate that as LLMs evolve, represented by GPT-4o, Claude 3.7 Sonnet, and Gemini 2.5 Pro, significant improvements in performance are observed on WebChoreArena. These findings suggest that WebChoreArena is well-suited to measure the advancement of state-of-the-art LLMs with greater clarity. Nevertheless, the results also indicate that even with Gemini 2.5 Pro, there remains substantial room for improvement compared to WebArena, highlighting the increased challenges posed by WebChoreArena.
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
MangaVQA and MangaLMM: A Benchmark and Specialized Model for Multimodal Manga Understanding
Authors:
Jeonghun Baek,
Kazuki Egashira,
Shota Onohara,
Atsuyuki Miyai,
Yuki Imajuku,
Hikaru Ikuta,
Kiyoharu Aizawa
Abstract:
Manga, or Japanese comics, is a richly multimodal narrative form that blends images and text in complex ways. Teaching large multimodal models (LMMs) to understand such narratives at a human-like level could help manga creators reflect on and refine their stories. To this end, we introduce two benchmarks for multimodal manga understanding: MangaOCR, which targets in-page text recognition, and Mang…
▽ More
Manga, or Japanese comics, is a richly multimodal narrative form that blends images and text in complex ways. Teaching large multimodal models (LMMs) to understand such narratives at a human-like level could help manga creators reflect on and refine their stories. To this end, we introduce two benchmarks for multimodal manga understanding: MangaOCR, which targets in-page text recognition, and MangaVQA, a novel benchmark designed to evaluate contextual understanding through visual question answering. MangaVQA consists of 526 high-quality, manually constructed question-answer pairs, enabling reliable evaluation across diverse narrative and visual scenarios. Building on these benchmarks, we develop MangaLMM, a manga-specialized model finetuned from the open-source LMM Qwen2.5-VL to jointly handle both tasks. Through extensive experiments, including comparisons with proprietary models such as GPT-4o and Gemini 2.5, we assess how well LMMs understand manga. Our benchmark and model provide a comprehensive foundation for evaluating and advancing LMMs in the richly narrative domain of manga.
△ Less
Submitted 26 May, 2025;
originally announced May 2025.
-
Prototype sub-wavelength structure anti-reflection coating on alumina filters for ground-based CMB telescopes
Authors:
Kosuke Aizawa,
Ryosuke Akizawa,
Scott Cray,
Shaul Hanany,
Shotaro Kawano,
Jürgen Koch,
Kuniaki Konishi,
Rex Lam,
Tomotake Matsumura,
Haruyuki Sakurai,
Ryota Takaku
Abstract:
We present designs and fabrication of sub-wavelength anti-reflection (AR) structures on alumina for infrared absorptive filters with passbands near 30, 125, and 250 GHz. These bands are widely used by ground-based instruments measuring the cosmic microwave background radiation. The designs are tuned to provide reflectance of 2% or less for fractional bandwidths between 51% and 72%, with each of th…
▽ More
We present designs and fabrication of sub-wavelength anti-reflection (AR) structures on alumina for infrared absorptive filters with passbands near 30, 125, and 250 GHz. These bands are widely used by ground-based instruments measuring the cosmic microwave background radiation. The designs are tuned to provide reflectance of 2% or less for fractional bandwidths between 51% and 72%, with each of the three primary bands containing two sub-bands. We make the sub-wavelength structures (SWS), which resemble a two-dimensional array of pyramids, using laser ablation. We measure the shapes of the fabricated pyramids and show that for incidence angles up to 20 degrees the predicted in-band average reflectance is 2% or less, in agreement with the design. The band average instrumental polarization is less than $3\times 10^{-3}$.
△ Less
Submitted 16 May, 2025;
originally announced May 2025.
-
World Food Atlas Project
Authors:
Ali Rostami,
Z Xie,
A Ishino,
Y Yamakata,
K Aizawa,
Ramesh Jain
Abstract:
A coronavirus pandemic is forcing people to be "at home" all over the world. In a life of hardly ever going out, we would have realized how the food we eat affects our bodies. What can we do to know our food more and control it better? To give us a clue, we are trying to build a World Food Atlas (WFA) that collects all the knowledge about food in the world. In this paper, we present two of our tri…
▽ More
A coronavirus pandemic is forcing people to be "at home" all over the world. In a life of hardly ever going out, we would have realized how the food we eat affects our bodies. What can we do to know our food more and control it better? To give us a clue, we are trying to build a World Food Atlas (WFA) that collects all the knowledge about food in the world. In this paper, we present two of our trials. The first is the Food Knowledge Graph (FKG), which is a graphical representation of knowledge about food and ingredient relationships derived from recipes and food nutrition data. The second is the FoodLog Athl and the RecipeLog that are applications for collecting people's detailed records about food habit. We also discuss several problems that we try to solve to build the WFA by integrating these two ideas.
△ Less
Submitted 25 April, 2025;
originally announced April 2025.
-
Experimental Studies on Spatial Resolution of a Delay-Line Current-Biased Kinetic-Inductance Detector
Authors:
The Dang Vu,
Hiroaki Shishido,
Kazuya Aizawa,
Takayuki Oku,
Kenichi Oikawa,
Masahide Harada,
Kenji M. Kojima,
Shigeyuki Miyajima,
Kazuhiko Soyama,
Tomio Koyama,
Mutsuo Hidaka,
Soh Y. Suzuki,
Manobu M. Tanaka,
Masahiko Machida,
Shuichi Kawamata,
Takekazu Ishida
Abstract:
A current-biased kinetic inductance detector (CB-KID) is a novel superconducting detector to construct a neutron transmission imaging system. The characteristics of a superconducting neutron detector have been systematically studied to improve spatial resolution of our CB-KID neutron detector. In this study, we investigated the distribution of spatial resolutions under different operating conditio…
▽ More
A current-biased kinetic inductance detector (CB-KID) is a novel superconducting detector to construct a neutron transmission imaging system. The characteristics of a superconducting neutron detector have been systematically studied to improve spatial resolution of our CB-KID neutron detector. In this study, we investigated the distribution of spatial resolutions under different operating conditions and examined the homogeneity of spatial resolutions in the detector in detail. We used a commercial standard Gd Siemens-star pattern as a conventional method to estimate the spatial resolution, and a lab-made 10B-dot array intended to examine detailed profiles on a distribution of spatial resolutions. We found that discrepancy in propagation velocities in the detector affected the uniformity of the spatial resolutions in neutron imaging. We analyzed the ellipsoidal line profiles along the circumferences of several different test circles in the Siemens-star image to find a distribution of spatial resolutions. Note that we succeeded in controlling the detector temperature precisely enough to realize stable propagation velocities of the signals in the detector to achieve the best spatial resolution with a delay-line CB-KID technique.
△ Less
Submitted 16 April, 2025;
originally announced April 2025.
-
Harnessing PDF Data for Improving Japanese Large Multimodal Models
Authors:
Jeonghun Baek,
Akiko Aizawa,
Kiyoharu Aizawa
Abstract:
Large Multimodal Models (LMMs) have demonstrated strong performance in English, but their effectiveness in Japanese remains limited due to the lack of high-quality training data. Current Japanese LMMs often rely on translated English datasets, restricting their ability to capture Japan-specific cultural knowledge. To address this, we explore the potential of Japanese PDF data as a training resourc…
▽ More
Large Multimodal Models (LMMs) have demonstrated strong performance in English, but their effectiveness in Japanese remains limited due to the lack of high-quality training data. Current Japanese LMMs often rely on translated English datasets, restricting their ability to capture Japan-specific cultural knowledge. To address this, we explore the potential of Japanese PDF data as a training resource, an area that remains largely underutilized. We introduce a fully automated pipeline that leverages pretrained models to extract image-text pairs from PDFs through layout analysis, OCR, and vision-language pairing, removing the need for manual annotation. Additionally, we construct instruction data from extracted image-text pairs to enrich the training data. To evaluate the effectiveness of PDF-derived data, we train Japanese LMMs and assess their performance on the Japanese LMM Benchmark. Our results demonstrate substantial improvements, with performance gains ranging from 2.1% to 13.8% on Heron-Bench. Further analysis highlights the impact of PDF-derived data on various factors, such as model size and language models, reinforcing its value as a multimodal resource for Japanese LMMs.
△ Less
Submitted 31 May, 2025; v1 submitted 20 February, 2025;
originally announced February 2025.
-
A Benchmark and Evaluation for Real-World Out-of-Distribution Detection Using Vision-Language Models
Authors:
Shiho Noda,
Atsuyuki Miyai,
Qing Yu,
Go Irie,
Kiyoharu Aizawa
Abstract:
Out-of-distribution (OOD) detection is a task that detects OOD samples during inference to ensure the safety of deployed models. However, conventional benchmarks have reached performance saturation, making it difficult to compare recent OOD detection methods. To address this challenge, we introduce three novel OOD detection benchmarks that enable a deeper understanding of method characteristics an…
▽ More
Out-of-distribution (OOD) detection is a task that detects OOD samples during inference to ensure the safety of deployed models. However, conventional benchmarks have reached performance saturation, making it difficult to compare recent OOD detection methods. To address this challenge, we introduce three novel OOD detection benchmarks that enable a deeper understanding of method characteristics and reflect real-world conditions. First, we present ImageNet-X, designed to evaluate performance under challenging semantic shifts. Second, we propose ImageNet-FS-X for full-spectrum OOD detection, assessing robustness to covariate shifts (feature distribution shifts). Finally, we propose Wilds-FS-X, which extends these evaluations to real-world datasets, offering a more comprehensive testbed. Our experiments reveal that recent CLIP-based OOD detection methods struggle to varying degrees across the three proposed benchmarks, and none of them consistently outperforms the others. We hope the community goes beyond specific benchmarks and includes more challenging conditions reflecting real-world scenarios. The code is https://github.com/hoshi23/OOD-X-Benchmarks.
△ Less
Submitted 29 May, 2025; v1 submitted 30 January, 2025;
originally announced January 2025.
-
JMMMU: A Japanese Massive Multi-discipline Multimodal Understanding Benchmark for Culture-aware Evaluation
Authors:
Shota Onohara,
Atsuyuki Miyai,
Yuki Imajuku,
Kazuki Egashira,
Jeonghun Baek,
Xiang Yue,
Graham Neubig,
Kiyoharu Aizawa
Abstract:
Accelerating research on Large Multimodal Models (LMMs) in non-English languages is crucial for enhancing user experiences across broader populations. In this paper, we introduce JMMMU (Japanese MMMU), the first large-scale Japanese benchmark designed to evaluate LMMs on expert-level tasks based on the Japanese cultural context. To facilitate comprehensive culture-aware evaluation, JMMMU features…
▽ More
Accelerating research on Large Multimodal Models (LMMs) in non-English languages is crucial for enhancing user experiences across broader populations. In this paper, we introduce JMMMU (Japanese MMMU), the first large-scale Japanese benchmark designed to evaluate LMMs on expert-level tasks based on the Japanese cultural context. To facilitate comprehensive culture-aware evaluation, JMMMU features two complementary subsets: (i) culture-agnostic (CA) subset, where the culture-independent subjects (e.g., Math) are selected and translated into Japanese, enabling one-to-one comparison with its English counterpart MMMU; and (ii) culture-specific (CS) subset, comprising newly crafted subjects that reflect Japanese cultural context. Using the CA subset, we observe performance drop in many LMMs when evaluated in Japanese, which is purely attributable to language variation. Using the CS subset, we reveal their inadequate Japanese cultural understanding. Further, by combining both subsets, we identify that some LMMs perform well on the CA subset but not on the CS subset, exposing a shallow understanding of the Japanese language that lacks depth in cultural understanding. We hope this work will not only help advance LMM performance in Japanese but also serve as a guideline to create high-standard, culturally diverse benchmarks for multilingual LMM development. The project page is https://mmmu-japanese-benchmark.github.io/JMMMU/.
△ Less
Submitted 19 March, 2025; v1 submitted 22 October, 2024;
originally announced October 2024.
-
FoodMLLM-JP: Leveraging Multimodal Large Language Models for Japanese Recipe Generation
Authors:
Yuki Imajuku,
Yoko Yamakata,
Kiyoharu Aizawa
Abstract:
Research on food image understanding using recipe data has been a long-standing focus due to the diversity and complexity of the data. Moreover, food is inextricably linked to people's lives, making it a vital research area for practical applications such as dietary management. Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities, not only in th…
▽ More
Research on food image understanding using recipe data has been a long-standing focus due to the diversity and complexity of the data. Moreover, food is inextricably linked to people's lives, making it a vital research area for practical applications such as dietary management. Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities, not only in their vast knowledge but also in their ability to handle languages naturally. While English is predominantly used, they can also support multiple languages including Japanese. This suggests that MLLMs are expected to significantly improve performance in food image understanding tasks. We fine-tuned open MLLMs LLaVA-1.5 and Phi-3 Vision on a Japanese recipe dataset and benchmarked their performance against the closed model GPT-4o. We then evaluated the content of generated recipes, including ingredients and cooking procedures, using 5,000 evaluation samples that comprehensively cover Japanese food culture. Our evaluation demonstrates that the open models trained on recipe data outperform GPT-4o, the current state-of-the-art model, in ingredient generation. Our model achieved F1 score of 0.531, surpassing GPT-4o's F1 score of 0.481, indicating a higher level of accuracy. Furthermore, our model exhibited comparable performance to GPT-4o in generating cooking procedure text.
△ Less
Submitted 3 March, 2025; v1 submitted 27 September, 2024;
originally announced September 2024.
-
Training-Free Sketch-Guided Diffusion with Latent Optimization
Authors:
Sandra Zhang Ding,
Jiafeng Mao,
Kiyoharu Aizawa
Abstract:
Based on recent advanced diffusion models, Text-to-image (T2I) generation models have demonstrated their capabilities to generate diverse and high-quality images. However, leveraging their potential for real-world content creation, particularly in providing users with precise control over the image generation result, poses a significant challenge. In this paper, we propose an innovative training-f…
▽ More
Based on recent advanced diffusion models, Text-to-image (T2I) generation models have demonstrated their capabilities to generate diverse and high-quality images. However, leveraging their potential for real-world content creation, particularly in providing users with precise control over the image generation result, poses a significant challenge. In this paper, we propose an innovative training-free pipeline that extends existing text-to-image generation models to incorporate a sketch as an additional condition. To generate new images with a layout and structure closely resembling the input sketch, we find that these core features of a sketch can be tracked with the cross-attention maps of diffusion models. We introduce latent optimization, a method that refines the noisy latent at each intermediate step of the generation process using cross-attention maps to ensure that the generated images adhere closely to the desired structure outlined in the reference sketch. Through latent optimization, our method enhances the accuracy of image generation, offering users greater control and customization options in content creation.
△ Less
Submitted 7 May, 2025; v1 submitted 30 August, 2024;
originally announced September 2024.
-
Investigating the Perception of Facial Anonymization Techniques in 360° Videos
Authors:
Leslie Wöhler,
Satoshi Ikehata,
Kiyoharu Aizawa
Abstract:
In this work, we investigate facial anonymization techniques in 360° videos and assess their influence on the perceived realism, anonymization effect, and presence of participants. In comparison to traditional footage, 360° videos can convey engaging, immersive experiences that accurately represent the atmosphere of real-world locations. As the entire environment is captured simultaneously, it is…
▽ More
In this work, we investigate facial anonymization techniques in 360° videos and assess their influence on the perceived realism, anonymization effect, and presence of participants. In comparison to traditional footage, 360° videos can convey engaging, immersive experiences that accurately represent the atmosphere of real-world locations. As the entire environment is captured simultaneously, it is necessary to anonymize the faces of bystanders in recordings of public spaces. Since this alters the video content, the perceived realism and immersion could be reduced. To understand these effects, we compare non-anonymized and anonymized 360° videos using blurring, black boxes, and face-swapping shown either on a regular screen or in a head-mounted display (HMD).
Our results indicate significant differences in the perception of the anonymization techniques. We find that face-swapping is most realistic and least disruptive, however, participants raised concerns regarding the effectiveness of the anonymization. Furthermore, we observe that presence is affected by facial anonymization in HMD condition. Overall, the results underscore the need for facial anonymization techniques that balance both photo-realism and a sense of privacy.
△ Less
Submitted 8 August, 2024;
originally announced August 2024.
-
Multi-dimensional optimisation of the scanning strategy for the LiteBIRD space mission
Authors:
Y. Takase,
L. Vacher,
H. Ishino,
G. Patanchon,
L. Montier,
S. L. Stever,
K. Ishizaka,
Y. Nagano,
W. Wang,
J. Aumont,
K. Aizawa,
A. Anand,
C. Baccigalupi,
M. Ballardini,
A. J. Banday,
R. B. Barreiro,
N. Bartolo,
S. Basak,
M. Bersanelli,
M. Bortolami,
T. Brinckmann,
E. Calabrese,
P. Campeti,
E. Carinos,
A. Carones
, et al. (83 additional authors not shown)
Abstract:
Large angular scale surveys in the absence of atmosphere are essential for measuring the primordial $B$-mode power spectrum of the Cosmic Microwave Background (CMB). Since this proposed measurement is about three to four orders of magnitude fainter than the temperature anisotropies of the CMB, in-flight calibration of the instruments and active suppression of systematic effects are crucial. We inv…
▽ More
Large angular scale surveys in the absence of atmosphere are essential for measuring the primordial $B$-mode power spectrum of the Cosmic Microwave Background (CMB). Since this proposed measurement is about three to four orders of magnitude fainter than the temperature anisotropies of the CMB, in-flight calibration of the instruments and active suppression of systematic effects are crucial. We investigate the effect of changing the parameters of the scanning strategy on the in-flight calibration effectiveness, the suppression of the systematic effects themselves, and the ability to distinguish systematic effects by null-tests. Next-generation missions such as LiteBIRD, modulated by a Half-Wave Plate (HWP), will be able to observe polarisation using a single detector, eliminating the need to combine several detectors to measure polarisation, as done in many previous experiments and hence avoiding the consequent systematic effects. While the HWP is expected to suppress many systematic effects, some of them will remain. We use an analytical approach to comprehensively address the mitigation of these systematic effects and identify the characteristics of scanning strategies that are the most effective for implementing a variety of calibration strategies in the multi-dimensional space of common spacecraft scan parameters. We also present Falcons, a fast spacecraft scanning simulator that we developed to investigate this scanning parameter space.
△ Less
Submitted 15 November, 2024; v1 submitted 6 August, 2024;
originally announced August 2024.
-
Generalized Out-of-Distribution Detection and Beyond in Vision Language Model Era: A Survey
Authors:
Atsuyuki Miyai,
Jingkang Yang,
Jingyang Zhang,
Yifei Ming,
Yueqian Lin,
Qing Yu,
Go Irie,
Shafiq Joty,
Yixuan Li,
Hai Li,
Ziwei Liu,
Toshihiko Yamasaki,
Kiyoharu Aizawa
Abstract:
Detecting out-of-distribution (OOD) samples is crucial for ensuring the safety of machine learning systems and has shaped the field of OOD detection. Meanwhile, several other problems are closely related to OOD detection, including anomaly detection (AD), novelty detection (ND), open set recognition (OSR), and outlier detection (OD). To unify these problems, a generalized OOD detection framework w…
▽ More
Detecting out-of-distribution (OOD) samples is crucial for ensuring the safety of machine learning systems and has shaped the field of OOD detection. Meanwhile, several other problems are closely related to OOD detection, including anomaly detection (AD), novelty detection (ND), open set recognition (OSR), and outlier detection (OD). To unify these problems, a generalized OOD detection framework was proposed, taxonomically categorizing these five problems. However, Vision Language Models (VLMs) such as CLIP have significantly changed the paradigm and blurred the boundaries between these fields, again confusing researchers. In this survey, we first present a generalized OOD detection v2, encapsulating the evolution of AD, ND, OSR, OOD detection, and OD in the VLM era. Our framework reveals that, with some field inactivity and integration, the demanding challenges have become OOD detection and AD. In addition, we also highlight the significant shift in the definition, problem settings, and benchmarks; we thus feature a comprehensive review of the methodology for OOD detection, including the discussion over other related tasks to clarify their relationship to OOD detection. Finally, we explore the advancements in the emerging Large Vision Language Model (LVLM) era, such as GPT-4V. We conclude this survey with open challenges and future directions.
△ Less
Submitted 31 July, 2024;
originally announced July 2024.
-
MangaUB: A Manga Understanding Benchmark for Large Multimodal Models
Authors:
Hikaru Ikuta,
Leslie Wöhler,
Kiyoharu Aizawa
Abstract:
Manga is a popular medium that combines stylized drawings and text to convey stories. As manga panels differ from natural images, computational systems traditionally had to be designed specifically for manga. Recently, the adaptive nature of modern large multimodal models (LMMs) shows possibilities for more general approaches. To provide an analysis of the current capability of LMMs for manga unde…
▽ More
Manga is a popular medium that combines stylized drawings and text to convey stories. As manga panels differ from natural images, computational systems traditionally had to be designed specifically for manga. Recently, the adaptive nature of modern large multimodal models (LMMs) shows possibilities for more general approaches. To provide an analysis of the current capability of LMMs for manga understanding tasks and identifying areas for their improvement, we design and evaluate MangaUB, a novel manga understanding benchmark for LMMs. MangaUB is designed to assess the recognition and understanding of content shown in a single panel as well as conveyed across multiple panels, allowing for a fine-grained analysis of a model's various capabilities required for manga understanding. Our results show strong performance on the recognition of image content, while understanding the emotion and information conveyed across multiple panels is still challenging, highlighting future work towards LMMs for manga understanding.
△ Less
Submitted 26 July, 2024;
originally announced July 2024.
-
LiteBIRD Science Goals and Forecasts. Mapping the Hot Gas in the Universe
Authors:
M. Remazeilles,
M. Douspis,
J. A. Rubiño-Martín,
A. J. Banday,
J. Chluba,
P. de Bernardis,
M. De Petris,
C. Hernández-Monteagudo,
G. Luzzi,
J. Macias-Perez,
S. Masi,
T. Namikawa,
L. Salvati,
H. Tanimura,
K. Aizawa,
A. Anand,
J. Aumont,
C. Baccigalupi,
M. Ballardini,
R. B. Barreiro,
N. Bartolo,
S. Basak,
M. Bersanelli,
D. Blinov,
M. Bortolami
, et al. (82 additional authors not shown)
Abstract:
We assess the capabilities of the LiteBIRD mission to map the hot gas distribution in the Universe through the thermal Sunyaev-Zeldovich (SZ) effect. Our analysis relies on comprehensive simulations incorporating various sources of Galactic and extragalactic foreground emission, while accounting for specific instrumental characteristics of LiteBIRD, such as detector sensitivities, frequency-depend…
▽ More
We assess the capabilities of the LiteBIRD mission to map the hot gas distribution in the Universe through the thermal Sunyaev-Zeldovich (SZ) effect. Our analysis relies on comprehensive simulations incorporating various sources of Galactic and extragalactic foreground emission, while accounting for specific instrumental characteristics of LiteBIRD, such as detector sensitivities, frequency-dependent beam convolution, inhomogeneous sky scanning, and $1/f$ noise. We implement a tailored component-separation pipeline to map the thermal SZ Compton $y$-parameter over 98% of the sky. Despite lower angular resolution for galaxy cluster science, LiteBIRD provides full-sky coverage and, compared to the Planck satellite, enhanced sensitivity, as well as more frequency bands to enable the construction of an all-sky $y$-map, with reduced foreground contamination at large and intermediate angular scales. By combining LiteBIRD and Planck channels in the component-separation pipeline, we obtain an optimal $y$-map that leverages the advantages of both experiments, with the higher angular resolution of the Planck channels enabling the recovery of compact clusters beyond the LiteBIRD beam limitations, and the numerous sensitive LiteBIRD channels further mitigating foregrounds. The added value of LiteBIRD is highlighted through the examination of maps, power spectra, and one-point statistics of the various sky components. After component separation, the $1/f$ noise from LiteBIRD is effectively mitigated below the thermal SZ signal at all multipoles. Cosmological constraints on $S_8=σ_8\left(Ω_{\rm m}/0.3\right)^{0.5}$ obtained from the LiteBIRD-Planck combined $y$-map power spectrum exhibits a 15% reduction in uncertainty compared to constraints from Planck alone. This improvement can be attributed to the increased portion of uncontaminated sky available in the LiteBIRD-Planck combined $y$-map.
△ Less
Submitted 23 October, 2024; v1 submitted 24 July, 2024;
originally announced July 2024.
-
The LiteBIRD mission to explore cosmic inflation
Authors:
T. Ghigna,
A. Adler,
K. Aizawa,
H. Akamatsu,
R. Akizawa,
E. Allys,
A. Anand,
J. Aumont,
J. Austermann,
S. Azzoni,
C. Baccigalupi,
M. Ballardini,
A. J. Banday,
R. B. Barreiro,
N. Bartolo,
S. Basak,
A. Basyrov,
S. Beckman,
M. Bersanelli,
M. Bortolami,
F. Bouchet,
T. Brinckmann,
P. Campeti,
E. Carinos,
A. Carones
, et al. (134 additional authors not shown)
Abstract:
LiteBIRD, the next-generation cosmic microwave background (CMB) experiment, aims for a launch in Japan's fiscal year 2032, marking a major advancement in the exploration of primordial cosmology and fundamental physics. Orbiting the Sun-Earth Lagrangian point L2, this JAXA-led strategic L-class mission will conduct a comprehensive mapping of the CMB polarization across the entire sky. During its 3-…
▽ More
LiteBIRD, the next-generation cosmic microwave background (CMB) experiment, aims for a launch in Japan's fiscal year 2032, marking a major advancement in the exploration of primordial cosmology and fundamental physics. Orbiting the Sun-Earth Lagrangian point L2, this JAXA-led strategic L-class mission will conduct a comprehensive mapping of the CMB polarization across the entire sky. During its 3-year mission, LiteBIRD will employ three telescopes within 15 unique frequency bands (ranging from 34 through 448 GHz), targeting a sensitivity of 2.2\,$μ$K-arcmin and a resolution of 0.5$^\circ$ at 100\,GHz. Its primary goal is to measure the tensor-to-scalar ratio $r$ with an uncertainty $δr = 0.001$, including systematic errors and margin. If $r \geq 0.01$, LiteBIRD expects to achieve a $>5σ$ detection in the $\ell=$2-10 and $\ell=$11-200 ranges separately, providing crucial insight into the early Universe. We describe LiteBIRD's scientific objectives, the application of systems engineering to mission requirements, the anticipated scientific impact, and the operations and scanning strategies vital to minimizing systematic effects. We will also highlight LiteBIRD's synergies with concurrent CMB projects.
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
Privacy Protection and Video Manipulation in Immersive Media
Authors:
Leslie Wöhler,
Satoshi Ikehata,
Kiyoharu Aizawa
Abstract:
In comparison to traditional footage, 360° videos can convey engaging, immersive experiences and even be utilized to create interactive virtual environments. Like regular recordings, these videos need to consider the privacy of recorded people and could be targets for video manipulations. However, due to their properties like enhanced presence, the effects on users might differ from traditional, n…
▽ More
In comparison to traditional footage, 360° videos can convey engaging, immersive experiences and even be utilized to create interactive virtual environments. Like regular recordings, these videos need to consider the privacy of recorded people and could be targets for video manipulations. However, due to their properties like enhanced presence, the effects on users might differ from traditional, non-immersive content. Therefore, we are interested in how changes of real-world footage like adding privacy protection or applying video manipulations could mitigate or introduce harm in the resulting immersive media.
△ Less
Submitted 23 April, 2024;
originally announced May 2024.
-
Zero-Shot Character Identification and Speaker Prediction in Comics via Iterative Multimodal Fusion
Authors:
Yingxuan Li,
Ryota Hinami,
Kiyoharu Aizawa,
Yusuke Matsui
Abstract:
Recognizing characters and predicting speakers of dialogue are critical for comic processing tasks, such as voice generation or translation. However, because characters vary by comic title, supervised learning approaches like training character classifiers which require specific annotations for each comic title are infeasible. This motivates us to propose a novel zero-shot approach, allowing machi…
▽ More
Recognizing characters and predicting speakers of dialogue are critical for comic processing tasks, such as voice generation or translation. However, because characters vary by comic title, supervised learning approaches like training character classifiers which require specific annotations for each comic title are infeasible. This motivates us to propose a novel zero-shot approach, allowing machines to identify characters and predict speaker names based solely on unannotated comic images. In spite of their importance in real-world applications, these task have largely remained unexplored due to challenges in story comprehension and multimodal integration. Recent large language models (LLMs) have shown great capability for text understanding and reasoning, while their application to multimodal content analysis is still an open problem. To address this problem, we propose an iterative multimodal framework, the first to employ multimodal information for both character identification and speaker prediction tasks. Our experiments demonstrate the effectiveness of the proposed framework, establishing a robust baseline for these tasks. Furthermore, since our method requires no training data or annotations, it can be used as-is on any comic series.
△ Less
Submitted 4 September, 2024; v1 submitted 22 April, 2024;
originally announced April 2024.
-
Unsolvable Problem Detection: Robust Understanding Evaluation for Large Multimodal Models
Authors:
Atsuyuki Miyai,
Jingkang Yang,
Jingyang Zhang,
Yifei Ming,
Qing Yu,
Go Irie,
Yixuan Li,
Hai Li,
Ziwei Liu,
Kiyoharu Aizawa
Abstract:
This paper introduces a novel task to evaluate the robust understanding capability of Large Multimodal Models (LMMs), termed $\textbf{Unsolvable Problem Detection (UPD)}$. Multiple-choice question answering (MCQA) is widely used to assess the understanding capability of LMMs, but it does not guarantee that LMMs truly comprehend the answer. UPD assesses the LMM's ability to withhold answers when en…
▽ More
This paper introduces a novel task to evaluate the robust understanding capability of Large Multimodal Models (LMMs), termed $\textbf{Unsolvable Problem Detection (UPD)}$. Multiple-choice question answering (MCQA) is widely used to assess the understanding capability of LMMs, but it does not guarantee that LMMs truly comprehend the answer. UPD assesses the LMM's ability to withhold answers when encountering unsolvable problems of MCQA, verifying whether the model truly understands the answer. UPD encompasses three problems: Absent Answer Detection (AAD), Incompatible Answer Set Detection (IASD), and Incompatible Visual Question Detection (IVQD), covering unsolvable cases like answer-lacking or incompatible choices and image-question mismatches. For the evaluation, we introduce the MM-UPD Bench, a benchmark for assessing performance across various ability dimensions. Our experiments reveal that even most LMMs, which demonstrate adequate performance on existing benchmarks, struggle significantly with MM-UPD, underscoring a novel aspect of trustworthiness that current benchmarks have overlooked. A detailed analysis shows that LMMs have different bottlenecks and chain-of-thought and self-reflection improved performance for LMMs with the bottleneck in their LLM capability. We hope our insights will enhance the broader understanding and development of more reliable LMMs. The code is available at https://github.com/AtsuMiyai/UPD.
△ Less
Submitted 9 June, 2025; v1 submitted 29 March, 2024;
originally announced March 2024.
-
Entity-NeRF: Detecting and Removing Moving Entities in Urban Scenes
Authors:
Takashi Otonari,
Satoshi Ikehata,
Kiyoharu Aizawa
Abstract:
Recent advancements in the study of Neural Radiance Fields (NeRF) for dynamic scenes often involve explicit modeling of scene dynamics. However, this approach faces challenges in modeling scene dynamics in urban environments, where moving objects of various categories and scales are present. In such settings, it becomes crucial to effectively eliminate moving objects to accurately reconstruct stat…
▽ More
Recent advancements in the study of Neural Radiance Fields (NeRF) for dynamic scenes often involve explicit modeling of scene dynamics. However, this approach faces challenges in modeling scene dynamics in urban environments, where moving objects of various categories and scales are present. In such settings, it becomes crucial to effectively eliminate moving objects to accurately reconstruct static backgrounds. Our research introduces an innovative method, termed here as Entity-NeRF, which combines the strengths of knowledge-based and statistical strategies. This approach utilizes entity-wise statistics, leveraging entity segmentation and stationary entity classification through thing/stuff segmentation. To assess our methodology, we created an urban scene dataset masked with moving objects. Our comprehensive experiments demonstrate that Entity-NeRF notably outperforms existing techniques in removing moving objects and reconstructing static urban backgrounds, both quantitatively and qualitatively.
△ Less
Submitted 24 March, 2024;
originally announced March 2024.
-
Cross-Lingual Learning in Multilingual Scene Text Recognition
Authors:
Jeonghun Baek,
Yusuke Matsui,
Kiyoharu Aizawa
Abstract:
In this paper, we investigate cross-lingual learning (CLL) for multilingual scene text recognition (STR). CLL transfers knowledge from one language to another. We aim to find the condition that exploits knowledge from high-resource languages for improving performance in low-resource languages. To do so, we first examine if two general insights about CLL discussed in previous works are applied to m…
▽ More
In this paper, we investigate cross-lingual learning (CLL) for multilingual scene text recognition (STR). CLL transfers knowledge from one language to another. We aim to find the condition that exploits knowledge from high-resource languages for improving performance in low-resource languages. To do so, we first examine if two general insights about CLL discussed in previous works are applied to multilingual STR: (1) Joint learning with high- and low-resource languages may reduce performance on low-resource languages, and (2) CLL works best between typologically similar languages. Through extensive experiments, we show that two general insights may not be applied to multilingual STR. After that, we show that the crucial condition for CLL is the dataset size of high-resource languages regardless of the kind of high-resource languages. Our code, data, and models are available at https://github.com/ku21fan/CLL-STR.
△ Less
Submitted 17 December, 2023;
originally announced December 2023.
-
The Lottery Ticket Hypothesis in Denoising: Towards Semantic-Driven Initialization
Authors:
Jiafeng Mao,
Xueting Wang,
Kiyoharu Aizawa
Abstract:
Text-to-image diffusion models allow users control over the content of generated images. Still, text-to-image generation occasionally leads to generation failure requiring users to generate dozens of images under the same text prompt before they obtain a satisfying result. We formulate the lottery ticket hypothesis in denoising: randomly initialized Gaussian noise images contain special pixel bloc…
▽ More
Text-to-image diffusion models allow users control over the content of generated images. Still, text-to-image generation occasionally leads to generation failure requiring users to generate dozens of images under the same text prompt before they obtain a satisfying result. We formulate the lottery ticket hypothesis in denoising: randomly initialized Gaussian noise images contain special pixel blocks (winning tickets) that naturally tend to be denoised into specific content independently. The generation failure in standard text-to-image synthesis is caused by the gap between optimal and actual spatial distribution of winning tickets in initial noisy images. To this end, we implement semantic-driven initial image construction creating initial noise from known winning tickets for each concept mentioned in the prompt. We conduct a series of experiments that verify the properties of winning tickets and demonstrate their generalizability across images and prompts. Our results show that aggregating winning tickets into the initial noise image effectively induce the model to generate the specified object at the corresponding location. Project Page: https://ut-mao.github.io/noise.github.io
△ Less
Submitted 8 October, 2024; v1 submitted 12 December, 2023;
originally announced December 2023.
-
Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation
Authors:
Daichi Horita,
Naoto Inoue,
Kotaro Kikuchi,
Kota Yamaguchi,
Kiyoharu Aizawa
Abstract:
Content-aware graphic layout generation aims to automatically arrange visual elements along with a given content, such as an e-commerce product image. In this paper, we argue that the current layout generation approaches suffer from the limited training data for the high-dimensional layout structure. We show that a simple retrieval augmentation can significantly improve the generation quality. Our…
▽ More
Content-aware graphic layout generation aims to automatically arrange visual elements along with a given content, such as an e-commerce product image. In this paper, we argue that the current layout generation approaches suffer from the limited training data for the high-dimensional layout structure. We show that a simple retrieval augmentation can significantly improve the generation quality. Our model, which is named Retrieval-Augmented Layout Transformer (RALF), retrieves nearest neighbor layout examples based on an input image and feeds these results into an autoregressive generator. Our model can apply retrieval augmentation to various controllable generation tasks and yield high-quality layouts within a unified architecture. Our extensive experiments show that RALF successfully generates content-aware layouts in both constrained and unconstrained settings and significantly outperforms the baselines.
△ Less
Submitted 15 April, 2024; v1 submitted 22 November, 2023;
originally announced November 2023.
-
Can Pre-trained Networks Detect Familiar Out-of-Distribution Data?
Authors:
Atsuyuki Miyai,
Qing Yu,
Go Irie,
Kiyoharu Aizawa
Abstract:
Out-of-distribution (OOD) detection is critical for safety-sensitive machine learning applications and has been extensively studied, yielding a plethora of methods developed in the literature. However, most studies for OOD detection did not use pre-trained models and trained a backbone from scratch. In recent years, transferring knowledge from large pre-trained models to downstream tasks by lightw…
▽ More
Out-of-distribution (OOD) detection is critical for safety-sensitive machine learning applications and has been extensively studied, yielding a plethora of methods developed in the literature. However, most studies for OOD detection did not use pre-trained models and trained a backbone from scratch. In recent years, transferring knowledge from large pre-trained models to downstream tasks by lightweight tuning has become mainstream for training in-distribution (ID) classifiers. To bridge the gap between the practice of OOD detection and current classifiers, the unique and crucial problem is that the samples whose information networks know often come as OOD input. We consider that such data may significantly affect the performance of large pre-trained networks because the discriminability of these OOD data depends on the pre-training algorithm. Here, we define such OOD data as PT-OOD (Pre-Trained OOD) data. In this paper, we aim to reveal the effect of PT-OOD on the OOD detection performance of pre-trained networks from the perspective of pre-training algorithms. To achieve this, we explore the PT-OOD detection performance of supervised and self-supervised pre-training algorithms with linear-probing tuning, the most common efficient tuning method. Through our experiments and analysis, we find that the low linear separability of PT-OOD in the feature space heavily degrades the PT-OOD detection performance, and self-supervised models are more vulnerable to PT-OOD than supervised pre-trained models, even with state-of-the-art detection methods. To solve this vulnerability, we further propose a unique solution to large-scale pre-trained models: Leveraging powerful instance-by-instance discriminative representations of pre-trained models and detecting OOD in the feature space independent of the ID decision boundaries. The code will be available via https://github.com/AtsuMiyai/PT-OOD.
△ Less
Submitted 12 October, 2023; v1 submitted 1 October, 2023;
originally announced October 2023.
-
Orientation mapping of YbSn$_3$ single crystals based on Bragg-dip analysis using a delay-line superconducting sensor
Authors:
Hiroaki Shishido,
The Dang Vu,
Kazuya Aizawa,
Kenji M. Kojima,
Tomio Koyama,
Kenichi Oikawa,
Masahide Harada,
Takayuki Oku,
Kazuhiko Soyama,
Shigeyuki Miyajima,
Mutsuo Hidaka,
Soh Y. Suzuki,
Manobu M. Tanaka,
Shuichi Kawamata,
Takekazu Ishida
Abstract:
Recent progress in high-power pulsed neutron sources has stimulated the development of the Bragg-dip and Bragg-edge analysis methods using a two-dimensional neutron detector with high temporal resolution to resolve the neutron energy by the time-of-flight method. The delay-line current-biased kinetic-inductance detector (CB-KID) is a two-dimensional superconducting sensor with a high temporal reso…
▽ More
Recent progress in high-power pulsed neutron sources has stimulated the development of the Bragg-dip and Bragg-edge analysis methods using a two-dimensional neutron detector with high temporal resolution to resolve the neutron energy by the time-of-flight method. The delay-line current-biased kinetic-inductance detector (CB-KID) is a two-dimensional superconducting sensor with a high temporal resolution and multi-hit capability. We demonstrate that the delay-line CB-KID with a $^{10}$B neutron conversion layer can be applied to high-spatial-resolution neutron transmission imaging and spectroscopy up to 100\,eV. Dip structures in the transmission spectrum induced by Bragg diffraction and nuclear resonance absorption in YbSn$_3$ single crystals. We successfully drew the orientation mapping of YbSn$_3$ crystals based on the analysis of observed Bragg-dip positions in the transmission spectrum.
△ Less
Submitted 9 August, 2023;
originally announced August 2023.
-
Open-Set Domain Adaptation with Visual-Language Foundation Models
Authors:
Qing Yu,
Go Irie,
Kiyoharu Aizawa
Abstract:
Unsupervised domain adaptation (UDA) has proven to be very effective in transferring knowledge obtained from a source domain with labeled data to a target domain with unlabeled data. Owing to the lack of labeled data in the target domain and the possible presence of unknown classes, open-set domain adaptation (ODA) has emerged as a potential solution to identify these classes during the training p…
▽ More
Unsupervised domain adaptation (UDA) has proven to be very effective in transferring knowledge obtained from a source domain with labeled data to a target domain with unlabeled data. Owing to the lack of labeled data in the target domain and the possible presence of unknown classes, open-set domain adaptation (ODA) has emerged as a potential solution to identify these classes during the training phase. Although existing ODA approaches aim to solve the distribution shifts between the source and target domains, most methods fine-tuned ImageNet pre-trained models on the source domain with the adaptation on the target domain. Recent visual-language foundation models (VLFM), such as Contrastive Language-Image Pre-Training (CLIP), are robust to many distribution shifts and, therefore, should substantially improve the performance of ODA. In this work, we explore generic ways to adopt CLIP, a popular VLFM, for ODA. We investigate the performance of zero-shot prediction using CLIP, and then propose an entropy optimization strategy to assist the ODA models with the outputs of CLIP. The proposed approach achieves state-of-the-art results on various benchmarks, demonstrating its effectiveness in addressing the ODA problem.
△ Less
Submitted 30 July, 2023;
originally announced July 2023.
-
Manga109Dialog: A Large-scale Dialogue Dataset for Comics Speaker Detection
Authors:
Yingxuan Li,
Kiyoharu Aizawa,
Yusuke Matsui
Abstract:
The expanding market for e-comics has spurred interest in the development of automated methods to analyze comics. For further understanding of comics, an automated approach is needed to link text in comics to characters speaking the words. Comics speaker detection research has practical applications, such as automatic character assignment for audiobooks, automatic translation according to characte…
▽ More
The expanding market for e-comics has spurred interest in the development of automated methods to analyze comics. For further understanding of comics, an automated approach is needed to link text in comics to characters speaking the words. Comics speaker detection research has practical applications, such as automatic character assignment for audiobooks, automatic translation according to characters' personalities, and inference of character relationships and stories.
To deal with the problem of insufficient speaker-to-text annotations, we created a new annotation dataset Manga109Dialog based on Manga109. Manga109Dialog is the world's largest comics speaker annotation dataset, containing 132,692 speaker-to-text pairs. We further divided our dataset into different levels by prediction difficulties to evaluate speaker detection methods more appropriately. Unlike existing methods mainly based on distances, we propose a deep learning-based method using scene graph generation models. Due to the unique features of comics, we enhance the performance of our proposed model by considering the frame reading order. We conducted experiments using Manga109Dialog and other datasets. Experimental results demonstrate that our scene-graph-based approach outperforms existing methods, achieving a prediction accuracy of over 75%.
△ Less
Submitted 22 April, 2024; v1 submitted 30 June, 2023;
originally announced June 2023.
-
LoCoOp: Few-Shot Out-of-Distribution Detection via Prompt Learning
Authors:
Atsuyuki Miyai,
Qing Yu,
Go Irie,
Kiyoharu Aizawa
Abstract:
We present a novel vision-language prompt learning approach for few-shot out-of-distribution (OOD) detection. Few-shot OOD detection aims to detect OOD images from classes that are unseen during training using only a few labeled in-distribution (ID) images. While prompt learning methods such as CoOp have shown effectiveness and efficiency in few-shot ID classification, they still face limitations…
▽ More
We present a novel vision-language prompt learning approach for few-shot out-of-distribution (OOD) detection. Few-shot OOD detection aims to detect OOD images from classes that are unseen during training using only a few labeled in-distribution (ID) images. While prompt learning methods such as CoOp have shown effectiveness and efficiency in few-shot ID classification, they still face limitations in OOD detection due to the potential presence of ID-irrelevant information in text embeddings. To address this issue, we introduce a new approach called Local regularized Context Optimization (LoCoOp), which performs OOD regularization that utilizes the portions of CLIP local features as OOD features during training. CLIP's local features have a lot of ID-irrelevant nuisances (e.g., backgrounds), and by learning to push them away from the ID class text embeddings, we can remove the nuisances in the ID class text embeddings and enhance the separation between ID and OOD. Experiments on the large-scale ImageNet OOD detection benchmarks demonstrate the superiority of our LoCoOp over zero-shot, fully supervised detection methods and prompt learning methods. Notably, even in a one-shot setting -- just one label per class, LoCoOp outperforms existing zero-shot and fully supervised detection methods. The code will be available via https://github.com/AtsuMiyai/LoCoOp.
△ Less
Submitted 25 October, 2023; v1 submitted 2 June, 2023;
originally announced June 2023.
-
Guided Image Synthesis via Initial Image Editing in Diffusion Model
Authors:
Jiafeng Mao,
Xueting Wang,
Kiyoharu Aizawa
Abstract:
Diffusion models have the ability to generate high quality images by denoising pure Gaussian noise images. While previous research has primarily focused on improving the control of image generation through adjusting the denoising process, we propose a novel direction of manipulating the initial noise to control the generated image. Through experiments on stable diffusion, we show that blocks of pi…
▽ More
Diffusion models have the ability to generate high quality images by denoising pure Gaussian noise images. While previous research has primarily focused on improving the control of image generation through adjusting the denoising process, we propose a novel direction of manipulating the initial noise to control the generated image. Through experiments on stable diffusion, we show that blocks of pixels in the initial latent images have a preference for generating specific content, and that modifying these blocks can significantly influence the generated image. In particular, we show that modifying a part of the initial image affects the corresponding region of the generated image while leaving other regions unaffected, which is useful for repainting tasks. Furthermore, we find that the generation preferences of pixel blocks are primarily determined by their values, rather than their position. By moving pixel blocks with a tendency to generate user-desired content to user-specified regions, our approach achieves state-of-the-art performance in layout-to-image generation. Our results highlight the flexibility and power of initial image manipulation in controlling the generated image. Project Page: https://ut-mao.github.io/swap.github.io/
△ Less
Submitted 8 October, 2024; v1 submitted 5 May, 2023;
originally announced May 2023.
-
GL-MCM: Global and Local Maximum Concept Matching for Zero-Shot Out-of-Distribution Detection
Authors:
Atsuyuki Miyai,
Qing Yu,
Go Irie,
Kiyoharu Aizawa
Abstract:
Zero-shot out-of-distribution (OOD) detection is a task that detects OOD images during inference with only in-distribution (ID) class names. Existing methods assume ID images contain a single, centered object, and do not consider the more realistic multi-object scenarios, where both ID and OOD objects are present. To meet the needs of many users, the detection method must have the flexibility to a…
▽ More
Zero-shot out-of-distribution (OOD) detection is a task that detects OOD images during inference with only in-distribution (ID) class names. Existing methods assume ID images contain a single, centered object, and do not consider the more realistic multi-object scenarios, where both ID and OOD objects are present. To meet the needs of many users, the detection method must have the flexibility to adapt the type of ID images. To this end, we present Global-Local Maximum Concept Matching (GL-MCM), which incorporates local image scores as an auxiliary score to enhance the separability of global and local visual features. Due to the simple ensemble score function design, GL-MCM can control the type of ID images with a single weight parameter. Experiments on ImageNet and multi-object benchmarks demonstrate that GL-MCM outperforms baseline zero-shot methods and is comparable to fully supervised methods. Furthermore, GL-MCM offers strong flexibility in adjusting the target type of ID images. The code is available via https://github.com/AtsuMiyai/GL-MCM.
△ Less
Submitted 21 January, 2025; v1 submitted 10 April, 2023;
originally announced April 2023.
-
Comprehensive Comparisons of Uniform Quantization in Deep Image Compression
Authors:
Koki Tsubota,
Kiyoharu Aizawa
Abstract:
In deep image compression, uniform quantization is applied to latent representations obtained by using an auto-encoder architecture for reducing bits and entropy coding. Quantization is a problem encountered in the end-to-end training of deep image compression. Quantization's gradient is zero, and it cannot backpropagate meaningful gradients. Many methods have been proposed to address the approxim…
▽ More
In deep image compression, uniform quantization is applied to latent representations obtained by using an auto-encoder architecture for reducing bits and entropy coding. Quantization is a problem encountered in the end-to-end training of deep image compression. Quantization's gradient is zero, and it cannot backpropagate meaningful gradients. Many methods have been proposed to address the approximations of quantization to obtain gradients. However, there have not been equitable comparisons among them. In this study, we comprehensively compare the existing approximations of uniform quantization. Furthermore, we evaluate possible combinations of quantizers for the decoder and the entropy model, as the approximated quantizers can be different for them. We conduct experiments using three network architectures on two test datasets. The experimental results reveal that the best approximated quantization differs by the network architectures, and the best approximations of the three are different from the original ones used for the architectures. We also show that the combination of quantizers that uses universal quantization for the entropy model and differentiable soft quantization for the decoder is a comparatively good choice for different architectures and datasets.
△ Less
Submitted 1 March, 2023;
originally announced March 2023.
-
Non-uniform Sampling Strategies for NeRF on 360{\textdegree} images
Authors:
Takashi Otonari,
Satoshi Ikehata,
Kiyoharu Aizawa
Abstract:
In recent years, the performance of novel view synthesis using perspective images has dramatically improved with the advent of neural radiance fields (NeRF). This study proposes two novel techniques that effectively build NeRF for 360{\textdegree} omnidirectional images. Due to the characteristics of a 360{\textdegree} image of ERP format that has spatial distortion in their high latitude regions…
▽ More
In recent years, the performance of novel view synthesis using perspective images has dramatically improved with the advent of neural radiance fields (NeRF). This study proposes two novel techniques that effectively build NeRF for 360{\textdegree} omnidirectional images. Due to the characteristics of a 360{\textdegree} image of ERP format that has spatial distortion in their high latitude regions and a 360{\textdegree} wide viewing angle, NeRF's general ray sampling strategy is ineffective. Hence, the view synthesis accuracy of NeRF is limited and learning is not efficient. We propose two non-uniform ray sampling schemes for NeRF to suit 360{\textdegree} images - distortion-aware ray sampling and content-aware ray sampling. We created an evaluation dataset Synth360 using Replica and SceneCity models of indoor and outdoor scenes, respectively. In experiments, we show that our proposal successfully builds 360{\textdegree} image NeRF in terms of both accuracy and efficiency. The proposal is widely applicable to advanced variants of NeRF. DietNeRF, AugNeRF, and NeRF++ combined with the proposed techniques further improve the performance. Moreover, we show that our proposed method enhances the quality of real-world scenes in 360{\textdegree} images. Synth360: https://drive.google.com/drive/folders/1suL9B7DO2no21ggiIHkH3JF3OecasQLb.
△ Less
Submitted 7 December, 2022;
originally announced December 2022.
-
A Structure-Guided Diffusion Model for Large-Hole Image Completion
Authors:
Daichi Horita,
Jiaolong Yang,
Dong Chen,
Yuki Koyama,
Kiyoharu Aizawa,
Nicu Sebe
Abstract:
Image completion techniques have made significant progress in filling missing regions (i.e., holes) in images. However, large-hole completion remains challenging due to limited structural information. In this paper, we address this problem by integrating explicit structural guidance into diffusion-based image completion, forming our structure-guided diffusion model (SGDM). It consists of two casca…
▽ More
Image completion techniques have made significant progress in filling missing regions (i.e., holes) in images. However, large-hole completion remains challenging due to limited structural information. In this paper, we address this problem by integrating explicit structural guidance into diffusion-based image completion, forming our structure-guided diffusion model (SGDM). It consists of two cascaded diffusion probabilistic models: structure and texture generators. The structure generator generates an edge image representing plausible structures within the holes, which is then used for guiding the texture generation process. To train both generators jointly, we devise a novel strategy that leverages optimal Bayesian denoising, which denoises the output of the structure generator in a single step and thus allows backpropagation. Our diffusion-based approach enables a diversity of plausible completions, while the editable edges allow for editing parts of an image. Our experiments on natural scene (Places) and face (CelebA-HQ) datasets demonstrate that our method achieves a superior or comparable visual quality compared to state-of-the-art approaches. The code is available for research purposes at https://github.com/UdonDa/Structure_Guided_Diffusion_Model.
△ Less
Submitted 6 September, 2023; v1 submitted 18 November, 2022;
originally announced November 2022.
-
Universal Deep Image Compression via Content-Adaptive Optimization with Adapters
Authors:
Koki Tsubota,
Hiroaki Akutsu,
Kiyoharu Aizawa
Abstract:
Deep image compression performs better than conventional codecs, such as JPEG, on natural images. However, deep image compression is learning-based and encounters a problem: the compression performance deteriorates significantly for out-of-domain images. In this study, we highlight this problem and address a novel task: universal deep image compression. This task aims to compress images belonging…
▽ More
Deep image compression performs better than conventional codecs, such as JPEG, on natural images. However, deep image compression is learning-based and encounters a problem: the compression performance deteriorates significantly for out-of-domain images. In this study, we highlight this problem and address a novel task: universal deep image compression. This task aims to compress images belonging to arbitrary domains, such as natural images, line drawings, and comics. To address this problem, we propose a content-adaptive optimization framework; this framework uses a pre-trained compression model and adapts the model to a target image during compression. Adapters are inserted into the decoder of the model. For each input image, our framework optimizes the latent representation extracted by the encoder and the adapter parameters in terms of rate-distortion. The adapter parameters are additionally transmitted per image. For the experiments, a benchmark dataset containing uncompressed images of four domains (natural images, line drawings, comics, and vector arts) is constructed and the proposed universal deep compression is evaluated. Finally, the proposed model is compared with non-adaptive and existing adaptive compression models. The comparison reveals that the proposed model outperforms these. The code and dataset are publicly available at https://github.com/kktsubota/universal-dic.
△ Less
Submitted 2 November, 2022;
originally announced November 2022.
-
Rethinking Rotation in Self-Supervised Contrastive Learning: Adaptive Positive or Negative Data Augmentation
Authors:
Atsuyuki Miyai,
Qing Yu,
Daiki Ikami,
Go Irie,
Kiyoharu Aizawa
Abstract:
Rotation is frequently listed as a candidate for data augmentation in contrastive learning but seldom provides satisfactory improvements. We argue that this is because the rotated image is always treated as either positive or negative. The semantics of an image can be rotation-invariant or rotation-variant, so whether the rotated image is treated as positive or negative should be determined based…
▽ More
Rotation is frequently listed as a candidate for data augmentation in contrastive learning but seldom provides satisfactory improvements. We argue that this is because the rotated image is always treated as either positive or negative. The semantics of an image can be rotation-invariant or rotation-variant, so whether the rotated image is treated as positive or negative should be determined based on the content of the image. Therefore, we propose a novel augmentation strategy, adaptive Positive or Negative Data Augmentation (PNDA), in which an original and its rotated image are a positive pair if they are semantically close and a negative pair if they are semantically different. To achieve PNDA, we first determine whether rotation is positive or negative on an image-by-image basis in an unsupervised way. Then, we apply PNDA to contrastive learning frameworks. Our experiments showed that PNDA improves the performance of contrastive learning. The code is available at \url{ https://github.com/AtsuMiyai/rethinking_rotation}.
△ Less
Submitted 24 November, 2022; v1 submitted 23 October, 2022;
originally announced October 2022.
-
Saliency-based Multiple Region of Interest Detection from a Single 360° image
Authors:
Yuuki Sawabe,
Satoshi Ikehata,
Kiyoharu Aizawa
Abstract:
360° images are informative -- it contains omnidirectional visual information around the camera. However, the areas that cover a 360° image is much larger than the human's field of view, therefore important information in different view directions is easily overlooked. To tackle this issue, we propose a method for predicting the optimal set of Region of Interest (RoI) from a single 360° image usin…
▽ More
360° images are informative -- it contains omnidirectional visual information around the camera. However, the areas that cover a 360° image is much larger than the human's field of view, therefore important information in different view directions is easily overlooked. To tackle this issue, we propose a method for predicting the optimal set of Region of Interest (RoI) from a single 360° image using the visual saliency as a clue. To deal with the scarce, strongly biased training data of existing single 360° image saliency prediction dataset, we also propose a data augmentation method based on the spherical random data rotation. From the predicted saliency map and redundant candidate regions, we obtain the optimal set of RoIs considering both the saliency within a region and the Interaction-Over-Union (IoU) between regions. We conduct the subjective evaluation to show that the proposed method can select regions that properly summarize the input 360° image.
△ Less
Submitted 8 September, 2022;
originally announced September 2022.
-
Evaluating the Stability of Deep Image Quality Assessment With Respect to Image Scaling
Authors:
Koki Tsubota,
Hiroaki Akutsu,
Kiyoharu Aizawa
Abstract:
Image quality assessment (IQA) is a fundamental metric for image processing tasks (e.g., compression). With full-reference IQAs, traditional IQAs, such as PSNR and SSIM, have been used. Recently, IQAs based on deep neural networks (deep IQAs), such as LPIPS and DISTS, have also been used. It is known that image scaling is inconsistent among deep IQAs, as some perform down-scaling as pre-processing…
▽ More
Image quality assessment (IQA) is a fundamental metric for image processing tasks (e.g., compression). With full-reference IQAs, traditional IQAs, such as PSNR and SSIM, have been used. Recently, IQAs based on deep neural networks (deep IQAs), such as LPIPS and DISTS, have also been used. It is known that image scaling is inconsistent among deep IQAs, as some perform down-scaling as pre-processing, whereas others instead use the original image size. In this paper, we show that the image scale is an influential factor that affects deep IQA performance. We comprehensively evaluate four deep IQAs on the same five datasets, and the experimental results show that image scale significantly influences IQA performance. We found that the most appropriate image scale is often neither the default nor the original size, and the choice differs depending on the methods and datasets used. We visualized the stability and found that PieAPP is the most stable among the four deep IQAs.
△ Less
Submitted 20 July, 2022;
originally announced July 2022.
-
COO: Comic Onomatopoeia Dataset for Recognizing Arbitrary or Truncated Texts
Authors:
Jeonghun Baek,
Yusuke Matsui,
Kiyoharu Aizawa
Abstract:
Recognizing irregular texts has been a challenging topic in text recognition. To encourage research on this topic, we provide a novel comic onomatopoeia dataset (COO), which consists of onomatopoeia texts in Japanese comics. COO has many arbitrary texts, such as extremely curved, partially shrunk texts, or arbitrarily placed texts. Furthermore, some texts are separated into several parts. Each par…
▽ More
Recognizing irregular texts has been a challenging topic in text recognition. To encourage research on this topic, we provide a novel comic onomatopoeia dataset (COO), which consists of onomatopoeia texts in Japanese comics. COO has many arbitrary texts, such as extremely curved, partially shrunk texts, or arbitrarily placed texts. Furthermore, some texts are separated into several parts. Each part is a truncated text and is not meaningful by itself. These parts should be linked to represent the intended meaning. Thus, we propose a novel task that predicts the link between truncated texts. We conduct three tasks to detect the onomatopoeia region and capture its intended meaning: text detection, text recognition, and link prediction. Through extensive experiments, we analyze the characteristics of the COO. Our data and code are available at \url{https://github.com/ku21fan/COO-Comic-Onomatopoeia}.
△ Less
Submitted 11 July, 2022;
originally announced July 2022.
-
SVG Vector Font Generation for Chinese Characters with Transformer
Authors:
Haruka Aoki,
Kiyoharu Aizawa
Abstract:
Designing fonts for Chinese characters is highly labor-intensive and time-consuming. While the latest methods successfully generate the English alphabet vector font, despite the high demand for automatic font generation, Chinese vector font generation has been an unsolved problem owing to its complex shape and numerous characters. This study addressed the problem of automatically generating Chines…
▽ More
Designing fonts for Chinese characters is highly labor-intensive and time-consuming. While the latest methods successfully generate the English alphabet vector font, despite the high demand for automatic font generation, Chinese vector font generation has been an unsolved problem owing to its complex shape and numerous characters. This study addressed the problem of automatically generating Chinese vector fonts from only a single style and content reference. We proposed a novel network architecture with Transformer and loss functions to capture structural features without differentiable rendering. Although the dataset range was still limited to the sans-serif family, we successfully generated the Chinese vector font for the first time using the proposed method.
△ Less
Submitted 21 June, 2022;
originally announced June 2022.
-
Intersection Prediction from Single 360° Image via Deep Detection of Possible Direction of Travel
Authors:
Naoki Sugimoto,
Satoshi Ikehata,
Kiyoharu Aizawa
Abstract:
Movie-Map, an interactive first-person-view map that engages the user in a simulated walking experience, comprises short 360° video segments separated by traffic intersections that are seamlessly connected according to the viewer's direction of travel. However, in wide urban-scale areas with numerous intersecting roads, manual intersection segmentation requires significant human effort. Therefore,…
▽ More
Movie-Map, an interactive first-person-view map that engages the user in a simulated walking experience, comprises short 360° video segments separated by traffic intersections that are seamlessly connected according to the viewer's direction of travel. However, in wide urban-scale areas with numerous intersecting roads, manual intersection segmentation requires significant human effort. Therefore, automatic identification of intersections from 360° videos is an important problem for scaling up Movie-Map. In this paper, we propose a novel method that identifies an intersection from individual frames in 360° videos. Instead of formulating the intersection identification as a standard binary classification task with a 360° image as input, we identify an intersection based on the number of the possible directions of travel (PDoT) in perspective images projected in eight directions from a single 360° image detected by the neural network for handling various types of intersections. We constructed a large-scale 360° Image Intersection Identification (iii360) dataset for training and evaluation where 360° videos were collected from various areas such as school campus, downtown, suburb, and china town and demonstrate that our PDoT-based method achieves 88\% accuracy, which is significantly better than that achieved by the direct naive binary classification based method. The source codes and a partial dataset will be shared in the community after the paper is published.
△ Less
Submitted 10 April, 2022;
originally announced April 2022.
-
Distortion-Aware Self-Supervised 360° Depth Estimation from A Single Equirectangular Projection Image
Authors:
Yuya Hasegawa,
Ikehata Satoshi,
Kiyoharu Aizawa
Abstract:
360° images are widely available over the last few years. This paper proposes a new technique for single 360° image depth prediction under open environments. Depth prediction from a 360° single image is not easy for two reasons. One is the limitation of supervision datasets - the currently available dataset is limited to indoor scenes. The other is the problems caused by Equirectangular Projection…
▽ More
360° images are widely available over the last few years. This paper proposes a new technique for single 360° image depth prediction under open environments. Depth prediction from a 360° single image is not easy for two reasons. One is the limitation of supervision datasets - the currently available dataset is limited to indoor scenes. The other is the problems caused by Equirectangular Projection Format (ERP), commonly used for 360° images, that are coordinate and distortion. There is only one method existing that uses cube map projection to produce six perspective images and apply self-supervised learning using motion pictures for perspective depth prediction to deal with these problems. Different from the existing method, we directly use the ERP format. We propose a framework of direct use of ERP with coordinate conversion of correspondences and distortion-aware upsampling module to deal with the ERP related problems and extend a self-supervised learning method for open environments. For the experiments, we firstly built a dataset for the evaluation, and quantitatively evaluate the depth prediction in outdoor scenes. We show that it outperforms the state-of-the-art technique
△ Less
Submitted 3 April, 2022;
originally announced April 2022.
-
Field-of-View IoU for Object Detection in 360° Images
Authors:
Miao Cao,
Satoshi Ikehata,
Kiyoharu Aizawa
Abstract:
360° cameras have gained popularity over the last few years. In this paper, we propose two fundamental techniques -- Field-of-View IoU (FoV-IoU) and 360Augmentation for object detection in 360° images. Although most object detection neural networks designed for the perspective images are applicable to 360° images in equirectangular projection (ERP) format, their performance deteriorates owing to t…
▽ More
360° cameras have gained popularity over the last few years. In this paper, we propose two fundamental techniques -- Field-of-View IoU (FoV-IoU) and 360Augmentation for object detection in 360° images. Although most object detection neural networks designed for the perspective images are applicable to 360° images in equirectangular projection (ERP) format, their performance deteriorates owing to the distortion in ERP images. Our method can be readily integrated with existing perspective object detectors and significantly improves the performance. The FoV-IoU computes the intersection-over-union of two Field-of-View bounding boxes in a spherical image which could be used for training, inference, and evaluation while 360Augmentation is a data augmentation technique specific to 360° object detection task which randomly rotates a spherical image and solves the bias due to the sphere-to-plane projection. We conduct extensive experiments on the 360indoor dataset with different types of perspective object detectors and show the consistent effectiveness of our method.
△ Less
Submitted 22 September, 2022; v1 submitted 7 February, 2022;
originally announced February 2022.
-
Noisy Annotation Refinement for Object Detection
Authors:
Jiafeng Mao,
Qing Yu,
Yoko Yamakata,
Kiyoharu Aizawa
Abstract:
Supervised training of object detectors requires well-annotated large-scale datasets, whose production is costly. Therefore, some efforts have been made to obtain annotations in economical ways, such as cloud sourcing. However, datasets obtained by these methods tend to contain noisy annotations such as inaccurate bounding boxes and incorrect class labels. In this study, we propose a new problem s…
▽ More
Supervised training of object detectors requires well-annotated large-scale datasets, whose production is costly. Therefore, some efforts have been made to obtain annotations in economical ways, such as cloud sourcing. However, datasets obtained by these methods tend to contain noisy annotations such as inaccurate bounding boxes and incorrect class labels. In this study, we propose a new problem setting of training object detectors on datasets with entangled noises of annotations of class labels and bounding boxes. Our proposed method efficiently decouples the entangled noises, corrects the noisy annotations, and subsequently trains the detector using the corrected annotations. We verified the effectiveness of our proposed method and compared it with the baseline on noisy datasets with different noise levels. The experimental results show that our proposed method significantly outperforms the baseline.
△ Less
Submitted 7 December, 2021; v1 submitted 20 October, 2021;
originally announced October 2021.
-
High Spatial Resolution Neutron Transmission Imaging Using a Superconducting Two-Dimensional Detector
Authors:
Hiroaki Shishido,
Kazuma Nishimura,
The Dang Vu,
Kazuya Aizawa,
Kenji M. Kojima,
Tomio Koyama,
Kenichi Oikawa,
Masahide Harada,
Takayuki Oku,
Kazuhiko Soyama,
Shigeyuki Miyajima,
Mutsuo Hidaka,
Soh Y. Suzuki,
Manobu M. Tanaka,
Shuichi Kawamata,
Takekazu Ishida
Abstract:
Neutron imaging is one of the most powerful tools for nondestructive inspection owing to the unique characteristics of neutron beams, such as high permeability for many heavy metals, high sensitivity for certain light elements, and isotope selectivity owing to a specific nuclear reaction between an isotope and neutrons. In this study, we employed a superconducting detector, current-biased kinetic-…
▽ More
Neutron imaging is one of the most powerful tools for nondestructive inspection owing to the unique characteristics of neutron beams, such as high permeability for many heavy metals, high sensitivity for certain light elements, and isotope selectivity owing to a specific nuclear reaction between an isotope and neutrons. In this study, we employed a superconducting detector, current-biased kinetic-inductance detector (CB-KID) for neutron imaging using a pulsed neutron source. We employed the delay-line method, and high spatial resolution imaging with only four reading channels was achieved. We also performed wavelength-resolved neutron imaging by the time-of-flight method for the pulsed neutron source. We obtained the neutron transmission images of a Gd-Al alloy sample, inside which single crystals of GdAl3 were grown, using the delay-line CB-KID. Single crystals were well imaged, in both shapes and distributions, throughout the Al-Gd alloy. We identified Gd nuclei via neutron transmissions that exhibited characteristic suppression above the neutron wavelength of 0.03 nm. In addition, the ^{155}Gd resonance dip, a dip structure of the transmission caused by the nuclear reaction between an isotope and neutrons, was observed even when the number of events was summed over a limited area of 15 X 12 um^2. Gd selective imaging was performed using the resonance dip of ^{155}Gd, and it showed clear Gd distribution even with a limited neutron wavelength range of 1 pm.
△ Less
Submitted 8 October, 2021;
originally announced October 2021.
-
Practical tests of neutron transmission imaging with a superconducting kinetic-inductance sensor
Authors:
The Dang Vu,
Hiroaki Shishido,
Kazuya Aizawa,
Kenji M. Kojima,
Tomio Koyama,
Kenichi Oikawa,
Masahide Harada,
Takayuki Oku,
Kazuhiko Soyama,
Shigeyuki Miyajima,
Mutsuo Hidaka,
Soh Y. Suzuki,
Manobu M. Tanakai,
Alex Malins,
Masahiko Machida,
Shuichi Kawamata,
Takekazu Ishida
Abstract:
Samples were examined using a superconducting (Nb) neutron imaging system employing a delay-line technique which in previous studies was shown to have high spatial resolution. We found excellent correspondence between neutron transmission and scanning electron microscope (SEM) images of Gd islands with sizes between 15 and 130 micrometer which were thermally-sprayed onto a Si substrate. Neutron tr…
▽ More
Samples were examined using a superconducting (Nb) neutron imaging system employing a delay-line technique which in previous studies was shown to have high spatial resolution. We found excellent correspondence between neutron transmission and scanning electron microscope (SEM) images of Gd islands with sizes between 15 and 130 micrometer which were thermally-sprayed onto a Si substrate. Neutron transmission images could be used to identify tiny voids in a thermally-sprayed continuous Gd2O3 film on a Si substrate which could not be seen in SEM images. We also found that neutron transmission images revealed pattern formations, mosaic features and co-existing dendritic phases in Wood's metal samples with constituent elements Bi, Pb, Sn and Cd. These results demonstrate the merits of the current-biased kinetic inductance detector (CB-KID) system for practical studies in materials science. Moreover, we found that operating the detector at a more optimal temperature (7.9 K) appreciably improved the effective detection efficiency when compared to previous studies conducted at 4 K. This is because the effective size of hot-spots in the superconducting meanderline planes increases with temperature, which makes particle detections more likely.
△ Less
Submitted 8 May, 2021;
originally announced May 2021.
-
NTIRE 2021 Challenge on Perceptual Image Quality Assessment
Authors:
Jinjin Gu,
Haoming Cai,
Chao Dong,
Jimmy S. Ren,
Yu Qiao,
Shuhang Gu,
Radu Timofte,
Manri Cheon,
Sungjun Yoon,
Byungyeon Kang,
Junwoo Lee,
Qing Zhang,
Haiyang Guo,
Yi Bin,
Yuqing Hou,
Hengliang Luo,
Jingyu Guo,
Zirui Wang,
Hai Wang,
Wenming Yang,
Qingyan Bai,
Shuwei Shi,
Weihao Xia,
Mingdeng Cao,
Jiahao Wang
, et al. (25 additional authors not shown)
Abstract:
This paper reports on the NTIRE 2021 challenge on perceptual image quality assessment (IQA), held in conjunction with the New Trends in Image Restoration and Enhancement workshop (NTIRE) workshop at CVPR 2021. As a new type of image processing technology, perceptual image processing algorithms based on Generative Adversarial Networks (GAN) have produced images with more realistic textures. These o…
▽ More
This paper reports on the NTIRE 2021 challenge on perceptual image quality assessment (IQA), held in conjunction with the New Trends in Image Restoration and Enhancement workshop (NTIRE) workshop at CVPR 2021. As a new type of image processing technology, perceptual image processing algorithms based on Generative Adversarial Networks (GAN) have produced images with more realistic textures. These output images have completely different characteristics from traditional distortions, thus pose a new challenge for IQA methods to evaluate their visual quality. In comparison with previous IQA challenges, the training and testing datasets in this challenge include the outputs of perceptual image processing algorithms and the corresponding subjective scores. Thus they can be used to develop and evaluate IQA methods on GAN-based distortions. The challenge has 270 registered participants in total. In the final testing stage, 13 participating teams submitted their models and fact sheets. Almost all of them have achieved much better results than existing IQA methods, while the winning method can demonstrate state-of-the-art performance.
△ Less
Submitted 28 June, 2021; v1 submitted 7 May, 2021;
originally announced May 2021.
-
A Novel Perspective for Positive-Unlabeled Learning via Noisy Labels
Authors:
Daiki Tanaka,
Daiki Ikami,
Kiyoharu Aizawa
Abstract:
Positive-unlabeled learning refers to the process of training a binary classifier using only positive and unlabeled data. Although unlabeled data can contain positive data, all unlabeled data are regarded as negative data in existing positive-unlabeled learning methods, which resulting in diminishing performance. We provide a new perspective on this problem -- considering unlabeled data as noisy-l…
▽ More
Positive-unlabeled learning refers to the process of training a binary classifier using only positive and unlabeled data. Although unlabeled data can contain positive data, all unlabeled data are regarded as negative data in existing positive-unlabeled learning methods, which resulting in diminishing performance. We provide a new perspective on this problem -- considering unlabeled data as noisy-labeled data, and introducing a new formulation of PU learning as a problem of joint optimization of noisy-labeled data. This research presents a methodology that assigns initial pseudo-labels to unlabeled data which is used as noisy-labeled data, and trains a deep neural network using the noisy-labeled data. Experimental results demonstrate that the proposed method significantly outperforms the state-of-the-art methods on several benchmark datasets.
△ Less
Submitted 8 March, 2021;
originally announced March 2021.
-
What If We Only Use Real Datasets for Scene Text Recognition? Toward Scene Text Recognition With Fewer Labels
Authors:
Jeonghun Baek,
Yusuke Matsui,
Kiyoharu Aizawa
Abstract:
Scene text recognition (STR) task has a common practice: All state-of-the-art STR models are trained on large synthetic data. In contrast to this practice, training STR models only on fewer real labels (STR with fewer labels) is important when we have to train STR models without synthetic data: for handwritten or artistic texts that are difficult to generate synthetically and for languages other t…
▽ More
Scene text recognition (STR) task has a common practice: All state-of-the-art STR models are trained on large synthetic data. In contrast to this practice, training STR models only on fewer real labels (STR with fewer labels) is important when we have to train STR models without synthetic data: for handwritten or artistic texts that are difficult to generate synthetically and for languages other than English for which we do not always have synthetic data. However, there has been implicit common knowledge that training STR models on real data is nearly impossible because real data is insufficient. We consider that this common knowledge has obstructed the study of STR with fewer labels. In this work, we would like to reactivate STR with fewer labels by disproving the common knowledge. We consolidate recently accumulated public real data and show that we can train STR models satisfactorily only with real labeled data. Subsequently, we find simple data augmentation to fully exploit real data. Furthermore, we improve the models by collecting unlabeled data and introducing semi- and self-supervised methods. As a result, we obtain a competitive model to state-of-the-art methods. To the best of our knowledge, this is the first study that 1) shows sufficient performance by only using real labels and 2) introduces semi- and self-supervised methods into STR with fewer labels. Our code and data are available: https://github.com/ku21fan/STR-Fewer-Labels
△ Less
Submitted 5 June, 2021; v1 submitted 7 March, 2021;
originally announced March 2021.
-
Building Movie Map -- A Tool for Exploring Areas in a City -- and its Evaluation
Authors:
Naoki Sugimoto,
Yoshihito Ebine,
Kiyoharu Aizawa
Abstract:
We propose a new Movie Map system, with an interface for exploring cities. The system consists of four stages; acquisition, analysis, management, and interaction. In the acquisition stage, omnidirectional videos are taken along streets in target areas. Frames of the video are localized on the map, intersections are detected, and videos are segmented. Turning views at intersections are subsequently…
▽ More
We propose a new Movie Map system, with an interface for exploring cities. The system consists of four stages; acquisition, analysis, management, and interaction. In the acquisition stage, omnidirectional videos are taken along streets in target areas. Frames of the video are localized on the map, intersections are detected, and videos are segmented. Turning views at intersections are subsequently generated. By connecting the video segments following the specified movement in an area, we can view the streets better. The interface allows for easy exploration of a target area, and it can show virtual billboards of stores in the view. We conducted user studies to compare our system to the GSV in a scenario where users could freely move and explore to find a landmark. The experiment showed that our system had a better user experience than GSV.
△ Less
Submitted 17 November, 2020;
originally announced November 2020.
-
Few-Shot Font Generation with Deep Metric Learning
Authors:
Haruka Aoki,
Koki Tsubota,
Hikaru Ikuta,
Kiyoharu Aizawa
Abstract:
Designing fonts for languages with a large number of characters, such as Japanese and Chinese, is an extremely labor-intensive and time-consuming task. In this study, we addressed the problem of automatically generating Japanese typographic fonts from only a few font samples, where the synthesized glyphs are expected to have coherent characteristics, such as skeletons, contours, and serifs. Existi…
▽ More
Designing fonts for languages with a large number of characters, such as Japanese and Chinese, is an extremely labor-intensive and time-consuming task. In this study, we addressed the problem of automatically generating Japanese typographic fonts from only a few font samples, where the synthesized glyphs are expected to have coherent characteristics, such as skeletons, contours, and serifs. Existing methods often fail to generate fine glyph images when the number of style reference glyphs is extremely limited. Herein, we proposed a simple but powerful framework for extracting better style features. This framework introduces deep metric learning to style encoders. We performed experiments using black-and-white and shape-distinctive font datasets and demonstrated the effectiveness of the proposed framework.
△ Less
Submitted 4 November, 2020;
originally announced November 2020.