Search | arXiv e-print repository

Fine-Tuning Video-Text Contrastive Model for Primate Behavior Retrieval from Unlabeled Raw Videos

Authors: Giulio Cesare Mastrocinque Santo, Patrícia Izar, Irene Delval, Victor de Napole Gregolin, Nina S. T. Hirata

Abstract: Video recordings of nonhuman primates in their natural habitat are a common source for studying their behavior in the wild. We fine-tune pre-trained video-text foundational models for the specific domain of capuchin monkeys, with the goal of developing useful computational models to help researchers to retrieve useful clips from videos. We focus on the challenging problem of training a model based… ▽ More Video recordings of nonhuman primates in their natural habitat are a common source for studying their behavior in the wild. We fine-tune pre-trained video-text foundational models for the specific domain of capuchin monkeys, with the goal of developing useful computational models to help researchers to retrieve useful clips from videos. We focus on the challenging problem of training a model based solely on raw, unlabeled video footage, using weak audio descriptions sometimes provided by field collaborators. We leverage recent advances in Multimodal Large Language Models (MLLMs) and Vision-Language Models (VLMs) to address the extremely noisy nature of both video and audio content. Specifically, we propose a two-folded approach: an agentic data treatment pipeline and a fine-tuning process. The data processing pipeline automatically extracts clean and semantically aligned video-text pairs from the raw videos, which are subsequently used to fine-tune a pre-trained Microsoft's X-CLIP model through Low-Rank Adaptation (LoRA). We obtained an uplift in $Hits@5$ of $167\%$ for the 16 frames model and an uplift of $114\%$ for the 8 frame model on our domain data. Moreover, based on $NDCG@K$ results, our model is able to rank well most of the considered behaviors, while the tested raw pre-trained models are not able to rank them at all. The code will be made available upon acceptance. △ Less

Submitted 8 May, 2025; originally announced May 2025.

arXiv:2502.04602 [pdf, other]

Extracting and Understanding the Superficial Knowledge in Alignment

Authors: Runjin Chen, Gabriel Jacob Perin, Xuxi Chen, Xilun Chen, Yan Han, Nina S. T. Hirata, Junyuan Hong, Bhavya Kailkhura

Abstract: Alignment of large language models (LLMs) with human values and preferences, often achieved through fine-tuning based on human feedback, is essential for ensuring safe and responsible AI behaviors. However, the process typically requires substantial data and computation resources. Recent studies have revealed that alignment might be attainable at lower costs through simpler methods, such as in-con… ▽ More Alignment of large language models (LLMs) with human values and preferences, often achieved through fine-tuning based on human feedback, is essential for ensuring safe and responsible AI behaviors. However, the process typically requires substantial data and computation resources. Recent studies have revealed that alignment might be attainable at lower costs through simpler methods, such as in-context learning. This leads to the question: Is alignment predominantly superficial? In this paper, we delve into this question and provide a quantitative analysis. We formalize the concept of superficial knowledge, defining it as knowledge that can be acquired through easily token restyling, without affecting the model's ability to capture underlying causal relationships between tokens. We propose a method to extract and isolate superficial knowledge from aligned models, focusing on the shallow modifications to the final token selection process. By comparing models augmented only with superficial knowledge to fully aligned models, we quantify the superficial portion of alignment. Our findings reveal that while superficial knowledge constitutes a significant portion of alignment, particularly in safety and detoxification tasks, it is not the whole story. Tasks requiring reasoning and contextual understanding still rely on deeper knowledge. Additionally, we demonstrate two practical advantages of isolated superficial knowledge: (1) it can be transferred between models, enabling efficient offsite alignment of larger models using extracted superficial knowledge from smaller models, and (2) it is recoverable, allowing for the restoration of alignment in compromised models without sacrificing performance. △ Less

Submitted 6 February, 2025; originally announced February 2025.

arXiv:2501.04750 [pdf, other]

Efficient License Plate Recognition in Videos Using Visual Rhythm and Accumulative Line Analysis

Authors: Victor Nascimento Ribeiro, Nina S. T. Hirata

Abstract: Video-based Automatic License Plate Recognition (ALPR) involves extracting vehicle license plate text information from video captures. Traditional systems typically rely heavily on high-end computing resources and utilize multiple frames to recognize license plates, leading to increased computational overhead. In this paper, we propose two methods capable of efficiently extracting exactly one fram… ▽ More Video-based Automatic License Plate Recognition (ALPR) involves extracting vehicle license plate text information from video captures. Traditional systems typically rely heavily on high-end computing resources and utilize multiple frames to recognize license plates, leading to increased computational overhead. In this paper, we propose two methods capable of efficiently extracting exactly one frame per vehicle and recognizing its license plate characters from this single image, thus significantly reducing computational demands. The first method uses Visual Rhythm (VR) to generate time-spatial images from videos, while the second employs Accumulative Line Analysis (ALA), a novel algorithm based on single-line video processing for real-time operation. Both methods leverage YOLO for license plate detection within the frame and a Convolutional Neural Network (CNN) for Optical Character Recognition (OCR) to extract textual information. Experiments on real videos demonstrate that the proposed methods achieve results comparable to traditional frame-by-frame approaches, with processing speeds three times faster. △ Less

Submitted 8 January, 2025; originally announced January 2025.

Comments: Accepted for presentation at the Conference on Graphics, Patterns and Images (SIBGRAPI) 2024

arXiv:2501.04534 [pdf, other]

doi 10.5753/sibgrapi.est.2023.27473

Combining YOLO and Visual Rhythm for Vehicle Counting

Authors: Victor Nascimento Ribeiro, Nina S. T. Hirata

Abstract: Video-based vehicle detection and counting play a critical role in managing transport infrastructure. Traditional image-based counting methods usually involve two main steps: initial detection and subsequent tracking, which are applied to all video frames, leading to a significant increase in computational complexity. To address this issue, this work presents an alternative and more efficient meth… ▽ More Video-based vehicle detection and counting play a critical role in managing transport infrastructure. Traditional image-based counting methods usually involve two main steps: initial detection and subsequent tracking, which are applied to all video frames, leading to a significant increase in computational complexity. To address this issue, this work presents an alternative and more efficient method for vehicle detection and counting. The proposed approach eliminates the need for a tracking step and focuses solely on detecting vehicles in key video frames, thereby increasing its efficiency. To achieve this, we developed a system that combines YOLO, for vehicle detection, with Visual Rhythm, a way to create time-spatial images that allows us to focus on frames that contain useful information. Additionally, this method can be used for counting in any application involving unidirectional moving targets to be detected and identified. Experimental analysis using real videos shows that the proposed method achieves mean counting accuracy around 99.15% over a set of videos, with a processing speed three times faster than tracking based approaches. △ Less

Submitted 8 January, 2025; originally announced January 2025.

Comments: Accepted for presentation at the Conference on Graphics, Patterns and Images (SIBGRAPI) 2023

arXiv:2501.02270 [pdf, other]

Efficient Video-Based ALPR System Using YOLO and Visual Rhythm

Authors: Victor Nascimento Ribeiro, Nina S. T. Hirata

Abstract: Automatic License Plate Recognition (ALPR) involves extracting vehicle license plate information from image or a video capture. These systems have gained popularity due to the wide availability of low-cost surveillance cameras and advances in Deep Learning. Typically, video-based ALPR systems rely on multiple frames to detect the vehicle and recognize the license plates. Therefore, we propose a sy… ▽ More Automatic License Plate Recognition (ALPR) involves extracting vehicle license plate information from image or a video capture. These systems have gained popularity due to the wide availability of low-cost surveillance cameras and advances in Deep Learning. Typically, video-based ALPR systems rely on multiple frames to detect the vehicle and recognize the license plates. Therefore, we propose a system capable of extracting exactly one frame per vehicle and recognizing its license plate characters from this singular image using an Optical Character Recognition (OCR) model. Early experiments show that this methodology is viable. △ Less

Submitted 8 January, 2025; v1 submitted 4 January, 2025; originally announced January 2025.

Comments: Accepted to CVPR 2024

arXiv:2406.06538 [pdf, other]

doi 10.1109/ICPR56361.2022.9956133

Understanding attention-based encoder-decoder networks: a case study with chess scoresheet recognition

Authors: Sergio Y. Hayashi, Nina S. T. Hirata

Abstract: Deep neural networks are largely used for complex prediction tasks. There is plenty of empirical evidence of their successful end-to-end training for a diversity of tasks. Success is often measured based solely on the final performance of the trained network, and explanations on when, why and how they work are less emphasized. In this paper we study encoder-decoder recurrent neural networks with a… ▽ More Deep neural networks are largely used for complex prediction tasks. There is plenty of empirical evidence of their successful end-to-end training for a diversity of tasks. Success is often measured based solely on the final performance of the trained network, and explanations on when, why and how they work are less emphasized. In this paper we study encoder-decoder recurrent neural networks with attention mechanisms for the task of reading handwritten chess scoresheets. Rather than prediction performance, our concern is to better understand how learning occurs in these type of networks. We characterize the task in terms of three subtasks, namely input-output alignment, sequential pattern recognition, and handwriting recognition, and experimentally investigate which factors affect their learning. We identify competition, collaboration and dependence relations between the subtasks, and argue that such knowledge might help one to better balance factors to properly train a network. △ Less

Submitted 23 April, 2024; originally announced June 2024.

Comments: This work was accepted and published in the 2022 26th International Conference on Pattern Recognition (ICPR)

Journal ref: 2022 26th International Conference on Pattern Recognition (ICPR)

arXiv:2404.09925 [pdf, other]

doi 10.1093/mnras/stae971

The Quasar Catalogue for S-PLUS DR4 (QuCatS) and the estimation of photometric redshifts

Authors: L. Nakazono, R. R. Valença, G. Soares, R. Izbicki, Ž. Ivezić, E. V. R. Lima, N. S. T. Hirata, L. Sodré Jr., R. Overzier, F. Almeida-Fernandes, G. B. Oliveira Schwarz, W. Schoenell, A. Kanaan, T. Ribeiro, C. Mendes de Oliveira

Abstract: The advent of massive broad-band photometric surveys enabled photometric redshift estimates for unprecedented numbers of galaxies and quasars. These estimates can be improved using better algorithms or by obtaining complementary data such as narrow-band photometry, and broad-band photometry over an extended wavelength range. We investigate the impact of both approaches on photometric redshifts for… ▽ More The advent of massive broad-band photometric surveys enabled photometric redshift estimates for unprecedented numbers of galaxies and quasars. These estimates can be improved using better algorithms or by obtaining complementary data such as narrow-band photometry, and broad-band photometry over an extended wavelength range. We investigate the impact of both approaches on photometric redshifts for quasars using data from Southern Photometric Local Universe Survey (S-PLUS) DR4, Galaxy Evolution Explorer (GALEX) DR6/7, and the unWISE catalog for the Wide-field Infrared Survey Explorer (WISE) in three machine learning methods: Random Forest, Flexible Conditional Density Estimation (FlexCoDE), and Bayesian Mixture Density Network (BMDN). Including narrow-band photometry improves the root-mean-square error by 11% in comparison to a model trained with only broad-band photometry. Narrow-band information only provided an improvement of 3.8% when GALEX and WISE colours were included. Thus narrow bands play a more important role for objects that do not have GALEX or WISE counterparts, which respectively makes 92% and 25% of S-PLUS data considered here. Nevertheless, the inclusion of narrow-band information provided better estimates of the probability density functions obtained with FlexCoDE and BMDN. We publicly release a value-added catalogue of photometrically selected quasars with the photo-z predictions from all methods studied here. The catalogue provided with this work covers the S-PLUS DR4 area (~3000deg$^2$), containing 645 980, 244 912, 144 991 sources with the probability of being a quasar higher than, 80%, 90%, 95% up to r < 21.3 and good photometry quality in the detection image. More quasar candidates can be retrieved from the S-PLUS data base by considering less restrictive selection criteria. △ Less

Submitted 23 August, 2024; v1 submitted 15 April, 2024; originally announced April 2024.

Journal ref: Monthly Notices of the Royal Astronomical Society, 2024, 531, 327-339

arXiv:2202.12941 [pdf, other]

doi 10.1016/j.nima.2022.166497

Digital Signal Analysis based on Convolutional Neural Networks for Active Target Time Projection Chambers

Authors: G. F. Fortino, J. C. Zamora, L. E. Tamayose, N. S. T. Hirata, V. Guimaraes

Abstract: An algorithm for digital signal analysis using convolutional neural networks (CNN) was developed in this work. The main objective of this algorithm is to make the analysis of experiments with active target time projection chambers more efficient. The code is divided in three steps: baseline correction, signal deconvolution and peak detection and integration. The CNNs were able to learn the signal… ▽ More An algorithm for digital signal analysis using convolutional neural networks (CNN) was developed in this work. The main objective of this algorithm is to make the analysis of experiments with active target time projection chambers more efficient. The code is divided in three steps: baseline correction, signal deconvolution and peak detection and integration. The CNNs were able to learn the signal processing models with relative errors of less than 6\%. The analysis based on CNNs provides the same results as the traditional deconvolution algorithms, but considerably more efficient in terms of computing time (about 65 times faster). This opens up new possibilities to improve existing codes and to simplify the analysis of the large amount of data produced in active target experiments. △ Less

Submitted 14 February, 2022; originally announced February 2022.

arXiv:2106.11986 [pdf, other]

doi 10.1093/mnras/stab1835

On the discovery of stars, quasars, and galaxies in the Southern Hemisphere with S-PLUS DR2

Authors: L. Nakazono, C. Mendes de Oliveira, N. S. T. Hirata, S. Jeram, C. Queiroz, Stephen S. Eikenberry, A. H. Gonzalez, R. Abramo, R. Overzier, M. Espadoto, A. Martinazzo, L. Sampedro, F. R. Herpich, F. Almeida-Fernandes, A. Werle, C. E. Barbosa, L. Sodré Jr., E. V. Lima, M. L. Buzzo, A. Cortesi, K. Menéndez-Delmestre, S. Akras, Alvaro Alvarez-Candal, A. R. Lopes, E. Telles , et al. (3 additional authors not shown)

Abstract: This paper provides a catalogue of stars, quasars, and galaxies for the Southern Photometric Local Universe Survey Data Release 2 (S-PLUS DR2) in the Stripe 82 region. We show that a 12-band filter system (5 Sloan-like and 7 narrow bands) allows better performance for object classification than the usual analysis based solely on broad bands (regardless of infrared information). Moreover, we show t… ▽ More This paper provides a catalogue of stars, quasars, and galaxies for the Southern Photometric Local Universe Survey Data Release 2 (S-PLUS DR2) in the Stripe 82 region. We show that a 12-band filter system (5 Sloan-like and 7 narrow bands) allows better performance for object classification than the usual analysis based solely on broad bands (regardless of infrared information). Moreover, we show that our classification is robust against missing values. Using spectroscopically confirmed sources retrieved from the Sloan Digital Sky Survey DR16 and DR14Q, we train a random forest classifier with the 12 S-PLUS magnitudes + 4 morphological features. A second random forest classifier is trained with the addition of the W1 (3.4 $μ$m) and W2 (4.6 $μ$m) magnitudes from the Wide-field Infrared Survey Explorer (WISE). Forty-four percent of our catalogue have WISE counterparts and are provided with classification from both models. We achieve 95.76% (52.47%) of quasar purity, 95.88% (92.24%) of quasar completeness, 99.44% (98.17%) of star purity, 98.22% (78.56%) of star completeness, 98.04% (81.39%) of galaxy purity, and 98.8% (85.37%) of galaxy completeness for the first (second) classifier, for which the metrics were calculated on objects with (without) WISE counterpart. A total of 2,926,787 objects that are not in our spectroscopic sample were labelled, obtaining 335,956 quasars, 1,347,340 stars, and 1,243,391 galaxies. From those, 7.4%, 76.0%, and 58.4% were classified with probabilities above 80%. The catalogue with classification and probabilities for Stripe 82 S-PLUS DR2 is available for download. △ Less

Submitted 4 November, 2021; v1 submitted 22 June, 2021; originally announced June 2021.

Comments: 27 pages, 22 figures. Updated to reflect the published version. Data products are available in https://splus.cloud/ website

Journal ref: Monthly Notices of the Royal Astronomical Society, 2021, 507, 5847-5868

arXiv:2004.11336 [pdf, other]

Self-supervised Learning for Astronomical Image Classification

Authors: Ana Martinazzo, Mateus Espadoto, Nina S. T. Hirata

Abstract: In Astronomy, a huge amount of image data is generated daily by photometric surveys, which scan the sky to collect data from stars, galaxies and other celestial objects. In this paper, we propose a technique to leverage unlabeled astronomical images to pre-train deep convolutional neural networks, in order to learn a domain-specific feature extractor which improves the results of machine learning… ▽ More In Astronomy, a huge amount of image data is generated daily by photometric surveys, which scan the sky to collect data from stars, galaxies and other celestial objects. In this paper, we propose a technique to leverage unlabeled astronomical images to pre-train deep convolutional neural networks, in order to learn a domain-specific feature extractor which improves the results of machine learning techniques in setups with small amounts of labeled data available. We show that our technique produces results which are in many cases better than using ImageNet pre-training. △ Less

Submitted 25 June, 2020; v1 submitted 23 April, 2020; originally announced April 2020.

Comments: Accepted for ICPR 2020

arXiv:1912.06199 [pdf, other]

Greenery Segmentation In Urban Images By Deep Learning

Authors: Artur A. M. Oliveira, Nina S. T. Hirata, Roberto Hirata Jr

Abstract: Vegetation is a relevant feature in the urban scenery and its awareness can be measured in an image by the Green View Index (GVI). Previous approaches to estimate the GVI were based upon heuristics image processing approaches and recently by deep learning networks (DLN). By leveraging some recent DLN architectures tuned to the image segmentation problem and exploiting a weighting strategy in the l… ▽ More Vegetation is a relevant feature in the urban scenery and its awareness can be measured in an image by the Green View Index (GVI). Previous approaches to estimate the GVI were based upon heuristics image processing approaches and recently by deep learning networks (DLN). By leveraging some recent DLN architectures tuned to the image segmentation problem and exploiting a weighting strategy in the loss function (LF) we improved previously reported results in similar datasets. △ Less

Submitted 12 December, 2019; originally announced December 2019.

Comments: Supplemental material can be found at http://greenery_data.arturao.org/

MSC Class: I.4.6; I.5.4 ACM Class: I.4.6; I.5.4

arXiv:1907.01567 [pdf, other]

doi 10.1093/mnras/stz1985

The Southern Photometric Local Universe Survey (S-PLUS): improved SEDs, morphologies and redshifts with 12 optical filters

Authors: C. Mendes de Oliveira, T. Ribeiro, W. Schoenell, A. Kanaan, R. A. Overzier, A. Molino, L. Sampedro, P. Coelho, C. E. Barbosa, A. Cortesi, M. V. Costa-Duarte, F. R. Herpich, J. A. Hernandez-Jimenez, V. M. Placco, H. S. Xavier, L. R. Abramo, R. K. Saito, A. L. Chies-Santos, A. Ederoclite, R. Lopes de Oliveira, D. R. Gonçalves, S. Akras, L. A. Almeida, F. Almeida-Fernandes, T. C. Beers , et al. (120 additional authors not shown)

Abstract: The Southern Photometric Local Universe Survey (S-PLUS) is imaging ~9300 deg^2 of the celestial sphere in twelve optical bands using a dedicated 0.8 m robotic telescope, the T80-South, at the Cerro Tololo Inter-American Observatory, Chile. The telescope is equipped with a 9.2k by 9.2k e2v detector with 10 um pixels, resulting in a field-of-view of 2 deg^2 with a plate scale of 0.55"/pixel. The sur… ▽ More The Southern Photometric Local Universe Survey (S-PLUS) is imaging ~9300 deg^2 of the celestial sphere in twelve optical bands using a dedicated 0.8 m robotic telescope, the T80-South, at the Cerro Tololo Inter-American Observatory, Chile. The telescope is equipped with a 9.2k by 9.2k e2v detector with 10 um pixels, resulting in a field-of-view of 2 deg^2 with a plate scale of 0.55"/pixel. The survey consists of four main subfields, which include two non-contiguous fields at high Galactic latitudes (8000 deg^2 at |b| > 30 deg) and two areas of the Galactic plane and bulge (for an additional 1300 deg^2). S-PLUS uses the Javalambre 12-band magnitude system, which includes the 5 u, g, r, i, z broad-band filters and 7 narrow-band filters centered on prominent stellar spectral features: the Balmer jump/[OII], Ca H+K, H-delta, G-band, Mg b triplet, H-alpha, and the Ca triplet. S-PLUS delivers accurate photometric redshifts (delta_z/(1+z) = 0.02 or better) for galaxies with r < 20 AB mag and redshift < 0.5, thus producing a 3D map of the local Universe over a volume of more than 1 (Gpc/h)^3. The final S-PLUS catalogue will also enable the study of star formation and stellar populations in and around the Milky Way and nearby galaxies, as well as searches for quasars, variable sources, and low-metallicity stars. In this paper we introduce the main characteristics of the survey, illustrated with science verification data highlighting the unique capabilities of S-PLUS. We also present the first public data release of ~336 deg^2 of the Stripe-82 area, which is available at http://datalab.noao.edu/splus. △ Less

Submitted 2 September, 2019; v1 submitted 2 July, 2019; originally announced July 2019.

Comments: Updated to reflect the published version (MNRAS, 489, 241). For a short introductory video of the S-PLUS project, see https://youtu.be/yc5kHrHU9Jk - The S-PLUS Data Release 1 is available at http://datalab.noao.edu/splus

arXiv:1902.07958 [pdf, other]

Deep Learning Multidimensional Projections

Authors: Mateus Espadoto, Nina S. T. Hirata, Alexandru C. Telea

Abstract: Dimensionality reduction methods, also known as projections, are frequently used for exploring multidimensional data in machine learning, data science, and information visualization. Among these, t-SNE and its variants have become very popular for their ability to visually separate distinct data clusters. However, such methods are computationally expensive for large datasets, suffer from stability… ▽ More Dimensionality reduction methods, also known as projections, are frequently used for exploring multidimensional data in machine learning, data science, and information visualization. Among these, t-SNE and its variants have become very popular for their ability to visually separate distinct data clusters. However, such methods are computationally expensive for large datasets, suffer from stability problems, and cannot directly handle out-of-sample data. We propose a learning approach to construct such projections. We train a deep neural network based on a collection of samples from a given data universe, and their corresponding projections, and next use the network to infer projections of data from the same, or similar, universes. Our approach generates projections with similar characteristics as the learned ones, is computationally two to three orders of magnitude faster than SNE-class methods, has no complex-to-set user parameters, handles out-of-sample data in a stable manner, and can be used to learn any projection technique. We demonstrate our proposal on several real-world high dimensional datasets from machine learning. △ Less

Submitted 21 February, 2019; originally announced February 2019.

arXiv:1712.04833 [pdf, other]

Symbol detection in online handwritten graphics using Faster R-CNN

Authors: Frank D. Julca-Aguilar, Nina S. T. Hirata

Abstract: Symbol detection techniques in online handwritten graphics (e.g. diagrams and mathematical expressions) consist of methods specifically designed for a single graphic type. In this work, we evaluate the Faster R-CNN object detection algorithm as a general method for detection of symbols in handwritten graphics. We evaluate different configurations of the Faster R-CNN method, and point out issues re… ▽ More Symbol detection techniques in online handwritten graphics (e.g. diagrams and mathematical expressions) consist of methods specifically designed for a single graphic type. In this work, we evaluate the Faster R-CNN object detection algorithm as a general method for detection of symbols in handwritten graphics. We evaluate different configurations of the Faster R-CNN method, and point out issues relative to the handwritten nature of the data. Considering the online recognition context, we evaluate efficiency and accuracy trade-offs of using Deep Neural Networks of different complexities as feature extractors. We evaluate the method on publicly available flowchart and mathematical expression (CROHME-2016) datasets. Results show that Faster R-CNN can be effectively used on both datasets, enabling the possibility of developing general methods for symbol detection, and furthermore, general graphic understanding methods that could be built on top of the algorithm. △ Less

Submitted 13 December, 2017; originally announced December 2017.

Comments: Submitted to DAS-2018

arXiv:1709.06476 [pdf, other]

Image operator learning coupled with CNN classification and its application to staff line removal

Authors: Frank D. Julca-Aguilar, Nina S. T. Hirata

Abstract: Many image transformations can be modeled by image operators that are characterized by pixel-wise local functions defined on a finite support window. In image operator learning, these functions are estimated from training data using machine learning techniques. Input size is usually a critical issue when using learning algorithms, and it limits the size of practicable windows. We propose the use o… ▽ More Many image transformations can be modeled by image operators that are characterized by pixel-wise local functions defined on a finite support window. In image operator learning, these functions are estimated from training data using machine learning techniques. Input size is usually a critical issue when using learning algorithms, and it limits the size of practicable windows. We propose the use of convolutional neural networks (CNNs) to overcome this limitation. The problem of removing staff-lines in music score images is chosen to evaluate the effects of window and convolutional mask sizes on the learned image operator performance. Results show that the CNN based solution outperforms previous ones obtained using conventional learning algorithms or heuristic algorithms, indicating the potential of CNNs as base classifiers in image operator learning. The implementations will be made available on the TRIOSlib project site. △ Less

Submitted 19 September, 2017; originally announced September 2017.

Comments: To appear in ICDAR 2017

arXiv:1709.06389 [pdf, other]

A General Framework for the Recognition of Online Handwritten Graphics

Authors: Frank Julca-Aguilar, Harold Mouchère, Christian Viard-Gaudin, Nina S. T. Hirata

Abstract: We propose a new framework for the recognition of online handwritten graphics. Three main features of the framework are its ability to treat symbol and structural level information in an integrated way, its flexibility with respect to different families of graphics, and means to control the tradeoff between recognition effectiveness and computational cost. We model a graphic as a labeled graph gen… ▽ More We propose a new framework for the recognition of online handwritten graphics. Three main features of the framework are its ability to treat symbol and structural level information in an integrated way, its flexibility with respect to different families of graphics, and means to control the tradeoff between recognition effectiveness and computational cost. We model a graphic as a labeled graph generated from a graph grammar. Non-terminal vertices represent subcomponents, terminal vertices represent symbols, and edges represent relations between subcomponents or symbols. We then model the recognition problem as a graph parsing problem: given an input stroke set, we search for a parse tree that represents the best interpretation of the input. Our graph parsing algorithm generates multiple interpretations (consistent with the grammar) and then we extract an optimal interpretation according to a cost function that takes into consideration the likelihood scores of symbols and structures. The parsing algorithm consists in recursively partitioning the stroke set according to structures defined in the grammar and it does not impose constraints present in some previous works (e.g. stroke ordering). By avoiding such constraints and thanks to the powerful representativeness of graphs, our approach can be adapted to the recognition of different graphic notations. We show applications to the recognition of mathematical expressions and flowcharts. Experimentation shows that our method obtains state-of-the-art accuracy in both applications. △ Less

Submitted 19 September, 2017; originally announced September 2017.

Comments: Submitted to TPAMI

Showing 1–16 of 16 results for author: Hirata, N S T