Skip to main content

Showing 1–17 of 17 results for author: Florencio, D

.
  1. arXiv:2501.05452  [pdf, other

    cs.CV cs.CL

    ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding

    Authors: Xingyu Fu, Minqian Liu, Zhengyuan Yang, John Corring, Yijuan Lu, Jianwei Yang, Dan Roth, Dinei Florencio, Cha Zhang

    Abstract: Structured image understanding, such as interpreting tables and charts, requires strategically refocusing across various structures and texts within an image, forming a reasoning sequence to arrive at the final answer. However, current multimodal large language models (LLMs) lack this multihop selective attention capability. In this work, we introduce ReFocus, a simple yet effective framework that… ▽ More

    Submitted 9 January, 2025; originally announced January 2025.

    Comments: Project link: https://zeyofu.github.io/ReFocus/

  2. arXiv:2305.14571  [pdf, other

    cs.CL

    From Characters to Words: Hierarchical Pre-trained Language Model for Open-vocabulary Language Understanding

    Authors: Li Sun, Florian Luisier, Kayhan Batmanghelich, Dinei Florencio, Cha Zhang

    Abstract: Current state-of-the-art models for natural language understanding require a preprocessing step to convert raw text into discrete tokens. This process known as tokenization relies on a pre-built vocabulary of words or sub-word morphemes. This fixed vocabulary limits the model's robustness to spelling errors and its capacity to adapt to new domains. In this work, we introduce a novel open-vocabular… ▽ More

    Submitted 29 May, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: Accepted to ACL 2023 Main Conference

  3. Diffusion-based Document Layout Generation

    Authors: Liu He, Yijuan Lu, John Corring, Dinei Florencio, Cha Zhang

    Abstract: We develop a diffusion-based approach for various document layout sequence generation. Layout sequences specify the contents of a document design in an explicit format. Our novel diffusion-based approach works in the sequence domain rather than the image domain in order to permit more complex and realistic layouts. We also introduce a new metric, Document Earth Mover's Distance (Doc-EMD). By consi… ▽ More

    Submitted 19 March, 2023; originally announced March 2023.

    Journal ref: Document Analysis and Recognition - ICDAR 2023. ICDAR 2023. Lecture Notes in Computer Science, vol 14187. Springer, Cham

  4. arXiv:2208.08201  [pdf, other

    cs.CL cs.AI

    Understanding Long Documents with Different Position-Aware Attentions

    Authors: Hai Pham, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang

    Abstract: Despite several successes in document understanding, the practical task for long document understanding is largely under-explored due to several challenges in computation and how to efficiently absorb long multimodal input. Most current transformer-based approaches only deal with short documents and employ solely textual information for attention due to its prohibitive computation and memory limit… ▽ More

    Submitted 17 August, 2022; originally announced August 2022.

  5. arXiv:2111.06738  [pdf, other

    cs.CV

    Improving Structured Text Recognition with Regular Expression Biasing

    Authors: Baoguang Shi, Wenfeng Cheng, Yijuan Lu, Cha Zhang, Dinei Florencio

    Abstract: We study the problem of recognizing structured text, i.e. text that follows certain formats, and propose to improve the recognition accuracy of structured text by specifying regular expressions (regexes) for biasing. A biased recognizer recognizes text that matches the specified regexes with significantly improved accuracy, at the cost of a generally small degradation on other text. The biasing is… ▽ More

    Submitted 10 November, 2021; originally announced November 2021.

  6. arXiv:2109.10282  [pdf, other

    cs.CL cs.CV

    TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models

    Authors: Minghao Li, Tengchao Lv, Jingye Chen, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei

    Abstract: Text recognition is a long-standing research problem for document digitalization. Existing approaches are usually built based on CNN for image understanding and RNN for char-level text generation. In addition, another language model is usually needed to improve the overall accuracy as a post-processing step. In this paper, we propose an end-to-end text recognition approach with pre-trained image T… ▽ More

    Submitted 6 September, 2022; v1 submitted 21 September, 2021; originally announced September 2021.

    Comments: Work in Progress

  7. arXiv:2104.08836  [pdf, other

    cs.CL

    LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding

    Authors: Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Furu Wei

    Abstract: Multimodal pre-training with text, layout, and image has achieved SOTA performance for visually-rich document understanding tasks recently, which demonstrates the great potential for joint learning across different modalities. In this paper, we present LayoutXLM, a multimodal pre-trained model for multilingual document understanding, which aims to bridge the language barriers for visually-rich doc… ▽ More

    Submitted 9 September, 2021; v1 submitted 18 April, 2021; originally announced April 2021.

    Comments: Work in progress

  8. arXiv:2012.14740  [pdf, other

    cs.CL

    LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding

    Authors: Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou

    Abstract: Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks due to its effective model architecture and the advantage of large-scale unlabeled scanned/digital-born documents. We propose LayoutLMv2 architecture with new pre-training tasks to model the interaction among text, layout, and image in a single multi-modal framework. Specifically, with a… ▽ More

    Submitted 9 January, 2022; v1 submitted 29 December, 2020; originally announced December 2020.

    Comments: ACL 2021 main conference

  9. arXiv:2012.04638  [pdf, other

    cs.CV

    TAP: Text-Aware Pre-training for Text-VQA and Text-Caption

    Authors: Zhengyuan Yang, Yijuan Lu, Jianfeng Wang, Xi Yin, Dinei Florencio, Lijuan Wang, Cha Zhang, Lei Zhang, Jiebo Luo

    Abstract: In this paper, we propose Text-Aware Pre-training (TAP) for Text-VQA and Text-Caption tasks. These two tasks aim at reading and understanding scene text in images for question answering and image caption generation, respectively. In contrast to the conventional vision-language pre-training that fails to capture scene text and its relationship with the visual and text modalities, TAP explicitly inc… ▽ More

    Submitted 8 December, 2020; originally announced December 2020.

  10. arXiv:1811.07275  [pdf, other

    cs.CV cs.LG

    RePr: Improved Training of Convolutional Filters

    Authors: Aaditya Prakash, James Storer, Dinei Florencio, Cha Zhang

    Abstract: A well-trained Convolutional Neural Network can easily be pruned without significant loss of performance. This is because of unnecessary overlap in the features captured by the network's filters. Innovations in network architecture such as skip/dense connections and Inception units have mitigated this problem to some extent, but these improvements come with increased computation and memory require… ▽ More

    Submitted 25 February, 2019; v1 submitted 18 November, 2018; originally announced November 2018.

    Comments: CVPR 2019

  11. A Fusion Framework for Camouflaged Moving Foreground Detection in the Wavelet Domain

    Authors: Shuai Li, Dinei Florencio, Wanqing Li, Yaqin Zhao, Chris Cook

    Abstract: Detecting camouflaged moving foreground objects has been known to be difficult due to the similarity between the foreground objects and the background. Conventional methods cannot distinguish the foreground from background due to the small differences between them and thus suffer from under-detection of the camouflaged foreground objects. In this paper, we present a fusion framework to address thi… ▽ More

    Submitted 16 April, 2018; originally announced April 2018.

    Comments: 13 pages, accepted by IEEE TIP

  12. arXiv:1802.05383  [pdf, other

    cs.CL cs.AI cs.SD eess.AS eess.SP

    Deep Learning Based Speech Beamforming

    Authors: Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang, Dinei Florencio, Mark Hasegawa-Johnson

    Abstract: Multi-channel speech enhancement with ad-hoc sensors has been a challenging task. Speech model guided beamforming algorithms are able to recover natural sounding speech, but the speech models tend to be oversimplified or the inference would otherwise be too complicated. On the other hand, deep learning based enhancement approaches are able to learn complicated speech distributions and perform effi… ▽ More

    Submitted 14 February, 2018; originally announced February 2018.

    Comments: Accepted in The 43rd IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP2018)

  13. arXiv:1707.03166  [pdf, other

    cs.CV

    Foreground Detection in Camouflaged Scenes

    Authors: Shuai Li, Dinei Florencio, Yaqin Zhao, Chris Cook, Wanqing Li

    Abstract: Foreground detection has been widely studied for decades due to its importance in many practical applications. Most of the existing methods assume foreground and background show visually distinct characteristics and thus the foreground can be detected once a good background model is obtained. However, there are many situations where this is not the case. Of particular interest in video surveillanc… ▽ More

    Submitted 11 July, 2017; originally announced July 2017.

    Comments: IEEE International Conference on Image Processing, 2017

  14. Joint Denoising / Compression of Image Contours via Shape Prior and Context Tree

    Authors: Amin Zheng, Gene Cheung, Dinei Florencio

    Abstract: With the advent of depth sensing technologies, the extraction of object contours in images---a common and important pre-processing step for later higher-level computer vision tasks like object detection and human action recognition---has become easier. However, acquisition noise in captured depth images means that detected contours suffer from unavoidable errors. In this paper, we propose to joint… ▽ More

    Submitted 30 April, 2017; originally announced May 2017.

  15. arXiv:1605.02427  [pdf, other

    cs.SD

    Speech Enhancement In Multiple-Noise Conditions using Deep Neural Networks

    Authors: Anurag Kumar, Dinei Florencio

    Abstract: In this paper we consider the problem of speech enhancement in real-world like conditions where multiple noises can simultaneously corrupt speech. Most of the current literature on speech enhancement focus primarily on presence of single noise in corrupted speech which is far from real-world environments. Specifically, we deal with improving speech quality in office environment where multiple stat… ▽ More

    Submitted 9 May, 2016; originally announced May 2016.

  16. Context Tree based Image Contour Coding using A Geometric Prior

    Authors: Amin Zheng, Gene Cheung, Dinei Florencio

    Abstract: If object contours in images are coded efficiently as side information, then they can facilitate advanced image / video coding techniques, such as graph Fourier transform coding or motion prediction of arbitrarily shaped pixel blocks. In this paper, we study the problem of lossless and lossy compression of detected contours in images. Specifically, we first convert a detected object contour compos… ▽ More

    Submitted 27 April, 2016; originally announced April 2016.

  17. Precision Enhancement of 3D Surfaces from Multiple Compressed Depth Maps

    Authors: Pengfei Wan, Gene Cheung, Philip A. Chou, Dinei Florencio, Cha Zhang, Oscar C. Au

    Abstract: In texture-plus-depth representation of a 3D scene, depth maps from different camera viewpoints are typically lossily compressed via the classical transform coding / coefficient quantization paradigm. In this paper we propose to reduce distortion of the decoded depth maps due to quantization. The key observation is that depth maps from different viewpoints constitute multiple descriptions (MD) of… ▽ More

    Submitted 24 February, 2014; originally announced May 2014.

    Comments: This work was accepted as ongoing work paper in IEEE MMSP'2013