Search | arXiv e-print repository

RAVEN: An Agentic Framework for Multimodal Entity Discovery from Large-Scale Video Collections

Abstract: We present RAVEN an adaptive AI agent framework designed for multimodal entity discovery and retrieval in large-scale video collections. Synthesizing information across visual, audio, and textual modalities, RAVEN autonomously processes video data to produce structured, actionable representations for downstream tasks. Key contributions include (1) a category understanding step to infer video theme… ▽ More We present RAVEN an adaptive AI agent framework designed for multimodal entity discovery and retrieval in large-scale video collections. Synthesizing information across visual, audio, and textual modalities, RAVEN autonomously processes video data to produce structured, actionable representations for downstream tasks. Key contributions include (1) a category understanding step to infer video themes and general-purpose entities, (2) a schema generation mechanism that dynamically defines domain-specific entities and attributes, and (3) a rich entity extraction process that leverages semantic retrieval and schema-guided prompting. RAVEN is designed to be model-agnostic, allowing the integration of different vision-language models (VLMs) and large language models (LLMs) based on application-specific requirements. This flexibility supports diverse applications in personalized search, content discovery, and scalable information retrieval, enabling practical applications across vast datasets. △ Less

Submitted 3 March, 2025; originally announced April 2025.

Comments: Presented at AI Agent for Information Retrieval: Generating and Ranking (Agent4IR) @ AAAI 2025 [https://sites.google.com/view/ai4ir/aaai-2025]

arXiv:2501.00290 [pdf, ps, other]

Zero-dilation indices and numerical ranges

Authors: Kennett L. Dela Rosa

Abstract: The zero-dilation index $d(A) $ of a matrix $A$ is the largest integer $k$ for which $\begin{bmatrix}0_k& *\\ * & *\end{bmatrix}$ is unitarily similar to $A$. In this study, the zero-dilation indices of certain block matrices are considered, namely, the block matrix analogues of companion matrices and upper triangular KMS matrices, respectively shown as \[\mathcal{C}=\begin{bmatrix} 0& \bigoplus_{… ▽ More The zero-dilation index $d(A) $ of a matrix $A$ is the largest integer $k$ for which $\begin{bmatrix}0_k& *\\ * & *\end{bmatrix}$ is unitarily similar to $A$. In this study, the zero-dilation indices of certain block matrices are considered, namely, the block matrix analogues of companion matrices and upper triangular KMS matrices, respectively shown as \[\mathcal{C}=\begin{bmatrix} 0& \bigoplus_{j=1}^{m-1}A_j \\ B_0& [B_j]_{j=1}^{m-1}\end{bmatrix}\ \mbox{and}\ \mathcal{K}=\begin{bmatrix}0& A& A^2&\cdots& A^{m-1}\\ 0 & 0& A& \ddots& \vdots\\ 0& 0 &0 &\ddots& A^2\\ \vdots& \vdots &\vdots & \ddots& A\\ 0& 0 & 0& \cdots &0\end{bmatrix}\] where $\mathcal{C}$ and $\mathcal{K}$ are $mn$-by-$mn$ and $A_j,B_j,A$ are $n$-by-$n$. Provided $\bigoplus_{j=1}^{m-1}A_j$ is nonsingular, it is proved that $d(\mathcal{C})$ satisfies the following: if $m\geq 3$ is odd (respectively, $m\geq 2$ is even), then $\frac{(m-1)n}{2}\leq d(\mathcal{C})\leq \frac{(m+1)n}{2}$ (respectively, $ d(\mathcal{C})= \frac{mn}{2}$). In the odd $m$ case, examples are given showing that it is possible to get as zero-dilation index each integer value between $\frac{(m-1)n}{2} $ and $\frac{(m+1)n}{2}$. On the other hand, $d(\mathcal{K})$ is proved to be equal to the number of nonnegative eigenvalues of $(\mathcal{K}+\mathcal{K}^*)/2$. Alternative characterizations of $d(\mathcal{K})$ are given. The circularity of the numerical range of $\mathcal{K} $ is also considered. △ Less

Submitted 31 December, 2024; originally announced January 2025.

Comments: 25 pages

MSC Class: 15A45; 15A60; 15B99; 47A12; 47A20

arXiv:2409.13339 [pdf, ps, other]

On commutators of unipotent matrices of index 2

Authors: Kennett L. Dela Rosa, Juan Paolo C. Santos

Abstract: A commutator of unipotent matrices of index 2 is a matrix of the form $XYX^{-1}Y^{-1}$, where $X$ and $Y$ are unipotent matrices of index 2, that is, $X\ne I_n$, $Y\ne I_n$, and $(X-I_n)^2=(Y-I_n)^2=0_n$. If $n>2$ and $\mathbb F$ is a field with $|\mathbb F|\geq 4$, then it is shown that every $n\times n$ matrix over $\mathbb F$ with determinant 1 is a product of at most four commutators of unipot… ▽ More A commutator of unipotent matrices of index 2 is a matrix of the form $XYX^{-1}Y^{-1}$, where $X$ and $Y$ are unipotent matrices of index 2, that is, $X\ne I_n$, $Y\ne I_n$, and $(X-I_n)^2=(Y-I_n)^2=0_n$. If $n>2$ and $\mathbb F$ is a field with $|\mathbb F|\geq 4$, then it is shown that every $n\times n$ matrix over $\mathbb F$ with determinant 1 is a product of at most four commutators of unipotent matrices of index 2. Consequently, every $n\times n$ matrix over $\mathbb F$ with determinant 1 is a product of at most eight unipotent matrices of index 2. Conditions on $\mathbb F$ are given that improve the upper bound on the commutator factors from four to three or two. The situation for $n=2$ is also considered. This study reveals a connection between factorability into commutators of unipotent matrices and properties of $\mathbb F$ such as its characteristic or its set of perfect squares. △ Less

Submitted 20 September, 2024; originally announced September 2024.

Comments: 23 pages

MSC Class: 15A21; 15A23; 15B33; 15B99; 20H20

arXiv:2407.14625 [pdf, other]

Benchmarking deep learning models for bearing fault diagnosis using the CWRU dataset: A multi-label approach

Authors: Rodrigo Kobashikawa Rosa, Danilo Braga, Danilo Silva

Abstract: This paper proposes a novel approach for modeling the problem of fault diagnosis using the Case Western Reserve University (CWRU) bearing fault dataset. Although the dataset is considered a standard reference for testing new algorithms, the typical dataset division suffers from data leakage, as shown by Hendriks et al. (2022) and Abburi et al. (2023), leading to papers reporting over-optimistic re… ▽ More This paper proposes a novel approach for modeling the problem of fault diagnosis using the Case Western Reserve University (CWRU) bearing fault dataset. Although the dataset is considered a standard reference for testing new algorithms, the typical dataset division suffers from data leakage, as shown by Hendriks et al. (2022) and Abburi et al. (2023), leading to papers reporting over-optimistic results. While their proposed division significantly mitigates this issue, it does not eliminate it entirely. Moreover, their proposed multi-class classification task can still lead to an unrealistic scenario by excluding the possibility of more than one fault type occurring at the same or different locations. As advocated in this paper, a multi-label formulation (detecting the presence of each type of fault for each location) can solve both issues, leading to a scenario closer to reality. Additionally, this approach mitigates the heavy class imbalance of the CWRU dataset, where faulty cases appear much more frequently than healthy cases, even though the opposite is more likely to occur in practice. A multi-label formulation also enables a more precise evaluation using prevalence-independent evaluation metrics for binary classification, such as the ROC curve. Finally, this paper proposes a more realistic dataset division that allows for more diversity in the training dataset while keeping the division free from data leakage. The results show that this new division can significantly improve performance while enabling a fine-grained error analysis. As an application of our approach, a comparative benchmark is performed using several state-of-the-art deep learning models applied to 1D and 2D signal representations in time and/or frequency domains. △ Less

Submitted 19 July, 2024; originally announced July 2024.

arXiv:2405.17706 [pdf, other]

Video Enriched Retrieval Augmented Generation Using Aligned Video Captions

Authors: Kevin Dela Rosa

Abstract: In this work, we propose the use of "aligned visual captions" as a mechanism for integrating information contained within videos into retrieval augmented generation (RAG) based chat assistant systems. These captions are able to describe the visual and audio content of videos in a large corpus while having the advantage of being in a textual format that is both easy to reason about & incorporate in… ▽ More In this work, we propose the use of "aligned visual captions" as a mechanism for integrating information contained within videos into retrieval augmented generation (RAG) based chat assistant systems. These captions are able to describe the visual and audio content of videos in a large corpus while having the advantage of being in a textual format that is both easy to reason about & incorporate into large language model (LLM) prompts, but also typically require less multimedia content to be inserted into the multimodal LLM context window, where typical configurations can aggressively fill up the context window by sampling video frames from the source video. Furthermore, visual captions can be adapted to specific use cases by prompting the original foundational model / captioner for particular visual details or fine tuning. In hopes of helping advancing progress in this area, we curate a dataset and describe automatic evaluation procedures on common RAG tasks. △ Less

Submitted 27 May, 2024; originally announced May 2024.

Comments: SIGIR 2024 Workshop on Multimodal Representation and Retrieval (MRR 2024)

arXiv:2312.01671 [pdf, other]

Multimodality-guided Image Style Transfer using Cross-modal GAN Inversion

Authors: Hanyu Wang, Pengxiang Wu, Kevin Dela Rosa, Chen Wang, Abhinav Shrivastava

Abstract: Image Style Transfer (IST) is an interdisciplinary topic of computer vision and art that continuously attracts researchers' interests. Different from traditional Image-guided Image Style Transfer (IIST) methods that require a style reference image as input to define the desired style, recent works start to tackle the problem in a text-guided manner, i.e., Text-guided Image Style Transfer (TIST). C… ▽ More Image Style Transfer (IST) is an interdisciplinary topic of computer vision and art that continuously attracts researchers' interests. Different from traditional Image-guided Image Style Transfer (IIST) methods that require a style reference image as input to define the desired style, recent works start to tackle the problem in a text-guided manner, i.e., Text-guided Image Style Transfer (TIST). Compared to IIST, such approaches provide more flexibility with text-specified styles, which are useful in scenarios where the style is hard to define with reference images. Unfortunately, many TIST approaches produce undesirable artifacts in the transferred images. To address this issue, we present a novel method to achieve much improved style transfer based on text guidance. Meanwhile, to offer more flexibility than IIST and TIST, our method allows style inputs from multiple sources and modalities, enabling MultiModality-guided Image Style Transfer (MMIST). Specifically, we realize MMIST with a novel cross-modal GAN inversion method, which generates style representations consistent with specified styles. Such style representations facilitate style transfer and in principle generalize any IIST methods to MMIST. Large-scale experiments and user studies demonstrate that our method achieves state-of-the-art performance on TIST task. Furthermore, comprehensive qualitative results confirm the effectiveness of our method on MMIST task and cross-modal style interpolation. △ Less

Submitted 4 December, 2023; originally announced December 2023.

Comments: WACV 2024. Project website: https://hywang66.github.io/mmist/

arXiv:2309.16249 [pdf, other]

FORB: A Flat Object Retrieval Benchmark for Universal Image Embedding

Authors: Pengxiang Wu, Siman Wang, Kevin Dela Rosa, Derek Hao Hu

Abstract: Image retrieval is a fundamental task in computer vision. Despite recent advances in this field, many techniques have been evaluated on a limited number of domains, with a small number of instance categories. Notably, most existing works only consider domains like 3D landmarks, making it difficult to generalize the conclusions made by these works to other domains, e.g., logo and other 2D flat obje… ▽ More Image retrieval is a fundamental task in computer vision. Despite recent advances in this field, many techniques have been evaluated on a limited number of domains, with a small number of instance categories. Notably, most existing works only consider domains like 3D landmarks, making it difficult to generalize the conclusions made by these works to other domains, e.g., logo and other 2D flat objects. To bridge this gap, we introduce a new dataset for benchmarking visual search methods on flat images with diverse patterns. Our flat object retrieval benchmark (FORB) supplements the commonly adopted 3D object domain, and more importantly, it serves as a testbed for assessing the image embedding quality on out-of-distribution domains. In this benchmark we investigate the retrieval accuracy of representative methods in terms of candidate ranks, as well as matching score margin, a viewpoint which is largely ignored by many works. Our experiments not only highlight the challenges and rich heterogeneity of FORB, but also reveal the hidden properties of different retrieval strategies. The proposed benchmark is a growing project and we expect to expand in both quantity and variety of objects. The dataset and supporting codes are available at https://github.com/pxiangwu/FORB/. △ Less

Submitted 28 September, 2023; originally announced September 2023.

Comments: NeurIPS 2023 Datasets and Benchmarks Track

arXiv:2011.10678 [pdf, other]

Open-Vocabulary Object Detection Using Captions

Authors: Alireza Zareian, Kevin Dela Rosa, Derek Hao Hu, Shih-Fu Chang

Abstract: Despite the remarkable accuracy of deep neural networks in object detection, they are costly to train and scale due to supervision requirements. Particularly, learning more object categories typically requires proportionally more bounding box annotations. Weakly supervised and zero-shot learning techniques have been explored to scale object detectors to more categories with less supervision, but t… ▽ More Despite the remarkable accuracy of deep neural networks in object detection, they are costly to train and scale due to supervision requirements. Particularly, learning more object categories typically requires proportionally more bounding box annotations. Weakly supervised and zero-shot learning techniques have been explored to scale object detectors to more categories with less supervision, but they have not been as successful and widely adopted as supervised models. In this paper, we put forth a novel formulation of the object detection problem, namely open-vocabulary object detection, which is more general, more practical, and more effective than weakly supervised and zero-shot approaches. We propose a new method to train object detectors using bounding box annotations for a limited set of object categories, as well as image-caption pairs that cover a larger variety of objects at a significantly lower cost. We show that the proposed method can detect and localize objects for which no bounding box annotation is provided during training, at a significantly higher accuracy than zero-shot approaches. Meanwhile, objects with bounding box annotation can be detected almost as accurately as supervised methods, which is significantly better than weakly supervised baselines. Accordingly, we establish a new state of the art for scalable object detection. △ Less

Submitted 14 March, 2021; v1 submitted 20 November, 2020; originally announced November 2020.

Comments: To be presented at CVPR 2021 (oral paper)

arXiv:2004.05288 [pdf, other]

Location of Ritz values in the numerical range of normal matrices

Authors: Kennett L. Dela Rosa, Hugo J. Woerdeman

Abstract: Let $μ_1$ be a complex number in the numerical range $W(A)$ of a normal matrix $A$. In the case when no eigenvalues of $A$ lie in the interior of $W(A)$, we identify the smallest convex region containing all possible complex numbers $μ_2$ for which $\begin{bmatrix}μ_1& *\\0& μ_2\end{bmatrix}$ is a $2$-by-$2$ compression of $A$. Let $μ_1$ be a complex number in the numerical range $W(A)$ of a normal matrix $A$. In the case when no eigenvalues of $A$ lie in the interior of $W(A)$, we identify the smallest convex region containing all possible complex numbers $μ_2$ for which $\begin{bmatrix}μ_1& *\\0& μ_2\end{bmatrix}$ is a $2$-by-$2$ compression of $A$. △ Less

Submitted 9 May, 2020; v1 submitted 10 April, 2020; originally announced April 2020.

Comments: 32 pages

MSC Class: 15A18; 15A29; 15A60; 47A12; 47A20

arXiv:2002.05069 [pdf]

Real-time forecasts of the 2019-nCoV epidemic in China from February 5th to February 24th, 2020

Authors: K. Roosa, Y. Lee, R. Luo, A. Kirpich, R. Rothenberg, J. M. Hyman, P. Yan, G. Chowell

Abstract: The initial cluster of severe pneumonia cases that triggered the 2019-nCoV epidemic was identified in Wuhan, China in December 2019. While early cases of the disease were linked to a wet market, human-to-human transmission has driven the rapid spread of the virus throughout China. The ongoing outbreak presents a challenge for modelers, as limited data are available on the early growth trajectory,… ▽ More The initial cluster of severe pneumonia cases that triggered the 2019-nCoV epidemic was identified in Wuhan, China in December 2019. While early cases of the disease were linked to a wet market, human-to-human transmission has driven the rapid spread of the virus throughout China. The ongoing outbreak presents a challenge for modelers, as limited data are available on the early growth trajectory, and the epidemiological characteristics of the novel coronavirus are yet to be fully elucidated. We provide timely short-term forecasts of the cumulative number of confirmed reported cases in Hubei province, the epicenter of the epidemic, and for the overall trajectory in China, excluding the province of Hubei. We collect daily reported cumulative case data for the 2019-nCoV outbreak for each Chinese province from the National Health Commission of China. Here, we provide 5, 10, and 15 day forecasts for five consecutive days, February 5th through February 9th, with quantified uncertainty based on a generalized logistic growth model, the Richards growth model, and a sub-epidemic wave model. Our most recent forecasts reported here based on data up until February 9, 2020, largely agree across the three models presented and suggest an average range of 7,409-7,496 additional cases in Hubei and 1,128-1,929 additional cases in other provinces within the next five days. Models also predict an average total cumulative case count between 37,415 - 38,028 in Hubei and 11,588 - 13,499 in other provinces by February 24, 2020. Mean estimates and uncertainty bounds for both Hubei and other provinces have remained relatively stable in the last three reporting dates (February 7th - 9th). Our forecasts suggest that the containment strategies implemented in China are successfully reducing transmission and that the epidemic growth has slowed in recent days. △ Less

Submitted 12 February, 2020; originally announced February 2020.

Comments: 6 figures

arXiv:1806.00941 [pdf, ps, other]

Bounds for Finite Semiprimitive Permutation Groups: Order, Base Size, and Minimal Degree

Authors: Luke Morgan, Cheryl E. Praeger, Kyle Rosa

Abstract: In this paper we study finite semiprimitive permutation groups, that is, groups in which each normal subgroup is transitive or semiregular. We give bounds on the order, base size, minimal degree, fixity, and chief length of an arbitrary finite semiprimitive group in terms of its degree. To establish these bounds, we classify finite semiprimitive groups that induce the alternating or symmetric grou… ▽ More In this paper we study finite semiprimitive permutation groups, that is, groups in which each normal subgroup is transitive or semiregular. We give bounds on the order, base size, minimal degree, fixity, and chief length of an arbitrary finite semiprimitive group in terms of its degree. To establish these bounds, we classify finite semiprimitive groups that induce the alternating or symmetric group on the set of orbits of an intransitive normal subgroup. △ Less

Submitted 3 June, 2018; originally announced June 2018.

MSC Class: 20B15; 20H30; 20B05

arXiv:1712.05520 [pdf, ps, other]

Bounding the composition length of primitive permutation groups and completely reducible linear groups

Authors: S. P. Glasby, Cheryl E. Praeger, Kyle Rosa, Gabriel Verret

Abstract: We obtain upper bounds on the composition length of a finite permutation group in terms of the degree and the number of orbits, and analogous bounds for primitive, quasiprimitive and semiprimitive groups. Similarly, we obtain upper bounds on the composition length of a finite completely reducible linear group in terms of some of its parameters. In almost all cases we show that the bounds are sharp… ▽ More We obtain upper bounds on the composition length of a finite permutation group in terms of the degree and the number of orbits, and analogous bounds for primitive, quasiprimitive and semiprimitive groups. Similarly, we obtain upper bounds on the composition length of a finite completely reducible linear group in terms of some of its parameters. In almost all cases we show that the bounds are sharp, and describe the extremal examples. △ Less

Submitted 14 March, 2018; v1 submitted 14 December, 2017; originally announced December 2017.

Comments: 23 pages; a few minor corrections following the referee's comments

MSC Class: 20B15; 20H30; 20B05

arXiv:1008.4385 [pdf, ps, other]

doi 10.1088/0004-637X/725/1/1082

The Advection of Supergranules by the Sun's Axisymmetric Flows

Authors: David H. Hathaway, Peter E. Williams, Kevin Dela Rosa, Manfred Cuntz

Abstract: We show that the motions of supergranules are consistent with a model in which they are simply advected by the axisymmetric flows in the Sun's surface shear layer. We produce a 10-day series of simulated Doppler images at a 15-minute cadence that reproduces most spatial and temporal characteristics seen in the SOHO/MDI Doppler data. Our simulated data have a spectrum of cellular flows with just tw… ▽ More We show that the motions of supergranules are consistent with a model in which they are simply advected by the axisymmetric flows in the Sun's surface shear layer. We produce a 10-day series of simulated Doppler images at a 15-minute cadence that reproduces most spatial and temporal characteristics seen in the SOHO/MDI Doppler data. Our simulated data have a spectrum of cellular flows with just two components -- a granule component that peaks at spherical wavenumbers of about 4000 and a supergranule component that peaks at wavenumbers of about 110. We include the advection of these cellular components by the axisymmetric flows -- differential rotation and meridional flow -- whose variations with latitude and depth (wavenumber) are consistent with observations. We mimic the evolution of the cellular pattern by introducing random variations to the phases of the spectral components at rates that reproduce the levels of cross-correlation as functions of time and latitude. Our simulated data do not include any wave-like characteristics for the supergranules yet can reproduce the rotation characteristics previously attributed to wave-like behavior. We find rotation rates which appear faster than the actual rotation rates and attribute this to the projection effects. We find that the measured meridional flow does accurately represent the actual flow and that the observations indicate poleward flow to $65\degr-70\degr$ latitude with equatorward counter cells in the polar regions. △ Less

Submitted 25 August, 2010; originally announced August 2010.

Comments: 15 pages, 8 figures

Showing 1–13 of 13 results for author: Rosa, K