-
Navigation of a Three-Link Microswimmer via Deep Reinforcement Learning
Authors:
Yuyang Lai,
Sina Heydari,
On Shun Pak,
Yi Man
Abstract:
Motile microorganisms develop effective swimming gaits to adapt to complex biological environments. Translating this adaptability to smart microrobots presents significant challenges in motion planning and stroke design. In this work, we explore the use of reinforcement learning (RL) to develop stroke patterns for targeted navigation in a three-link swimmer model at low Reynolds numbers. Specifica…
▽ More
Motile microorganisms develop effective swimming gaits to adapt to complex biological environments. Translating this adaptability to smart microrobots presents significant challenges in motion planning and stroke design. In this work, we explore the use of reinforcement learning (RL) to develop stroke patterns for targeted navigation in a three-link swimmer model at low Reynolds numbers. Specifically, we design two RL-based strategies: one focusing on maximizing velocity (Velocity-Focused Strategy) and another balancing velocity with energy consumption (Energy-Aware Strategy). Our results demonstrate how the use of different reward functions influences the resulting stroke patterns developed via RL, which are compared with those obtained from traditional optimization methods. Furthermore, we showcase the capability of the RL-powered swimmer in adapting its stroke patterns in performing different navigation tasks, including tracing complex trajectories and pursuing moving targets. Taken together, this work highlights the potential of reinforcement learning as a versatile tool for designing efficient and adaptive microswimmers capable of sophisticated maneuvers in complex environments.
△ Less
Submitted 29 May, 2025;
originally announced June 2025.
-
Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought
Authors:
Yunze Man,
De-An Huang,
Guilin Liu,
Shiwei Sheng,
Shilong Liu,
Liang-Yan Gui,
Jan Kautz,
Yu-Xiong Wang,
Zhiding Yu
Abstract:
Recent advances in multimodal large language models (MLLMs) have demonstrated remarkable capabilities in vision-language tasks, yet they often struggle with vision-centric scenarios where precise visual focus is needed for accurate reasoning. In this paper, we introduce Argus to address these limitations with a new visual attention grounding mechanism. Our approach employs object-centric grounding…
▽ More
Recent advances in multimodal large language models (MLLMs) have demonstrated remarkable capabilities in vision-language tasks, yet they often struggle with vision-centric scenarios where precise visual focus is needed for accurate reasoning. In this paper, we introduce Argus to address these limitations with a new visual attention grounding mechanism. Our approach employs object-centric grounding as visual chain-of-thought signals, enabling more effective goal-conditioned visual attention during multimodal reasoning tasks. Evaluations on diverse benchmarks demonstrate that Argus excels in both multimodal reasoning tasks and referring object grounding tasks. Extensive analysis further validates various design choices of Argus, and reveals the effectiveness of explicit language-guided visual region-of-interest engagement in MLLMs, highlighting the importance of advancing multimodal intelligence from a visual-centric perspective. Project page: https://yunzeman.github.io/argus/
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
Construct to Commitment: The Effect of Narratives on Economic Growth
Authors:
Hanyuan Jiang,
Yi Man
Abstract:
We study how government-led narratives through mass media evolve from construct, a mechanism for framing expectations, into commitment, a sustainable pillar for growth. We propose the ``Narratives-Construct-Commitment (NCC)" framework outlining the mechanism and institutionalization of narratives, and formalize it as a dynamic Bayesian game. Using the Innovation-Driven Development Strategy (2016)…
▽ More
We study how government-led narratives through mass media evolve from construct, a mechanism for framing expectations, into commitment, a sustainable pillar for growth. We propose the ``Narratives-Construct-Commitment (NCC)" framework outlining the mechanism and institutionalization of narratives, and formalize it as a dynamic Bayesian game. Using the Innovation-Driven Development Strategy (2016) as a case study, we identify the narrative shock from high-frequency financial data and trace its impact using local projection method. By shaping expectations, credible narratives institutionalize investment incentives, channel resources into R\&D, and facilitate sustained improvements in total factor productivity (TFP). Our findings strive to provide insights into the New Quality Productive Forces initiative, highlighting the role of narratives in transforming vision into tangible economic growth.
△ Less
Submitted 29 April, 2025;
originally announced April 2025.
-
AgMMU: A Comprehensive Agricultural Multimodal Understanding and Reasoning Benchmark
Authors:
Aruna Gauba,
Irene Pi,
Yunze Man,
Ziqi Pang,
Vikram S. Adve,
Yu-Xiong Wang
Abstract:
We curate a dataset AgMMU for evaluating and developing vision-language models (VLMs) to produce factually accurate answers for knowledge-intensive expert domains. Our AgMMU concentrates on one of the most socially beneficial domains, agriculture, which requires connecting detailed visual observation with precise knowledge to diagnose, e.g., pest identification, management instructions, etc. As a…
▽ More
We curate a dataset AgMMU for evaluating and developing vision-language models (VLMs) to produce factually accurate answers for knowledge-intensive expert domains. Our AgMMU concentrates on one of the most socially beneficial domains, agriculture, which requires connecting detailed visual observation with precise knowledge to diagnose, e.g., pest identification, management instructions, etc. As a core uniqueness of our dataset, all facts, questions, and answers are extracted from 116,231 conversations between real-world users and authorized agricultural experts. After a three-step dataset curation pipeline with GPT-4o, LLaMA models, and human verification, AgMMU features an evaluation set of 5,460 multiple-choice questions (MCQs) and open-ended questions (OEQs). We also provide a development set that contains 205,399 pieces of agricultural knowledge information, including disease identification, symptoms descriptions, management instructions, insect and pest identification, and species identification. As a multimodal factual dataset, it reveals that existing VLMs face significant challenges with questions requiring both detailed perception and factual knowledge. Moreover, open-source VLMs still demonstrate a substantial performance gap compared to proprietary ones. To advance knowledge-intensive VLMs, we conduct fine-tuning experiments using our development set, which improves LLaVA-1.5 evaluation accuracy by up to 3.1%. We hope that AgMMU can serve both as an evaluation benchmark dedicated to agriculture and a development suite for incorporating knowledge-intensive expertise into general-purpose VLMs.
△ Less
Submitted 14 April, 2025;
originally announced April 2025.
-
PaintScene4D: Consistent 4D Scene Generation from Text Prompts
Authors:
Vinayak Gupta,
Yunze Man,
Yu-Xiong Wang
Abstract:
Recent advances in diffusion models have revolutionized 2D and 3D content creation, yet generating photorealistic dynamic 4D scenes remains a significant challenge. Existing dynamic 4D generation methods typically rely on distilling knowledge from pre-trained 3D generative models, often fine-tuned on synthetic object datasets. Consequently, the resulting scenes tend to be object-centric and lack p…
▽ More
Recent advances in diffusion models have revolutionized 2D and 3D content creation, yet generating photorealistic dynamic 4D scenes remains a significant challenge. Existing dynamic 4D generation methods typically rely on distilling knowledge from pre-trained 3D generative models, often fine-tuned on synthetic object datasets. Consequently, the resulting scenes tend to be object-centric and lack photorealism. While text-to-video models can generate more realistic scenes with motion, they often struggle with spatial understanding and provide limited control over camera viewpoints during rendering. To address these limitations, we present PaintScene4D, a novel text-to-4D scene generation framework that departs from conventional multi-view generative models in favor of a streamlined architecture that harnesses video generative models trained on diverse real-world datasets. Our method first generates a reference video using a video generation model, and then employs a strategic camera array selection for rendering. We apply a progressive warping and inpainting technique to ensure both spatial and temporal consistency across multiple viewpoints. Finally, we optimize multi-view images using a dynamic renderer, enabling flexible camera control based on user preferences. Adopting a training-free architecture, our PaintScene4D efficiently produces realistic 4D scenes that can be viewed from arbitrary trajectories. The code will be made publicly available. Our project page is at https://paintscene4d.github.io/
△ Less
Submitted 28 March, 2025; v1 submitted 5 December, 2024;
originally announced December 2024.
-
RandAR: Decoder-only Autoregressive Visual Generation in Random Orders
Authors:
Ziqi Pang,
Tianyuan Zhang,
Fujun Luan,
Yunze Man,
Hao Tan,
Kai Zhang,
William T. Freeman,
Yu-Xiong Wang
Abstract:
We introduce RandAR, a decoder-only visual autoregressive (AR) model capable of generating images in arbitrary token orders. Unlike previous decoder-only AR models that rely on a predefined generation order, RandAR removes this inductive bias, unlocking new capabilities in decoder-only generation. Our essential design enables random order by inserting a "position instruction token" before each ima…
▽ More
We introduce RandAR, a decoder-only visual autoregressive (AR) model capable of generating images in arbitrary token orders. Unlike previous decoder-only AR models that rely on a predefined generation order, RandAR removes this inductive bias, unlocking new capabilities in decoder-only generation. Our essential design enables random order by inserting a "position instruction token" before each image token to be predicted, representing the spatial location of the next image token. Trained on randomly permuted token sequences -- a more challenging task than fixed-order generation, RandAR achieves comparable performance to its conventional raster-order counterpart. More importantly, decoder-only transformers trained from random orders acquire new capabilities. For the efficiency bottleneck of AR models, RandAR adopts parallel decoding with KV-Cache at inference time, enjoying 2.5x acceleration without sacrificing generation quality. Additionally, RandAR supports inpainting, outpainting and resolution extrapolation in a zero-shot manner. We hope RandAR inspires new directions for decoder-only visual generation models and broadens their applications across diverse scenarios. Our project page is at https://rand-ar.github.io/.
△ Less
Submitted 2 December, 2024;
originally announced December 2024.
-
AdaCM$^2$: On Understanding Extremely Long-Term Video with Adaptive Cross-Modality Memory Reduction
Authors:
Yuanbin Man,
Ying Huang,
Chengming Zhang,
Bingzhe Li,
Wei Niu,
Miao Yin
Abstract:
The advancements in large language models (LLMs) have propelled the improvement of video understanding tasks by incorporating LLMs with visual models. However, most existing LLM-based models (e.g., VideoLLaMA, VideoChat) are constrained to processing short-duration videos. Recent attempts to understand long-term videos by extracting and compressing visual features into a fixed memory size. Neverth…
▽ More
The advancements in large language models (LLMs) have propelled the improvement of video understanding tasks by incorporating LLMs with visual models. However, most existing LLM-based models (e.g., VideoLLaMA, VideoChat) are constrained to processing short-duration videos. Recent attempts to understand long-term videos by extracting and compressing visual features into a fixed memory size. Nevertheless, those methods leverage only visual modality to merge video tokens and overlook the correlation between visual and textual queries, leading to difficulties in effectively handling complex question-answering tasks. To address the challenges of long videos and complex prompts, we propose AdaCM$^2$, which, for the first time, introduces an adaptive cross-modality memory reduction approach to video-text alignment in an auto-regressive manner on video streams. Our extensive experiments on various video understanding tasks, such as video captioning, video question answering, and video classification, demonstrate that AdaCM$^2$ achieves state-of-the-art performance across multiple datasets while significantly reducing memory usage. Notably, it achieves a 4.5% improvement across multiple tasks in the LVU dataset with a GPU memory consumption reduction of up to 65%.
△ Less
Submitted 4 April, 2025; v1 submitted 19 November, 2024;
originally announced November 2024.
-
Differentiable architecture search with multi-dimensional attention for spiking neural networks
Authors:
Yilei Man,
Linhai Xie,
Shushan Qiao,
Yumei Zhou,
Delong Shang
Abstract:
Spiking Neural Networks (SNNs) have gained enormous popularity in the field of artificial intelligence due to their low power consumption. However, the majority of SNN methods directly inherit the structure of Artificial Neural Networks (ANN), usually leading to sub-optimal model performance in SNNs. To alleviate this problem, we integrate Neural Architecture Search (NAS) method and propose Multi-…
▽ More
Spiking Neural Networks (SNNs) have gained enormous popularity in the field of artificial intelligence due to their low power consumption. However, the majority of SNN methods directly inherit the structure of Artificial Neural Networks (ANN), usually leading to sub-optimal model performance in SNNs. To alleviate this problem, we integrate Neural Architecture Search (NAS) method and propose Multi-Attention Differentiable Architecture Search (MA-DARTS) to directly automate the search for the optimal network structure of SNNs. Initially, we defined a differentiable two-level search space and conducted experiments within micro architecture under a fixed layer. Then, we incorporated a multi-dimensional attention mechanism and implemented the MA-DARTS algorithm in this search space. Comprehensive experiments demonstrate our model achieves state-of-the-art performance on classification compared to other methods under the same parameters with 94.40% accuracy on CIFAR10 dataset and 76.52% accuracy on CIFAR100 dataset. Additionally, we monitored and assessed the number of spikes (NoS) in each cell during the whole experiment. Notably, the number of spikes of the whole model stabilized at approximately 110K in validation and 100k in training on datasets.
△ Less
Submitted 1 November, 2024;
originally announced November 2024.
-
SceneCraft: Layout-Guided 3D Scene Generation
Authors:
Xiuyu Yang,
Yunze Man,
Jun-Kun Chen,
Yu-Xiong Wang
Abstract:
The creation of complex 3D scenes tailored to user specifications has been a tedious and challenging task with traditional 3D modeling tools. Although some pioneering methods have achieved automatic text-to-3D generation, they are generally limited to small-scale scenes with restricted control over the shape and texture. We introduce SceneCraft, a novel method for generating detailed indoor scenes…
▽ More
The creation of complex 3D scenes tailored to user specifications has been a tedious and challenging task with traditional 3D modeling tools. Although some pioneering methods have achieved automatic text-to-3D generation, they are generally limited to small-scale scenes with restricted control over the shape and texture. We introduce SceneCraft, a novel method for generating detailed indoor scenes that adhere to textual descriptions and spatial layout preferences provided by users. Central to our method is a rendering-based technique, which converts 3D semantic layouts into multi-view 2D proxy maps. Furthermore, we design a semantic and depth conditioned diffusion model to generate multi-view images, which are used to learn a neural radiance field (NeRF) as the final scene representation. Without the constraints of panorama image generation, we surpass previous methods in supporting complicated indoor space generation beyond a single room, even as complicated as a whole multi-bedroom apartment with irregular shapes and layouts. Through experimental analysis, we demonstrate that our method significantly outperforms existing approaches in complex indoor scene generation with diverse textures, consistent geometry, and realistic visual quality. Code and more results are available at: https://orangesodahub.github.io/SceneCraft
△ Less
Submitted 8 May, 2025; v1 submitted 11 October, 2024;
originally announced October 2024.
-
Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding
Authors:
Yunze Man,
Shuhong Zheng,
Zhipeng Bao,
Martial Hebert,
Liang-Yan Gui,
Yu-Xiong Wang
Abstract:
Complex 3D scene understanding has gained increasing attention, with scene encoding strategies playing a crucial role in this success. However, the optimal scene encoding strategies for various scenarios remain unclear, particularly compared to their image-based counterparts. To address this issue, we present a comprehensive study that probes various visual encoding models for 3D scene understandi…
▽ More
Complex 3D scene understanding has gained increasing attention, with scene encoding strategies playing a crucial role in this success. However, the optimal scene encoding strategies for various scenarios remain unclear, particularly compared to their image-based counterparts. To address this issue, we present a comprehensive study that probes various visual encoding models for 3D scene understanding, identifying the strengths and limitations of each model across different scenarios. Our evaluation spans seven vision foundation encoders, including image-based, video-based, and 3D foundation models. We evaluate these models in four tasks: Vision-Language Scene Reasoning, Visual Grounding, Segmentation, and Registration, each focusing on different aspects of scene understanding. Our evaluations yield key findings: DINOv2 demonstrates superior performance, video models excel in object-level tasks, diffusion models benefit geometric tasks, and language-pretrained models show unexpected limitations in language-related tasks. These insights challenge some conventional understandings, provide novel perspectives on leveraging visual foundation models, and highlight the need for more flexible encoder selection in future vision-language and scene-understanding tasks. Code: https://github.com/YunzeMan/Lexicon3D
△ Less
Submitted 8 May, 2025; v1 submitted 5 September, 2024;
originally announced September 2024.
-
Floating No More: Object-Ground Reconstruction from a Single Image
Authors:
Yunze Man,
Yichen Sheng,
Jianming Zhang,
Liang-Yan Gui,
Yu-Xiong Wang
Abstract:
Recent advancements in 3D object reconstruction from single images have primarily focused on improving the accuracy of object shapes. Yet, these techniques often fail to accurately capture the inter-relation between the object, ground, and camera. As a result, the reconstructed objects often appear floating or tilted when placed on flat surfaces. This limitation significantly affects 3D-aware imag…
▽ More
Recent advancements in 3D object reconstruction from single images have primarily focused on improving the accuracy of object shapes. Yet, these techniques often fail to accurately capture the inter-relation between the object, ground, and camera. As a result, the reconstructed objects often appear floating or tilted when placed on flat surfaces. This limitation significantly affects 3D-aware image editing applications like shadow rendering and object pose manipulation. To address this issue, we introduce ORG (Object Reconstruction with Ground), a novel task aimed at reconstructing 3D object geometry in conjunction with the ground surface. Our method uses two compact pixel-level representations to depict the relationship between camera, object, and ground. Experiments show that the proposed ORG model can effectively reconstruct object-ground geometry on unseen data, significantly enhancing the quality of shadow generation and pose manipulation compared to conventional single-image 3D reconstruction techniques.
△ Less
Submitted 26 July, 2024;
originally announced July 2024.
-
Situational Awareness Matters in 3D Vision Language Reasoning
Authors:
Yunze Man,
Liang-Yan Gui,
Yu-Xiong Wang
Abstract:
Being able to carry out complicated vision language reasoning tasks in 3D space represents a significant milestone in developing household robots and human-centered embodied AI. In this work, we demonstrate that a critical and distinct challenge in 3D vision language reasoning is situational awareness, which incorporates two key components: (1) The autonomous agent grounds its self-location based…
▽ More
Being able to carry out complicated vision language reasoning tasks in 3D space represents a significant milestone in developing household robots and human-centered embodied AI. In this work, we demonstrate that a critical and distinct challenge in 3D vision language reasoning is situational awareness, which incorporates two key components: (1) The autonomous agent grounds its self-location based on a language prompt. (2) The agent answers open-ended questions from the perspective of its calculated position. To address this challenge, we introduce SIG3D, an end-to-end Situation-Grounded model for 3D vision language reasoning. We tokenize the 3D scene into sparse voxel representation and propose a language-grounded situation estimator, followed by a situated question answering module. Experiments on the SQA3D and ScanQA datasets show that SIG3D outperforms state-of-the-art models in situation estimation and question answering by a large margin (e.g., an enhancement of over 30% on situation estimation accuracy). Subsequent analysis corroborates our architectural design choices, explores the distinct functions of visual and textual tokens, and highlights the importance of situational awareness in the domain of 3D question answering.
△ Less
Submitted 26 June, 2024; v1 submitted 11 June, 2024;
originally announced June 2024.
-
The Explicit values of the UBCT, the LBCT and the DBCT of the inverse function
Authors:
Yuying Man,
Nian Li,
Zhen Liu,
Xiangyong Zeng
Abstract:
Substitution boxes (S-boxes) play a significant role in ensuring the resistance of block ciphers against various attacks. The Upper Boomerang Connectivity Table (UBCT), the Lower Boomerang Connectivity Table (LBCT) and the Double Boomerang Connectivity Table (DBCT) of a given S-box are crucial tools to analyze its security concerning specific attacks. However, there are currently no related result…
▽ More
Substitution boxes (S-boxes) play a significant role in ensuring the resistance of block ciphers against various attacks. The Upper Boomerang Connectivity Table (UBCT), the Lower Boomerang Connectivity Table (LBCT) and the Double Boomerang Connectivity Table (DBCT) of a given S-box are crucial tools to analyze its security concerning specific attacks. However, there are currently no related results for this research. The inverse function is crucial for constructing S-boxes of block ciphers with good cryptographic properties in symmetric cryptography. Therefore, extensive research has been conducted on the inverse function, exploring various properties related to standard attacks. Thanks to the recent advancements in boomerang cryptanalysis, particularly the introduction of concepts such as UBCT, LBCT, and DBCT, this paper aims to further investigate the properties of the inverse function $F(x)=x^{2^n-2}$ over $\gf_{2^n}$ for arbitrary $n$. As a consequence, by carrying out certain finer manipulations of solving specific equations over $\gf_{2^n}$, we give all entries of the UBCT, LBCT of $F(x)$ over $\gf_{2^n}$ for arbitrary $n$. Besides, based on the results of the UBCT and LBCT for the inverse function, we determine that $F(x)$ is hard when $n$ is odd. Furthermore, we completely compute all entries of the DBCT of $F(x)$ over $\gf_{2^n}$ for arbitrary $n$. Additionally, we provide the precise number of elements with a given entry by means of the values of some Kloosterman sums. Further, we determine the double boomerang uniformity of $F(x)$ over $\gf_{2^n}$ for arbitrary $n$. Our in-depth analysis of the DBCT of $F(x)$ contributes to a better evaluation of the S-box's resistance against boomerang attacks.
△ Less
Submitted 18 April, 2024;
originally announced April 2024.
-
Analytical Insight of Earth: A Cloud-Platform of Intelligent Computing for Geospatial Big Data
Authors:
Hao Xu,
Yuanbin Man,
Mingyang Yang,
Jichao Wu,
Qi Zhang,
Jing Wang
Abstract:
The rapid accumulation of Earth observation data presents a formidable challenge for the processing capabilities of traditional remote sensing desktop software, particularly when it comes to analyzing expansive geographical areas and prolonged temporal sequences. Cloud computing has emerged as a transformative solution, surmounting the barriers traditionally associated with the management and comp…
▽ More
The rapid accumulation of Earth observation data presents a formidable challenge for the processing capabilities of traditional remote sensing desktop software, particularly when it comes to analyzing expansive geographical areas and prolonged temporal sequences. Cloud computing has emerged as a transformative solution, surmounting the barriers traditionally associated with the management and computation of voluminous datasets. This paper introduces the Analytical Insight of Earth (AI Earth), an innovative remote sensing intelligent computing cloud platform, powered by the robust Alibaba Cloud infrastructure. AI Earth provides an extensive collection of publicly available remote sensing datasets, along with a suite of computational tools powered by a high-performance computing engine. Furthermore, it provides a variety of classic deep learning (DL) models and a novel remote sensing large vision segmentation model tailored to different recognition tasks. The platform enables users to upload their unique samples for model training and to deploy third-party models, thereby increasing the accessibility and openness of DL applications. This platform will facilitate researchers in leveraging remote sensing data for large-scale applied research in areas such as resources, environment, ecology, and climate.
△ Less
Submitted 26 December, 2023;
originally announced December 2023.
-
On the second-order zero differential spectra of some power functions over finite fields
Authors:
Yuying Man,
Nian Li,
Zejun Xiang,
Xiangyong Zeng
Abstract:
Boukerrou et al. (IACR Trans. Symmetric Cryptol. 2020(1), 331-362) introduced the notion of Feistel Boomerang Connectivity Table (FBCT), the Feistel counterpart of the Boomerang Connectivity Table (BCT), and the Feistel boomerang uniformity (which is the same as the second-order zero differential uniformity in even characteristic). FBCT is a crucial table for the analysis of the resistance of bloc…
▽ More
Boukerrou et al. (IACR Trans. Symmetric Cryptol. 2020(1), 331-362) introduced the notion of Feistel Boomerang Connectivity Table (FBCT), the Feistel counterpart of the Boomerang Connectivity Table (BCT), and the Feistel boomerang uniformity (which is the same as the second-order zero differential uniformity in even characteristic). FBCT is a crucial table for the analysis of the resistance of block ciphers to power attacks such as differential and boomerang attacks. It is worth noting that the coefficients of FBCT are related to the second-order zero differential spectra of functions. In this paper, by carrying out certain finer manipulations of solving specific equations over the finite field $\mathbb{F}_{p^n}$, we explicitly determine the second-order zero differential spectra of some power functions with low differential uniformity, and show that our considered functions also have low second-order zero differential uniformity. Our study pushes further former investigations on second-order zero differential uniformity and Feistel boomerang differential uniformity for a power function $F$.
△ Less
Submitted 27 October, 2023;
originally announced October 2023.
-
Frozen Transformers in Language Models Are Effective Visual Encoder Layers
Authors:
Ziqi Pang,
Ziyang Xie,
Yunze Man,
Yu-Xiong Wang
Abstract:
This paper reveals that large language models (LLMs), despite being trained solely on textual data, are surprisingly strong encoders for purely visual tasks in the absence of language. Even more intriguingly, this can be achieved by a simple yet previously overlooked strategy -- employing a frozen transformer block from pre-trained LLMs as a constituent encoder layer to directly process visual tok…
▽ More
This paper reveals that large language models (LLMs), despite being trained solely on textual data, are surprisingly strong encoders for purely visual tasks in the absence of language. Even more intriguingly, this can be achieved by a simple yet previously overlooked strategy -- employing a frozen transformer block from pre-trained LLMs as a constituent encoder layer to directly process visual tokens. Our work pushes the boundaries of leveraging LLMs for computer vision tasks, significantly departing from conventional practices that typically necessitate a multi-modal vision-language setup with associated language prompts, inputs, or outputs. We demonstrate that our approach consistently enhances performance across a diverse range of tasks, encompassing pure 2D and 3D visual recognition tasks (e.g., image and point cloud classification), temporal modeling tasks (e.g., action recognition), non-semantic tasks (e.g., motion forecasting), and multi-modal tasks (e.g., 2D/3D visual question answering and image-text retrieval). Such improvements are a general phenomenon, applicable to various types of LLMs (e.g., LLaMA and OPT) and different LLM transformer blocks. We additionally propose the information filtering hypothesis to explain the effectiveness of pre-trained LLMs in visual encoding -- the pre-trained LLM transformer blocks discern informative visual tokens and further amplify their effect. This hypothesis is empirically supported by the observation that the feature activation, after training with LLM transformer blocks, exhibits a stronger focus on relevant regions. We hope that our work inspires new perspectives on utilizing LLMs and deepening our understanding of their underlying mechanisms. Code is available at https://github.com/ziqipang/LM4VisualEncoding.
△ Less
Submitted 6 May, 2024; v1 submitted 19 October, 2023;
originally announced October 2023.
-
In-depth analysis of S-boxes over binary finite fields concerning their differential and Feistel boomerang differential uniformities
Authors:
Yuying Man,
Sihem Mesnager,
Nian Li,
Xiangyong Zeng,
Xiaohu Tang
Abstract:
Substitution boxes (S-boxes) play a significant role in ensuring the resistance of block ciphers against various attacks. The Difference Distribution Table (DDT), the Feistel Boomerang Connectivity Table (FBCT), the Feistel Boomerang Difference Table (FBDT) and the Feistel Boomerang Extended Table (FBET) of a given S-box are crucial tools to analyze its security concerning specific attacks. Howeve…
▽ More
Substitution boxes (S-boxes) play a significant role in ensuring the resistance of block ciphers against various attacks. The Difference Distribution Table (DDT), the Feistel Boomerang Connectivity Table (FBCT), the Feistel Boomerang Difference Table (FBDT) and the Feistel Boomerang Extended Table (FBET) of a given S-box are crucial tools to analyze its security concerning specific attacks. However, the results on them are rare. In this paper, we investigate the properties of the power function $F(x):=x^{2^{m+1}-1}$ over the finite field $\gf_{2^n}$ of order $2^n$ where $n=2m$ or $n=2m+1$ ($m$ stands for a positive integer). As a consequence, by carrying out certain finer manipulations of solving specific equations over $\gf_{2^n}$, we give explicit values of all entries of the DDT, the FBCT, the FBDT and the FBET of the investigated power functions. From the theoretical point of view, our study pushes further former investigations on differential and Feistel boomerang differential uniformities for a novel power function $F$. From a cryptographic point of view, when considering Feistel block cipher involving $F$, our in-depth analysis helps select $F$ resistant to differential attacks, Feistel differential attacks and Feistel boomerang attacks, respectively.
△ Less
Submitted 4 September, 2023;
originally announced September 2023.
-
DualCross: Cross-Modality Cross-Domain Adaptation for Monocular BEV Perception
Authors:
Yunze Man,
Liang-Yan Gui,
Yu-Xiong Wang
Abstract:
Closing the domain gap between training and deployment and incorporating multiple sensor modalities are two challenging yet critical topics for self-driving. Existing work only focuses on single one of the above topics, overlooking the simultaneous domain and modality shift which pervasively exists in real-world scenarios. A model trained with multi-sensor data collected in Europe may need to run…
▽ More
Closing the domain gap between training and deployment and incorporating multiple sensor modalities are two challenging yet critical topics for self-driving. Existing work only focuses on single one of the above topics, overlooking the simultaneous domain and modality shift which pervasively exists in real-world scenarios. A model trained with multi-sensor data collected in Europe may need to run in Asia with a subset of input sensors available. In this work, we propose DualCross, a cross-modality cross-domain adaptation framework to facilitate the learning of a more robust monocular bird's-eye-view (BEV) perception model, which transfers the point cloud knowledge from a LiDAR sensor in one domain during the training phase to the camera-only testing scenario in a different domain. This work results in the first open analysis of cross-domain cross-sensor perception and adaptation for monocular 3D tasks in the wild. We benchmark our approach on large-scale datasets under a wide range of domain shifts and show state-of-the-art results against various baselines.
△ Less
Submitted 11 June, 2024; v1 submitted 5 May, 2023;
originally announced May 2023.
-
BotTriNet: A Unified and Efficient Embedding for Social Bots Detection via Metric Learning
Authors:
Jun Wu,
Xuesong Ye,
Yanyuet Man
Abstract:
The rapid and accurate identification of bot accounts in online social networks is an ongoing challenge. In this paper, we propose BOTTRINET, a unified embedding framework that leverages the textual content posted by accounts to detect bots. Our approach is based on the premise that account personalities and habits can be revealed through their contextual content. To achieve this, we designed a tr…
▽ More
The rapid and accurate identification of bot accounts in online social networks is an ongoing challenge. In this paper, we propose BOTTRINET, a unified embedding framework that leverages the textual content posted by accounts to detect bots. Our approach is based on the premise that account personalities and habits can be revealed through their contextual content. To achieve this, we designed a triplet network that refines raw embeddings using metric learning techniques. The BOTTRINET framework produces word, sentence, and account embeddings, which we evaluate on a real-world dataset, CRESCI2017, consisting of three bot account categories and five bot sample sets. Our approach achieves state-of-the-art performance on two content-intensive bot sets, with an average accuracy of 98.34% and f1score of 97.99%. Moreover, our method makes a significant breakthrough on four content-less bot sets, with an average accuracy improvement of 11.52% and an average f1score increase of 16.70%. Our contribution is twofold: First, we propose a unified and effective framework that combines various embeddings for bot detection. Second, we demonstrate that metric learning techniques can be applied in this context to refine raw embeddings and improve classification performance. Our approach outperforms prior works and sets a new standard for bot detection in social networks.
△ Less
Submitted 6 May, 2023; v1 submitted 6 April, 2023;
originally announced April 2023.
-
Several new infinite classes of 0-APN power functions over $\mathbb{F}_{2^n}$
Authors:
Yuying Man,
Shizhu Tian,
Nian Li,
Xiangyong Zeng
Abstract:
The investigation of partially APN functions has attracted a lot of research interest recently. In this paper, we present several new infinite classes of 0-APN power functions over $\mathbb{F}_{2^n}$ by using the multivariate method and resultant elimination, and show that these 0-APN power functions are CCZ-inequivalent to the known ones.
The investigation of partially APN functions has attracted a lot of research interest recently. In this paper, we present several new infinite classes of 0-APN power functions over $\mathbb{F}_{2^n}$ by using the multivariate method and resultant elimination, and show that these 0-APN power functions are CCZ-inequivalent to the known ones.
△ Less
Submitted 9 December, 2022;
originally announced December 2022.
-
On the Differential Properties of the Power Mapping $x^{p^m+2}$
Authors:
Yuying Man,
Yongbo Xia,
Chunlei Li,
Tor Helleseth
Abstract:
Let $m$ be a positive integer and $p$ a prime. In this paper, we investigate the differential properties of the power mapping $x^{p^m+2}$ over $\mathbb{F}_{p^n}$, where $n=2m$ or $n=2m-1$. For the case $n=2m$, by transforming the derivative equation of $x^{p^m+2}$ and studying some related equations, we completely determine the differential spectrum of this power mapping. For the case $n=2m-1$, th…
▽ More
Let $m$ be a positive integer and $p$ a prime. In this paper, we investigate the differential properties of the power mapping $x^{p^m+2}$ over $\mathbb{F}_{p^n}$, where $n=2m$ or $n=2m-1$. For the case $n=2m$, by transforming the derivative equation of $x^{p^m+2}$ and studying some related equations, we completely determine the differential spectrum of this power mapping. For the case $n=2m-1$, the derivative equation can be transformed to a polynomial of degree $p+3$. The problem is more difficult and we obtain partial results about the differential spectrum of $x^{p^m+2}$.
△ Less
Submitted 17 April, 2022;
originally announced April 2022.
-
Fast Graph Neural Tangent Kernel via Kronecker Sketching
Authors:
Shunhua Jiang,
Yunze Man,
Zhao Song,
Zheng Yu,
Danyang Zhuo
Abstract:
Many deep learning tasks have to deal with graphs (e.g., protein structures, social networks, source code abstract syntax trees). Due to the importance of these tasks, people turned to Graph Neural Networks (GNNs) as the de facto method for learning on graphs. GNNs have become widely applied due to their convincing performance. Unfortunately, one major barrier to using GNNs is that GNNs require su…
▽ More
Many deep learning tasks have to deal with graphs (e.g., protein structures, social networks, source code abstract syntax trees). Due to the importance of these tasks, people turned to Graph Neural Networks (GNNs) as the de facto method for learning on graphs. GNNs have become widely applied due to their convincing performance. Unfortunately, one major barrier to using GNNs is that GNNs require substantial time and resources to train. Recently, a new method for learning on graph data is Graph Neural Tangent Kernel (GNTK) [Du, Hou, Salakhutdinov, Poczos, Wang and Xu 19]. GNTK is an application of Neural Tangent Kernel (NTK) [Jacot, Gabriel and Hongler 18] (a kernel method) on graph data, and solving NTK regression is equivalent to using gradient descent to train an infinite-wide neural network. The key benefit of using GNTK is that, similar to any kernel method, GNTK's parameters can be solved directly in a single step. This can avoid time-consuming gradient descent. Meanwhile, sketching has become increasingly used in speeding up various optimization problems, including solving kernel regression. Given a kernel matrix of $n$ graphs, using sketching in solving kernel regression can reduce the running time to $o(n^3)$. But unfortunately such methods usually require extensive knowledge about the kernel matrix beforehand, while in the case of GNTK we find that the construction of the kernel matrix is already $O(n^2N^4)$, assuming each graph has $N$ nodes. The kernel matrix construction time can be a major performance bottleneck when the size of graphs $N$ increases. A natural question to ask is thus whether we can speed up the kernel matrix construction to improve GNTK regression's end-to-end running time. This paper provides the first algorithm to construct the kernel matrix in $o(n^2N^3)$ running time.
△ Less
Submitted 4 December, 2021;
originally announced December 2021.
-
Multi-Echo LiDAR for 3D Object Detection
Authors:
Yunze Man,
Xinshuo Weng,
Prasanna Kumar Sivakuma,
Matthew O'Toole,
Kris Kitani
Abstract:
LiDAR sensors can be used to obtain a wide range of measurement signals other than a simple 3D point cloud, and those signals can be leveraged to improve perception tasks like 3D object detection. A single laser pulse can be partially reflected by multiple objects along its path, resulting in multiple measurements called echoes. Multi-echo measurement can provide information about object contours…
▽ More
LiDAR sensors can be used to obtain a wide range of measurement signals other than a simple 3D point cloud, and those signals can be leveraged to improve perception tasks like 3D object detection. A single laser pulse can be partially reflected by multiple objects along its path, resulting in multiple measurements called echoes. Multi-echo measurement can provide information about object contours and semi-transparent surfaces which can be used to better identify and locate objects. LiDAR can also measure surface reflectance (intensity of laser pulse return), as well as ambient light of the scene (sunlight reflected by objects). These signals are already available in commercial LiDAR devices but have not been used in most LiDAR-based detection models. We present a 3D object detection model which leverages the full spectrum of measurement signals provided by LiDAR. First, we propose a multi-signal fusion (MSF) module to combine (1) the reflectance and ambient features extracted with a 2D CNN, and (2) point cloud features extracted using a 3D graph neural network (GNN). Second, we propose a multi-echo aggregation (MEA) module to combine the information encoded in different set of echo points. Compared with traditional single echo point cloud methods, our proposed Multi-Signal LiDAR Detector (MSLiD) extracts richer context information from a wider range of sensing measurements and achieves more accurate 3D object detection. Experiments show that by incorporating the multi-modality of LiDAR, our method outperforms the state-of-the-art by up to 9.1%.
△ Less
Submitted 23 July, 2021;
originally announced July 2021.
-
Multi-Modality Task Cascade for 3D Object Detection
Authors:
Jinhyung Park,
Xinshuo Weng,
Yunze Man,
Kris Kitani
Abstract:
Point clouds and RGB images are naturally complementary modalities for 3D visual understanding - the former provides sparse but accurate locations of points on objects, while the latter contains dense color and texture information. Despite this potential for close sensor fusion, many methods train two models in isolation and use simple feature concatenation to represent 3D sensor data. This separa…
▽ More
Point clouds and RGB images are naturally complementary modalities for 3D visual understanding - the former provides sparse but accurate locations of points on objects, while the latter contains dense color and texture information. Despite this potential for close sensor fusion, many methods train two models in isolation and use simple feature concatenation to represent 3D sensor data. This separated training scheme results in potentially sub-optimal performance and prevents 3D tasks from being used to benefit 2D tasks that are often useful on their own. To provide a more integrated approach, we propose a novel Multi-Modality Task Cascade network (MTC-RCNN) that leverages 3D box proposals to improve 2D segmentation predictions, which are then used to further refine the 3D boxes. We show that including a 2D network between two stages of 3D modules significantly improves both 2D and 3D task performance. Moreover, to prevent the 3D module from over-relying on the overfitted 2D predictions, we propose a dual-head 2D segmentation training and inference scheme, allowing the 2nd 3D module to learn to interpret imperfect 2D segmentation predictions. Evaluating our model on the challenging SUN RGB-D dataset, we improve upon state-of-the-art results of both single modality and fusion networks by a large margin ($\textbf{+3.8}$ [email protected]). Code will be released $\href{https://github.com/Divadi/MTC_RCNN}{\text{here.}}$
△ Less
Submitted 8 July, 2021;
originally announced July 2021.
-
Graph Neural Networks for 3D Multi-Object Tracking
Authors:
Xinshuo Weng,
Yongxin Wang,
Yunze Man,
Kris Kitani
Abstract:
3D Multi-object tracking (MOT) is crucial to autonomous systems. Recent work often uses a tracking-by-detection pipeline, where the feature of each object is extracted independently to compute an affinity matrix. Then, the affinity matrix is passed to the Hungarian algorithm for data association. A key process of this pipeline is to learn discriminative features for different objects in order to r…
▽ More
3D Multi-object tracking (MOT) is crucial to autonomous systems. Recent work often uses a tracking-by-detection pipeline, where the feature of each object is extracted independently to compute an affinity matrix. Then, the affinity matrix is passed to the Hungarian algorithm for data association. A key process of this pipeline is to learn discriminative features for different objects in order to reduce confusion during data association. To that end, we propose two innovative techniques: (1) instead of obtaining the features for each object independently, we propose a novel feature interaction mechanism by introducing Graph Neural Networks; (2) instead of obtaining the features from either 2D or 3D space as in prior work, we propose a novel joint feature extractor to learn appearance and motion features from 2D and 3D space. Through experiments on the KITTI dataset, our proposed method achieves state-of-the-art 3D MOT performance. Our project website is at http://www.xinshuoweng.com/projects/GNN3DMOT.
△ Less
Submitted 20 August, 2020;
originally announced August 2020.
-
GNN3DMOT: Graph Neural Network for 3D Multi-Object Tracking with Multi-Feature Learning
Authors:
Xinshuo Weng,
Yongxin Wang,
Yunze Man,
Kris Kitani
Abstract:
3D Multi-object tracking (MOT) is crucial to autonomous systems. Recent work uses a standard tracking-by-detection pipeline, where feature extraction is first performed independently for each object in order to compute an affinity matrix. Then the affinity matrix is passed to the Hungarian algorithm for data association. A key process of this standard pipeline is to learn discriminative features f…
▽ More
3D Multi-object tracking (MOT) is crucial to autonomous systems. Recent work uses a standard tracking-by-detection pipeline, where feature extraction is first performed independently for each object in order to compute an affinity matrix. Then the affinity matrix is passed to the Hungarian algorithm for data association. A key process of this standard pipeline is to learn discriminative features for different objects in order to reduce confusion during data association. In this work, we propose two techniques to improve the discriminative feature learning for MOT: (1) instead of obtaining features for each object independently, we propose a novel feature interaction mechanism by introducing the Graph Neural Network. As a result, the feature of one object is informed of the features of other objects so that the object feature can lean towards the object with similar feature (i.e., object probably with a same ID) and deviate from objects with dissimilar features (i.e., object probably with different IDs), leading to a more discriminative feature for each object; (2) instead of obtaining the feature from either 2D or 3D space in prior work, we propose a novel joint feature extractor to learn appearance and motion features from 2D and 3D space simultaneously. As features from different modalities often have complementary information, the joint feature can be more discriminate than feature from each individual modality. To ensure that the joint feature extractor does not heavily rely on one modality, we also propose an ensemble training paradigm. Through extensive evaluation, our proposed method achieves state-of-the-art performance on KITTI and nuScenes 3D MOT benchmarks. Our code will be made available at https://github.com/xinshuoweng/GNN3DMOT
△ Less
Submitted 12 June, 2020;
originally announced June 2020.
-
GhostImage: Remote Perception Attacks against Camera-based Image Classification Systems
Authors:
Yanmao Man,
Ming Li,
Ryan Gerdes
Abstract:
In vision-based object classification systems imaging sensors perceive the environment and machine learning is then used to detect and classify objects for decision-making purposes; e.g., to maneuver an automated vehicle around an obstacle or to raise an alarm to indicate the presence of an intruder in surveillance settings. In this work we demonstrate how the perception domain can be remotely and…
▽ More
In vision-based object classification systems imaging sensors perceive the environment and machine learning is then used to detect and classify objects for decision-making purposes; e.g., to maneuver an automated vehicle around an obstacle or to raise an alarm to indicate the presence of an intruder in surveillance settings. In this work we demonstrate how the perception domain can be remotely and unobtrusively exploited to enable an attacker to create spurious objects or alter an existing object. An automated system relying on a detection/classification framework subject to our attack could be made to undertake actions with catastrophic results due to attacker-induced misperception.
We focus on camera-based systems and show that it is possible to remotely project adversarial patterns into camera systems by exploiting two common effects in optical imaging systems, viz., lens flare/ghost effects and auto-exposure control. To improve the robustness of the attack to channel effects, we generate optimal patterns by integrating adversarial machine learning techniques with a trained end-to-end channel model. We experimentally demonstrate our attacks using a low-cost projector, on three different image datasets, in indoor and outdoor environments, and with three different cameras. Experimental results show that, depending on the projector-camera distance, attack success rates can reach as high as 100% and under targeted conditions.
△ Less
Submitted 23 June, 2020; v1 submitted 21 January, 2020;
originally announced January 2020.
-
A Semi-Supervised Framework for Automatic Pixel-Wise Breast Cancer Grading of Histological Images
Authors:
Yanyuet Man,
Xiangyun Ding,
Xingcheng Yao,
Han Bao
Abstract:
Throughout the world, breast cancer is one of the leading causes of female death. Recently, deep learning methods are developed to automatically grade breast cancer of histological slides. However, the performance of existing deep learning models is limited due to the lack of large annotated biomedical datasets. One promising way to relieve the annotating burden is to leverage the unannotated data…
▽ More
Throughout the world, breast cancer is one of the leading causes of female death. Recently, deep learning methods are developed to automatically grade breast cancer of histological slides. However, the performance of existing deep learning models is limited due to the lack of large annotated biomedical datasets. One promising way to relieve the annotating burden is to leverage the unannotated datasets to enhance the trained model. In this paper, we first apply active learning method in breast cancer grading, and propose a semi-supervised framework based on expectation maximization (EM) model. The proposed EM approach is based on the collaborative filtering among the annotated and unannotated datasets. The collaborative filtering method effectively extracts useful and credible datasets from the unannotated images. Results of pixel-wise prediction of whole-slide images (WSI) demonstrate that the proposed method not only outperforms state-of-art methods, but also significantly reduces the annotation cost by over 70%.
△ Less
Submitted 8 March, 2022; v1 submitted 2 July, 2019;
originally announced July 2019.
-
Deep Q Learning Driven CT Pancreas Segmentation with Geometry-Aware U-Net
Authors:
Yunze Man,
Yangsibo Huang,
Junyi Feng,
Xi Li,
Fei Wu
Abstract:
Segmentation of pancreas is important for medical image analysis, yet it faces great challenges of class imbalance, background distractions and non-rigid geometrical features. To address these difficulties, we introduce a Deep Q Network(DQN) driven approach with deformable U-Net to accurately segment the pancreas by explicitly interacting with contextual information and extract anisotropic feature…
▽ More
Segmentation of pancreas is important for medical image analysis, yet it faces great challenges of class imbalance, background distractions and non-rigid geometrical features. To address these difficulties, we introduce a Deep Q Network(DQN) driven approach with deformable U-Net to accurately segment the pancreas by explicitly interacting with contextual information and extract anisotropic features from pancreas. The DQN based model learns a context-adaptive localization policy to produce a visually tightened and precise localization bounding box of the pancreas. Furthermore, deformable U-Net captures geometry-aware information of pancreas by learning geometrically deformable filters for feature extraction. Experiments on NIH dataset validate the effectiveness of the proposed framework in pancreas segmentation.
△ Less
Submitted 19 April, 2019;
originally announced April 2019.
-
GroundNet: Monocular Ground Plane Normal Estimation with Geometric Consistency
Authors:
Yunze Man,
Xinshuo Weng,
Xi Li,
Kris Kitani
Abstract:
We focus on estimating the 3D orientation of the ground plane from a single image. We formulate the problem as an inter-mingled multi-task prediction problem by jointly optimizing for pixel-wise surface normal direction, ground plane segmentation, and depth estimates. Specifically, our proposed model, GroundNet, first estimates the depth and surface normal in two separate streams, from which two g…
▽ More
We focus on estimating the 3D orientation of the ground plane from a single image. We formulate the problem as an inter-mingled multi-task prediction problem by jointly optimizing for pixel-wise surface normal direction, ground plane segmentation, and depth estimates. Specifically, our proposed model, GroundNet, first estimates the depth and surface normal in two separate streams, from which two ground plane normals are then computed deterministically. To leverage the geometric correlation between depth and normal, we propose to add a consistency loss on top of the computed ground plane normals. In addition, a ground segmentation stream is used to isolate the ground regions so that we can selectively back-propagate parameter updates through only the ground regions in the image. Our method achieves the top-ranked performance on ground plane normal estimation and horizon line detection on the real-world outdoor datasets of ApolloScape and KITTI, improving the performance of previous art by up to 17.7% relatively.
△ Less
Submitted 9 August, 2019; v1 submitted 17 November, 2018;
originally announced November 2018.
-
IntelliAd Understanding In-APP Ad Costs From Users Perspective
Authors:
Cuiyun Gao,
Hui Xu,
Yichuan Man,
Yangfan Zhou,
Michael R. Lyu
Abstract:
Ads are an important revenue source for mobile app development, especially for free apps, whose expense can be compensated by ad revenue. The ad benefits also carry with costs. For example, too many ads can interfere the user experience, leading to less user retention and reduced earnings ultimately. In the paper, we aim at understanding the ad costs from users perspective. We utilize app reviews,…
▽ More
Ads are an important revenue source for mobile app development, especially for free apps, whose expense can be compensated by ad revenue. The ad benefits also carry with costs. For example, too many ads can interfere the user experience, leading to less user retention and reduced earnings ultimately. In the paper, we aim at understanding the ad costs from users perspective. We utilize app reviews, which are widely recognized as expressions of user perceptions, to identify the ad costs concerned by users. Four types of ad costs, i.e., number of ads, memory/CPU overhead, traffic usage, and bettery consumption, have been discovered from user reviews. To verify whether different ad integration schemes generate different ad costs, we first obtain the commonly used ad schemes from 104 popular apps, and then design a framework named IntelliAd to automatically measure the ad costs of each scheme. To demonstrate whether these costs indeed influence users reactions, we finally observe the correlations between the measured ad costs and the user perceptions. We discover that the costs related to memory/CPU overhead and battery consumption are more concerned by users, while the traffic usage is less concerned by users in spite of its obvious variations among different schemes in the experiments. Our experimental results provide the developers with suggestions on better incorporating ads into apps and, meanwhile, ensuring the user experience.
△ Less
Submitted 12 July, 2016;
originally announced July 2016.