-
RoboPearls: Editable Video Simulation for Robot Manipulation
Authors:
Tao Tang,
Likui Zhang,
Youpeng Wen,
Kaidong Zhang,
Jia-Wang Bian,
xia zhou,
Tianyi Yan,
Kun Zhan,
Peng Jia,
Hefeng Wu,
Liang Lin,
Xiaodan Liang
Abstract:
The development of generalist robot manipulation policies has seen significant progress, driven by large-scale demonstration data across diverse environments. However, the high cost and inefficiency of collecting real-world demonstrations hinder the scalability of data acquisition. While existing simulation platforms enable controlled environments for robotic learning, the challenge of bridging th…
▽ More
The development of generalist robot manipulation policies has seen significant progress, driven by large-scale demonstration data across diverse environments. However, the high cost and inefficiency of collecting real-world demonstrations hinder the scalability of data acquisition. While existing simulation platforms enable controlled environments for robotic learning, the challenge of bridging the sim-to-real gap remains. To address these challenges, we propose RoboPearls, an editable video simulation framework for robotic manipulation. Built on 3D Gaussian Splatting (3DGS), RoboPearls enables the construction of photo-realistic, view-consistent simulations from demonstration videos, and supports a wide range of simulation operators, including various object manipulations, powered by advanced modules like Incremental Semantic Distillation (ISD) and 3D regularized NNFM Loss (3D-NNFM). Moreover, by incorporating large language models (LLMs), RoboPearls automates the simulation production process in a user-friendly manner through flexible command interpretation and execution. Furthermore, RoboPearls employs a vision-language model (VLM) to analyze robotic learning issues to close the simulation loop for performance enhancement. To demonstrate the effectiveness of RoboPearls, we conduct extensive experiments on multiple datasets and scenes, including RLBench, COLOSSEUM, Ego4D, Open X-Embodiment, and a real-world robot, which demonstrate our satisfactory simulation performance.
△ Less
Submitted 28 June, 2025;
originally announced June 2025.
-
Astra: Toward General-Purpose Mobile Robots via Hierarchical Multimodal Learning
Authors:
Sheng Chen,
Peiyu He,
Jiaxin Hu,
Ziyang Liu,
Yansheng Wang,
Tao Xu,
Chi Zhang,
Chongchong Zhang,
Chao An,
Shiyu Cai,
Duo Cao,
Kangping Chen,
Shuai Chu,
Tianwei Chu,
Mingdi Dan,
Min Du,
Weiwei Fang,
Pengyou Fu,
Junkai Hu,
Xiaowei Jiang,
Zhaodi Jiang,
Fuxuan Li,
Jun Li,
Minghui Li,
Mingyao Li
, et al. (46 additional authors not shown)
Abstract:
Modern robot navigation systems encounter difficulties in diverse and complex indoor environments. Traditional approaches rely on multiple modules with small models or rule-based systems and thus lack adaptability to new environments. To address this, we developed Astra, a comprehensive dual-model architecture, Astra-Global and Astra-Local, for mobile robot navigation. Astra-Global, a multimodal L…
▽ More
Modern robot navigation systems encounter difficulties in diverse and complex indoor environments. Traditional approaches rely on multiple modules with small models or rule-based systems and thus lack adaptability to new environments. To address this, we developed Astra, a comprehensive dual-model architecture, Astra-Global and Astra-Local, for mobile robot navigation. Astra-Global, a multimodal LLM, processes vision and language inputs to perform self and goal localization using a hybrid topological-semantic graph as the global map, and outperforms traditional visual place recognition methods. Astra-Local, a multitask network, handles local path planning and odometry estimation. Its 4D spatial-temporal encoder, trained through self-supervised learning, generates robust 4D features for downstream tasks. The planning head utilizes flow matching and a novel masked ESDF loss to minimize collision risks for generating local trajectories, and the odometry head integrates multi-sensor inputs via a transformer encoder to predict the relative pose of the robot. Deployed on real in-house mobile robots, Astra achieves high end-to-end mission success rate across diverse indoor environments.
△ Less
Submitted 6 June, 2025;
originally announced June 2025.
-
Geometry-aware Temporal Aggregation Network for Monocular 3D Lane Detection
Authors:
Huan Zheng,
Wencheng Han,
Tianyi Yan,
Cheng-zhong Xu,
Jianbing Shen
Abstract:
Monocular 3D lane detection aims to estimate 3D position of lanes from frontal-view (FV) images. However, current monocular 3D lane detection methods suffer from two limitations, including inaccurate geometric information of the predicted 3D lanes and difficulties in maintaining lane integrity. To address these issues, we seek to fully exploit the potential of multiple input frames. First, we aim…
▽ More
Monocular 3D lane detection aims to estimate 3D position of lanes from frontal-view (FV) images. However, current monocular 3D lane detection methods suffer from two limitations, including inaccurate geometric information of the predicted 3D lanes and difficulties in maintaining lane integrity. To address these issues, we seek to fully exploit the potential of multiple input frames. First, we aim at enhancing the ability to perceive the geometry of scenes by leveraging temporal geometric consistency. Second, we strive to improve the integrity of lanes by revealing more instance information from temporal sequences. Therefore, we propose a novel Geometry-aware Temporal Aggregation Network (GTA-Net) for monocular 3D lane detection. On one hand, we develop the Temporal Geometry Enhancement Module (TGEM), which exploits geometric consistency across successive frames, facilitating effective geometry perception. On the other hand, we present the Temporal Instance-aware Query Generation (TIQG), which strategically incorporates temporal cues into query generation, thereby enabling the exploration of comprehensive instance information. Experiments demonstrate that our GTA-Net achieves SoTA results, surpassing existing monocular 3D lane detection solutions.
△ Less
Submitted 29 April, 2025;
originally announced April 2025.
-
Tracing Cross-chain Transactions between EVM-based Blockchains: An Analysis of Ethereum-Polygon Bridges
Authors:
Tao Yan,
Chuanshan Huang,
Claudio J. Tessone
Abstract:
Ethereum's scalability has been a major concern due to its limited transaction throughput and high fees. To address these limitations, Polygon has emerged as a sidechain solution that facilitates asset transfers between Ethereum and Polygon, thereby improving scalability and reducing costs. However, current cross-chain transactions, particularly those between Ethereum and Polygon, lack transparenc…
▽ More
Ethereum's scalability has been a major concern due to its limited transaction throughput and high fees. To address these limitations, Polygon has emerged as a sidechain solution that facilitates asset transfers between Ethereum and Polygon, thereby improving scalability and reducing costs. However, current cross-chain transactions, particularly those between Ethereum and Polygon, lack transparency and traceability. This paper proposes a method to track cross-chain transactions across EVM-compatible blockchains. It leverages the unique feature that user addresses are consistent across EVM-compatible blockchains. We develop a matching heuristic algorithm that links transactions between the source and target chains by combining transaction time, value, and token identification. Applying our methodology to over 2 million cross-chain transactions (August 2020-August 2023) between Ethereum and Polygon, we achieve matching rates of up to 99.65% for deposits and 92.78% for withdrawals, across different asset types including Ether, ERC-20 tokens, and NFTs. In addition, we provide a comprehensive analysis of various properties and characteristics of cross-chain transactions. Our methodology and findings contribute to a better understanding of cross-chain transaction dynamics and bridge performance, with implications for improving bridge efficiency and security in cross-chain operations.
△ Less
Submitted 21 April, 2025;
originally announced April 2025.
-
Network Analysis of Uniswap: Centralization and Fragility in the Decentralized Exchange Market
Authors:
Tao Yan,
Claudio J. Tessone
Abstract:
The Uniswap is a Decentralized Exchange (DEX) protocol that facilitates automatic token exchange without the need for traditional order books. Every pair of tokens forms a liquidity pool on Uniswap, and each token can be paired with any other token to create liquidity pools. This characteristic motivates us to employ a complex network approach to analyze the features of the Uniswap market. This re…
▽ More
The Uniswap is a Decentralized Exchange (DEX) protocol that facilitates automatic token exchange without the need for traditional order books. Every pair of tokens forms a liquidity pool on Uniswap, and each token can be paired with any other token to create liquidity pools. This characteristic motivates us to employ a complex network approach to analyze the features of the Uniswap market. This research presents a comprehensive analysis of the Uniswap network using complex network methods. The network on October 31, 2023, is built to observe its recent features, showcasing both scale-free and core-periphery properties. By employing node and edge-betweenness metrics, we detect the most important tokens and liquidity pools. Additionally, we construct daily networks spanning from the beginning of Uniswap V2 on May 5, 2020, until October 31, 2023, and our findings demonstrate that the network becomes increasingly fragile over time. Furthermore, we conduct a robustness analysis by simulating the deletion of nodes to estimate the impact of some extreme events such as the Terra collapse. The results indicate that the Uniswap network exhibits robustness, yet it is notably fragile when deleting tokens with high betweenness centrality. This finding highlights that, despite being a decentralized exchange, Uniswap exhibits significant centralization tendencies in terms of token network connectivity and the distribution of TVL across nodes (tokens) and edges (liquidity pools).
△ Less
Submitted 10 March, 2025;
originally announced March 2025.
-
DeFine: A Decomposed and Fine-Grained Annotated Dataset for Long-form Article Generation
Authors:
Ming Wang,
Fang Wang,
Minghao Hu,
Li He,
Haiyang Wang,
Jun Zhang,
Tianwei Yan,
Li Li,
Zhunchen Luo,
Wei Luo,
Xiaoying Bai,
Guotong Geng
Abstract:
Long-form article generation (LFAG) presents challenges such as maintaining logical consistency, comprehensive topic coverage, and narrative coherence across extended articles. Existing datasets often lack both the hierarchical structure and fine-grained annotation needed to effectively decompose tasks, resulting in shallow, disorganized article generation. To address these limitations, we introdu…
▽ More
Long-form article generation (LFAG) presents challenges such as maintaining logical consistency, comprehensive topic coverage, and narrative coherence across extended articles. Existing datasets often lack both the hierarchical structure and fine-grained annotation needed to effectively decompose tasks, resulting in shallow, disorganized article generation. To address these limitations, we introduce DeFine, a Decomposed and Fine-grained annotated dataset for long-form article generation. DeFine is characterized by its hierarchical decomposition strategy and the integration of domain-specific knowledge with multi-level annotations, ensuring granular control and enhanced depth in article generation. To construct the dataset, a multi-agent collaborative pipeline is proposed, which systematically segments the generation process into four parts: Data Miner, Cite Retreiver, Q&A Annotator and Data Cleaner. To validate the effectiveness of DeFine, we designed and tested three LFAG baselines: the web retrieval, the local retrieval, and the grounded reference. We fine-tuned the Qwen2-7b-Instruct model using the DeFine training dataset. The experimental results showed significant improvements in text quality, specifically in topic coverage, depth of information, and content fidelity. Our dataset publicly available to facilitate future research.
△ Less
Submitted 10 March, 2025;
originally announced March 2025.
-
Promote, Suppress, Iterate: How Language Models Answer One-to-Many Factual Queries
Authors:
Tianyi Lorena Yan,
Robin Jia
Abstract:
To answer one-to-many factual queries (e.g., listing cities of a country), a language model (LM) must simultaneously recall knowledge and avoid repeating previous answers. How are these two subtasks implemented and integrated internally? Across multiple datasets and models, we identify a promote-then-suppress mechanism: the model first recalls all answers, and then suppresses previously generated…
▽ More
To answer one-to-many factual queries (e.g., listing cities of a country), a language model (LM) must simultaneously recall knowledge and avoid repeating previous answers. How are these two subtasks implemented and integrated internally? Across multiple datasets and models, we identify a promote-then-suppress mechanism: the model first recalls all answers, and then suppresses previously generated ones. Specifically, LMs use both the subject and previous answer tokens to perform knowledge recall, with attention propagating subject information and MLPs promoting the answers. Then, attention attends to and suppresses previous answer tokens, while MLPs amplify the suppression signal. Our mechanism is corroborated by extensive experimental evidence: in addition to using early decoding and causal tracing, we analyze how components use different tokens by introducing both Token Lens, which decodes aggregated attention updates from specified tokens, and a knockout method that analyzes changes in MLP outputs after removing attention to specified tokens. Overall, we provide new insights into how LMs' internal components interact with different input tokens to support complex factual recall. Code is available at https://github.com/Lorenayannnnn/how-lms-answer-one-to-many-factual-queries.
△ Less
Submitted 5 March, 2025; v1 submitted 27 February, 2025;
originally announced February 2025.
-
Extremal Self-Dual Codes and Linear Complementary Dual Codes from Double Circulant Codes
Authors:
Wenyu Han,
Tongjiang Yan,
Ming Yan
Abstract:
This paper explores extremal self-dual double circulant (DC) codes and linear complementary dual (LCD) codes of arbitrary length over the Galois field $\mathbb F_2$. We establish the sufficient and necessary conditions for DC codes and bordered DC codes to be self-dual and identify the conditions for self-dual DC codes of length up to 44 to be extremal or non-extremal. Additionally, The self-duali…
▽ More
This paper explores extremal self-dual double circulant (DC) codes and linear complementary dual (LCD) codes of arbitrary length over the Galois field $\mathbb F_2$. We establish the sufficient and necessary conditions for DC codes and bordered DC codes to be self-dual and identify the conditions for self-dual DC codes of length up to 44 to be extremal or non-extremal. Additionally, The self-duality and extremality between DC codes and bordered DC codes are also examined. Finally, sufficient conditions for bordered DC codes to be LCD codes over $\mathbb F_2$ under Euclidean inner product are presented.
△ Less
Submitted 20 February, 2025;
originally announced February 2025.
-
ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models
Authors:
Jonathan Roberts,
Mohammad Reza Taesiri,
Ansh Sharma,
Akash Gupta,
Samuel Roberts,
Ioana Croitoru,
Simion-Vlad Bogolin,
Jialu Tang,
Florian Langer,
Vyas Raina,
Vatsal Raina,
Hanyi Xiong,
Vishaal Udandarao,
Jingyi Lu,
Shiyang Chen,
Sam Purkis,
Tianshuo Yan,
Wenye Lin,
Gyungin Shin,
Qiaochu Yang,
Anh Totti Nguyen,
David I. Atkinson,
Aaditya Baranwal,
Alexandru Coca,
Mikah Dang
, et al. (9 additional authors not shown)
Abstract:
Large Multimodal Models (LMMs) exhibit major shortfalls when interpreting images and, by some measures, have poorer spatial cognition than small children or animals. Despite this, they attain high scores on many popular visual benchmarks, with headroom rapidly eroded by an ongoing surge of model progress. To address this, there is a pressing need for difficult benchmarks that remain relevant for l…
▽ More
Large Multimodal Models (LMMs) exhibit major shortfalls when interpreting images and, by some measures, have poorer spatial cognition than small children or animals. Despite this, they attain high scores on many popular visual benchmarks, with headroom rapidly eroded by an ongoing surge of model progress. To address this, there is a pressing need for difficult benchmarks that remain relevant for longer. We take this idea to its limit by introducing ZeroBench-a lightweight visual reasoning benchmark that is entirely impossible for contemporary frontier LMMs. Our benchmark consists of 100 manually curated questions and 334 less difficult subquestions. We evaluate 20 LMMs on ZeroBench, all of which score 0.0%, and rigorously analyse the errors. To encourage progress in visual understanding, we publicly release ZeroBench.
△ Less
Submitted 6 March, 2025; v1 submitted 13 February, 2025;
originally announced February 2025.
-
Knowledge Editing for Large Language Model with Knowledge Neuronal Ensemble
Authors:
Yongchang Li,
Yujin Zhu,
Tao Yan,
Shijian Fan,
Gang Wu,
Liang Xu
Abstract:
As real-world knowledge is constantly evolving, ensuring the timeliness and accuracy of a model's knowledge is crucial. This has made knowledge editing in large language models increasingly important. However, existing knowledge editing methods face several challenges, including parameter localization coupling, imprecise localization, and a lack of dynamic interaction across layers. In this paper,…
▽ More
As real-world knowledge is constantly evolving, ensuring the timeliness and accuracy of a model's knowledge is crucial. This has made knowledge editing in large language models increasingly important. However, existing knowledge editing methods face several challenges, including parameter localization coupling, imprecise localization, and a lack of dynamic interaction across layers. In this paper, we propose a novel knowledge editing method called Knowledge Neuronal Ensemble (KNE). A knowledge neuronal ensemble represents a group of neurons encoding specific knowledge, thus mitigating the issue of frequent parameter modification caused by coupling in parameter localization. The KNE method enhances the precision and accuracy of parameter localization by computing gradient attribution scores for each parameter at each layer. During the editing process, only the gradients and losses associated with the knowledge neuronal ensemble are computed, with error backpropagation performed accordingly, ensuring dynamic interaction and collaborative updates among parameters. Experimental results on three widely used knowledge editing datasets show that the KNE method significantly improves the accuracy of knowledge editing and achieves, or even exceeds, the performance of the best baseline methods in portability and locality metrics.
△ Less
Submitted 29 December, 2024;
originally announced December 2024.
-
OLiDM: Object-aware LiDAR Diffusion Models for Autonomous Driving
Authors:
Tianyi Yan,
Junbo Yin,
Xianpeng Lang,
Ruigang Yang,
Cheng-Zhong Xu,
Jianbing Shen
Abstract:
To enhance autonomous driving safety in complex scenarios, various methods have been proposed to simulate LiDAR point cloud data. Nevertheless, these methods often face challenges in producing high-quality, diverse, and controllable foreground objects. To address the needs of object-aware tasks in 3D perception, we introduce OLiDM, a novel framework capable of generating high-fidelity LiDAR data a…
▽ More
To enhance autonomous driving safety in complex scenarios, various methods have been proposed to simulate LiDAR point cloud data. Nevertheless, these methods often face challenges in producing high-quality, diverse, and controllable foreground objects. To address the needs of object-aware tasks in 3D perception, we introduce OLiDM, a novel framework capable of generating high-fidelity LiDAR data at both the object and the scene levels. OLiDM consists of two pivotal components: the Object-Scene Progressive Generation (OPG) module and the Object Semantic Alignment (OSA) module. OPG adapts to user-specific prompts to generate desired foreground objects, which are subsequently employed as conditions in scene generation, ensuring controllable outputs at both the object and scene levels. This also facilitates the association of user-defined object-level annotations with the generated LiDAR scenes. Moreover, OSA aims to rectify the misalignment between foreground objects and background scenes, enhancing the overall quality of the generated objects. The broad effectiveness of OLiDM is demonstrated across various LiDAR generation tasks, as well as in 3D perception tasks. Specifically, on the KITTI-360 dataset, OLiDM surpasses prior state-of-the-art methods such as UltraLiDAR by 17.5 in FPD. Additionally, in sparse-to-dense LiDAR completion, OLiDM achieves a significant improvement over LiDARGen, with a 57.47\% increase in semantic IoU. Moreover, OLiDM enhances the performance of mainstream 3D detectors by 2.4\% in mAP and 1.9\% in NDS, underscoring its potential in advancing object-aware 3D tasks. Code is available at: https://yanty123.github.io/OLiDM.
△ Less
Submitted 22 December, 2024;
originally announced December 2024.
-
JailPO: A Novel Black-box Jailbreak Framework via Preference Optimization against Aligned LLMs
Authors:
Hongyi Li,
Jiawei Ye,
Jie Wu,
Tianjie Yan,
Chu Wang,
Zhixin Li
Abstract:
Large Language Models (LLMs) aligned with human feedback have recently garnered significant attention. However, it remains vulnerable to jailbreak attacks, where adversaries manipulate prompts to induce harmful outputs. Exploring jailbreak attacks enables us to investigate the vulnerabilities of LLMs and further guides us in enhancing their security. Unfortunately, existing techniques mainly rely…
▽ More
Large Language Models (LLMs) aligned with human feedback have recently garnered significant attention. However, it remains vulnerable to jailbreak attacks, where adversaries manipulate prompts to induce harmful outputs. Exploring jailbreak attacks enables us to investigate the vulnerabilities of LLMs and further guides us in enhancing their security. Unfortunately, existing techniques mainly rely on handcrafted templates or generated-based optimization, posing challenges in scalability, efficiency and universality. To address these issues, we present JailPO, a novel black-box jailbreak framework to examine LLM alignment. For scalability and universality, JailPO meticulously trains attack models to automatically generate covert jailbreak prompts. Furthermore, we introduce a preference optimization-based attack method to enhance the jailbreak effectiveness, thereby improving efficiency. To analyze model vulnerabilities, we provide three flexible jailbreak patterns. Extensive experiments demonstrate that JailPO not only automates the attack process while maintaining effectiveness but also exhibits superior performance in efficiency, universality, and robustness against defenses compared to baselines. Additionally, our analysis of the three JailPO patterns reveals that attacks based on complex templates exhibit higher attack strength, whereas covert question transformations elicit riskier responses and are more likely to bypass defense mechanisms.
△ Less
Submitted 20 December, 2024;
originally announced December 2024.
-
MambaPro: Multi-Modal Object Re-Identification with Mamba Aggregation and Synergistic Prompt
Authors:
Yuhao Wang,
Xuehu Liu,
Tianyu Yan,
Yang Liu,
Aihua Zheng,
Pingping Zhang,
Huchuan Lu
Abstract:
Multi-modal object Re-IDentification (ReID) aims to retrieve specific objects by utilizing complementary image information from different modalities. Recently, large-scale pre-trained models like CLIP have demonstrated impressive performance in traditional single-modal object ReID tasks. However, they remain unexplored for multi-modal object ReID. Furthermore, current multi-modal aggregation metho…
▽ More
Multi-modal object Re-IDentification (ReID) aims to retrieve specific objects by utilizing complementary image information from different modalities. Recently, large-scale pre-trained models like CLIP have demonstrated impressive performance in traditional single-modal object ReID tasks. However, they remain unexplored for multi-modal object ReID. Furthermore, current multi-modal aggregation methods have obvious limitations in dealing with long sequences from different modalities. To address above issues, we introduce a novel framework called MambaPro for multi-modal object ReID. To be specific, we first employ a Parallel Feed-Forward Adapter (PFA) for adapting CLIP to multi-modal object ReID. Then, we propose the Synergistic Residual Prompt (SRP) to guide the joint learning of multi-modal features. Finally, leveraging Mamba's superior scalability for long sequences, we introduce Mamba Aggregation (MA) to efficiently model interactions between different modalities. As a result, MambaPro could extract more robust features with lower complexity. Extensive experiments on three multi-modal object ReID benchmarks (i.e., RGBNT201, RGBNT100 and MSVR310) validate the effectiveness of our proposed methods. The source code is available at https://github.com/924973292/MambaPro.
△ Less
Submitted 14 December, 2024;
originally announced December 2024.
-
k-HyperEdge Medoids for Clustering Ensemble
Authors:
Feijiang Li,
Jieting Wang,
Liuya zhang,
Yuhua Qian,
Shuai jin,
Tao Yan,
Liang Du
Abstract:
Clustering ensemble has been a popular research topic in data science due to its ability to improve the robustness of the single clustering method. Many clustering ensemble methods have been proposed, most of which can be categorized into clustering-view and sample-view methods. The clustering-view method is generally efficient, but it could be affected by the unreliability that existed in base cl…
▽ More
Clustering ensemble has been a popular research topic in data science due to its ability to improve the robustness of the single clustering method. Many clustering ensemble methods have been proposed, most of which can be categorized into clustering-view and sample-view methods. The clustering-view method is generally efficient, but it could be affected by the unreliability that existed in base clustering results. The sample-view method shows good performance, while the construction of the pairwise sample relation is time-consuming. In this paper, the clustering ensemble is formulated as a k-HyperEdge Medoids discovery problem and a clustering ensemble method based on k-HyperEdge Medoids that considers the characteristics of the above two types of clustering ensemble methods is proposed. In the method, a set of hyperedges is selected from the clustering view efficiently, then the hyperedges are diffused and adjusted from the sample view guided by a hyperedge loss function to construct an effective k-HyperEdge Medoid set. The loss function is mainly reduced by assigning samples to the hyperedge with the highest degree of belonging. Theoretical analyses show that the solution can approximate the optimal, the assignment method can gradually reduce the loss function, and the estimation of the belonging degree is statistically reasonable. Experiments on artificial data show the working mechanism of the proposed method. The convergence of the method is verified by experimental analysis of twenty data sets. The effectiveness and efficiency of the proposed method are also verified on these data, with nine representative clustering ensemble algorithms as reference.
△ Less
Submitted 11 December, 2024;
originally announced December 2024.
-
MOSABench: Multi-Object Sentiment Analysis Benchmark for Evaluating Multimodal Large Language Models Understanding of Complex Image
Authors:
Shezheng Song,
Chengxiang He,
Shasha Li,
Shan Zhao,
Chengyu Wang,
Tianwei Yan,
Xiaopeng Li,
Qian Wan,
Jun Ma,
Jie Yu,
Xiaoguang Mao
Abstract:
Multimodal large language models (MLLMs) have shown remarkable progress in high-level semantic tasks such as visual question answering, image captioning, and emotion recognition. However, despite advancements, there remains a lack of standardized benchmarks for evaluating MLLMs performance in multi-object sentiment analysis, a key task in semantic understanding. To address this gap, we introduce M…
▽ More
Multimodal large language models (MLLMs) have shown remarkable progress in high-level semantic tasks such as visual question answering, image captioning, and emotion recognition. However, despite advancements, there remains a lack of standardized benchmarks for evaluating MLLMs performance in multi-object sentiment analysis, a key task in semantic understanding. To address this gap, we introduce MOSABench, a novel evaluation dataset designed specifically for multi-object sentiment analysis. MOSABench includes approximately 1,000 images with multiple objects, requiring MLLMs to independently assess the sentiment of each object, thereby reflecting real-world complexities. Key innovations in MOSABench include distance-based target annotation, post-processing for evaluation to standardize outputs, and an improved scoring mechanism. Our experiments reveal notable limitations in current MLLMs: while some models, like mPLUG-owl and Qwen-VL2, demonstrate effective attention to sentiment-relevant features, others exhibit scattered focus and performance declines, especially as the spatial distance between objects increases. This research underscores the need for MLLMs to enhance accuracy in complex, multi-object sentiment analysis tasks and establishes MOSABench as a foundational tool for advancing sentiment analysis capabilities in MLLMs.
△ Less
Submitted 25 November, 2024;
originally announced December 2024.
-
DrivingSphere: Building a High-fidelity 4D World for Closed-loop Simulation
Authors:
Tianyi Yan,
Dongming Wu,
Wencheng Han,
Junpeng Jiang,
Xia Zhou,
Kun Zhan,
Cheng-zhong Xu,
Jianbing Shen
Abstract:
Autonomous driving evaluation requires simulation environments that closely replicate actual road conditions, including real-world sensory data and responsive feedback loops. However, many existing simulations need to predict waypoints along fixed routes on public datasets or synthetic photorealistic data, \ie, open-loop simulation usually lacks the ability to assess dynamic decision-making. While…
▽ More
Autonomous driving evaluation requires simulation environments that closely replicate actual road conditions, including real-world sensory data and responsive feedback loops. However, many existing simulations need to predict waypoints along fixed routes on public datasets or synthetic photorealistic data, \ie, open-loop simulation usually lacks the ability to assess dynamic decision-making. While the recent efforts of closed-loop simulation offer feedback-driven environments, they cannot process visual sensor inputs or produce outputs that differ from real-world data. To address these challenges, we propose DrivingSphere, a realistic and closed-loop simulation framework. Its core idea is to build 4D world representation and generate real-life and controllable driving scenarios. In specific, our framework includes a Dynamic Environment Composition module that constructs a detailed 4D driving world with a format of occupancy equipping with static backgrounds and dynamic objects, and a Visual Scene Synthesis module that transforms this data into high-fidelity, multi-view video outputs, ensuring spatial and temporal consistency. By providing a dynamic and realistic simulation environment, DrivingSphere enables comprehensive testing and validation of autonomous driving algorithms, ultimately advancing the development of more reliable autonomous cars. The benchmark will be publicly released.
△ Less
Submitted 17 November, 2024;
originally announced November 2024.
-
$M^3EL$: A Multi-task Multi-topic Dataset for Multi-modal Entity Linking
Authors:
Fang Wang,
Shenglin Yin,
Xiaoying Bai,
Minghao Hu,
Tianwei Yan,
Yi Liang
Abstract:
Multi-modal Entity Linking (MEL) is a fundamental component for various downstream tasks. However, existing MEL datasets suffer from small scale, scarcity of topic types and limited coverage of tasks, making them incapable of effectively enhancing the entity linking capabilities of multi-modal models. To address these obstacles, we propose a dataset construction pipeline and publish $M^3EL$, a lar…
▽ More
Multi-modal Entity Linking (MEL) is a fundamental component for various downstream tasks. However, existing MEL datasets suffer from small scale, scarcity of topic types and limited coverage of tasks, making them incapable of effectively enhancing the entity linking capabilities of multi-modal models. To address these obstacles, we propose a dataset construction pipeline and publish $M^3EL$, a large-scale dataset for MEL. $M^3EL$ includes 79,625 instances, covering 9 diverse multi-modal tasks, and 5 different topics. In addition, to further improve the model's adaptability to multi-modal tasks, We propose a modality-augmented training strategy. Utilizing $M^3EL$ as a corpus, train the $\textit{CLIP}_{\textit{ND}}$ model based on $\textit{CLIP} (\textit{ViT}-\textit{B}-\textit{32})$, and conduct a comparative analysis with an existing multi-modal baselines. Experimental results show that the existing models perform far below expectations (ACC of 49.4%-75.8%), After analysis, it was obtained that small dataset sizes, insufficient modality task coverage, and limited topic diversity resulted in poor generalisation of multi-modal models. Our dataset effectively addresses these issues, and the $\textit{CLIP}_{\textit{ND}}$ model fine-tuned with $M^3EL$ shows a significant improvement in accuracy, with an average improvement of 9.3% to 25% across various tasks. Our dataset is available at https://anonymous.4open.science/r/M3EL.
△ Less
Submitted 8 October, 2024;
originally announced October 2024.
-
Q-value Regularized Decision ConvFormer for Offline Reinforcement Learning
Authors:
Teng Yan,
Zhendong Ruan,
Yaobang Cai,
Yu Han,
Wenxian Li,
Yang Zhang
Abstract:
As a data-driven paradigm, offline reinforcement learning (Offline RL) has been formulated as sequence modeling, where the Decision Transformer (DT) has demonstrated exceptional capabilities. Unlike previous reinforcement learning methods that fit value functions or compute policy gradients, DT adjusts the autoregressive model based on the expected returns, past states, and actions, using a causal…
▽ More
As a data-driven paradigm, offline reinforcement learning (Offline RL) has been formulated as sequence modeling, where the Decision Transformer (DT) has demonstrated exceptional capabilities. Unlike previous reinforcement learning methods that fit value functions or compute policy gradients, DT adjusts the autoregressive model based on the expected returns, past states, and actions, using a causally masked Transformer to output the optimal action. However, due to the inconsistency between the sampled returns within a single trajectory and the optimal returns across multiple trajectories, it is challenging to set an expected return to output the optimal action and stitch together suboptimal trajectories. Decision ConvFormer (DC) is easier to understand in the context of modeling RL trajectories within a Markov Decision Process compared to DT. We propose the Q-value Regularized Decision ConvFormer (QDC), which combines the understanding of RL trajectories by DC and incorporates a term that maximizes action values using dynamic programming methods during training. This ensures that the expected returns of the sampled actions are consistent with the optimal returns. QDC achieves excellent performance on the D4RL benchmark, outperforming or approaching the optimal level in all tested environments. It particularly demonstrates outstanding competitiveness in trajectory stitching capability.
△ Less
Submitted 12 September, 2024;
originally announced September 2024.
-
Research on the Construction of Maximum Distance Separable Codes via Arbitrary twisted Generalized Reed-Solomon Codes
Authors:
Chun'e Zhao,
Wenping Ma,
Tongjiang Yan,
Yuhua Sun
Abstract:
Maximum distance separable (MDS) codes have significant combinatorial and cryptographic applications due to their certain optimality. Generalized Reed-Solomon (GRS) codes are the most prominent MDS codes. Twisted generalized Reed-Solomon (TGRS) codes may not necessarily be MDS. It is meaningful to study the conditions under which TGRS codes are MDS. In this paper, we study a general class of TGRS…
▽ More
Maximum distance separable (MDS) codes have significant combinatorial and cryptographic applications due to their certain optimality. Generalized Reed-Solomon (GRS) codes are the most prominent MDS codes. Twisted generalized Reed-Solomon (TGRS) codes may not necessarily be MDS. It is meaningful to study the conditions under which TGRS codes are MDS. In this paper, we study a general class of TGRS (A-TGRS) codes which include all the known special ones. First, we obtain a new explicit expression of the inverse of the Vandermonde matrix. Based on this, we further derive an equivalent condition under which an A-TGRS code is MDS. According to this, the A-TGRS MDS codes include nearly all the known related results in the previous literatures. More importantly, we also obtain many other classes of MDS TGRS codes with new parameter matrices. In addition, we present a new method to compute the inverse of the lower triangular Toplitz matrix by a linear feedback shift register, which will be very useful in many research fields.
△ Less
Submitted 21 August, 2024;
originally announced August 2024.
-
Multi-Scale and Detail-Enhanced Segment Anything Model for Salient Object Detection
Authors:
Shixuan Gao,
Pingping Zhang,
Tianyu Yan,
Huchuan Lu
Abstract:
Salient Object Detection (SOD) aims to identify and segment the most prominent objects in images. Advanced SOD methods often utilize various Convolutional Neural Networks (CNN) or Transformers for deep feature extraction. However, these methods still deliver low performance and poor generalization in complex cases. Recently, Segment Anything Model (SAM) has been proposed as a visual fundamental mo…
▽ More
Salient Object Detection (SOD) aims to identify and segment the most prominent objects in images. Advanced SOD methods often utilize various Convolutional Neural Networks (CNN) or Transformers for deep feature extraction. However, these methods still deliver low performance and poor generalization in complex cases. Recently, Segment Anything Model (SAM) has been proposed as a visual fundamental model, which gives strong segmentation and generalization capabilities. Nonetheless, SAM requires accurate prompts of target objects, which are unavailable in SOD. Additionally, SAM lacks the utilization of multi-scale and multi-level information, as well as the incorporation of fine-grained details. To address these shortcomings, we propose a Multi-scale and Detail-enhanced SAM (MDSAM) for SOD. Specifically, we first introduce a Lightweight Multi-Scale Adapter (LMSA), which allows SAM to learn multi-scale information with very few trainable parameters. Then, we propose a Multi-Level Fusion Module (MLFM) to comprehensively utilize the multi-level information from the SAM's encoder. Finally, we propose a Detail Enhancement Module (DEM) to incorporate SAM with fine-grained details. Experimental results demonstrate the superior performance of our model on multiple SOD datasets and its strong generalization on other segmentation tasks. The source code is released at https://github.com/BellyBeauty/MDSAM.
△ Less
Submitted 8 August, 2024;
originally announced August 2024.
-
MMedAgent: Learning to Use Medical Tools with Multi-modal Agent
Authors:
Binxu Li,
Tiankai Yan,
Yuanting Pan,
Jie Luo,
Ruiyang Ji,
Jiayuan Ding,
Zhe Xu,
Shilong Liu,
Haoyu Dong,
Zihao Lin,
Yixin Wang
Abstract:
Multi-Modal Large Language Models (MLLMs), despite being successful, exhibit limited generality and often fall short when compared to specialized models. Recently, LLM-based agents have been developed to address these challenges by selecting appropriate specialized models as tools based on user inputs. However, such advancements have not been extensively explored within the medical domain. To brid…
▽ More
Multi-Modal Large Language Models (MLLMs), despite being successful, exhibit limited generality and often fall short when compared to specialized models. Recently, LLM-based agents have been developed to address these challenges by selecting appropriate specialized models as tools based on user inputs. However, such advancements have not been extensively explored within the medical domain. To bridge this gap, this paper introduces the first agent explicitly designed for the medical field, named \textbf{M}ulti-modal \textbf{Med}ical \textbf{Agent} (MMedAgent). We curate an instruction-tuning dataset comprising six medical tools solving seven tasks across five modalities, enabling the agent to choose the most suitable tools for a given task. Comprehensive experiments demonstrate that MMedAgent achieves superior performance across a variety of medical tasks compared to state-of-the-art open-source methods and even the closed-source model, GPT-4o. Furthermore, MMedAgent exhibits efficiency in updating and integrating new medical tools. Codes and models are all available.
△ Less
Submitted 5 October, 2024; v1 submitted 2 July, 2024;
originally announced July 2024.
-
CAVM: Conditional Autoregressive Vision Model for Contrast-Enhanced Brain Tumor MRI Synthesis
Authors:
Lujun Gui,
Chuyang Ye,
Tianyi Yan
Abstract:
Contrast-enhanced magnetic resonance imaging (MRI) is pivotal in the pipeline of brain tumor segmentation and analysis. Gadolinium-based contrast agents, as the most commonly used contrast agents, are expensive and may have potential side effects, and it is desired to obtain contrast-enhanced brain tumor MRI scans without the actual use of contrast agents. Deep learning methods have been applied t…
▽ More
Contrast-enhanced magnetic resonance imaging (MRI) is pivotal in the pipeline of brain tumor segmentation and analysis. Gadolinium-based contrast agents, as the most commonly used contrast agents, are expensive and may have potential side effects, and it is desired to obtain contrast-enhanced brain tumor MRI scans without the actual use of contrast agents. Deep learning methods have been applied to synthesize virtual contrast-enhanced MRI scans from non-contrast images. However, as this synthesis problem is inherently ill-posed, these methods fall short in producing high-quality results. In this work, we propose Conditional Autoregressive Vision Model (CAVM) for improving the synthesis of contrast-enhanced brain tumor MRI. As the enhancement of image intensity grows with a higher dose of contrast agents, we assume that it is less challenging to synthesize a virtual image with a lower dose, where the difference between the contrast-enhanced and non-contrast images is smaller. Thus, CAVM gradually increases the contrast agent dosage and produces higher-dose images based on previous lower-dose ones until the final desired dose is achieved. Inspired by the resemblance between the gradual dose increase and the Chain-of-Thought approach in natural language processing, CAVM uses an autoregressive strategy with a decomposition tokenizer and a decoder. Specifically, the tokenizer is applied to obtain a more compact image representation for computational efficiency, and it decomposes the image into dose-variant and dose-invariant tokens. Then, a masked self-attention mechanism is developed for autoregression that gradually increases the dose of the virtual image based on the dose-variant tokens. Finally, the updated dose-variant tokens corresponding to the desired dose are decoded together with dose-invariant tokens to produce the final contrast-enhanced MRI.
△ Less
Submitted 23 June, 2024;
originally announced June 2024.
-
MDeRainNet: An Efficient Macro-pixel Image Rain Removal Network
Authors:
Tao Yan,
Weijiang He,
Chenglong Wang,
Cihang Wei,
Xiangjie Zhu,
Yinghui Wang,
Rynson W. H. Lau
Abstract:
Since rainy weather always degrades image quality and poses significant challenges to most computer vision-based intelligent systems, image de-raining has been a hot research topic. Fortunately, in a rainy light field (LF) image, background obscured by rain streaks in one sub-view may be visible in the other sub-views, and implicit depth information and recorded 4D structural information may benef…
▽ More
Since rainy weather always degrades image quality and poses significant challenges to most computer vision-based intelligent systems, image de-raining has been a hot research topic. Fortunately, in a rainy light field (LF) image, background obscured by rain streaks in one sub-view may be visible in the other sub-views, and implicit depth information and recorded 4D structural information may benefit rain streak detection and removal. However, existing LF image rain removal methods either do not fully exploit the global correlations of 4D LF data or only utilize partial sub-views, resulting in sub-optimal rain removal performance and no-equally good quality for all de-rained sub-views. In this paper, we propose an efficient network, called MDeRainNet, for rain streak removal from LF images. The proposed network adopts a multi-scale encoder-decoder architecture, which directly works on Macro-pixel images (MPIs) to improve the rain removal performance. To fully model the global correlation between the spatial and the angular information, we propose an Extended Spatial-Angular Interaction (ESAI) module to merge them, in which a simple and effective Transformer-based Spatial-Angular Interaction Attention (SAIA) block is also proposed for modeling long-range geometric correlations and making full use of the angular information. Furthermore, to improve the generalization performance of our network on real-world rainy scenes, we propose a novel semi-supervised learning framework for our MDeRainNet, which utilizes multi-level KL loss to bridge the domain gap between features of synthetic and real-world rain streaks and introduces colored-residue image guided contrastive regularization to reconstruct rain-free images. Extensive experiments conducted on synthetic and real-world LFIs demonstrate that our method outperforms the state-of-the-art methods both quantitatively and qualitatively.
△ Less
Submitted 23 June, 2025; v1 submitted 15 June, 2024;
originally announced June 2024.
-
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding
Authors:
Fei Wang,
Xingyu Fu,
James Y. Huang,
Zekun Li,
Qin Liu,
Xiaogeng Liu,
Mingyu Derek Ma,
Nan Xu,
Wenxuan Zhou,
Kai Zhang,
Tianyi Lorena Yan,
Wenjie Jacky Mo,
Hsiang-Hui Liu,
Pan Lu,
Chunyuan Li,
Chaowei Xiao,
Kai-Wei Chang,
Dan Roth,
Sheng Zhang,
Hoifung Poon,
Muhao Chen
Abstract:
We introduce MuirBench, a comprehensive benchmark that focuses on robust multi-image understanding capabilities of multimodal LLMs. MuirBench consists of 12 diverse multi-image tasks (e.g., scene understanding, ordering) that involve 10 categories of multi-image relations (e.g., multiview, temporal relations). Comprising 11,264 images and 2,600 multiple-choice questions, MuirBench is created in a…
▽ More
We introduce MuirBench, a comprehensive benchmark that focuses on robust multi-image understanding capabilities of multimodal LLMs. MuirBench consists of 12 diverse multi-image tasks (e.g., scene understanding, ordering) that involve 10 categories of multi-image relations (e.g., multiview, temporal relations). Comprising 11,264 images and 2,600 multiple-choice questions, MuirBench is created in a pairwise manner, where each standard instance is paired with an unanswerable variant that has minimal semantic differences, in order for a reliable assessment. Evaluated upon 20 recent multi-modal LLMs, our results reveal that even the best-performing models like GPT-4o and Gemini Pro find it challenging to solve MuirBench, achieving 68.0% and 49.3% in accuracy. Open-source multimodal LLMs trained on single images can hardly generalize to multi-image questions, hovering below 33.3% in accuracy. These results highlight the importance of MuirBench in encouraging the community to develop multimodal LLMs that can look beyond a single image, suggesting potential pathways for future improvements.
△ Less
Submitted 1 July, 2024; v1 submitted 13 June, 2024;
originally announced June 2024.
-
PTA: Enhancing Multimodal Sentiment Analysis through Pipelined Prediction and Translation-based Alignment
Authors:
Shezheng Song,
Shasha Li,
Shan Zhao,
Chengyu Wang,
Xiaopeng Li,
Jie Yu,
Qian Wan,
Jun Ma,
Tianwei Yan,
Wentao Ma,
Xiaoguang Mao
Abstract:
Multimodal aspect-based sentiment analysis (MABSA) aims to understand opinions in a granular manner, advancing human-computer interaction and other fields. Traditionally, MABSA methods use a joint prediction approach to identify aspects and sentiments simultaneously. However, we argue that joint models are not always superior. Our analysis shows that joint models struggle to align relevant text to…
▽ More
Multimodal aspect-based sentiment analysis (MABSA) aims to understand opinions in a granular manner, advancing human-computer interaction and other fields. Traditionally, MABSA methods use a joint prediction approach to identify aspects and sentiments simultaneously. However, we argue that joint models are not always superior. Our analysis shows that joint models struggle to align relevant text tokens with image patches, leading to misalignment and ineffective image utilization.
In contrast, a pipeline framework first identifies aspects through MATE (Multimodal Aspect Term Extraction) and then aligns these aspects with image patches for sentiment classification (MASC: Multimodal Aspect-Oriented Sentiment Classification). This method is better suited for multimodal scenarios where effective image use is crucial. We present three key observations: (a) MATE and MASC have different feature requirements, with MATE focusing on token-level features and MASC on sequence-level features; (b) the aspect identified by MATE is crucial for effective image utilization; and (c) images play a trivial role in previous MABSA methods due to high noise.
Based on these observations, we propose a pipeline framework that first predicts the aspect and then uses translation-based alignment (TBA) to enhance multimodal semantic consistency for better image utilization. Our method achieves state-of-the-art (SOTA) performance on widely used MABSA datasets Twitter-15 and Twitter-17. This demonstrates the effectiveness of the pipeline approach and its potential to provide valuable insights for future MABSA research.
For reproducibility, the code and checkpoint will be released.
△ Less
Submitted 13 June, 2024; v1 submitted 22 May, 2024;
originally announced June 2024.
-
Asymmetrical estimator for training encapsulated deep photonic neural networks
Authors:
Yizhi Wang,
Minjia Chen,
Chunhui Yao,
Jie Ma,
Ting Yan,
Richard Penty,
Qixiang Cheng
Abstract:
Photonic neural networks (PNNs) are fast in-propagation and high bandwidth paradigms that aim to popularize reproducible NN acceleration with higher efficiency and lower cost. However, the training of PNN is known to be challenging, where the device-to-device and system-to-system variations create imperfect knowledge of the PNN. Despite backpropagation (BP)-based training algorithms being the indu…
▽ More
Photonic neural networks (PNNs) are fast in-propagation and high bandwidth paradigms that aim to popularize reproducible NN acceleration with higher efficiency and lower cost. However, the training of PNN is known to be challenging, where the device-to-device and system-to-system variations create imperfect knowledge of the PNN. Despite backpropagation (BP)-based training algorithms being the industry standard for their robustness, generality, and fast gradient convergence for digital training, existing PNN-BP methods rely heavily on accurate intermediate state extraction or extensive computational resources for deep PNNs (DPNNs). The truncated photonic signal propagation and the computation overhead bottleneck DPNN's operation efficiency and increase system construction cost. Here, we introduce the asymmetrical training (AsyT) method, tailored for encapsulated DPNNs, where the signal is preserved in the analogue photonic domain for the entire structure. AsyT offers a lightweight solution for DPNNs with minimum readouts, fast and energy-efficient operation, and minimum system footprint. AsyT's ease of operation, error tolerance, and generality aim to promote PNN acceleration in a widened operational scenario despite the fabrication variations and imperfect controls. We demonstrated AsyT for encapsulated DPNN with integrated photonic chips, repeatably enhancing the performance from in-silico BP for different network structures and datasets.
△ Less
Submitted 13 February, 2025; v1 submitted 28 May, 2024;
originally announced May 2024.
-
Unified End-to-End V2X Cooperative Autonomous Driving
Authors:
Zhiwei Li,
Bozhen Zhang,
Lei Yang,
Tianyu Shen,
Nuo Xu,
Ruosen Hao,
Weiting Li,
Tao Yan,
Huaping Liu
Abstract:
V2X cooperation, through the integration of sensor data from both vehicles and infrastructure, is considered a pivotal approach to advancing autonomous driving technology. Current research primarily focuses on enhancing perception accuracy, often overlooking the systematic improvement of accident prediction accuracy through end-to-end learning, leading to insufficient attention to the safety issue…
▽ More
V2X cooperation, through the integration of sensor data from both vehicles and infrastructure, is considered a pivotal approach to advancing autonomous driving technology. Current research primarily focuses on enhancing perception accuracy, often overlooking the systematic improvement of accident prediction accuracy through end-to-end learning, leading to insufficient attention to the safety issues of autonomous driving. To address this challenge, this paper introduces the UniE2EV2X framework, a V2X-integrated end-to-end autonomous driving system that consolidates key driving modules within a unified network. The framework employs a deformable attention-based data fusion strategy, effectively facilitating cooperation between vehicles and infrastructure. The main advantages include: 1) significantly enhancing agents' perception and motion prediction capabilities, thereby improving the accuracy of accident predictions; 2) ensuring high reliability in the data fusion process; 3) superior end-to-end perception compared to modular approaches. Furthermore, We implement the UniE2EV2X framework on the challenging DeepAccident, a simulation dataset designed for V2X cooperative driving.
△ Less
Submitted 6 May, 2024;
originally announced May 2024.
-
MAS-SAM: Segment Any Marine Animal with Aggregated Features
Authors:
Tianyu Yan,
Zifu Wan,
Xinhao Deng,
Pingping Zhang,
Yang Liu,
Huchuan Lu
Abstract:
Recently, Segment Anything Model (SAM) shows exceptional performance in generating high-quality object masks and achieving zero-shot image segmentation. However, as a versatile vision model, SAM is primarily trained with large-scale natural light images. In underwater scenes, it exhibits substantial performance degradation due to the light scattering and absorption. Meanwhile, the simplicity of th…
▽ More
Recently, Segment Anything Model (SAM) shows exceptional performance in generating high-quality object masks and achieving zero-shot image segmentation. However, as a versatile vision model, SAM is primarily trained with large-scale natural light images. In underwater scenes, it exhibits substantial performance degradation due to the light scattering and absorption. Meanwhile, the simplicity of the SAM's decoder might lead to the loss of fine-grained object details. To address the above issues, we propose a novel feature learning framework named MAS-SAM for marine animal segmentation, which involves integrating effective adapters into the SAM's encoder and constructing a pyramidal decoder. More specifically, we first build a new SAM's encoder with effective adapters for underwater scenes. Then, we introduce a Hypermap Extraction Module (HEM) to generate multi-scale features for a comprehensive guidance. Finally, we propose a Progressive Prediction Decoder (PPD) to aggregate the multi-scale features and predict the final segmentation results. When grafting with the Fusion Attention Module (FAM), our method enables to extract richer marine information from global contextual cues to fine-grained local details. Extensive experiments on four public MAS datasets demonstrate that our MAS-SAM can obtain better results than other typical segmentation methods. The source code is available at https://github.com/Drchip61/MAS-SAM.
△ Less
Submitted 9 May, 2024; v1 submitted 24 April, 2024;
originally announced April 2024.
-
Fantastic Animals and Where to Find Them: Segment Any Marine Animal with Dual SAM
Authors:
Pingping Zhang,
Tianyu Yan,
Yang Liu,
Huchuan Lu
Abstract:
As an important pillar of underwater intelligence, Marine Animal Segmentation (MAS) involves segmenting animals within marine environments. Previous methods don't excel in extracting long-range contextual features and overlook the connectivity between discrete pixels. Recently, Segment Anything Model (SAM) offers a universal framework for general segmentation tasks. Unfortunately, trained with nat…
▽ More
As an important pillar of underwater intelligence, Marine Animal Segmentation (MAS) involves segmenting animals within marine environments. Previous methods don't excel in extracting long-range contextual features and overlook the connectivity between discrete pixels. Recently, Segment Anything Model (SAM) offers a universal framework for general segmentation tasks. Unfortunately, trained with natural images, SAM does not obtain the prior knowledge from marine images. In addition, the single-position prompt of SAM is very insufficient for prior guidance. To address these issues, we propose a novel feature learning framework, named Dual-SAM for high-performance MAS. To this end, we first introduce a dual structure with SAM's paradigm to enhance feature learning of marine images. Then, we propose a Multi-level Coupled Prompt (MCP) strategy to instruct comprehensive underwater prior information, and enhance the multi-level features of SAM's encoder with adapters. Subsequently, we design a Dilated Fusion Attention Module (DFAM) to progressively integrate multi-level features from SAM's encoder. Finally, instead of directly predicting the masks of marine animals, we propose a Criss-Cross Connectivity Prediction (C$^3$P) paradigm to capture the inter-connectivity between discrete pixels. With dual decoders, it generates pseudo-labels and achieves mutual supervision for complementary feature representations, resulting in considerable improvements over previous techniques. Extensive experiments verify that our proposed method achieves state-of-the-art performances on five widely-used MAS datasets. The code is available at https://github.com/Drchip61/Dual_SAM.
△ Less
Submitted 7 April, 2024;
originally announced April 2024.
-
DWE+: Dual-Way Matching Enhanced Framework for Multimodal Entity Linking
Authors:
Shezheng Song,
Shasha Li,
Shan Zhao,
Xiaopeng Li,
Chengyu Wang,
Jie Yu,
Jun Ma,
Tianwei Yan,
Bin Ji,
Xiaoguang Mao
Abstract:
Multimodal entity linking (MEL) aims to utilize multimodal information (usually textual and visual information) to link ambiguous mentions to unambiguous entities in knowledge base. Current methods facing main issues: (1)treating the entire image as input may contain redundant information. (2)the insufficient utilization of entity-related information, such as attributes in images. (3)semantic inco…
▽ More
Multimodal entity linking (MEL) aims to utilize multimodal information (usually textual and visual information) to link ambiguous mentions to unambiguous entities in knowledge base. Current methods facing main issues: (1)treating the entire image as input may contain redundant information. (2)the insufficient utilization of entity-related information, such as attributes in images. (3)semantic inconsistency between the entity in knowledge base and its representation. To this end, we propose DWE+ for multimodal entity linking. DWE+ could capture finer semantics and dynamically maintain semantic consistency with entities. This is achieved by three aspects: (a)we introduce a method for extracting fine-grained image features by partitioning the image into multiple local objects. Then, hierarchical contrastive learning is used to further align semantics between coarse-grained information(text and image) and fine-grained (mention and visual objects). (b)we explore ways to extract visual attributes from images to enhance fusion feature such as facial features and identity. (c)we leverage Wikipedia and ChatGPT to capture the entity representation, achieving semantic enrichment from both static and dynamic perspectives, which better reflects the real-world entity semantics. Experiments on Wikimel, Richpedia, and Wikidiverse datasets demonstrate the effectiveness of DWE+ in improving MEL performance. Specifically, we optimize these datasets and achieve state-of-the-art performance on the enhanced datasets. The code and enhanced datasets are released on https://github.com/season1blue/DWET
△ Less
Submitted 7 April, 2024;
originally announced April 2024.
-
Deep Reinforcement Learning in Autonomous Car Path Planning and Control: A Survey
Authors:
Yiyang Chen,
Chao Ji,
Yunrui Cai,
Tong Yan,
Bo Su
Abstract:
Combining data-driven applications with control systems plays a key role in recent Autonomous Car research. This thesis offers a structured review of the latest literature on Deep Reinforcement Learning (DRL) within the realm of autonomous vehicle Path Planning and Control. It collects a series of DRL methodologies and algorithms and their applications in the field, focusing notably on their roles…
▽ More
Combining data-driven applications with control systems plays a key role in recent Autonomous Car research. This thesis offers a structured review of the latest literature on Deep Reinforcement Learning (DRL) within the realm of autonomous vehicle Path Planning and Control. It collects a series of DRL methodologies and algorithms and their applications in the field, focusing notably on their roles in trajectory planning and dynamic control. In this review, we delve into the application outcomes of DRL technologies in this domain. By summarizing these literatures, we highlight potential challenges, aiming to offer insights that might aid researchers engaged in related fields.
△ Less
Submitted 30 March, 2024;
originally announced April 2024.
-
Monotonic Paraphrasing Improves Generalization of Language Model Prompting
Authors:
Qin Liu,
Fei Wang,
Nan Xu,
Tianyi Yan,
Tao Meng,
Muhao Chen
Abstract:
Performance of large language models (LLMs) may vary with different prompts or instructions of even the same task. One commonly recognized factor for this phenomenon is the model's familiarity with the given prompt or instruction, which is typically estimated by its perplexity. However, finding the prompt with the lowest perplexity is challenging, given the enormous space of possible prompting phr…
▽ More
Performance of large language models (LLMs) may vary with different prompts or instructions of even the same task. One commonly recognized factor for this phenomenon is the model's familiarity with the given prompt or instruction, which is typically estimated by its perplexity. However, finding the prompt with the lowest perplexity is challenging, given the enormous space of possible prompting phrases. In this paper, we propose monotonic paraphrasing (MonoPara), an end-to-end decoding strategy that paraphrases given prompts or instructions into their lower perplexity counterparts based on an ensemble of a paraphrase LM for prompt (or instruction) rewriting, and a target LM (i.e. the prompt or instruction executor) that constrains the generation for lower perplexity. The ensemble decoding process can efficiently paraphrase the original prompt without altering its semantic meaning, while monotonically decreasing the perplexity of each generation as calculated by the target LM. We explore in detail both greedy and search-based decoding as two alternative decoding schemes of MonoPara. Notably, MonoPara does not require any training and can monotonically lower the perplexity of the paraphrased prompt or instruction, leading to improved performance of zero-shot LM prompting as evaluated on a wide selection of tasks. In addition, MonoPara is also shown to effectively improve LMs' generalization on perturbed and unseen task instructions.
△ Less
Submitted 2 November, 2024; v1 submitted 24 March, 2024;
originally announced March 2024.
-
Distributed Robust Learning based Formation Control of Mobile Robots based on Bioinspired Neural Dynamics
Authors:
Zhe Xu,
Tao Yan,
Simon X. Yang,
S. Andrew Gadsden,
Mohammad Biglarbegian
Abstract:
This paper addresses the challenges of distributed formation control in multiple mobile robots, introducing a novel approach that enhances real-world practicability. We first introduce a distributed estimator using a variable structure and cascaded design technique, eliminating the need for derivative information to improve the real time performance. Then, a kinematic tracking control method is de…
▽ More
This paper addresses the challenges of distributed formation control in multiple mobile robots, introducing a novel approach that enhances real-world practicability. We first introduce a distributed estimator using a variable structure and cascaded design technique, eliminating the need for derivative information to improve the real time performance. Then, a kinematic tracking control method is developed utilizing a bioinspired neural dynamic-based approach aimed at providing smooth control inputs and effectively resolving the speed jump issue. Furthermore, to address the challenges for robots operating with completely unknown dynamics and disturbances, a learning-based robust dynamic controller is developed. This controller provides real time parameter estimates while maintaining its robustness against disturbances. The overall stability of the proposed method is proved with rigorous mathematical analysis. At last, multiple comprehensive simulation studies have shown the advantages and effectiveness of the proposed method.
△ Less
Submitted 23 March, 2024;
originally announced March 2024.
-
GCAM: Gaussian and causal-attention model of food fine-grained recognition
Authors:
Guohang Zhuang,
Yue Hu,
Tianxing Yan,
JiaZhan Gao
Abstract:
Currently, most food recognition relies on deep learning for category classification. However, these approaches struggle to effectively distinguish between visually similar food samples, highlighting the pressing need to address fine-grained issues in food recognition. To mitigate these challenges, we propose the adoption of a Gaussian and causal-attention model for fine-grained object recognition…
▽ More
Currently, most food recognition relies on deep learning for category classification. However, these approaches struggle to effectively distinguish between visually similar food samples, highlighting the pressing need to address fine-grained issues in food recognition. To mitigate these challenges, we propose the adoption of a Gaussian and causal-attention model for fine-grained object recognition.In particular, we train to obtain Gaussian features over target regions, followed by the extraction of fine-grained features from the objects, thereby enhancing the feature mapping capabilities of the target regions. To counteract data drift resulting from uneven data distributions, we employ a counterfactual reasoning approach. By using counterfactual interventions, we analyze the impact of the learned image attention mechanism on network predictions, enabling the network to acquire more useful attention weights for fine-grained image recognition. Finally, we design a learnable loss strategy to balance training stability across various modules, ultimately improving the accuracy of the final target recognition. We validate our approach on four relevant datasets, demonstrating its excellent performance across these four datasets.We experimentally show that GCAM surpasses state-of-the-art methods on the ETH-FOOD101, UECFOOD256, and Vireo-FOOD172 datasets. Furthermore, our approach also achieves state-of-the-art performance on the CUB-200 dataset.
△ Less
Submitted 17 March, 2024;
originally announced March 2024.
-
CrossGLG: LLM Guides One-shot Skeleton-based 3D Action Recognition in a Cross-level Manner
Authors:
Tingbing Yan,
Wenzheng Zeng,
Yang Xiao,
Xingyu Tong,
Bo Tan,
Zhiwen Fang,
Zhiguo Cao,
Joey Tianyi Zhou
Abstract:
Most existing one-shot skeleton-based action recognition focuses on raw low-level information (e.g., joint location), and may suffer from local information loss and low generalization ability. To alleviate these, we propose to leverage text description generated from large language models (LLM) that contain high-level human knowledge, to guide feature learning, in a global-local-global way. Partic…
▽ More
Most existing one-shot skeleton-based action recognition focuses on raw low-level information (e.g., joint location), and may suffer from local information loss and low generalization ability. To alleviate these, we propose to leverage text description generated from large language models (LLM) that contain high-level human knowledge, to guide feature learning, in a global-local-global way. Particularly, during training, we design $2$ prompts to gain global and local text descriptions of each action from an LLM. We first utilize the global text description to guide the skeleton encoder focus on informative joints (i.e.,global-to-local). Then we build non-local interaction between local text and joint features, to form the final global representation (i.e., local-to-global). To mitigate the asymmetry issue between the training and inference phases, we further design a dual-branch architecture that allows the model to perform novel class inference without any text input, also making the additional inference cost neglectable compared with the base skeleton encoder. Extensive experiments on three different benchmarks show that CrossGLG consistently outperforms the existing SOTA methods with large margins, and the inference cost (model size) is only $2.8$\% than the previous SOTA. CrossGLG can also serve as a plug-and-play module that can substantially enhance the performance of different SOTA skeleton encoders with a neglectable cost during inference. The source code will be released soon.
△ Less
Submitted 15 March, 2024;
originally announced March 2024.
-
Motion Guided Token Compression for Efficient Masked Video Modeling
Authors:
Yukun Feng,
Yangming Shi,
Fengze Liu,
Tan Yan
Abstract:
Recent developments in Transformers have achieved notable strides in enhancing video comprehension. Nonetheless, the O($N^2$) computation complexity associated with attention mechanisms presents substantial computational hurdles when dealing with the high dimensionality of videos. This challenge becomes particularly pronounced when striving to increase the frames per second (FPS) to enhance the mo…
▽ More
Recent developments in Transformers have achieved notable strides in enhancing video comprehension. Nonetheless, the O($N^2$) computation complexity associated with attention mechanisms presents substantial computational hurdles when dealing with the high dimensionality of videos. This challenge becomes particularly pronounced when striving to increase the frames per second (FPS) to enhance the motion capturing capabilities. Such a pursuit is likely to introduce redundancy and exacerbate the existing computational limitations. In this paper, we initiate by showcasing the enhanced performance achieved through an escalation in the FPS rate. Additionally, we present a novel approach, Motion Guided Token Compression (MGTC), to empower Transformer models to utilize a smaller yet more representative set of tokens for comprehensive video representation. Consequently, this yields substantial reductions in computational burden and remains seamlessly adaptable to increased FPS rates. Specifically, we draw inspiration from video compression algorithms and scrutinize the variance between patches in consecutive video frames across the temporal dimension. The tokens exhibiting a disparity below a predetermined threshold are then masked. Notably, this masking strategy effectively addresses video redundancy while conserving essential information. Our experiments, conducted on widely examined video recognition datasets, Kinetics-400, UCF101 and HMDB51, demonstrate that elevating the FPS rate results in a significant top-1 accuracy score improvement of over 1.6, 1.6 and 4.0. By implementing MGTC with the masking ratio of 25\%, we further augment accuracy by 0.1 and simultaneously reduce computational costs by over 31\% on Kinetics-400. Even within a fixed computational budget, higher FPS rates paired with MGTC sustain performance gains when compared to lower FPS settings.
△ Less
Submitted 10 January, 2024;
originally announced February 2024.
-
"It Must Be Gesturing Towards Me": Gesture-Based Interaction between Autonomous Vehicles and Pedestrians
Authors:
Xiang Chang,
Zihe Chen,
Xiaoyan Dong,
Yuxin Cai,
Tingmin Yan,
Haolin Cai,
Zherui Zhou,
Guyue Zhou,
Jiangtao Gong
Abstract:
Interacting with pedestrians understandably and efficiently is one of the toughest challenges faced by autonomous vehicles (AVs) due to the limitations of current algorithms and external human-machine interfaces (eHMIs). In this paper, we design eHMIs based on gestures inspired by the most popular method of interaction between pedestrians and human drivers. Eight common gestures were selected to c…
▽ More
Interacting with pedestrians understandably and efficiently is one of the toughest challenges faced by autonomous vehicles (AVs) due to the limitations of current algorithms and external human-machine interfaces (eHMIs). In this paper, we design eHMIs based on gestures inspired by the most popular method of interaction between pedestrians and human drivers. Eight common gestures were selected to convey AVs' yielding or non-yielding intentions at uncontrolled crosswalks from previous literature. Through a VR experiment (N1 = 31) and a following online survey (N2 = 394), we discovered significant differences in the usability of gesture-based eHMIs compared to current eHMIs. Good gesture-based eHMIs increase the efficiency of pedestrian-AV interaction while ensuring safety. Poor gestures, however, cause misinterpretation. The underlying reasons were explored: ambiguity regarding the recipient of the signal and whether the gestures are precise, polite, and familiar to pedestrians. Based on this empirical evidence, we discuss potential opportunities and provide valuable insights into developing comprehensible gesture-based eHMIs in the future to support better interaction between AVs and other road users.
△ Less
Submitted 22 February, 2024;
originally announced February 2024.
-
An Error-Matching Exclusion Method for Accelerating Visual SLAM
Authors:
Shaojie Zhang,
Yinghui Wang,
Jiaxing Ma,
Wei Li,
Jinlong Yang,
Tao Yan,
Yukai Wang,
Liangyi Huang,
Mingfeng Wang,
Ibragim R. Atadjanov
Abstract:
In Visual SLAM, achieving accurate feature matching consumes a significant amount of time, severely impacting the real-time performance of the system. This paper proposes an accelerated method for Visual SLAM by integrating GMS (Grid-based Motion Statistics) with RANSAC (Random Sample Consensus) for the removal of mismatched features. The approach first utilizes the GMS algorithm to estimate the q…
▽ More
In Visual SLAM, achieving accurate feature matching consumes a significant amount of time, severely impacting the real-time performance of the system. This paper proposes an accelerated method for Visual SLAM by integrating GMS (Grid-based Motion Statistics) with RANSAC (Random Sample Consensus) for the removal of mismatched features. The approach first utilizes the GMS algorithm to estimate the quantity of matched pairs within the neighborhood and ranks the matches based on their confidence. Subsequently, the Random Sample Consensus (RANSAC) algorithm is employed to further eliminate mismatched features. To address the time-consuming issue of randomly selecting all matched pairs, this method transforms it into the problem of prioritizing sample selection from high-confidence matches. This enables the iterative solution of the optimal model. Experimental results demonstrate that the proposed method achieves a comparable accuracy to the original GMS-RANSAC while reducing the average runtime by 24.13% on the KITTI, TUM desk, and TUM doll datasets.
△ Less
Submitted 25 February, 2024; v1 submitted 22 February, 2024;
originally announced February 2024.
-
A Feature Matching Method Based on Multi-Level Refinement Strategy
Authors:
Shaojie Zhang,
Yinghui Wang,
Jiaxing Ma,
Wei Li,
Jinlong Yang,
Tao Yan,
Yukai Wang,
Liangyi Huang,
Mingfeng Wang,
Ibragim R. Atadjanov
Abstract:
Feature matching is a fundamental and crucial process in visual SLAM, and precision has always been a challenging issue in feature matching. In this paper, based on a multi-level fine matching strategy, we propose a new feature matching method called KTGP-ORB. This method utilizes the similarity of local appearance in the Hamming space generated by feature descriptors to establish initial correspo…
▽ More
Feature matching is a fundamental and crucial process in visual SLAM, and precision has always been a challenging issue in feature matching. In this paper, based on a multi-level fine matching strategy, we propose a new feature matching method called KTGP-ORB. This method utilizes the similarity of local appearance in the Hamming space generated by feature descriptors to establish initial correspondences. It combines the constraint of local image motion smoothness, uses the GMS algorithm to enhance the accuracy of initial matches, and finally employs the PROSAC algorithm to optimize matches, achieving precise matching based on global grayscale information in Euclidean space. Experimental results demonstrate that the KTGP-ORB method reduces the error by an average of 29.92% compared to the ORB algorithm in complex scenes with illumination variations and blur.
△ Less
Submitted 25 February, 2024; v1 submitted 20 February, 2024;
originally announced February 2024.
-
A Robust Error-Resistant View Selection Method for 3D Reconstruction
Authors:
Shaojie Zhang,
Yinghui Wang,
Bin Nan,
Wei Li,
Jinlong Yang,
Tao Yan,
Yukai Wang,
Liangyi Huang,
Mingfeng Wang,
Ibragim R. Atadjanov
Abstract:
To address the issue of increased triangulation uncertainty caused by selecting views with small camera baselines in Structure from Motion (SFM) view selection, this paper proposes a robust error-resistant view selection method. The method utilizes a triangulation-based computation to obtain an error-resistant model, which is then used to construct an error-resistant matrix. The sorting results of…
▽ More
To address the issue of increased triangulation uncertainty caused by selecting views with small camera baselines in Structure from Motion (SFM) view selection, this paper proposes a robust error-resistant view selection method. The method utilizes a triangulation-based computation to obtain an error-resistant model, which is then used to construct an error-resistant matrix. The sorting results of each row in the error-resistant matrix determine the candidate view set for each view. By traversing the candidate view sets of all views and completing the missing views based on the error-resistant matrix, the integrity of 3D reconstruction is ensured. Experimental comparisons between this method and the exhaustive method with the highest accuracy in the COLMAP program are conducted in terms of average reprojection error and absolute trajectory error in the reconstruction results. The proposed method demonstrates an average reduction of 29.40% in reprojection error accuracy and 5.07% in absolute trajectory error on the TUM dataset and DTU dataset.
△ Less
Submitted 25 February, 2024; v1 submitted 17 February, 2024;
originally announced February 2024.
-
Analyzing Reward Dynamics and Decentralization in Ethereum 2.0: An Advanced Data Engineering Workflow and Comprehensive Datasets for Proof-of-Stake Incentives
Authors:
Tao Yan,
Shengnan Li,
Benjamin Kraner,
Luyao Zhang,
Claudio J. Tessone
Abstract:
Ethereum 2.0, as the preeminent smart contract blockchain platform, guarantees the precise execution of applications without third-party intervention. At its core, this system leverages the Proof-of-Stake (PoS) consensus mechanism, which utilizes a stochastic process to select validators for block proposal and validation, consequently rewarding them for their contributions. However, the implementa…
▽ More
Ethereum 2.0, as the preeminent smart contract blockchain platform, guarantees the precise execution of applications without third-party intervention. At its core, this system leverages the Proof-of-Stake (PoS) consensus mechanism, which utilizes a stochastic process to select validators for block proposal and validation, consequently rewarding them for their contributions. However, the implementation of blockchain technology often diverges from its central tenet of decentralized consensus, presenting significant analytical challenges. Our study collects consensus reward data from the Ethereum Beacon chain and conducts a comprehensive analysis of reward distribution and evolution, categorizing them into attestation, proposer and sync committee rewards. To evaluate the degree of decentralization in PoS Ethereum, we apply several inequality indices, including the Shannon entropy, the Gini Index, the Nakamoto Coefficient, and the Herfindahl-Hirschman Index (HHI). Our comprehensive dataset is publicly available on Harvard Dataverse, and our analytical methodologies are accessible via GitHub, promoting open-access research. Additionally, we provide insights on utilizing our data for future investigations focused on assessing, augmenting, and refining the decentralization, security, and efficiency of blockchain systems.
△ Less
Submitted 16 February, 2024;
originally announced February 2024.
-
Contrastive Instruction Tuning
Authors:
Tianyi Lorena Yan,
Fei Wang,
James Y. Huang,
Wenxuan Zhou,
Fan Yin,
Aram Galstyan,
Wenpeng Yin,
Muhao Chen
Abstract:
Instruction tuning has been used as a promising approach to improve the performance of large language models (LLMs) on unseen tasks. However, current LLMs exhibit limited robustness to unseen instructions, generating inconsistent outputs when the same instruction is phrased with slightly varied forms or language styles. This behavior indicates LLMs' lack of robustness to textual variations and gen…
▽ More
Instruction tuning has been used as a promising approach to improve the performance of large language models (LLMs) on unseen tasks. However, current LLMs exhibit limited robustness to unseen instructions, generating inconsistent outputs when the same instruction is phrased with slightly varied forms or language styles. This behavior indicates LLMs' lack of robustness to textual variations and generalizability to unseen instructions, potentially leading to trustworthiness issues. Accordingly, we propose Contrastive Instruction Tuning, which maximizes the similarity between the hidden representations of semantically equivalent instruction-instance pairs while minimizing the similarity between semantically different ones. To facilitate this approach, we augment the existing FLAN collection by paraphrasing task instructions. Experiments on the PromptBench benchmark show that CoIN consistently improves LLMs' robustness to unseen instructions with variations across character, word, sentence, and semantic levels by an average of +2.5% in accuracy. Code is available at https://github.com/luka-group/CoIN.
△ Less
Submitted 6 June, 2024; v1 submitted 16 February, 2024;
originally announced February 2024.
-
Region Feature Descriptor Adapted to High Affine Transformations
Authors:
Shaojie Zhang,
Yinghui Wang,
Bin Nan,
Wei Li,
Jinlong Yang,
Tao Yan,
Yukai Wang,
Liangyi Huang,
Mingfeng Wang,
Ibragim R. Atadjanov
Abstract:
To address the issue of feature descriptors being ineffective in representing grayscale feature information when images undergo high affine transformations, leading to a rapid decline in feature matching accuracy, this paper proposes a region feature descriptor based on simulating affine transformations using classification. The proposed method initially categorizes images with different affine de…
▽ More
To address the issue of feature descriptors being ineffective in representing grayscale feature information when images undergo high affine transformations, leading to a rapid decline in feature matching accuracy, this paper proposes a region feature descriptor based on simulating affine transformations using classification. The proposed method initially categorizes images with different affine degrees to simulate affine transformations and generate a new set of images. Subsequently, it calculates neighborhood information for feature points on this new image set. Finally, the descriptor is generated by combining the grayscale histogram of the maximum stable extremal region to which the feature point belongs and the normalized position relative to the grayscale centroid of the feature point's region. Experimental results, comparing feature matching metrics under affine transformation scenarios, demonstrate that the proposed descriptor exhibits higher precision and robustness compared to existing classical descriptors. Additionally, it shows robustness when integrated with other descriptors.
△ Less
Submitted 25 February, 2024; v1 submitted 15 February, 2024;
originally announced February 2024.
-
A Highlight Removal Method for Capsule Endoscopy Images
Authors:
Shaojie Zhang,
Yinghui Wang,
Peixuan Liu,
Wei Li,
Jinlong Yang,
Tao Yan,
Yukai Wang,
Liangyi Huang,
Mingfeng Wang,
Ibragim R. Atadjanov
Abstract:
The images captured by Wireless Capsule Endoscopy (WCE) always exhibit specular reflections, and removing highlights while preserving the color and texture in the region remains a challenge. To address this issue, this paper proposes a highlight removal method for capsule endoscopy images. Firstly, the confidence and feature terms of the highlight region's edges are computed, where confidence is o…
▽ More
The images captured by Wireless Capsule Endoscopy (WCE) always exhibit specular reflections, and removing highlights while preserving the color and texture in the region remains a challenge. To address this issue, this paper proposes a highlight removal method for capsule endoscopy images. Firstly, the confidence and feature terms of the highlight region's edges are computed, where confidence is obtained by the ratio of known pixels in the RGB space's R channel to the B channel within a window centered on the highlight region's edge pixel, and feature terms are acquired by multiplying the gradient vector of the highlight region's edge pixel with the iso-intensity line. Subsequently, the confidence and feature terms are assigned different weights and summed to obtain the priority of all highlight region's edge pixels, and the pixel with the highest priority is identified. Then, the variance of the highlight region's edge pixels is used to adjust the size of the sample block window, and the best-matching block is searched in the known region based on the RGB color similarity and distance between the sample block and the window centered on the pixel with the highest priority. Finally, the pixels in the best-matching block are copied to the highest priority highlight removal region to achieve the goal of removing the highlight region. Experimental results demonstrate that the proposed method effectively removes highlights from WCE images, with a lower coefficient of variation in the highlight removal region compared to the Crinimisi algorithm and DeepGin method. Additionally, the color and texture in the highlight removal region are similar to those in the surrounding areas, and the texture is continuous.
△ Less
Submitted 25 February, 2024; v1 submitted 10 February, 2024;
originally announced February 2024.
-
Empirical and Theoretical Analysis of Liquid Staking Protocols
Authors:
Krzysztof Gogol,
Benjamin Kraner,
Malte Schlosser,
Tao Yan,
Claudio Tessone,
Burkhard Stiller
Abstract:
Liquid staking has become the largest category of decentralized finance protocols in terms of total value locked. However, few studies exist on its implementation designs or underlying risks. The liquid staking protocols allow for earning staking rewards without the disadvantage of locking the capital at the validators. Yet, they are seen by some as a threat to the Proof-of-Stake blockchain securi…
▽ More
Liquid staking has become the largest category of decentralized finance protocols in terms of total value locked. However, few studies exist on its implementation designs or underlying risks. The liquid staking protocols allow for earning staking rewards without the disadvantage of locking the capital at the validators. Yet, they are seen by some as a threat to the Proof-of-Stake blockchain security.
This paper is the first work that classifies liquid staking implementations. It analyzes the historical performance of major liquid staking tokens in comparison to the traditional staking for the largest Proof-of-Stake blockchains. Furthermore, the research investigates the impact of centralization, maximum extractable value and the migration of Ethereum from Proof-of-Work to Proof-of-Stake on the tokens' performance. Examining the tracking error of the liquid stacking providers to the staking rewards shows that they are persistent and cannot be explained by macro-variables of the currency, such as the variance or return.
△ Less
Submitted 29 January, 2024;
originally announced January 2024.
-
A Dual-way Enhanced Framework from Text Matching Point of View for Multimodal Entity Linking
Authors:
Shezheng Song,
Shan Zhao,
Chengyu Wang,
Tianwei Yan,
Shasha Li,
Xiaoguang Mao,
Meng Wang
Abstract:
Multimodal Entity Linking (MEL) aims at linking ambiguous mentions with multimodal information to entity in Knowledge Graph (KG) such as Wikipedia, which plays a key role in many applications. However, existing methods suffer from shortcomings, including modality impurity such as noise in raw image and ambiguous textual entity representation, which puts obstacles to MEL. We formulate multimodal en…
▽ More
Multimodal Entity Linking (MEL) aims at linking ambiguous mentions with multimodal information to entity in Knowledge Graph (KG) such as Wikipedia, which plays a key role in many applications. However, existing methods suffer from shortcomings, including modality impurity such as noise in raw image and ambiguous textual entity representation, which puts obstacles to MEL. We formulate multimodal entity linking as a neural text matching problem where each multimodal information (text and image) is treated as a query, and the model learns the mapping from each query to the relevant entity from candidate entities. This paper introduces a dual-way enhanced (DWE) framework for MEL: (1) our model refines queries with multimodal data and addresses semantic gaps using cross-modal enhancers between text and image information. Besides, DWE innovatively leverages fine-grained image attributes, including facial characteristic and scene feature, to enhance and refine visual features. (2)By using Wikipedia descriptions, DWE enriches entity semantics and obtains more comprehensive textual representation, which reduces between textual representation and the entities in KG. Extensive experiments on three public benchmarks demonstrate that our method achieves state-of-the-art (SOTA) performance, indicating the superiority of our model. The code is released on https://github.com/season1blue/DWE
△ Less
Submitted 31 July, 2024; v1 submitted 18 December, 2023;
originally announced December 2023.
-
TransY-Net:Learning Fully Transformer Networks for Change Detection of Remote Sensing Images
Authors:
Tianyu Yan,
Zifu Wan,
Pingping Zhang,
Gong Cheng,
Huchuan Lu
Abstract:
In the remote sensing field, Change Detection (CD) aims to identify and localize the changed regions from dual-phase images over the same places. Recently, it has achieved great progress with the advances of deep learning. However, current methods generally deliver incomplete CD regions and irregular CD boundaries due to the limited representation ability of the extracted visual features. To relie…
▽ More
In the remote sensing field, Change Detection (CD) aims to identify and localize the changed regions from dual-phase images over the same places. Recently, it has achieved great progress with the advances of deep learning. However, current methods generally deliver incomplete CD regions and irregular CD boundaries due to the limited representation ability of the extracted visual features. To relieve these issues, in this work we propose a novel Transformer-based learning framework named TransY-Net for remote sensing image CD, which improves the feature extraction from a global view and combines multi-level visual features in a pyramid manner. More specifically, the proposed framework first utilizes the advantages of Transformers in long-range dependency modeling. It can help to learn more discriminative global-level features and obtain complete CD regions. Then, we introduce a novel pyramid structure to aggregate multi-level visual features from Transformers for feature enhancement. The pyramid structure grafted with a Progressive Attention Module (PAM) can improve the feature representation ability with additional inter-dependencies through spatial and channel attentions. Finally, to better train the whole framework, we utilize the deeply-supervised learning with multiple boundary-aware loss functions. Extensive experiments demonstrate that our proposed method achieves a new state-of-the-art performance on four optical and two SAR image CD benchmarks. The source code is released at https://github.com/Drchip61/TransYNet.
△ Less
Submitted 22 October, 2023;
originally announced October 2023.
-
Gem5Pred: Predictive Approaches For Gem5 Simulation Time
Authors:
Tian Yan,
Xueyang Li,
Sifat Ut Taki,
Saeid Mehrdad
Abstract:
Gem5, an open-source, flexible, and cost-effective simulator, is widely recognized and utilized in both academic and industry fields for hardware simulation. However, the typically time-consuming nature of simulating programs on Gem5 underscores the need for a predictive model that can estimate simulation time. As of now, no such dataset or model exists. In response to this gap, this paper makes a…
▽ More
Gem5, an open-source, flexible, and cost-effective simulator, is widely recognized and utilized in both academic and industry fields for hardware simulation. However, the typically time-consuming nature of simulating programs on Gem5 underscores the need for a predictive model that can estimate simulation time. As of now, no such dataset or model exists. In response to this gap, this paper makes a novel contribution by introducing a unique dataset specifically created for this purpose. We also conducted analysis of the effects of different instruction types on the simulation time in Gem5. After this, we employ three distinct models leveraging CodeBERT to execute the prediction task based on the developed dataset. Our superior regression model achieves a Mean Absolute Error (MAE) of 0.546, while our top-performing classification model records an Accuracy of 0.696. Our models establish a foundation for future investigations on this topic, serving as benchmarks against which subsequent models can be compared. We hope that our contribution can simulate further research in this field. The dataset we used is available at https://github.com/XueyangLiOSU/Gem5Pred.
△ Less
Submitted 10 October, 2023;
originally announced October 2023.
-
Has Sentiment Returned to the Pre-pandemic Level? A Sentiment Analysis Using U.S. College Subreddit Data from 2019 to 2022
Authors:
Tian Yan,
Fang Liu
Abstract:
As impact of COVID-19 pandemic winds down, both individuals and society gradually return to pre-pandemic activities. This study aims to explore how people's emotions have changed from the pre-pandemic during the pandemic to post-emergency period and whether it has returned to pre-pandemic level. We collected Reddit data in 2019 (pre-pandemic), 2020 (peak pandemic), 2021, and 2022 (late stages of p…
▽ More
As impact of COVID-19 pandemic winds down, both individuals and society gradually return to pre-pandemic activities. This study aims to explore how people's emotions have changed from the pre-pandemic during the pandemic to post-emergency period and whether it has returned to pre-pandemic level. We collected Reddit data in 2019 (pre-pandemic), 2020 (peak pandemic), 2021, and 2022 (late stages of pandemic, transitioning period to post-emergency period) from subreddits in 128 universities/colleges in the U.S., and a set of school-level characteristics. We predicted two sets of sentiments from a pre-trained Robustly Optimized BERT pre-training approach (RoBERTa) and graph attention network (GAT) that leverages both rich semantic and relational information among posted messages and then applied a logistic stacking method to obtain the final sentiment classification. After obtaining sentiment label for each message, we used a generalized linear mixed-effects model to estimate temporal trend in sentiment from 2019 to 2022 and how school-level factors may affect sentiment. Compared to the year 2019, the odds of negative sentiment in years 2020, 2021, and 2022 are 24%, 4.3%, and 10.3% higher, respectively, which are all statistically significant(adjusted $p$<0.05). Our study findings suggest a partial recovery in the sentiment composition in the post-pandemic-emergency era. The results align with common expectations and provide a detailed quantification of how sentiments have evolved from 2019 to 2022.
△ Less
Submitted 15 September, 2023;
originally announced September 2023.
-
MS23D: A 3D Object Detection Method Using Multi-Scale Semantic Feature Points to Construct 3D Feature Layer
Authors:
Yongxin Shao,
Aihong Tan,
Binrui Wang,
Tianhong Yan,
Zhetao Sun,
Yiyang Zhang,
Jiaxin Liu
Abstract:
LiDAR point clouds can effectively depict the motion and posture of objects in three-dimensional space. Many studies accomplish the 3D object detection by voxelizing point clouds. However, in autonomous driving scenarios, the sparsity and hollowness of point clouds create some difficulties for voxel-based methods. The sparsity of point clouds makes it challenging to describe the geometric features…
▽ More
LiDAR point clouds can effectively depict the motion and posture of objects in three-dimensional space. Many studies accomplish the 3D object detection by voxelizing point clouds. However, in autonomous driving scenarios, the sparsity and hollowness of point clouds create some difficulties for voxel-based methods. The sparsity of point clouds makes it challenging to describe the geometric features of objects. The hollowness of point clouds poses difficulties for the aggregation of 3D features. We propose a two-stage 3D object detection framework, called MS23D. (1) We propose a method using voxel feature points from multi-branch to construct the 3D feature layer. Using voxel feature points from different branches, we construct a relatively compact 3D feature layer with rich semantic features. Additionally, we propose a distance-weighted sampling method, reducing the loss of foreground points caused by downsampling and allowing the 3D feature layer to retain more foreground points. (2) In response to the hollowness of point clouds, we predict the offsets between deep-level feature points and the object's centroid, making them as close as possible to the object's centroid. This enables the aggregation of these feature points with abundant semantic features. For feature points from shallow-level, we retain them on the object's surface to describe the geometric features of the object. To validate our approach, we evaluated its effectiveness on both the KITTI and ONCE datasets.
△ Less
Submitted 10 August, 2024; v1 submitted 31 August, 2023;
originally announced August 2023.