-
Mapping at First Sense: A Lightweight Neural Network-Based Indoor Structures Prediction Method for Robot Autonomous Exploration
Authors:
Haojia Gao,
Haohua Que,
Kunrong Li,
Weihao Shan,
Mingkai Liu,
Rong Zhao,
Lei Mu,
Xinghua Yang,
Qi Wei,
Fei Qiao
Abstract:
Autonomous exploration in unknown environments is a critical challenge in robotics, particularly for applications such as indoor navigation, search and rescue, and service robotics. Traditional exploration strategies, such as frontier-based methods, often struggle to efficiently utilize prior knowledge of structural regularities in indoor spaces. To address this limitation, we propose Mapping at F…
▽ More
Autonomous exploration in unknown environments is a critical challenge in robotics, particularly for applications such as indoor navigation, search and rescue, and service robotics. Traditional exploration strategies, such as frontier-based methods, often struggle to efficiently utilize prior knowledge of structural regularities in indoor spaces. To address this limitation, we propose Mapping at First Sense, a lightweight neural network-based approach that predicts unobserved areas in local maps, thereby enhancing exploration efficiency. The core of our method, SenseMapNet, integrates convolutional and transformerbased architectures to infer occluded regions while maintaining computational efficiency for real-time deployment on resourceconstrained robots. Additionally, we introduce SenseMapDataset, a curated dataset constructed from KTH and HouseExpo environments, which facilitates training and evaluation of neural models for indoor exploration. Experimental results demonstrate that SenseMapNet achieves an SSIM (structural similarity) of 0.78, LPIPS (perceptual quality) of 0.68, and an FID (feature distribution alignment) of 239.79, outperforming conventional methods in map reconstruction quality. Compared to traditional frontier-based exploration, our method reduces exploration time by 46.5% (from 2335.56s to 1248.68s) while maintaining a high coverage rate (88%) and achieving a reconstruction accuracy of 88%. The proposed method represents a promising step toward efficient, learning-driven robotic exploration in structured environments.
△ Less
Submitted 5 April, 2025;
originally announced April 2025.
-
SenseExpo: Efficient Autonomous Exploration with Prediction Information from Lightweight Neural Networks
Authors:
Haojia Gao,
Haohua Que,
Hoiian Au,
Weihao Shan,
Mingkai Liu,
Yusen Qin,
Lei Mu,
Rong Zhao,
Xinghua Yang,
Qi Wei,
Fei Qiao
Abstract:
This paper proposes SenseExpo, an efficient autonomous exploration framework based on a lightweight prediction network, which addresses the limitations of traditional methods in computational overhead and environmental generalization. By integrating Generative Adversarial Networks (GANs), Transformer, and Fast Fourier Convolution (FFC), we designed a lightweight prediction model with merely 709k p…
▽ More
This paper proposes SenseExpo, an efficient autonomous exploration framework based on a lightweight prediction network, which addresses the limitations of traditional methods in computational overhead and environmental generalization. By integrating Generative Adversarial Networks (GANs), Transformer, and Fast Fourier Convolution (FFC), we designed a lightweight prediction model with merely 709k parameters. Our smallest model achieves better performance on the KTH dataset than U-net (24.5M) and LaMa (51M), delivering PSNR 9.026 and SSIM 0.718, particularly representing a 38.7% PSNR improvement over the 51M-parameter LaMa model. Cross-domain testing demonstrates its strong generalization capability, with an FID score of 161.55 on the HouseExpo dataset, significantly outperforming comparable methods. Regarding exploration efficiency, on the KTH dataset,SenseExpo demonstrates approximately a 67.9% time reduction in exploration time compared to MapEx. On the MRPB 1.0 dataset, SenseExpo achieves 77.1% time reduction roughly compared to MapEx. Deployed as a plug-and-play ROS node, the framework seamlessly integrates with existing navigation systems, providing an efficient solution for resource-constrained devices.
△ Less
Submitted 20 March, 2025;
originally announced March 2025.
-
Enhancing Speech Large Language Models with Prompt-Aware Mixture of Audio Encoders
Authors:
Weiqiao Shan,
Yuang Li,
Yuhao Zhang,
Yingfeng Luo,
Chen Xu,
Xiaofeng Zhao,
Long Meng,
Yunfei Lu,
Min Zhang,
Hao Yang,
Tong Xiao,
Jingbo Zhu
Abstract:
Connecting audio encoders with large language models (LLMs) allows the LLM to perform various audio understanding tasks, such as automatic speech recognition (ASR) and audio captioning (AC). Most research focuses on training an adapter layer to generate a unified audio feature for the LLM. However, different tasks may require distinct features that emphasize either semantic or acoustic aspects, ma…
▽ More
Connecting audio encoders with large language models (LLMs) allows the LLM to perform various audio understanding tasks, such as automatic speech recognition (ASR) and audio captioning (AC). Most research focuses on training an adapter layer to generate a unified audio feature for the LLM. However, different tasks may require distinct features that emphasize either semantic or acoustic aspects, making task-specific audio features more desirable. In this paper, we propose Prompt-aware Mixture (PaM) to enhance the Speech LLM that uses multiple audio encoders. Our approach involves using different experts to extract different features based on the prompt that indicates different tasks. Experiments demonstrate that with PaM, only one Speech LLM surpasses the best performances achieved by all single-encoder Speech LLMs on ASR, Speaker Number Verification, and AC tasks. PaM also outperforms other feature fusion baselines, such as concatenation and averaging.
△ Less
Submitted 20 February, 2025;
originally announced February 2025.
-
Optimizing Speech Multi-View Feature Fusion through Conditional Computation
Authors:
Weiqiao Shan,
Yuhao Zhang,
Yuchen Han,
Bei Li,
Xiaofeng Zhao,
Yuang Li,
Min Zhang,
Hao Yang,
Tong Xiao,
Jingbo Zhu
Abstract:
Recent advancements have highlighted the efficacy of self-supervised learning (SSL) features in various speech-related tasks, providing lightweight and versatile multi-view speech representations. However, our study reveals that while SSL features expedite model convergence, they conflict with traditional spectral features like FBanks in terms of update directions. In response, we propose a novel…
▽ More
Recent advancements have highlighted the efficacy of self-supervised learning (SSL) features in various speech-related tasks, providing lightweight and versatile multi-view speech representations. However, our study reveals that while SSL features expedite model convergence, they conflict with traditional spectral features like FBanks in terms of update directions. In response, we propose a novel generalized feature fusion framework grounded in conditional computation, featuring a gradient-sensitive gating network and a multi-stage dropout strategy. This framework mitigates feature conflicts and bolsters model robustness to multi-view input features. By integrating SSL and spectral features, our approach accelerates convergence and maintains performance on par with spectral models across multiple speech translation tasks on the MUSTC dataset.
△ Less
Submitted 14 January, 2025;
originally announced January 2025.
-
Early Exit Is a Natural Capability in Transformer-based Models: An Empirical Study on Early Exit without Joint Optimization
Authors:
Weiqiao Shan,
Long Meng,
Tong Zheng,
Yingfeng Luo,
Bei Li,
junxin Wang,
Tong Xiao,
Jingbo Zhu
Abstract:
Large language models (LLMs) exhibit exceptional performance across various downstream tasks. However, they encounter limitations due to slow inference speeds stemming from their extensive parameters. The early exit (EE) is an approach that aims to accelerate auto-regressive decoding. EE generates outputs from intermediate layers instead of using the whole model, which offers a promising solution…
▽ More
Large language models (LLMs) exhibit exceptional performance across various downstream tasks. However, they encounter limitations due to slow inference speeds stemming from their extensive parameters. The early exit (EE) is an approach that aims to accelerate auto-regressive decoding. EE generates outputs from intermediate layers instead of using the whole model, which offers a promising solution to this challenge. However, additional output layers and joint optimization used in conventional EE hinder the application of EE in LLMs.
In this paper, we explore the possibility of LLMs EE without additional output layers and joint optimization. Our findings indicate that EE is a natural capability within transformer-based models. While joint optimization does not give model EE capability, it must be employed to address challenges by improving the accuracy of locating the optimal EE layer through gating functions. Additionally, our study reveals patterns in EE behavior from a sub-word perspective based on the LLaMA model and the potential possibility for EE based on sub-layers.
△ Less
Submitted 2 December, 2024;
originally announced December 2024.
-
UVLLM: An Automated Universal RTL Verification Framework using LLMs
Authors:
Yuchen Hu,
Junhao Ye,
Ke Xu,
Jialin Sun,
Shiyue Zhang,
Xinyao Jiao,
Dingrong Pan,
Jie Zhou,
Ning Wang,
Weiwei Shan,
Xinwei Fang,
Xi Wang,
Nan Guan,
Zhe Jiang
Abstract:
Verifying hardware designs in embedded systems is crucial but often labor-intensive and time-consuming. While existing solutions have improved automation, they frequently rely on unrealistic assumptions. To address these challenges, we introduce a novel framework, UVLLM, which combines Large Language Models (LLMs) with the Universal Verification Methodology (UVM) to relax these assumptions. UVLLM…
▽ More
Verifying hardware designs in embedded systems is crucial but often labor-intensive and time-consuming. While existing solutions have improved automation, they frequently rely on unrealistic assumptions. To address these challenges, we introduce a novel framework, UVLLM, which combines Large Language Models (LLMs) with the Universal Verification Methodology (UVM) to relax these assumptions. UVLLM significantly enhances the automation of testing and repairing error-prone Register Transfer Level (RTL) codes, a critical aspect of verification development. Unlike existing methods, UVLLM ensures that all errors are triggered during verification, achieving a syntax error fix rate of 86.99% and a functional error fix rate of 71.92% on our proposed benchmark. These results demonstrate a substantial improvement in verification efficiency. Additionally, our study highlights the current limitations of LLM applications, particularly their reliance on extensive training data. We emphasize the transformative potential of LLMs in hardware design verification and suggest promising directions for future research in AI-driven hardware design methodologies. The Repo. of dataset and code: https://anonymous.4open.science/r/UVLLM/.
△ Less
Submitted 25 November, 2024;
originally announced November 2024.
-
AnalogGym: An Open and Practical Testing Suite for Analog Circuit Synthesis
Authors:
Jintao Li,
Haochang Zhi,
Ruiyu Lyu,
Wangzhen Li,
Zhaori Bi,
Keren Zhu,
Yanhan Zeng,
Weiwei Shan,
Changhao Yan,
Fan Yang,
Yun Li,
Xuan Zeng
Abstract:
Recent advances in machine learning (ML) for automating analog circuit synthesis have been significant, yet challenges remain. A critical gap is the lack of a standardized evaluation framework, compounded by various process design kits (PDKs), simulation tools, and a limited variety of circuit topologies. These factors hinder direct comparisons and the validation of algorithms. To address these sh…
▽ More
Recent advances in machine learning (ML) for automating analog circuit synthesis have been significant, yet challenges remain. A critical gap is the lack of a standardized evaluation framework, compounded by various process design kits (PDKs), simulation tools, and a limited variety of circuit topologies. These factors hinder direct comparisons and the validation of algorithms. To address these shortcomings, we introduced AnalogGym, an open-source testing suite designed to provide fair and comprehensive evaluations. AnalogGym includes 30 circuit topologies in five categories: sensing front ends, voltage references, low dropout regulators, amplifiers, and phase-locked loops. It supports several technology nodes for academic and commercial applications and is compatible with commercial simulators such as Cadence Spectre, Synopsys HSPICE, and the open-source simulator Ngspice. AnalogGym standardizes the assessment of ML algorithms in analog circuit synthesis and promotes reproducibility with its open datasets and detailed benchmark specifications. AnalogGym's user-friendly design allows researchers to easily adapt it for robust, transparent comparisons of state-of-the-art methods, while also exposing them to real-world industrial design challenges, enhancing the practical relevance of their work. Additionally, we have conducted a comprehensive comparison study of various analog sizing methods on AnalogGym, highlighting the capabilities and advantages of different approaches. AnalogGym is available in the GitHub repository https://github.com/CODA-Team/AnalogGym. The documentation is also available at http://coda-team.github.io/AnalogGym/.
△ Less
Submitted 13 September, 2024;
originally announced September 2024.
-
Decoupling Feature Representations of Ego and Other Modalities for Incomplete Multi-modal Brain Tumor Segmentation
Authors:
Kaixiang Yang,
Wenqi Shan,
Xudong Li,
Xuan Wang,
Xikai Yang,
Xi Wang,
Pheng-Ann Heng,
Qiang Li,
Zhiwei Wang
Abstract:
Multi-modal brain tumor segmentation typically involves four magnetic resonance imaging (MRI) modalities, while incomplete modalities significantly degrade performance. Existing solutions employ explicit or implicit modality adaptation, aligning features across modalities or learning a fused feature robust to modality incompleteness. They share a common goal of encouraging each modality to express…
▽ More
Multi-modal brain tumor segmentation typically involves four magnetic resonance imaging (MRI) modalities, while incomplete modalities significantly degrade performance. Existing solutions employ explicit or implicit modality adaptation, aligning features across modalities or learning a fused feature robust to modality incompleteness. They share a common goal of encouraging each modality to express both itself and the others. However, the two expression abilities are entangled as a whole in a seamless feature space, resulting in prohibitive learning burdens. In this paper, we propose DeMoSeg to enhance the modality adaptation by Decoupling the task of representing the ego and other Modalities for robust incomplete multi-modal Segmentation. The decoupling is super lightweight by simply using two convolutions to map each modality onto four feature sub-spaces. The first sub-space expresses itself (Self-feature), while the remaining sub-spaces substitute for other modalities (Mutual-features). The Self- and Mutual-features interactively guide each other through a carefully-designed Channel-wised Sparse Self-Attention (CSSA). After that, a Radiologist-mimic Cross-modality expression Relationships (RCR) is introduced to have available modalities provide Self-feature and also `lend' their Mutual-features to compensate for the absent ones by exploiting the clinical prior knowledge. The benchmark results on BraTS2020, BraTS2018 and BraTS2015 verify the DeMoSeg's superiority thanks to the alleviated modality adaptation difficulty. Concretely, for BraTS2020, DeMoSeg increases Dice by at least 0.92%, 2.95% and 4.95% on whole tumor, tumor core and enhanced tumor regions, respectively, compared to other state-of-the-arts. Codes are at https://github.com/kk42yy/DeMoSeg
△ Less
Submitted 16 August, 2024;
originally announced August 2024.
-
MEIC: Re-thinking RTL Debug Automation using LLMs
Authors:
Ke Xu,
Jialin Sun,
Yuchen Hu,
Xinwei Fang,
Weiwei Shan,
Xi Wang,
Zhe Jiang
Abstract:
The deployment of Large Language Models (LLMs) for code debugging (e.g., C and Python) is widespread, benefiting from their ability to understand and interpret intricate concepts. However, in the semiconductor industry, utilising LLMs to debug Register Transfer Level (RTL) code is still insufficient, largely due to the underrepresentation of RTL-specific data in training sets. This work introduces…
▽ More
The deployment of Large Language Models (LLMs) for code debugging (e.g., C and Python) is widespread, benefiting from their ability to understand and interpret intricate concepts. However, in the semiconductor industry, utilising LLMs to debug Register Transfer Level (RTL) code is still insufficient, largely due to the underrepresentation of RTL-specific data in training sets. This work introduces a novel framework, Make Each Iteration Count (MEIC), which contrasts with traditional one-shot LLM-based debugging methods that heavily rely on prompt engineering, model tuning, and model training. MEIC utilises LLMs in an iterative process to overcome the limitation of LLMs in RTL code debugging, which is suitable for identifying and correcting both syntax and function errors, while effectively managing the uncertainties inherent in LLM operations. To evaluate our framework, we provide an open-source dataset comprising 178 common RTL programming errors. The experimental results demonstrate that the proposed debugging framework achieves fix rate of 93% for syntax errors and 78% for function errors, with up to 48x speedup in debugging processes when compared with experienced engineers. The Repo. of dataset and code: https://anonymous.4open.science/r/Verilog-Auto-Debug-6E7F/.
△ Less
Submitted 10 May, 2024;
originally announced May 2024.
-
Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives
Authors:
Kristen Grauman,
Andrew Westbury,
Lorenzo Torresani,
Kris Kitani,
Jitendra Malik,
Triantafyllos Afouras,
Kumar Ashutosh,
Vijay Baiyya,
Siddhant Bansal,
Bikram Boote,
Eugene Byrne,
Zach Chavis,
Joya Chen,
Feng Cheng,
Fu-Jen Chu,
Sean Crane,
Avijit Dasgupta,
Jing Dong,
Maria Escobar,
Cristhian Forigua,
Abrham Gebreselasie,
Sanjay Haresh,
Jing Huang,
Md Mohaiminul Islam,
Suyog Jain
, et al. (76 additional authors not shown)
Abstract:
We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric and exocentric video of skilled human activities (e.g., sports, music, dance, bike repair). 740 participants from 13 cities worldwide performed these activities in 123 different natural scene contexts, yielding long-form captures from…
▽ More
We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric and exocentric video of skilled human activities (e.g., sports, music, dance, bike repair). 740 participants from 13 cities worldwide performed these activities in 123 different natural scene contexts, yielding long-form captures from 1 to 42 minutes each and 1,286 hours of video combined. The multimodal nature of the dataset is unprecedented: the video is accompanied by multichannel audio, eye gaze, 3D point clouds, camera poses, IMU, and multiple paired language descriptions -- including a novel "expert commentary" done by coaches and teachers and tailored to the skilled-activity domain. To push the frontier of first-person video understanding of skilled human activity, we also present a suite of benchmark tasks and their annotations, including fine-grained activity understanding, proficiency estimation, cross-view translation, and 3D hand/body pose. All resources are open sourced to fuel new research in the community. Project page: http://ego-exo4d-data.org/
△ Less
Submitted 25 September, 2024; v1 submitted 30 November, 2023;
originally announced November 2023.
-
PartialFormer: Modeling Part Instead of Whole for Machine Translation
Authors:
Tong Zheng,
Bei Li,
Huiwen Bao,
Jiale Wang,
Weiqiao Shan,
Tong Xiao,
Jingbo Zhu
Abstract:
The design choices in Transformer feed-forward neural networks have resulted in significant computational and parameter overhead. In this work, we emphasize the importance of hidden dimensions in designing lightweight FFNs, a factor often overlooked in previous architectures. Guided by this principle, we introduce PartialFormer, a parameter-efficient Transformer architecture utilizing multiple sma…
▽ More
The design choices in Transformer feed-forward neural networks have resulted in significant computational and parameter overhead. In this work, we emphasize the importance of hidden dimensions in designing lightweight FFNs, a factor often overlooked in previous architectures. Guided by this principle, we introduce PartialFormer, a parameter-efficient Transformer architecture utilizing multiple smaller FFNs to reduce parameters and computation while maintaining essential hidden dimensions. These smaller FFNs are integrated into a multi-head attention mechanism for effective collaboration. We also propose a tailored head scaling strategy to enhance PartialFormer's capabilities. Furthermore, we present a residual-like attention calculation to improve depth scaling within PartialFormer. Extensive experiments on 9 translation tasks and 1 abstractive summarization task validate the effectiveness of our PartialFormer approach on machine translation and summarization tasks. Our code would be available at: https://github.com/zhengkid/PartialFormer.
△ Less
Submitted 5 June, 2024; v1 submitted 23 October, 2023;
originally announced October 2023.
-
Collaborative Route Planning of UAVs, Workers and Cars for Crowdsensing in Disaster Response
Authors:
Lei Han,
Chunyu Tu,
Zhiwen Yu,
Zhiyong Yu,
Weihua Shan,
Liang Wang,
Bin Guo
Abstract:
Efficiently obtaining the up-to-date information in the disaster-stricken area is the key to successful disaster response. Unmanned aerial vehicles (UAVs), workers and cars can collaborate to accomplish sensing tasks, such as data collection, in disaster-stricken areas. In this paper, we explicitly address the route planning for a group of agents, including UAVs, workers, and cars, with the goal o…
▽ More
Efficiently obtaining the up-to-date information in the disaster-stricken area is the key to successful disaster response. Unmanned aerial vehicles (UAVs), workers and cars can collaborate to accomplish sensing tasks, such as data collection, in disaster-stricken areas. In this paper, we explicitly address the route planning for a group of agents, including UAVs, workers, and cars, with the goal of maximizing the task completion rate. We propose MANF-RL-RP, a heterogeneous multi-agent route planning algorithm that incorporates several efficient designs, including global-local dual information processing and a tailored model structure for heterogeneous multi-agent systems. Global-local dual information processing encompasses the extraction and dissemination of spatial features from global information, as well as the partitioning and filtering of local information from individual agents. Regarding the construction of the model structure for heterogeneous multi-agent, we perform the following work. We design the same data structure to represent the states of different agents, prove the Markovian property of the decision-making process of agents to simplify the model structure, and also design a reasonable reward function to train the model. Finally, we conducted detailed experiments based on the rich simulation data. In comparison to the baseline algorithms, namely Greedy-SC-RP and MANF-DNN-RP, MANF-RL-RP has exhibited a significant improvement in terms of task completion rate.
△ Less
Submitted 21 August, 2023;
originally announced August 2023.
-
Human 3D Avatar Modeling with Implicit Neural Representation: A Brief Survey
Authors:
Mingyang Sun,
Dingkang Yang,
Dongliang Kou,
Yang Jiang,
Weihua Shan,
Zhe Yan,
Lihua Zhang
Abstract:
A human 3D avatar is one of the important elements in the metaverse, and the modeling effect directly affects people's visual experience. However, the human body has a complex topology and diverse details, so it is often expensive, time-consuming, and laborious to build a satisfactory model. Recent studies have proposed a novel method, implicit neural representation, which is a continuous represen…
▽ More
A human 3D avatar is one of the important elements in the metaverse, and the modeling effect directly affects people's visual experience. However, the human body has a complex topology and diverse details, so it is often expensive, time-consuming, and laborious to build a satisfactory model. Recent studies have proposed a novel method, implicit neural representation, which is a continuous representation method and can describe objects with arbitrary topology at arbitrary resolution. Researchers have applied implicit neural representation to human 3D avatar modeling and obtained more excellent results than traditional methods. This paper comprehensively reviews the application of implicit neural representation in human body modeling. First, we introduce three implicit representations of occupancy field, SDF, and NeRF, and make a classification of the literature investigated in this paper. Then the application of implicit modeling methods in the body, hand, and head are compared and analyzed respectively. Finally, we point out the shortcomings of current work and provide available suggestions for researchers.
△ Less
Submitted 6 June, 2023;
originally announced June 2023.
-
Diffusion-Based 3D Human Pose Estimation with Multi-Hypothesis Aggregation
Authors:
Wenkang Shan,
Zhenhua Liu,
Xinfeng Zhang,
Zhao Wang,
Kai Han,
Shanshe Wang,
Siwei Ma,
Wen Gao
Abstract:
In this paper, a novel Diffusion-based 3D Pose estimation (D3DP) method with Joint-wise reProjection-based Multi-hypothesis Aggregation (JPMA) is proposed for probabilistic 3D human pose estimation. On the one hand, D3DP generates multiple possible 3D pose hypotheses for a single 2D observation. It gradually diffuses the ground truth 3D poses to a random distribution, and learns a denoiser conditi…
▽ More
In this paper, a novel Diffusion-based 3D Pose estimation (D3DP) method with Joint-wise reProjection-based Multi-hypothesis Aggregation (JPMA) is proposed for probabilistic 3D human pose estimation. On the one hand, D3DP generates multiple possible 3D pose hypotheses for a single 2D observation. It gradually diffuses the ground truth 3D poses to a random distribution, and learns a denoiser conditioned on 2D keypoints to recover the uncontaminated 3D poses. The proposed D3DP is compatible with existing 3D pose estimators and supports users to balance efficiency and accuracy during inference through two customizable parameters. On the other hand, JPMA is proposed to assemble multiple hypotheses generated by D3DP into a single 3D pose for practical use. It reprojects 3D pose hypotheses to the 2D camera plane, selects the best hypothesis joint-by-joint based on the reprojection errors, and combines the selected joints into the final pose. The proposed JPMA conducts aggregation at the joint level and makes use of the 2D prior information, both of which have been overlooked by previous approaches. Extensive experiments on Human3.6M and MPI-INF-3DHP datasets show that our method outperforms the state-of-the-art deterministic and probabilistic approaches by 1.5% and 8.9%, respectively. Code is available at https://github.com/paTRICK-swk/D3DP.
△ Less
Submitted 22 August, 2023; v1 submitted 21 March, 2023;
originally announced March 2023.
-
Accuracy and Fidelity Comparison of Luna and DALL-E 2 Diffusion-Based Image Generation Systems
Authors:
Michael Cahyadi,
Muhammad Rafi,
William Shan,
Jurike Moniaga,
Henry Lucky
Abstract:
We qualitatively examine the accuracy and fidelity between two diffusion-based image generation systems, namely DALL-E 2 and Luna, which have massive differences in training datasets, algorithmic approaches, prompt resolvement, and output upscaling. The methodology used is a qualitative benchmark created by Saharia et al. and in our research we conclude that DALL-E 2 significantly edges Luna in bo…
▽ More
We qualitatively examine the accuracy and fidelity between two diffusion-based image generation systems, namely DALL-E 2 and Luna, which have massive differences in training datasets, algorithmic approaches, prompt resolvement, and output upscaling. The methodology used is a qualitative benchmark created by Saharia et al. and in our research we conclude that DALL-E 2 significantly edges Luna in both alignment and fidelity comparisons.
△ Less
Submitted 27 February, 2023; v1 submitted 5 January, 2023;
originally announced January 2023.
-
P-STMO: Pre-Trained Spatial Temporal Many-to-One Model for 3D Human Pose Estimation
Authors:
Wenkang Shan,
Zhenhua Liu,
Xinfeng Zhang,
Shanshe Wang,
Siwei Ma,
Wen Gao
Abstract:
This paper introduces a novel Pre-trained Spatial Temporal Many-to-One (P-STMO) model for 2D-to-3D human pose estimation task. To reduce the difficulty of capturing spatial and temporal information, we divide this task into two stages: pre-training (Stage I) and fine-tuning (Stage II). In Stage I, a self-supervised pre-training sub-task, termed masked pose modeling, is proposed. The human joints i…
▽ More
This paper introduces a novel Pre-trained Spatial Temporal Many-to-One (P-STMO) model for 2D-to-3D human pose estimation task. To reduce the difficulty of capturing spatial and temporal information, we divide this task into two stages: pre-training (Stage I) and fine-tuning (Stage II). In Stage I, a self-supervised pre-training sub-task, termed masked pose modeling, is proposed. The human joints in the input sequence are randomly masked in both spatial and temporal domains. A general form of denoising auto-encoder is exploited to recover the original 2D poses and the encoder is capable of capturing spatial and temporal dependencies in this way. In Stage II, the pre-trained encoder is loaded to STMO model and fine-tuned. The encoder is followed by a many-to-one frame aggregator to predict the 3D pose in the current frame. Especially, an MLP block is utilized as the spatial feature extractor in STMO, which yields better performance than other methods. In addition, a temporal downsampling strategy is proposed to diminish data redundancy. Extensive experiments on two benchmarks show that our method outperforms state-of-the-art methods with fewer parameters and less computational overhead. For example, our P-STMO model achieves 42.1mm MPJPE on Human3.6M dataset when using 2D poses from CPN as inputs. Meanwhile, it brings a 1.5-7.1 times speedup to state-of-the-art methods. Code is available at https://github.com/paTRICK-swk/P-STMO.
△ Less
Submitted 28 July, 2022; v1 submitted 15 March, 2022;
originally announced March 2022.
-
Improving Robustness and Accuracy via Relative Information Encoding in 3D Human Pose Estimation
Authors:
Wenkang Shan,
Haopeng Lu,
Shanshe Wang,
Xinfeng Zhang,
Wen Gao
Abstract:
Most of the existing 3D human pose estimation approaches mainly focus on predicting 3D positional relationships between the root joint and other human joints (local motion) instead of the overall trajectory of the human body (global motion). Despite the great progress achieved by these approaches, they are not robust to global motion, and lack the ability to accurately predict local motion with a…
▽ More
Most of the existing 3D human pose estimation approaches mainly focus on predicting 3D positional relationships between the root joint and other human joints (local motion) instead of the overall trajectory of the human body (global motion). Despite the great progress achieved by these approaches, they are not robust to global motion, and lack the ability to accurately predict local motion with a small movement range. To alleviate these two problems, we propose a relative information encoding method that yields positional and temporal enhanced representations. Firstly, we encode positional information by utilizing relative coordinates of 2D poses to enhance the consistency between the input and output distribution. The same posture with different absolute 2D positions can be mapped to a common representation. It is beneficial to resist the interference of global motion on the prediction results. Second, we encode temporal information by establishing the connection between the current pose and other poses of the same person within a period of time. More attention will be paid to the movement changes before and after the current pose, resulting in better prediction performance on local motion with a small movement range. The ablation studies validate the effectiveness of the proposed relative information encoding method. Besides, we introduce a multi-stage optimization method to the whole framework to further exploit the positional and temporal enhanced representations. Our method outperforms state-of-the-art methods on two public datasets. Code is available at https://github.com/paTRICK-swk/Pose3D-RIE.
△ Less
Submitted 29 July, 2021;
originally announced July 2021.
-
Interpolation of Microscale Stress and Strain Fields Based on Mechanical Models
Authors:
Wenzhe Shan,
Udo Nackenhorst
Abstract:
In this short contribution we introduce a new procedure to recover the stress and strain fields for particle systems by mechanical models. Numerical tests for simple loading conditions have shown an excellent match between the estimated values and the reference values. The estimated stress field is also consistent with the so called Quasicontinuum stress field, which suggests its potential applica…
▽ More
In this short contribution we introduce a new procedure to recover the stress and strain fields for particle systems by mechanical models. Numerical tests for simple loading conditions have shown an excellent match between the estimated values and the reference values. The estimated stress field is also consistent with the so called Quasicontinuum stress field, which suggests its potential application for scale bridging techniques. The estimated stress fields for complicated loading conditions such as defect and indentation are also demonstrated
△ Less
Submitted 20 April, 2021;
originally announced April 2021.
-
Computing Arlequin coupling coefficient for concurrent FE-MD approaches
Authors:
Wenzhe Shan,
Udo Nackenhorst
Abstract:
Arlequin coupling coefficient is essential for concurrent FE-MD models with overlapping domains, but the calculation of its value is quite difficult when the geometry of the coupling region is complicated. In this work, we introduce a general procedure for the preprocessing of a concurrent FE-MD model, given that the mesh and atoms have already been created. The procedure is independent of the geo…
▽ More
Arlequin coupling coefficient is essential for concurrent FE-MD models with overlapping domains, but the calculation of its value is quite difficult when the geometry of the coupling region is complicated. In this work, we introduce a general procedure for the preprocessing of a concurrent FE-MD model, given that the mesh and atoms have already been created. The procedure is independent of the geometry of the coupling region and can be used for both 2D and 3D problems. The procedure includes steps of determining the relative positions of atoms inside the FE elements in the coupling region, as well as computing the Arlequin coupling coefficient for an arbitrary point inside the coupling region or on its boundary. Two approaches are provided for determining the coefficient: the direct approach and the temperature approach.
△ Less
Submitted 19 April, 2021;
originally announced April 2021.
-
Spectral Roll-off Points Variations: Exploring Useful Information in Feature Maps by Its Variations
Authors:
Yunkai Yu,
Yuyang You,
Zhihong Yang,
Guozheng Liu,
Peiyao Li,
Zhicheng Yang,
Wenjing Shan
Abstract:
Useful information (UI) is an elusive concept in neural networks. A quantitative measurement of UI is absent, despite the variations of UI can be recognized by prior knowledge. The communication bandwidth of feature maps decreases after downscaling operations, but UI flows smoothly after training due to lower Nyquist frequency. Inspired by the low-Nyqusit-frequency nature of UI, we propose the use…
▽ More
Useful information (UI) is an elusive concept in neural networks. A quantitative measurement of UI is absent, despite the variations of UI can be recognized by prior knowledge. The communication bandwidth of feature maps decreases after downscaling operations, but UI flows smoothly after training due to lower Nyquist frequency. Inspired by the low-Nyqusit-frequency nature of UI, we propose the use of spectral roll-off points (SROPs) to estimate UI on variations. The computation of an SROP is extended from a 1-D signal to a 2-D image by the required rotation invariance in image classification tasks. SROP statistics across feature maps are implemented as layer-wise useful information estimates. We design sanity checks to explore SROP variations when UI variations are produced by variations in model input, model architecture and training stages. The variations of SROP is synchronizes with UI variations in various randomized and sufficiently trained model structures. Therefore, SROP variations is an accurate and convenient sign of UI variations, which promotes the explainability of data representations with respect to frequency-domain knowledge.
△ Less
Submitted 12 August, 2021; v1 submitted 30 January, 2021;
originally announced February 2021.
-
Adversarial Data Encryption
Authors:
Yingdong Hu,
Liang Zhang,
Wei Shan,
Xiaoxiao Qin,
Jing Qi,
Zhenzhou Wu,
Yang Yuan
Abstract:
In the big data era, many organizations face the dilemma of data sharing. Regular data sharing is often necessary for human-centered discussion and communication, especially in medical scenarios. However, unprotected data sharing may also lead to data leakage. Inspired by adversarial attack, we propose a method for data encryption, so that for human beings the encrypted data look identical to the…
▽ More
In the big data era, many organizations face the dilemma of data sharing. Regular data sharing is often necessary for human-centered discussion and communication, especially in medical scenarios. However, unprotected data sharing may also lead to data leakage. Inspired by adversarial attack, we propose a method for data encryption, so that for human beings the encrypted data look identical to the original version, but for machine learning methods they are misleading. To show the effectiveness of our method, we collaborate with the Beijing Tiantan Hospital, which has a world leading neurological center. We invite $3$ doctors to manually inspect our encryption method based on real world medical images. The results show that the encrypted images can be used for diagnosis by the doctors, but not by machine learning methods.
△ Less
Submitted 11 February, 2020; v1 submitted 10 February, 2020;
originally announced February 2020.