-
LONGER: Scaling Up Long Sequence Modeling in Industrial Recommenders
Authors:
Zheng Chai,
Qin Ren,
Xijun Xiao,
Huizhi Yang,
Bo Han,
Sijun Zhang,
Di Chen,
Hui Lu,
Wenlin Zhao,
Lele Yu,
Xionghang Xie,
Shiru Ren,
Xiang Sun,
Yaocheng Tan,
Peng Xu,
Yuchao Zheng,
Di Wu
Abstract:
Modeling ultra-long user behavior sequences is critical for capturing both long- and short-term preferences in industrial recommender systems. Existing solutions typically rely on two-stage retrieval or indirect modeling paradigms, incuring upstream-downstream inconsistency and computational inefficiency. In this paper, we present LONGER, a Long-sequence Optimized traNsformer for GPU-Efficient Rec…
▽ More
Modeling ultra-long user behavior sequences is critical for capturing both long- and short-term preferences in industrial recommender systems. Existing solutions typically rely on two-stage retrieval or indirect modeling paradigms, incuring upstream-downstream inconsistency and computational inefficiency. In this paper, we present LONGER, a Long-sequence Optimized traNsformer for GPU-Efficient Recommenders. LONGER incorporates (i) a global token mechanism for stabilizing attention over long contexts, (ii) a token merge module with lightweight InnerTransformers and hybrid attention strategy to reduce quadratic complexity, and (iii) a series of engineering optimizations, including training with mixed-precision and activation recomputation, KV cache serving, and the fully synchronous model training and serving framework for unified GPU-based dense and sparse parameter updates. LONGER consistently outperforms strong baselines in both offline metrics and online A/B testing in both advertising and e-commerce services at ByteDance, validating its consistent effectiveness and industrial-level scaling laws. Currently, LONGER has been fully deployed at more than 10 influential scenarios at ByteDance, serving billion users.
△ Less
Submitted 7 May, 2025;
originally announced May 2025.
-
Generalizable Cervical Cancer Screening via Large-scale Pretraining and Test-Time Adaptation
Authors:
Hao Jiang,
Cheng Jin,
Huangjing Lin,
Yanning Zhou,
Xi Wang,
Jiabo Ma,
Li Ding,
Jun Hou,
Runsheng Liu,
Zhizhong Chai,
Luyang Luo,
Huijuan Shi,
Yinling Qian,
Qiong Wang,
Changzhong Li,
Anjia Han,
Ronald Cheong Kin Chan,
Hao Chen
Abstract:
Cervical cancer is a leading malignancy in female reproductive system. While AI-assisted cytology offers a cost-effective and non-invasive screening solution, current systems struggle with generalizability in complex clinical scenarios. To address this issue, we introduced Smart-CCS, a generalizable Cervical Cancer Screening paradigm based on pretraining and adaptation to create robust and general…
▽ More
Cervical cancer is a leading malignancy in female reproductive system. While AI-assisted cytology offers a cost-effective and non-invasive screening solution, current systems struggle with generalizability in complex clinical scenarios. To address this issue, we introduced Smart-CCS, a generalizable Cervical Cancer Screening paradigm based on pretraining and adaptation to create robust and generalizable screening systems. To develop and validate Smart-CCS, we first curated a large-scale, multi-center dataset named CCS-127K, which comprises a total of 127,471 cervical cytology whole-slide images collected from 48 medical centers. By leveraging large-scale self-supervised pretraining, our CCS models are equipped with strong generalization capability, potentially generalizing across diverse scenarios. Then, we incorporated test-time adaptation to specifically optimize the trained CCS model for complex clinical settings, which adapts and refines predictions, improving real-world applicability. We conducted large-scale system evaluation among various cohorts. In retrospective cohorts, Smart-CCS achieved an overall area under the curve (AUC) value of 0.965 and sensitivity of 0.913 for cancer screening on 11 internal test datasets. In external testing, system performance maintained high at 0.950 AUC across 6 independent test datasets. In prospective cohorts, our Smart-CCS achieved AUCs of 0.947, 0.924, and 0.986 in three prospective centers, respectively. Moreover, the system demonstrated superior sensitivity in diagnosing cervical cancer, confirming the accuracy of our cancer screening results by using histology findings for validation. Interpretability analysis with cell and slide predictions further indicated that the system's decision-making aligns with clinical practice. Smart-CCS represents a significant advancement in cancer screening across diverse clinical contexts.
△ Less
Submitted 12 February, 2025;
originally announced February 2025.
-
Large Memory Network for Recommendation
Authors:
Hui Lu,
Zheng Chai,
Yuchao Zheng,
Zhe Chen,
Deping Xie,
Peng Xu,
Xun Zhou,
Di Wu
Abstract:
Modeling user behavior sequences in recommender systems is essential for understanding user preferences over time, enabling personalized and accurate recommendations for improving user retention and enhancing business values. Despite its significance, there are two challenges for current sequential modeling approaches. From the spatial dimension, it is difficult to mutually perceive similar users'…
▽ More
Modeling user behavior sequences in recommender systems is essential for understanding user preferences over time, enabling personalized and accurate recommendations for improving user retention and enhancing business values. Despite its significance, there are two challenges for current sequential modeling approaches. From the spatial dimension, it is difficult to mutually perceive similar users' interests for a generalized intention understanding; from the temporal dimension, current methods are generally prone to forgetting long-term interests due to the fixed-length input sequence. In this paper, we present Large Memory Network (LMN), providing a novel idea by compressing and storing user history behavior information in a large-scale memory block. With the elaborated online deployment strategy, the memory block can be easily scaled up to million-scale in the industry. Extensive offline comparison experiments, memory scaling up experiments, and online A/B test on Douyin E-Commerce Search (ECS) are performed, validating the superior performance of LMN. Currently, LMN has been fully deployed in Douyin ECS, serving millions of users each day.
△ Less
Submitted 17 February, 2025; v1 submitted 8 February, 2025;
originally announced February 2025.
-
Adaptive Domain Scaling for Personalized Sequential Modeling in Recommenders
Authors:
Zheng Chai,
Hui Lu,
Di Chen,
Qin Ren,
Yuchao Zheng,
Xun Zhou
Abstract:
Users generally exhibit complex behavioral patterns and diverse intentions in multiple business scenarios of super applications like Douyin, presenting great challenges to current industrial multi-domain recommenders. To mitigate the discrepancies across diverse domains, researches and industrial practices generally emphasize sophisticated network structures to accomodate diverse data distribution…
▽ More
Users generally exhibit complex behavioral patterns and diverse intentions in multiple business scenarios of super applications like Douyin, presenting great challenges to current industrial multi-domain recommenders. To mitigate the discrepancies across diverse domains, researches and industrial practices generally emphasize sophisticated network structures to accomodate diverse data distributions, while neglecting the inherent understanding of user behavioral sequence from the multi-domain perspective. In this paper, we present Adaptive Domain Scaling (ADS) model, which comprehensively enhances the personalization capability in target-aware sequence modeling across multiple domains. Specifically, ADS comprises of two major modules, including personalized sequence representation generation (PSRG) and personalized candidate representation generation (PCRG). The modules contribute to the tailored multi-domain learning by dynamically learning both the user behavioral sequence item representation and the candidate target item representation under different domains, facilitating adaptive user intention understanding. Experiments are performed on both a public dataset and two billion-scaled industrial datasets, and the extensive results verify the high effectiveness and compatibility of ADS. Besides, we conduct online experiments on two influential business scenarios including Douyin Advertisement Platform and Douyin E-commerce Service Platform, both of which show substantial business improvements. Currently, ADS has been fully deployed in many recommendation services at ByteDance, serving billions of users.
△ Less
Submitted 28 April, 2025; v1 submitted 8 February, 2025;
originally announced February 2025.
-
BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and Multistructured Data
Authors:
Xuwu Wang,
Qiwen Cui,
Yunzhe Tao,
Yiran Wang,
Ziwei Chai,
Xiaotian Han,
Boyi Liu,
Jianbo Yuan,
Jing Su,
Guoyin Wang,
Tingkai Liu,
Liyu Chen,
Tianyi Liu,
Tao Sun,
Yufeng Zhang,
Sirui Zheng,
Quanzeng You,
Yang Yang,
Hongxia Yang
Abstract:
Large language models (LLMs) have become increasingly pivotal across various domains, especially in handling complex data types. This includes structured data processing, as exemplified by ChartQA and ChatGPT-Ada, and multimodal unstructured data processing as seen in Visual Question Answering (VQA). These areas have attracted significant attention from both industry and academia. Despite this, th…
▽ More
Large language models (LLMs) have become increasingly pivotal across various domains, especially in handling complex data types. This includes structured data processing, as exemplified by ChartQA and ChatGPT-Ada, and multimodal unstructured data processing as seen in Visual Question Answering (VQA). These areas have attracted significant attention from both industry and academia. Despite this, there remains a lack of unified evaluation methodologies for these diverse data handling scenarios. In response, we introduce BabelBench, an innovative benchmark framework that evaluates the proficiency of LLMs in managing multimodal multistructured data with code execution. BabelBench incorporates a dataset comprising 247 meticulously curated problems that challenge the models with tasks in perception, commonsense reasoning, logical reasoning, and so on. Besides the basic capabilities of multimodal understanding, structured data processing as well as code generation, these tasks demand advanced capabilities in exploration, planning, reasoning and debugging. Our experimental findings on BabelBench indicate that even cutting-edge models like ChatGPT 4 exhibit substantial room for improvement. The insights derived from our comprehensive analysis offer valuable guidance for future research within the community. The benchmark data can be found at https://github.com/FFD8FFE/babelbench.
△ Less
Submitted 1 October, 2024;
originally announced October 2024.
-
PRANCE: Joint Token-Optimization and Structural Channel-Pruning for Adaptive ViT Inference
Authors:
Ye Li,
Chen Tang,
Yuan Meng,
Jiajun Fan,
Zenghao Chai,
Xinzhu Ma,
Zhi Wang,
Wenwu Zhu
Abstract:
We introduce PRANCE, a Vision Transformer compression framework that jointly optimizes the activated channels and reduces tokens, based on the characteristics of inputs. Specifically, PRANCE~ leverages adaptive token optimization strategies for a certain computational budget, aiming to accelerate ViTs' inference from a unified data and architectural perspective. However, the joint framework poses…
▽ More
We introduce PRANCE, a Vision Transformer compression framework that jointly optimizes the activated channels and reduces tokens, based on the characteristics of inputs. Specifically, PRANCE~ leverages adaptive token optimization strategies for a certain computational budget, aiming to accelerate ViTs' inference from a unified data and architectural perspective. However, the joint framework poses challenges to both architectural and decision-making aspects. Firstly, while ViTs inherently support variable-token inference, they do not facilitate dynamic computations for variable channels. To overcome this limitation, we propose a meta-network using weight-sharing techniques to support arbitrary channels of the Multi-head Self-Attention and Multi-layer Perceptron layers, serving as a foundational model for architectural decision-making. Second, simultaneously optimizing the structure of the meta-network and input data constitutes a combinatorial optimization problem with an extremely large decision space, reaching up to around $10^{14}$, making supervised learning infeasible. To this end, we design a lightweight selector employing Proximal Policy Optimization for efficient decision-making. Furthermore, we introduce a novel "Result-to-Go" training mechanism that models ViTs' inference process as a Markov decision process, significantly reducing action space and mitigating delayed-reward issues during training. Extensive experiments demonstrate the effectiveness of PRANCE~ in reducing FLOPs by approximately 50\%, retaining only about 10\% of tokens while achieving lossless Top-1 accuracy. Additionally, our framework is shown to be compatible with various token optimization techniques such as pruning, merging, and sequential pruning-merging strategies. The code is available at \href{https://github.com/ChildTang/PRANCE}{https://github.com/ChildTang/PRANCE}.
△ Less
Submitted 6 July, 2024;
originally announced July 2024.
-
STAR: Skeleton-aware Text-based 4D Avatar Generation with In-Network Motion Retargeting
Authors:
Zenghao Chai,
Chen Tang,
Yongkang Wong,
Mohan Kankanhalli
Abstract:
The creation of 4D avatars (i.e., animated 3D avatars) from text description typically uses text-to-image (T2I) diffusion models to synthesize 3D avatars in the canonical space and subsequently applies animation with target motions. However, such an optimization-by-animation paradigm has several drawbacks. (1) For pose-agnostic optimization, the rendered images in canonical pose for naive Score Di…
▽ More
The creation of 4D avatars (i.e., animated 3D avatars) from text description typically uses text-to-image (T2I) diffusion models to synthesize 3D avatars in the canonical space and subsequently applies animation with target motions. However, such an optimization-by-animation paradigm has several drawbacks. (1) For pose-agnostic optimization, the rendered images in canonical pose for naive Score Distillation Sampling (SDS) exhibit domain gap and cannot preserve view-consistency using only T2I priors, and (2) For post hoc animation, simply applying the source motions to target 3D avatars yields translation artifacts and misalignment. To address these issues, we propose Skeleton-aware Text-based 4D Avatar generation with in-network motion Retargeting (STAR). STAR considers the geometry and skeleton differences between the template mesh and target avatar, and corrects the mismatched source motion by resorting to the pretrained motion retargeting techniques. With the informatively retargeted and occlusion-aware skeleton, we embrace the skeleton-conditioned T2I and text-to-video (T2V) priors, and propose a hybrid SDS module to coherently provide multi-view and frame-consistent supervision signals. Hence, STAR can progressively optimize the geometry, texture, and motion in an end-to-end manner. The quantitative and qualitative experiments demonstrate our proposed STAR can synthesize high-quality 4D avatars with vivid animations that align well with the text description. Additional ablation studies shows the contributions of each component in STAR. The source code and demos are available at: \href{https://star-avatar.github.io}{https://star-avatar.github.io}.
△ Less
Submitted 7 June, 2024;
originally announced June 2024.
-
PDNNet: PDN-Aware GNN-CNN Heterogeneous Network for Dynamic IR Drop Prediction
Authors:
Yuxiang Zhao,
Zhuomin Chai,
Xun Jiang,
Yibo Lin,
Runsheng Wang,
Ru Huang
Abstract:
IR drop on the power delivery network (PDN) is closely related to PDN's configuration and cell current consumption. As the integrated circuit (IC) design is growing larger, dynamic IR drop simulation becomes computationally unaffordable and machine learning based IR drop prediction has been explored as a promising solution. Although CNN-based methods have been adapted to IR drop prediction task in…
▽ More
IR drop on the power delivery network (PDN) is closely related to PDN's configuration and cell current consumption. As the integrated circuit (IC) design is growing larger, dynamic IR drop simulation becomes computationally unaffordable and machine learning based IR drop prediction has been explored as a promising solution. Although CNN-based methods have been adapted to IR drop prediction task in several works, the shortcomings of overlooking PDN configuration is non-negligible. In this paper, we consider not only how to properly represent cell-PDN relation, but also how to model IR drop following its physical nature in the feature aggregation procedure. Thus, we propose a novel graph structure, PDNGraph, to unify the representations of the PDN structure and the fine-grained cell-PDN relation. We further propose a dual-branch heterogeneous network, PDNNet, incorporating two parallel GNN-CNN branches to favorably capture the above features during the learning process. Several key designs are presented to make the dynamic IR drop prediction highly effective and interpretable. We are the first work to apply graph structure to deep-learning based dynamic IR drop prediction method. Experiments show that PDNNet outperforms the state-of-the-art CNN-based methods and achieves 545x speedup compared to the commercial tool, which demonstrates the superiority of our method.
△ Less
Submitted 5 December, 2024; v1 submitted 27 March, 2024;
originally announced March 2024.
-
An Expert is Worth One Token: Synergizing Multiple Expert LLMs as Generalist via Expert Token Routing
Authors:
Ziwei Chai,
Guoyin Wang,
Jing Su,
Tianjie Zhang,
Xuanwen Huang,
Xuwu Wang,
Jingjing Xu,
Jianbo Yuan,
Hongxia Yang,
Fei Wu,
Yang Yang
Abstract:
We present Expert-Token-Routing, a unified generalist framework that facilitates seamless integration of multiple expert LLMs. Our framework represents expert LLMs as special expert tokens within the vocabulary of a meta LLM. The meta LLM can route to an expert LLM like generating new tokens. Expert-Token-Routing not only supports learning the implicit expertise of expert LLMs from existing instru…
▽ More
We present Expert-Token-Routing, a unified generalist framework that facilitates seamless integration of multiple expert LLMs. Our framework represents expert LLMs as special expert tokens within the vocabulary of a meta LLM. The meta LLM can route to an expert LLM like generating new tokens. Expert-Token-Routing not only supports learning the implicit expertise of expert LLMs from existing instruction dataset but also allows for dynamic extension of new expert LLMs in a plug-and-play manner. It also conceals the detailed collaboration process from the user's perspective, facilitating interaction as though it were a singular LLM. Our framework outperforms various existing multi-LLM collaboration paradigms across benchmarks that incorporate six diverse expert domains, demonstrating effectiveness and robustness in building generalist LLM system via synergizing multiple expert LLMs.
△ Less
Submitted 11 June, 2024; v1 submitted 25 March, 2024;
originally announced March 2024.
-
Can GNN be Good Adapter for LLMs?
Authors:
Xuanwen Huang,
Kaiqiao Han,
Yang Yang,
Dezheng Bao,
Quanjin Tao,
Ziwei Chai,
Qi Zhu
Abstract:
Recently, large language models (LLMs) have demonstrated superior capabilities in understanding and zero-shot learning on textual data, promising significant advances for many text-related domains. In the graph domain, various real-world scenarios also involve textual data, where tasks and node features can be described by text. These text-attributed graphs (TAGs) have broad applications in social…
▽ More
Recently, large language models (LLMs) have demonstrated superior capabilities in understanding and zero-shot learning on textual data, promising significant advances for many text-related domains. In the graph domain, various real-world scenarios also involve textual data, where tasks and node features can be described by text. These text-attributed graphs (TAGs) have broad applications in social media, recommendation systems, etc. Thus, this paper explores how to utilize LLMs to model TAGs. Previous methods for TAG modeling are based on million-scale LMs. When scaled up to billion-scale LLMs, they face huge challenges in computational costs. Additionally, they also ignore the zero-shot inference capabilities of LLMs. Therefore, we propose GraphAdapter, which uses a graph neural network (GNN) as an efficient adapter in collaboration with LLMs to tackle TAGs. In terms of efficiency, the GNN adapter introduces only a few trainable parameters and can be trained with low computation costs. The entire framework is trained using auto-regression on node text (next token prediction). Once trained, GraphAdapter can be seamlessly fine-tuned with task-specific prompts for various downstream tasks. Through extensive experiments across multiple real-world TAGs, GraphAdapter based on Llama 2 gains an average improvement of approximately 5\% in terms of node classification. Furthermore, GraphAdapter can also adapt to other language models, including RoBERTa, GPT-2. The promising results demonstrate that GNNs can serve as effective adapters for LLMs in TAG modeling.
△ Less
Submitted 20 February, 2024;
originally announced February 2024.
-
Deep Rib Fracture Instance Segmentation and Classification from CT on the RibFrac Challenge
Authors:
Jiancheng Yang,
Rui Shi,
Liang Jin,
Xiaoyang Huang,
Kaiming Kuang,
Donglai Wei,
Shixuan Gu,
Jianying Liu,
Pengfei Liu,
Zhizhong Chai,
Yongjie Xiao,
Hao Chen,
Liming Xu,
Bang Du,
Xiangyi Yan,
Hao Tang,
Adam Alessio,
Gregory Holste,
Jiapeng Zhang,
Xiaoming Wang,
Jianye He,
Lixuan Che,
Hanspeter Pfister,
Ming Li,
Bingbing Ni
Abstract:
Rib fractures are a common and potentially severe injury that can be challenging and labor-intensive to detect in CT scans. While there have been efforts to address this field, the lack of large-scale annotated datasets and evaluation benchmarks has hindered the development and validation of deep learning algorithms. To address this issue, the RibFrac Challenge was introduced, providing a benchmar…
▽ More
Rib fractures are a common and potentially severe injury that can be challenging and labor-intensive to detect in CT scans. While there have been efforts to address this field, the lack of large-scale annotated datasets and evaluation benchmarks has hindered the development and validation of deep learning algorithms. To address this issue, the RibFrac Challenge was introduced, providing a benchmark dataset of over 5,000 rib fractures from 660 CT scans, with voxel-level instance mask annotations and diagnosis labels for four clinical categories (buckle, nondisplaced, displaced, or segmental). The challenge includes two tracks: a detection (instance segmentation) track evaluated by an FROC-style metric and a classification track evaluated by an F1-style metric. During the MICCAI 2020 challenge period, 243 results were evaluated, and seven teams were invited to participate in the challenge summary. The analysis revealed that several top rib fracture detection solutions achieved performance comparable or even better than human experts. Nevertheless, the current rib fracture classification solutions are hardly clinically applicable, which can be an interesting area in the future. As an active benchmark and research resource, the data and online evaluation of the RibFrac Challenge are available at the challenge website. As an independent contribution, we have also extended our previous internal baseline by incorporating recent advancements in large-scale pretrained networks and point-based rib segmentation techniques. The resulting FracNet+ demonstrates competitive performance in rib fracture detection, which lays a foundation for further research and development in AI-assisted rib fracture detection and diagnosis.
△ Less
Submitted 14 February, 2024;
originally announced February 2024.
-
Enhancing Cross-domain Link Prediction via Evolution Process Modeling
Authors:
Xuanwen Huang,
Wei Chow,
Yize Zhu,
Yang Wang,
Ziwei Chai,
Chunping Wang,
Lei Chen,
Yang Yang
Abstract:
This work proposes DyExpert, a dynamic graph model for cross-domain link prediction. It can explicitly model historical evolving processes to learn the evolution pattern of a specific downstream graph and subsequently make pattern-specific link predictions. DyExpert adopts a decode-only transformer and is capable of efficiently parallel training and inference by \textit{conditioned link generation…
▽ More
This work proposes DyExpert, a dynamic graph model for cross-domain link prediction. It can explicitly model historical evolving processes to learn the evolution pattern of a specific downstream graph and subsequently make pattern-specific link predictions. DyExpert adopts a decode-only transformer and is capable of efficiently parallel training and inference by \textit{conditioned link generation} that integrates both evolution modeling and link prediction. DyExpert is trained by extensive dynamic graphs across diverse domains, comprising 6M dynamic edges. Extensive experiments on eight untrained graphs demonstrate that DyExpert achieves state-of-the-art performance in cross-domain link prediction. Compared to the advanced baseline under the same setting, DyExpert achieves an average of 11.40% improvement Average Precision across eight graphs. More impressive, it surpasses the fully supervised performance of 8 advanced baselines on 6 untrained graphs.
△ Less
Submitted 5 February, 2025; v1 submitted 3 February, 2024;
originally announced February 2024.
-
MapGPT: Map-Guided Prompting with Adaptive Path Planning for Vision-and-Language Navigation
Authors:
Jiaqi Chen,
Bingqian Lin,
Ran Xu,
Zhenhua Chai,
Xiaodan Liang,
Kwan-Yee K. Wong
Abstract:
Embodied agents equipped with GPT as their brains have exhibited extraordinary decision-making and generalization abilities across various tasks. However, existing zero-shot agents for vision-and-language navigation (VLN) only prompt GPT-4 to select potential locations within localized environments, without constructing an effective "global-view" for the agent to understand the overall environment…
▽ More
Embodied agents equipped with GPT as their brains have exhibited extraordinary decision-making and generalization abilities across various tasks. However, existing zero-shot agents for vision-and-language navigation (VLN) only prompt GPT-4 to select potential locations within localized environments, without constructing an effective "global-view" for the agent to understand the overall environment. In this work, we present a novel map-guided GPT-based agent, dubbed MapGPT, which introduces an online linguistic-formed map to encourage global exploration. Specifically, we build an online map and incorporate it into the prompts that include node information and topological relationships, to help GPT understand the spatial environment. Benefiting from this design, we further propose an adaptive planning mechanism to assist the agent in performing multi-step path planning based on a map, systematically exploring multiple candidate nodes or sub-goals step by step. Extensive experiments demonstrate that our MapGPT is applicable to both GPT-4 and GPT-4V, achieving state-of-the-art zero-shot performance on R2R and REVERIE simultaneously (~10% and ~12% improvements in SR), and showcasing the newly emergent global thinking and path planning abilities of the GPT.
△ Less
Submitted 20 June, 2024; v1 submitted 14 January, 2024;
originally announced January 2024.
-
InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks
Authors:
Xueyu Hu,
Ziyu Zhao,
Shuang Wei,
Ziwei Chai,
Qianli Ma,
Guoyin Wang,
Xuwu Wang,
Jing Su,
Jingjing Xu,
Ming Zhu,
Yao Cheng,
Jianbo Yuan,
Jiwei Li,
Kun Kuang,
Yang Yang,
Hongxia Yang,
Fei Wu
Abstract:
In this paper, we introduce InfiAgent-DABench, the first benchmark specifically designed to evaluate LLM-based agents on data analysis tasks. These tasks require agents to end-to-end solving complex tasks by interacting with an execution environment. This benchmark contains DAEval, a dataset consisting of 257 data analysis questions derived from 52 CSV files, and an agent framework which incorpora…
▽ More
In this paper, we introduce InfiAgent-DABench, the first benchmark specifically designed to evaluate LLM-based agents on data analysis tasks. These tasks require agents to end-to-end solving complex tasks by interacting with an execution environment. This benchmark contains DAEval, a dataset consisting of 257 data analysis questions derived from 52 CSV files, and an agent framework which incorporates LLMs to serve as data analysis agents for both serving and evaluation. Since data analysis questions are often open-ended and hard to evaluate without human supervision, we adopt a format-prompting technique to convert each question into a closed-form format so that they can be automatically evaluated. Our extensive benchmarking of 34 LLMs uncovers the current challenges encountered in data analysis tasks. In addition, building on top of our agent framework, we develop a specialized agent, DAAgent, which surpasses GPT-3.5 by 3.9% on DABench. Evaluation datasets and toolkits for InfiAgent-DABench are released at https://github.com/InfiAgent/InfiAgent .
△ Less
Submitted 11 March, 2024; v1 submitted 10 January, 2024;
originally announced January 2024.
-
Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models
Authors:
Guangji Bai,
Zheng Chai,
Chen Ling,
Shiyu Wang,
Jiaying Lu,
Nan Zhang,
Tingwei Shi,
Ziyang Yu,
Mengdan Zhu,
Yifei Zhang,
Xinyuan Song,
Carl Yang,
Yue Cheng,
Liang Zhao
Abstract:
The burgeoning field of Large Language Models (LLMs), exemplified by sophisticated models like OpenAI's ChatGPT, represents a significant advancement in artificial intelligence. These models, however, bring forth substantial challenges in the high consumption of computational, memory, energy, and financial resources, especially in environments with limited resource capabilities. This survey aims t…
▽ More
The burgeoning field of Large Language Models (LLMs), exemplified by sophisticated models like OpenAI's ChatGPT, represents a significant advancement in artificial intelligence. These models, however, bring forth substantial challenges in the high consumption of computational, memory, energy, and financial resources, especially in environments with limited resource capabilities. This survey aims to systematically address these challenges by reviewing a broad spectrum of techniques designed to enhance the resource efficiency of LLMs. We categorize methods based on their optimization focus: computational, memory, energy, financial, and network resources and their applicability across various stages of an LLM's lifecycle, including architecture design, pretraining, finetuning, and system design. Additionally, the survey introduces a nuanced categorization of resource efficiency techniques by their specific resource types, which uncovers the intricate relationships and mappings between various resources and corresponding optimization techniques. A standardized set of evaluation metrics and datasets is also presented to facilitate consistent and fair comparisons across different models and techniques. By offering a comprehensive overview of the current sota and identifying open research avenues, this survey serves as a foundational reference for researchers and practitioners, aiding them in developing more sustainable and efficient LLMs in a rapidly evolving landscape.
△ Less
Submitted 29 December, 2024; v1 submitted 31 December, 2023;
originally announced January 2024.
-
GraphLLM: Boosting Graph Reasoning Ability of Large Language Model
Authors:
Ziwei Chai,
Tianjie Zhang,
Liang Wu,
Kaiqiao Han,
Xiaohai Hu,
Xuanwen Huang,
Yang Yang
Abstract:
The advancement of Large Language Models (LLMs) has remarkably pushed the boundaries towards artificial general intelligence (AGI), with their exceptional ability on understanding diverse types of information, including but not limited to images and audio. Despite this progress, a critical gap remains in empowering LLMs to proficiently understand and reason on graph data. Recent studies underscore…
▽ More
The advancement of Large Language Models (LLMs) has remarkably pushed the boundaries towards artificial general intelligence (AGI), with their exceptional ability on understanding diverse types of information, including but not limited to images and audio. Despite this progress, a critical gap remains in empowering LLMs to proficiently understand and reason on graph data. Recent studies underscore LLMs' underwhelming performance on fundamental graph reasoning tasks. In this paper, we endeavor to unearth the obstacles that impede LLMs in graph reasoning, pinpointing the common practice of converting graphs into natural language descriptions (Graph2Text) as a fundamental bottleneck. To overcome this impediment, we introduce GraphLLM, a pioneering end-to-end approach that synergistically integrates graph learning models with LLMs. This synergy equips LLMs with the ability to proficiently interpret and reason on graph data, harnessing the superior expressive power of graph learning models. Our empirical evaluations across four fundamental graph reasoning tasks validate the effectiveness of GraphLLM. The results exhibit a substantial average accuracy enhancement of 54.44%, alongside a noteworthy context reduction of 96.45% across various graph reasoning tasks.
△ Less
Submitted 9 October, 2023;
originally announced October 2023.
-
Staleness-Alleviated Distributed GNN Training via Online Dynamic-Embedding Prediction
Authors:
Guangji Bai,
Ziyang Yu,
Zheng Chai,
Yue Cheng,
Liang Zhao
Abstract:
Despite the recent success of Graph Neural Networks (GNNs), it remains challenging to train GNNs on large-scale graphs due to neighbor explosions. As a remedy, distributed computing becomes a promising solution by leveraging abundant computing resources (e.g., GPU). However, the node dependency of graph data increases the difficulty of achieving high concurrency in distributed GNN training, which…
▽ More
Despite the recent success of Graph Neural Networks (GNNs), it remains challenging to train GNNs on large-scale graphs due to neighbor explosions. As a remedy, distributed computing becomes a promising solution by leveraging abundant computing resources (e.g., GPU). However, the node dependency of graph data increases the difficulty of achieving high concurrency in distributed GNN training, which suffers from the massive communication overhead. To address it, Historical value approximation is deemed a promising class of distributed training techniques. It utilizes an offline memory to cache historical information (e.g., node embedding) as an affordable approximation of the exact value and achieves high concurrency. However, such benefits come at the cost of involving dated training information, leading to staleness, imprecision, and convergence issues. To overcome these challenges, this paper proposes SAT (Staleness-Alleviated Training), a novel and scalable distributed GNN training framework that reduces the embedding staleness adaptively. The key idea of SAT is to model the GNN's embedding evolution as a temporal graph and build a model upon it to predict future embedding, which effectively alleviates the staleness of the cached historical embedding. We propose an online algorithm to train the embedding predictor and the distributed GNN alternatively and further provide a convergence analysis. Empirically, we demonstrate that SAT can effectively reduce embedding staleness and thus achieve better performance and convergence speed on multiple large-scale graph datasets.
△ Less
Submitted 10 December, 2023; v1 submitted 25 August, 2023;
originally announced August 2023.
-
Divide and Adapt: Active Domain Adaptation via Customized Learning
Authors:
Duojun Huang,
Jichang Li,
Weikai Chen,
Junshi Huang,
Zhenhua Chai,
Guanbin Li
Abstract:
Active domain adaptation (ADA) aims to improve the model adaptation performance by incorporating active learning (AL) techniques to label a maximally-informative subset of target samples. Conventional AL methods do not consider the existence of domain shift, and hence, fail to identify the truly valuable samples in the context of domain adaptation. To accommodate active learning and domain adaptio…
▽ More
Active domain adaptation (ADA) aims to improve the model adaptation performance by incorporating active learning (AL) techniques to label a maximally-informative subset of target samples. Conventional AL methods do not consider the existence of domain shift, and hence, fail to identify the truly valuable samples in the context of domain adaptation. To accommodate active learning and domain adaption, the two naturally different tasks, in a collaborative framework, we advocate that a customized learning strategy for the target data is the key to the success of ADA solutions. We present Divide-and-Adapt (DiaNA), a new ADA framework that partitions the target instances into four categories with stratified transferable properties. With a novel data subdivision protocol based on uncertainty and domainness, DiaNA can accurately recognize the most gainful samples. While sending the informative instances for annotation, DiaNA employs tailored learning strategies for the remaining categories. Furthermore, we propose an informativeness score that unifies the data partitioning criteria. This enables the use of a Gaussian mixture model (GMM) to automatically sample unlabeled data into the proposed four categories. Thanks to the "divideand-adapt" spirit, DiaNA can handle data with large variations of domain gap. In addition, we show that DiaNA can generalize to different domain adaptation settings, such as unsupervised domain adaptation (UDA), semi-supervised domain adaptation (SSDA), source-free domain adaptation (SFDA), etc.
△ Less
Submitted 21 July, 2023;
originally announced July 2023.
-
Exploration and Exploitation of Unlabeled Data for Open-Set Semi-Supervised Learning
Authors:
Ganlong Zhao,
Guanbin Li,
Yipeng Qin,
Jinjin Zhang,
Zhenhua Chai,
Xiaolin Wei,
Liang Lin,
Yizhou Yu
Abstract:
In this paper, we address a complex but practical scenario in semi-supervised learning (SSL) named open-set SSL, where unlabeled data contain both in-distribution (ID) and out-of-distribution (OOD) samples. Unlike previous methods that only consider ID samples to be useful and aim to filter out OOD ones completely during training, we argue that the exploration and exploitation of both ID and OOD s…
▽ More
In this paper, we address a complex but practical scenario in semi-supervised learning (SSL) named open-set SSL, where unlabeled data contain both in-distribution (ID) and out-of-distribution (OOD) samples. Unlike previous methods that only consider ID samples to be useful and aim to filter out OOD ones completely during training, we argue that the exploration and exploitation of both ID and OOD samples can benefit SSL. To support our claim, i) we propose a prototype-based clustering and identification algorithm that explores the inherent similarity and difference among samples at feature level and effectively cluster them around several predefined ID and OOD prototypes, thereby enhancing feature learning and facilitating ID/OOD identification; ii) we propose an importance-based sampling method that exploits the difference in importance of each ID and OOD sample to SSL, thereby reducing the sampling bias and improving the training. Our proposed method achieves state-of-the-art in several challenging benchmarks, and improves upon existing SSL methods even when ID samples are totally absent in unlabeled data.
△ Less
Submitted 30 June, 2023;
originally announced June 2023.
-
Deep Omni-supervised Learning for Rib Fracture Detection from Chest Radiology Images
Authors:
Zhizhong Chai,
Luyang Luo,
Huangjing Lin,
Pheng-Ann Heng,
Hao Chen
Abstract:
Deep learning (DL)-based rib fracture detection has shown promise of playing an important role in preventing mortality and improving patient outcome. Normally, developing DL-based object detection models requires a huge amount of bounding box annotation. However, annotating medical data is time-consuming and expertise-demanding, making obtaining a large amount of fine-grained annotations extremely…
▽ More
Deep learning (DL)-based rib fracture detection has shown promise of playing an important role in preventing mortality and improving patient outcome. Normally, developing DL-based object detection models requires a huge amount of bounding box annotation. However, annotating medical data is time-consuming and expertise-demanding, making obtaining a large amount of fine-grained annotations extremely infeasible. This poses a pressing need {for} developing label-efficient detection models to alleviate radiologists' labeling burden. To tackle this challenge, the literature on object detection has witnessed an increase of weakly-supervised and semi-supervised approaches, yet still lacks a unified framework that leverages various forms of fully-labeled, weakly-labeled, and unlabeled data. In this paper, we present a novel omni-supervised object detection network, ORF-Netv2, to leverage as much available supervision as possible. Specifically, a multi-branch omni-supervised detection head is introduced with each branch trained with a specific type of supervision. A co-training-based dynamic label assignment strategy is then proposed to enable flexible and robust learning from the weakly-labeled and unlabeled data. Extensive evaluation was conducted for the proposed framework with three rib fracture datasets on both chest CT and X-ray. By leveraging all forms of supervision, ORF-Netv2 achieves mAPs of 34.7, 44.7, and 19.4 on the three datasets, respectively, surpassing the baseline detector which uses only box annotations by mAP gains of 3.8, 4.8, and 5.0, respectively. Furthermore, ORF-Netv2 consistently outperforms other competitive label-efficient methods over various scenarios, showing a promising framework for label-efficient fracture detection. The code is available at: https://github.com/zhizhongchai/ORF-Net.
△ Less
Submitted 19 January, 2024; v1 submitted 23 June, 2023;
originally announced June 2023.
-
HybridNet: Dual-Branch Fusion of Geometrical and Topological Views for VLSI Congestion Prediction
Authors:
Yuxiang Zhao,
Zhuomin Chai,
Yibo Lin,
Runsheng Wang,
Ru Huang
Abstract:
Accurate early congestion prediction can prevent unpleasant surprises at the routing stage, playing a crucial character in assisting designers to iterate faster in VLSI design cycles. In this paper, we introduce a novel strategy to fully incorporate topological and geometrical features of circuits by making several key designs in our network architecture. To be more specific, we construct two indi…
▽ More
Accurate early congestion prediction can prevent unpleasant surprises at the routing stage, playing a crucial character in assisting designers to iterate faster in VLSI design cycles. In this paper, we introduce a novel strategy to fully incorporate topological and geometrical features of circuits by making several key designs in our network architecture. To be more specific, we construct two individual graphs (geometry-graph, topology-graph) with distinct edge construction schemes according to their unique properties. We then propose a dual-branch network with different encoder layers in each pathway and aggregate representations with a sophisticated fusion strategy. Our network, named HybridNet, not only provides a simple yet effective way to capture the geometric interactions of cells, but also preserves the original topological relationships in the netlist. Experimental results on the ISPD2015 benchmarks show that we achieve an improvement of 10.9% compared to previous methods.
△ Less
Submitted 12 June, 2023; v1 submitted 7 May, 2023;
originally announced May 2023.
-
Towards Effective Collaborative Learning in Long-Tailed Recognition
Authors:
Zhengzhuo Xu,
Zenghao Chai,
Chengyin Xu,
Chun Yuan,
Haiqin Yang
Abstract:
Real-world data usually suffers from severe class imbalance and long-tailed distributions, where minority classes are significantly underrepresented compared to the majority ones. Recent research prefers to utilize multi-expert architectures to mitigate the model uncertainty on the minority, where collaborative learning is employed to aggregate the knowledge of experts, i.e., online distillation.…
▽ More
Real-world data usually suffers from severe class imbalance and long-tailed distributions, where minority classes are significantly underrepresented compared to the majority ones. Recent research prefers to utilize multi-expert architectures to mitigate the model uncertainty on the minority, where collaborative learning is employed to aggregate the knowledge of experts, i.e., online distillation. In this paper, we observe that the knowledge transfer between experts is imbalanced in terms of class distribution, which results in limited performance improvement of the minority classes. To address it, we propose a re-weighted distillation loss by comparing two classifiers' predictions, which are supervised by online distillation and label annotations, respectively. We also emphasize that feature-level distillation will significantly improve model performance and increase feature robustness. Finally, we propose an Effective Collaborative Learning (ECL) framework that integrates a contrastive proxy task branch to further improve feature quality. Quantitative and qualitative experiments on four standard datasets demonstrate that ECL achieves state-of-the-art performance and the detailed ablation studies manifest the effectiveness of each component in ECL.
△ Less
Submitted 5 May, 2023;
originally announced May 2023.
-
RIDCP: Revitalizing Real Image Dehazing via High-Quality Codebook Priors
Authors:
Rui-Qi Wu,
Zheng-Peng Duan,
Chun-Le Guo,
Zhi Chai,
Chong-Yi Li
Abstract:
Existing dehazing approaches struggle to process real-world hazy images owing to the lack of paired real data and robust priors. In this work, we present a new paradigm for real image dehazing from the perspectives of synthesizing more realistic hazy data and introducing more robust priors into the network. Specifically, (1) instead of adopting the de facto physical scattering model, we rethink th…
▽ More
Existing dehazing approaches struggle to process real-world hazy images owing to the lack of paired real data and robust priors. In this work, we present a new paradigm for real image dehazing from the perspectives of synthesizing more realistic hazy data and introducing more robust priors into the network. Specifically, (1) instead of adopting the de facto physical scattering model, we rethink the degradation of real hazy images and propose a phenomenological pipeline considering diverse degradation types. (2) We propose a Real Image Dehazing network via high-quality Codebook Priors (RIDCP). Firstly, a VQGAN is pre-trained on a large-scale high-quality dataset to obtain the discrete codebook, encapsulating high-quality priors (HQPs). After replacing the negative effects brought by haze with HQPs, the decoder equipped with a novel normalized feature alignment module can effectively utilize high-quality features and produce clean results. However, although our degradation pipeline drastically mitigates the domain gap between synthetic and real data, it is still intractable to avoid it, which challenges HQPs matching in the wild. Thus, we re-calculate the distance when matching the features to the HQPs by a controllable matching operation, which facilitates finding better counterparts. We provide a recommendation to control the matching based on an explainable solution. Users can also flexibly adjust the enhancement degree as per their preference. Extensive experiments verify the effectiveness of our data synthesis pipeline and the superior performance of RIDCP in real image dehazing.
△ Less
Submitted 8 April, 2023;
originally announced April 2023.
-
Towards Accurate Post-Training Quantization for Vision Transformer
Authors:
Yifu Ding,
Haotong Qin,
Qinghua Yan,
Zhenhua Chai,
Junjie Liu,
Xiaolin Wei,
Xianglong Liu
Abstract:
Vision transformer emerges as a potential architecture for vision tasks. However, the intense computation and non-negligible delay hinder its application in the real world. As a widespread model compression technique, existing post-training quantization methods still cause severe performance drops. We find the main reasons lie in (1) the existing calibration metric is inaccurate in measuring the q…
▽ More
Vision transformer emerges as a potential architecture for vision tasks. However, the intense computation and non-negligible delay hinder its application in the real world. As a widespread model compression technique, existing post-training quantization methods still cause severe performance drops. We find the main reasons lie in (1) the existing calibration metric is inaccurate in measuring the quantization influence for extremely low-bit representation, and (2) the existing quantization paradigm is unfriendly to the power-law distribution of Softmax. Based on these observations, we propose a novel Accurate Post-training Quantization framework for Vision Transformer, namely APQ-ViT. We first present a unified Bottom-elimination Blockwise Calibration scheme to optimize the calibration metric to perceive the overall quantization disturbance in a blockwise manner and prioritize the crucial quantization errors that influence more on the final output. Then, we design a Matthew-effect Preserving Quantization for Softmax to maintain the power-law character and keep the function of the attention mechanism. Comprehensive experiments on large-scale classification and detection datasets demonstrate that our APQ-ViT surpasses the existing post-training quantization methods by convincing margins, especially in lower bit-width settings (e.g., averagely up to 5.17% improvement for classification and 24.43% for detection on W4A4). We also highlight that APQ-ViT enjoys versatility and works well on diverse transformer variants.
△ Less
Submitted 24 March, 2023;
originally announced March 2023.
-
HiFace: High-Fidelity 3D Face Reconstruction by Learning Static and Dynamic Details
Authors:
Zenghao Chai,
Tianke Zhang,
Tianyu He,
Xu Tan,
Tadas Baltrušaitis,
HsiangTao Wu,
Runnan Li,
Sheng Zhao,
Chun Yuan,
Jiang Bian
Abstract:
3D Morphable Models (3DMMs) demonstrate great potential for reconstructing faithful and animatable 3D facial surfaces from a single image. The facial surface is influenced by the coarse shape, as well as the static detail (e,g., person-specific appearance) and dynamic detail (e.g., expression-driven wrinkles). Previous work struggles to decouple the static and dynamic details through image-level s…
▽ More
3D Morphable Models (3DMMs) demonstrate great potential for reconstructing faithful and animatable 3D facial surfaces from a single image. The facial surface is influenced by the coarse shape, as well as the static detail (e,g., person-specific appearance) and dynamic detail (e.g., expression-driven wrinkles). Previous work struggles to decouple the static and dynamic details through image-level supervision, leading to reconstructions that are not realistic. In this paper, we aim at high-fidelity 3D face reconstruction and propose HiFace to explicitly model the static and dynamic details. Specifically, the static detail is modeled as the linear combination of a displacement basis, while the dynamic detail is modeled as the linear interpolation of two displacement maps with polarized expressions. We exploit several loss functions to jointly learn the coarse shape and fine details with both synthetic and real-world datasets, which enable HiFace to reconstruct high-fidelity 3D shapes with animatable details. Extensive quantitative and qualitative experiments demonstrate that HiFace presents state-of-the-art reconstruction quality and faithfully recovers both the static and dynamic details. Our project page can be found at https://project-hiface.github.io.
△ Less
Submitted 23 August, 2023; v1 submitted 20 March, 2023;
originally announced March 2023.
-
SEAM: Searching Transferable Mixed-Precision Quantization Policy through Large Margin Regularization
Authors:
Chen Tang,
Kai Ouyang,
Zenghao Chai,
Yunpeng Bai,
Yuan Meng,
Zhi Wang,
Wenwu Zhu
Abstract:
Mixed-precision quantization (MPQ) suffers from the time-consuming process of searching the optimal bit-width allocation i.e., the policy) for each layer, especially when using large-scale datasets such as ISLVRC-2012. This limits the practicality of MPQ in real-world deployment scenarios. To address this issue, this paper proposes a novel method for efficiently searching for effective MPQ policie…
▽ More
Mixed-precision quantization (MPQ) suffers from the time-consuming process of searching the optimal bit-width allocation i.e., the policy) for each layer, especially when using large-scale datasets such as ISLVRC-2012. This limits the practicality of MPQ in real-world deployment scenarios. To address this issue, this paper proposes a novel method for efficiently searching for effective MPQ policies using a small proxy dataset instead of the large-scale dataset used for training the model. Deviating from the established norm of employing a consistent dataset for both model training and MPQ policy search stages, our approach, therefore, yields a substantial enhancement in the efficiency of MPQ exploration. Nonetheless, using discrepant datasets poses challenges in searching for a transferable MPQ policy. Driven by the observation that quantization noise of sub-optimal policy exerts a detrimental influence on the discriminability of feature representations -- manifesting as diminished class margins and ambiguous decision boundaries -- our method aims to identify policies that uphold the discriminative nature of feature representations, i.e., intra-class compactness and inter-class separation. This general and dataset-independent property makes us search for the MPQ policy over a rather small-scale proxy dataset and then the policy can be directly used to quantize the model trained on a large-scale dataset. Our method offers several advantages, including high proxy data utilization, no excessive hyper-parameter tuning, and high searching efficiency. We search high-quality MPQ policies with the proxy dataset that has only 4% of the data scale compared to the large-scale target dataset, achieving the same accuracy as searching directly on the latter, improving MPQ searching efficiency by up to 300 times.
△ Less
Submitted 22 August, 2023; v1 submitted 14 February, 2023;
originally announced February 2023.
-
ERA-Solver: Error-Robust Adams Solver for Fast Sampling of Diffusion Probabilistic Models
Authors:
Shengmeng Li,
Luping Liu,
Zenghao Chai,
Runnan Li,
Xu Tan
Abstract:
Though denoising diffusion probabilistic models (DDPMs) have achieved remarkable generation results, the low sampling efficiency of DDPMs still limits further applications. Since DDPMs can be formulated as diffusion ordinary differential equations (ODEs), various fast sampling methods can be derived from solving diffusion ODEs. However, we notice that previous sampling methods with fixed analytica…
▽ More
Though denoising diffusion probabilistic models (DDPMs) have achieved remarkable generation results, the low sampling efficiency of DDPMs still limits further applications. Since DDPMs can be formulated as diffusion ordinary differential equations (ODEs), various fast sampling methods can be derived from solving diffusion ODEs. However, we notice that previous sampling methods with fixed analytical form are not robust with the error in the noise estimated from pretrained diffusion models. In this work, we construct an error-robust Adams solver (ERA-Solver), which utilizes the implicit Adams numerical method that consists of a predictor and a corrector. Different from the traditional predictor based on explicit Adams methods, we leverage a Lagrange interpolation function as the predictor, which is further enhanced with an error-robust strategy to adaptively select the Lagrange bases with lower error in the estimated noise. Experiments on Cifar10, LSUN-Church, and LSUN-Bedroom datasets demonstrate that our proposed ERA-Solver achieves 5.14, 9.42, and 9.69 Fenchel Inception Distance (FID) for image generation, with only 10 network evaluations.
△ Less
Submitted 6 February, 2023; v1 submitted 30 January, 2023;
originally announced January 2023.
-
Learning Imbalanced Data with Vision Transformers
Authors:
Zhengzhuo Xu,
Ruikang Liu,
Shuo Yang,
Zenghao Chai,
Chun Yuan
Abstract:
The real-world data tends to be heavily imbalanced and severely skew the data-driven deep neural networks, which makes Long-Tailed Recognition (LTR) a massive challenging task. Existing LTR methods seldom train Vision Transformers (ViTs) with Long-Tailed (LT) data, while the off-the-shelf pretrain weight of ViTs always leads to unfair comparisons. In this paper, we systematically investigate the V…
▽ More
The real-world data tends to be heavily imbalanced and severely skew the data-driven deep neural networks, which makes Long-Tailed Recognition (LTR) a massive challenging task. Existing LTR methods seldom train Vision Transformers (ViTs) with Long-Tailed (LT) data, while the off-the-shelf pretrain weight of ViTs always leads to unfair comparisons. In this paper, we systematically investigate the ViTs' performance in LTR and propose LiVT to train ViTs from scratch only with LT data. With the observation that ViTs suffer more severe LTR problems, we conduct Masked Generative Pretraining (MGP) to learn generalized features. With ample and solid evidence, we show that MGP is more robust than supervised manners. In addition, Binary Cross Entropy (BCE) loss, which shows conspicuous performance with ViTs, encounters predicaments in LTR. We further propose the balanced BCE to ameliorate it with strong theoretical groundings. Specially, we derive the unbiased extension of Sigmoid and compensate extra logit margins to deploy it. Our Bal-BCE contributes to the quick convergence of ViTs in just a few epochs. Extensive experiments demonstrate that with MGP and Bal-BCE, LiVT successfully trains ViTs well without any additional data and outperforms comparable state-of-the-art methods significantly, e.g., our ViT-B achieves 81.0% Top-1 accuracy in iNaturalist 2018 without bells and whistles. Code is available at https://github.com/XuZhengzhuo/LiVT.
△ Less
Submitted 8 March, 2023; v1 submitted 4 December, 2022;
originally announced December 2022.
-
DRKF: Distilled Rotated Kernel Fusion for Efficient Rotation Invariant Descriptors in Local Feature Matching
Authors:
Ranran Huang,
Jiancheng Cai,
Chao Li,
Zhuoyuan Wu,
Xinmin Liu,
Zhenhua Chai
Abstract:
The performance of local feature descriptors degrades in the presence of large rotation variations. To address this issue, we present an efficient approach to learning rotation invariant descriptors. Specifically, we propose Rotated Kernel Fusion (RKF) which imposes rotations on the convolution kernel to improve the inherent nature of CNN. Since RKF can be processed by the subsequent re-parameteri…
▽ More
The performance of local feature descriptors degrades in the presence of large rotation variations. To address this issue, we present an efficient approach to learning rotation invariant descriptors. Specifically, we propose Rotated Kernel Fusion (RKF) which imposes rotations on the convolution kernel to improve the inherent nature of CNN. Since RKF can be processed by the subsequent re-parameterization, no extra computational costs will be introduced in the inference stage. Moreover, we present Multi-oriented Feature Aggregation (MOFA) which aggregates features extracted from multiple rotated versions of the input image and can provide auxiliary knowledge for the training of RKF by leveraging the distillation strategy. We refer to the distilled RKF model as DRKF. Besides the evaluation on a rotation-augmented version of the public dataset HPatches, we also contribute a new dataset named DiverseBEV which is collected during the drone's flight and consists of bird's eye view images with large viewpoint changes and camera rotations. Extensive experiments show that our method can outperform other state-of-the-art techniques when exposed to large rotation variations.
△ Less
Submitted 5 January, 2024; v1 submitted 22 September, 2022;
originally announced September 2022.
-
HyP$^2$ Loss: Beyond Hypersphere Metric Space for Multi-label Image Retrieval
Authors:
Chengyin Xu,
Zenghao Chai,
Zhengzhuo Xu,
Chun Yuan,
Yanbo Fan,
Jue Wang
Abstract:
Image retrieval has become an increasingly appealing technique with broad multimedia application prospects, where deep hashing serves as the dominant branch towards low storage and efficient retrieval. In this paper, we carried out in-depth investigations on metric learning in deep hashing for establishing a powerful metric space in multi-label scenarios, where the pair loss suffers high computati…
▽ More
Image retrieval has become an increasingly appealing technique with broad multimedia application prospects, where deep hashing serves as the dominant branch towards low storage and efficient retrieval. In this paper, we carried out in-depth investigations on metric learning in deep hashing for establishing a powerful metric space in multi-label scenarios, where the pair loss suffers high computational overhead and converge difficulty, while the proxy loss is theoretically incapable of expressing the profound label dependencies and exhibits conflicts in the constructed hypersphere space. To address the problems, we propose a novel metric learning framework with Hybrid Proxy-Pair Loss (HyP$^2$ Loss) that constructs an expressive metric space with efficient training complexity w.r.t. the whole dataset. The proposed HyP$^2$ Loss focuses on optimizing the hypersphere space by learnable proxies and excavating data-to-data correlations of irrelevant pairs, which integrates sufficient data correspondence of pair-based methods and high-efficiency of proxy-based methods. Extensive experiments on four standard multi-label benchmarks justify the proposed method outperforms the state-of-the-art, is robust among different hash bits and achieves significant performance gains with a faster, more stable convergence speed. Our code is available at https://github.com/JerryXu0129/HyP2-Loss.
△ Less
Submitted 14 August, 2022;
originally announced August 2022.
-
Underwater Ranker: Learn Which Is Better and How to Be Better
Authors:
Chunle Guo,
Ruiqi Wu,
Xin Jin,
Linghao Han,
Zhi Chai,
Weidong Zhang,
Chongyi Li
Abstract:
In this paper, we present a ranking-based underwater image quality assessment (UIQA) method, abbreviated as URanker. The URanker is built on the efficient conv-attentional image Transformer. In terms of underwater images, we specially devise (1) the histogram prior that embeds the color distribution of an underwater image as histogram token to attend global degradation and (2) the dynamic cross-sc…
▽ More
In this paper, we present a ranking-based underwater image quality assessment (UIQA) method, abbreviated as URanker. The URanker is built on the efficient conv-attentional image Transformer. In terms of underwater images, we specially devise (1) the histogram prior that embeds the color distribution of an underwater image as histogram token to attend global degradation and (2) the dynamic cross-scale correspondence to model local degradation. The final prediction depends on the class tokens from different scales, which comprehensively considers multi-scale dependencies. With the margin ranking loss, our URanker can accurately rank the order of underwater images of the same scene enhanced by different underwater image enhancement (UIE) algorithms according to their visual quality. To achieve that, we also contribute a dataset, URankerSet, containing sufficient results enhanced by different UIE algorithms and the corresponding perceptual rankings, to train our URanker. Apart from the good performance of URanker, we found that a simple U-shape UIE network can obtain promising performance when it is coupled with our pre-trained URanker as additional supervision. In addition, we also propose a normalization tail that can significantly improve the performance of UIE networks. Extensive experiments demonstrate the state-of-the-art performance of our method. The key designs of our method are discussed. We will release our dataset and code.
△ Less
Submitted 26 November, 2022; v1 submitted 14 August, 2022;
originally announced August 2022.
-
CircuitNet: An Open-Source Dataset for Machine Learning Applications in Electronic Design Automation (EDA)
Authors:
Zhuomin Chai,
Yuxiang Zhao,
Yibo Lin,
Wei Liu,
Runsheng Wang,
Ru Huang
Abstract:
The electronic design automation (EDA) community has been actively exploring machine learning (ML) for very large-scale integrated computer-aided design (VLSI CAD). Many studies explored learning-based techniques for cross-stage prediction tasks in the design flow to achieve faster design convergence. Although building ML models usually requires a large amount of data, most studies can only genera…
▽ More
The electronic design automation (EDA) community has been actively exploring machine learning (ML) for very large-scale integrated computer-aided design (VLSI CAD). Many studies explored learning-based techniques for cross-stage prediction tasks in the design flow to achieve faster design convergence. Although building ML models usually requires a large amount of data, most studies can only generate small internal datasets for validation because of the lack of large public datasets. In this essay, we present the first open-source dataset called CircuitNet for ML tasks in VLSI CAD.
△ Less
Submitted 31 August, 2022; v1 submitted 31 July, 2022;
originally announced August 2022.
-
ORF-Net: Deep Omni-supervised Rib Fracture Detection from Chest CT Scans
Authors:
Zhizhong Chai,
Huangjing Lin,
Luyang Luo,
Pheng-Ann Heng,
Hao Chen
Abstract:
Most of the existing object detection works are based on the bounding box annotation: each object has a precise annotated box. However, for rib fractures, the bounding box annotation is very labor-intensive and time-consuming because radiologists need to investigate and annotate the rib fractures on a slice-by-slice basis. Although a few studies have proposed weakly-supervised methods or semi-supe…
▽ More
Most of the existing object detection works are based on the bounding box annotation: each object has a precise annotated box. However, for rib fractures, the bounding box annotation is very labor-intensive and time-consuming because radiologists need to investigate and annotate the rib fractures on a slice-by-slice basis. Although a few studies have proposed weakly-supervised methods or semi-supervised methods, they could not handle different forms of supervision simultaneously. In this paper, we proposed a novel omni-supervised object detection network, which can exploit multiple different forms of annotated data to further improve the detection performance. Specifically, the proposed network contains an omni-supervised detection head, in which each form of annotation data corresponds to a unique classification branch. Furthermore, we proposed a dynamic label assignment strategy for different annotated forms of data to facilitate better learning for each branch. Moreover, we also design a confidence-aware classification loss to emphasize the samples with high confidence and further improve the model's performance. Extensive experiments conducted on the testing dataset show our proposed method outperforms other state-of-the-art approaches consistently, demonstrating the efficacy of deep omni-supervised learning on improving rib fracture detection performance.
△ Less
Submitted 5 July, 2022;
originally announced July 2022.
-
Distributed Graph Neural Network Training with Periodic Stale Representation Synchronization
Authors:
Zheng Chai,
Guangji Bai,
Liang Zhao,
Yue Cheng
Abstract:
Despite the recent success of Graph Neural Networks, it remains challenging to train a GNN on large graphs with millions of nodes and billions of edges, which are prevalent in many graph-based applications. Traditional sampling-based methods accelerate GNN training by dropping edges and nodes, which impairs the graph integrity and model performance. Differently, distributed GNN algorithms accelera…
▽ More
Despite the recent success of Graph Neural Networks, it remains challenging to train a GNN on large graphs with millions of nodes and billions of edges, which are prevalent in many graph-based applications. Traditional sampling-based methods accelerate GNN training by dropping edges and nodes, which impairs the graph integrity and model performance. Differently, distributed GNN algorithms accelerate GNN training by utilizing multiple computing devices and can be classified into two types: "partition-based" methods enjoy low communication costs but suffer from information loss due to dropped edges, while "propagation-based" methods avoid information loss but suffer from prohibitive communication overhead caused by the neighbor explosion. To jointly address these problems, this paper proposes DIGEST (DIstributed Graph reprEsentation SynchronizaTion), a novel distributed GNN training framework that synergizes the complementary strength of both categories of existing methods. We propose to allow each device to utilize the stale representations of its neighbors in other subgraphs during subgraph parallel training. This way, our method preserves global graph information from neighbors to avoid information loss and reduce communication costs. Our convergence analysis demonstrates that DIGEST enjoys a state-of-the-art convergence rate. Extensive experimental evaluation on large, real-world graph datasets shows that DIGEST achieves up to 21.82 speedups without compromising performance compared to state-of-the-art distributed GNN training frameworks.
△ Less
Submitted 2 October, 2022; v1 submitted 31 May, 2022;
originally announced June 2022.
-
Image Harmonization by Matching Regional References
Authors:
Ziyue Zhu,
Zhao Zhang,
Zheng Lin,
Ruiqi Wu,
Zhi Chai,
Chun-Le Guo
Abstract:
To achieve visual consistency in composite images, recent image harmonization methods typically summarize the appearance pattern of global background and apply it to the global foreground without location discrepancy. However, for a real image, the appearances (illumination, color temperature, saturation, hue, texture, etc) of different regions can vary significantly. So previous methods, which tr…
▽ More
To achieve visual consistency in composite images, recent image harmonization methods typically summarize the appearance pattern of global background and apply it to the global foreground without location discrepancy. However, for a real image, the appearances (illumination, color temperature, saturation, hue, texture, etc) of different regions can vary significantly. So previous methods, which transfer the appearance globally, are not optimal. Trying to solve this issue, we firstly match the contents between the foreground and background and then adaptively adjust every foreground location according to the appearance of its content-related background regions. Further, we design a residual reconstruction strategy, that uses the predicted residual to adjust the appearance, and the composite foreground to reserve the image details. Extensive experiments demonstrate the effectiveness of our method. The source code will be available publicly.
△ Less
Submitted 10 April, 2022;
originally announced April 2022.
-
REALY: Rethinking the Evaluation of 3D Face Reconstruction
Authors:
Zenghao Chai,
Haoxian Zhang,
Jing Ren,
Di Kang,
Zhengzhuo Xu,
Xuefei Zhe,
Chun Yuan,
Linchao Bao
Abstract:
The evaluation of 3D face reconstruction results typically relies on a rigid shape alignment between the estimated 3D model and the ground-truth scan. We observe that aligning two shapes with different reference points can largely affect the evaluation results. This poses difficulties for precisely diagnosing and improving a 3D face reconstruction method. In this paper, we propose a novel evaluati…
▽ More
The evaluation of 3D face reconstruction results typically relies on a rigid shape alignment between the estimated 3D model and the ground-truth scan. We observe that aligning two shapes with different reference points can largely affect the evaluation results. This poses difficulties for precisely diagnosing and improving a 3D face reconstruction method. In this paper, we propose a novel evaluation approach with a new benchmark REALY, consists of 100 globally aligned face scans with accurate facial keypoints, high-quality region masks, and topology-consistent meshes. Our approach performs region-wise shape alignment and leads to more accurate, bidirectional correspondences during computing the shape errors. The fine-grained, region-wise evaluation results provide us detailed understandings about the performance of state-of-the-art 3D face reconstruction methods. For example, our experiments on single-image based reconstruction methods reveal that DECA performs the best on nose regions, while GANFit performs better on cheek regions. Besides, a new and high-quality 3DMM basis, HIFI3D++, is further derived using the same procedure as we construct REALY to align and retopologize several 3D face datasets. We will release REALY, HIFI3D++, and our new evaluation pipeline at https://realy3dface.com.
△ Less
Submitted 19 July, 2022; v1 submitted 18 March, 2022;
originally announced March 2022.
-
Conquering Data Variations in Resolution: A Slice-Aware Multi-Branch Decoder Network
Authors:
Shuxin Wang,
Shilei Cao,
Zhizhong Chai,
Dong Wei,
Kai Ma,
Liansheng Wang,
Yefeng Zheng
Abstract:
Fully convolutional neural networks have made promising progress in joint liver and liver tumor segmentation. Instead of following the debates over 2D versus 3D networks (for example, pursuing the balance between large-scale 2D pretraining and 3D context), in this paper, we novelly identify the wide variation in the ratio between intra- and inter-slice resolutions as a crucial obstacle to the perf…
▽ More
Fully convolutional neural networks have made promising progress in joint liver and liver tumor segmentation. Instead of following the debates over 2D versus 3D networks (for example, pursuing the balance between large-scale 2D pretraining and 3D context), in this paper, we novelly identify the wide variation in the ratio between intra- and inter-slice resolutions as a crucial obstacle to the performance. To tackle the mismatch between the intra- and inter-slice information, we propose a slice-aware 2.5D network that emphasizes extracting discriminative features utilizing not only in-plane semantics but also out-of-plane coherence for each separate slice. Specifically, we present a slice-wise multi-input multi-output architecture to instantiate such a design paradigm, which contains a Multi-Branch Decoder (MD) with a Slice-centric Attention Block (SAB) for learning slice-specific features and a Densely Connected Dice (DCD) loss to regularize the inter-slice predictions to be coherent and continuous. Based on the aforementioned innovations, we achieve state-of-the-art results on the MICCAI 2017 Liver Tumor Segmentation (LiTS) dataset. Besides, we also test our model on the ISBI 2019 Segmentation of THoracic Organs at Risk (SegTHOR) dataset, and the result proves the robustness and generalizability of the proposed method in other segmentation tasks.
△ Less
Submitted 7 March, 2022;
originally announced March 2022.
-
Compressing Models with Few Samples: Mimicking then Replacing
Authors:
Huanyu Wang,
Junjie Liu,
Xin Ma,
Yang Yong,
Zhenhua Chai,
Jianxin Wu
Abstract:
Few-sample compression aims to compress a big redundant model into a small compact one with only few samples. If we fine-tune models with these limited few samples directly, models will be vulnerable to overfit and learn almost nothing. Hence, previous methods optimize the compressed model layer-by-layer and try to make every layer have the same outputs as the corresponding layer in the teacher mo…
▽ More
Few-sample compression aims to compress a big redundant model into a small compact one with only few samples. If we fine-tune models with these limited few samples directly, models will be vulnerable to overfit and learn almost nothing. Hence, previous methods optimize the compressed model layer-by-layer and try to make every layer have the same outputs as the corresponding layer in the teacher model, which is cumbersome. In this paper, we propose a new framework named Mimicking then Replacing (MiR) for few-sample compression, which firstly urges the pruned model to output the same features as the teacher's in the penultimate layer, and then replaces teacher's layers before penultimate with a well-tuned compact one. Unlike previous layer-wise reconstruction methods, our MiR optimizes the entire network holistically, which is not only simple and effective, but also unsupervised and general. MiR outperforms previous methods with large margins. Codes will be available soon.
△ Less
Submitted 7 January, 2022;
originally announced January 2022.
-
Contrastive Attention Network with Dense Field Estimation for Face Completion
Authors:
Xin Ma,
Xiaoqiang Zhou,
Huaibo Huang,
Gengyun Jia,
Zhenhua Chai,
Xiaolin Wei
Abstract:
Most modern face completion approaches adopt an autoencoder or its variants to restore missing regions in face images. Encoders are often utilized to learn powerful representations that play an important role in meeting the challenges of sophisticated learning tasks. Specifically, various kinds of masks are often presented in face images in the wild, forming complex patterns, especially in this ha…
▽ More
Most modern face completion approaches adopt an autoencoder or its variants to restore missing regions in face images. Encoders are often utilized to learn powerful representations that play an important role in meeting the challenges of sophisticated learning tasks. Specifically, various kinds of masks are often presented in face images in the wild, forming complex patterns, especially in this hard period of COVID-19. It's difficult for encoders to capture such powerful representations under this complex situation. To address this challenge, we propose a self-supervised Siamese inference network to improve the generalization and robustness of encoders. It can encode contextual semantics from full-resolution images and obtain more discriminative representations. To deal with geometric variations of face images, a dense correspondence field is integrated into the network. We further propose a multi-scale decoder with a novel dual attention fusion module (DAF), which can combine the restored and known regions in an adaptive manner. This multi-scale architecture is beneficial for the decoder to utilize discriminative representations learned from encoders into images. Extensive experiments clearly demonstrate that the proposed approach not only achieves more appealing results compared with state-of-the-art methods but also improves the performance of masked face recognition dramatically.
△ Less
Submitted 19 December, 2021;
originally announced December 2021.
-
HHF: Hashing-guided Hinge Function for Deep Hashing Retrieval
Authors:
Chengyin Xu,
Zenghao Chai,
Zhengzhuo Xu,
Hongjia Li,
Qiruyi Zuo,
Lingyu Yang,
Chun Yuan
Abstract:
Deep hashing has shown promising performance in large-scale image retrieval. However, latent codes extracted by Deep Neural Networks (DNNs) will inevitably lose semantic information during the binarization process, which damages the retrieval accuracy and makes it challenging. Although many existing approaches perform regularization to alleviate quantization errors, we figure out an incompatible c…
▽ More
Deep hashing has shown promising performance in large-scale image retrieval. However, latent codes extracted by Deep Neural Networks (DNNs) will inevitably lose semantic information during the binarization process, which damages the retrieval accuracy and makes it challenging. Although many existing approaches perform regularization to alleviate quantization errors, we figure out an incompatible conflict between metric learning and quantization learning. The metric loss penalizes the inter-class distances to push different classes unconstrained far away. Worse still, it tends to map the latent code deviate from ideal binarization point and generate severe ambiguity in the binarization process. Based on the minimum distance of the binary linear code, we creatively propose Hashing-guided Hinge Function (HHF) to avoid such conflict. In detail, the carefully-designed inflection point, which relies on the hash bit length and category numbers, is explicitly adopted to balance the metric term and quantization term. Such a modification prevents the network from falling into local metric optimal minima in deep hashing. Extensive experiments in CIFAR-10, CIFAR-100, ImageNet, and MS-COCO show that HHF consistently outperforms existing techniques, and is robust and flexible to transplant into other methods. Code is available at https://github.com/JerryXu0129/HHF.
△ Less
Submitted 12 January, 2022; v1 submitted 3 December, 2021;
originally announced December 2021.
-
Semantic-Sparse Colorization Network for Deep Exemplar-based Colorization
Authors:
Yunpeng Bai,
Chao Dong,
Zenghao Chai,
Andong Wang,
Zhengzhuo Xu,
Chun Yuan
Abstract:
Exemplar-based colorization approaches rely on reference image to provide plausible colors for target gray-scale image. The key and difficulty of exemplar-based colorization is to establish an accurate correspondence between these two images. Previous approaches have attempted to construct such a correspondence but are faced with two obstacles. First, using luminance channels for the calculation o…
▽ More
Exemplar-based colorization approaches rely on reference image to provide plausible colors for target gray-scale image. The key and difficulty of exemplar-based colorization is to establish an accurate correspondence between these two images. Previous approaches have attempted to construct such a correspondence but are faced with two obstacles. First, using luminance channels for the calculation of correspondence is inaccurate. Second, the dense correspondence they built introduces wrong matching results and increases the computation burden. To address these two problems, we propose Semantic-Sparse Colorization Network (SSCN) to transfer both the global image style and detailed semantic-related colors to the gray-scale image in a coarse-to-fine manner. Our network can perfectly balance the global and local colors while alleviating the ambiguous matching problem. Experiments show that our method outperforms existing methods in both quantitative and qualitative evaluation and achieves state-of-the-art performance.
△ Less
Submitted 18 July, 2022; v1 submitted 2 December, 2021;
originally announced December 2021.
-
Towards Calibrated Model for Long-Tailed Visual Recognition from Prior Perspective
Authors:
Zhengzhuo Xu,
Zenghao Chai,
Chun Yuan
Abstract:
Real-world data universally confronts a severe class-imbalance problem and exhibits a long-tailed distribution, i.e., most labels are associated with limited instances. The naïve models supervised by such datasets would prefer dominant labels, encounter a serious generalization challenge and become poorly calibrated. We propose two novel methods from the prior perspective to alleviate this dilemma…
▽ More
Real-world data universally confronts a severe class-imbalance problem and exhibits a long-tailed distribution, i.e., most labels are associated with limited instances. The naïve models supervised by such datasets would prefer dominant labels, encounter a serious generalization challenge and become poorly calibrated. We propose two novel methods from the prior perspective to alleviate this dilemma. First, we deduce a balance-oriented data augmentation named Uniform Mixup (UniMix) to promote mixup in long-tailed scenarios, which adopts advanced mixing factor and sampler in favor of the minority. Second, motivated by the Bayesian theory, we figure out the Bayes Bias (Bayias), an inherent bias caused by the inconsistency of prior, and compensate it as a modification on standard cross-entropy loss. We further prove that both the proposed methods ensure the classification calibration theoretically and empirically. Extensive experiments verify that our strategies contribute to a better-calibrated model, and their combination achieves state-of-the-art performance on CIFAR-LT, ImageNet-LT, and iNaturalist 2018.
△ Less
Submitted 6 November, 2021;
originally announced November 2021.
-
MoDeRNN: Towards Fine-grained Motion Details for Spatiotemporal Predictive Learning
Authors:
Zenghao Chai,
Zhengzhuo Xu,
Chun Yuan
Abstract:
Spatiotemporal predictive learning (ST-PL) aims at predicting the subsequent frames via limited observed sequences, and it has broad applications in the real world. However, learning representative spatiotemporal features for prediction is challenging. Moreover, chaotic uncertainty among consecutive frames exacerbates the difficulty in long-term prediction. This paper concentrates on improving pre…
▽ More
Spatiotemporal predictive learning (ST-PL) aims at predicting the subsequent frames via limited observed sequences, and it has broad applications in the real world. However, learning representative spatiotemporal features for prediction is challenging. Moreover, chaotic uncertainty among consecutive frames exacerbates the difficulty in long-term prediction. This paper concentrates on improving prediction quality by enhancing the correspondence between the previous context and the current state. We carefully design Detail Context Block (DCB) to extract fine-grained details and improve the isolated correlation between upper context state and current input state. We integrate DCB with standard ConvLSTM and introduce Motion Details RNN (MoDeRNN) to capture fine-grained spatiotemporal features and improve the expression of latent states of RNNs to achieve significant quality. Experiments on Moving MNIST and Typhoon datasets demonstrate the effectiveness of the proposed method. MoDeRNN outperforms existing state-of-the-art techniques qualitatively and quantitatively with lower computation loads.
△ Less
Submitted 12 February, 2022; v1 submitted 25 October, 2021;
originally announced October 2021.
-
LOF: Structure-Aware Line Tracking based on Optical Flow
Authors:
Meixiang Quan,
Zheng Chai,
Xiao Liu
Abstract:
Lines provide the significantly richer geometric structural information about the environment than points, so lines are widely used in recent Visual Odometry (VO) works. Since VO with lines use line tracking results to locate and map, line tracking is a crucial component in VO. Although the state-of-the-art line tracking methods have made great progress, they are still heavily dependent on line de…
▽ More
Lines provide the significantly richer geometric structural information about the environment than points, so lines are widely used in recent Visual Odometry (VO) works. Since VO with lines use line tracking results to locate and map, line tracking is a crucial component in VO. Although the state-of-the-art line tracking methods have made great progress, they are still heavily dependent on line detection or the predicted line segments. In order to relieve the dependencies described above to track line segments completely, accurately, and robustly at higher computational efficiency, we propose a structure-aware Line tracking algorithm based entirely on Optical Flow (LOF). Firstly, we propose a gradient-based strategy to sample pixels on lines that are suitable for line optical flow calculation. Then, in order to align the lines by fully using the structural relationship between the sampled points on it and effectively removing the influence of sampled points on it occluded by other objects, we propose a two-step structure-aware line segment alignment method. Furthermore, we propose a line refinement method to refine the orientation, position, and endpoints of the aligned line segments. Extensive experimental results demonstrate that the proposed LOF outperforms the state-of-the-art performance in line tracking accuracy, robustness, and efficiency, which also improves the location accuracy and robustness of VO system with lines.
△ Less
Submitted 17 September, 2021;
originally announced September 2021.
-
A real-time global re-localization framework for 3D LiDAR SLAM
Authors:
Ziqi Chai,
Xiaoyu Shi,
Yan Zhou,
Zhenhua Xiong
Abstract:
Simultaneous localization and mapping (SLAM) has been a hot research field in the past years. Against the backdrop of more affordable 3D LiDAR sensors, research on 3D LiDAR SLAM is becoming increasingly popular. Furthermore, the re-localization problem with a point cloud map is the foundation for other SLAM applications. In this paper, a template matching framework is proposed to re-localize a rob…
▽ More
Simultaneous localization and mapping (SLAM) has been a hot research field in the past years. Against the backdrop of more affordable 3D LiDAR sensors, research on 3D LiDAR SLAM is becoming increasingly popular. Furthermore, the re-localization problem with a point cloud map is the foundation for other SLAM applications. In this paper, a template matching framework is proposed to re-localize a robot globally in a 3D LiDAR map. This presents two main challenges. First, most global descriptors for point cloud can only be used for place detection under a small local area. Therefore, in order to re-localize globally in the map, point clouds and descriptors(templates) are densely collected using a reconstructed mesh model at an offline stage by a physical simulation engine to expand the functional distance of point cloud descriptors. Second, the increased number of collected templates makes the matching stage too slow to meet the real-time requirement, for which a cascade matching method is presented for better efficiency. In the experiments, the proposed framework achieves 0.2-meter accuracy at about 10Hz matching speed using pure python implementation with 100k templates, which is effective and efficient for SLAM applications.
△ Less
Submitted 1 September, 2021;
originally announced September 2021.
-
Asynchronous Federated Learning for Sensor Data with Concept Drift
Authors:
Yujing Chen,
Zheng Chai,
Yue Cheng,
Huzefa Rangwala
Abstract:
Federated learning (FL) involves multiple distributed devices jointly training a shared model without any of the participants having to reveal their local data to a centralized server. Most of previous FL approaches assume that data on devices are fixed and stationary during the training process. However, this assumption is unrealistic because these devices usually have varying sampling rates and…
▽ More
Federated learning (FL) involves multiple distributed devices jointly training a shared model without any of the participants having to reveal their local data to a centralized server. Most of previous FL approaches assume that data on devices are fixed and stationary during the training process. However, this assumption is unrealistic because these devices usually have varying sampling rates and different system configurations. In addition, the underlying distribution of the device data can change dynamically over time, which is known as concept drift. Concept drift makes the learning process complicated because of the inconsistency between existing and upcoming data. Traditional concept drift handling techniques such as chunk based and ensemble learning-based methods are not suitable in the federated learning frameworks due to the heterogeneity of local devices. We propose a novel approach, FedConD, to detect and deal with the concept drift on local devices and minimize the effect on the performance of models in asynchronous FL. The drift detection strategy is based on an adaptive mechanism which uses the historical performance of the local models. The drift adaptation is realized by adjusting the regularization parameter of objective function on each local device. Additionally, we design a communication strategy on the server side to select local updates in a prudent fashion and speed up model convergence. Experimental evaluations on three evolving data streams and two image datasets show that \model~detects and handles concept drift, and also reduces the overall communication cost compared to other baseline methods.
△ Less
Submitted 31 August, 2021;
originally announced September 2021.
-
Trash to Treasure: Harvesting OOD Data with Cross-Modal Matching for Open-Set Semi-Supervised Learning
Authors:
Junkai Huang,
Chaowei Fang,
Weikai Chen,
Zhenhua Chai,
Xiaolin Wei,
Pengxu Wei,
Liang Lin,
Guanbin Li
Abstract:
Open-set semi-supervised learning (open-set SSL) investigates a challenging but practical scenario where out-of-distribution (OOD) samples are contained in the unlabeled data. While the mainstream technique seeks to completely filter out the OOD samples for semi-supervised learning (SSL), we propose a novel training mechanism that could effectively exploit the presence of OOD data for enhanced fea…
▽ More
Open-set semi-supervised learning (open-set SSL) investigates a challenging but practical scenario where out-of-distribution (OOD) samples are contained in the unlabeled data. While the mainstream technique seeks to completely filter out the OOD samples for semi-supervised learning (SSL), we propose a novel training mechanism that could effectively exploit the presence of OOD data for enhanced feature learning while avoiding its adverse impact on the SSL. We achieve this goal by first introducing a warm-up training that leverages all the unlabeled data, including both the in-distribution (ID) and OOD samples. Specifically, we perform a pretext task that enforces our feature extractor to obtain a high-level semantic understanding of the training images, leading to more discriminative features that can benefit the downstream tasks. Since the OOD samples are inevitably detrimental to SSL, we propose a novel cross-modal matching strategy to detect OOD samples. Instead of directly applying binary classification, we train the network to predict whether the data sample is matched to an assigned one-hot class label. The appeal of the proposed cross-modal matching over binary classification is the ability to generate a compatible feature space that aligns with the core classification task. Extensive experiments show that our approach substantially lifts the performance on open-set SSL and outperforms the state-of-the-art by a large margin.
△ Less
Submitted 12 August, 2021;
originally announced August 2021.
-
Method Towards CVPR 2021 SimLocMatch Challenge
Authors:
Xiaopeng Bi,
Ran Yan,
Zheng Chai,
Haotian Zhang,
Xiao Liu
Abstract:
This report describes Megvii-3D team's approach towards SimLocMatch Challenge @ CVPR 2021 Image Matching Workshop.
This report describes Megvii-3D team's approach towards SimLocMatch Challenge @ CVPR 2021 Image Matching Workshop.
△ Less
Submitted 10 August, 2021; v1 submitted 10 August, 2021;
originally announced August 2021.
-
Method Towards CVPR 2021 Image Matching Challenge
Authors:
Xiaopeng Bi,
Yu Chen,
Xinyang Liu,
Dehao Zhang,
Ran Yan,
Zheng Chai,
Haotian Zhang,
Xiao Liu
Abstract:
This report describes Megvii-3D team's approach towards CVPR 2021 Image Matching Workshop.
This report describes Megvii-3D team's approach towards CVPR 2021 Image Matching Workshop.
△ Less
Submitted 10 August, 2021; v1 submitted 10 August, 2021;
originally announced August 2021.
-
Towards Quantized Model Parallelism for Graph-Augmented MLPs Based on Gradient-Free ADMM Framework
Authors:
Junxiang Wang,
Hongyi Li,
Zheng Chai,
Yongchao Wang,
Yue Cheng,
Liang Zhao
Abstract:
While Graph Neural Networks (GNNs) are popular in the deep learning community, they suffer from several challenges including over-smoothing, over-squashing, and gradient vanishing. Recently, a series of models have attempted to relieve these issues by first augmenting the node features and then imposing node-wise functions based on Multi-Layer Perceptron (MLP), which are widely referred to as GA-M…
▽ More
While Graph Neural Networks (GNNs) are popular in the deep learning community, they suffer from several challenges including over-smoothing, over-squashing, and gradient vanishing. Recently, a series of models have attempted to relieve these issues by first augmenting the node features and then imposing node-wise functions based on Multi-Layer Perceptron (MLP), which are widely referred to as GA-MLP models. However, while GA-MLP models enjoy deeper architectures for better accuracy, their efficiency largely deteriorates. Moreover, popular acceleration techniques such as stochastic-version or data-parallelism cannot be effectively applied due to the dependency among samples (i.e., nodes) in graphs. To address these issues, in this paper, instead of data parallelism, we propose a parallel graph deep learning Alternating Direction Method of Multipliers (pdADMM-G) framework to achieve model parallelism: parameters in each layer of GA-MLP models can be updated in parallel. The extended pdADMM-G-Q algorithm reduces communication costs by introducing the quantization technique. Theoretical convergence to a (quantized) stationary point of the pdADMM-G algorithm and the pdADMM-G-Q algorithm is provided with a sublinear convergence rate $o(1/k)$, where $k$ is the number of iterations. Extensive experiments demonstrate the convergence of two proposed algorithms. Moreover, they lead to a more massive speedup and better performance than all state-of-the-art comparison methods on nine benchmark datasets. Last but not least, the proposed pdADMM-G-Q algorithm reduces communication overheads by up to $45\%$ without loss of performance. Our code is available at \url{https://github.com/xianggebenben/pdADMM-G}.
△ Less
Submitted 16 November, 2022; v1 submitted 20 May, 2021;
originally announced May 2021.