-
Multi-Agent Reasoning for Cardiovascular Imaging Phenotype Analysis
Authors:
Weitong Zhang,
Mengyun Qiao,
Chengqi Zang,
Steven Niederer,
Paul M Matthews,
Wenjia Bai,
Bernhard Kainz
Abstract:
Identifying the associations between imaging phenotypes and disease risk factors and outcomes is essential for understanding disease mechanisms and improving diagnosis and prognosis models. However, traditional approaches rely on human-driven hypothesis testing and selection of association factors, often overlooking complex, non-linear dependencies among imaging phenotypes and other multi-modal da…
▽ More
Identifying the associations between imaging phenotypes and disease risk factors and outcomes is essential for understanding disease mechanisms and improving diagnosis and prognosis models. However, traditional approaches rely on human-driven hypothesis testing and selection of association factors, often overlooking complex, non-linear dependencies among imaging phenotypes and other multi-modal data. To address this, we introduce a Multi-agent Exploratory Synergy for the Heart (MESHAgents) framework that leverages large language models as agents to dynamically elicit, surface, and decide confounders and phenotypes in association studies, using cardiovascular imaging as a proof of concept. Specifically, we orchestrate a multi-disciplinary team of AI agents -- spanning cardiology, biomechanics, statistics, and clinical research -- which spontaneously generate and converge on insights through iterative, self-organizing reasoning. The framework dynamically synthesizes statistical correlations with multi-expert consensus, providing an automated pipeline for phenome-wide association studies (PheWAS). We demonstrate the system's capabilities through a population-based study of imaging phenotypes of the heart and aorta. MESHAgents autonomously uncovered correlations between imaging phenotypes and a wide range of non-imaging factors, identifying additional confounder variables beyond standard demographic factors. Validation on diagnosis tasks reveals that MESHAgents-discovered phenotypes achieve performance comparable to expert-selected phenotypes, with mean AUC differences as small as -0.004 on disease classification tasks. Notably, the recall score improves for 6 out of 9 disease types. Our framework provides clinically relevant imaging phenotypes with transparent reasoning, offering a scalable alternative to expert-driven methods.
△ Less
Submitted 4 July, 2025;
originally announced July 2025.
-
Uncover Treasures in DCT: Advancing JPEG Quality Enhancement by Exploiting Latent Correlations
Authors:
Jing Yang,
Qunliang Xing,
Mai Xu,
Minglang Qiao
Abstract:
Joint Photographic Experts Group (JPEG) achieves data compression by quantizing Discrete Cosine Transform (DCT) coefficients, which inevitably introduces compression artifacts. Most existing JPEG quality enhancement methods operate in the pixel domain, suffering from the high computational costs of decoding. Consequently, direct enhancement of JPEG images in the DCT domain has gained increasing at…
▽ More
Joint Photographic Experts Group (JPEG) achieves data compression by quantizing Discrete Cosine Transform (DCT) coefficients, which inevitably introduces compression artifacts. Most existing JPEG quality enhancement methods operate in the pixel domain, suffering from the high computational costs of decoding. Consequently, direct enhancement of JPEG images in the DCT domain has gained increasing attention. However, current DCT-domain methods often exhibit limited performance. To address this challenge, we identify two critical types of correlations within the DCT coefficients of JPEG images. Building on this insight, we propose an Advanced DCT-domain JPEG Quality Enhancement (AJQE) method that fully exploits these correlations. The AJQE method enables the adaptation of numerous well-established pixel-domain models to the DCT domain, achieving superior performance with reduced computational complexity. Compared to the pixel-domain counterparts, the DCT-domain models derived by our method demonstrate a 0.35 dB improvement in PSNR and a 60.5% increase in enhancement throughput on average.
△ Less
Submitted 26 June, 2025;
originally announced June 2025.
-
Unsupervised Evolutionary Cell Type Matching via Entropy-Minimized Optimal Transport
Authors:
Mu Qiao
Abstract:
Identifying evolutionary correspondences between cell types across species is a fundamental challenge in comparative genomics and evolutionary biology. Existing approaches often rely on either reference-based matching, which imposes asymmetry by designating one species as the reference, or projection-based matching, which may increase computational complexity and obscure biological interpretabilit…
▽ More
Identifying evolutionary correspondences between cell types across species is a fundamental challenge in comparative genomics and evolutionary biology. Existing approaches often rely on either reference-based matching, which imposes asymmetry by designating one species as the reference, or projection-based matching, which may increase computational complexity and obscure biological interpretability at the cell-type level. Here, we present OT-MESH, an unsupervised computational framework leveraging entropy-regularized optimal transport (OT) to systematically determine cross-species cell type homologies. Our method uniquely integrates the Minimize Entropy of Sinkhorn (MESH) technique to refine the OT plan. It begins by selecting genes with high Signal-to-Noise Ratio (SNR) to capture the most informative features, from which a cost matrix is constructed using cosine distances between cell-type centroids. Importantly, the MESH procedure iteratively refines the cost matrix, leading to a transport plan with significantly enhanced sparsity and interpretability of the resulting correspondence matrices. Applied to retinal bipolar cells (BCs) and retinal ganglion cells (RGCs) from mouse and macaque, OT-MESH accurately recovers known evolutionary relationships and uncovers novel correspondences, one of which was independently validated experimentally. Thus, our framework offers a principled, scalable, symmetric, and interpretable solution for evolutionary cell type mapping, facilitating deeper insights into cellular specialization and conservation across species.
△ Less
Submitted 30 May, 2025;
originally announced May 2025.
-
$α$-GAN by Rényi Cross Entropy
Authors:
Ni Ding,
Miao Qiao,
Jiaxing Xu,
Yiping Ke,
Xiaoyu Zhang
Abstract:
This paper proposes $α$-GAN, a generative adversarial network using Rényi measures. The value function is formulated, by Rényi cross entropy, as an expected certainty measure incurred by the discriminator's soft decision as to where the sample is from, true population or the generator. The discriminator tries to maximize the Rényi certainty about sample source, while the generator wants to reduce…
▽ More
This paper proposes $α$-GAN, a generative adversarial network using Rényi measures. The value function is formulated, by Rényi cross entropy, as an expected certainty measure incurred by the discriminator's soft decision as to where the sample is from, true population or the generator. The discriminator tries to maximize the Rényi certainty about sample source, while the generator wants to reduce it by injecting fake samples. This forms a min-max problem with the solution parameterized by the Rényi order $α$. This $α$-GAN reduces to vanilla GAN at $α= 1$, where the value function is exactly the binary cross entropy. The optimization of $α$-GAN is over probability (vector) space. It is shown that the gradient is exponentially enlarged when Rényi order is in the range $α\in (0,1)$. This makes convergence faster, which is verified by experimental results. A discussion shows that choosing $α\in (0,1)$ may be able to solve some common problems, e.g., vanishing gradient. A following observation reveals that this range has not been fully explored in the existing Rényi version GANs.
△ Less
Submitted 20 May, 2025;
originally announced May 2025.
-
Contrastive Learning for Continuous Touch-Based Authentication
Authors:
Mengyu Qiao,
Yunpeng Zhai,
Yang Wang
Abstract:
Smart mobile devices have become indispensable in modern daily life, where sensitive information is frequently processed, stored, and transmitted-posing critical demands for robust security controls. Given that touchscreens are the primary medium for human-device interaction, continuous user authentication based on touch behavior presents a natural and seamless security solution. While existing me…
▽ More
Smart mobile devices have become indispensable in modern daily life, where sensitive information is frequently processed, stored, and transmitted-posing critical demands for robust security controls. Given that touchscreens are the primary medium for human-device interaction, continuous user authentication based on touch behavior presents a natural and seamless security solution. While existing methods predominantly adopt binary classification under single-modal learning settings, we propose a unified contrastive learning framework for continuous authentication in a non-disruptive manner. Specifically, the proposed method leverages a Temporal Masked Autoencoder to extract temporal patterns from raw multi-sensor data streams, capturing continuous motion and gesture dynamics. The pre-trained TMAE is subsequently integrated into a Siamese Temporal-Attentive Convolutional Network within a contrastive learning paradigm to model both sequential and cross-modal patterns. To further enhance performance, we incorporate multi-head attention and channel attention mechanisms to capture long-range dependencies and optimize inter-channel feature integration. Extensive experiments on public benchmarks and a self-collected dataset demonstrate that our approach outperforms state-of-the-art methods, offering a reliable and effective solution for user authentication on mobile devices.
△ Less
Submitted 24 April, 2025;
originally announced April 2025.
-
Towards Generalizable Deepfake Detection with Spatial-Frequency Collaborative Learning and Hierarchical Cross-Modal Fusion
Authors:
Mengyu Qiao,
Runze Tian,
Yang Wang
Abstract:
The rapid evolution of deep generative models poses a critical challenge to deepfake detection, as detectors trained on forgery-specific artifacts often suffer significant performance degradation when encountering unseen forgeries. While existing methods predominantly rely on spatial domain analysis, frequency domain operations are primarily limited to feature-level augmentation, leaving frequency…
▽ More
The rapid evolution of deep generative models poses a critical challenge to deepfake detection, as detectors trained on forgery-specific artifacts often suffer significant performance degradation when encountering unseen forgeries. While existing methods predominantly rely on spatial domain analysis, frequency domain operations are primarily limited to feature-level augmentation, leaving frequency-native artifacts and spatial-frequency interactions insufficiently exploited. To address this limitation, we propose a novel detection framework that integrates multi-scale spatial-frequency analysis for universal deepfake detection. Our framework comprises three key components: (1) a local spectral feature extraction pipeline that combines block-wise discrete cosine transform with cascaded multi-scale convolutions to capture subtle spectral artifacts; (2) a global spectral feature extraction pipeline utilizing scale-invariant differential accumulation to identify holistic forgery distribution patterns; and (3) a multi-stage cross-modal fusion mechanism that incorporates shallow-layer attention enhancement and deep-layer dynamic modulation to model spatial-frequency interactions. Extensive evaluations on widely adopted benchmarks demonstrate that our method outperforms state-of-the-art deepfake detection methods in both accuracy and generalizability.
△ Less
Submitted 23 April, 2025;
originally announced April 2025.
-
Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement Learning
Authors:
ByteDance Seed,
:,
Jiaze Chen,
Tiantian Fan,
Xin Liu,
Lingjun Liu,
Zhiqi Lin,
Mingxuan Wang,
Chengyi Wang,
Xiangpeng Wei,
Wenyuan Xu,
Yufeng Yuan,
Yu Yue,
Lin Yan,
Qiying Yu,
Xiaochen Zuo,
Chi Zhang,
Ruofei Zhu,
Zhecheng An,
Zhihao Bai,
Yu Bao,
Xingyan Bin,
Jiangjie Chen,
Feng Chen,
Hongmin Chen
, et al. (249 additional authors not shown)
Abstract:
We introduce Seed1.5-Thinking, capable of reasoning through thinking before responding, resulting in improved performance on a wide range of benchmarks. Seed1.5-Thinking achieves 86.7 on AIME 2024, 55.0 on Codeforces and 77.3 on GPQA, demonstrating excellent reasoning abilities in STEM and coding. Beyond reasoning tasks, the method demonstrates notable generalization across diverse domains. For in…
▽ More
We introduce Seed1.5-Thinking, capable of reasoning through thinking before responding, resulting in improved performance on a wide range of benchmarks. Seed1.5-Thinking achieves 86.7 on AIME 2024, 55.0 on Codeforces and 77.3 on GPQA, demonstrating excellent reasoning abilities in STEM and coding. Beyond reasoning tasks, the method demonstrates notable generalization across diverse domains. For instance, it surpasses DeepSeek R1 by 8% in win rate on non-reasoning tasks, indicating its broader applicability. Compared to other state-of-the-art reasoning models, Seed1.5-Thinking is a Mixture-of-Experts (MoE) model with a relatively small size, featuring 20B activated and 200B total parameters. As part of our effort to assess generalized reasoning, we develop two internal benchmarks, BeyondAIME and Codeforces, both of which will be publicly released to support future research. Model trial link: https://www.volcengine.com/experience/ark.
△ Less
Submitted 29 April, 2025; v1 submitted 10 April, 2025;
originally announced April 2025.
-
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Authors:
Qiying Yu,
Zheng Zhang,
Ruofei Zhu,
Yufeng Yuan,
Xiaochen Zuo,
Yu Yue,
Weinan Dai,
Tiantian Fan,
Gaohong Liu,
Lingjun Liu,
Xin Liu,
Haibin Lin,
Zhiqi Lin,
Bole Ma,
Guangming Sheng,
Yuxuan Tong,
Chi Zhang,
Mofan Zhang,
Wang Zhang,
Hang Zhu,
Jinhua Zhu,
Jiaze Chen,
Jiangjie Chen,
Chengyi Wang,
Hongli Yu
, et al. (10 additional authors not shown)
Abstract:
Inference scaling empowers LLMs with unprecedented reasoning ability, with reinforcement learning as the core technique to elicit complex reasoning. However, key technical details of state-of-the-art reasoning LLMs are concealed (such as in OpenAI o1 blog and DeepSeek R1 technical report), thus the community still struggles to reproduce their RL training results. We propose the $\textbf{D}$ecouple…
▽ More
Inference scaling empowers LLMs with unprecedented reasoning ability, with reinforcement learning as the core technique to elicit complex reasoning. However, key technical details of state-of-the-art reasoning LLMs are concealed (such as in OpenAI o1 blog and DeepSeek R1 technical report), thus the community still struggles to reproduce their RL training results. We propose the $\textbf{D}$ecoupled Clip and $\textbf{D}$ynamic s$\textbf{A}$mpling $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{DAPO}$) algorithm, and fully open-source a state-of-the-art large-scale RL system that achieves 50 points on AIME 2024 using Qwen2.5-32B base model. Unlike previous works that withhold training details, we introduce four key techniques of our algorithm that make large-scale LLM RL a success. In addition, we open-source our training code, which is built on the verl framework, along with a carefully curated and processed dataset. These components of our open-source system enhance reproducibility and support future research in large-scale LLM RL.
△ Less
Submitted 19 May, 2025; v1 submitted 18 March, 2025;
originally announced March 2025.
-
Truthfulness of Decision-Theoretic Calibration Measures
Authors:
Mingda Qiao,
Eric Zhao
Abstract:
Calibration measures quantify how much a forecaster's predictions violates calibration, which requires that forecasts are unbiased conditioning on the forecasted probabilities. Two important desiderata for a calibration measure are its decision-theoretic implications (i.e., downstream decision-makers that best-respond to the forecasts are always no-regret) and its truthfulness (i.e., a forecaster…
▽ More
Calibration measures quantify how much a forecaster's predictions violates calibration, which requires that forecasts are unbiased conditioning on the forecasted probabilities. Two important desiderata for a calibration measure are its decision-theoretic implications (i.e., downstream decision-makers that best-respond to the forecasts are always no-regret) and its truthfulness (i.e., a forecaster approximately minimizes error by always reporting the true probabilities). Existing measures satisfy at most one of the properties, but not both.
We introduce a new calibration measure termed subsampled step calibration, $\mathsf{StepCE}^{\textsf{sub}}$, that is both decision-theoretic and truthful. In particular, on any product distribution, $\mathsf{StepCE}^{\textsf{sub}}$ is truthful up to an $O(1)$ factor whereas prior decision-theoretic calibration measures suffer from an $e^{-Ω(T)}$-$Ω(\sqrt{T})$ truthfulness gap. Moreover, in any smoothed setting where the conditional probability of each event is perturbed by a noise of magnitude $c > 0$, $\mathsf{StepCE}^{\textsf{sub}}$ is truthful up to an $O(\sqrt{\log(1/c)})$ factor, while prior decision-theoretic measures have an $e^{-Ω(T)}$-$Ω(T^{1/3})$ truthfulness gap. We also prove a general impossibility result for truthful decision-theoretic forecasting: any complete and decision-theoretic calibration measure must be discontinuous and non-truthful in the non-smoothed setting.
△ Less
Submitted 4 March, 2025;
originally announced March 2025.
-
Optimal $k$-Secretary with Logarithmic Memory
Authors:
Mingda Qiao,
Wei Zhang
Abstract:
We study memory-bounded algorithms for the $k$-secretary problem. The algorithm of Kleinberg (2005) achieves an optimal competitive ratio of $1 - O(1/\sqrt{k})$, yet a straightforward implementation requires $Ω(k)$ memory.
Our main result is a $k$-secretary algorithm that matches the optimal competitive ratio using $O(\log k)$ words of memory. We prove this result by establishing a general reduc…
▽ More
We study memory-bounded algorithms for the $k$-secretary problem. The algorithm of Kleinberg (2005) achieves an optimal competitive ratio of $1 - O(1/\sqrt{k})$, yet a straightforward implementation requires $Ω(k)$ memory.
Our main result is a $k$-secretary algorithm that matches the optimal competitive ratio using $O(\log k)$ words of memory. We prove this result by establishing a general reduction from $k$-secretary to (random-order) quantile estimation, the problem of finding the $k$-th largest element in a stream. We show that a quantile estimation algorithm with an $O(k^α)$ expected error (in terms of the rank) gives a $(1 - O(1/k^{1-α}))$-competitive $k$-secretary algorithm with $O(1)$ extra words.
We then introduce a new quantile estimation algorithm that achieves an $O(\sqrt{k})$ expected error bound using $O(\log k)$ memory. Of independent interest, we give a different algorithm that uses $O(\sqrt{k})$ words and finds the $k$-th largest element exactly with high probability, generalizing a result of Munro and Paterson (1980).
△ Less
Submitted 13 February, 2025;
originally announced February 2025.
-
OCSU: Optical Chemical Structure Understanding for Molecule-centric Scientific Discovery
Authors:
Siqi Fan,
Yuguang Xie,
Bowen Cai,
Ailin Xie,
Gaochao Liu,
Mu Qiao,
Jie Xing,
Zaiqing Nie
Abstract:
Understanding the chemical structure from a graphical representation of a molecule is a challenging image caption task that would greatly benefit molecule-centric scientific discovery. Variations in molecular images and caption subtasks pose a significant challenge in both image representation learning and task modeling. Yet, existing methods only focus on a specific caption task that translates a…
▽ More
Understanding the chemical structure from a graphical representation of a molecule is a challenging image caption task that would greatly benefit molecule-centric scientific discovery. Variations in molecular images and caption subtasks pose a significant challenge in both image representation learning and task modeling. Yet, existing methods only focus on a specific caption task that translates a molecular image into its graph structure, i.e., OCSR. In this paper, we propose the Optical Chemical Structure Understanding (OCSU) task, which extends low-level recognition to multilevel understanding and aims to translate chemical structure diagrams into readable strings for both machine and chemist. To facilitate the development of OCSU technology, we explore both OCSR-based and OCSR-free paradigms. We propose DoubleCheck to enhance OCSR performance via attentive feature enhancement for local ambiguous atoms. It can be cascaded with existing SMILES-based molecule understanding methods to achieve OCSU. Meanwhile, Mol-VL is a vision-language model end-to-end optimized for OCSU. We also construct Vis-CheBI20, the first large-scale OCSU dataset. Through comprehensive experiments, we demonstrate the proposed approaches excel at providing chemist-readable caption for chemical structure diagrams, which provide solid baselines for further research. Our code, model, and data are open-sourced at https://github.com/PharMolix/OCSU.
△ Less
Submitted 22 May, 2025; v1 submitted 26 January, 2025;
originally announced January 2025.
-
Reach Measurement, Optimization and Frequency Capping In Targeted Online Advertising Under k-Anonymity
Authors:
Yuan Gao,
Mu Qiao
Abstract:
The growth in the use of online advertising to foster brand awareness over recent years is largely attributable to the ubiquity of social media. One pivotal technology contributing to the success of online brand advertising is frequency capping, a mechanism that enables marketers to control the number of times an ad is shown to a specific user. However, the very foundation of this technology is be…
▽ More
The growth in the use of online advertising to foster brand awareness over recent years is largely attributable to the ubiquity of social media. One pivotal technology contributing to the success of online brand advertising is frequency capping, a mechanism that enables marketers to control the number of times an ad is shown to a specific user. However, the very foundation of this technology is being scrutinized as the industry gravitates towards advertising solutions that prioritize user privacy. This paper delves into the issue of reach measurement and optimization within the context of $k$-anonymity, a privacy-preserving model gaining traction across major online advertising platforms. We outline how to report reach within this new privacy landscape and demonstrate how probabilistic discounting, a probabilistic adaptation of traditional frequency capping, can be employed to optimize campaign performance. Experiments are performed to assess the trade-off between user privacy and the efficacy of online brand advertising. Notably, we discern a significant dip in performance as long as privacy is introduced, yet this comes with a limited additional cost for advertising platforms to offer their users more privacy.
△ Less
Submitted 8 January, 2025;
originally announced January 2025.
-
Leakage-Robust Bayesian Persuasion
Authors:
Nika Haghtalab,
Mingda Qiao,
Kunhe Yang
Abstract:
We introduce the concept of leakage-robust Bayesian persuasion. Situated between public persuasion [KG11, CCG23, Xu20] and private persuasion [AB19], leakage-robust persuasion considers a setting where one or more signals privately sent by a sender to the receivers may be leaked. We study the design of leakage-robust persuasion schemes and quantify the price of robustness using two formalisms:
-…
▽ More
We introduce the concept of leakage-robust Bayesian persuasion. Situated between public persuasion [KG11, CCG23, Xu20] and private persuasion [AB19], leakage-robust persuasion considers a setting where one or more signals privately sent by a sender to the receivers may be leaked. We study the design of leakage-robust persuasion schemes and quantify the price of robustness using two formalisms:
- The first notion, $k$-worst-case persuasiveness, requires a scheme to remain persuasive as long as each receiver observes at most $k$ leaked signals. We quantify the Price of Worst-case Robustness (PoWR$_k$) -- i.e., the gap in sender's utility as compared to the optimal private scheme -- as $Θ(\min\{2^k,n\})$ for supermodular sender utilities and $Θ(k)$ for submodular or XOS utilities, where $n$ is the number of receivers. This result also establishes that in some instances, $Θ(\log k)$ leakages are sufficient for the utility of the optimal leakage-robust persuasion to degenerate to that of public persuasion.
- The second notion, expected downstream utility robustness, relaxes the persuasiveness and considers the impact on sender's utility when receivers best respond to their observations. By quantifying the Price of Downstream Robustness (PoDR) as the gap between the sender's expected utility over random leakage patterns as compared to private persuasion, we show that over several natural and structured distributions of leakage patterns, PoDR improves PoWR to $Θ(k)$ or even $Θ(1)$, where $k$ is the maximum number of leaked signals observable to each receiver across leakage patterns in the distribution.
En route to these results, we show that subsampling and masking are general-purpose algorithmic paradigms for transforming private persuasion signaling schemes to leakage-robust ones, with minmax optimal loss in the sender's utility.
△ Less
Submitted 25 November, 2024;
originally announced November 2024.
-
MaCTG: Multi-Agent Collaborative Thought Graph for Automatic Programming
Authors:
Zixiao Zhao,
Jing Sun,
Zhe Hou,
Zhiyuan Wei,
Cheng-Hao Cai,
Miao Qiao,
Jin Song Dong
Abstract:
With the rapid advancement of Large Language Models (LLMs), LLM-based approaches have demonstrated strong problem-solving capabilities across various domains. However, in automatic programming, a single LLM is typically limited to function-level code generation, while multi-agent systems composed of multiple LLMs often suffer from inefficient task planning. This lack of structured coordination can…
▽ More
With the rapid advancement of Large Language Models (LLMs), LLM-based approaches have demonstrated strong problem-solving capabilities across various domains. However, in automatic programming, a single LLM is typically limited to function-level code generation, while multi-agent systems composed of multiple LLMs often suffer from inefficient task planning. This lack of structured coordination can lead to cascading hallucinations, where accumulated errors across agents result in suboptimal workflows and excessive computational costs. To overcome these challenges, we introduce MaCTG (Multi-Agent Collaborative Thought Graph), a novel multi-agent framework that employs a dynamic graph structure to facilitate precise task allocation and controlled collaboration among LLM agents. MaCTG autonomously assigns agent roles based on programming requirements, dynamically refines task distribution through context-aware adjustments, and systematically verifies and integrates project-level code, effectively reducing hallucination errors and improving overall accuracy. MaCTG enhances cost-effectiveness by implementing a hybrid LLM deployment, where proprietary models handle complex reasoning, while open-source models are used for routine coding and validation tasks. To evaluate MaCTG's effectiveness, we applied it to traditional image processing auto-programming tasks, achieving a state-of-the-art accuracy of 83.33%. Additionally, by leveraging its hybrid LLM configuration, MaCTG significantly reduced operational costs by 89.09% compared to existing multi-agent frameworks, demonstrating its efficiency, scalability, and real-world applicability.
△ Less
Submitted 21 April, 2025; v1 submitted 24 October, 2024;
originally announced October 2024.
-
A personalized time-resolved 3D mesh generative model for unveiling normal heart dynamics
Authors:
Mengyun Qiao,
Kathryn A McGurk,
Shuo Wang,
Paul M. Matthews,
Declan P O Regan,
Wenjia Bai
Abstract:
Understanding the structure and motion of the heart is crucial for diagnosing and managing cardiovascular diseases, the leading cause of global death. There is wide variation in cardiac shape and motion patterns, influenced by demographic, anthropometric and disease factors. Unravelling normal patterns of shape and motion, and understanding how each individual deviates from the norm, would facilit…
▽ More
Understanding the structure and motion of the heart is crucial for diagnosing and managing cardiovascular diseases, the leading cause of global death. There is wide variation in cardiac shape and motion patterns, influenced by demographic, anthropometric and disease factors. Unravelling normal patterns of shape and motion, and understanding how each individual deviates from the norm, would facilitate accurate diagnosis and personalised treatment strategies. To this end, we developed a conditional generative model, MeshHeart, to learn the distribution of shape and motion patterns for the left and right ventricles of the heart. To model the high-dimensional spatio-temporal mesh data, MeshHeart employs a geometric encoder to represent cardiac meshes in a latent space, and a temporal Transformer to model the motion dynamics of latent representations. Based on MeshHeart, we investigate the latent space of 3D+t cardiac mesh sequences and propose a distance metric, latent delta, which quantifies the deviation of a real heart from its personalised normative pattern. In experiments using a large cardiac magnetic resonance image dataset of 38,309 subjects from the UK Biobank, MeshHeart demonstrates high performance in cardiac mesh sequence reconstruction and generation. Latent space features are discriminative for cardiac disease classification, whereas latent delta exhibits strong correlations with clinical phenotypes in phenome-wide association studies. The code and the trained model are released to support further research.
△ Less
Submitted 2 June, 2025; v1 submitted 20 September, 2024;
originally announced September 2024.
-
Contrasformer: A Brain Network Contrastive Transformer for Neurodegenerative Condition Identification
Authors:
Jiaxing Xu,
Kai He,
Mengcheng Lan,
Qingtian Bian,
Wei Li,
Tieying Li,
Yiping Ke,
Miao Qiao
Abstract:
Understanding neurological disorder is a fundamental problem in neuroscience, which often requires the analysis of brain networks derived from functional magnetic resonance imaging (fMRI) data. Despite the prevalence of Graph Neural Networks (GNNs) and Graph Transformers in various domains, applying them to brain networks faces challenges. Specifically, the datasets are severely impacted by the no…
▽ More
Understanding neurological disorder is a fundamental problem in neuroscience, which often requires the analysis of brain networks derived from functional magnetic resonance imaging (fMRI) data. Despite the prevalence of Graph Neural Networks (GNNs) and Graph Transformers in various domains, applying them to brain networks faces challenges. Specifically, the datasets are severely impacted by the noises caused by distribution shifts across sub-populations and the neglect of node identities, both obstruct the identification of disease-specific patterns. To tackle these challenges, we propose Contrasformer, a novel contrastive brain network Transformer. It generates a prior-knowledge-enhanced contrast graph to address the distribution shifts across sub-populations by a two-stream attention mechanism. A cross attention with identity embedding highlights the identity of nodes, and three auxiliary losses ensure group consistency. Evaluated on 4 functional brain network datasets over 4 different diseases, Contrasformer outperforms the state-of-the-art methods for brain networks by achieving up to 10.8\% improvement in accuracy, which demonstrates its efficacy in neurological disorder identification. Case studies illustrate its interpretability, especially in the context of neuroscience. This paper provides a solution for analyzing brain networks, offering valuable insights into neurological disorders. Our code is available at \url{https://github.com/AngusMonroe/Contrasformer}.
△ Less
Submitted 17 September, 2024;
originally announced September 2024.
-
Truthfulness of Calibration Measures
Authors:
Nika Haghtalab,
Mingda Qiao,
Kunhe Yang,
Eric Zhao
Abstract:
We initiate the study of the truthfulness of calibration measures in sequential prediction. A calibration measure is said to be truthful if the forecaster (approximately) minimizes the expected penalty by predicting the conditional expectation of the next outcome, given the prior distribution of outcomes. Truthfulness is an important property of calibration measures, ensuring that the forecaster i…
▽ More
We initiate the study of the truthfulness of calibration measures in sequential prediction. A calibration measure is said to be truthful if the forecaster (approximately) minimizes the expected penalty by predicting the conditional expectation of the next outcome, given the prior distribution of outcomes. Truthfulness is an important property of calibration measures, ensuring that the forecaster is not incentivized to exploit the system with deliberate poor forecasts. This makes it an essential desideratum for calibration measures, alongside typical requirements, such as soundness and completeness.
We conduct a taxonomy of existing calibration measures and their truthfulness. Perhaps surprisingly, we find that all of them are far from being truthful. That is, under existing calibration measures, there are simple distributions on which a polylogarithmic (or even zero) penalty is achievable, while truthful prediction leads to a polynomial penalty. Our main contribution is the introduction of a new calibration measure termed the Subsampled Smooth Calibration Error (SSCE) under which truthful prediction is optimal up to a constant multiplicative factor.
△ Less
Submitted 20 November, 2024; v1 submitted 18 July, 2024;
originally announced July 2024.
-
MIPI 2024 Challenge on Few-shot RAW Image Denoising: Methods and Results
Authors:
Xin Jin,
Chunle Guo,
Xiaoming Li,
Zongsheng Yue,
Chongyi Li,
Shangchen Zhou,
Ruicheng Feng,
Yuekun Dai,
Peiqing Yang,
Chen Change Loy,
Ruoqi Li,
Chang Liu,
Ziyi Wang,
Yao Du,
Jingjing Yang,
Long Bao,
Heng Sun,
Xiangyu Kong,
Xiaoxia Xing,
Jinlong Wu,
Yuanyang Xue,
Hyunhee Park,
Sejun Song,
Changho Kim,
Jingfan Tan
, et al. (17 additional authors not shown)
Abstract:
The increasing demand for computational photography and imaging on mobile platforms has led to the widespread development and integration of advanced image sensors with novel algorithms in camera systems. However, the scarcity of high-quality data for research and the rare opportunity for in-depth exchange of views from industry and academia constrain the development of mobile intelligent photogra…
▽ More
The increasing demand for computational photography and imaging on mobile platforms has led to the widespread development and integration of advanced image sensors with novel algorithms in camera systems. However, the scarcity of high-quality data for research and the rare opportunity for in-depth exchange of views from industry and academia constrain the development of mobile intelligent photography and imaging (MIPI). Building on the achievements of the previous MIPI Workshops held at ECCV 2022 and CVPR 2023, we introduce our third MIPI challenge including three tracks focusing on novel image sensors and imaging algorithms. In this paper, we summarize and review the Few-shot RAW Image Denoising track on MIPI 2024. In total, 165 participants were successfully registered, and 7 teams submitted results in the final testing phase. The developed solutions in this challenge achieved state-of-the-art erformance on Few-shot RAW Image Denoising. More details of this challenge and the link to the dataset can be found at https://mipichallenge.org/MIPI2024.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
QUBIQ: Uncertainty Quantification for Biomedical Image Segmentation Challenge
Authors:
Hongwei Bran Li,
Fernando Navarro,
Ivan Ezhov,
Amirhossein Bayat,
Dhritiman Das,
Florian Kofler,
Suprosanna Shit,
Diana Waldmannstetter,
Johannes C. Paetzold,
Xiaobin Hu,
Benedikt Wiestler,
Lucas Zimmer,
Tamaz Amiranashvili,
Chinmay Prabhakar,
Christoph Berger,
Jonas Weidner,
Michelle Alonso-Basant,
Arif Rashid,
Ujjwal Baid,
Wesam Adel,
Deniz Ali,
Bhakti Baheti,
Yingbin Bai,
Ishaan Bhatt,
Sabri Can Cetindag
, et al. (55 additional authors not shown)
Abstract:
Uncertainty in medical image segmentation tasks, especially inter-rater variability, arising from differences in interpretations and annotations by various experts, presents a significant challenge in achieving consistent and reliable image segmentation. This variability not only reflects the inherent complexity and subjective nature of medical image interpretation but also directly impacts the de…
▽ More
Uncertainty in medical image segmentation tasks, especially inter-rater variability, arising from differences in interpretations and annotations by various experts, presents a significant challenge in achieving consistent and reliable image segmentation. This variability not only reflects the inherent complexity and subjective nature of medical image interpretation but also directly impacts the development and evaluation of automated segmentation algorithms. Accurately modeling and quantifying this variability is essential for enhancing the robustness and clinical applicability of these algorithms. We report the set-up and summarize the benchmark results of the Quantification of Uncertainties in Biomedical Image Quantification Challenge (QUBIQ), which was organized in conjunction with International Conferences on Medical Image Computing and Computer-Assisted Intervention (MICCAI) 2020 and 2021. The challenge focuses on the uncertainty quantification of medical image segmentation which considers the omnipresence of inter-rater variability in imaging datasets. The large collection of images with multi-rater annotations features various modalities such as MRI and CT; various organs such as the brain, prostate, kidney, and pancreas; and different image dimensions 2D-vs-3D. A total of 24 teams submitted different solutions to the problem, combining various baseline models, Bayesian neural networks, and ensemble model techniques. The obtained results indicate the importance of the ensemble models, as well as the need for further research to develop efficient 3D methods for uncertainty quantification methods in 3D segmentation tasks.
△ Less
Submitted 24 June, 2024; v1 submitted 19 March, 2024;
originally announced May 2024.
-
A Foundation Model for Brain Lesion Segmentation with Mixture of Modality Experts
Authors:
Xinru Zhang,
Ni Ou,
Berke Doga Basaran,
Marco Visentin,
Mengyun Qiao,
Renyang Gu,
Cheng Ouyang,
Yaou Liu,
Paul M. Matthew,
Chuyang Ye,
Wenjia Bai
Abstract:
Brain lesion segmentation plays an essential role in neurological research and diagnosis. As brain lesions can be caused by various pathological alterations, different types of brain lesions tend to manifest with different characteristics on different imaging modalities. Due to this complexity, brain lesion segmentation methods are often developed in a task-specific manner. A specific segmentation…
▽ More
Brain lesion segmentation plays an essential role in neurological research and diagnosis. As brain lesions can be caused by various pathological alterations, different types of brain lesions tend to manifest with different characteristics on different imaging modalities. Due to this complexity, brain lesion segmentation methods are often developed in a task-specific manner. A specific segmentation model is developed for a particular lesion type and imaging modality. However, the use of task-specific models requires predetermination of the lesion type and imaging modality, which complicates their deployment in real-world scenarios. In this work, we propose a universal foundation model for 3D brain lesion segmentation, which can automatically segment different types of brain lesions for input data of various imaging modalities. We formulate a novel Mixture of Modality Experts (MoME) framework with multiple expert networks attending to different imaging modalities. A hierarchical gating network combines the expert predictions and fosters expertise collaboration. Furthermore, we introduce a curriculum learning strategy during training to avoid the degeneration of each expert network and preserve their specialization. We evaluated the proposed method on nine brain lesion datasets, encompassing five imaging modalities and eight lesion types. The results show that our model outperforms state-of-the-art universal models and provides promising generalization to unseen datasets.
△ Less
Submitted 16 July, 2024; v1 submitted 16 May, 2024;
originally announced May 2024.
-
Synthesizing Realistic Data for Table Recognition
Authors:
Qiyu Hou,
Jun Wang,
Meixuan Qiao,
Lujun Tian
Abstract:
To overcome the limitations and challenges of current automatic table data annotation methods and random table data synthesis approaches, we propose a novel method for synthesizing annotation data specifically designed for table recognition. This method utilizes the structure and content of existing complex tables, facilitating the efficient creation of tables that closely replicate the authentic…
▽ More
To overcome the limitations and challenges of current automatic table data annotation methods and random table data synthesis approaches, we propose a novel method for synthesizing annotation data specifically designed for table recognition. This method utilizes the structure and content of existing complex tables, facilitating the efficient creation of tables that closely replicate the authentic styles found in the target domain. By leveraging the actual structure and content of tables from Chinese financial announcements, we have developed the first extensive table annotation dataset in this domain. We used this dataset to train several recent deep learning-based end-to-end table recognition models. Additionally, we have established the inaugural benchmark for real-world complex tables in the Chinese financial announcement domain, using it to assess the performance of models trained on our synthetic data, thereby effectively validating our method's practicality and effectiveness. Furthermore, we applied our synthesis method to augment the FinTabNet dataset, extracted from English financial announcements, by increasing the proportion of tables with multiple spanning cells to introduce greater complexity. Our experiments show that models trained on this augmented dataset achieve comprehensive improvements in performance, especially in the recognition of tables with multiple spanning cells.
△ Less
Submitted 9 July, 2024; v1 submitted 17 April, 2024;
originally announced April 2024.
-
Scalable $k$-clique Densest Subgraph Search
Authors:
Xiaowei Ye,
Miao Qiao,
Rong-Hua Li,
Qi Zhang,
Guoren Wang
Abstract:
In this paper, we present a collection of novel and scalable algorithms designed to tackle the challenges inherent in the $k$-clique densest subgraph problem (\kcdsp) within network analysis. We propose \psctl, a novel algorithm based on the Frank-Wolfe approach for addressing \kcdsp, effectively solving a distinct convex programming problem. \textcolor{black}{\psctl is able to approximate \kcdsp…
▽ More
In this paper, we present a collection of novel and scalable algorithms designed to tackle the challenges inherent in the $k$-clique densest subgraph problem (\kcdsp) within network analysis. We propose \psctl, a novel algorithm based on the Frank-Wolfe approach for addressing \kcdsp, effectively solving a distinct convex programming problem. \textcolor{black}{\psctl is able to approximate \kcdsp with near optimal guarantees.} The notable advantage of \psctl lies in its time complexity, which is independent of the count of $k$-cliques, resulting in remarkable efficiency in practical applications. Additionally, we present \spath, a sampling-based algorithm with the capability to handle networks on an unprecedented scale, reaching up to $1.8\times 10^9$ edges. By leveraging the \ccpath algorithm as a uniform $k$-clique sampler, \spath ensures the efficient processing of large-scale network data, accompanied by a detailed analysis of accuracy guarantees. Together, these contributions represent a significant advancement in the field of $k$-clique densest subgraph discovery. In experimental evaluations, our algorithms demonstrate orders of magnitude faster performance compared to the current state-of-the-art solutions.
△ Less
Submitted 8 March, 2024;
originally announced March 2024.
-
Platforms for Efficient and Incentive-Aware Collaboration
Authors:
Nika Haghtalab,
Mingda Qiao,
Kunhe Yang
Abstract:
Collaboration is crucial for reaching collective goals. However, its effectiveness is often undermined by the strategic behavior of individual agents -- a fact that is captured by a high Price of Stability (PoS) in recent literature [Blum et al., 2021]. Implicit in the traditional PoS analysis is the assumption that agents have full knowledge of how their tasks relate to one another. We offer a ne…
▽ More
Collaboration is crucial for reaching collective goals. However, its effectiveness is often undermined by the strategic behavior of individual agents -- a fact that is captured by a high Price of Stability (PoS) in recent literature [Blum et al., 2021]. Implicit in the traditional PoS analysis is the assumption that agents have full knowledge of how their tasks relate to one another. We offer a new perspective on bringing about efficient collaboration among strategic agents using information design. Inspired by the growing importance of collaboration in machine learning (such as platforms for collaborative federated learning and data cooperatives), we propose a framework where the platform has more information about how the agents' tasks relate to each other than the agents themselves. We characterize how and to what degree such platforms can leverage their information advantage to steer strategic agents toward efficient collaboration.
Concretely, we consider collaboration networks where each node is a task type held by one agent, and each task benefits from contributions made in their inclusive neighborhood of tasks. This network structure is known to the agents and the platform, but only the platform knows each agent's real location -- from the agents' perspective, their location is determined by a random permutation. We employ private Bayesian persuasion and design two families of persuasive signaling schemes that the platform can use to ensure a small total workload when agents follow the signal. The first family aims to achieve the minmax optimal approximation ratio compared to the optimal collaboration, which is shown to be $Θ(\sqrt{n})$ for unit-weight graphs, $Θ(n^{2/3})$ for graphs with constant minimum edge weights, and $O(n^{3/4})$ for general weighted graphs. The second family ensures per-instance strict improvement compared to full information disclosure.
△ Less
Submitted 20 November, 2024; v1 submitted 23 February, 2024;
originally announced February 2024.
-
Collaborative Learning with Different Labeling Functions
Authors:
Yuyang Deng,
Mingda Qiao
Abstract:
We study a variant of Collaborative PAC Learning, in which we aim to learn an accurate classifier for each of the $n$ data distributions, while minimizing the number of samples drawn from them in total. Unlike in the usual collaborative learning setup, it is not assumed that there exists a single classifier that is simultaneously accurate for all distributions.
We show that, when the data distri…
▽ More
We study a variant of Collaborative PAC Learning, in which we aim to learn an accurate classifier for each of the $n$ data distributions, while minimizing the number of samples drawn from them in total. Unlike in the usual collaborative learning setup, it is not assumed that there exists a single classifier that is simultaneously accurate for all distributions.
We show that, when the data distributions satisfy a weaker realizability assumption, which appeared in [Crammer and Mansour, 2012] in the context of multi-task learning, sample-efficient learning is still feasible. We give a learning algorithm based on Empirical Risk Minimization (ERM) on a natural augmentation of the hypothesis class, and the analysis relies on an upper bound on the VC dimension of this augmented class.
In terms of the computational efficiency, we show that ERM on the augmented hypothesis class is NP-hard, which gives evidence against the existence of computationally efficient learners in general. On the positive side, for two special cases, we give learners that are both sample- and computationally-efficient.
△ Less
Submitted 22 May, 2024; v1 submitted 15 February, 2024;
originally announced February 2024.
-
On the Distance from Calibration in Sequential Prediction
Authors:
Mingda Qiao,
Letian Zheng
Abstract:
We study a sequential binary prediction setting where the forecaster is evaluated in terms of the calibration distance, which is defined as the $L_1$ distance between the predicted values and the set of predictions that are perfectly calibrated in hindsight. This is analogous to a calibration measure recently proposed by Błasiok, Gopalan, Hu and Nakkiran (STOC 2023) for the offline setting. The ca…
▽ More
We study a sequential binary prediction setting where the forecaster is evaluated in terms of the calibration distance, which is defined as the $L_1$ distance between the predicted values and the set of predictions that are perfectly calibrated in hindsight. This is analogous to a calibration measure recently proposed by Błasiok, Gopalan, Hu and Nakkiran (STOC 2023) for the offline setting. The calibration distance is a natural and intuitive measure of deviation from perfect calibration, and satisfies a Lipschitz continuity property which does not hold for many popular calibration measures, such as the $L_1$ calibration error and its variants.
We prove that there is a forecasting algorithm that achieves an $O(\sqrt{T})$ calibration distance in expectation on an adversarially chosen sequence of $T$ binary outcomes. At the core of this upper bound is a structural result showing that the calibration distance is accurately approximated by the lower calibration distance, which is a continuous relaxation of the former. We then show that an $O(\sqrt{T})$ lower calibration distance can be achieved via a simple minimax argument and a reduction to online learning on a Lipschitz class.
On the lower bound side, an $Ω(T^{1/3})$ calibration distance is shown to be unavoidable, even when the adversary outputs a sequence of independent random bits, and has an additional ability to early stop (i.e., to stop producing random bits and output the same bit in the remaining steps). Interestingly, without this early stopping, the forecaster can achieve a much smaller calibration distance of $\mathrm{polylog}(T)$.
△ Less
Submitted 27 May, 2024; v1 submitted 12 February, 2024;
originally announced February 2024.
-
A Combinatorial Approach to Robust PCA
Authors:
Weihao Kong,
Mingda Qiao,
Rajat Sen
Abstract:
We study the problem of recovering Gaussian data under adversarial corruptions when the noises are low-rank and the corruptions are on the coordinate level. Concretely, we assume that the Gaussian noises lie in an unknown $k$-dimensional subspace $U \subseteq \mathbb{R}^d$, and $s$ randomly chosen coordinates of each data point fall into the control of an adversary. This setting models the scenari…
▽ More
We study the problem of recovering Gaussian data under adversarial corruptions when the noises are low-rank and the corruptions are on the coordinate level. Concretely, we assume that the Gaussian noises lie in an unknown $k$-dimensional subspace $U \subseteq \mathbb{R}^d$, and $s$ randomly chosen coordinates of each data point fall into the control of an adversary. This setting models the scenario of learning from high-dimensional yet structured data that are transmitted through a highly-noisy channel, so that the data points are unlikely to be entirely clean.
Our main result is an efficient algorithm that, when $ks^2 = O(d)$, recovers every single data point up to a nearly-optimal $\ell_1$ error of $\tilde O(ks/d)$ in expectation. At the core of our proof is a new analysis of the well-known Basis Pursuit (BP) method for recovering a sparse signal, which is known to succeed under additional assumptions (e.g., incoherence or the restricted isometry property) on the underlying subspace $U$. In contrast, we present a novel approach via studying a natural combinatorial problem and show that, over the randomness in the support of the sparse signal, a high-probability error bound is possible even if the subspace $U$ is arbitrary.
△ Less
Submitted 27 November, 2023;
originally announced November 2023.
-
BioMedGPT: Open Multimodal Generative Pre-trained Transformer for BioMedicine
Authors:
Yizhen Luo,
Jiahuan Zhang,
Siqi Fan,
Kai Yang,
Yushuai Wu,
Mu Qiao,
Zaiqing Nie
Abstract:
Foundation models (FMs) have exhibited remarkable performance across a wide range of downstream tasks in many domains. Nevertheless, general-purpose FMs often face challenges when confronted with domain-specific problems, due to their limited access to the proprietary training data in a particular domain. In biomedicine, there are various biological modalities, such as molecules, proteins, and cel…
▽ More
Foundation models (FMs) have exhibited remarkable performance across a wide range of downstream tasks in many domains. Nevertheless, general-purpose FMs often face challenges when confronted with domain-specific problems, due to their limited access to the proprietary training data in a particular domain. In biomedicine, there are various biological modalities, such as molecules, proteins, and cells, which are encoded by the language of life and exhibit significant modality gaps with human natural language. In this paper, we introduce BioMedGPT, an open multimodal generative pre-trained transformer (GPT) for biomedicine, to bridge the gap between the language of life and human natural language. BioMedGPT allows users to easily ``communicate'' with diverse biological modalities through free text, which is the first of its kind. BioMedGPT aligns different biological modalities with natural language via a large generative language model, namely, BioMedGPT-LM. We publish BioMedGPT-10B, which unifies the feature spaces of molecules, proteins, and natural language via encoding and alignment. Through fine-tuning, BioMedGPT-10B outperforms or is on par with human and significantly larger general-purpose foundation models on the biomedical QA task. It also demonstrates promising performance in the molecule QA and protein QA tasks, which could greatly accelerate the discovery of new drugs and therapeutic targets. In addition, BioMedGPT-LM-7B is the first large generative language model based on Llama2 in the biomedical domain, therefore is commercial friendly. Both BioMedGPT-10B and BioMedGPT-LM-7B are open-sourced to the research community. In addition, we publish the datasets that are meticulously curated for the alignment of multi-modalities, i.e., PubChemQA and UniProtQA. All the models, codes, and datasets are available at \url{https://github.com/PharMolix/OpenBioMed}.
△ Less
Submitted 21 August, 2023; v1 submitted 18 August, 2023;
originally announced August 2023.
-
LesionMix: A Lesion-Level Data Augmentation Method for Medical Image Segmentation
Authors:
Berke Doga Basaran,
Weitong Zhang,
Mengyun Qiao,
Bernhard Kainz,
Paul M. Matthews,
Wenjia Bai
Abstract:
Data augmentation has become a de facto component of deep learning-based medical image segmentation methods. Most data augmentation techniques used in medical imaging focus on spatial and intensity transformations to improve the diversity of training images. They are often designed at the image level, augmenting the full image, and do not pay attention to specific abnormalities within the image. H…
▽ More
Data augmentation has become a de facto component of deep learning-based medical image segmentation methods. Most data augmentation techniques used in medical imaging focus on spatial and intensity transformations to improve the diversity of training images. They are often designed at the image level, augmenting the full image, and do not pay attention to specific abnormalities within the image. Here, we present LesionMix, a novel and simple lesion-aware data augmentation method. It performs augmentation at the lesion level, increasing the diversity of lesion shape, location, intensity and load distribution, and allowing both lesion populating and inpainting. Experiments on different modalities and different lesion datasets, including four brain MR lesion datasets and one liver CT lesion dataset, demonstrate that LesionMix achieves promising performance in lesion image segmentation, outperforming several recent Mix-based data augmentation methods. The code will be released at https://github.com/dogabasaran/lesionmix.
△ Less
Submitted 17 August, 2023;
originally announced August 2023.
-
Contrastive Graph Pooling for Explainable Classification of Brain Networks
Authors:
Jiaxing Xu,
Qingtian Bian,
Xinhang Li,
Aihu Zhang,
Yiping Ke,
Miao Qiao,
Wei Zhang,
Wei Khang Jeremy Sim,
Balázs Gulyás
Abstract:
Functional magnetic resonance imaging (fMRI) is a commonly used technique to measure neural activation. Its application has been particularly important in identifying underlying neurodegenerative conditions such as Parkinson's, Alzheimer's, and Autism. Recent analysis of fMRI data models the brain as a graph and extracts features by graph neural networks (GNNs). However, the unique characteristics…
▽ More
Functional magnetic resonance imaging (fMRI) is a commonly used technique to measure neural activation. Its application has been particularly important in identifying underlying neurodegenerative conditions such as Parkinson's, Alzheimer's, and Autism. Recent analysis of fMRI data models the brain as a graph and extracts features by graph neural networks (GNNs). However, the unique characteristics of fMRI data require a special design of GNN. Tailoring GNN to generate effective and domain-explainable features remains challenging. In this paper, we propose a contrastive dual-attention block and a differentiable graph pooling method called ContrastPool to better utilize GNN for brain networks, meeting fMRI-specific requirements. We apply our method to 5 resting-state fMRI brain network datasets of 3 diseases and demonstrate its superiority over state-of-the-art baselines. Our case study confirms that the patterns extracted by our method match the domain knowledge in neuroscience literature, and disclose direct and interesting insights. Our contributions underscore the potential of ContrastPool for advancing the understanding of brain networks and neurodegenerative conditions. The source code is available at https://github.com/AngusMonroe/ContrastPool.
△ Less
Submitted 6 September, 2024; v1 submitted 7 July, 2023;
originally announced July 2023.
-
M-FLAG: Medical Vision-Language Pre-training with Frozen Language Models and Latent Space Geometry Optimization
Authors:
Che Liu,
Sibo Cheng,
Chen Chen,
Mengyun Qiao,
Weitong Zhang,
Anand Shah,
Wenjia Bai,
Rossella Arcucci
Abstract:
Medical vision-language models enable co-learning and integrating features from medical imaging and clinical text. However, these models are not easy to train and the latent representation space can be complex. Here we propose a novel way for pre-training and regularising medical vision-language models. The proposed method, named Medical vision-language pre-training with Frozen language models and…
▽ More
Medical vision-language models enable co-learning and integrating features from medical imaging and clinical text. However, these models are not easy to train and the latent representation space can be complex. Here we propose a novel way for pre-training and regularising medical vision-language models. The proposed method, named Medical vision-language pre-training with Frozen language models and Latent spAce Geometry optimization (M-FLAG), leverages a frozen language model for training stability and efficiency and introduces a novel orthogonality loss to harmonize the latent space geometry. We demonstrate the potential of the pre-trained model on three downstream tasks: medical image classification, segmentation, and object detection. Extensive experiments across five public datasets demonstrate that M-FLAG significantly outperforms existing medical vision-language pre-training approaches and reduces the number of parameters by 78\%. Notably, M-FLAG achieves outstanding performance on the segmentation task while using only 1\% of the RSNA dataset, even outperforming ImageNet pre-trained models that have been fine-tuned using 100\% of the data.
△ Less
Submitted 19 July, 2023; v1 submitted 17 July, 2023;
originally announced July 2023.
-
Structure Diagram Recognition in Financial Announcements
Authors:
Meixuan Qiao,
Jun Wang,
Junfu Xiang,
Qiyu Hou,
Ruixuan Li
Abstract:
Accurately extracting structured data from structure diagrams in financial announcements is of great practical importance for building financial knowledge graphs and further improving the efficiency of various financial applications. First, we proposed a new method for recognizing structure diagrams in financial announcements, which can better detect and extract different types of connecting lines…
▽ More
Accurately extracting structured data from structure diagrams in financial announcements is of great practical importance for building financial knowledge graphs and further improving the efficiency of various financial applications. First, we proposed a new method for recognizing structure diagrams in financial announcements, which can better detect and extract different types of connecting lines, including straight lines, curves, and polylines of different orientations and angles. Second, we developed a two-stage method to efficiently generate the industry's first benchmark of structure diagrams from Chinese financial announcements, where a large number of diagrams were synthesized and annotated using an automated tool to train a preliminary recognition model with fairly good performance, and then a high-quality benchmark can be obtained by automatically annotating the real-world structure diagrams using the preliminary model and then making few manual corrections. Finally, we experimentally verified the significant performance advantage of our structure diagram recognition method over previous methods.
△ Less
Submitted 1 May, 2023; v1 submitted 25 April, 2023;
originally announced April 2023.
-
Feature-Conditioned Cascaded Video Diffusion Models for Precise Echocardiogram Synthesis
Authors:
Hadrien Reynaud,
Mengyun Qiao,
Mischa Dombrowski,
Thomas Day,
Reza Razavi,
Alberto Gomez,
Paul Leeson,
Bernhard Kainz
Abstract:
Image synthesis is expected to provide value for the translation of machine learning methods into clinical practice. Fundamental problems like model robustness, domain transfer, causal modelling, and operator training become approachable through synthetic data. Especially, heavily operator-dependant modalities like Ultrasound imaging require robust frameworks for image and video generation. So far…
▽ More
Image synthesis is expected to provide value for the translation of machine learning methods into clinical practice. Fundamental problems like model robustness, domain transfer, causal modelling, and operator training become approachable through synthetic data. Especially, heavily operator-dependant modalities like Ultrasound imaging require robust frameworks for image and video generation. So far, video generation has only been possible by providing input data that is as rich as the output data, e.g., image sequence plus conditioning in, video out. However, clinical documentation is usually scarce and only single images are reported and stored, thus retrospective patient-specific analysis or the generation of rich training data becomes impossible with current approaches. In this paper, we extend elucidated diffusion models for video modelling to generate plausible video sequences from single images and arbitrary conditioning with clinical parameters. We explore this idea within the context of echocardiograms by looking into the variation of the Left Ventricle Ejection Fraction, the most essential clinical metric gained from these examinations. We use the publicly available EchoNet-Dynamic dataset for all our experiments. Our image to sequence approach achieves an $R^2$ score of 93%, which is 38 points higher than recently proposed sequence to sequence generation methods. Code and models will be available at: https://github.com/HReynaud/EchoDiffusion.
△ Less
Submitted 21 February, 2024; v1 submitted 22 March, 2023;
originally announced March 2023.
-
GNN-Ensemble: Towards Random Decision Graph Neural Networks
Authors:
Wenqi Wei,
Mu Qiao,
Divyesh Jadav
Abstract:
Graph Neural Networks (GNNs) have enjoyed wide spread applications in graph-structured data. However, existing graph based applications commonly lack annotated data. GNNs are required to learn latent patterns from a limited amount of training data to perform inferences on a vast amount of test data. The increased complexity of GNNs, as well as a single point of model parameter initialization, usua…
▽ More
Graph Neural Networks (GNNs) have enjoyed wide spread applications in graph-structured data. However, existing graph based applications commonly lack annotated data. GNNs are required to learn latent patterns from a limited amount of training data to perform inferences on a vast amount of test data. The increased complexity of GNNs, as well as a single point of model parameter initialization, usually lead to overfitting and sub-optimal performance. In addition, it is known that GNNs are vulnerable to adversarial attacks. In this paper, we push one step forward on the ensemble learning of GNNs with improved accuracy, generalization, and adversarial robustness. Following the principles of stochastic modeling, we propose a new method called GNN-Ensemble to construct an ensemble of random decision graph neural networks whose capacity can be arbitrarily expanded for improvement in performance. The essence of the method is to build multiple GNNs in randomly selected substructures in the topological space and subfeatures in the feature space, and then combine them for final decision making. These GNNs in different substructure and subfeature spaces generalize their classification in complementary ways. Consequently, their combined classification performance can be improved and overfitting on the training data can be effectively reduced. In the meantime, we show that GNN-Ensemble can significantly improve the adversarial robustness against attacks on GNNs.
△ Less
Submitted 20 March, 2023;
originally announced March 2023.
-
CHeart: A Conditional Spatio-Temporal Generative Model for Cardiac Anatomy
Authors:
Mengyun Qiao,
Shuo Wang,
Huaqi Qiu,
Antonio de Marvao,
Declan P. O'Regan,
Daniel Rueckert,
Wenjia Bai
Abstract:
Two key questions in cardiac image analysis are to assess the anatomy and motion of the heart from images; and to understand how they are associated with non-imaging clinical factors such as gender, age and diseases. While the first question can often be addressed by image segmentation and motion tracking algorithms, our capability to model and to answer the second question is still limited. In th…
▽ More
Two key questions in cardiac image analysis are to assess the anatomy and motion of the heart from images; and to understand how they are associated with non-imaging clinical factors such as gender, age and diseases. While the first question can often be addressed by image segmentation and motion tracking algorithms, our capability to model and to answer the second question is still limited. In this work, we propose a novel conditional generative model to describe the 4D spatio-temporal anatomy of the heart and its interaction with non-imaging clinical factors. The clinical factors are integrated as the conditions of the generative modelling, which allows us to investigate how these factors influence the cardiac anatomy. We evaluate the model performance in mainly two tasks, anatomical sequence completion and sequence generation. The model achieves a high performance in anatomical sequence completion, comparable to or outperforming other state-of-the-art generative models. In terms of sequence generation, given clinical conditions, the model can generate realistic synthetic 4D sequential anatomies that share similar distributions with the real data.
△ Less
Submitted 30 November, 2023; v1 submitted 30 January, 2023;
originally announced January 2023.
-
Graph Convolutional Neural Networks with Diverse Negative Samples via Decomposed Determinant Point Processes
Authors:
Wei Duan,
Junyu Xuan,
Maoying Qiao,
Jie Lu
Abstract:
Graph convolutional networks (GCNs) have achieved great success in graph representation learning by extracting high-level features from nodes and their topology. Since GCNs generally follow a message-passing mechanism, each node aggregates information from its first-order neighbour to update its representation. As a result, the representations of nodes with edges between them should be positively…
▽ More
Graph convolutional networks (GCNs) have achieved great success in graph representation learning by extracting high-level features from nodes and their topology. Since GCNs generally follow a message-passing mechanism, each node aggregates information from its first-order neighbour to update its representation. As a result, the representations of nodes with edges between them should be positively correlated and thus can be considered positive samples. However, there are more non-neighbour nodes in the whole graph, which provide diverse and useful information for the representation update. Two non-adjacent nodes usually have different representations, which can be seen as negative samples. Besides the node representations, the structural information of the graph is also crucial for learning. In this paper, we used quality-diversity decomposition in determinant point processes (DPP) to obtain diverse negative samples. When defining a distribution on diverse subsets of all non-neighbouring nodes, we incorporate both graph structure information and node representations. Since the DPP sampling process requires matrix eigenvalue decomposition, we propose a new shortest-path-base method to improve computational efficiency. Finally, we incorporate the obtained negative samples into the graph convolution operation. The ideas are evaluated empirically in experiments on node classification tasks. These experiments show that the newly proposed methods not only improve the overall performance of standard representation learning but also significantly alleviate over-smoothing problems.
△ Less
Submitted 6 September, 2023; v1 submitted 5 December, 2022;
originally announced December 2022.
-
Data-Driven Network Neuroscience: On Data Collection and Benchmark
Authors:
Jiaxing Xu,
Yunhan Yang,
David Tse Jung Huang,
Sophi Shilpa Gururajapathy,
Yiping Ke,
Miao Qiao,
Alan Wang,
Haribalan Kumar,
Josh McGeown,
Eryn Kwon
Abstract:
This paper presents a comprehensive and quality collection of functional human brain network data for potential research in the intersection of neuroscience, machine learning, and graph analytics. Anatomical and functional MRI images have been used to understand the functional connectivity of the human brain and are particularly important in identifying underlying neurodegenerative conditions such…
▽ More
This paper presents a comprehensive and quality collection of functional human brain network data for potential research in the intersection of neuroscience, machine learning, and graph analytics. Anatomical and functional MRI images have been used to understand the functional connectivity of the human brain and are particularly important in identifying underlying neurodegenerative conditions such as Alzheimer's, Parkinson's, and Autism. Recently, the study of the brain in the form of brain networks using machine learning and graph analytics has become increasingly popular, especially to predict the early onset of these conditions. A brain network, represented as a graph, retains rich structural and positional information that traditional examination methods are unable to capture. However, the lack of publicly accessible brain network data prevents researchers from data-driven explorations. One of the main difficulties lies in the complicated domain-specific preprocessing steps and the exhaustive computation required to convert the data from MRI images into brain networks. We bridge this gap by collecting a large amount of MRI images from public databases and a private source, working with domain experts to make sensible design choices, and preprocessing the MRI images to produce a collection of brain network datasets. The datasets originate from 6 different sources, cover 4 brain conditions, and consist of a total of 2,702 subjects. We test our graph datasets on 12 machine learning models to provide baselines and validate the data quality on a recent graph analysis model. To lower the barrier to entry and promote the research in this interdisciplinary field, we release our brain network data and complete preprocessing details including codes at https://doi.org/10.17608/k6.auckland.21397377 and https://github.com/brainnetuoa/data_driven_network_neuroscience.
△ Less
Submitted 29 October, 2023; v1 submitted 10 November, 2022;
originally announced November 2022.
-
Centralization Problem for Opinion Convergence in Decentralized Networks
Authors:
Yiping Liu,
Jiamou Liu,
Bakhadyr Khoussaino,
Miao Qiao,
Bo Yan
Abstract:
This paper aims to provide a new perspective on the interplay between decentralization -- a prevalent character of multi-agent systems -- and centralization, i.e., the task of imposing central control to meet system-level goals. In particular, in the context of networked opinion dynamic model, the paper proposes and discusses a framework for centralization. More precisely, a decentralized network…
▽ More
This paper aims to provide a new perspective on the interplay between decentralization -- a prevalent character of multi-agent systems -- and centralization, i.e., the task of imposing central control to meet system-level goals. In particular, in the context of networked opinion dynamic model, the paper proposes and discusses a framework for centralization. More precisely, a decentralized network consists of autonomous agents and their social structure that is unknown and dynamic. Centralization is a process of appointing agents in the network to act as access units who provide information and exert influence over their local surroundings. We discuss centralization for the DeGroot model of opinion dynamics, aiming to enforce opinion convergence using the minimum number of access units. We show that the key to the centralization process lies in selecting access units so that they form a dominating set. We then propose algorithms under a new local algorithmic framework, namely prowling, to accomplish this task. To validate our algorithm, we perform systematic experiments over both real-world and synthetic networks and verify that our algorithm outperforms benchmarks.
△ Less
Submitted 28 October, 2022;
originally announced October 2022.
-
A Fourier Approach to Mixture Learning
Authors:
Mingda Qiao,
Guru Guruganesh,
Ankit Singh Rawat,
Avinava Dubey,
Manzil Zaheer
Abstract:
We revisit the problem of learning mixtures of spherical Gaussians. Given samples from mixture $\frac{1}{k}\sum_{j=1}^{k}\mathcal{N}(μ_j, I_d)$, the goal is to estimate the means $μ_1, μ_2, \ldots, μ_k \in \mathbb{R}^d$ up to a small error. The hardness of this learning problem can be measured by the separation $Δ$ defined as the minimum distance between all pairs of means. Regev and Vijayaraghava…
▽ More
We revisit the problem of learning mixtures of spherical Gaussians. Given samples from mixture $\frac{1}{k}\sum_{j=1}^{k}\mathcal{N}(μ_j, I_d)$, the goal is to estimate the means $μ_1, μ_2, \ldots, μ_k \in \mathbb{R}^d$ up to a small error. The hardness of this learning problem can be measured by the separation $Δ$ defined as the minimum distance between all pairs of means. Regev and Vijayaraghavan (2017) showed that with $Δ= Ω(\sqrt{\log k})$ separation, the means can be learned using $\mathrm{poly}(k, d)$ samples, whereas super-polynomially many samples are required if $Δ= o(\sqrt{\log k})$ and $d = Ω(\log k)$. This leaves open the low-dimensional regime where $d = o(\log k)$.
In this work, we give an algorithm that efficiently learns the means in $d = O(\log k/\log\log k)$ dimensions under separation $d/\sqrt{\log k}$ (modulo doubly logarithmic factors). This separation is strictly smaller than $\sqrt{\log k}$, and is also shown to be necessary. Along with the results of Regev and Vijayaraghavan (2017), our work almost pins down the critical separation threshold at which efficient parameter learning becomes possible for spherical Gaussian mixtures. More generally, our algorithm runs in time $\mathrm{poly}(k)\cdot f(d, Δ, ε)$, and is thus fixed-parameter tractable in parameters $d$, $Δ$ and $ε$.
Our approach is based on estimating the Fourier transform of the mixture at carefully chosen frequencies, and both the algorithm and its analysis are simple and elementary. Our positive results can be easily extended to learning mixtures of non-Gaussian distributions, under a mild condition on the Fourier spectrum of the distribution.
△ Less
Submitted 5 October, 2022; v1 submitted 5 October, 2022;
originally announced October 2022.
-
Learning from the Dark: Boosting Graph Convolutional Neural Networks with Diverse Negative Samples
Authors:
Wei Duan,
Junyu Xuan,
Maoying Qiao,
Jie Lu
Abstract:
Graph Convolutional Neural Networks (GCNs) has been generally accepted to be an effective tool for node representations learning. An interesting way to understand GCNs is to think of them as a message passing mechanism where each node updates its representation by accepting information from its neighbours (also known as positive samples). However, beyond these neighbouring nodes, graphs have a lar…
▽ More
Graph Convolutional Neural Networks (GCNs) has been generally accepted to be an effective tool for node representations learning. An interesting way to understand GCNs is to think of them as a message passing mechanism where each node updates its representation by accepting information from its neighbours (also known as positive samples). However, beyond these neighbouring nodes, graphs have a large, dark, all-but forgotten world in which we find the non-neighbouring nodes (negative samples). In this paper, we show that this great dark world holds a substantial amount of information that might be useful for representation learning. Most specifically, it can provide negative information about the node representations. Our overall idea is to select appropriate negative samples for each node and incorporate the negative information contained in these samples into the representation updates. Moreover, we show that the process of selecting the negative samples is not trivial. Our theme therefore begins by describing the criteria for a good negative sample, followed by a determinantal point process algorithm for efficiently obtaining such samples. A GCN, boosted by diverse negative samples, then jointly considers the positive and negative information when passing messages. Experimental evaluations show that this idea not only improves the overall performance of standard representation learning but also significantly alleviates over-smoothing problems.
△ Less
Submitted 3 October, 2022;
originally announced October 2022.
-
Online Pen Testing
Authors:
Mingda Qiao,
Gregory Valiant
Abstract:
We study a "pen testing" problem, in which we are given $n$ pens with unknown amounts of ink $X_1, X_2, \ldots, X_n$, and we want to choose a pen with the maximum amount of remaining ink in it. The challenge is that we cannot access each $X_i$ directly; we only get to write with the $i$-th pen until either a certain amount of ink is used, or the pen runs out of ink. In both cases, this testing red…
▽ More
We study a "pen testing" problem, in which we are given $n$ pens with unknown amounts of ink $X_1, X_2, \ldots, X_n$, and we want to choose a pen with the maximum amount of remaining ink in it. The challenge is that we cannot access each $X_i$ directly; we only get to write with the $i$-th pen until either a certain amount of ink is used, or the pen runs out of ink. In both cases, this testing reduces the remaining ink in the pen and thus the utility of selecting it.
Despite this significant lack of information, we show that it is possible to approximately maximize our utility up to an $O(\log n)$ factor. Formally, we consider two different setups: the "prophet" setting, in which each $X_i$ is independently drawn from some distribution $\mathcal{D}_i$, and the "secretary" setting, in which $(X_i)_{i=1}^n$ is a random permutation of arbitrary $a_1, a_2, \ldots, a_n$. We derive the optimal competitive ratios in both settings up to constant factors. Our algorithms are surprisingly robust: (1) In the prophet setting, we only require one sample from each $\mathcal{D}_i$, rather than a full description of the distribution; (2) In the secretary setting, the algorithm also succeeds under an arbitrary permutation, if an estimate of the maximum $a_i$ is given.
Our techniques include a non-trivial online sampling scheme from a sequence with an unknown length, as well as the construction of a hard, non-uniform distribution over permutations. Both might be of independent interest. We also highlight some immediate open problems and discuss several directions for future research.
△ Less
Submitted 21 November, 2022; v1 submitted 2 October, 2022;
originally announced October 2022.
-
Generative Modelling of the Ageing Heart with Cross-Sectional Imaging and Clinical Data
Authors:
Mengyun Qiao,
Berke Doga Basaran,
Huaqi Qiu,
Shuo Wang,
Yi Guo,
Yuanyuan Wang,
Paul M. Matthews,
Daniel Rueckert,
Wenjia Bai
Abstract:
Cardiovascular disease, the leading cause of death globally, is an age-related disease. Understanding the morphological and functional changes of the heart during ageing is a key scientific question, the answer to which will help us define important risk factors of cardiovascular disease and monitor disease progression. In this work, we propose a novel conditional generative model to describe the…
▽ More
Cardiovascular disease, the leading cause of death globally, is an age-related disease. Understanding the morphological and functional changes of the heart during ageing is a key scientific question, the answer to which will help us define important risk factors of cardiovascular disease and monitor disease progression. In this work, we propose a novel conditional generative model to describe the changes of 3D anatomy of the heart during ageing. The proposed model is flexible and allows integration of multiple clinical factors (e.g. age, gender) into the generating process. We train the model on a large-scale cross-sectional dataset of cardiac anatomies and evaluate on both cross-sectional and longitudinal datasets. The model demonstrates excellent performance in predicting the longitudinal evolution of the ageing heart and modelling its data distribution. The codes are available at https://github.com/MengyunQ/AgeHeart.
△ Less
Submitted 10 October, 2022; v1 submitted 28 August, 2022;
originally announced August 2022.
-
Subject-Specific Lesion Generation and Pseudo-Healthy Synthesis for Multiple Sclerosis Brain Images
Authors:
Berke Doga Basaran,
Mengyun Qiao,
Paul M. Matthews,
Wenjia Bai
Abstract:
Understanding the intensity characteristics of brain lesions is key for defining image-based biomarkers in neurological studies and for predicting disease burden and outcome. In this work, we present a novel foreground-based generative method for modelling the local lesion characteristics that can both generate synthetic lesions on healthy images and synthesize subject-specific pseudo-healthy imag…
▽ More
Understanding the intensity characteristics of brain lesions is key for defining image-based biomarkers in neurological studies and for predicting disease burden and outcome. In this work, we present a novel foreground-based generative method for modelling the local lesion characteristics that can both generate synthetic lesions on healthy images and synthesize subject-specific pseudo-healthy images from pathological images. Furthermore, the proposed method can be used as a data augmentation module to generate synthetic images for training brain image segmentation networks. Experiments on multiple sclerosis (MS) brain images acquired on magnetic resonance imaging (MRI) demonstrate that the proposed method can generate highly realistic pseudo-healthy and pseudo-pathological brain images. Data augmentation using the synthetic images improves the brain image segmentation performance compared to traditional data augmentation methods as well as a recent lesion-aware data augmentation technique, CarveMix. The code will be released at https://github.com/dogabasaran/lesion-synthesis.
△ Less
Submitted 3 August, 2022;
originally announced August 2022.
-
Open Problem: Properly learning decision trees in polynomial time?
Authors:
Guy Blanc,
Jane Lange,
Mingda Qiao,
Li-Yang Tan
Abstract:
The authors recently gave an $n^{O(\log\log n)}$ time membership query algorithm for properly learning decision trees under the uniform distribution (Blanc et al., 2021). The previous fastest algorithm for this problem ran in $n^{O(\log n)}$ time, a consequence of Ehrenfeucht and Haussler (1989)'s classic algorithm for the distribution-free setting. In this article we highlight the natural open pr…
▽ More
The authors recently gave an $n^{O(\log\log n)}$ time membership query algorithm for properly learning decision trees under the uniform distribution (Blanc et al., 2021). The previous fastest algorithm for this problem ran in $n^{O(\log n)}$ time, a consequence of Ehrenfeucht and Haussler (1989)'s classic algorithm for the distribution-free setting. In this article we highlight the natural open problem of obtaining a polynomial-time algorithm, discuss possible avenues towards obtaining it, and state intermediate milestones that we believe are of independent interest.
△ Less
Submitted 29 June, 2022;
originally announced June 2022.
-
MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining
Authors:
Pengyuan Lyu,
Chengquan Zhang,
Shanshan Liu,
Meina Qiao,
Yangliu Xu,
Liang Wu,
Kun Yao,
Junyu Han,
Errui Ding,
Jingdong Wang
Abstract:
Text images contain both visual and linguistic information. However, existing pre-training techniques for text recognition mainly focus on either visual representation learning or linguistic knowledge learning. In this paper, we propose a novel approach MaskOCR to unify vision and language pre-training in the classical encoder-decoder recognition framework. We adopt the masked image modeling appro…
▽ More
Text images contain both visual and linguistic information. However, existing pre-training techniques for text recognition mainly focus on either visual representation learning or linguistic knowledge learning. In this paper, we propose a novel approach MaskOCR to unify vision and language pre-training in the classical encoder-decoder recognition framework. We adopt the masked image modeling approach to pre-train the feature encoder using a large set of unlabeled real text images, which allows us to learn strong visual representations. In contrast to introducing linguistic knowledge with an additional language model, we directly pre-train the sequence decoder. Specifically, we transform text data into synthesized text images to unify the data modalities of vision and language, and enhance the language modeling capability of the sequence decoder using a proposed masked image-language modeling scheme. Significantly, the encoder is frozen during the pre-training phase of the sequence decoder. Experimental results demonstrate that our proposed method achieves superior performance on benchmark datasets, including Chinese and English text images.
△ Less
Submitted 9 October, 2023; v1 submitted 1 June, 2022;
originally announced June 2022.
-
Progressive Training of A Two-Stage Framework for Video Restoration
Authors:
Meisong Zheng,
Qunliang Xing,
Minglang Qiao,
Mai Xu,
Lai Jiang,
Huaida Liu,
Ying Chen
Abstract:
As a widely studied task, video restoration aims to enhance the quality of the videos with multiple potential degradations, such as noises, blurs and compression artifacts. Among video restorations, compressed video quality enhancement and video super-resolution are two of the main tacks with significant values in practical scenarios. Recently, recurrent neural networks and transformers attract in…
▽ More
As a widely studied task, video restoration aims to enhance the quality of the videos with multiple potential degradations, such as noises, blurs and compression artifacts. Among video restorations, compressed video quality enhancement and video super-resolution are two of the main tacks with significant values in practical scenarios. Recently, recurrent neural networks and transformers attract increasing research interests in this field, due to their impressive capability in sequence-to-sequence modeling. However, the training of these models is not only costly but also relatively hard to converge, with gradient exploding and vanishing problems. To cope with these problems, we proposed a two-stage framework including a multi-frame recurrent network and a single-frame transformer. Besides, multiple training strategies, such as transfer learning and progressive training, are developed to shorten the training time and improve the model performance. Benefiting from the above technical contributions, our solution wins two champions and a runner-up in the NTIRE 2022 super-resolution and quality enhancement of compressed video challenges. Code is available at https://github.com/ryanxingql/winner-ntire22-vqe.
△ Less
Submitted 4 February, 2023; v1 submitted 21 April, 2022;
originally announced April 2022.
-
NTIRE 2022 Challenge on Super-Resolution and Quality Enhancement of Compressed Video: Dataset, Methods and Results
Authors:
Ren Yang,
Radu Timofte,
Meisong Zheng,
Qunliang Xing,
Minglang Qiao,
Mai Xu,
Lai Jiang,
Huaida Liu,
Ying Chen,
Youcheng Ben,
Xiao Zhou,
Chen Fu,
Pei Cheng,
Gang Yu,
Junyi Li,
Renlong Wu,
Zhilu Zhang,
Wei Shang,
Zhengyao Lv,
Yunjin Chen,
Mingcai Zhou,
Dongwei Ren,
Kai Zhang,
Wangmeng Zuo,
Pavel Ostyakov
, et al. (54 additional authors not shown)
Abstract:
This paper reviews the NTIRE 2022 Challenge on Super-Resolution and Quality Enhancement of Compressed Video. In this challenge, we proposed the LDV 2.0 dataset, which includes the LDV dataset (240 videos) and 95 additional videos. This challenge includes three tracks. Track 1 aims at enhancing the videos compressed by HEVC at a fixed QP. Track 2 and Track 3 target both the super-resolution and qua…
▽ More
This paper reviews the NTIRE 2022 Challenge on Super-Resolution and Quality Enhancement of Compressed Video. In this challenge, we proposed the LDV 2.0 dataset, which includes the LDV dataset (240 videos) and 95 additional videos. This challenge includes three tracks. Track 1 aims at enhancing the videos compressed by HEVC at a fixed QP. Track 2 and Track 3 target both the super-resolution and quality enhancement of HEVC compressed video. They require x2 and x4 super-resolution, respectively. The three tracks totally attract more than 600 registrations. In the test phase, 8 teams, 8 teams and 12 teams submitted the final results to Tracks 1, 2 and 3, respectively. The proposed methods and solutions gauge the state-of-the-art of super-resolution and quality enhancement of compressed video. The proposed LDV 2.0 dataset is available at https://github.com/RenYang-home/LDV_dataset. The homepage of this challenge (including open-sourced codes) is at https://github.com/RenYang-home/NTIRE22_VEnh_SR.
△ Less
Submitted 25 April, 2022; v1 submitted 20 April, 2022;
originally announced April 2022.
-
Towards a New Science of Disinformation
Authors:
Claudio S. Pinhanez,
German H. Flores,
Marisa A. Vasconcelos,
Mu Qiao,
Nick Linck,
Rogério de Paula,
Yuya J. Ong
Abstract:
How can we best address the dangerous impact that deep learning-generated fake audios, photographs, and videos (a.k.a. deepfakes) may have in personal and societal life? We foresee that the availability of cheap deepfake technology will create a second wave of disinformation where people will receive specific, personalized disinformation through different channels, making the current approaches to…
▽ More
How can we best address the dangerous impact that deep learning-generated fake audios, photographs, and videos (a.k.a. deepfakes) may have in personal and societal life? We foresee that the availability of cheap deepfake technology will create a second wave of disinformation where people will receive specific, personalized disinformation through different channels, making the current approaches to fight disinformation obsolete. We argue that fake media has to be seen as an upcoming cybersecurity problem, and we have to shift from combating its spread to a prevention and cure framework where users have available ways to verify, challenge, and argue against the veracity of each piece of media they are exposed to. To create the technologies behind this framework, we propose that a new Science of Disinformation is needed, one which creates a theoretical framework both for the processes of communication and consumption of false content. Key scientific and technological challenges facing this research agenda are listed and discussed in the light of state-of-art technologies for fake media generation and detection, argument finding and construction, and how to effectively engage users in the prevention and cure processes.
△ Less
Submitted 17 March, 2022;
originally announced April 2022.
-
Joint Learning of Visual-Audio Saliency Prediction and Sound Source Localization on Multi-face Videos
Authors:
Minglang Qiao,
Yufan Liu,
Mai Xu,
Xin Deng,
Bing Li,
Weiming Hu,
Ali Borji
Abstract:
Visual and audio events simultaneously occur and both attract attention. However, most existing saliency prediction works ignore the influence of audio and only consider vision modality. In this paper, we propose a multitask learning method for visual-audio saliency prediction and sound source localization on multi-face video by leveraging visual, audio and face information. Specifically, we first…
▽ More
Visual and audio events simultaneously occur and both attract attention. However, most existing saliency prediction works ignore the influence of audio and only consider vision modality. In this paper, we propose a multitask learning method for visual-audio saliency prediction and sound source localization on multi-face video by leveraging visual, audio and face information. Specifically, we first introduce a large-scale database of multi-face video in visual-audio condition (MVVA), containing eye-tracking data and sound source annotations. Using this database, we find that sound influences human attention, and conversly attention offers a cue to determine sound source on multi-face video. Guided by these findings, a visual-audio multi-task network (VAM-Net) is introduced to predict saliency and locate sound source. VAM-Net consists of three branches corresponding to visual, audio and face modalities. Visual branch has a two-stream architecture to capture spatial and temporal information. Face and audio branches encode audio signals and faces, respectively. Finally, a spatio-temporal multi-modal graph (STMG) is constructed to model the interaction among multiple faces. With joint optimization of these branches, the intrinsic correlation of the tasks of saliency prediction and sound source localization is utilized and their performance is boosted by each other. Experiments show that the proposed method outperforms 12 state-of-the-art saliency prediction methods, and achieves competitive results in sound source localization.
△ Less
Submitted 5 November, 2021;
originally announced November 2021.
-
Dual-view Snapshot Compressive Imaging via Optical Flow Aided Recurrent Neural Network
Authors:
Ruiying Lu,
Bo Chen,
Guanliang Liu,
Ziheng Cheng,
Mu Qiao,
Xin Yuan
Abstract:
Dual-view snapshot compressive imaging (SCI) aims to capture videos from two field-of-views (FoVs) using a 2D sensor (detector) in a single snapshot, achieving joint FoV and temporal compressive sensing, and thus enjoying the advantages of low-bandwidth, low-power, and low-cost. However, it is challenging for existing model-based decoding algorithms to reconstruct each individual scene, which usua…
▽ More
Dual-view snapshot compressive imaging (SCI) aims to capture videos from two field-of-views (FoVs) using a 2D sensor (detector) in a single snapshot, achieving joint FoV and temporal compressive sensing, and thus enjoying the advantages of low-bandwidth, low-power, and low-cost. However, it is challenging for existing model-based decoding algorithms to reconstruct each individual scene, which usually require exhaustive parameter tuning with extremely long running time for large scale data. In this paper, we propose an optical flow-aided recurrent neural network for dual video SCI systems, which provides high-quality decoding in seconds. Firstly, we develop a diversity amplification method to enlarge the differences between scenes of two FoVs, and design a deep convolutional neural network with dual branches to separate different scenes from the single measurement. Secondly, we integrate the bidirectional optical flow extracted from adjacent frames with the recurrent neural network to jointly reconstruct each video in a sequential manner. Extensive results on both simulation and real data demonstrate the superior performance of our proposed model in a short inference time. The code and data are available at https://github.com/RuiyingLu/OFaNet-for-Dual-view-SCI.
△ Less
Submitted 11 September, 2021;
originally announced September 2021.
-
Properly learning decision trees in almost polynomial time
Authors:
Guy Blanc,
Jane Lange,
Mingda Qiao,
Li-Yang Tan
Abstract:
We give an $n^{O(\log\log n)}$-time membership query algorithm for properly and agnostically learning decision trees under the uniform distribution over $\{\pm 1\}^n$. Even in the realizable setting, the previous fastest runtime was $n^{O(\log n)}$, a consequence of a classic algorithm of Ehrenfeucht and Haussler. Our algorithm shares similarities with practical heuristics for learning decision tr…
▽ More
We give an $n^{O(\log\log n)}$-time membership query algorithm for properly and agnostically learning decision trees under the uniform distribution over $\{\pm 1\}^n$. Even in the realizable setting, the previous fastest runtime was $n^{O(\log n)}$, a consequence of a classic algorithm of Ehrenfeucht and Haussler. Our algorithm shares similarities with practical heuristics for learning decision trees, which we augment with additional ideas to circumvent known lower bounds against these heuristics. To analyze our algorithm, we prove a new structural result for decision trees that strengthens a theorem of O'Donnell, Saks, Schramm, and Servedio. While the OSSS theorem says that every decision tree has an influential variable, we show how every decision tree can be "pruned" so that every variable in the resulting tree is influential.
△ Less
Submitted 1 November, 2021; v1 submitted 1 September, 2021;
originally announced September 2021.