-
Incorporating Correlated Nugget Effects in Multivariate Spatial Models: An Application to Argo Ocean Data
Authors:
Damilya Saduakhas,
David Bolin,
Xiaotian Jin,
Alexandre B. Simas,
Jonas Wallin
Abstract:
Accurate analysis of global oceanographic data, such as temperature and salinity profiles from the Argo program, requires geostatistical models capable of capturing complex spatial dependencies. This study introduces Gaussian and non-Gaussian hierarchical multivariate Matérn-SPDE models with correlated nugget effects to account for small-scale variability and measurement error correlations. Using…
▽ More
Accurate analysis of global oceanographic data, such as temperature and salinity profiles from the Argo program, requires geostatistical models capable of capturing complex spatial dependencies. This study introduces Gaussian and non-Gaussian hierarchical multivariate Matérn-SPDE models with correlated nugget effects to account for small-scale variability and measurement error correlations. Using simulations and Argo data, we demonstrate that incorporating correlated nugget effects significantly improves the accuracy of parameter estimation and spatial prediction in both Gaussian and non-Gaussian multivariate spatial processes. When applied to global ocean temperature and salinity data, our model yields lower correlation estimates between fields compared to models that assume independent noise. This suggests that traditional models may overestimate the underlying field correlation. By separating these effects, our approach captures fine-scale oceanic patterns more effectively. These findings show the importance of relaxing the assumption of independent measurement errors in multivariate hierarchical models.
△ Less
Submitted 3 June, 2025;
originally announced June 2025.
-
Generalized Tree-Informed Mixed Model Regression
Authors:
Jeremiah Allis,
Xin Jin,
Riddhi Ghosh
Abstract:
The standard regression tree method applied to observations within clusters poses both methodological and implementation challenges. Effectively leveraging these data requires methods that account for both individual-level and sample-level effects. We propose Generalized Tree-Informed Mixed Model (GTIMM), which replaces the linear fixed effect in a generalized linear mixed model (GLMM) with the ou…
▽ More
The standard regression tree method applied to observations within clusters poses both methodological and implementation challenges. Effectively leveraging these data requires methods that account for both individual-level and sample-level effects. We propose Generalized Tree-Informed Mixed Model (GTIMM), which replaces the linear fixed effect in a generalized linear mixed model (GLMM) with the output of a regression tree. Traditional parameter estimation and prediction techniques, such as the expectation-maximization algorithm, scale poorly in high-dimensional settings, creating a computational bottleneck. To address this, we employ a quasi-likelihood framework with stochastic gradient descent for optimized parameter estimation. Additionally, we establish a theoretical bound for the mean squared prediction error. The predictive performance of our method is evaluated through simulations and compared with existing approaches. Finally, we apply our model to predict country-level GDP based on trade, foreign direct investment, unemployment, inflation, and geographic region.
△ Less
Submitted 3 March, 2025;
originally announced March 2025.
-
Revisiting and Benchmarking Graph Autoencoders: A Contrastive Learning Perspective
Authors:
Jintang Li,
Ruofan Wu,
Yuchang Zhu,
Huizhe Zhang,
Xinzhou Jin,
Guibin Zhang,
Zulun Zhu,
Zibin Zheng,
Liang Chen
Abstract:
Graph autoencoders (GAEs) are self-supervised learning models that can learn meaningful representations of graph-structured data by reconstructing the input graph from a low-dimensional latent space. Over the past few years, GAEs have gained significant attention in academia and industry. In particular, the recent advent of GAEs with masked autoencoding schemes marks a significant advancement in g…
▽ More
Graph autoencoders (GAEs) are self-supervised learning models that can learn meaningful representations of graph-structured data by reconstructing the input graph from a low-dimensional latent space. Over the past few years, GAEs have gained significant attention in academia and industry. In particular, the recent advent of GAEs with masked autoencoding schemes marks a significant advancement in graph self-supervised learning research. While numerous GAEs have been proposed, the underlying mechanisms of GAEs are not well understood, and a comprehensive benchmark for GAEs is still lacking. In this work, we bridge the gap between GAEs and contrastive learning by establishing conceptual and methodological connections. We revisit the GAEs studied in previous works and demonstrate how contrastive learning principles can be applied to GAEs. Motivated by these insights, we introduce lrGAE (left-right GAE), a general and powerful GAE framework that leverages contrastive learning principles to learn meaningful representations. Our proposed lrGAE not only facilitates a deeper understanding of GAEs but also sets a new benchmark for GAEs across diverse graph-based learning tasks. The source code for lrGAE, including the baselines and all the code for reproducing the results, is publicly available at https://github.com/EdisonLeeeee/lrGAE.
△ Less
Submitted 14 October, 2024;
originally announced October 2024.
-
Demystifying Language Model Forgetting with Low-rank Example Associations
Authors:
Xisen Jin,
Xiang Ren
Abstract:
Large Language models (LLMs) suffer from forgetting of upstream knowledge when fine-tuned. Despite efforts on mitigating forgetting, few have investigated how forgotten upstream examples are dependent on newly learned tasks. Insights on such dependencies enable efficient and targeted mitigation of forgetting. In this paper, we empirically analyze forgetting that occurs in $N$ upstream examples of…
▽ More
Large Language models (LLMs) suffer from forgetting of upstream knowledge when fine-tuned. Despite efforts on mitigating forgetting, few have investigated how forgotten upstream examples are dependent on newly learned tasks. Insights on such dependencies enable efficient and targeted mitigation of forgetting. In this paper, we empirically analyze forgetting that occurs in $N$ upstream examples of language modeling or instruction-tuning after fine-tuning LLMs on one of $M$ new tasks, visualized in $M\times N$ matrices. We show that the matrices are often well-approximated with low-rank matrices, indicating the dominance of simple associations between the learned tasks and forgotten upstream examples. Leveraging the analysis, we predict forgetting of upstream examples when fine-tuning LLMs on unseen tasks with matrix completion over the empirical associations. This enables fast identification of most forgotten examples without expensive inference on the entire upstream data. Despite simplicity, the approach outperforms prior approaches that learn semantic relationships of learned tasks and upstream examples with LMs. We demonstrate the practical utility of our analysis by showing statistically significantly reduced forgetting as we upweight predicted examples for replay during fine-tuning.
△ Less
Submitted 18 May, 2025; v1 submitted 20 June, 2024;
originally announced June 2024.
-
Two-Sample Hypothesis Testing for Large Random Graphs of Unequal Size
Authors:
Xin Jin,
Kit Chan,
Ian Barnett,
Riddhi Pratim Ghosh
Abstract:
Two-sample hypothesis testing for large graphs is popular in cognitive science, probabilistic machine learning and artificial intelligence. While numerous methods have been proposed in the literature to address this problem, less attention has been devoted to scenarios involving graphs of unequal size or situations where there are only one or a few samples of graphs. In this article, we propose a…
▽ More
Two-sample hypothesis testing for large graphs is popular in cognitive science, probabilistic machine learning and artificial intelligence. While numerous methods have been proposed in the literature to address this problem, less attention has been devoted to scenarios involving graphs of unequal size or situations where there are only one or a few samples of graphs. In this article, we propose a Frobenius test statistic tailored for small sample sizes and unequal-sized random graphs to test whether they are generated from the same model or not. Our approach involves an algorithm for generating bootstrapped adjacency matrices from estimated community-wise edge probability matrices, forming the basis of the Frobenius test statistic. We derive the asymptotic distribution of the proposed test statistic and validate its stability and efficiency in detecting minor differences in underlying models through simulations. Furthermore, we explore its application to fMRI data where we are able to distinguish brain activity patterns when subjects are exposed to sentences and pictures for two different stimuli and the control group.
△ Less
Submitted 16 February, 2024;
originally announced February 2024.
-
What Will My Model Forget? Forecasting Forgotten Examples in Language Model Refinement
Authors:
Xisen Jin,
Xiang Ren
Abstract:
Language models deployed in the wild make errors. However, simply updating the model with the corrected error instances causes catastrophic forgetting -- the updated model makes errors on instances learned during the instruction tuning or upstream training phase. Randomly replaying upstream data yields unsatisfactory performance and often comes with high variance and poor controllability. To this…
▽ More
Language models deployed in the wild make errors. However, simply updating the model with the corrected error instances causes catastrophic forgetting -- the updated model makes errors on instances learned during the instruction tuning or upstream training phase. Randomly replaying upstream data yields unsatisfactory performance and often comes with high variance and poor controllability. To this end, we try to forecast upstream examples that will be forgotten due to a model update for improved controllability of the replay process and interpretability. We train forecasting models given a collection of online learned examples and corresponding forgotten upstream pre-training examples. We propose a partially interpretable forecasting model based on the observation that changes in pre-softmax logit scores of pretraining examples resemble that of online learned examples, which performs decently on BART but fails on T5 models. We further show a black-box classifier based on inner products of example representations achieves better forecasting performance over a series of setups. Finally, we show that we reduce forgetting of upstream pretraining examples by replaying examples that are forecasted to be forgotten, demonstrating the practical utility of forecasting example forgetting.
△ Less
Submitted 9 December, 2024; v1 submitted 2 February, 2024;
originally announced February 2024.
-
The Building Data Genome Directory -- An open, comprehensive data sharing platform for building performance research
Authors:
Xiaoyu Jin,
Chun Fu,
Hussain Kazmi,
Atilla Balint,
Ada Canaydin,
Matias Quintana,
Filip Biljecki,
Fu Xiao,
Clayton Miller
Abstract:
The building sector plays a crucial role in the worldwide decarbonization effort, accounting for significant portions of energy consumption and environmental effects. However, the scarcity of open data sources is a continuous challenge for built environment researchers and practitioners. Although several efforts have been made to consolidate existing open datasets, no database currently offers a c…
▽ More
The building sector plays a crucial role in the worldwide decarbonization effort, accounting for significant portions of energy consumption and environmental effects. However, the scarcity of open data sources is a continuous challenge for built environment researchers and practitioners. Although several efforts have been made to consolidate existing open datasets, no database currently offers a comprehensive collection of building data types with all subcategories and time granularities (e.g., year, month, and sub-hour). This paper presents the Building Data Genome Directory, an open data-sharing platform serving as a one-stop shop for the data necessary for vital categories of building energy research. The data directory is an online portal (http://buildingdatadirectory.org/) that allows filtering and discovering valuable datasets. The directory covers meter, building-level, and aggregated community-level data at the spatial scale and year-to-minute level at the temporal scale. The datasets were consolidated from a comprehensive exploration of sources, including governments, research institutes, and online energy dashboards. The results of this effort include the aggregation of 60 datasets pertaining to building energy ontologies, building energy models, building energy and water data, electric vehicle data, weather data, building information data, text-mining-based research data, image data of buildings, fault detection diagnosis data and occupant data. A crowdsourcing mechanism in the platform allows users to submit datasets they suggest for inclusion by filling out an online form. This directory can fuel research and applications on building energy efficiency, which is an essential step toward addressing the world's energy and environmental challenges.
△ Less
Submitted 3 July, 2023;
originally announced July 2023.
-
Deep Learning Hamiltonian Monte Carlo
Authors:
Sam Foreman,
Xiao-Yong Jin,
James C. Osborn
Abstract:
We generalize the Hamiltonian Monte Carlo algorithm with a stack of neural network layers and evaluate its ability to sample from different topologies in a two dimensional lattice gauge theory. We demonstrate that our model is able to successfully mix between modes of different topologies, significantly reducing the computational cost required to generated independent gauge field configurations. O…
▽ More
We generalize the Hamiltonian Monte Carlo algorithm with a stack of neural network layers and evaluate its ability to sample from different topologies in a two dimensional lattice gauge theory. We demonstrate that our model is able to successfully mix between modes of different topologies, significantly reducing the computational cost required to generated independent gauge field configurations. Our implementation is available at https://github.com/saforem2/l2hmc-qcd .
△ Less
Submitted 7 May, 2021;
originally announced May 2021.
-
Domain Adaptation for Time Series Forecasting via Attention Sharing
Authors:
Xiaoyong Jin,
Youngsuk Park,
Danielle C. Maddix,
Hao Wang,
Yuyang Wang
Abstract:
Recently, deep neural networks have gained increasing popularity in the field of time series forecasting. A primary reason for their success is their ability to effectively capture complex temporal dynamics across multiple related time series. The advantages of these deep forecasters only start to emerge in the presence of a sufficient amount of data. This poses a challenge for typical forecasting…
▽ More
Recently, deep neural networks have gained increasing popularity in the field of time series forecasting. A primary reason for their success is their ability to effectively capture complex temporal dynamics across multiple related time series. The advantages of these deep forecasters only start to emerge in the presence of a sufficient amount of data. This poses a challenge for typical forecasting problems in practice, where there is a limited number of time series or observations per time series, or both. To cope with this data scarcity issue, we propose a novel domain adaptation framework, Domain Adaptation Forecaster (DAF). DAF leverages statistical strengths from a relevant domain with abundant data samples (source) to improve the performance on the domain of interest with limited data (target). In particular, we use an attention-based shared module with a domain discriminator across domains and private modules for individual domains. We induce domain-invariant latent features (queries and keys) and retrain domain-specific features (values) simultaneously to enable joint training of forecasters on source and target domains. A main insight is that our design of aligning keys allows the target domain to leverage source time series even with different characteristics. Extensive experiments on various domains demonstrate that our proposed method outperforms state-of-the-art baselines on synthetic and real-world datasets, and ablation studies verify the effectiveness of our design choices.
△ Less
Submitted 21 June, 2022; v1 submitted 12 February, 2021;
originally announced February 2021.
-
Inter-Series Attention Model for COVID-19 Forecasting
Authors:
Xiaoyong Jin,
Yu-Xiang Wang,
Xifeng Yan
Abstract:
COVID-19 pandemic has an unprecedented impact all over the world since early 2020. During this public health crisis, reliable forecasting of the disease becomes critical for resource allocation and administrative planning. The results from compartmental models such as SIR and SEIR are popularly referred by CDC and news media. With more and more COVID-19 data becoming available, we examine the foll…
▽ More
COVID-19 pandemic has an unprecedented impact all over the world since early 2020. During this public health crisis, reliable forecasting of the disease becomes critical for resource allocation and administrative planning. The results from compartmental models such as SIR and SEIR are popularly referred by CDC and news media. With more and more COVID-19 data becoming available, we examine the following question: Can a direct data-driven approach without modeling the disease spreading dynamics outperform the well referred compartmental models and their variants? In this paper, we show the possibility. It is observed that as COVID-19 spreads at different speed and scale in different geographic regions, it is highly likely that similar progression patterns are shared among these regions within different time periods. This intuition lead us to develop a new neural forecasting model, called Attention Crossing Time Series (\textbf{ACTS}), that makes forecasts via comparing patterns across time series obtained from multiple regions. The attention mechanism originally developed for natural language processing can be leveraged and generalized to materialize this idea. Among 13 out of 18 testings including forecasting newly confirmed cases, hospitalizations and deaths, \textbf{ACTS} outperforms all the leading COVID-19 forecasters highlighted by CDC.
△ Less
Submitted 5 April, 2021; v1 submitted 24 October, 2020;
originally announced October 2020.
-
On Transferability of Bias Mitigation Effects in Language Model Fine-Tuning
Authors:
Xisen Jin,
Francesco Barbieri,
Brendan Kennedy,
Aida Mostafazadeh Davani,
Leonardo Neves,
Xiang Ren
Abstract:
Fine-tuned language models have been shown to exhibit biases against protected groups in a host of modeling tasks such as text classification and coreference resolution. Previous works focus on detecting these biases, reducing bias in data representations, and using auxiliary training objectives to mitigate bias during fine-tuning. Although these techniques achieve bias reduction for the task and…
▽ More
Fine-tuned language models have been shown to exhibit biases against protected groups in a host of modeling tasks such as text classification and coreference resolution. Previous works focus on detecting these biases, reducing bias in data representations, and using auxiliary training objectives to mitigate bias during fine-tuning. Although these techniques achieve bias reduction for the task and domain at hand, the effects of bias mitigation may not directly transfer to new tasks, requiring additional data collection and customized annotation of sensitive attributes, and re-evaluation of appropriate fairness metrics. We explore the feasibility and benefits of upstream bias mitigation (UBM) for reducing bias on downstream tasks, by first applying bias mitigation to an upstream model through fine-tuning and subsequently using it for downstream fine-tuning. We find, in extensive experiments across hate speech detection, toxicity detection, occupation prediction, and coreference resolution tasks over various bias factors, that the effects of UBM are indeed transferable to new downstream tasks or domains via fine-tuning, creating less biased downstream models than directly fine-tuning on the downstream task or transferring from a vanilla upstream model. Though challenges remain, we show that UBM promises more efficient and accessible bias mitigation in LM fine-tuning.
△ Less
Submitted 11 April, 2021; v1 submitted 24 October, 2020;
originally announced October 2020.
-
On Efficient Constructions of Checkpoints
Authors:
Yu Chen,
Zhenming Liu,
Bin Ren,
Xin Jin
Abstract:
Efficient construction of checkpoints/snapshots is a critical tool for training and diagnosing deep learning models. In this paper, we propose a lossy compression scheme for checkpoint constructions (called LC-Checkpoint). LC-Checkpoint simultaneously maximizes the compression rate and optimizes the recovery speed, under the assumption that SGD is used to train the model. LC-Checkpointuses quantiz…
▽ More
Efficient construction of checkpoints/snapshots is a critical tool for training and diagnosing deep learning models. In this paper, we propose a lossy compression scheme for checkpoint constructions (called LC-Checkpoint). LC-Checkpoint simultaneously maximizes the compression rate and optimizes the recovery speed, under the assumption that SGD is used to train the model. LC-Checkpointuses quantization and priority promotion to store the most crucial information for SGD to recover, and then uses a Huffman coding to leverage the non-uniform distribution of the gradient scales. Our extensive experiments show that LC-Checkpoint achieves a compression rate up to $28\times$ and recovery speedup up to $5.77\times$ over a state-of-the-art algorithm (SCAR).
△ Less
Submitted 27 September, 2020;
originally announced September 2020.
-
VAFL: a Method of Vertical Asynchronous Federated Learning
Authors:
Tianyi Chen,
Xiao Jin,
Yuejiao Sun,
Wotao Yin
Abstract:
Horizontal Federated learning (FL) handles multi-client data that share the same set of features, and vertical FL trains a better predictor that combine all the features from different clients. This paper targets solving vertical FL in an asynchronous fashion, and develops a simple FL method. The new method allows each client to run stochastic gradient algorithms without coordination with other cl…
▽ More
Horizontal Federated learning (FL) handles multi-client data that share the same set of features, and vertical FL trains a better predictor that combine all the features from different clients. This paper targets solving vertical FL in an asynchronous fashion, and develops a simple FL method. The new method allows each client to run stochastic gradient algorithms without coordination with other clients, so it is suitable for intermittent connectivity of clients. This method further uses a new technique of perturbed local embedding to ensure data privacy and improve communication efficiency. Theoretically, we present the convergence rate and privacy level of our method for strongly convex, nonconvex and even nonsmooth objectives separately. Empirically, we apply our method to FL on various image and healthcare datasets. The results compare favorably to centralized and synchronous FL methods.
△ Less
Submitted 12 July, 2020;
originally announced July 2020.
-
Gradient-based Editing of Memory Examples for Online Task-free Continual Learning
Authors:
Xisen Jin,
Arka Sadhu,
Junyi Du,
Xiang Ren
Abstract:
We explore task-free continual learning (CL), in which a model is trained to avoid catastrophic forgetting in the absence of explicit task boundaries or identities. Among many efforts on task-free CL, a notable family of approaches are memory-based that store and replay a subset of training examples. However, the utility of stored seen examples may diminish over time since CL models are continuall…
▽ More
We explore task-free continual learning (CL), in which a model is trained to avoid catastrophic forgetting in the absence of explicit task boundaries or identities. Among many efforts on task-free CL, a notable family of approaches are memory-based that store and replay a subset of training examples. However, the utility of stored seen examples may diminish over time since CL models are continually updated. Here, we propose Gradient based Memory EDiting (GMED), a framework for editing stored examples in continuous input space via gradient updates, in order to create more "challenging" examples for replay. GMED-edited examples remain similar to their unedited forms, but can yield increased loss in the upcoming model updates, thereby making the future replays more effective in overcoming catastrophic forgetting. By construction, GMED can be seamlessly applied in conjunction with other memory-based CL algorithms to bring further improvement. Experiments validate the effectiveness of GMED, and our best method significantly outperforms baselines and previous state-of-the-art on five out of six datasets. Code can be found at https://github.com/INK-USC/GMED.
△ Less
Submitted 7 December, 2021; v1 submitted 27 June, 2020;
originally announced June 2020.
-
Towards Hierarchical Importance Attribution: Explaining Compositional Semantics for Neural Sequence Models
Authors:
Xisen Jin,
Zhongyu Wei,
Junyi Du,
Xiangyang Xue,
Xiang Ren
Abstract:
The impressive performance of neural networks on natural language processing tasks attributes to their ability to model complicated word and phrase compositions. To explain how the model handles semantic compositions, we study hierarchical explanation of neural network predictions. We identify non-additivity and context independent importance attributions within hierarchies as two desirable proper…
▽ More
The impressive performance of neural networks on natural language processing tasks attributes to their ability to model complicated word and phrase compositions. To explain how the model handles semantic compositions, we study hierarchical explanation of neural network predictions. We identify non-additivity and context independent importance attributions within hierarchies as two desirable properties for highlighting word and phrase compositions. We show some prior efforts on hierarchical explanations, e.g. contextual decomposition, do not satisfy the desired properties mathematically, leading to inconsistent explanation quality in different models. In this paper, we start by proposing a formal and general way to quantify the importance of each word and phrase. Following the formulation, we propose Sampling and Contextual Decomposition (SCD) algorithm and Sampling and Occlusion (SOC) algorithm. Human and metrics evaluation on both LSTM models and BERT Transformer models on multiple datasets show that our algorithms outperform prior hierarchical explanation algorithms. Our algorithms help to visualize semantic composition captured by models, extract classification rules and improve human trust of models. Project page: https://inklab.usc.edu/hiexpl/
△ Less
Submitted 15 June, 2020; v1 submitted 7 November, 2019;
originally announced November 2019.
-
Progressive Feature Polishing Network for Salient Object Detection
Authors:
Bo Wang,
Quan Chen,
Min Zhou,
Zhiqiang Zhang,
Xiaogang Jin,
Kun Gai
Abstract:
Feature matters for salient object detection. Existing methods mainly focus on designing a sophisticated structure to incorporate multi-level features and filter out cluttered features. We present Progressive Feature Polishing Network (PFPN), a simple yet effective framework to progressively polish the multi-level features to be more accurate and representative. By employing multiple Feature Polis…
▽ More
Feature matters for salient object detection. Existing methods mainly focus on designing a sophisticated structure to incorporate multi-level features and filter out cluttered features. We present Progressive Feature Polishing Network (PFPN), a simple yet effective framework to progressively polish the multi-level features to be more accurate and representative. By employing multiple Feature Polishing Modules (FPMs) in a recurrent manner, our approach is able to detect salient objects with fine details without any post-processing. A FPM parallelly updates the features of each level by directly incorporating all higher level context information. Moreover, it can keep the dimensions and hierarchical structures of the feature maps, which makes it flexible to be integrated with any CNN-based models. Empirical experiments show that our results are monotonically getting better with increasing number of FPMs. Without bells and whistles, PFPN outperforms the state-of-the-art methods significantly on five benchmark datasets under various evaluation metrics.
△ Less
Submitted 14 November, 2019;
originally announced November 2019.
-
You May Not Need Order in Time Series Forecasting
Authors:
Yunkai Zhang,
Qiao Jiang,
Shurui Li,
Xiaoyong Jin,
Xueying Ma,
Xifeng Yan
Abstract:
Time series forecasting with limited data is a challenging yet critical task. While transformers have achieved outstanding performances in time series forecasting, they often require many training samples due to the large number of trainable parameters. In this paper, we propose a training technique for transformers that prepares the training windows through random sampling. As input time steps ne…
▽ More
Time series forecasting with limited data is a challenging yet critical task. While transformers have achieved outstanding performances in time series forecasting, they often require many training samples due to the large number of trainable parameters. In this paper, we propose a training technique for transformers that prepares the training windows through random sampling. As input time steps need not be consecutive, the number of distinct samples increases from linearly to combinatorially many. By breaking the temporal order, this technique also helps transformers to capture dependencies among time steps in finer granularity. We achieve competitive results compared to the state-of-the-art on real-world datasets.
△ Less
Submitted 21 October, 2019;
originally announced October 2019.
-
Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting
Authors:
Shiyang Li,
Xiaoyong Jin,
Yao Xuan,
Xiyou Zhou,
Wenhu Chen,
Yu-Xiang Wang,
Xifeng Yan
Abstract:
Time series forecasting is an important problem across many domains, including predictions of solar plant energy output, electricity consumption, and traffic jam situation. In this paper, we propose to tackle such forecasting problem with Transformer [1]. Although impressed by its performance in our preliminary study, we found its two major weaknesses: (1) locality-agnostics: the point-wise dot-pr…
▽ More
Time series forecasting is an important problem across many domains, including predictions of solar plant energy output, electricity consumption, and traffic jam situation. In this paper, we propose to tackle such forecasting problem with Transformer [1]. Although impressed by its performance in our preliminary study, we found its two major weaknesses: (1) locality-agnostics: the point-wise dot-product self-attention in canonical Transformer architecture is insensitive to local context, which can make the model prone to anomalies in time series; (2) memory bottleneck: space complexity of canonical Transformer grows quadratically with sequence length $L$, making directly modeling long time series infeasible. In order to solve these two issues, we first propose convolutional self-attention by producing queries and keys with causal convolution so that local context can be better incorporated into attention mechanism. Then, we propose LogSparse Transformer with only $O(L(\log L)^{2})$ memory cost, improving forecasting accuracy for time series with fine granularity and strong long-term dependencies under constrained memory budget. Our experiments on both synthetic data and real-world datasets show that it compares favorably to the state-of-the-art.
△ Less
Submitted 3 January, 2020; v1 submitted 29 June, 2019;
originally announced July 2019.
-
Selective Transfer with Reinforced Transfer Network for Partial Domain Adaptation
Authors:
Zhihong Chen,
Chao Chen,
Zhaowei Cheng,
Boyuan Jiang,
Ke Fang,
Xinyu Jin
Abstract:
One crucial aspect of partial domain adaptation (PDA) is how to select the relevant source samples in the shared classes for knowledge transfer. Previous PDA methods tackle this problem by re-weighting the source samples based on their high-level information (deep features). However, since the domain shift between source and target domains, only using the deep features for sample selection is defe…
▽ More
One crucial aspect of partial domain adaptation (PDA) is how to select the relevant source samples in the shared classes for knowledge transfer. Previous PDA methods tackle this problem by re-weighting the source samples based on their high-level information (deep features). However, since the domain shift between source and target domains, only using the deep features for sample selection is defective. We argue that it is more reasonable to additionally exploit the pixel-level information for PDA problem, as the appearance difference between outlier source classes and target classes is significantly large. In this paper, we propose a reinforced transfer network (RTNet), which utilizes both high-level and pixel-level information for PDA problem. Our RTNet is composed of a reinforced data selector (RDS) based on reinforcement learning (RL), which filters out the outlier source samples, and a domain adaptation model which minimizes the domain discrepancy in the shared label space. Specifically, in the RDS, we design a novel reward based on the reconstruct errors of selected source samples on the target generator, which introduces the pixel-level information to guide the learning of RDS. Besides, we develope a state containing high-level information, which used by the RDS for sample selection. The proposed RDS is a general module, which can be easily integrated into existing DA models to make them fit the PDA situation. Extensive experiments indicate that RTNet can achieve state-of-the-art performance for PDA tasks on several benchmark datasets.
△ Less
Submitted 27 February, 2020; v1 submitted 26 May, 2019;
originally announced May 2019.
-
Recurrent Event Network: Autoregressive Structure Inference over Temporal Knowledge Graphs
Authors:
Woojeong Jin,
Meng Qu,
Xisen Jin,
Xiang Ren
Abstract:
Knowledge graph reasoning is a critical task in natural language processing. The task becomes more challenging on temporal knowledge graphs, where each fact is associated with a timestamp. Most existing methods focus on reasoning at past timestamps and they are not able to predict facts happening in the future. This paper proposes Recurrent Event Network (RE-NET), a novel autoregressive architectu…
▽ More
Knowledge graph reasoning is a critical task in natural language processing. The task becomes more challenging on temporal knowledge graphs, where each fact is associated with a timestamp. Most existing methods focus on reasoning at past timestamps and they are not able to predict facts happening in the future. This paper proposes Recurrent Event Network (RE-NET), a novel autoregressive architecture for predicting future interactions. The occurrence of a fact (event) is modeled as a probability distribution conditioned on temporal sequences of past knowledge graphs. Specifically, our RE-NET employs a recurrent event encoder to encode past facts and uses a neighborhood aggregator to model the connection of facts at the same timestamp. Future facts can then be inferred in a sequential manner based on the two modules. We evaluate our proposed method via link prediction at future times on five public datasets. Through extensive experiments, we demonstrate the strength of RENET, especially on multi-step inference over future timestamps, and achieve state-of-the-art performance on all five datasets. Code and data can be found at https://github.com/INK-USC/RE-Net.
△ Less
Submitted 6 October, 2020; v1 submitted 11 April, 2019;
originally announced April 2019.
-
On the Sensitivity of Adversarial Robustness to Input Data Distributions
Authors:
Gavin Weiguang Ding,
Kry Yik Chau Lui,
Xiaomeng Jin,
Luyu Wang,
Ruitong Huang
Abstract:
Neural networks are vulnerable to small adversarial perturbations. Existing literature largely focused on understanding and mitigating the vulnerability of learned models. In this paper, we demonstrate an intriguing phenomenon about the most popular robust training method in the literature, adversarial training: Adversarial robustness, unlike clean accuracy, is sensitive to the input data distribu…
▽ More
Neural networks are vulnerable to small adversarial perturbations. Existing literature largely focused on understanding and mitigating the vulnerability of learned models. In this paper, we demonstrate an intriguing phenomenon about the most popular robust training method in the literature, adversarial training: Adversarial robustness, unlike clean accuracy, is sensitive to the input data distribution. Even a semantics-preserving transformations on the input data distribution can cause a significantly different robustness for the adversarial trained model that is both trained and evaluated on the new distribution. Our discovery of such sensitivity on data distribution is based on a study which disentangles the behaviors of clean accuracy and robust accuracy of the Bayes classifier. Empirical investigations further confirm our finding. We construct semantically-identical variants for MNIST and CIFAR10 respectively, and show that standardly trained models achieve comparable clean accuracies on them, but adversarially trained models achieve significantly different robustness accuracies. This counter-intuitive phenomenon indicates that input data distribution alone can affect the adversarial robustness of trained neural networks, not necessarily the tasks themselves. Lastly, we discuss the practical implications on evaluating adversarial robustness, and make initial attempts to understand this complex phenomenon.
△ Less
Submitted 21 February, 2019;
originally announced February 2019.
-
advertorch v0.1: An Adversarial Robustness Toolbox based on PyTorch
Authors:
Gavin Weiguang Ding,
Luyu Wang,
Xiaomeng Jin
Abstract:
advertorch is a toolbox for adversarial robustness research. It contains various implementations for attacks, defenses and robust training methods. advertorch is built on PyTorch (Paszke et al., 2017), and leverages the advantages of the dynamic computational graph to provide concise and efficient reference implementations. The code is licensed under the LGPL license and is open sourced at https:/…
▽ More
advertorch is a toolbox for adversarial robustness research. It contains various implementations for attacks, defenses and robust training methods. advertorch is built on PyTorch (Paszke et al., 2017), and leverages the advantages of the dynamic computational graph to provide concise and efficient reference implementations. The code is licensed under the LGPL license and is open sourced at https://github.com/BorealisAI/advertorch .
△ Less
Submitted 20 February, 2019;
originally announced February 2019.
-
FPDeep: Scalable Acceleration of CNN Training on Deeply-Pipelined FPGA Clusters
Authors:
Tong Geng,
Tianqi Wang,
Ang Li,
Xi Jin,
Martin Herbordt
Abstract:
Deep Neural Networks (DNNs) have revolutionized numerous applications, but the demand for ever more performance remains unabated. Scaling DNN computations to larger clusters is generally done by distributing tasks in batch mode using methods such as distributed synchronous SGD. Among the issues with this approach is that to make the distributed cluster work with high utilization, the workload dist…
▽ More
Deep Neural Networks (DNNs) have revolutionized numerous applications, but the demand for ever more performance remains unabated. Scaling DNN computations to larger clusters is generally done by distributing tasks in batch mode using methods such as distributed synchronous SGD. Among the issues with this approach is that to make the distributed cluster work with high utilization, the workload distributed to each node must be large, which implies nontrivial growth in the SGD mini-batch size.
In this paper, we propose a framework called FPDeep, which uses a hybrid of model and layer parallelism to configure distributed reconfigurable clusters to train DNNs. This approach has numerous benefits. First, the design does not suffer from batch size growth. Second, novel workload and weight partitioning leads to balanced loads of both among nodes. And third, the entire system is a fine-grained pipeline. This leads to high parallelism and utilization and also minimizes the time features need to be cached while waiting for back-propagation. As a result, storage demand is reduced to the point where only on-chip memory is used for the convolution layers. We evaluate FPDeep with the Alexnet, VGG-16, and VGG-19 benchmarks. Experimental results show that FPDeep has good scalability to a large number of FPGAs, with the limiting factor being the FPGA-to-FPGA bandwidth. With 6 transceivers per FPGA, FPDeep shows linearity up to 83 FPGAs. Energy efficiency is evaluated with respect to GOPs/J. FPDeep provides, on average, 6.36x higher energy efficiency than comparable GPU servers.
△ Less
Submitted 21 June, 2020; v1 submitted 4 January, 2019;
originally announced January 2019.
-
Parameter Transfer Extreme Learning Machine based on Projective Model
Authors:
Chao Chen,
Boyuan Jiang,
Xinyu Jin
Abstract:
Recent years, transfer learning has attracted much attention in the community of machine learning. In this paper, we mainly focus on the tasks of parameter transfer under the framework of extreme learning machine (ELM). Unlike the existing parameter transfer approaches, which incorporate the source model information into the target by regularizing the di erence between the source and target domain…
▽ More
Recent years, transfer learning has attracted much attention in the community of machine learning. In this paper, we mainly focus on the tasks of parameter transfer under the framework of extreme learning machine (ELM). Unlike the existing parameter transfer approaches, which incorporate the source model information into the target by regularizing the di erence between the source and target domain parameters, an intuitively appealing projective-model is proposed to bridge the source and target model parameters. Specifically, we formulate the parameter transfer in the ELM networks by the means of parameter projection, and train the model by optimizing the projection matrix and classifier parameters jointly. Further more, the `L2,1-norm structured sparsity penalty is imposed on the source domain parameters, which encourages the joint feature selection and parameter transfer. To evaluate the e ectiveness of the proposed method, comprehensive experiments on several commonly used domain adaptation datasets are presented. The results show that the proposed method significantly outperforms the non-transfer ELM networks and other classical transfer learning methods.
△ Less
Submitted 14 September, 2018; v1 submitted 4 September, 2018;
originally announced September 2018.
-
Joint Domain Alignment and Discriminative Feature Learning for Unsupervised Deep Domain Adaptation
Authors:
Chao Chen,
Zhihong Chen,
Boyuan Jiang,
Xinyu Jin
Abstract:
Recently, considerable effort has been devoted to deep domain adaptation in computer vision and machine learning communities. However, most of existing work only concentrates on learning shared feature representation by minimizing the distribution discrepancy across different domains. Due to the fact that all the domain alignment approaches can only reduce, but not remove the domain shift. Target…
▽ More
Recently, considerable effort has been devoted to deep domain adaptation in computer vision and machine learning communities. However, most of existing work only concentrates on learning shared feature representation by minimizing the distribution discrepancy across different domains. Due to the fact that all the domain alignment approaches can only reduce, but not remove the domain shift. Target domain samples distributed near the edge of the clusters, or far from their corresponding class centers are easily to be misclassified by the hyperplane learned from the source domain. To alleviate this issue, we propose to joint domain alignment and discriminative feature learning, which could benefit both domain alignment and final classification. Specifically, an instance-based discriminative feature learning method and a center-based discriminative feature learning method are proposed, both of which guarantee the domain invariant features with better intra-class compactness and inter-class separability. Extensive experiments show that learning the discriminative features in the shared feature space can significantly boost the performance of deep domain adaptation methods.
△ Less
Submitted 3 November, 2018; v1 submitted 28 August, 2018;
originally announced August 2018.
-
Deep Learning Detection Networks in MIMO Decode-Forward Relay Channels
Authors:
Xianglan Jin,
Hyoung-Nam Kim
Abstract:
In this paper, we consider signal detection algorithms in a multiple-input multiple-output (MIMO) decode-forward (DF) relay channel with one source, one relay, and one destination. The existing suboptimal near maximum likelihood (NML) detector and the NML with two-level pair-wise error probability (NMLw2PEP) detector achieve excellent performance with instantaneous channel state information (CSI)…
▽ More
In this paper, we consider signal detection algorithms in a multiple-input multiple-output (MIMO) decode-forward (DF) relay channel with one source, one relay, and one destination. The existing suboptimal near maximum likelihood (NML) detector and the NML with two-level pair-wise error probability (NMLw2PEP) detector achieve excellent performance with instantaneous channel state information (CSI) of the source-relay (SR) link and with statistical CSI of the SR link, respectively. However, the NML detectors require an exponentially increasing complexity as the number of transmit antennas increases. Using deep learning algorithms, NML-based detection networks (NMLDNs) are proposed with and without the CSI of the SR link at the destination. The NMLDNs detect signals in changing channels after a single training using a large number of randomly distributed channels. The detection networks require much lower detection complexity than the exhaustive search NML detectors while exhibiting good performance. To evaluate the performance, we introduce semidefinite relaxation detectors with polynomial complexity based on the NML detectors. Additionally, new linear detectors based on the zero gradient of the NML metrics are proposed. Applying various detection algorithms at the relay (DetR) and detection algorithms at the destination (DetD), we present some DetR-DetD methods in MIMO DF relay channels. An appropriate DetR-DetD method can be employed according to the required error probability and detection complexity. The complexity analysis and simulation results validate the arguments of this paper.
△ Less
Submitted 11 July, 2018;
originally announced July 2018.
-
General solutions for nonlinear differential equations: a rule-based self-learning approach using deep reinforcement learning
Authors:
Shiyin Wei,
Xiaowei Jin,
Hui Li
Abstract:
A universal rule-based self-learning approach using deep reinforcement learning (DRL) is proposed for the first time to solve nonlinear ordinary differential equations and partial differential equations. The solver consists of a deep neural network-structured actor that outputs candidate solutions, and a critic derived only from physical rules (governing equations and boundary and initial conditio…
▽ More
A universal rule-based self-learning approach using deep reinforcement learning (DRL) is proposed for the first time to solve nonlinear ordinary differential equations and partial differential equations. The solver consists of a deep neural network-structured actor that outputs candidate solutions, and a critic derived only from physical rules (governing equations and boundary and initial conditions). Solutions in discretized time are treated as multiple tasks sharing the same governing equation, and the current step parameters provide an ideal initialization for the next owing to the temporal continuity of the solutions, which shows a transfer learning characteristic and indicates that the DRL solver has captured the intrinsic nature of the equation. The approach is verified through solving the Schrödinger, Navier-Stokes, Burgers', Van der Pol, and Lorenz equations and an equation of motion. The results indicate that the approach gives solutions with high accuracy, and the solution process promises to get faster.
△ Less
Submitted 29 May, 2019; v1 submitted 13 May, 2018;
originally announced May 2018.
-
Stochastic Conjugate Gradient Algorithm with Variance Reduction
Authors:
Xiao-Bo Jin,
Xu-Yao Zhang,
Kaizhu Huang,
Guang-Gang Geng
Abstract:
Conjugate gradient (CG) methods are a class of important methods for solving linear equations and nonlinear optimization problems. In this paper, we propose a new stochastic CG algorithm with variance reduction and we prove its linear convergence with the Fletcher and Reeves method for strongly convex and smooth functions. We experimentally demonstrate that the CG with variance reduction algorithm…
▽ More
Conjugate gradient (CG) methods are a class of important methods for solving linear equations and nonlinear optimization problems. In this paper, we propose a new stochastic CG algorithm with variance reduction and we prove its linear convergence with the Fletcher and Reeves method for strongly convex and smooth functions. We experimentally demonstrate that the CG with variance reduction algorithm converges faster than its counterparts for four learning models, which may be convex, nonconvex or nonsmooth. In addition, its area under the curve performance on six large-scale data sets is comparable to that of the LIBLINEAR solver for the L2-regularized L2-loss but with a significant improvement in computational efficiency
△ Less
Submitted 16 October, 2018; v1 submitted 26 October, 2017;
originally announced October 2017.
-
Linear NDCG and Pair-wise Loss
Authors:
Xiao-Bo Jin,
Guang-Gang Geng
Abstract:
Linear NDCG is used for measuring the performance of the Web content quality assessment in ECML/PKDD Discovery Challenge 2010. In this paper, we will prove that the DCG error equals a new pair-wise loss.
Linear NDCG is used for measuring the performance of the Web content quality assessment in ECML/PKDD Discovery Challenge 2010. In this paper, we will prove that the DCG error equals a new pair-wise loss.
△ Less
Submitted 10 March, 2013;
originally announced March 2013.