-
ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists
Authors:
Jie Ruan,
Inderjeet Nair,
Shuyang Cao,
Amy Liu,
Sheza Munir,
Micah Pollens-Dempsey,
Tiffany Chiang,
Lucy Kates,
Nicholas David,
Sihan Chen,
Ruxin Yang,
Yuqian Yang,
Jasmine Gump,
Tessa Bialek,
Vivek Sankaran,
Margo Schlanger,
Lu Wang
Abstract:
This paper introduces ExpertLongBench, an expert-level benchmark containing 11 tasks from 9 domains that reflect realistic expert workflows and applications. Beyond question answering, the application-driven tasks in ExpertLongBench demand long-form outputs that can exceed 5,000 tokens and strict adherence to domain-specific requirements. Notably, each task in ExpertLongBench includes a rubric, de…
▽ More
This paper introduces ExpertLongBench, an expert-level benchmark containing 11 tasks from 9 domains that reflect realistic expert workflows and applications. Beyond question answering, the application-driven tasks in ExpertLongBench demand long-form outputs that can exceed 5,000 tokens and strict adherence to domain-specific requirements. Notably, each task in ExpertLongBench includes a rubric, designed or validated by domain experts, to specify task requirements and guide output evaluation. Furthermore, we propose CLEAR, an evaluation framework that supports accurate evaluation of long-form model outputs in our benchmark. To achieve fine-grained, expert-aligned evaluation, CLEAR derives checklists from both model outputs and references by extracting information corresponding to items in the task-specific rubric. Checklist items for model outputs are then compared with corresponding items for reference outputs to assess their correctness, enabling grounded evaluation. We benchmark 11 large language models (LLMs) and analyze components in CLEAR, showing that (1) existing LLMs, with the top performer achieving only a 26.8% F1 score, require significant improvement for expert-level tasks; (2) models can generate content corresponding to the required aspects, though often not accurately; and (3) accurate checklist extraction and comparison in CLEAR can be achieved by open-weight models for more scalable and low-cost usage.
△ Less
Submitted 1 June, 2025;
originally announced June 2025.
-
Semi-Self Representation Learning for Crowdsourced WiFi Trajectories
Authors:
Yu-Lin Kuo,
Yu-Chee Tseng,
Ting-Hui Chiang,
Yan-Ann Chen
Abstract:
WiFi fingerprint-based localization has been studied intensively. Point-based solutions rely on position annotations of WiFi fingerprints. Trajectory-based solutions, however, require end-position annotations of WiFi trajectories, where a WiFi trajectory is a multivariate time series of signal features. A trajectory dataset is much larger than a pointwise dataset as the number of potential traject…
▽ More
WiFi fingerprint-based localization has been studied intensively. Point-based solutions rely on position annotations of WiFi fingerprints. Trajectory-based solutions, however, require end-position annotations of WiFi trajectories, where a WiFi trajectory is a multivariate time series of signal features. A trajectory dataset is much larger than a pointwise dataset as the number of potential trajectories in a field may grow exponentially with respect to the size of the field. This work presents a semi-self representation learning solution, where a large dataset $C$ of crowdsourced unlabeled WiFi trajectories can be automatically labeled by a much smaller dataset $\tilde C$ of labeled WiFi trajectories. The size of $\tilde C$ only needs to be proportional to the size of the physical field, while the unlabeled $C$ could be much larger. This is made possible through a novel ``cut-and-flip'' augmentation scheme based on the meet-in-the-middle paradigm. A two-stage learning consisting of trajectory embedding followed by endpoint embedding is proposed for the unlabeled $C$. Then the learned representations are labeled by $\tilde C$ and connected to a neural-based localization network. The result, while delivering promising accuracy, significantly relieves the burden of human annotations for trajectory-based localization.
△ Less
Submitted 2 April, 2025;
originally announced April 2025.
-
Token embeddings violate the manifold hypothesis
Authors:
Michael Robinson,
Sourya Dey,
Tony Chiang
Abstract:
A full understanding of the behavior of a large language model (LLM) requires our understanding of its input token space. If this space differs from our assumptions, our understanding of and conclusions about the LLM will likely be flawed. We elucidate the structure of the token embeddings both empirically and theoretically. We present a novel statistical test assuming that the neighborhood around…
▽ More
A full understanding of the behavior of a large language model (LLM) requires our understanding of its input token space. If this space differs from our assumptions, our understanding of and conclusions about the LLM will likely be flawed. We elucidate the structure of the token embeddings both empirically and theoretically. We present a novel statistical test assuming that the neighborhood around each token has a relatively flat and smooth structure as the null hypothesis. Failing to reject the null is uninformative, but rejecting it at a specific token $ψ$ implies an irregularity in the token subspace in a $ψ$-neighborhood, $B(ψ)$. The structure assumed in the null is a generalization of a manifold with boundary called a \emph{smooth fiber bundle} (which can be split into two spatial regimes -- small and large radius), so we denote our new hypothesis test as the ``fiber bundle hypothesis.'' Failure to reject the null hypothesis is uninformative, but rejecting it at $ψ$ indicates a statistically significant irregularity at $B(ψ)$. By running our test over several open-source LLMs, each with unique token embeddings, we find that the null is frequently rejected, and so the evidence suggests that the token subspace is not a fiber bundle and hence also not a manifold. As a consequence of our findings, when an LLM is presented with two semantically equivalent prompts, if one prompt contains a token implicated by our test, the response to that prompt will likely exhibit less stability than the other.
△ Less
Submitted 28 May, 2025; v1 submitted 1 April, 2025;
originally announced April 2025.
-
Assessing Generative Models for Structured Data
Authors:
Reilly Cannon,
Nicolette M. Laird,
Caesar Vazquez,
Andy Lin,
Amy Wagler,
Tony Chiang
Abstract:
Synthetic tabular data generation has emerged as a promising method to address limited data availability and privacy concerns. With the sharp increase in the performance of large language models in recent years, researchers have been interested in applying these models to the generation of tabular data. However, little is known about the quality of the generated tabular data from large language mo…
▽ More
Synthetic tabular data generation has emerged as a promising method to address limited data availability and privacy concerns. With the sharp increase in the performance of large language models in recent years, researchers have been interested in applying these models to the generation of tabular data. However, little is known about the quality of the generated tabular data from large language models. The predominant method for assessing the quality of synthetic tabular data is the train-synthetic-test-real approach, where the artificial examples are compared to the original by how well machine learning models, trained separately on the real and synthetic sets, perform in some downstream tasks. This method does not directly measure how closely the distribution of generated data approximates that of the original. This paper introduces rigorous methods for directly assessing synthetic tabular data against real data by looking at inter-column dependencies within the data. We find that large language models (GPT-2), both when queried via few-shot prompting and when fine-tuned, and GAN (CTGAN) models do not produce data with dependencies that mirror the original real data. Results from this study can inform future practice in synthetic data generation to improve data quality.
△ Less
Submitted 26 March, 2025;
originally announced March 2025.
-
The Rotary Position Embedding May Cause Dimension Inefficiency in Attention Heads for Long-Distance Retrieval
Authors:
Ting-Rui Chiang,
Dani Yogatama
Abstract:
The Rotary Position Embedding (RoPE) is widely used in the attention heads of many large language models (LLM). It rotates dimensions in the query and the key vectors by different angles according to their positions in the input sequence. For long context modeling, the range of positions may vary a lot, and thus RoPE rotates some dimensions by a great range of angles. We hypothesize that the wide…
▽ More
The Rotary Position Embedding (RoPE) is widely used in the attention heads of many large language models (LLM). It rotates dimensions in the query and the key vectors by different angles according to their positions in the input sequence. For long context modeling, the range of positions may vary a lot, and thus RoPE rotates some dimensions by a great range of angles. We hypothesize that the wide range of rotation angles may prevent LLMs from utilizing those dimensions. To validate this hypothesis, we present a controlled experiment showing that applying RoPE causes low utility of certain dimensions. Our analyses on three LLMs also indicate that these dimensions do not help LLMs do long-context question answering.
△ Less
Submitted 16 February, 2025;
originally announced February 2025.
-
Primus: A Pioneering Collection of Open-Source Datasets for Cybersecurity LLM Training
Authors:
Yao-Ching Yu,
Tsun-Han Chiang,
Cheng-Wei Tsai,
Chien-Ming Huang,
Wen-Kwang Tsao
Abstract:
Large Language Models (LLMs) have shown remarkable advancements in specialized fields such as finance, law, and medicine. However, in cybersecurity, we have noticed a lack of open-source datasets, with a particular lack of high-quality cybersecurity pretraining corpora, even though much research indicates that LLMs acquire their knowledge during pretraining. To address this, we present a comprehen…
▽ More
Large Language Models (LLMs) have shown remarkable advancements in specialized fields such as finance, law, and medicine. However, in cybersecurity, we have noticed a lack of open-source datasets, with a particular lack of high-quality cybersecurity pretraining corpora, even though much research indicates that LLMs acquire their knowledge during pretraining. To address this, we present a comprehensive suite of datasets covering all major training stages, including pretraining, instruction fine-tuning, and reasoning distillation with cybersecurity-specific self-reflection data. Extensive ablation studies demonstrate their effectiveness on public cybersecurity benchmarks. In particular, continual pre-training on our dataset yields a 15.88% improvement in the aggregate score, while reasoning distillation leads to a 10% gain in security certification (CISSP). We will release all datasets and trained cybersecurity LLMs under the ODC-BY and MIT licenses to encourage further research in the community. For access to all datasets and model weights, please refer to https://huggingface.co/collections/trendmicro-ailab/primus-67b1fd27052b802b4af9d243.
△ Less
Submitted 1 June, 2025; v1 submitted 16 February, 2025;
originally announced February 2025.
-
SVM/SVR Kernels as Quantum Propagators
Authors:
Nan-Hong Kuo,
Tsung-Wei Chiang,
Renata Wong
Abstract:
In this work, we establish the equivalence between Support Vector Machine (SVM) kernels and quantum Green's functions. Drawing on the analogy between margin maximization in SVMs and action extremization in Lagrangian mechanics, we show that many standard kernels correspond naturally to Green's functions and that this correspondence arises from the inversion of physical operators. We further demons…
▽ More
In this work, we establish the equivalence between Support Vector Machine (SVM) kernels and quantum Green's functions. Drawing on the analogy between margin maximization in SVMs and action extremization in Lagrangian mechanics, we show that many standard kernels correspond naturally to Green's functions and that this correspondence arises from the inversion of physical operators. We further demonstrate how positive semi-definiteness, which is essential for valid SVM kernels, aligns with the spectral properties that ensure well-defined Green's functions.
We employ the Kernel Polynomial Method (KPM) to create custom kernels for cases where the commonly employed kernels don't lead to a convergence. These custom kernels approximate the desired Green's functions. We furthermore demonstrate numerically on examples taken from physical problems, such as electrical conductivity, scattering amplitudes, photonic crystals, and energy levels of anharmonic oscillators, that selecting kernel functions that mirror the mathematical form of the associated Green's function can significantly enhance the predictive accuracy of machine learning models.
△ Less
Submitted 16 February, 2025;
originally announced February 2025.
-
In-plane anisotropy of charge density wave fluctuations in 1$T$-TiSe$_2$
Authors:
Xuefei Guo,
Anshul Kogar,
Jans Henke,
Felix Flicker,
Fernando de Juan,
Stella X. -L. Sun,
Issam Khayr,
Yingying Peng,
Sangjun Lee,
Matthew J. Krogstad,
Stephan Rosenkranz,
Raymond Osborn,
Jacob P. C. Ruff,
David B. Lioi,
Goran Karapetrov,
Daniel J. Campbell,
Johnpierre Paglione,
Jasper van Wezel,
Tai C. Chiang,
Peter Abbamonte
Abstract:
We report measurements of anisotropic triple-$q$ charge density wave (CDW) fluctuations in the transition metal dichalcogenide 1$T$-TiSe$_2$ over a large volume of reciprocal space with X-ray diffuse scattering. Above the transition temperature, $T_{\text{CDW}}$, the out-of-plane diffuse scattering is characterized by rod-like structures which indicate that the CDW fluctuations in neighboring laye…
▽ More
We report measurements of anisotropic triple-$q$ charge density wave (CDW) fluctuations in the transition metal dichalcogenide 1$T$-TiSe$_2$ over a large volume of reciprocal space with X-ray diffuse scattering. Above the transition temperature, $T_{\text{CDW}}$, the out-of-plane diffuse scattering is characterized by rod-like structures which indicate that the CDW fluctuations in neighboring layers are largely decoupled. In addition, the in-plane diffuse scattering is marked by ellipses which reveal that the in-plane fluctuations are anisotropic. Our analysis of the diffuse scattering line shapes and orientations suggests that the three charge density wave components contain independent phase fluctuations. At $T_{\text{CDW}}$, long range coherence is established in both the in-plane and out-of-plane directions, consistent with the large observed value of the CDW gap compared to $T_{\text{CDW}}$, and the predicted presence of a hierarchy of energy scales.
△ Less
Submitted 17 January, 2025;
originally announced January 2025.
-
Measurement of the dynamic charge susceptibility near the charge density wave transition in ErTe$_3$
Authors:
Dipanjan Chaudhuri,
Qianni Jiang,
Xuefei Guo,
Jin Chen,
Caitlin S. Kengle,
Farzaneh Hoveyda-Marashi,
Camille Bernal-Choban,
Niels de Vries,
Tai-Chang Chiang,
Eduardo Fradkin,
Ian R. Fisher,
Peter Abbamonte
Abstract:
A charge density wave (CDW) is a phase of matter characterized by a periodic modulation of the valence electron density accompanied by a distortion of the lattice structure. The microscopic details of CDW formation are closely tied to the dynamic charge susceptibility, $χ(q,ω)$, which describes the behavior of electronic collective modes. Despite decades of extensive study, the behavior of…
▽ More
A charge density wave (CDW) is a phase of matter characterized by a periodic modulation of the valence electron density accompanied by a distortion of the lattice structure. The microscopic details of CDW formation are closely tied to the dynamic charge susceptibility, $χ(q,ω)$, which describes the behavior of electronic collective modes. Despite decades of extensive study, the behavior of $χ(q,ω)$ in the vicinity of a CDW transition has never been measured with high energy resolution ($\sim$meV). Here, we investigate the canonical CDW transition in ErTe$_3$ using momentum-resolved electron energy loss spectroscopy (M-EELS), a technique uniquely sensitive to valence band charge excitations. Unlike phonons in these materials, which undergo conventional softening due to the Kohn anomaly at the CDW wavevector, the electronic excitations display purely relaxational dynamics that are well described by a diffusive model. The diffusivity peaks around 250 K, just below the critical temperature. Additionally, we report, for the first time, a divergence in the real part of $χ(q,ω)$ in the static limit ($ω\rightarrow 0$), a phenomenon predicted to characterize CDWs since the 1970s. These results highlight the importance of energy- and momentum-resolved measurements of electronic susceptibility and demonstrate the power of M-EELS as a versatile probe of charge dynamics in materials.
△ Less
Submitted 18 March, 2025; v1 submitted 22 November, 2024;
originally announced November 2024.
-
Conformally invariant charge fluctuations in a strange metal
Authors:
Xuefei Guo,
Jin Chen,
Farzaneh Hoveyda-Marashi,
Simon L. Bettler,
Dipanjan Chaudhuri,
Caitlin S. Kengle,
John A. Schneeloch,
Ruidan Zhang,
Genda Gu,
Tai-Chang Chiang,
Alexei M. Tsvelik,
Thomas Faulkner,
Philip W. Phillips,
Peter Abbamonte
Abstract:
The strange metal is a peculiar phase of matter in which the electron scattering rate, $τ^{-1} \sim k_B T/\hbar$, which determines the electrical resistance, is universal across a wide family of materials and determined only by fundamental constants. In 1989, theorists hypothesized that this universality would manifest as scale-invariant behavior in the dynamic charge susceptibility, $χ''(q,ω)$. H…
▽ More
The strange metal is a peculiar phase of matter in which the electron scattering rate, $τ^{-1} \sim k_B T/\hbar$, which determines the electrical resistance, is universal across a wide family of materials and determined only by fundamental constants. In 1989, theorists hypothesized that this universality would manifest as scale-invariant behavior in the dynamic charge susceptibility, $χ''(q,ω)$. Here, we present momentum-resolved inelastic electron scattering measurements of the strange metal Bi$_2$Sr$_2$CaCu$_2$O$_{8+x}$ showing that the susceptibility has the scale-invariant form $χ''(q,ω) = T^{-ν} f(ω/T)$, with exponent $ν= 0.93$. We find the response is consistent with conformal invariance, meaning the dynamics may be thought of as occurring on a circle of radius $1/T$ in imaginary time, characterized by conformal dimension $Δ= 0.05$. Our study indicates that the strange metal is a universal phenomenon whose properties are not determined by microscopic properties of a particular material.
△ Less
Submitted 17 November, 2024;
originally announced November 2024.
-
The tidal evolution of anisotropic subhaloes: A new pathway to creating isotropic and cored satellites
Authors:
Barry T. Chiang,
Frank C. van den Bosch,
Hsi-Yu Schive
Abstract:
It is common practice, both in dynamical modelling and in idealised numerical simulations, to assume that galaxies and/or dark matter haloes are spherical and have isotropic velocity distributions, such that their distribution functions are ergodic. However, there is no good reason to assume that this assumption is accurate. In this paper we use idealised $N$-body simulations to study the tidal ev…
▽ More
It is common practice, both in dynamical modelling and in idealised numerical simulations, to assume that galaxies and/or dark matter haloes are spherical and have isotropic velocity distributions, such that their distribution functions are ergodic. However, there is no good reason to assume that this assumption is accurate. In this paper we use idealised $N$-body simulations to study the tidal evolution of subhaloes that are anisotropic at infall. We show that the detailed velocity anisotropy has a large impact on the subhalo's mass loss rate. In particular, subhaloes that are radially anisotropic experience much more mass loss than their tangentially anisotropic counterparts. In fact, in the former case, the stripping of highly radial orbits can cause a rapid cusp-to-core transformation, without having to resort to any baryonic feedback processes. Once the tidal radius becomes comparable to the radius of the core thus formed, the subhalo is tidally disrupted. Subhaloes that at infall are tangentially anisotropic are far more resilient to tidal stripping, and are never disrupted when simulated with sufficient resolution. We show that the preferential stripping of more radial orbits, combined with re-virialisation post stripping, causes an isotropisation of the subhalo's velocity distributions. This implies that subhaloes that have experienced significant mass loss are expected to be close to isotropic, which may alleviate the mass-anisotropy degeneracies that hamper the dynamical modelling of Milky Way satellites.
△ Less
Submitted 5 November, 2024;
originally announced November 2024.
-
LocateBench: Evaluating the Locating Ability of Vision Language Models
Authors:
Ting-Rui Chiang,
Joshua Robinson,
Xinyan Velocity Yu,
Dani Yogatama
Abstract:
The ability to locate an object in an image according to natural language instructions is crucial for many real-world applications. In this work we propose LocateBench, a high-quality benchmark dedicated to evaluating this ability. We experiment with multiple prompting approaches, and measure the accuracy of several large vision language models. We find that even the accuracy of the strongest mode…
▽ More
The ability to locate an object in an image according to natural language instructions is crucial for many real-world applications. In this work we propose LocateBench, a high-quality benchmark dedicated to evaluating this ability. We experiment with multiple prompting approaches, and measure the accuracy of several large vision language models. We find that even the accuracy of the strongest model, GPT-4o, lags behind human accuracy by more than 10%.
△ Less
Submitted 17 October, 2024;
originally announced October 2024.
-
Reprojection Errors as Prompts for Efficient Scene Coordinate Regression
Authors:
Ting-Ru Liu,
Hsuan-Kung Yang,
Jou-Min Liu,
Chun-Wei Huang,
Tsung-Chih Chiang,
Quan Kong,
Norimasa Kobori,
Chun-Yi Lee
Abstract:
Scene coordinate regression (SCR) methods have emerged as a promising area of research due to their potential for accurate visual localization. However, many existing SCR approaches train on samples from all image regions, including dynamic objects and texture-less areas. Utilizing these areas for optimization during training can potentially hamper the overall performance and efficiency of the mod…
▽ More
Scene coordinate regression (SCR) methods have emerged as a promising area of research due to their potential for accurate visual localization. However, many existing SCR approaches train on samples from all image regions, including dynamic objects and texture-less areas. Utilizing these areas for optimization during training can potentially hamper the overall performance and efficiency of the model. In this study, we first perform an in-depth analysis to validate the adverse impacts of these areas. Drawing inspiration from our analysis, we then introduce an error-guided feature selection (EGFS) mechanism, in tandem with the use of the Segment Anything Model (SAM). This mechanism seeds low reprojection areas as prompts and expands them into error-guided masks, and then utilizes these masks to sample points and filter out problematic areas in an iterative manner. The experiments demonstrate that our method outperforms existing SCR approaches that do not rely on 3D information on the Cambridge Landmarks and Indoor6 datasets.
△ Less
Submitted 6 September, 2024;
originally announced September 2024.
-
Understanding Generative AI Content with Embedding Models
Authors:
Max Vargas,
Reilly Cannon,
Andrew Engel,
Anand D. Sarwate,
Tony Chiang
Abstract:
Constructing high-quality features is critical to any quantitative data analysis. While feature engineering was historically addressed by carefully hand-crafting data representations based on domain expertise, deep neural networks (DNNs) now offer a radically different approach. DNNs implicitly engineer features by transforming their input data into hidden feature vectors called embeddings. For em…
▽ More
Constructing high-quality features is critical to any quantitative data analysis. While feature engineering was historically addressed by carefully hand-crafting data representations based on domain expertise, deep neural networks (DNNs) now offer a radically different approach. DNNs implicitly engineer features by transforming their input data into hidden feature vectors called embeddings. For embedding vectors produced by foundation models -- which are trained to be useful across many contexts -- we demonstrate that simple and well-studied dimensionality-reduction techniques such as Principal Component Analysis uncover inherent heterogeneity in input data concordant with human-understandable explanations. Of the many applications for this framework, we find empirical evidence that there is intrinsic separability between real samples and those generated by artificial intelligence (AI).
△ Less
Submitted 22 February, 2025; v1 submitted 19 August, 2024;
originally announced August 2024.
-
Point-SAM: Promptable 3D Segmentation Model for Point Clouds
Authors:
Yuchen Zhou,
Jiayuan Gu,
Tung Yen Chiang,
Fanbo Xiang,
Hao Su
Abstract:
The development of 2D foundation models for image segmentation has been significantly advanced by the Segment Anything Model (SAM). However, achieving similar success in 3D models remains a challenge due to issues such as non-unified data formats, poor model scalability, and the scarcity of labeled data with diverse masks. To this end, we propose a 3D promptable segmentation model Point-SAM, focus…
▽ More
The development of 2D foundation models for image segmentation has been significantly advanced by the Segment Anything Model (SAM). However, achieving similar success in 3D models remains a challenge due to issues such as non-unified data formats, poor model scalability, and the scarcity of labeled data with diverse masks. To this end, we propose a 3D promptable segmentation model Point-SAM, focusing on point clouds. We employ an efficient transformer-based architecture tailored for point clouds, extending SAM to the 3D domain. We then distill the rich knowledge from 2D SAM for Point-SAM training by introducing a data engine to generate part-level and object-level pseudo-labels at scale from 2D SAM. Our model outperforms state-of-the-art 3D segmentation models on several indoor and outdoor benchmarks and demonstrates a variety of applications, such as interactive 3D annotation and zero-shot 3D instance proposal. Codes and demo can be found at https://github.com/zyc00/Point-SAM.
△ Less
Submitted 2 December, 2024; v1 submitted 25 June, 2024;
originally announced June 2024.
-
Evidence of directional structural superlubricity and Lévy flights in a van der Waals heterostructure
Authors:
Maxime Le Ster,
Paweł Krukowski,
Maciej Rogala,
Paweł Dabrowski,
Iaroslav Lutsyk,
Klaudia Toczek,
Krzysztof Podlaski,
Tefvik O. Mendeş,
Francesca Genuzio,
Andrea Locatelli,
Guan Bian,
Tai-Chang Chiang,
Simon A. Brown,
Paweł J. Kowalczyk
Abstract:
Structural superlubricity is a special frictionless contact in which two crystals are in incommensurate arrangement such that relative in-plane translation is associated with vanishing energy barrier crossing. So far, it has been realized in multilayer graphene and other van der Waals two-dimensional crystals with hexagonal or triangular crystalline symmetries, leading to isotropic frictionless co…
▽ More
Structural superlubricity is a special frictionless contact in which two crystals are in incommensurate arrangement such that relative in-plane translation is associated with vanishing energy barrier crossing. So far, it has been realized in multilayer graphene and other van der Waals two-dimensional crystals with hexagonal or triangular crystalline symmetries, leading to isotropic frictionless contacts. Directional structural superlubricity, to date unrealized in two-dimensional systems, is possible when the reciprocal lattices of the two crystals coincide in one direction only. Here, we evidence directional structural superlubricity a $α$-bismuthene/graphite van der Waals system, manifested by spontaneous hopping of the islands over hundreds of nanometres at room temperature, resolved by low-energy electron microscopy and supported by registry simulations. Statistical analysis of individual and collective $α$-bismuthene islands populations reveal a heavy-tailed distribution of the hopping lengths and sticking times indicative of L{é}vy flight dynamics, largely unobserved in condensed-matter systems.
△ Less
Submitted 16 July, 2024; v1 submitted 24 June, 2024;
originally announced June 2024.
-
Measuring training variability from stochastic optimization using robust nonparametric testing
Authors:
Sinjini Banerjee,
Tim Marrinan,
Reilly Cannon,
Tony Chiang,
Anand D. Sarwate
Abstract:
Deep neural network training often involves stochastic optimization, meaning each run will produce a different model. This implies that hyperparameters of the training process, such as the random seed itself, can potentially have significant influence on the variability in the trained models. Measuring model quality by summary statistics, such as test accuracy, can obscure this dependence. We prop…
▽ More
Deep neural network training often involves stochastic optimization, meaning each run will produce a different model. This implies that hyperparameters of the training process, such as the random seed itself, can potentially have significant influence on the variability in the trained models. Measuring model quality by summary statistics, such as test accuracy, can obscure this dependence. We propose a robust hypothesis testing framework and a novel summary statistic, the $α$-trimming level, to measure model similarity. Applying hypothesis testing directly with the $α$-trimming level is challenging because we cannot accurately describe the distribution under the null hypothesis. Our framework addresses this issue by determining how closely an approximate distribution resembles the expected distribution of a group of individually trained models and using this approximation as our reference. We then use the $α$-trimming level to suggest how many training runs should be sampled to ensure that an ensemble is a reliable representative of the true model performance. We also show how to use the $α$-trimming level to measure model variability and demonstrate experimentally that it is more expressive than performance metrics like validation accuracy, churn, or expected calibration error when taken alone. An application of fine-tuning over random seed in transfer learning illustrates the advantage of our new metric.
△ Less
Submitted 15 April, 2025; v1 submitted 12 June, 2024;
originally announced June 2024.
-
Galactic disc heating by density granulation in fuzzy dark matter simulations
Authors:
Hsun-Yeong Yang,
Barry T. Chiang,
Guan-Ming Su,
Hsi-Yu Schive,
Tzihong Chiueh,
Jeremiah P. Ostriker
Abstract:
Fuzzy dark matter (FDM), an attractive dark matter candidate comprising ultralight bosons (axions) with a particle mass $m_a\sim10^{-22}$ eV, is motivated by the small-scale challenges of cold dark matter and features a kpc-size de Broglie wavelength. Quantum wave interference inside an FDM halo gives rise to stochastically fluctuating density granulation; the resulting gravitational perturbations…
▽ More
Fuzzy dark matter (FDM), an attractive dark matter candidate comprising ultralight bosons (axions) with a particle mass $m_a\sim10^{-22}$ eV, is motivated by the small-scale challenges of cold dark matter and features a kpc-size de Broglie wavelength. Quantum wave interference inside an FDM halo gives rise to stochastically fluctuating density granulation; the resulting gravitational perturbations could drive significant disc thickening, providing a natural explanation for galactic thick discs. Here we present the first self-consistent simulations of FDM haloes and stellar discs, exploring $m_a=0.2-1.2\times10^{-22}$ eV and halo masses $M_\text{h} = 0.7-2.8\times10^{11}$ M$_\odot$. Disc thickening is observed in all simulated systems. The disc heating rates are approximately constant in time and increase substantially with decreasing $m_a$, reaching $dh/dt \simeq 0.04$ ($0.4$) kpc Gyr$^{-1}$ and $dσ_z^2/dt \simeq4$ ($150$) km$^2$s$^{-2}$Gyr$^{-1}$ for $m_a=1.2$ ($0.2$) $\times10^{-22}$ eV and $M_\text{h} =7\times10^{10} \text{M}_\odot$, where $h$ is the disc scale height and $σ_z$ is the vertical velocity dispersion. These simulated heating rates agree within a factor of two with the theoretical estimates of Chiang et al., confirming that the rough estimate of Church et al. overpredicts the granulation-driven disc heating rate by two orders of magnitude. However, the simulation-inferred heating rates scale less steeply than the theoretically predicted relation $dσ^2_z/dt \propto m_a^{-3}$. Finally, we examine the applicability of the Fokker-Planck approximation in FDM granulation modelling and the robustness of the $m_a$ exclusion bound derived from the Galactic disc kinematics.
△ Less
Submitted 14 March, 2024;
originally announced March 2024.
-
Understanding In-Context Learning with a Pelican Soup Framework
Authors:
Ting-Rui Chiang,
Dani Yogatama
Abstract:
Many existing theoretical analyses of in-context learning for natural language processing are based on latent variable models that leaves gaps between theory and practice. We aim to close these gaps by proposing a theoretical framework, the Pelican Soup Framework. In this framework, we introduce (1) the notion of a common sense knowledge base, (2) a general formalism for natural language classific…
▽ More
Many existing theoretical analyses of in-context learning for natural language processing are based on latent variable models that leaves gaps between theory and practice. We aim to close these gaps by proposing a theoretical framework, the Pelican Soup Framework. In this framework, we introduce (1) the notion of a common sense knowledge base, (2) a general formalism for natural language classification tasks, and the notion of (3) meaning association. Under this framework, we can establish a $\mathcal{O}(1/T)$ loss bound for in-context learning, where $T$ is the number of example-label pairs in the demonstration. Compared with previous works, our bound reflects the effect of the choice of verbalizers and the effect of instruction tuning. An additional notion of \textit{atom concepts} makes our framework possible to explain the generalization to tasks unseen in the language model training data. Finally, we propose a toy setup, Calcutec, and a digit addition task that mimics types of distribution shifts a model needs to overcome to perform in-context learning. We also experiment with GPT2-Large on real-world NLP tasks. Our empirical results demonstrate the efficacy of our framework to explain in-context learning.
△ Less
Submitted 15 February, 2024;
originally announced February 2024.
-
On Retrieval Augmentation and the Limitations of Language Model Training
Authors:
Ting-Rui Chiang,
Xinyan Velocity Yu,
Joshua Robinson,
Ollie Liu,
Isabelle Lee,
Dani Yogatama
Abstract:
Augmenting a language model (LM) with $k$-nearest neighbors ($k$NN) retrieval on its training data alone can decrease its perplexity, though the underlying reasons for this remain elusive. In this work, we rule out one previously posited possibility -- the "softmax bottleneck." We then create a new dataset to evaluate LM generalization ability in the setting where training data contains additional…
▽ More
Augmenting a language model (LM) with $k$-nearest neighbors ($k$NN) retrieval on its training data alone can decrease its perplexity, though the underlying reasons for this remain elusive. In this work, we rule out one previously posited possibility -- the "softmax bottleneck." We then create a new dataset to evaluate LM generalization ability in the setting where training data contains additional information that is not causally relevant. This task is challenging even for GPT-3.5 Turbo. We show that, for both GPT-2 and Mistral 7B, $k$NN retrieval augmentation consistently improves performance in this setting. Finally, to make $k$NN retrieval more accessible, we propose using a multi-layer perceptron model that maps datastore keys to values as a drop-in replacement for traditional retrieval. This reduces storage costs by over 25x.
△ Less
Submitted 2 April, 2024; v1 submitted 16 November, 2023;
originally announced November 2023.
-
Efficient kernel surrogates for neural network-based regression
Authors:
Saad Qadeer,
Andrew Engel,
Amanda Howard,
Adam Tsou,
Max Vargas,
Panos Stinis,
Tony Chiang
Abstract:
Despite their immense promise in performing a variety of learning tasks, a theoretical understanding of the limitations of Deep Neural Networks (DNNs) has so far eluded practitioners. This is partly due to the inability to determine the closed forms of the learned functions, making it harder to study their generalization properties on unseen datasets. Recent work has shown that randomly initialize…
▽ More
Despite their immense promise in performing a variety of learning tasks, a theoretical understanding of the limitations of Deep Neural Networks (DNNs) has so far eluded practitioners. This is partly due to the inability to determine the closed forms of the learned functions, making it harder to study their generalization properties on unseen datasets. Recent work has shown that randomly initialized DNNs in the infinite width limit converge to kernel machines relying on a Neural Tangent Kernel (NTK) with known closed form. These results suggest, and experimental evidence corroborates, that empirical kernel machines can also act as surrogates for finite width DNNs. The high computational cost of assembling the full NTK, however, makes this approach infeasible in practice, motivating the need for low-cost approximations. In the current work, we study the performance of the Conjugate Kernel (CK), an efficient approximation to the NTK that has been observed to yield fairly similar results. For the regression problem of smooth functions and logistic regression classification, we show that the CK performance is only marginally worse than that of the NTK and, in certain cases, is shown to be superior. In particular, we establish bounds for the relative test losses, verify them with numerical tests, and identify the regularity of the kernel as the key determinant of performance. In addition to providing a theoretical grounding for using CKs instead of NTKs, our framework suggests a recipe for improving DNN accuracy inexpensively. We present a demonstration of this on the foundation model GPT-2 by comparing its performance on a classification task using a conventional approach and our prescription. We also show how our approach can be used to improve physics-informed operator network training for regression tasks as well as convolutional neural network training for vision classification tasks.
△ Less
Submitted 24 January, 2024; v1 submitted 28 October, 2023;
originally announced October 2023.
-
The Distributional Hypothesis Does Not Fully Explain the Benefits of Masked Language Model Pretraining
Authors:
Ting-Rui Chiang,
Dani Yogatama
Abstract:
We analyze the masked language modeling pretraining objective function from the perspective of the distributional hypothesis. We investigate whether better sample efficiency and the better generalization capability of models pretrained with masked language modeling can be attributed to the semantic similarity encoded in the pretraining data's distributional property. Via a synthetic dataset, our a…
▽ More
We analyze the masked language modeling pretraining objective function from the perspective of the distributional hypothesis. We investigate whether better sample efficiency and the better generalization capability of models pretrained with masked language modeling can be attributed to the semantic similarity encoded in the pretraining data's distributional property. Via a synthetic dataset, our analysis suggests that distributional property indeed leads to the better sample efficiency of pretrained masked language models, but does not fully explain the generalization capability. We also conduct analyses over two real-world datasets and demonstrate that the distributional property does not explain the generalization ability of pretrained natural language models either. Our results illustrate our limited understanding of model pretraining and provide future research directions.
△ Less
Submitted 24 October, 2023;
originally announced October 2023.
-
Computational and Systems Biology Advances to Enable Bioagent-Agnostic Signatures
Authors:
Andy Lin,
Cameron Torres,
Errett C. Hobbs,
Jaydeep Bardhan,
Stephen B. Aley,
Charles T. Spencer,
Karen L. Taylor,
Tony Chiang
Abstract:
Enumerated threat agent lists have long driven biodefense priorities. The global SARS-CoV-2 pandemic demonstrated the limitations of searching for known threat agents as compared to a more agnostic approach. Recent technological advances are enabling agent-agnostic biodefense, especially through the integration of multi-modal observations of host-pathogen interactions directed by a human immunolog…
▽ More
Enumerated threat agent lists have long driven biodefense priorities. The global SARS-CoV-2 pandemic demonstrated the limitations of searching for known threat agents as compared to a more agnostic approach. Recent technological advances are enabling agent-agnostic biodefense, especially through the integration of multi-modal observations of host-pathogen interactions directed by a human immunological model. Although well-developed technical assays exist for many aspects of human-pathogen interaction, the analytic methods and pipelines to combine and holistically interpret the results of such assays are immature and require further investments to exploit new technologies. In this manuscript, we discuss potential immunologically based bioagent-agnostic approaches and the computational tool gaps the community should prioritize filling.
△ Less
Submitted 28 February, 2024; v1 submitted 20 October, 2023;
originally announced October 2023.
-
Foundation Model's Embedded Representations May Detect Distribution Shift
Authors:
Max Vargas,
Adam Tsou,
Andrew Engel,
Tony Chiang
Abstract:
Sampling biases can cause distribution shifts between train and test datasets for supervised learning tasks, obscuring our ability to understand the generalization capacity of a model. This is especially important considering the wide adoption of pre-trained foundational neural networks -- whose behavior remains poorly understood -- for transfer learning (TL) tasks. We present a case study for TL…
▽ More
Sampling biases can cause distribution shifts between train and test datasets for supervised learning tasks, obscuring our ability to understand the generalization capacity of a model. This is especially important considering the wide adoption of pre-trained foundational neural networks -- whose behavior remains poorly understood -- for transfer learning (TL) tasks. We present a case study for TL on the Sentiment140 dataset and show that many pre-trained foundation models encode different representations of Sentiment140's manually curated test set $M$ from the automatically labeled training set $P$, confirming that a distribution shift has occurred. We argue training on $P$ and measuring performance on $M$ is a biased measure of generalization. Experiments on pre-trained GPT-2 show that the features learnable from $P$ do not improve (and in fact hamper) performance on $M$. Linear probes on pre-trained GPT-2's representations are robust and may even outperform overall fine-tuning, implying a fundamental importance for discerning distribution shift in train/test splits for model interpretation.
△ Less
Submitted 2 February, 2024; v1 submitted 20 October, 2023;
originally announced October 2023.
-
Visual Forecasting as a Mid-level Representation for Avoidance
Authors:
Hsuan-Kung Yang,
Tsung-Chih Chiang,
Ting-Ru Liu,
Chun-Wei Huang,
Jou-Min Liu,
Chun-Yi Lee
Abstract:
The challenge of navigation in environments with dynamic objects continues to be a central issue in the study of autonomous agents. While predictive methods hold promise, their reliance on precise state information makes them less practical for real-world implementation. This study presents visual forecasting as an innovative alternative. By introducing intuitive visual cues, this approach project…
▽ More
The challenge of navigation in environments with dynamic objects continues to be a central issue in the study of autonomous agents. While predictive methods hold promise, their reliance on precise state information makes them less practical for real-world implementation. This study presents visual forecasting as an innovative alternative. By introducing intuitive visual cues, this approach projects the future trajectories of dynamic objects to improve agent perception and enable anticipatory actions. Our research explores two distinct strategies for conveying predictive information through visual forecasting: (1) sequences of bounding boxes, and (2) augmented paths. To validate the proposed visual forecasting strategies, we initiate evaluations in simulated environments using the Unity engine and then extend these evaluations to real-world scenarios to assess both practicality and effectiveness. The results confirm the viability of visual forecasting as a promising solution for navigation and obstacle avoidance in dynamic environments.
△ Less
Submitted 17 September, 2023;
originally announced October 2023.
-
Robust Nonparametric Hypothesis Testing to Understand Variability in Training Neural Networks
Authors:
Sinjini Banerjee,
Reilly Cannon,
Tim Marrinan,
Tony Chiang,
Anand D. Sarwate
Abstract:
Training a deep neural network (DNN) often involves stochastic optimization, which means each run will produce a different model. Several works suggest this variability is negligible when models have the same performance, which in the case of classification is test accuracy. However, models with similar test accuracy may not be computing the same function. We propose a new measure of closeness bet…
▽ More
Training a deep neural network (DNN) often involves stochastic optimization, which means each run will produce a different model. Several works suggest this variability is negligible when models have the same performance, which in the case of classification is test accuracy. However, models with similar test accuracy may not be computing the same function. We propose a new measure of closeness between classification models based on the output of the network before thresholding. Our measure is based on a robust hypothesis-testing framework and can be adapted to other quantities derived from trained models.
△ Less
Submitted 30 September, 2023;
originally announced October 2023.
-
Exploring Learned Representations of Neural Networks with Principal Component Analysis
Authors:
Amit Harlev,
Andrew Engel,
Panos Stinis,
Tony Chiang
Abstract:
Understanding feature representation for deep neural networks (DNNs) remains an open question within the general field of explainable AI. We use principal component analysis (PCA) to study the performance of a k-nearest neighbors classifier (k-NN), nearest class-centers classifier (NCC), and support vector machines on the learned layer-wise representations of a ResNet-18 trained on CIFAR-10. We sh…
▽ More
Understanding feature representation for deep neural networks (DNNs) remains an open question within the general field of explainable AI. We use principal component analysis (PCA) to study the performance of a k-nearest neighbors classifier (k-NN), nearest class-centers classifier (NCC), and support vector machines on the learned layer-wise representations of a ResNet-18 trained on CIFAR-10. We show that in certain layers, as little as 20% of the intermediate feature-space variance is necessary for high-accuracy classification and that across all layers, the first ~100 PCs completely determine the performance of the k-NN and NCC classifiers. We relate our findings to neural collapse and provide partial evidence for the related phenomenon of intermediate neural collapse. Our preliminary work provides three distinct yet interpretable surrogate models for feature representation with an affine linear model the best performing. We also show that leveraging several surrogate models affords us a clever method to estimate where neural collapse may initially occur within the DNN.
△ Less
Submitted 26 September, 2023;
originally announced September 2023.
-
Minibatching Offers Improved Generalization Performance for Second Order Optimizers
Authors:
Eric Silk,
Swarnita Chakraborty,
Nairanjana Dasgupta,
Anand D. Sarwate,
Andrew Lumsdaine,
Tony Chiang
Abstract:
Training deep neural networks (DNNs) used in modern machine learning is computationally expensive. Machine learning scientists, therefore, rely on stochastic first-order methods for training, coupled with significant hand-tuning, to obtain good performance. To better understand performance variability of different stochastic algorithms, including second-order methods, we conduct an empirical study…
▽ More
Training deep neural networks (DNNs) used in modern machine learning is computationally expensive. Machine learning scientists, therefore, rely on stochastic first-order methods for training, coupled with significant hand-tuning, to obtain good performance. To better understand performance variability of different stochastic algorithms, including second-order methods, we conduct an empirical study that treats performance as a response variable across multiple training sessions of the same model. Using 2-factor Analysis of Variance (ANOVA) with interactions, we show that batch size used during training has a statistically significant effect on the peak accuracy of the methods, and that full batch largely performed the worst. In addition, we found that second-order optimizers (SOOs) generally exhibited significantly lower variance at specific batch sizes, suggesting they may require less hyperparameter tuning, leading to a reduced overall time to solution for model training.
△ Less
Submitted 25 May, 2023;
originally announced July 2023.
-
Consistency between reflection M-EELS and optical spectroscopy measurements of the long-wavelength density response of Bi$_2$Sr$_2$CaCu$_2$O$_{8+x}$
Authors:
Jin Chen,
Xuefei Guo,
Christian Boyd,
Simon Bettler,
Caitlin Kengle,
Dipanjan Chaudhuri,
Farzaneh Hoveyda,
Ali Husain,
John Schneeloch,
Genda Gu,
Philip Phillips,
Bruno Uchoa,
Tai-Chang Chiang,
Peter Abbamonte
Abstract:
The density fluctuation spectrum captures many fundamental properties of strange metals. Using momentum-resolved electron energy-loss spectroscopy (M-EELS), we recently showed that the density response of the strange metal Bi$_2$Sr$_2$CaCu$_2$O$_{8+x}$ (Bi-2212) at large momentum, $q$, exhibits a constant-in-frequency continuum [Mitrano, PNAS $\textbf{115}$, 5392 (2018); Husain, PRX $\textbf{9}$,…
▽ More
The density fluctuation spectrum captures many fundamental properties of strange metals. Using momentum-resolved electron energy-loss spectroscopy (M-EELS), we recently showed that the density response of the strange metal Bi$_2$Sr$_2$CaCu$_2$O$_{8+x}$ (Bi-2212) at large momentum, $q$, exhibits a constant-in-frequency continuum [Mitrano, PNAS $\textbf{115}$, 5392 (2018); Husain, PRX $\textbf{9}$, 041062 (2019)] reminiscent of the marginal Fermi liquid (MFL) hypothesis of the late 1980s [Varma, PRL $\textbf{63}$, 1996 (1989)]. However, reconciling this observation with infrared (IR) optics experiments, which show a well-defined plasmon excitation at $q \sim 0$, has been challenging. Here we report M-EELS measurements of Bi-2212 using 4$\times$ improved momentum resolution, allowing us to reach the optical limit. For momenta $q<0.04$ r.l.u., the M-EELS data show a plasmon feature that is quantitatively consistent with IR optics. For $q>0.04$ r.l.u., the spectra become incoherent with an MFL-like, constant-in-frequency form. We speculate that, at finite frequency, $ω$, and nonzero $q$, some attribute of this Planckian metal randomizes the probe electron, causing it to lose information about its own momentum.
△ Less
Submitted 13 December, 2023; v1 submitted 6 June, 2023;
originally announced June 2023.
-
Faithful and Efficient Explanations for Neural Networks via Neural Tangent Kernel Surrogate Models
Authors:
Andrew Engel,
Zhichao Wang,
Natalie S. Frank,
Ioana Dumitriu,
Sutanay Choudhury,
Anand Sarwate,
Tony Chiang
Abstract:
A recent trend in explainable AI research has focused on surrogate modeling, where neural networks are approximated as simpler ML algorithms such as kernel machines. A second trend has been to utilize kernel functions in various explain-by-example or data attribution tasks. In this work, we combine these two trends to analyze approximate empirical neural tangent kernels (eNTK) for data attribution…
▽ More
A recent trend in explainable AI research has focused on surrogate modeling, where neural networks are approximated as simpler ML algorithms such as kernel machines. A second trend has been to utilize kernel functions in various explain-by-example or data attribution tasks. In this work, we combine these two trends to analyze approximate empirical neural tangent kernels (eNTK) for data attribution. Approximation is critical for eNTK analysis due to the high computational cost to compute the eNTK. We define new approximate eNTK and perform novel analysis on how well the resulting kernel machine surrogate models correlate with the underlying neural network. We introduce two new random projection variants of approximate eNTK which allow users to tune the time and memory complexity of their calculation. We conclude that kernel machines using approximate neural tangent kernel as the kernel function are effective surrogate models, with the introduced trace NTK the most consistent performer. Open source software allowing users to efficiently calculate kernel functions in the PyTorch framework is available (https://github.com/pnnl/projection\_ntk).
△ Less
Submitted 11 March, 2024; v1 submitted 23 May, 2023;
originally announced May 2023.
-
Observation of 2D Weyl Fermion States in Epitaxial Bismuthene
Authors:
Qiangsheng Lu,
P. V. Sreenivasa Reddy,
Hoyeon Jeon,
Alessandro R. Mazza,
Matthew Brahlek,
Weikang Wu,
Shengyuan A. Yang,
Jacob Cook,
Clayton Conner,
Xiaoqian Zhang,
Amarnath Chakraborty,
Yueh-Ting Yao,
Hung-Ju Tien,
Chun-Han Tseng,
Po-Yuan Yang,
Shang-Wei Lien,
Hsin Lin,
Tai-Chang Chiang,
Giovanni Vignale,
An-Ping Li,
Tay-Rong Chang,
Rob G. Moore,
Guang Bian
Abstract:
A two-dimensional (2D) Weyl semimetal featuring a spin-polarized linear band dispersion and a nodal Fermi surface is a new topological phase of matter. It is a solid-state realization of Weyl fermions in an intrinsic 2D system. The nontrivial topology of 2D Weyl cones guarantees the existence of a new form of topologically protected boundary states, Fermi string edge states. In this work, we repor…
▽ More
A two-dimensional (2D) Weyl semimetal featuring a spin-polarized linear band dispersion and a nodal Fermi surface is a new topological phase of matter. It is a solid-state realization of Weyl fermions in an intrinsic 2D system. The nontrivial topology of 2D Weyl cones guarantees the existence of a new form of topologically protected boundary states, Fermi string edge states. In this work, we report the realization of a 2D Weyl semimetal in monolayer-thick epitaxial bismuthene grown on SnS(Se) substrate. The intrinsic band gap of bismuthene is eliminated by the space-inversion-symmetry-breaking substrate perturbations, resulting in a gapless spin-polarized Weyl band dispersion. The linear dispersion and spin polarization of the Weyl fermion states are observed in our spin and angle-resolved photoemission measurements. In addition, the scanning tunneling microscopy/spectroscopy reveals a pronounced local density of states at the edge, suggesting the existence of Fermi string edge states. These results open the door for the experimental exploration of the exotic properties of Weyl fermion states in reduced dimensions.
△ Less
Submitted 6 March, 2023;
originally announced March 2023.
-
Virtual Guidance as a Mid-level Representation for Navigation with Augmented Reality
Authors:
Hsuan-Kung Yang,
Tsung-Chih Chiang,
Jou-Min Liu,
Ting-Ru Liu,
Chun-Wei Huang,
Tsu-Ching Hsiao,
Chun-Yi Lee
Abstract:
In the context of autonomous navigation, effectively conveying abstract navigational cues to agents in dynamic environments presents significant challenges, particularly when navigation information is derived from diverse modalities such as both vision and high-level language descriptions. To address this issue, we introduce a novel technique termed `Virtual Guidance,' which is designed to visuall…
▽ More
In the context of autonomous navigation, effectively conveying abstract navigational cues to agents in dynamic environments presents significant challenges, particularly when navigation information is derived from diverse modalities such as both vision and high-level language descriptions. To address this issue, we introduce a novel technique termed `Virtual Guidance,' which is designed to visually represent non-visual instructional signals. These visual cues are overlaid onto the agent's camera view and served as comprehensible navigational guidance signals. To validate the concept of virtual guidance, we propose a sim-to-real framework that enables the transfer of the trained policy from simulated environments to real world, ensuring the adaptability of virtual guidance in practical scenarios. We evaluate and compare the proposed method against a non-visual guidance baseline through detailed experiments in simulation. The experimental results demonstrate that the proposed virtual guidance approach outperforms the baseline methods across multiple scenarios and offers clear evidence of its effectiveness in autonomous navigation tasks.
△ Less
Submitted 14 March, 2025; v1 submitted 5 March, 2023;
originally announced March 2023.
-
Influence of Structural Defects on Charge Density Waves in 1T-TaS2
Authors:
I. Lutsyk,
K. Szalowski,
P. Krukowski,
P. Dabrowski,
M. Rogala,
W. Kozlowski,
M. Le Ster,
M. Piskorski,
D. A. Kowalczyk,
W. Rys,
R. Dunal,
A. Nadolska,
K. Toczek,
P. Przybysz,
E. Lacinska,
J. Binder,
A. Wysmolek,
N. Olszowska,
J. J. Kolodziej,
M. Gmitra,
T. Hattori,
Y. Kuwahara,
G. Bian,
T. -C. Chiang,
P. J. Kowalczyk
Abstract:
The influence of intrinsic defects of 1T-TaS2 on charge density waves (CDW) is studied using scanning tunneling microscopy and spectroscopy (STM, STS), angle-resolved photoelectron spectroscopy (ARPES), and density functional theory (DFT). We identify several types of structural defects and find that most have a local character limited to the single CDW site, with single exception which effectivel…
▽ More
The influence of intrinsic defects of 1T-TaS2 on charge density waves (CDW) is studied using scanning tunneling microscopy and spectroscopy (STM, STS), angle-resolved photoelectron spectroscopy (ARPES), and density functional theory (DFT). We identify several types of structural defects and find that most have a local character limited to the single CDW site, with single exception which effectively behaves as a dopant, leading to band bending and affecting multiple neighboring sites. While only one type of defect can be observed by STM topographic imaging, all defects are easily resolved by local density of states (LDOS) mapping with STS. We correlate atomically-resolved STM periodicity of defect-free 1T-TaS2 to top sulfur atoms and introduce tiling of the surface using equiangular hexagon. DFT calculations (with included Coulomb interactions) are used to investigate the electronic structure by introducing sulfur vacancy or substituting sulfur with oxygen. The sulfur vacancy is characterized by metallic properties and is identified as an origin of one of observed experimentally defects. Whereas in the case of the latter, the oxidation of 1T-TaS2 is found to result in the loss of magnetic properties expected in defect-free material.
△ Less
Submitted 1 March, 2023;
originally announced March 2023.
-
Dimensional crossover and symmetry transformation of the charge density waves in VSe2
Authors:
P. Chen,
Y. -H. Chan,
R. -Y. Liu,
H. T. Zhang,
Q. Gao,
A. -V. Fedorov,
M. Y. Chou,
T. -C. Chiang
Abstract:
Collective phenomena in solids can be sensitive to the dimensionality of the system; a case of special interest is VSe2, which shows a (r7 x r3) charge density wave (CDW) in the single layer with the three-fold symmetry in the normal phase spontaneously broken, in contrast to the (4 x 4) in-plane CDW in the bulk. Angle-resolved photoemission spectroscopy (ARPES) from VSe2 ranging from a single lay…
▽ More
Collective phenomena in solids can be sensitive to the dimensionality of the system; a case of special interest is VSe2, which shows a (r7 x r3) charge density wave (CDW) in the single layer with the three-fold symmetry in the normal phase spontaneously broken, in contrast to the (4 x 4) in-plane CDW in the bulk. Angle-resolved photoemission spectroscopy (ARPES) from VSe2 ranging from a single layer to the bulk reveals the evolution of the electronic structure including the Fermi surface contours and the CDW gap. At a thickness of two layers, the ARPES maps are already nearly bulklike, but the transition temperature TC for the (4 x 4) CDW is much higher than the bulk value of 110 K. These results can be understood as a result of dimensional crossover of phonon instability driven by a competition of nesting vectors. Our study provides key insights into the CDW mechanisms and offers a perspective in the search and control of emergent phases in quantum materials.
△ Less
Submitted 20 February, 2023;
originally announced February 2023.
-
Evidence of high-temperature exciton condensation in a two-dimensional semimetal
Authors:
Qiang Gao,
Yang-hao Chan,
Yuzhe Wang,
Haotian Zhang,
Jinxu Pu,
Shengtao Cui,
Yichen Yang,
Zhengtai Liu,
Dawei Shen,
Zhe Sun,
Juan Jiang,
Tai C. Chiang,
Peng Chen
Abstract:
Electrons and holes can spontaneously form excitons and condense in a semimetal or semiconductor, as predicted decades ago. This type of Bose condensation can happen at much higher temperatures in comparison with dilute atomic gases. Two-dimensional (2D) materials with reduced Coulomb screening around the Fermi level are promising for realizing such a system. Here we report a change in the band st…
▽ More
Electrons and holes can spontaneously form excitons and condense in a semimetal or semiconductor, as predicted decades ago. This type of Bose condensation can happen at much higher temperatures in comparison with dilute atomic gases. Two-dimensional (2D) materials with reduced Coulomb screening around the Fermi level are promising for realizing such a system. Here we report a change in the band structure accompanied by a phase transition at about 180 K in single-layer ZrTe2 based on angle-resolved photoemission spectroscopy (ARPES) measurements. Below the transition temperature, gap opening and development of an ultra-flat band top around the zone center are observed. This gap and the phase transition are rapidly suppressed with extra carrier densities introduced by adding more layers or dopants on the surface. The results suggest the formation of an excitonic insulating ground state in single-layer ZrTe2, and the findings are rationalized by first principles calculations and a self-consistent mean-field theory. Our study provides evidence for exciton condensation in a 2D semimetal and demonstrates strong dimensionality effects on the formation of intrinsic bound electron-hole pairs in solids.
△ Less
Submitted 7 February, 2023;
originally announced February 2023.
-
Edge States of α-Bismuthene Nanostructures
Authors:
Sara Salehitaleghani,
Tobias Maerkl,
Pawel J Kowalczyk,
Maxime Le Ster,
Xiaoxiong Wang,
Guang Bian,
Tai-Chang Chiang,
Simon A Brown
Abstract:
We present a systematic investigation of the edge states of two-dimensional α-bismuthene (α-Bi) structures self-assembled on HOPG substrates, using scanning tunnelling microscopy and scanning tunnelling spectroscopy. The measurements are carried out for 3ML, 5ML and 7ML thick Bi structures. Our spectroscopy studies reveal clear features at the edges of the 5ML and 7ML thick structures, and the pos…
▽ More
We present a systematic investigation of the edge states of two-dimensional α-bismuthene (α-Bi) structures self-assembled on HOPG substrates, using scanning tunnelling microscopy and scanning tunnelling spectroscopy. The measurements are carried out for 3ML, 5ML and 7ML thick Bi structures. Our spectroscopy studies reveal clear features at the edges of the 5ML and 7ML thick structures, and the positions of the edge states (ESs) coincide with the topographical step edges. In contrast, in 3ML structures the ESs appear to be absent and instead new states are sometimes observed, far from the topographical edge. These states are associated with a moiré pattern and result from strain-induced modulation of the topology. Our observations demonstrate the impact on the edge states of coupling to adjacent structures.
△ Less
Submitted 1 January, 2023;
originally announced January 2023.
-
ExReg: Wide-range Photo Exposure Correction via a Multi-dimensional Regressor with Attention
Authors:
Tzu-Hao Chiang,
Hao-Chien Hsueh,
Ching-Chun Hsiao,
Ching-Chun Huang
Abstract:
Photo exposure correction is widely investigated, but fewer studies focus on correcting under and over-exposed images simultaneously. Three issues remain open to handle and correct under and over-exposed images in a unified way. First, a locally-adaptive exposure adjustment may be more flexible instead of learning a global mapping. Second, it is an ill-posed problem to determine the suitable expos…
▽ More
Photo exposure correction is widely investigated, but fewer studies focus on correcting under and over-exposed images simultaneously. Three issues remain open to handle and correct under and over-exposed images in a unified way. First, a locally-adaptive exposure adjustment may be more flexible instead of learning a global mapping. Second, it is an ill-posed problem to determine the suitable exposure values locally. Third, photos with the same content but different exposures may not reach consistent adjustment results. To this end, we proposed a novel exposure correction network, ExReg, to address the challenges by formulating exposure correction as a multi-dimensional regression process. Given an input image, a compact multi-exposure generation network is introduced to generate images with different exposure conditions for multi-dimensional regression and exposure correction in the next stage. An auxiliary module is designed to predict the region-wise exposure values, guiding the mainly proposed Encoder-Decoder ANP (Attentive Neural Processes) to regress the final corrected image. The experimental results show that ExReg can generate well-exposed results and outperform the SOTA method by 1.3dB in PSNR for extensive exposure problems. In addition, given the same image but under various exposure for testing, the corrected results are more visually consistent and physically accurate.
△ Less
Submitted 14 December, 2022;
originally announced December 2022.
-
Can ultralight dark matter explain the age-velocity dispersion relation of the Milky Way disc: A revised and improved treatment
Authors:
Barry T. Chiang,
Jeremiah P. Ostriker,
Hsi-Yu Schive
Abstract:
Ultralight axion-like particles $m_a \sim 10^{-22}$ eV, or Fuzzy Dark Matter (FDM), behave comparably to cold dark matter (CDM) on cosmological scales and exhibit a kpc-size de Broglie wavelength capable of alleviating established (sub-)galactic-scale problems of CDM. Substructures inside an FDM halo incur gravitational potential perturbations, resulting in stellar heating sufficient to account fo…
▽ More
Ultralight axion-like particles $m_a \sim 10^{-22}$ eV, or Fuzzy Dark Matter (FDM), behave comparably to cold dark matter (CDM) on cosmological scales and exhibit a kpc-size de Broglie wavelength capable of alleviating established (sub-)galactic-scale problems of CDM. Substructures inside an FDM halo incur gravitational potential perturbations, resulting in stellar heating sufficient to account for the Galactic disc thickening over a Hubble time, as first demonstrated by Church et al. We present a more sophisticated treatment that incorporates the full baryon and dark matter distributions of the Milky Way and adopts stellar disc kinematics inferred from recent Gaia, APOGEE, and LAMOST surveys. Ubiquitous density granulation and subhalo passages respectively drive inner disc thickening and flaring of the outer disc, resulting in an observationally consistent `U-shaped' disc vertical velocity dispersion profile with the global minimum located near the solar radius. The observed age-velocity dispersion relation in the solar vicinity can be explained by the FDM-substructure-induced heating and places an exclusion bound $m_a \gtrsim 0.4\times10^{-22}$ eV. We assess non-trivial uncertainties in the empirical core-halo relation, FDM subhalo mass function and tidal stripping, and stellar heating estimate. The mass range $m_a\simeq 0.5-0.7\times10^{-22}$ eV favoured by the observed thick disc kinematics is in tension with several exclusion bounds inferred from dwarf density profiles, stellar streams, and Milky Way satellite populations, which could be significantly relaxed due to the aforesaid uncertainties. Additionally, strongly anisotropic heating could help explain the formation of ultra-thin disc galaxies.
△ Less
Submitted 5 December, 2022; v1 submitted 14 November, 2022;
originally announced November 2022.
-
Spectral Evolution and Invariance in Linear-width Neural Networks
Authors:
Zhichao Wang,
Andrew Engel,
Anand Sarwate,
Ioana Dumitriu,
Tony Chiang
Abstract:
We investigate the spectral properties of linear-width feed-forward neural networks, where the sample size is asymptotically proportional to network width. Empirically, we show that the spectra of weight in this high dimensional regime are invariant when trained by gradient descent for small constant learning rates; we provide a theoretical justification for this observation and prove the invarian…
▽ More
We investigate the spectral properties of linear-width feed-forward neural networks, where the sample size is asymptotically proportional to network width. Empirically, we show that the spectra of weight in this high dimensional regime are invariant when trained by gradient descent for small constant learning rates; we provide a theoretical justification for this observation and prove the invariance of the bulk spectra for both conjugate and neural tangent kernels. We demonstrate similar characteristics when training with stochastic gradient descent with small learning rates. When the learning rate is large, we exhibit the emergence of an outlier whose corresponding eigenvector is aligned with the training data structure. We also show that after adaptive gradient training, where a lower test error and feature learning emerge, both weight and kernel matrices exhibit heavy tail behavior. Simple examples are provided to explain when heavy tails can have better generalizations. We exhibit different spectral properties such as invariant bulk, spike, and heavy-tailed distribution from a two-layer neural network using different training strategies, and then correlate them to the feature learning. Analogous phenomena also appear when we train conventional neural networks with real-world data. We conclude that monitoring the evolution of the spectra during training is an essential step toward understanding the training dynamics and feature learning.
△ Less
Submitted 7 November, 2023; v1 submitted 11 November, 2022;
originally announced November 2022.
-
Anharmonic multiphonon origin of the valence plasmon in SrTi1-xNbxO3
Authors:
Caitlin S. Kengle,
Samantha I. Rubeck,
Melinda Rak,
Jin Chen,
Faren Hoveyda,
Simon Bettler,
Ali Husain,
Matteo Mitrano,
Alexander Edelman,
Peter Littlewood,
Tai-Chang Chiang,
Fahad Mahmood,
Peter Abbamonte
Abstract:
Doped SrTi1-xNbxO3 exhibits superconductivity and a mid-infrared optical response reminiscent of copper-oxide superconductors. Strangely, its plasma frequency, omega_p, increases by a factor of ~3 when cooling from 300 K to 20 K, without any accepted explanation. Here, we present momentum-resolved electron energy loss spectroscopy (M-EELS) measurements of SrTi1-xNbxO3 at nonzero momentum, q. We fi…
▽ More
Doped SrTi1-xNbxO3 exhibits superconductivity and a mid-infrared optical response reminiscent of copper-oxide superconductors. Strangely, its plasma frequency, omega_p, increases by a factor of ~3 when cooling from 300 K to 20 K, without any accepted explanation. Here, we present momentum-resolved electron energy loss spectroscopy (M-EELS) measurements of SrTi1-xNbxO3 at nonzero momentum, q. We find that the infrared feature previously identified as a plasmon is present at large q in insulating SrTiO3, where it exhibits the same temperature dependence and may be identified as an anharmonic, multiphonon background. Doping with Nb increases its peak energy and total spectral weight, drawing this background to lower q where it becomes visible in IR optics experiments. We conclude that the "plasmon" in doped SrTi1-xNbxO3 is not a free-carrier mode, but a composite excitation that inherits its unusual properties from the lattice anharmonicity of the insulator.
△ Less
Submitted 26 October, 2022;
originally announced October 2022.
-
DialCrowd 2.0: A Quality-Focused Dialog System Crowdsourcing Toolkit
Authors:
Jessica Huynh,
Ting-Rui Chiang,
Jeffrey Bigham,
Maxine Eskenazi
Abstract:
Dialog system developers need high-quality data to train, fine-tune and assess their systems. They often use crowdsourcing for this since it provides large quantities of data from many workers. However, the data may not be of sufficiently good quality. This can be due to the way that the requester presents a task and how they interact with the workers. This paper introduces DialCrowd 2.0 to help r…
▽ More
Dialog system developers need high-quality data to train, fine-tune and assess their systems. They often use crowdsourcing for this since it provides large quantities of data from many workers. However, the data may not be of sufficiently good quality. This can be due to the way that the requester presents a task and how they interact with the workers. This paper introduces DialCrowd 2.0 to help requesters obtain higher quality data by, for example, presenting tasks more clearly and facilitating effective communication with workers. DialCrowd 2.0 guides developers in creating improved Human Intelligence Tasks (HITs) and is directly applicable to the workflows used currently by developers and researchers.
△ Less
Submitted 25 July, 2022;
originally announced July 2022.
-
TorchNTK: A Library for Calculation of Neural Tangent Kernels of PyTorch Models
Authors:
Andrew Engel,
Zhichao Wang,
Anand D. Sarwate,
Sutanay Choudhury,
Tony Chiang
Abstract:
We introduce torchNTK, a python library to calculate the empirical neural tangent kernel (NTK) of neural network models in the PyTorch framework. We provide an efficient method to calculate the NTK of multilayer perceptrons. We compare the explicit differentiation implementation against autodifferentiation implementations, which have the benefit of extending the utility of the library to any archi…
▽ More
We introduce torchNTK, a python library to calculate the empirical neural tangent kernel (NTK) of neural network models in the PyTorch framework. We provide an efficient method to calculate the NTK of multilayer perceptrons. We compare the explicit differentiation implementation against autodifferentiation implementations, which have the benefit of extending the utility of the library to any architecture supported by PyTorch, such as convolutional networks. A feature of the library is that we expose the user to layerwise NTK components, and show that in some regimes a layerwise calculation is more memory efficient. We conduct preliminary experiments to demonstrate use cases for the software and probe the NTK.
△ Less
Submitted 24 May, 2022;
originally announced May 2022.
-
Breaking Down Multilingual Machine Translation
Authors:
Ting-Rui Chiang,
Yi-Pei Chen,
Yi-Ting Yeh,
Graham Neubig
Abstract:
While multilingual training is now an essential ingredient in machine translation (MT) systems, recent work has demonstrated that it has different effects in different multilingual settings, such as many-to-one, one-to-many, and many-to-many learning. These training settings expose the encoder and the decoder in a machine translation model with different data distributions. In this paper, we exami…
▽ More
While multilingual training is now an essential ingredient in machine translation (MT) systems, recent work has demonstrated that it has different effects in different multilingual settings, such as many-to-one, one-to-many, and many-to-many learning. These training settings expose the encoder and the decoder in a machine translation model with different data distributions. In this paper, we examine how different varieties of multilingual training contribute to learning these two components of the MT model. Specifically, we compare bilingual models with encoders and/or decoders initialized by multilingual training. We show that multilingual training is beneficial to encoders in general, while it only benefits decoders for low-resource languages (LRLs). We further find the important attention heads for each language pair and compare their correlations during inference. Our analysis sheds light on how multilingual translation models work and enables us to propose methods to improve performance by training with highly related languages. Our many-to-one models for high-resource languages and one-to-many models for LRL outperform the best results reported by Aharoni et al. (2019)
△ Less
Submitted 3 April, 2022; v1 submitted 15 October, 2021;
originally announced October 2021.
-
Are you doing what I say? On modalities alignment in ALFRED
Authors:
Ting-Rui Chiang,
Yi-Ting Yeh,
Ta-Chung Chi,
Yau-Shian Wang
Abstract:
ALFRED is a recently proposed benchmark that requires a model to complete tasks in simulated house environments specified by instructions in natural language. We hypothesize that key to success is accurately aligning the text modality with visual inputs. Motivated by this, we inspect how well existing models can align these modalities using our proposed intrinsic metric, boundary adherence score (…
▽ More
ALFRED is a recently proposed benchmark that requires a model to complete tasks in simulated house environments specified by instructions in natural language. We hypothesize that key to success is accurately aligning the text modality with visual inputs. Motivated by this, we inspect how well existing models can align these modalities using our proposed intrinsic metric, boundary adherence score (BAS). The results show the previous models are indeed failing to perform proper alignment. To address this issue, we introduce approaches aimed at improving model alignment and demonstrate how improved alignment, improves end task performance.
△ Less
Submitted 11 October, 2021;
originally announced October 2021.
-
On a Benefit of Mask Language Modeling: Robustness to Simplicity Bias
Authors:
Ting-Rui Chiang
Abstract:
Despite the success of pretrained masked language models (MLM), why MLM pretraining is useful is still a qeustion not fully answered. In this work we theoretically and empirically show that MLM pretraining makes models robust to lexicon-level spurious features, partly answer the question. We theoretically show that, when we can model the distribution of a spurious feature $Π$ conditioned on the co…
▽ More
Despite the success of pretrained masked language models (MLM), why MLM pretraining is useful is still a qeustion not fully answered. In this work we theoretically and empirically show that MLM pretraining makes models robust to lexicon-level spurious features, partly answer the question. We theoretically show that, when we can model the distribution of a spurious feature $Π$ conditioned on the context, then (1) $Π$ is at least as informative as the spurious feature, and (2) learning from $Π$ is at least as simple as learning from the spurious feature. Therefore, MLM pretraining rescues the model from the simplicity bias caused by the spurious feature. We also explore the efficacy of MLM pretraing in causal settings. Finally we close the gap between our theories and the real world practices by conducting experiments on the hate speech detection and the name entity recognition tasks.
△ Less
Submitted 11 October, 2021;
originally announced October 2021.
-
Observation of Unpinned Two-Dimensional Dirac States in Antimony Single Layers with Phosphorene Structure
Authors:
Qiangsheng Lu,
Matthew Snyder,
Kyle Y. Chen,
Xiaoqian Zhang,
Jacob Cook,
Duy Tung Nguyen,
P. V. Sreenivasa Reddy,
Tay-Rong Chang,
Pawel J. Kowalczyk,
Simon A. Brown,
Tai-Chang Chiang,
Shengyuan A. Yang,
Guang Bian
Abstract:
The discovery of graphene has stimulated enormous interest in two-dimensional (2D) electron gas with linear band structure. 2D Dirac materials possess many intriguing physical properties such as high carrier mobility and zero-energy Landau level thanks to the relativistic dispersion and chiral spin/pseudospin texture. 2D Dirac states discovered so far are exclusively pinned at high-symmetry points…
▽ More
The discovery of graphene has stimulated enormous interest in two-dimensional (2D) electron gas with linear band structure. 2D Dirac materials possess many intriguing physical properties such as high carrier mobility and zero-energy Landau level thanks to the relativistic dispersion and chiral spin/pseudospin texture. 2D Dirac states discovered so far are exclusively pinned at high-symmetry points of the Brillouin zone, for example, surface Dirac states at $\overlineΓ$ in topological insulators Bi$_2$Se(Te)$_3$ and Dirac cones at $K$ and $K'$ in graphene. In this work, we report the realization of 2D Dirac states at generic $k$-points in antimony atomic layers with phosphorene structure ($i.e.$ $α$-antimonene). The unpinned nature enables versatile ways to control the locations of the Dirac points in momentum space. In addition, dispersions around the unpinned Dirac points exhibit intrinsically anisotropic behaviors due to the reduced symmetry of generic momentum points. These properties make the $α$-antimonene films a promising platform for exploring interesting physics in unpinned 2D Dirac fermions that are distinct from the conventional Dirac states in graphene.
△ Less
Submitted 18 October, 2021; v1 submitted 10 October, 2021;
originally announced October 2021.
-
Improving Dialogue State Tracking by Joint Slot Modeling
Authors:
Ting-Rui Chiang,
Yi-Ting Yeh
Abstract:
Dialogue state tracking models play an important role in a task-oriented dialogue system. However, most of them model the slot types conditionally independently given the input. We discover that it may cause the model to be confused by slot types that share the same data type. To mitigate this issue, we propose TripPy-MRF and TripPy-LSTM that models the slots jointly. Our results show that they ar…
▽ More
Dialogue state tracking models play an important role in a task-oriented dialogue system. However, most of them model the slot types conditionally independently given the input. We discover that it may cause the model to be confused by slot types that share the same data type. To mitigate this issue, we propose TripPy-MRF and TripPy-LSTM that models the slots jointly. Our results show that they are able to alleviate the confusion mentioned above, and they push the state-of-the-art on dataset MultiWoZ 2.1 from 58.7 to 61.3. Our implementation is available at https://github.com/CTinRay/Trippy-Joint.
△ Less
Submitted 14 November, 2021; v1 submitted 28 September, 2021;
originally announced September 2021.
-
Relating Neural Text Degeneration to Exposure Bias
Authors:
Ting-Rui Chiang,
Yun-Nung Chen
Abstract:
This work focuses on relating two mysteries in neural-based text generation: exposure bias, and text degeneration. Despite the long time since exposure bias was mentioned and the numerous studies for its remedy, to our knowledge, its impact on text generation has not yet been verified. Text degeneration is a problem that the widely-used pre-trained language model GPT-2 was recently found to suffer…
▽ More
This work focuses on relating two mysteries in neural-based text generation: exposure bias, and text degeneration. Despite the long time since exposure bias was mentioned and the numerous studies for its remedy, to our knowledge, its impact on text generation has not yet been verified. Text degeneration is a problem that the widely-used pre-trained language model GPT-2 was recently found to suffer from (Holtzman et al., 2020). Motivated by the unknown causation of the text degeneration, in this paper we attempt to relate these two mysteries. Specifically, we first qualitatively quantitatively identify mistakes made before text degeneration occurs. Then we investigate the significance of the mistakes by inspecting the hidden states in GPT-2. Our results show that text degeneration is likely to be partly caused by exposure bias. We also study the self-reinforcing mechanism of text degeneration, explaining why the mistakes amplify. In sum, our study provides a more concrete foundation for further investigation on exposure bias and text degeneration problems.
△ Less
Submitted 17 September, 2021;
originally announced September 2021.
-
Kramers-Weyl fermions in the chiral charge density wave material (TaSe$_4$)$_2$I
Authors:
Soyeun Kim,
Robert C. McKay,
Nina Bielinski,
Chengxi Zhao,
Meng-Kai Lin,
Joseph A. Hlevyack,
Xuefei Guo,
Sung-Kwan Mo,
Peter Abbamonte,
Tai-Chang Chiang,
André Schleife,
Daniel P. Shoemaker,
Barry Bradlyn,
Fahad Mahmood
Abstract:
The quasi-one-dimensional chiral charge density wave (CDW) material (TaSe$_4$)$_2$I has been recently predicted to host Kramers-Weyl (KW) fermions which should exist in the vicinity of high symmetry points in the Brillouin zone in chiral materials with strong spin-orbit coupling. However, direct spectroscopic evidence of KW fermions is limited. Here we use helicity-dependent laser-based angle reso…
▽ More
The quasi-one-dimensional chiral charge density wave (CDW) material (TaSe$_4$)$_2$I has been recently predicted to host Kramers-Weyl (KW) fermions which should exist in the vicinity of high symmetry points in the Brillouin zone in chiral materials with strong spin-orbit coupling. However, direct spectroscopic evidence of KW fermions is limited. Here we use helicity-dependent laser-based angle resolved photoemission spectroscopy (ARPES) in conjunction with tight-binding and first-principles calculations to identify KW fermions in (TaSe$_4$)$_2$I. We find that topological and symmetry considerations place distinct constraints on the (pseudo-) spin texture and the observed spectra around a KW node. We further reveal an interplay between the spin texture around the chiral KW node and the onset of CDW order in (TaSe$_4$)$_2$I. Our findings highlight the unique topological nature of (TaSe$_4$)$_2$I and provide a pathway for identifying KW fermions in other chiral materials.
△ Less
Submitted 24 August, 2021;
originally announced August 2021.
-
Why Can You Lay Off Heads? Investigating How BERT Heads Transfer
Authors:
Ting-Rui Chiang,
Yun-Nung Chen
Abstract:
The huge size of the widely used BERT family models has led to recent efforts about model distillation. The main goal of distillation is to create a task-agnostic pre-trained model that can be fine-tuned on downstream tasks without fine-tuning its full-sized version. Despite the progress of distillation, to what degree and for what reason a task-agnostic model can be created from distillation has…
▽ More
The huge size of the widely used BERT family models has led to recent efforts about model distillation. The main goal of distillation is to create a task-agnostic pre-trained model that can be fine-tuned on downstream tasks without fine-tuning its full-sized version. Despite the progress of distillation, to what degree and for what reason a task-agnostic model can be created from distillation has not been well studied. Also, the mechanisms behind transfer learning of those BERT models are not well investigated either. Therefore, this work focuses on analyzing the acceptable deduction when distillation for guiding the future distillation procedure. Specifically, we first inspect the prunability of the Transformer heads in RoBERTa and ALBERT using their head importance estimation proposed by Michel et al. (2019), and then check the coherence of the important heads between the pre-trained task and downstream tasks. Hence, the acceptable deduction of performance on the pre-trained task when distilling a model can be derived from the results, and we further compare the behavior of the pruned model before and after fine-tuning. Our studies provide guidance for future directions about BERT family model distillation.
△ Less
Submitted 13 June, 2021;
originally announced June 2021.