-
Persona Features Control Emergent Misalignment
Authors:
Miles Wang,
Tom Dupré la Tour,
Olivia Watkins,
Alex Makelov,
Ryan A. Chi,
Samuel Miserendino,
Johannes Heidecke,
Tejal Patwardhan,
Dan Mossing
Abstract:
Understanding how language models generalize behaviors from their training to a broader deployment distribution is an important problem in AI safety. Betley et al. discovered that fine-tuning GPT-4o on intentionally insecure code causes "emergent misalignment," where models give stereotypically malicious responses to unrelated prompts. We extend this work, demonstrating emergent misalignment acros…
▽ More
Understanding how language models generalize behaviors from their training to a broader deployment distribution is an important problem in AI safety. Betley et al. discovered that fine-tuning GPT-4o on intentionally insecure code causes "emergent misalignment," where models give stereotypically malicious responses to unrelated prompts. We extend this work, demonstrating emergent misalignment across diverse conditions, including reinforcement learning on reasoning models, fine-tuning on various synthetic datasets, and in models without safety training. To investigate the mechanisms behind this generalized misalignment, we apply a "model diffing" approach using sparse autoencoders to compare internal model representations before and after fine-tuning. This approach reveals several "misaligned persona" features in activation space, including a toxic persona feature which most strongly controls emergent misalignment and can be used to predict whether a model will exhibit such behavior. Additionally, we investigate mitigation strategies, discovering that fine-tuning an emergently misaligned model on just a few hundred benign samples efficiently restores alignment.
△ Less
Submitted 24 June, 2025;
originally announced June 2025.
-
modeLing: A Novel Dataset for Testing Linguistic Reasoning in Language Models
Authors:
Nathan A. Chi,
Teodor Malchev,
Riley Kong,
Ryan A. Chi,
Lucas Huang,
Ethan A. Chi,
R. Thomas McCoy,
Dragomir Radev
Abstract:
We introduce modeLing, a novel benchmark of Linguistics Olympiad-style puzzles which tests few-shot reasoning in AI systems. Solving these puzzles necessitates inferring aspects of a language's grammatical structure from a small number of examples. Such puzzles provide a natural testbed for language models, as they require compositional generalization and few-shot inductive reasoning. Consisting s…
▽ More
We introduce modeLing, a novel benchmark of Linguistics Olympiad-style puzzles which tests few-shot reasoning in AI systems. Solving these puzzles necessitates inferring aspects of a language's grammatical structure from a small number of examples. Such puzzles provide a natural testbed for language models, as they require compositional generalization and few-shot inductive reasoning. Consisting solely of new puzzles written specifically for this work, modeLing has no risk of appearing in the training data of existing AI systems: this ameliorates the risk of data leakage, a potential confounder for many prior evaluations of reasoning. Evaluating several large open source language models and GPT on our benchmark, we observe non-negligible accuracy, demonstrating few-shot emergent reasoning ability which cannot merely be attributed to shallow memorization. However, imperfect model performance suggests that modeLing can be used to measure further progress in linguistic reasoning.
△ Less
Submitted 24 June, 2024;
originally announced June 2024.
-
LINGOLY: A Benchmark of Olympiad-Level Linguistic Reasoning Puzzles in Low-Resource and Extinct Languages
Authors:
Andrew M. Bean,
Simi Hellsten,
Harry Mayne,
Jabez Magomere,
Ethan A. Chi,
Ryan Chi,
Scott A. Hale,
Hannah Rose Kirk
Abstract:
In this paper, we present the LingOly benchmark, a novel benchmark for advanced reasoning abilities in large language models. Using challenging Linguistic Olympiad puzzles, we evaluate (i) capabilities for in-context identification and generalisation of linguistic patterns in very low-resource or extinct languages, and (ii) abilities to follow complex task instructions. The LingOly benchmark cover…
▽ More
In this paper, we present the LingOly benchmark, a novel benchmark for advanced reasoning abilities in large language models. Using challenging Linguistic Olympiad puzzles, we evaluate (i) capabilities for in-context identification and generalisation of linguistic patterns in very low-resource or extinct languages, and (ii) abilities to follow complex task instructions. The LingOly benchmark covers more than 90 mostly low-resource languages, minimising issues of data contamination, and contains 1,133 problems across 6 formats and 5 levels of human difficulty. We assess performance with both direct accuracy and comparison to a no-context baseline to penalise memorisation. Scores from 11 state-of-the-art LLMs demonstrate the benchmark to be challenging, and models perform poorly on the higher difficulty problems. On harder problems, even the top model only achieved 38.7% accuracy, a 24.7% improvement over the no-context baseline. Large closed models typically outperform open models, and in general, the higher resource the language, the better the scores. These results indicate, in absence of memorisation, true multi-step out-of-domain reasoning remains a challenge for current language models.
△ Less
Submitted 31 October, 2024; v1 submitted 10 June, 2024;
originally announced June 2024.
-
Premise Order Matters in Reasoning with Large Language Models
Authors:
Xinyun Chen,
Ryan A. Chi,
Xuezhi Wang,
Denny Zhou
Abstract:
Large language models (LLMs) have accomplished remarkable reasoning performance in various domains. However, in the domain of reasoning tasks, we discover a frailty: LLMs are surprisingly brittle to the ordering of the premises, despite the fact that such ordering does not alter the underlying task. In particular, we observe that LLMs achieve the best performance when the premise order aligns with…
▽ More
Large language models (LLMs) have accomplished remarkable reasoning performance in various domains. However, in the domain of reasoning tasks, we discover a frailty: LLMs are surprisingly brittle to the ordering of the premises, despite the fact that such ordering does not alter the underlying task. In particular, we observe that LLMs achieve the best performance when the premise order aligns with the context required in intermediate reasoning steps. For example, in deductive reasoning tasks, presenting the premises in the same order as the ground truth proof in the prompt (as opposed to random ordering) drastically increases the model's accuracy. We first examine the effect of premise ordering on deductive reasoning on a variety of LLMs, and our evaluation shows that permuting the premise order can cause a performance drop of over 30%. In addition, we release the benchmark R-GSM, based on GSM8K, to examine the ordering effect for mathematical problem-solving, and we again observe a significant drop in accuracy, relative to the original GSM8K benchmark.
△ Less
Submitted 28 May, 2024; v1 submitted 13 February, 2024;
originally announced February 2024.
-
BLESS: Benchmarking Large Language Models on Sentence Simplification
Authors:
Tannon Kew,
Alison Chi,
Laura Vásquez-Rodríguez,
Sweta Agrawal,
Dennis Aumiller,
Fernando Alva-Manchego,
Matthew Shardlow
Abstract:
We present BLESS, a comprehensive performance benchmark of the most recent state-of-the-art large language models (LLMs) on the task of text simplification (TS). We examine how well off-the-shelf LLMs can solve this challenging task, assessing a total of 44 models, differing in size, architecture, pre-training methods, and accessibility, on three test sets from different domains (Wikipedia, news,…
▽ More
We present BLESS, a comprehensive performance benchmark of the most recent state-of-the-art large language models (LLMs) on the task of text simplification (TS). We examine how well off-the-shelf LLMs can solve this challenging task, assessing a total of 44 models, differing in size, architecture, pre-training methods, and accessibility, on three test sets from different domains (Wikipedia, news, and medical) under a few-shot setting. Our analysis considers a suite of automatic metrics as well as a large-scale quantitative investigation into the types of common edit operations performed by the different models. Furthermore, we perform a manual qualitative analysis on a subset of model outputs to better gauge the quality of the generated simplifications. Our evaluation indicates that the best LLMs, despite not being trained on TS, perform comparably with state-of-the-art TS baselines. Additionally, we find that certain LLMs demonstrate a greater range and diversity of edit operations. Our performance benchmark will be available as a resource for the development of future TS methods and evaluation metrics.
△ Less
Submitted 24 October, 2023;
originally announced October 2023.
-
Learning to Paraphrase Sentences to Different Complexity Levels
Authors:
Alison Chi,
Li-Kuang Chen,
Yi-Chen Chang,
Shu-Hui Lee,
Jason S. Chang
Abstract:
While sentence simplification is an active research topic in NLP, its adjacent tasks of sentence complexification and same-level paraphrasing are not. To train models on all three tasks, we present two new unsupervised datasets. We compare these datasets, one labeled by a weak classifier and the other by a rule-based approach, with a single supervised dataset. Using these three datasets for traini…
▽ More
While sentence simplification is an active research topic in NLP, its adjacent tasks of sentence complexification and same-level paraphrasing are not. To train models on all three tasks, we present two new unsupervised datasets. We compare these datasets, one labeled by a weak classifier and the other by a rule-based approach, with a single supervised dataset. Using these three datasets for training, we perform extensive experiments on both multitasking and prompting strategies. Compared to other systems trained on unsupervised parallel data, models trained on our weak classifier labeled dataset achieve state-of-the-art performance on the ASSET simplification benchmark. Our models also outperform previous work on sentence level targeting. Finally, we establish how a handful of Large Language Models perform on these tasks under a zero-shot setting.
△ Less
Submitted 4 August, 2023;
originally announced August 2023.
-
Multiobjective Logistics Optimization for Automated ATM Cash Replenishment Process
Authors:
Bui Tien Thanh,
Dinh Van Tuan,
Tuan Anh Chi,
Nguyen Van Dai,
Nguyen Tai Quang Dinh,
Nguyen Thu Thuy,
Nguyen Thi Xuan Hoa
Abstract:
In the digital transformation era, integrating digital technology into every aspect of banking operations improves process automation, cost efficiency, and service level improvement. Although logistics for ATM cash is a crucial task that impacts operating costs and consumer satisfaction, there has been little effort to enhance it. Specifically, in Vietnam, with a market of more than 20,000 ATMs na…
▽ More
In the digital transformation era, integrating digital technology into every aspect of banking operations improves process automation, cost efficiency, and service level improvement. Although logistics for ATM cash is a crucial task that impacts operating costs and consumer satisfaction, there has been little effort to enhance it. Specifically, in Vietnam, with a market of more than 20,000 ATMs nationally, research and technological solutions that can resolve this issue remain scarce. In this paper, we generalized the vehicle routing problem for ATM cash replenishment, suggested a mathematical model and then offered a tool to evaluate various situations. When being evaluated on the simulated dataset, our proposed model and method produced encouraging results with the benefits of cutting ATM cash operating costs.
△ Less
Submitted 22 July, 2023; v1 submitted 23 April, 2023;
originally announced April 2023.
-
Improving Radiology Report Generation Systems by Removing Hallucinated References to Non-existent Priors
Authors:
Vignav Ramesh,
Nathan Andrew Chi,
Pranav Rajpurkar
Abstract:
Current deep learning models trained to generate radiology reports from chest radiographs are capable of producing clinically accurate, clear, and actionable text that can advance patient care. However, such systems all succumb to the same problem: making hallucinated references to non-existent prior reports. Such hallucinations occur because these models are trained on datasets of real-world pati…
▽ More
Current deep learning models trained to generate radiology reports from chest radiographs are capable of producing clinically accurate, clear, and actionable text that can advance patient care. However, such systems all succumb to the same problem: making hallucinated references to non-existent prior reports. Such hallucinations occur because these models are trained on datasets of real-world patient reports that inherently refer to priors. To this end, we propose two methods to remove references to priors in radiology reports: (1) a GPT-3-based few-shot approach to rewrite medical reports without references to priors; and (2) a BioBERT-based token classification approach to directly remove words referring to priors. We use the aforementioned approaches to modify MIMIC-CXR, a publicly available dataset of chest X-rays and their associated free-text radiology reports; we then retrain CXR-RePaiR, a radiology report generation system, on the adapted MIMIC-CXR dataset. We find that our re-trained model--which we call CXR-ReDonE--outperforms previous report generation methods on clinical metrics, achieving an average BERTScore of 0.2351 (2.57% absolute improvement). We expect our approach to be broadly valuable in enabling current radiology report generation systems to be more directly integrated into clinical pipelines.
△ Less
Submitted 13 October, 2022; v1 submitted 26 September, 2022;
originally announced October 2022.
-
Neural Generation Meets Real People: Building a Social, Informative Open-Domain Dialogue Agent
Authors:
Ethan A. Chi,
Ashwin Paranjape,
Abigail See,
Caleb Chiam,
Trenton Chang,
Kathleen Kenealy,
Swee Kiat Lim,
Amelia Hardy,
Chetanya Rastogi,
Haojun Li,
Alexander Iyabor,
Yutong He,
Hari Sowrirajan,
Peng Qi,
Kaushik Ram Sadagopan,
Nguyet Minh Phu,
Dilara Soylu,
Jillian Tang,
Avanika Narayan,
Giovanni Campagna,
Christopher D. Manning
Abstract:
We present Chirpy Cardinal, an open-domain social chatbot. Aiming to be both informative and conversational, our bot chats with users in an authentic, emotionally intelligent way. By integrating controlled neural generation with scaffolded, hand-written dialogue, we let both the user and bot take turns driving the conversation, producing an engaging and socially fluent experience. Deployed in the…
▽ More
We present Chirpy Cardinal, an open-domain social chatbot. Aiming to be both informative and conversational, our bot chats with users in an authentic, emotionally intelligent way. By integrating controlled neural generation with scaffolded, hand-written dialogue, we let both the user and bot take turns driving the conversation, producing an engaging and socially fluent experience. Deployed in the fourth iteration of the Alexa Prize Socialbot Grand Challenge, Chirpy Cardinal handled thousands of conversations per day, placing second out of nine bots with an average user rating of 3.58/5.
△ Less
Submitted 16 January, 2023; v1 submitted 25 July, 2022;
originally announced July 2022.
-
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Authors:
Aarohi Srivastava,
Abhinav Rastogi,
Abhishek Rao,
Abu Awal Md Shoeb,
Abubakar Abid,
Adam Fisch,
Adam R. Brown,
Adam Santoro,
Aditya Gupta,
Adrià Garriga-Alonso,
Agnieszka Kluska,
Aitor Lewkowycz,
Akshat Agarwal,
Alethea Power,
Alex Ray,
Alex Warstadt,
Alexander W. Kocurek,
Ali Safaya,
Ali Tazarv,
Alice Xiang,
Alicia Parrish,
Allen Nie,
Aman Hussain,
Amanda Askell,
Amanda Dsouza
, et al. (426 additional authors not shown)
Abstract:
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-futur…
▽ More
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 450 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.
△ Less
Submitted 12 June, 2023; v1 submitted 9 June, 2022;
originally announced June 2022.
-
Wrist-Squeezing Force Feedback Improves Accuracy and Speed in Robotic Surgery Training
Authors:
Sergio Machaca,
Eric Cao,
Amy Chi,
Gina Adrales,
Katherine J Kuchenbecker,
Jeremy D Brown
Abstract:
Current robotic minimally invasive surgery (RMIS) platforms provide surgeons with no haptic feedback of the robot's physical interactions. This limitation forces surgeons to rely heavily on visual feedback and can make it challenging for surgical trainees to manipulate tissue gently. Prior research has demonstrated that haptic feedback can increase task accuracy in RMIS training. However, it remai…
▽ More
Current robotic minimally invasive surgery (RMIS) platforms provide surgeons with no haptic feedback of the robot's physical interactions. This limitation forces surgeons to rely heavily on visual feedback and can make it challenging for surgical trainees to manipulate tissue gently. Prior research has demonstrated that haptic feedback can increase task accuracy in RMIS training. However, it remains unclear whether these improvements represent a fundamental improvement in skill, or if they simply stem from re-prioritizing accuracy over task completion time. In this study, we provide haptic feedback of the force applied by the surgical instruments using custom wrist-squeezing devices. We hypothesize that individuals receiving haptic feedback will increase accuracy (produce less force) while increasing their task completion time, compared to a control group receiving no haptic feedback. To test this hypothesis, N=21 novice participants were asked to repeatedly complete a ring rollercoaster surgical training task as quickly as possible. Results show that participants receiving haptic feedback apply significantly less force (0.67 N) than the control group, and they complete the task no faster or slower than the control group after twelve repetitions. Furthermore, participants in the feedback group decreased their task completion times significantly faster (7.68%) than participants in the control group (5.26%). This form of haptic feedback, therefore, has the potential to help trainees improve their technical accuracy without compromising speed.
△ Less
Submitted 31 March, 2023; v1 submitted 13 May, 2022;
originally announced May 2022.
-
Classifying Autism from Crowdsourced Semi-Structured Speech Recordings: A Machine Learning Approach
Authors:
Nathan A. Chi,
Peter Washington,
Aaron Kline,
Arman Husic,
Cathy Hou,
Chloe He,
Kaitlyn Dunlap,
Dennis Wall
Abstract:
Autism spectrum disorder (ASD) is a neurodevelopmental disorder which results in altered behavior, social development, and communication patterns. In past years, autism prevalence has tripled, with 1 in 54 children now affected. Given that traditional diagnosis is a lengthy, labor-intensive process, significant attention has been given to developing systems that automatically screen for autism. Pr…
▽ More
Autism spectrum disorder (ASD) is a neurodevelopmental disorder which results in altered behavior, social development, and communication patterns. In past years, autism prevalence has tripled, with 1 in 54 children now affected. Given that traditional diagnosis is a lengthy, labor-intensive process, significant attention has been given to developing systems that automatically screen for autism. Prosody abnormalities are among the clearest signs of autism, with affected children displaying speech idiosyncrasies including echolalia, monotonous intonation, atypical pitch, and irregular linguistic stress patterns. In this work, we present a suite of machine learning approaches to detect autism in self-recorded speech audio captured from autistic and neurotypical (NT) children in home environments. We consider three methods to detect autism in child speech: first, Random Forests trained on extracted audio features (including Mel-frequency cepstral coefficients); second, convolutional neural networks (CNNs) trained on spectrograms; and third, fine-tuned wav2vec 2.0--a state-of-the-art Transformer-based ASR model. We train our classifiers on our novel dataset of cellphone-recorded child speech audio curated from Stanford's Guess What? mobile game, an app designed to crowdsource videos of autistic and neurotypical children in a natural home environment. The Random Forest classifier achieves 70% accuracy, the fine-tuned wav2vec 2.0 model achieves 77% accuracy, and the CNN achieves 79% accuracy when classifying children's audio as either ASD or NT. Our models were able to predict autism status when training on a varied selection of home audio clips with inconsistent recording quality, which may be more generalizable to real world conditions. These results demonstrate that machine learning methods offer promise in detecting autism automatically from speech without specialized equipment.
△ Less
Submitted 3 January, 2022;
originally announced January 2022.
-
Preliminary investigation into how limb choice affects kinesthetic perception
Authors:
Mohit Singhala,
Amy Chi,
Maria Coleman,
Jeremy D. Brown
Abstract:
We have a limited understanding of how we integrate haptic information in real-time from our upper limbs to perform complex bimanual tasks, an ability that humans routinely employ to perform tasks of varying levels of difficulty. In order to understand how information from both limbs is used to create a unified percept, it is important to study both the limbs separately first. Prevalent theories h…
▽ More
We have a limited understanding of how we integrate haptic information in real-time from our upper limbs to perform complex bimanual tasks, an ability that humans routinely employ to perform tasks of varying levels of difficulty. In order to understand how information from both limbs is used to create a unified percept, it is important to study both the limbs separately first. Prevalent theories highlighting the role of central nervous system (CNS) in accounting for internal body dynamics seem to suggest that both upper limbs should be equally sensitive to external stimuli. However, there is empirical proof demonstrating a perceptual difference in our upper limbs for tasks like shape discrimination, prompting the need to study effects of limb choice on kinesthetic perception. In this manuscript, we start evaluating Just Noticeable Difference (JND) for stiffness for both forearms separately. Early results validate the need for a more thorough investigation of limb choice on kinesthetic perception.
△ Less
Submitted 22 July, 2021;
originally announced July 2021.
-
Deep Subjecthood: Higher-Order Grammatical Features in Multilingual BERT
Authors:
Isabel Papadimitriou,
Ethan A. Chi,
Richard Futrell,
Kyle Mahowald
Abstract:
We investigate how Multilingual BERT (mBERT) encodes grammar by examining how the high-order grammatical feature of morphosyntactic alignment (how different languages define what counts as a "subject") is manifested across the embedding spaces of different languages. To understand if and how morphosyntactic alignment affects contextual embedding spaces, we train classifiers to recover the subjecth…
▽ More
We investigate how Multilingual BERT (mBERT) encodes grammar by examining how the high-order grammatical feature of morphosyntactic alignment (how different languages define what counts as a "subject") is manifested across the embedding spaces of different languages. To understand if and how morphosyntactic alignment affects contextual embedding spaces, we train classifiers to recover the subjecthood of mBERT embeddings in transitive sentences (which do not contain overt information about morphosyntactic alignment) and then evaluate them zero-shot on intransitive sentences (where subjecthood classification depends on alignment), within and across languages. We find that the resulting classifier distributions reflect the morphosyntactic alignment of their training languages. Our results demonstrate that mBERT representations are influenced by high-level grammatical features that are not manifested in any one input sentence, and that this is robust across languages. Further examining the characteristics that our classifiers rely on, we find that features such as passive voice, animacy and case strongly correlate with classification decisions, suggesting that mBERT does not encode subjecthood purely syntactically, but that subjecthood embedding is continuous and dependent on semantic and discourse factors, as is proposed in much of the functional linguistics literature. Together, these results provide insight into how grammatical features manifest in contextual embedding spaces, at a level of abstraction not covered by previous work.
△ Less
Submitted 26 January, 2021;
originally announced January 2021.
-
Align-Refine: Non-Autoregressive Speech Recognition via Iterative Realignment
Authors:
Ethan A. Chi,
Julian Salazar,
Katrin Kirchhoff
Abstract:
Non-autoregressive models greatly improve decoding speed over typical sequence-to-sequence models, but suffer from degraded performance. Infilling and iterative refinement models make up some of this gap by editing the outputs of a non-autoregressive model, but are constrained in the edits that they can make. We propose iterative realignment, where refinements occur over latent alignments rather t…
▽ More
Non-autoregressive models greatly improve decoding speed over typical sequence-to-sequence models, but suffer from degraded performance. Infilling and iterative refinement models make up some of this gap by editing the outputs of a non-autoregressive model, but are constrained in the edits that they can make. We propose iterative realignment, where refinements occur over latent alignments rather than output sequence space. We demonstrate this in speech recognition with Align-Refine, an end-to-end Transformer-based model which refines connectionist temporal classification (CTC) alignments to allow length-changing insertions and deletions. Align-Refine outperforms Imputer and Mask-CTC, matching an autoregressive baseline on WSJ at 1/14th the real-time factor and attaining a LibriSpeech test-other WER of 9.0% without an LM. Our model is strong even in one iteration with a shallower decoder.
△ Less
Submitted 24 October, 2020;
originally announced October 2020.
-
Finding Universal Grammatical Relations in Multilingual BERT
Authors:
Ethan A. Chi,
John Hewitt,
Christopher D. Manning
Abstract:
Recent work has found evidence that Multilingual BERT (mBERT), a transformer-based multilingual masked language model, is capable of zero-shot cross-lingual transfer, suggesting that some aspects of its representations are shared cross-lingually. To better understand this overlap, we extend recent work on finding syntactic trees in neural networks' internal representations to the multilingual sett…
▽ More
Recent work has found evidence that Multilingual BERT (mBERT), a transformer-based multilingual masked language model, is capable of zero-shot cross-lingual transfer, suggesting that some aspects of its representations are shared cross-lingually. To better understand this overlap, we extend recent work on finding syntactic trees in neural networks' internal representations to the multilingual setting. We show that subspaces of mBERT representations recover syntactic tree distances in languages other than English, and that these subspaces are approximately shared across languages. Motivated by these results, we present an unsupervised analysis method that provides evidence mBERT learns representations of syntactic dependency labels, in the form of clusters which largely agree with the Universal Dependencies taxonomy. This evidence suggests that even without explicit supervision, multilingual masked language models learn certain linguistic universals.
△ Less
Submitted 20 May, 2020; v1 submitted 9 May, 2020;
originally announced May 2020.
-
SGVAE: Sequential Graph Variational Autoencoder
Authors:
Bowen Jing,
Ethan A. Chi,
Jillian Tang
Abstract:
Generative models of graphs are well-known, but many existing models are limited in scalability and expressivity. We present a novel sequential graphical variational autoencoder operating directly on graphical representations of data. In our model, the encoding and decoding of a graph as is framed as a sequential deconstruction and construction process, respectively, enabling the the learning of a…
▽ More
Generative models of graphs are well-known, but many existing models are limited in scalability and expressivity. We present a novel sequential graphical variational autoencoder operating directly on graphical representations of data. In our model, the encoding and decoding of a graph as is framed as a sequential deconstruction and construction process, respectively, enabling the the learning of a latent space. Experiments on a cycle dataset show promise, but highlight the need for a relaxation of the distribution over node permutations.
△ Less
Submitted 16 December, 2019;
originally announced December 2019.
-
Limitless HTTP in an HTTPS World: Inferring the Semantics of the HTTPS Protocol without Decryption
Authors:
Blake Anderson,
Andrew Chi,
Scott Dunlop,
David McGrew
Abstract:
We present new analytic techniques for inferring HTTP semantics from passive observations of HTTPS that can infer the value of important fields including the status-code, Content-Type, and Server, and the presence or absence of several additional HTTP header fields, e.g., Cookie and Referer. Our goals are twofold: to better understand the limitations of the confidentiality of HTTPS, and to explore…
▽ More
We present new analytic techniques for inferring HTTP semantics from passive observations of HTTPS that can infer the value of important fields including the status-code, Content-Type, and Server, and the presence or absence of several additional HTTP header fields, e.g., Cookie and Referer. Our goals are twofold: to better understand the limitations of the confidentiality of HTTPS, and to explore benign uses of traffic analysis such as application troubleshooting and malware detection that could replace HTTPS interception and static private keys in some scenarios. We found that our techniques improve the efficacy of malware detection, but they do not enable more powerful website fingerprinting attacks against Tor. Our broader set of results raises concerns about the confidentiality goals of TLS relative to a user's expectation of privacy, warranting future research.
We apply our methods to the semantics of both HTTP/1.1 and HTTP/2 on data collected from automated runs of Firefox 58.0, Chrome 63.0, and Tor Browser 7.0.11 in a lab setting, and from applications running in a malware sandbox. We obtain ground truth plaintext for a diverse set of applications from the malware sandbox by extracting the key material needed for decryption from RAM post-execution. We developed an iterative approach to simultaneously solve several multi-class (field values) and binary (field presence) classification problems, and we show that our inference algorithm achieves an unweighted $F_1$ score greater than 0.900 for most HTTP fields examined.
△ Less
Submitted 29 May, 2018;
originally announced May 2018.
-
Server-side verification of client behavior in cryptographic protocols
Authors:
Andrew Chi,
Robert Cochran,
Marie Nesfield,
Michael K. Reiter,
Cynthia Sturton
Abstract:
Numerous exploits of client-server protocols and applications involve modifying clients to behave in ways that untampered clients would not, such as crafting malicious packets. In this paper, we demonstrate practical verification of a cryptographic protocol client's messaging behavior as being consistent with the client program it is believed to be running. Moreover, we accomplish this without mod…
▽ More
Numerous exploits of client-server protocols and applications involve modifying clients to behave in ways that untampered clients would not, such as crafting malicious packets. In this paper, we demonstrate practical verification of a cryptographic protocol client's messaging behavior as being consistent with the client program it is believed to be running. Moreover, we accomplish this without modifying the client in any way, and without knowing all of the client-side inputs driving its behavior. Our toolchain for verifying a client's messages explores multiple candidate execution paths in the client concurrently, an innovation that we show is both specifically useful for cryptographic protocol clients and more generally useful for client applications of other types, as well. In addition, our toolchain includes a novel approach to symbolically executing the client software in multiple passes that defers expensive functions until their inputs can be inferred and concretized. We demonstrate client verification on OpenSSL to show that, e.g., Heartbleed exploits can be detected without Heartbleed-specific filtering and within seconds of the first malicious packet, and that verification of legitimate clients can keep pace with, e.g., Gmail workloads.
△ Less
Submitted 13 March, 2016;
originally announced March 2016.