Skip to main content

Showing 1–50 of 72 results for author: Schneider, N

Searching in archive cs. Search in all archives.
.
  1. arXiv:2504.20304  [pdf, other

    cs.CL cs.AI

    UD-English-CHILDES: A Collected Resource of Gold and Silver Universal Dependencies Trees for Child Language Interactions

    Authors: Xiulin Yang, Zhuoxuan Ju, Lanni Bu, Zoey Liu, Nathan Schneider

    Abstract: CHILDES is a widely used resource of transcribed child and child-directed speech. This paper introduces UD-English-CHILDES, the first officially released Universal Dependencies (UD) treebank derived from previously dependency-annotated CHILDES data with consistent and unified annotation guidelines. Our corpus harmonizes annotations from 11 children and their caregivers, totaling over 48k sentences… ▽ More

    Submitted 4 May, 2025; v1 submitted 28 April, 2025; originally announced April 2025.

  2. arXiv:2504.01162  [pdf, ps, other

    cs.IR

    Information Retrieval for Climate Impact

    Authors: Maarten de Rijke, Bart van den Hurk, Flora Salim, Alaa Al Khourdajie, Nan Bai, Renato Calzone, Declan Curran, Getnet Demil, Lesley Frew, Noah Gießing, Mukesh Kumar Gupta, Maria Heuss, Sanaa Hobeichi, David Huard, Jingwei Kang, Ana Lucic, Tanwi Mallick, Shruti Nath, Andrew Okem, Barbara Pernici, Thilina Rajapakse, Hira Saleem, Harry Scells, Nicole Schneider, Damiano Spina , et al. (6 additional authors not shown)

    Abstract: The purpose of the MANILA24 Workshop on information retrieval for climate impact was to bring together researchers from academia, industry, governments, and NGOs to identify and discuss core research problems in information retrieval to assess climate change impacts. The workshop aimed to foster collaboration by bringing communities together that have so far not been very well connected -- informa… ▽ More

    Submitted 1 April, 2025; originally announced April 2025.

    Comments: Report on the MANILA24 Workshop

    ACM Class: H.3.3

  3. arXiv:2503.18751  [pdf, other

    cs.CL cs.AI

    Construction Identification and Disambiguation Using BERT: A Case Study of NPN

    Authors: Wesley Scivetti, Nathan Schneider

    Abstract: Construction Grammar hypothesizes that knowledge of a language consists chiefly of knowledge of form-meaning pairs (''constructions'') that include vocabulary, general grammar rules, and even idiosyncratic patterns. Recent work has shown that transformer language models represent at least some constructional patterns, including ones where the construction is rare overall. In this work, we probe BE… ▽ More

    Submitted 24 March, 2025; originally announced March 2025.

    Comments: 8 pages, ACL long-paper format (preprint)

  4. arXiv:2411.14003  [pdf, other

    cs.LG stat.ML

    Generative Intervention Models for Causal Perturbation Modeling

    Authors: Nora Schneider, Lars Lorch, Niki Kilbertus, Bernhard Schölkopf, Andreas Krause

    Abstract: We consider the problem of predicting perturbation effects via causal models. In many applications, it is a priori unknown which mechanisms of a system are modified by an external perturbation, even though the features of the perturbation are available. For example, in genomics, some properties of a drug may be known, but not their causal effects on the regulatory pathways of cells. We propose a g… ▽ More

    Submitted 21 November, 2024; originally announced November 2024.

  5. arXiv:2405.05966  [pdf, other

    cs.CL cs.AI

    Natural Language Processing RELIES on Linguistics

    Authors: Juri Opitz, Shira Wein, Nathan Schneider

    Abstract: Large Language Models (LLMs) have become capable of generating highly fluent text in certain languages, without modules specially designed to capture grammar or semantic coherence. What does this mean for the future of linguistic expertise in NLP? We highlight several aspects in which NLP (still) relies on linguistics, or where linguistic thinking can illuminate new directions. We argue our case a… ▽ More

    Submitted 10 March, 2025; v1 submitted 9 May, 2024; originally announced May 2024.

    Comments: To appear in Computational Linguistics. This is a pre-MIT Press publication version

  6. arXiv:2403.17748  [pdf, other

    cs.CL

    UCxn: Typologically Informed Annotation of Constructions Atop Universal Dependencies

    Authors: Leonie Weissweiler, Nina Böbel, Kirian Guiller, Santiago Herrera, Wesley Scivetti, Arthur Lorenzi, Nurit Melnik, Archna Bhatia, Hinrich Schütze, Lori Levin, Amir Zeldes, Joakim Nivre, William Croft, Nathan Schneider

    Abstract: The Universal Dependencies (UD) project has created an invaluable collection of treebanks with contributions in over 140 languages. However, the UD annotations do not tell the full story. Grammatical constructions that convey meaning through a particular combination of several morphosyntactic elements -- for example, interrogative sentences with special markers and/or word orders -- are not labele… ▽ More

    Submitted 26 March, 2024; originally announced March 2024.

    Comments: LREC-COLING 2024

  7. arXiv:2403.14273  [pdf, other

    cs.NE cs.AI

    Reactor Optimization Benchmark by Reinforcement Learning

    Authors: Deborah Schwarcz, Nadav Schneider, Gal Oren, Uri Steinitz

    Abstract: Neutronic calculations for reactors are a daunting task when using Monte Carlo (MC) methods. As high-performance computing has advanced, the simulation of a reactor is nowadays more readily done, but design and optimization with multiple parameters is still a computational challenge. MC transport simulations, coupled with machine learning techniques, offer promising avenues for enhancing the effic… ▽ More

    Submitted 21 March, 2024; originally announced March 2024.

  8. arXiv:2402.09126  [pdf, other

    cs.DC cs.AI cs.CL cs.LG cs.SE

    MPIrigen: MPI Code Generation through Domain-Specific Language Models

    Authors: Nadav Schneider, Niranjan Hasabnis, Vy A. Vo, Tal Kadosh, Neva Krien, Mihai Capotă, Guy Tamir, Ted Willke, Nesreen Ahmed, Yuval Pinter, Timothy Mattson, Gal Oren

    Abstract: The imperative need to scale computation across numerous nodes highlights the significance of efficient parallel computing, particularly in the realm of Message Passing Interface (MPI) integration. The challenging parallel programming task of generating MPI-based parallel programs has remained unexplored. This study first investigates the performance of state-of-the-art language models in generati… ▽ More

    Submitted 23 April, 2024; v1 submitted 14 February, 2024; originally announced February 2024.

  9. arXiv:2312.13322  [pdf, ps, other

    cs.PL cs.AI cs.LG cs.SE

    MonoCoder: Domain-Specific Code Language Model for HPC Codes and Tasks

    Authors: Tal Kadosh, Niranjan Hasabnis, Vy A. Vo, Nadav Schneider, Neva Krien, Mihai Capota, Abdul Wasay, Nesreen Ahmed, Ted Willke, Guy Tamir, Yuval Pinter, Timothy Mattson, Gal Oren

    Abstract: With easier access to powerful compute resources, there is a growing trend in AI for software development to develop large language models (LLMs) to address a variety of programming tasks. Even LLMs applied to tasks from the high-performance computing (HPC) domain are huge in size and demand expensive compute resources for training. This is partly because LLMs for HPC tasks are obtained by finetun… ▽ More

    Submitted 19 September, 2024; v1 submitted 20 December, 2023; originally announced December 2023.

  10. arXiv:2311.06965  [pdf, other

    cs.LG stat.ML

    Anchor Data Augmentation

    Authors: Nora Schneider, Shirin Goshtasbpour, Fernando Perez-Cruz

    Abstract: We propose a novel algorithm for data augmentation in nonlinear over-parametrized regression. Our data augmentation algorithm borrows from the literature on causality and extends the recently proposed Anchor regression (AR) method for data augmentation, which is in contrast to the current state-of-the-art domain-agnostic solutions that rely on the Mixup literature. Our Anchor Data Augmentation (AD… ▽ More

    Submitted 27 November, 2023; v1 submitted 12 November, 2023; originally announced November 2023.

  11. arXiv:2311.00268  [pdf, other

    cs.CL

    Syntactic Inductive Bias in Transformer Language Models: Especially Helpful for Low-Resource Languages?

    Authors: Luke Gessler, Nathan Schneider

    Abstract: A line of work on Transformer-based language models such as BERT has attempted to use syntactic inductive bias to enhance the pretraining process, on the theory that building syntactic structure into the training process should reduce the amount of data needed for training. But such methods are often tested for high-resource languages such as English. In this work, we investigate whether these met… ▽ More

    Submitted 31 October, 2023; originally announced November 2023.

    Comments: Accepted at CoNLL 2023

  12. arXiv:2308.09440  [pdf, other

    cs.CL cs.PL

    Scope is all you need: Transforming LLMs for HPC Code

    Authors: Tal Kadosh, Niranjan Hasabnis, Vy A. Vo, Nadav Schneider, Neva Krien, Abdul Wasay, Nesreen Ahmed, Ted Willke, Guy Tamir, Yuval Pinter, Timothy Mattson, Gal Oren

    Abstract: With easier access to powerful compute resources, there is a growing trend in the field of AI for software development to develop larger and larger language models (LLMs) to address a variety of programming tasks. Even LLMs applied to tasks from the high-performance computing (HPC) domain are huge in size (e.g., billions of parameters) and demand expensive compute resources for training. We found… ▽ More

    Submitted 29 September, 2023; v1 submitted 18 August, 2023; originally announced August 2023.

  13. arXiv:2308.08206  [pdf, other

    cs.CV cs.AI

    Explainable Multi-View Deep Networks Methodology for Experimental Physics

    Authors: Nadav Schneider, Muriel Tzdaka, Galit Sturm, Guy Lazovski, Galit Bar, Gilad Oren, Raz Gvishi, Gal Oren

    Abstract: Physical experiments often involve multiple imaging representations, such as X-ray scans and microscopic images. Deep learning models have been widely used for supervised analysis in these experiments. Combining different image representations is frequently required to analyze and make a decision properly. Consequently, multi-view data has emerged - datasets where each sample is described by views… ▽ More

    Submitted 27 July, 2024; v1 submitted 16 August, 2023; originally announced August 2023.

  14. Efficient Domain Adaptation of Sentence Embeddings Using Adapters

    Authors: Tim Schopf, Dennis N. Schneider, Florian Matthes

    Abstract: Sentence embeddings enable us to capture the semantic similarity of short texts. Most sentence embedding models are trained for general semantic textual similarity tasks. Therefore, to use sentence embeddings in a particular domain, the model must be adapted to it in order to achieve good results. Usually, this is done by fine-tuning the entire sentence embedding model for the domain of interest.… ▽ More

    Submitted 24 September, 2023; v1 submitted 6 July, 2023; originally announced July 2023.

    Comments: Accepted to the 14th International Conference on Recent Advances in Natural Language Processing (RANLP 2023)

    ACM Class: I.2.7

  15. arXiv:2306.00936  [pdf, other

    cs.CL cs.IR

    AMR4NLI: Interpretable and robust NLI measures from semantic graphs

    Authors: Juri Opitz, Shira Wein, Julius Steen, Anette Frank, Nathan Schneider

    Abstract: The task of natural language inference (NLI) asks whether a given premise (expressed in NL) entails a given NL hypothesis. NLI benchmarks contain human ratings of entailment, but the meaning relationships driving these ratings are not formalized. Can the underlying sentence pair relationships be made more explicit in an interpretable yet robust fashion? We compare semantic structures to represent… ▽ More

    Submitted 5 September, 2023; v1 submitted 1 June, 2023; originally announced June 2023.

    Comments: International Conference on Computational Semantics (IWCS 2023); v2 fixes an imprecise sentence below Eq. 5

  16. arXiv:2305.17347  [pdf, other

    cs.CL

    CGELBank Annotation Manual v1.1

    Authors: Brett Reynolds, Nathan Schneider, Aryaman Arora

    Abstract: CGELBank is a treebank and associated tools based on a syntactic formalism for English derived from the Cambridge Grammar of the English Language. This document lays out the particularities of the CGELBank annotation scheme.

    Submitted 4 June, 2024; v1 submitted 26 May, 2023; originally announced May 2023.

  17. arXiv:2305.14719  [pdf, other

    cs.CL

    CuRIAM: Corpus re Interpretation and Metalanguage in U.S. Supreme Court Opinions

    Authors: Michael Kranzlein, Nathan Schneider, Kevin Tobia

    Abstract: Most judicial decisions involve the interpretation of legal texts; as such, judicial opinion requires the use of language as a medium to comment on or draw attention to other language. Language used this way is called metalanguage. We develop an annotation schema for categorizing types of legal metalanguage and apply our schema to a set of U.S. Supreme Court opinions, yielding a corpus totaling 59… ▽ More

    Submitted 24 May, 2023; originally announced May 2023.

  18. arXiv:2305.11999  [pdf, other

    cs.DC cs.AI cs.LG cs.PF

    Advising OpenMP Parallelization via a Graph-Based Approach with Transformers

    Authors: Tal Kadosh, Nadav Schneider, Niranjan Hasabnis, Timothy Mattson, Yuval Pinter, Gal Oren

    Abstract: There is an ever-present need for shared memory parallelization schemes to exploit the full potential of multi-core architectures. The most common parallelization API addressing this need today is OpenMP. Nevertheless, writing parallel code manually is complex and effort-intensive. Thus, many deterministic source-to-source (S2S) compilers have emerged, intending to automate the process of translat… ▽ More

    Submitted 16 May, 2023; originally announced May 2023.

  19. arXiv:2305.09438  [pdf, other

    cs.DC cs.CL cs.LG

    MPI-rical: Data-Driven MPI Distributed Parallelism Assistance with Transformers

    Authors: Nadav Schneider, Tal Kadosh, Niranjan Hasabnis, Timothy Mattson, Yuval Pinter, Gal Oren

    Abstract: Message Passing Interface (MPI) plays a crucial role in distributed memory parallelization across multiple nodes. However, parallelizing MPI code manually, and specifically, performing domain decomposition, is a challenging, error-prone task. In this paper, we address this problem by developing MPI-RICAL, a novel data-driven, programming-assistance tool that assists programmers in writing domain d… ▽ More

    Submitted 30 August, 2023; v1 submitted 16 May, 2023; originally announced May 2023.

  20. arXiv:2305.03041  [pdf, other

    cs.LG q-bio.QM

    Are VAEs Bad at Reconstructing Molecular Graphs?

    Authors: Hagen Muenkler, Hubert Misztela, Michal Pikusa, Marwin Segler, Nadine Schneider, Krzysztof Maziarz

    Abstract: Many contemporary generative models of molecules are variational auto-encoders of molecular graphs. One term in their training loss pertains to reconstructing the input, yet reconstruction capabilities of state-of-the-art models have not yet been thoroughly compared on a large and chemically diverse dataset. In this work, we show that when several state-of-the-art generative models are evaluated u… ▽ More

    Submitted 4 May, 2023; originally announced May 2023.

    Comments: Published at the ELLIS Workshop on Machine Learning for Molecules (ML4Molecules 2022)

  21. arXiv:2304.11501  [pdf, other

    cs.CL

    Lost in Translationese? Reducing Translation Effect Using Abstract Meaning Representation

    Authors: Shira Wein, Nathan Schneider

    Abstract: Translated texts bear several hallmarks distinct from texts originating in the language. Though individual translated texts are often fluent and preserve meaning, at a large scale, translated texts have statistical tendencies which distinguish them from text originally written in the language ("translationese") and can affect model performance. We frame the novel task of translationese reduction a… ▽ More

    Submitted 29 January, 2024; v1 submitted 22 April, 2023; originally announced April 2023.

    Comments: EACL 2024 Camera-ready

  22. arXiv:2304.01179  [pdf, other

    cs.CL cs.AI

    Hate Speech Targets Detection in Parler using BERT

    Authors: Nadav Schneider, Shimon Shouei, Saleem Ghantous, Elad Feldman

    Abstract: Online social networks have become a fundamental component of our everyday life. Unfortunately, these platforms are also a stage for hate speech. Popular social networks have regularized rules against hate speech. Consequently, social networks like Parler and Gab advocating and claiming to be free speech platforms have evolved. These platforms have become a district for hate speech against diverse… ▽ More

    Submitted 3 April, 2023; originally announced April 2023.

  23. arXiv:2302.08808  [pdf, other

    cs.CV cs.AI cs.LG

    Paint it Black: Generating paintings from text descriptions

    Authors: Mahnoor Shahid, Mark Koch, Niklas Schneider

    Abstract: Two distinct tasks - generating photorealistic pictures from given text prompts and transferring the style of a painting to a real image to make it appear as though it were done by an artist, have been addressed many times, and several approaches have been proposed to accomplish them. However, the intersection of these two, i.e., generating paintings from a given caption, is a relatively unexplore… ▽ More

    Submitted 17 February, 2023; originally announced February 2023.

  24. arXiv:2302.00636  [pdf, other

    cs.CL

    Are UD Treebanks Getting More Consistent? A Report Card for English UD

    Authors: Amir Zeldes, Nathan Schneider

    Abstract: Recent efforts to consolidate guidelines and treebanks in the Universal Dependencies project raise the expectation that joint training and dataset comparison is increasingly possible for high-resource languages such as English, which have multiple corpora. Focusing on the two largest UD English treebanks, we examine progress in data consolidation and answer several questions: Are UD English treeba… ▽ More

    Submitted 1 February, 2023; originally announced February 2023.

    Comments: Proceedings of the Sixth Workshop on Universal Dependencies (UDW 2023)

  25. arXiv:2212.08999  [pdf, other

    cs.CL

    Sentence-level Feedback Generation for English Language Learners: Does Data Augmentation Help?

    Authors: Shabnam Behzad, Amir Zeldes, Nathan Schneider

    Abstract: In this paper, we present strong baselines for the task of Feedback Comment Generation for Writing Learning. Given a sentence and an error span, the task is to generate a feedback comment explaining the error. Sentences and feedback comments are both in English. We experiment with LLMs and also create multiple pseudo datasets for the task, investigating how it affects the performance of our system… ▽ More

    Submitted 17 December, 2022; originally announced December 2022.

    Comments: GenChal 2022: FCG, INLG 2023

  26. arXiv:2211.05965  [pdf, other

    cs.HC cs.IR

    Using dynamic circles and squares to visualize spatio-temporal variation

    Authors: Harsh Patel, Nicole Schneider, Hanan Samet

    Abstract: Visualizations such as bar charts, scatter plots, and objects on geographical maps often convey critical information, including exact and relative numeric values, using shapes. The choice of shape and method of encoding information is often arbitrarily, or based on convention. However, past studies have shown that the human eye can be fooled by visual representations. The Ebbinghaus illusion demon… ▽ More

    Submitted 10 November, 2022; originally announced November 2022.

  27. arXiv:2210.03018  [pdf, other

    cs.CL

    Measuring Fine-Grained Semantic Equivalence with Abstract Meaning Representation

    Authors: Shira Wein, Zhuxin Wang, Nathan Schneider

    Abstract: Identifying semantically equivalent sentences is important for many cross-lingual and mono-lingual NLP tasks. Current approaches to semantic equivalence take a loose, sentence-level approach to "equivalence," despite previous evidence that fine-grained differences and implicit content have an effect on human understanding (Roth and Anthonio, 2021) and system performance (Briakou and Carpuat, 2021)… ▽ More

    Submitted 6 October, 2022; originally announced October 2022.

  28. arXiv:2210.00394  [pdf, other

    cs.CL

    CGELBank: CGEL as a Framework for English Syntax Annotation

    Authors: Brett Reynolds, Aryaman Arora, Nathan Schneider

    Abstract: We introduce the syntactic formalism of the \textit{Cambridge Grammar of the English Language} (CGEL) to the world of treebanking through the CGELBank project. We discuss some issues in linguistic analysis that arose in adapting the formalism to corpus annotation, followed by quantitative and qualitative comparisons with parallel UD and PTB treebanks. We argue that CGEL provides a good tradeoff be… ▽ More

    Submitted 1 October, 2022; originally announced October 2022.

    Comments: 11 pages (8 main text)

    MSC Class: 68T50 ACM Class: I.2.7

  29. arXiv:2208.07196  [pdf, other

    cs.CV cs.LG physics.ins-det

    Determining HEDP Foams' Quality with Multi-View Deep Learning Classification

    Authors: Nadav Schneider, Matan Rusanovsky, Raz Gvishi, Gal Oren

    Abstract: High energy density physics (HEDP) experiments commonly involve a dynamic wave-front propagating inside a low-density foam. This effect affects its density and hence, its transparency. A common problem in foam production is the creation of defective foams. Accurate information on their dimension and homogeneity is required to classify the foams' quality. Therefore, those parameters are being chara… ▽ More

    Submitted 10 August, 2022; originally announced August 2022.

  30. arXiv:2205.03955  [pdf, ps, other

    cs.CL

    MASALA: Modelling and Analysing the Semantics of Adpositions in Linguistic Annotation of Hindi

    Authors: Aryaman Arora, Nitin Venkateswaran, Nathan Schneider

    Abstract: We present a completed, publicly available corpus of annotated semantic relations of adpositions and case markers in Hindi. We used the multilingual SNACS annotation scheme, which has been applied to a variety of typologically diverse languages. Building on past work examining linguistic problems in SNACS annotation, we use language models to attempt automatic labelling of SNACS supersenses in Hin… ▽ More

    Submitted 8 May, 2022; originally announced May 2022.

    Comments: 9 pages (6 main text). To appear at LREC 2022

    ACM Class: I.2.7

  31. arXiv:2205.00395  [pdf, other

    cs.CL

    ELQA: A Corpus of Metalinguistic Questions and Answers about English

    Authors: Shabnam Behzad, Keisuke Sakaguchi, Nathan Schneider, Amir Zeldes

    Abstract: We present ELQA, a corpus of questions and answers in and about the English language. Collected from two online forums, the >70k questions (from English learners and others) cover wide-ranging topics including grammar, meaning, fluency, and etymology. The answers include descriptions of general properties of English vocabulary and grammar as well as explanations about specific (correct and incorre… ▽ More

    Submitted 3 July, 2023; v1 submitted 1 May, 2022; originally announced May 2022.

    Comments: Accepted to ACL 2023

  32. arXiv:2204.07663  [pdf, other

    cs.CL

    Spanish Abstract Meaning Representation: Annotation of a General Corpus

    Authors: Shira Wein, Lucia Donatelli, Ethan Ricker, Calvin Engstrom, Alex Nelson, Nathan Schneider

    Abstract: The Abstract Meaning Representation (AMR) formalism, designed originally for English, has been adapted to a number of languages. We build on previous work proposing the annotation of AMR in Spanish, which resulted in the release of 50 Spanish AMR annotations for the fictional text "The Little Prince." In this work, we present the first sizable, general annotation project for Spanish Abstract Meani… ▽ More

    Submitted 15 April, 2022; originally announced April 2022.

  33. arXiv:2112.08513  [pdf, other

    cs.CL

    DocAMR: Multi-Sentence AMR Representation and Evaluation

    Authors: Tahira Naseem, Austin Blodgett, Sadhana Kumaravel, Tim O'Gorman, Young-Suk Lee, Jeffrey Flanigan, Ramón Fernandez Astudillo, Radu Florian, Salim Roukos, Nathan Schneider

    Abstract: Despite extensive research on parsing of English sentences into Abstraction Meaning Representation (AMR) graphs, which are compared to gold graphs via the Smatch metric, full-document parsing into a unified graph representation lacks well-defined representation and evaluation. Taking advantage of a super-sentential level of coreference annotation from previous work, we introduce a simple algorithm… ▽ More

    Submitted 6 May, 2022; v1 submitted 15 December, 2021; originally announced December 2021.

    MSC Class: I.2.7

  34. arXiv:2112.07874  [pdf, other

    cs.CL

    Linguistic Frameworks Go Toe-to-Toe at Neuro-Symbolic Language Modeling

    Authors: Jakob Prange, Nathan Schneider, Lingpeng Kong

    Abstract: We examine the extent to which, in principle, linguistic graph representations can complement and improve neural language modeling. With an ensemble setup consisting of a pretrained Transformer and ground-truth graphs from one of 7 different formalisms, we find that, overall, semantic constituency structures are most useful to language modeling performance -- outpacing syntactic constituency struc… ▽ More

    Submitted 10 May, 2022; v1 submitted 14 December, 2021; originally announced December 2021.

    Comments: Accepted to NAACL 2022 (slight typesetting divergences to NAACL camera-ready due to TexLive 2020/2021 mismatches)

  35. arXiv:2110.12243  [pdf, other

    cs.CL

    PASTRIE: A Corpus of Prepositions Annotated with Supersense Tags in Reddit International English

    Authors: Michael Kranzlein, Emma Manning, Siyao Peng, Shira Wein, Aryaman Arora, Bradford Salen, Nathan Schneider

    Abstract: We present the Prepositions Annotated with Supersense Tags in Reddit International English ("PASTRIE") corpus, a new dataset containing manually annotated preposition supersenses of English data from presumed speakers of four L1s: English, French, German, and Spanish. The annotations are comprehensive, covering all preposition types and tokens in the sample. Along with the corpus, we provide analy… ▽ More

    Submitted 23 October, 2021; originally announced October 2021.

    Comments: Expanded from the version published at the Linguistic Annotation Workshop 2020

  36. arXiv:2109.11491  [pdf, other

    cs.CL

    Putting Words in BERT's Mouth: Navigating Contextualized Vector Spaces with Pseudowords

    Authors: Taelin Karidi, Yichu Zhou, Nathan Schneider, Omri Abend, Vivek Srikumar

    Abstract: We present a method for exploring regions around individual points in a contextualized vector space (particularly, BERT space), as a way to investigate how these regions correspond to word senses. By inducing a contextualized "pseudoword" as a stand-in for a static embedding in the input layer, and then performing masked prediction of a word in the sentence, we are able to investigate the geometry… ▽ More

    Submitted 4 October, 2021; v1 submitted 23 September, 2021; originally announced September 2021.

    Comments: EMNLP 2021 camera-ready version

  37. arXiv:2109.10952  [pdf, other

    cs.CL

    Cross-linguistically Consistent Semantic and Syntactic Annotation of Child-directed Speech

    Authors: Ida Szubert, Omri Abend, Nathan Schneider, Samuel Gibbon, Louis Mahon, Sharon Goldwater, Mark Steedman

    Abstract: This paper proposes a methodology for constructing such corpora of child directed speech (CDS) paired with sentential logical forms, and uses this method to create two such corpora, in English and Hebrew. The approach enforces a cross-linguistically consistent representation, building on recent advances in dependency representation and semantic parsing. Specifically, the approach involves two step… ▽ More

    Submitted 14 March, 2024; v1 submitted 22 September, 2021; originally announced September 2021.

  38. arXiv:2109.09780  [pdf, other

    cs.CL

    BERT Has Uncommon Sense: Similarity Ranking for Word Sense BERTology

    Authors: Luke Gessler, Nathan Schneider

    Abstract: An important question concerning contextualized word embedding (CWE) models like BERT is how well they can represent different word senses, especially those in the long tail of uncommon senses. Rather than build a WSD system as in previous work, we investigate contextualized embedding neighborhoods directly, formulating a query-by-example nearest neighbor retrieval task and examining ranking perfo… ▽ More

    Submitted 20 September, 2021; originally announced September 2021.

    Comments: Accepted at BlackboxNLP 2021

  39. Making Heads and Tails of Models with Marginal Calibration for Sparse Tagsets

    Authors: Michael Kranzlein, Nelson F. Liu, Nathan Schneider

    Abstract: For interpreting the behavior of a probabilistic model, it is useful to measure a model's calibration--the extent to which it produces reliable confidence scores. We address the open problem of calibration for tagging models with sparse tagsets, and recommend strategies to measure and reduce calibration error (CE) in such models. We show that several post-hoc recalibration techniques all reduce ca… ▽ More

    Submitted 15 September, 2021; originally announced September 2021.

  40. arXiv:2108.12928  [pdf, other

    cs.CL

    Mischievous Nominal Constructions in Universal Dependencies

    Authors: Nathan Schneider, Amir Zeldes

    Abstract: While the highly multilingual Universal Dependencies (UD) project provides extensive guidelines for clausal structure as well as structure within canonical nominal phrases, a standard treatment is lacking for many "mischievous" nominal phenomena that break the mold. As a result, numerous inconsistencies within and across corpora can be found, even in languages with extensive UD treebanking work, s… ▽ More

    Submitted 25 December, 2021; v1 submitted 29 August, 2021; originally announced August 2021.

    Comments: Extended version of the paper that is published in Proceedings of the Fifth Workshop on Universal Dependencies (UDW, SyntaxFest 2021), with additional sections on adverbial NPs and numbers/measurements

  41. arXiv:2107.07970  [pdf, other

    cs.CL

    How Vulnerable Are Automatic Fake News Detection Methods to Adversarial Attacks?

    Authors: Camille Koenders, Johannes Filla, Nicolai Schneider, Vinicius Woloszyn

    Abstract: As the spread of false information on the internet has increased dramatically in recent years, more and more attention is being paid to automated fake news detection. Some fake news detection methods are already quite successful. Nevertheless, there are still many vulnerabilities in the detection algorithms. The reason for this is that fake news publishers can structure and formulate their texts i… ▽ More

    Submitted 16 July, 2021; originally announced July 2021.

    Comments: 9 pages, Github: https://github.com/nicolaischneider/FakeNewsDetectionVulnerability

  42. arXiv:2107.04523  [pdf, other

    cs.CV

    Learning Cascaded Detection Tasks with Weakly-Supervised Domain Adaptation

    Authors: Niklas Hanselmann, Nick Schneider, Benedikt Ortelt, Andreas Geiger

    Abstract: In order to handle the challenges of autonomous driving, deep learning has proven to be crucial in tackling increasingly complex tasks, such as 3D detection or instance segmentation. State-of-the-art approaches for image-based detection tasks tackle this complexity by operating in a cascaded fashion: they first extract a 2D bounding box based on which additional attributes, e.g. instance masks, ar… ▽ More

    Submitted 9 July, 2021; originally announced July 2021.

    Comments: Accepted to IEEE IV 2021

  43. arXiv:2106.06002  [pdf, other

    cs.CL

    Probabilistic, Structure-Aware Algorithms for Improved Variety, Accuracy, and Coverage of AMR Alignments

    Authors: Austin Blodgett, Nathan Schneider

    Abstract: We present algorithms for aligning components of Abstract Meaning Representation (AMR) graphs to spans in English sentences. We leverage unsupervised learning in combination with heuristics, taking the best of both worlds from previous AMR aligners. Our unsupervised models, however, are more sensitive to graph substructures, without requiring a separate syntactic parse. Our approach covers a wider… ▽ More

    Submitted 10 June, 2021; originally announced June 2021.

    Comments: ACL 2021 Camera-ready

  44. arXiv:2103.14961  [pdf, other

    cs.CL

    Supersense and Sensibility: Proxy Tasks for Semantic Annotation of Prepositions

    Authors: Luke Gessler, Shira Wein, Nathan Schneider

    Abstract: Prepositional supersense annotation is time-consuming and requires expert training. Here, we present two sensible methods for obtaining prepositional supersense annotations by eliciting surface substitution and similarity judgments. Four pilot studies suggest that both methods have potential for producing prepositional supersense annotations that are comparable in quality to expert annotations.

    Submitted 27 March, 2021; originally announced March 2021.

    Comments: Presented at LAW XIV in 2020

  45. arXiv:2103.03864  [pdf, other

    cs.LG q-bio.QM

    Learning to Extend Molecular Scaffolds with Structural Motifs

    Authors: Krzysztof Maziarz, Henry Jackson-Flux, Pashmina Cameron, Finton Sirockin, Nadine Schneider, Nikolaus Stiefl, Marwin Segler, Marc Brockschmidt

    Abstract: Recent advancements in deep learning-based modeling of molecules promise to accelerate in silico drug discovery. A plethora of generative models is available, building molecules either atom-by-atom and bond-by-bond or fragment-by-fragment. However, many drug discovery projects require a fixed scaffold to be present in the generated molecule, and incorporating that constraint has only recently been… ▽ More

    Submitted 12 May, 2024; v1 submitted 5 March, 2021; originally announced March 2021.

    Comments: Published at the 10th International Conference on Learning Representations (ICLR 2022)

  46. arXiv:2103.01399  [pdf, other

    cs.CL

    Hindi-Urdu Adposition and Case Supersenses v1.0

    Authors: Aryaman Arora, Nitin Venkateswaran, Nathan Schneider

    Abstract: These are the guidelines for the application of SNACS (Semantic Network of Adposition and Case Supersenses; Schneider et al. 2018) to Modern Standard Hindi of Delhi. SNACS is an inventory of 50 supersenses (semantic labels) for labelling the use of adpositions and case markers with respect to both lexical-semantic function and relation to the underlying context. The English guidelines (Schneider e… ▽ More

    Submitted 1 March, 2021; originally announced March 2021.

    ACM Class: I.2.7

  47. arXiv:2012.15810  [pdf, other

    cs.CL

    UCCA's Foundational Layer: Annotation Guidelines v2.1

    Authors: Omri Abend, Nathan Schneider, Dotan Dvir, Jakob Prange, Ari Rappoport

    Abstract: This is the annotation manual for Universal Conceptual Cognitive Annotation (UCCA; Abend and Rappoport, 2013), specifically the Foundational Layer. UCCA is a graph-based semantic annotation scheme based on typological linguistic principles. It has been applied to several languages; for ease of exposition these guidelines give examples mainly in English. New annotators may wish to start with the tu… ▽ More

    Submitted 31 December, 2020; originally announced December 2020.

  48. arXiv:2012.01285  [pdf, other

    cs.CL

    Supertagging the Long Tail with Tree-Structured Decoding of Complex Categories

    Authors: Jakob Prange, Nathan Schneider, Vivek Srikumar

    Abstract: Although current CCG supertaggers achieve high accuracy on the standard WSJ test set, few systems make use of the categories' internal structure that will drive the syntactic derivation during parsing. The tagset is traditionally truncated, discarding the many rare and complex category types in the long tail. However, supertags are themselves trees. Rather than give up on rare tags, we investigate… ▽ More

    Submitted 11 December, 2020; v1 submitted 2 December, 2020; originally announced December 2020.

    Comments: Accepted to appear in TACL; Authors' final version, pre-MIT Press publication

  49. arXiv:2011.00834  [pdf, other

    cs.CL

    Comparison by Conversion: Reverse-Engineering UCCA from Syntax and Lexical Semantics

    Authors: Daniel Hershcovich, Nathan Schneider, Dotan Dvir, Jakob Prange, Miryam de Lhoneux, Omri Abend

    Abstract: Building robust natural language understanding systems will require a clear characterization of whether and how various linguistic meaning representations complement each other. To perform a systematic comparative analysis, we evaluate the mapping between meaning representations from different frameworks using two complementary methods: (i) a rule-based converter, and (ii) a supervised delexicaliz… ▽ More

    Submitted 2 November, 2020; originally announced November 2020.

    Comments: COLING 2020 camera ready

  50. arXiv:2009.12470  [pdf

    cs.CY

    Effective Voice: Beyond Exit and Affect in Online Communities

    Authors: Seth Frey, Nathan Schneider

    Abstract: Online communities provide ample opportunities for user self-expression but generally lack the means for average users to exercise direct control over community policies. This paper sets out to identify a set of strategies and techniques through which the voices of participants might be better heard through defined mechanisms for institutional governance. Drawing on Albert O. Hirschman's distincti… ▽ More

    Submitted 20 February, 2021; v1 submitted 25 September, 2020; originally announced September 2020.

    Comments: ~9000 words

    ACM Class: J.4; K.4.3