Search | arXiv e-print repository

OpenNER 1.0: Standardized Open-Access Named Entity Recognition Datasets in 50+ Languages

Authors: Chester Palen-Michel, Maxwell Pickering, Maya Kruse, Jonne Sälevä, Constantine Lignos

Abstract: We present OpenNER 1.0, a standardized collection of openly-available named entity recognition (NER) datasets. OpenNER contains 36 NER corpora that span 52 languages, human-annotated in varying named entity ontologies. We correct annotation format issues, standardize the original datasets into a uniform representation with consistent entity type names across corpora, and provide the collection in… ▽ More We present OpenNER 1.0, a standardized collection of openly-available named entity recognition (NER) datasets. OpenNER contains 36 NER corpora that span 52 languages, human-annotated in varying named entity ontologies. We correct annotation format issues, standardize the original datasets into a uniform representation with consistent entity type names across corpora, and provide the collection in a structure that enables research in multilingual and multi-ontology NER. We provide baseline results using three pretrained multilingual language models and two large language models to compare the performance of recent models and facilitate future research in NER. We find that no single model is best in all languages and that significant work remains to obtain high performance from LLMs on the NER task. △ Less

Submitted 26 June, 2025; v1 submitted 12 December, 2024; originally announced December 2024.

Comments: Under review

arXiv:2311.02192 [pdf, other]

doi 10.56553/popets-2025-0062

Automating Governing Knowledge Commons and Contextual Integrity (GKC-CI) Privacy Policy Annotations with Large Language Models

Authors: Jake Chanenson, Madison Pickering, Noah Apthorpe

Abstract: Identifying contextual integrity (CI) and governing knowledge commons (GKC) parameters in privacy policy texts can facilitate normative privacy analysis. However, GKC-CI annotation has heretofore required manual or crowdsourced effort. This paper demonstrates that high-accuracy GKC-CI parameter annotation of privacy policies can be performed automatically using large language models. We fine-tune… ▽ More Identifying contextual integrity (CI) and governing knowledge commons (GKC) parameters in privacy policy texts can facilitate normative privacy analysis. However, GKC-CI annotation has heretofore required manual or crowdsourced effort. This paper demonstrates that high-accuracy GKC-CI parameter annotation of privacy policies can be performed automatically using large language models. We fine-tune 50 open-source and proprietary models on 21,588 ground truth GKC-CI annotations from 16 privacy policies. Our best performing model has an accuracy of 90.65%, which is comparable to the accuracy of experts on the same task. We apply our best performing model to 456 privacy policies from a variety of online services, demonstrating the effectiveness of scaling GKC-CI annotation for privacy policy exploration and analysis. We publicly release our model training code, training and testing data, an annotation visualizer, and all annotated policies for future GKC-CI research. △ Less

Submitted 9 December, 2024; v1 submitted 3 November, 2023; originally announced November 2023.

Comments: 29 pages, 18 figures, 11 tables; camera-ready version

arXiv:2303.08014 [pdf]

Do large language models resemble humans in language use?

Authors: Zhenguang G. Cai, Xufeng Duan, David A. Haslett, Shuqi Wang, Martin J. Pickering

Abstract: Large language models (LLMs) such as ChatGPT and Vicuna have shown remarkable capacities in comprehending and producing language. However, their internal workings remain a black box, and it is unclear whether LLMs and chatbots can develop humanlike characteristics in language use. Cognitive scientists have devised many experiments that probe, and have made great progress in explaining, how people… ▽ More Large language models (LLMs) such as ChatGPT and Vicuna have shown remarkable capacities in comprehending and producing language. However, their internal workings remain a black box, and it is unclear whether LLMs and chatbots can develop humanlike characteristics in language use. Cognitive scientists have devised many experiments that probe, and have made great progress in explaining, how people comprehend and produce language. We subjected ChatGPT and Vicuna to 12 of these experiments ranging from sounds to dialogue, preregistered and with 1000 runs (i.e., iterations) per experiment. ChatGPT and Vicuna replicated the human pattern of language use in 10 and 7 out of the 12 experiments, respectively. The models associated unfamiliar words with different meanings depending on their forms, continued to access recently encountered meanings of ambiguous words, reused recent sentence structures, attributed causality as a function of verb semantics, and accessed different meanings and retrieved different words depending on an interlocutor's identity. In addition, ChatGPT, but not Vicuna, nonliterally interpreted implausible sentences that were likely to have been corrupted by noise, drew reasonable inferences, and overlooked semantic fallacies in a sentence. Finally, unlike humans, neither model preferred using shorter words to convey less informative content, nor did they use context to resolve syntactic ambiguities. We discuss how these convergences and divergences may result from the transformer architecture. Overall, these experiments demonstrate that LLMs such as ChatGPT (and Vicuna to a lesser extent) are humanlike in many aspects of human language processing. △ Less

Submitted 25 March, 2024; v1 submitted 10 March, 2023; originally announced March 2023.

arXiv:2112.03653 [pdf, ps, other]

A Specification for Typed Template Haskell

Authors: Matthew Pickering, Andres Löh, Nicolas Wu

Abstract: Multi-stage programming is a proven technique that provides predictable performance characteristics by controlling code generation. We propose a core semantics for Typed Template Haskell, an extension of Haskell that supports multi staged programming that interacts well with polymorphism and qualified types. Our semantics relates a declarative source language with qualified types to a core languag… ▽ More Multi-stage programming is a proven technique that provides predictable performance characteristics by controlling code generation. We propose a core semantics for Typed Template Haskell, an extension of Haskell that supports multi staged programming that interacts well with polymorphism and qualified types. Our semantics relates a declarative source language with qualified types to a core language based on the the polymorphic lambda calculus augmented with multi-stage constructs. △ Less

Submitted 7 December, 2021; originally announced December 2021.

arXiv:1805.06798 [pdf, other]

Generic Deriving of Generic Traversals

Authors: Csongor Kiss, Matthew Pickering, Nicolas Wu

Abstract: Functional programmers have an established tradition of using traversals as a design pattern to work with recursive data structures. The technique is so prolific that a whole host of libraries have been designed to help in the task of automatically providing traversals by analysing the generic structure of data types. More recently, lenses have entered the functional scene and have proved themselv… ▽ More Functional programmers have an established tradition of using traversals as a design pattern to work with recursive data structures. The technique is so prolific that a whole host of libraries have been designed to help in the task of automatically providing traversals by analysing the generic structure of data types. More recently, lenses have entered the functional scene and have proved themselves to be a simple and versatile mechanism for working with product types. They make it easy to focus on the salient parts of a data structure in a composable and reusable manner. In this paper, we use the combination of lenses and traversals to give rise to an expressive and flexible library for querying and modifying complex data structures. Furthermore, since our lenses and traversals are based on the generic shape of data, we are able to use this information to produce code that is as efficient as hand-written versions. The technique leverages the structure of data to produce generic abstractions that are then eliminated by the standard workhorses of modern functional compilers: inlining and specialisation. △ Less

Submitted 17 May, 2018; originally announced May 2018.

Comments: 28 pages, ICFP

arXiv:1703.10857 [pdf]

doi 10.22152/programming-journal.org/2017/1/7

Profunctor Optics: Modular Data Accessors

Authors: Matthew Pickering, Jeremy Gibbons, Nicolas Wu

Abstract: CONTEXT: Data accessors allow one to read and write components of a data structure, such as the fields of a record, the variants of a union, or the elements of a container. These data accessors are collectively known as optics; they are fundamental to programs that manipulate complex data. INQUIRY: Individual data accessors for simple data structures are easy to write, for example as pairs of "g… ▽ More CONTEXT: Data accessors allow one to read and write components of a data structure, such as the fields of a record, the variants of a union, or the elements of a container. These data accessors are collectively known as optics; they are fundamental to programs that manipulate complex data. INQUIRY: Individual data accessors for simple data structures are easy to write, for example as pairs of "getter" and "setter" methods. However, it is not obvious how to combine data accessors, in such a way that data accessors for a compound data structure are composed out of smaller data accessors for the parts of that structure. Generally, one has to write a sequence of statements or declarations that navigate step by step through the data structure, accessing one level at a time - which is to say, data accessors are traditionally not first-class citizens, combinable in their own right. APPROACH: We present a framework for modular data access, in which individual data accessors for simple data structures may be freely combined to obtain more complex data accessors for compound data structures. Data accessors become first-class citizens. The framework is based around the notion of profunctors, a flexible generalization of functions. KNOWLEDGE: The language features required are higher-order functions ("lambdas" or "closures"), parametrized types ("generics" or "abstract types"), and some mechanism for separating interfaces from implementations ("abstract classes" or "modules"). We use Haskell as a vehicle in which to present our constructions, but languages such as Java, C#, or Scala that provide the necessary features should work just as well. GROUNDING: We provide implementations of all our constructions, in the form of a literate program: the manuscript file for the paper is also the source code for the program, and the extracted code is available separately for evaluation. We also prove the essential properties demonstrating that our profunctor-based representations are precisely equivalent to the more familiar concrete representations. IMPORTANCE: Our results should pave the way to simpler ways of writing programs that access the components of compound data structures. △ Less

Submitted 31 March, 2017; originally announced March 2017.

Journal ref: The Art, Science, and Engineering of Programming, 2017, Vol. 1, Issue 2, Article 7

arXiv:1605.06180 [pdf, ps, other]

doi 10.1016/j.jcss.2016.04.005.

Modeling and performance evaluation of stealthy false data injection attacks on smart grid in the presence of corrupted measurements

Authors: Adnan Anwar, Abdun Naser Mahmood, Mark Pickering

Abstract: The false data injection (FDI) attack cannot be detected by the traditional anomaly detection techniques used in the energy system state estimators. In this paper, we demonstrate how FDI attacks can be constructed blindly, i.e., without system knowledge, including topological connectivity and line reactance information. Our analysis reveals that existing FDI attacks become detectable (consequently… ▽ More The false data injection (FDI) attack cannot be detected by the traditional anomaly detection techniques used in the energy system state estimators. In this paper, we demonstrate how FDI attacks can be constructed blindly, i.e., without system knowledge, including topological connectivity and line reactance information. Our analysis reveals that existing FDI attacks become detectable (consequently unsuccessful) by the state estimator if the data contains grossly corrupted measurements such as device malfunction and communication errors. The proposed sparse optimization based stealthy attacks construction strategy overcomes this limitation by separating the gross errors from the measurement matrix. Extensive theoretical modeling and experimental evaluation show that the proposed technique performs more stealthily (has less relative error) and efficiently (fast enough to maintain time requirement) compared to other methods on IEEE benchmark test systems. △ Less

Submitted 19 May, 2016; originally announced May 2016.

Comments: Keywords: Smart grid, False data injection, Blind attack, Principal component analysis (PCA), Journal of Computer and System Sciences, Elsevier, 2016

Showing 1–7 of 7 results for author: Pickering, M