Search | arXiv e-print repository

arXiv:2506.12242 [pdf]

Large Language Models for History, Philosophy, and Sociology of Science: Interpretive Uses, Methodological Challenges, and Critical Perspectives

Authors: Arno Simons, Michael Zichert, Adrian Wüthrich

Abstract: This paper explores the use of large language models (LLMs) as research tools in the history, philosophy, and sociology of science (HPSS). LLMs are remarkably effective at processing unstructured text and inferring meaning from context, offering new affordances that challenge long-standing divides between computational and interpretive methods. This raises both opportunities and challenges for HPS… ▽ More This paper explores the use of large language models (LLMs) as research tools in the history, philosophy, and sociology of science (HPSS). LLMs are remarkably effective at processing unstructured text and inferring meaning from context, offering new affordances that challenge long-standing divides between computational and interpretive methods. This raises both opportunities and challenges for HPSS, which emphasizes interpretive methodologies and understands meaning as context-dependent, ambiguous, and historically situated. We argue that HPSS is uniquely positioned not only to benefit from LLMs' capabilities but also to interrogate their epistemic assumptions and infrastructural implications. To this end, we first offer a concise primer on LLM architectures and training paradigms tailored to non-technical readers. We frame LLMs not as neutral tools but as epistemic infrastructures that encode assumptions about meaning, context, and similarity, conditioned by their training data, architecture, and patterns of use. We then examine how computational techniques enhanced by LLMs, such as structuring data, detecting patterns, and modeling dynamic processes, can be applied to support interpretive research in HPSS. Our analysis compares full-context and generative models, outlines strategies for domain and task adaptation (e.g., continued pretraining, fine-tuning, and retrieval-augmented generation), and evaluates their respective strengths and limitations for interpretive inquiry in HPSS. We conclude with four lessons for integrating LLMs into HPSS: (1) model selection involves interpretive trade-offs; (2) LLM literacy is foundational; (3) HPSS must define its own benchmarks and corpora; and (4) LLMs should enhance, not replace, interpretive methods. △ Less

Submitted 13 June, 2025; originally announced June 2025.

Comments: 27 pages, 2 tables

ACM Class: A.1; I.2.1; I.2.7; J.4; J.5

arXiv:2501.10728 [pdf, other]

ParkView: Visualizing Monotone Interleavings

Authors: Thijs Beurskens, Steven van den Broek, Arjen Simons, Willem Sonke, Kevin Verbeek, Tim Ophelders, Michael Hoffmann, Bettina Speckmann

Abstract: Merge trees are a powerful tool from topological data analysis that is frequently used to analyze scalar fields. The similarity between two merge trees can be captured by an interleaving: a pair of maps between the trees that jointly preserve ancestor relations in the trees. Interleavings can have a complex structure; visualizing them requires a sense of (drawing) order which is not inherent in th… ▽ More Merge trees are a powerful tool from topological data analysis that is frequently used to analyze scalar fields. The similarity between two merge trees can be captured by an interleaving: a pair of maps between the trees that jointly preserve ancestor relations in the trees. Interleavings can have a complex structure; visualizing them requires a sense of (drawing) order which is not inherent in this purely topological concept. However, in practice it is often desirable to introduce additional geometric constraints, which leads to variants such as labeled or monotone interleavings. Monotone interleavings respect a given order on the leaves of the merge trees and hence have the potential to be visualized in a clear and comprehensive manner. In this paper, we introduce ParkView: a schematic, scalable encoding for monotone interleavings. ParkView captures both maps of the interleaving using an optimal decomposition of both trees into paths and corresponding branches. We prove several structural properties of monotone interleavings, which support a sparse visual encoding using active paths and hedges that can be linked using a maximum of 6 colors for merge trees of arbitrary size. We show how to compute an optimal path-branch decomposition in linear time and illustrate ParkView on a number of real-world datasets. △ Less

Submitted 14 February, 2025; v1 submitted 18 January, 2025; originally announced January 2025.

arXiv:2411.17920 [pdf, other]

doi 10.1007/978-3-031-82697-9_12

Visual Complexity of Point Set Mappings

Authors: Wouter Meulemans, Arjen Simons, Kevin Verbeek

Abstract: We study the visual complexity of animated transitions between point sets. Although there exist many metrics for point set similarity, these metrics are not adequate for this purpose, as they typically treat each point separately. Instead, we propose to look at translations of entire subsets/groups of points to measure the visual complexity of a transition between two point sets. Specifically, giv… ▽ More We study the visual complexity of animated transitions between point sets. Although there exist many metrics for point set similarity, these metrics are not adequate for this purpose, as they typically treat each point separately. Instead, we propose to look at translations of entire subsets/groups of points to measure the visual complexity of a transition between two point sets. Specifically, given two labeled point sets A and B in R^d, the goal is to compute the cheapest transformation that maps all points in A to their corresponding point in B, where the translation of a group of points counts as a single operation in terms of complexity. In this paper we identify several problem dimensions involving group translations that may be relevant to various applications, and study the algorithmic complexity of the resulting problems. Specifically, we consider different restrictions on the groups that can be translated, and different optimization functions. For most of the resulting problem variants we are able to provide polynomial time algorithms, or establish that they are NP-hard. For the remaining open problems we either provide an approximation algorithm or establish the NP-hardness of a restricted version of the problem. Furthermore, our problem classification can easily be extended with additional problem dimensions giving rise to new problem variants that can be studied in future work. △ Less

Submitted 26 November, 2024; originally announced November 2024.

Comments: 17 pages, 4 figures

arXiv:2411.14877 [pdf, other]

Astro-HEP-BERT: A bidirectional language model for studying the meanings of concepts in astrophysics and high energy physics

Authors: Arno Simons

Abstract: I present Astro-HEP-BERT, a transformer-based language model specifically designed for generating contextualized word embeddings (CWEs) to study the meanings of concepts in astrophysics and high-energy physics. Built on a general pretrained BERT model, Astro-HEP-BERT underwent further training over three epochs using the Astro-HEP Corpus, a dataset I curated from 21.84 million paragraphs extracted… ▽ More I present Astro-HEP-BERT, a transformer-based language model specifically designed for generating contextualized word embeddings (CWEs) to study the meanings of concepts in astrophysics and high-energy physics. Built on a general pretrained BERT model, Astro-HEP-BERT underwent further training over three epochs using the Astro-HEP Corpus, a dataset I curated from 21.84 million paragraphs extracted from more than 600,000 scholarly articles on arXiv, all belonging to at least one of these two scientific domains. The project demonstrates both the effectiveness and feasibility of adapting a bidirectional transformer for applications in the history, philosophy, and sociology of science (HPSS). The entire training process was conducted using freely available code, pretrained weights, and text inputs, completed on a single MacBook Pro Laptop (M2/96GB). Preliminary evaluations indicate that Astro-HEP-BERT's CWEs perform comparably to domain-adapted BERT models trained from scratch on larger datasets for domain-specific word sense disambiguation and induction and related semantic change analyses. This suggests that retraining general language models for specific scientific domains can be a cost-effective and efficient strategy for HPSS researchers, enabling high performance without the need for extensive training from scratch. △ Less

Submitted 22 November, 2024; originally announced November 2024.

Comments: 7 pages, 4 figures, 1 table

ACM Class: I.2.6; I.2.7; J.4

arXiv:2411.14073 [pdf, other]

Meaning at the Planck scale? Contextualized word embeddings for doing history, philosophy, and sociology of science

Authors: Arno Simons

Abstract: This paper explores the potential of contextualized word embeddings (CWEs) as a new tool in the history, philosophy, and sociology of science (HPSS) for studying contextual and evolving meanings of scientific concepts. Using the term "Planck" as a test case, I evaluate five BERT-based models with varying degrees of domain-specific pretraining, including my custom model Astro-HEP-BERT, trained on t… ▽ More This paper explores the potential of contextualized word embeddings (CWEs) as a new tool in the history, philosophy, and sociology of science (HPSS) for studying contextual and evolving meanings of scientific concepts. Using the term "Planck" as a test case, I evaluate five BERT-based models with varying degrees of domain-specific pretraining, including my custom model Astro-HEP-BERT, trained on the Astro-HEP Corpus, a dataset containing 21.84 million paragraphs from 600,000 articles in astrophysics and high-energy physics. For this analysis, I compiled two labeled datasets: (1) the Astro-HEP-Planck Corpus, consisting of 2,900 labeled occurrences of "Planck" sampled from 1,500 paragraphs in the Astro-HEP Corpus, and (2) a physics-related Wikipedia dataset comprising 1,186 labeled occurrences of "Planck" across 885 paragraphs. Results demonstrate that the domain-adapted models outperform the general-purpose ones in disambiguating the target term, predicting its known meanings, and generating high-quality sense clusters, as measured by a novel purity indicator I developed. Additionally, this approach reveals semantic shifts in the target term over three decades in the unlabeled Astro-HEP Corpus, highlighting the emergence of the Planck space mission as a dominant sense. The study underscores the importance of domain-specific pretraining for analyzing scientific language and demonstrates the cost-effectiveness of adapting pretrained models for HPSS research. By offering a scalable and transferable method for modeling the meanings of scientific concepts, CWEs open up new avenues for investigating the socio-historical dynamics of scientific discourses. △ Less

Submitted 21 November, 2024; originally announced November 2024.

Comments: 18 pages, 7 figures (1 in the Supplement)

ACM Class: I.2.6; I.2.7; J.4

arXiv:1409.5452 [pdf, other]

Windows into Geometric Events: Data Structures for Time-Windowed Querying of Temporal Point Sets

Authors: Michael J. Bannister, William E. Devanny, Michael T. Goodrich, Joseph A. Simons, Lowell Trott

Abstract: We study geometric data structures for sets of point-based temporal events, answering time-windowed queries, i.e., given a contiguous time interval we answer common geometric queries about the point events with time stamps in this interval. The geometric queries we consider include queries based on the skyline, convex hull, and proximity relations of the point set. We provide space efficient data… ▽ More We study geometric data structures for sets of point-based temporal events, answering time-windowed queries, i.e., given a contiguous time interval we answer common geometric queries about the point events with time stamps in this interval. The geometric queries we consider include queries based on the skyline, convex hull, and proximity relations of the point set. We provide space efficient data structures which answer queries in polylogarithmic time. △ Less

Submitted 18 September, 2014; originally announced September 2014.

Comments: CCCG 2014

arXiv:1409.0597 [pdf, other]

Data-Oblivious Graph Algorithms in Outsourced External Memory

Authors: Michael T. Goodrich, Joseph A. Simons

Abstract: Motivated by privacy preservation for outsourced data, data-oblivious external memory is a computational framework where a client performs computations on data stored at a semi-trusted server in a way that does not reveal her data to the server. This approach facilitates collaboration and reliability over traditional frameworks, and it provides privacy protection, even though the server has full a… ▽ More Motivated by privacy preservation for outsourced data, data-oblivious external memory is a computational framework where a client performs computations on data stored at a semi-trusted server in a way that does not reveal her data to the server. This approach facilitates collaboration and reliability over traditional frameworks, and it provides privacy protection, even though the server has full access to the data and he can monitor how it is accessed by the client. The challenge is that even if data is encrypted, the server can learn information based on the client data access pattern; hence, access patterns must also be obfuscated. We investigate privacy-preserving algorithms for outsourced external memory that are based on the use of data-oblivious algorithms, that is, algorithms where each possible sequence of data accesses is independent of the data values. We give new efficient data-oblivious algorithms in the outsourced external memory model for a number of fundamental graph problems. Our results include new data-oblivious external-memory methods for constructing minimum spanning trees, performing various traversals on rooted trees, answering least common ancestor queries on trees, computing biconnected components, and forming open ear decompositions. None of our algorithms make use of constant-time random oracles. △ Less

Submitted 1 September, 2014; originally announced September 2014.

Comments: 20 pages

MSC Class: 68 ACM Class: F.2.0

arXiv:1308.5741 [pdf, other]

Fixed parameter tractability of crossing minimization of almost-trees

Authors: Michael J. Bannister, David Eppstein, Joseph A. Simons

Abstract: We investigate exact crossing minimization for graphs that differ from trees by a small number of additional edges, for several variants of the crossing minimization problem. In particular, we provide fixed parameter tractable algorithms for the 1-page book crossing number, the 2-page book crossing number, and the minimum number of crossed edges in 1-page and 2-page book drawings. We investigate exact crossing minimization for graphs that differ from trees by a small number of additional edges, for several variants of the crossing minimization problem. In particular, we provide fixed parameter tractable algorithms for the 1-page book crossing number, the 2-page book crossing number, and the minimum number of crossed edges in 1-page and 2-page book drawings. △ Less

Submitted 26 August, 2013; originally announced August 2013.

Comments: Graph Drawing 2013

arXiv:1306.3482 [pdf, other]

Set-Difference Range Queries

Authors: David Eppstein, Michael T. Goodrich, Joseph A. Simons

Abstract: We introduce the problem of performing set-difference range queries, where answers to queries are set-theoretic symmetric differences between sets of items in two geometric ranges. We describe a general framework for answering such queries based on a novel use of data-streaming sketches we call signed symmetric-difference sketches. We show that such sketches can be realized using invertible Bloom… ▽ More We introduce the problem of performing set-difference range queries, where answers to queries are set-theoretic symmetric differences between sets of items in two geometric ranges. We describe a general framework for answering such queries based on a novel use of data-streaming sketches we call signed symmetric-difference sketches. We show that such sketches can be realized using invertible Bloom filters (IBFs), which can be composed, differenced, and searched so as to solve set-difference range queries in a wide range of scenarios. △ Less

Submitted 14 June, 2013; originally announced June 2013.

arXiv:1204.4714 [pdf, other]

Dynamic Planar Point Location with Sub-Logarithmic Local Updates

Authors: Maarten Löffler, Joe Simons, Darren Strash

Abstract: We study planar point location in a collection of disjoint fat regions, and investigate the complexity of \emph {local updates}: replacing any region by a different region that is "similar" to the original region. (i.e., the size differs by at most a constant factor, and distance between the two regions is a constant times that size). We show that it is possible to create a linear size data struct… ▽ More We study planar point location in a collection of disjoint fat regions, and investigate the complexity of \emph {local updates}: replacing any region by a different region that is "similar" to the original region. (i.e., the size differs by at most a constant factor, and distance between the two regions is a constant times that size). We show that it is possible to create a linear size data structure that allows for insertions, deletions, and queries in logarithmic time, and allows for local updates in sub-logarithmic time on a pointer machine. △ Less

Submitted 22 February, 2013; v1 submitted 20 April, 2012; originally announced April 2012.

arXiv:1109.0312 [pdf, other]

Fully Retroactive Approximate Range and Nearest Neighbor Searching

Authors: Michael T. Goodrich, Joseph A. Simons

Abstract: We describe fully retroactive dynamic data structures for approximate range reporting and approximate nearest neighbor reporting. We show how to maintain, for any positive constant $d$, a set of $n$ points in $\R^d$ indexed by time such that we can perform insertions or deletions at any point in the timeline in $O(\log n)$ amortized time. We support, for any small constant $ε>0$, $(1+ε)$-approxima… ▽ More We describe fully retroactive dynamic data structures for approximate range reporting and approximate nearest neighbor reporting. We show how to maintain, for any positive constant $d$, a set of $n$ points in $\R^d$ indexed by time such that we can perform insertions or deletions at any point in the timeline in $O(\log n)$ amortized time. We support, for any small constant $ε>0$, $(1+ε)$-approximate range reporting queries at any point in the timeline in $O(\log n + k)$ time, where $k$ is the output size. We also show how to answer $(1+ε)$-approximate nearest neighbor queries for any point in the past or present in $O(\log n)$ time. △ Less

Submitted 1 September, 2011; originally announced September 2011.

Comments: 24 pages, 4 figures. To appear at the 22nd International Symposium on Algorithms and Computation (ISAAC 2011)

ACM Class: E.0

arXiv:1108.5361 [pdf, other]

doi 10.7155/jgaa.00312

Confluent Hasse diagrams

Authors: David Eppstein, Joseph A. Simons

Abstract: We show that a transitively reduced digraph has a confluent upward drawing if and only if its reachability relation has order dimension at most two. In this case, we construct a confluent upward drawing with $O(n^2)$ features, in an $O(n) \times O(n)$ grid in $O(n^2)$ time. For the digraphs representing series-parallel partial orders we show how to construct a drawing with $O(n)$ features in an… ▽ More We show that a transitively reduced digraph has a confluent upward drawing if and only if its reachability relation has order dimension at most two. In this case, we construct a confluent upward drawing with $O(n^2)$ features, in an $O(n) \times O(n)$ grid in $O(n^2)$ time. For the digraphs representing series-parallel partial orders we show how to construct a drawing with $O(n)$ features in an $O(n) \times O(n)$ grid in $O(n)$ time from a series-parallel decomposition of the partial order. Our drawings are optimal in the number of confluent junctions they use. △ Less

Submitted 13 November, 2013; v1 submitted 26 August, 2011; originally announced August 2011.

Comments: 20 pages, 13 figures

ACM Class: D.2.2; G.2.2

Journal ref: J. Graph Algorithms & Applications 17(7): 689-710, 2013

arXiv:1108.4705 [pdf, other]

doi 10.7155/jgaa.00263

Inapproximability of Orthogonal Compaction

Authors: Michael J. Bannister, David Eppstein, Joseph A. Simons

Abstract: We show that several problems of compacting orthogonal graph drawings to use the minimum number of rows, area, length of longest edge or total edge length cannot be approximated better than within a polynomial factor of optimal in polynomial time unless P = NP. We also provide a fixed-parameter-tractable algorithm for testing whether a drawing can be compacted to a small number of rows. We show that several problems of compacting orthogonal graph drawings to use the minimum number of rows, area, length of longest edge or total edge length cannot be approximated better than within a polynomial factor of optimal in polynomial time unless P = NP. We also provide a fixed-parameter-tractable algorithm for testing whether a drawing can be compacted to a small number of rows. △ Less

Submitted 26 February, 2012; v1 submitted 23 August, 2011; originally announced August 2011.

Comments: Updated to the final version to appear in the Journal of Graph Algorithms and Applications

Journal ref: J. Graph Algorithms & Applications 16(3): 651-673, 2012

arXiv:1106.4092 [pdf, ps, other]

doi 10.4204/EPTCS.55.3

Building a refinement checker for Z

Authors: John Derrick, Siobhán North, Anthony J. H. Simons

Abstract: In previous work we have described how refinements can be checked using a temporal logic based model-checker, and how we have built a model-checker for Z by providing a translation of Z into the SAL input language. In this paper we draw these two strands of work together and discuss how we have implemented refinement checking in our Z2SAL toolset. The net effect of this work is that the SAL tool… ▽ More In previous work we have described how refinements can be checked using a temporal logic based model-checker, and how we have built a model-checker for Z by providing a translation of Z into the SAL input language. In this paper we draw these two strands of work together and discuss how we have implemented refinement checking in our Z2SAL toolset. The net effect of this work is that the SAL toolset can be used to check refinements between Z specifications supplied as input files written in the LaTeX mark-up. Two examples are used to illustrate the approach and compare it with a manual translation and refinement check. △ Less

Submitted 21 June, 2011; originally announced June 2011.

Comments: In Proceedings Refine 2011, arXiv:1106.3488

Journal ref: EPTCS 55, 2011, pp. 37-52

arXiv:0802.2258 [pdf]

doi 10.1109/ENC.2005.52

Using Alloy to model-check visual design notations

Authors: Anthony J. H. Simons, Carlos Alberto Fernandez-y-Fernandez

Abstract: This paper explores the process of validation for the abstract syntax of a graphical notation. We define an unified specification for five of the UML diagrams used by the Discovery Method and, in this document, we illustrate how diagrams can be represented in Alloy and checked against our specification in order to know if these are valid under the Discovery notation. This paper explores the process of validation for the abstract syntax of a graphical notation. We define an unified specification for five of the UML diagrams used by the Discovery Method and, in this document, we illustrate how diagrams can be represented in Alloy and checked against our specification in order to know if these are valid under the Discovery notation. △ Less

Submitted 15 February, 2008; originally announced February 2008.

Comments: 8 pages

ACM Class: I.6.4; D.3.1; I.3.5

Journal ref: Simons, A.J.H. and Fernandez-y-Fernandez, C.A., Using Alloy to model-check visual design notations. In Sixth Mexican Int. Conf. on C S, (Mexico, 2005), IEEE, 121-128

Showing 1–15 of 15 results for author: Simons, A