Search | arXiv e-print repository

From Pixels to Histopathology: A Graph-Based Framework for Interpretable Whole Slide Image Analysis

Authors: Alexander Weers, Alexander H. Berger, Laurin Lux, Peter Schüffler, Daniel Rueckert, Johannes C. Paetzold

Abstract: The histopathological classification of whole-slide images (WSIs) is a fundamental task in digital pathology; yet it requires extensive time and expertise from specialists. While deep learning methods show promising results, they typically process WSIs by dividing them into artificial patches, which inherently prevents a network from learning from the entire image context, disregards natural tissu… ▽ More The histopathological classification of whole-slide images (WSIs) is a fundamental task in digital pathology; yet it requires extensive time and expertise from specialists. While deep learning methods show promising results, they typically process WSIs by dividing them into artificial patches, which inherently prevents a network from learning from the entire image context, disregards natural tissue structures and compromises interpretability. Our method overcomes this limitation through a novel graph-based framework that constructs WSI graph representations. The WSI-graph efficiently captures essential histopathological information in a compact form. We build tissue representations (nodes) that follow biological boundaries rather than arbitrary patches all while providing interpretable features for explainability. Through adaptive graph coarsening guided by learned embeddings, we progressively merge regions while maintaining discriminative local features and enabling efficient global information exchange. In our method's final step, we solve the diagnostic task through a graph attention network. We empirically demonstrate strong performance on multiple challenging tasks such as cancer stage classification and survival prediction, while also identifying predictive factors using Integrated Gradients. Our implementation is publicly available at https://github.com/HistoGraph31/pix2pathology △ Less

Submitted 14 March, 2025; originally announced March 2025.

Comments: 11 pages, 2 figures

arXiv:2503.09808 [pdf, other]

Fine-tuning Vision Language Models with Graph-based Knowledge for Explainable Medical Image Analysis

Authors: Chenjun Li, Laurin Lux, Alexander H. Berger, Martin J. Menten, Mert R. Sabuncu, Johannes C. Paetzold

Abstract: Accurate staging of Diabetic Retinopathy (DR) is essential for guiding timely interventions and preventing vision loss. However, current staging models are hardly interpretable, and most public datasets contain no clinical reasoning or interpretation beyond image-level labels. In this paper, we present a novel method that integrates graph representation learning with vision-language models (VLMs)… ▽ More Accurate staging of Diabetic Retinopathy (DR) is essential for guiding timely interventions and preventing vision loss. However, current staging models are hardly interpretable, and most public datasets contain no clinical reasoning or interpretation beyond image-level labels. In this paper, we present a novel method that integrates graph representation learning with vision-language models (VLMs) to deliver explainable DR diagnosis. Our approach leverages optical coherence tomography angiography (OCTA) images by constructing biologically informed graphs that encode key retinal vascular features such as vessel morphology and spatial connectivity. A graph neural network (GNN) then performs DR staging while integrated gradients highlight critical nodes and edges and their individual features that drive the classification decisions. We collect this graph-based knowledge which attributes the model's prediction to physiological structures and their characteristics. We then transform it into textual descriptions for VLMs. We perform instruction-tuning with these textual descriptions and the corresponding image to train a student VLM. This final agent can classify the disease and explain its decision in a human interpretable way solely based on a single image input. Experimental evaluations on both proprietary and public datasets demonstrate that our method not only improves classification accuracy but also offers more clinically interpretable results. An expert study further demonstrates that our method provides more accurate diagnostic explanations and paves the way for precise localization of pathologies in OCTA images. △ Less

Submitted 12 March, 2025; originally announced March 2025.

Comments: 11 pages, 3 figures

arXiv:2502.16697 [pdf, other]

Interpretable Retinal Disease Prediction Using Biology-Informed Heterogeneous Graph Representations

Authors: Laurin Lux, Alexander H. Berger, Maria Romeo Tricas, Alaa E. Fayed, Sobha Sivaprasada, Linus Kreitner, Jonas Weidner, Martin J. Menten, Daniel Rueckert, Johannes C. Paetzold

Abstract: Interpretability is crucial to enhance trust in machine learning models for medical diagnostics. However, most state-of-the-art image classifiers based on neural networks are not interpretable. As a result, clinicians often resort to known biomarkers for diagnosis, although biomarker-based classification typically performs worse than large neural networks. This work proposes a method that surpasse… ▽ More Interpretability is crucial to enhance trust in machine learning models for medical diagnostics. However, most state-of-the-art image classifiers based on neural networks are not interpretable. As a result, clinicians often resort to known biomarkers for diagnosis, although biomarker-based classification typically performs worse than large neural networks. This work proposes a method that surpasses the performance of established machine learning models while simultaneously improving prediction interpretability for diabetic retinopathy staging from optical coherence tomography angiography (OCTA) images. Our method is based on a novel biology-informed heterogeneous graph representation that models retinal vessel segments, intercapillary areas, and the foveal avascular zone (FAZ) in a human-interpretable way. This graph representation allows us to frame diabetic retinopathy staging as a graph-level classification task, which we solve using an efficient graph neural network. We benchmark our method against well-established baselines, including classical biomarker-based classifiers, convolutional neural networks (CNNs), and vision transformers. Our model outperforms all baselines on two datasets. Crucially, we use our biology-informed graph to provide explanations of unprecedented detail. Our approach surpasses existing methods in precisely localizing and identifying critical vessels or intercapillary areas. In addition, we give informative and human-interpretable attributions to critical characteristics. Our work contributes to the development of clinical decision-support tools in ophthalmology. △ Less

Submitted 23 February, 2025; originally announced February 2025.

arXiv:2412.14619 [pdf, other]

Pitfalls of topology-aware image segmentation

Authors: Alexander H. Berger, Laurin Lux, Alexander Weers, Martin Menten, Daniel Rueckert, Johannes C. Paetzold

Abstract: Topological correctness, i.e., the preservation of structural integrity and specific characteristics of shape, is a fundamental requirement for medical imaging tasks, such as neuron or vessel segmentation. Despite the recent surge in topology-aware methods addressing this challenge, their real-world applicability is hindered by flawed benchmarking practices. In this paper, we identify critical pit… ▽ More Topological correctness, i.e., the preservation of structural integrity and specific characteristics of shape, is a fundamental requirement for medical imaging tasks, such as neuron or vessel segmentation. Despite the recent surge in topology-aware methods addressing this challenge, their real-world applicability is hindered by flawed benchmarking practices. In this paper, we identify critical pitfalls in model evaluation that include inadequate connectivity choices, overlooked topological artifacts in ground truth annotations, and inappropriate use of evaluation metrics. Through detailed empirical analysis, we uncover these issues' profound impact on the evaluation and ranking of segmentation methods. Drawing from our findings, we propose a set of actionable recommendations to establish fair and robust evaluation standards for topology-aware medical image segmentation methods. △ Less

Submitted 19 December, 2024; originally announced December 2024.

Comments: Code is available at https://github.com/AlexanderHBerger/topo-pitfalls

arXiv:2412.10760 [pdf, other]

Fixed Order Scheduling with Deadlines

Authors: Andre Berger, Arman Rouhani, Marc Schröder

Abstract: This paper studies a scheduling problem in a parallel machine setting, where each machine must adhere to a predetermined fixed order for processing the jobs. Given $n$ jobs, each with processing times and deadlines, we aim to minimize the number of machines while ensuring deadlines are met and the fixed order is maintained. We show that the first-fit algorithm solves the problem optimally with uni… ▽ More This paper studies a scheduling problem in a parallel machine setting, where each machine must adhere to a predetermined fixed order for processing the jobs. Given $n$ jobs, each with processing times and deadlines, we aim to minimize the number of machines while ensuring deadlines are met and the fixed order is maintained. We show that the first-fit algorithm solves the problem optimally with unit processing times and is a 2-approximation in the following four cases: (1) the order aligns with non-increasing slacks, (2) the order aligns with non-decreasing slacks, (3) the order aligns with non-increasing deadlines, and (4) the optimal solution uses at most 3 machines. For the general problem we provide an $O(\log n)$-approximation. △ Less

Submitted 15 May, 2025; v1 submitted 14 December, 2024; originally announced December 2024.

arXiv:2411.03228 [pdf, other]

Topograph: An efficient Graph-Based Framework for Strictly Topology Preserving Image Segmentation

Authors: Laurin Lux, Alexander H. Berger, Alexander Weers, Nico Stucki, Daniel Rueckert, Ulrich Bauer, Johannes C. Paetzold

Abstract: Topological correctness plays a critical role in many image segmentation tasks, yet most networks are trained using pixel-wise loss functions, such as Dice, neglecting topological accuracy. Existing topology-aware methods often lack robust topological guarantees, are limited to specific use cases, or impose high computational costs. In this work, we propose a novel, graph-based framework for topol… ▽ More Topological correctness plays a critical role in many image segmentation tasks, yet most networks are trained using pixel-wise loss functions, such as Dice, neglecting topological accuracy. Existing topology-aware methods often lack robust topological guarantees, are limited to specific use cases, or impose high computational costs. In this work, we propose a novel, graph-based framework for topologically accurate image segmentation that is both computationally efficient and generally applicable. Our method constructs a component graph that fully encodes the topological information of both the prediction and ground truth, allowing us to efficiently identify topologically critical regions and aggregate a loss based on local neighborhood information. Furthermore, we introduce a strict topological metric capturing the homotopy equivalence between the union and intersection of prediction-label pairs. We formally prove the topological guarantees of our approach and empirically validate its effectiveness on binary and multi-class datasets. Our loss demonstrates state-of-the-art performance with up to fivefold faster loss computation compared to persistent homology methods. △ Less

Submitted 17 April, 2025; v1 submitted 5 November, 2024; originally announced November 2024.

arXiv:2409.11731 [pdf, other]

Performance and Robustness of Signal-Dependent vs. Signal-Independent Binaural Signal Matching with Wearable Microphone Arrays

Authors: Ami Berger, Vladimir Tourbabin, Jacob Donley, Zamir Ben-Hur, Boaz Rafaely

Abstract: The increasing popularity of spatial audio in applications such as teleconferencing, entertainment, and virtual reality has led to the recent developments of binaural reproduction methods. However, only a few of these methods are well-suited for wearable and mobile arrays, which typically consist of a small number of microphones. One such method is binaural signal matching (BSM), which has been sh… ▽ More The increasing popularity of spatial audio in applications such as teleconferencing, entertainment, and virtual reality has led to the recent developments of binaural reproduction methods. However, only a few of these methods are well-suited for wearable and mobile arrays, which typically consist of a small number of microphones. One such method is binaural signal matching (BSM), which has been shown to produce high-quality binaural signals for wearable arrays. However, BSM may be suboptimal in cases of high direct-to-reverberant ratio (DRR) as it is based on the diffuse sound field assumption. To overcome this limitation, previous studies incorporated sound-field models other than diffuse. However, performance may be sensitive to signal estimation errors. This paper aims to provide a systematic and comprehensive analysis of signal-dependent vs. signal-independent BSM, so that the benefits and limitations of the methods become clearer. Two signal-dependent BSM-based methods designed for high DRR scenarios that incorporate a sound field model composed of direct and reverberant components are investigated mathematically, using simulations, and finally validated by a listening test, and compared to the signal-independent BSM. The results show that signal-dependent BSM can significantly improve performance, in particular in the direction of the source, while presenting only a negligible degradation in other directions. Furthermore, when source direction estimation is inaccurate, performance of of the signal-dependent BSM degrade to equal that of the signal-independent BSM, presenting a desired robustness quality. △ Less

Submitted 14 February, 2025; v1 submitted 18 September, 2024; originally announced September 2024.

arXiv:2407.05714 [pdf, other]

Implementing a hybrid approach in a knowledge engineering process to manage technical advice relating to feedback from the operation of complex sensitive equipment

Authors: Alain Claude Hervé Berger, Sébastien Boblet, Thierry Cartié, Jean-Pierre Cotton, François Vexler

Abstract: How can technical advice on operating experience feedback be managed efficiently in an organization that has never used knowledge engineering techniques and methods? This article explains how an industrial company in the nuclear and defense sectors adopted such an approach, adapted to its "TA KM" organizational context and falls within the ISO30401 framework, to build a complete system with a "SAR… ▽ More How can technical advice on operating experience feedback be managed efficiently in an organization that has never used knowledge engineering techniques and methods? This article explains how an industrial company in the nuclear and defense sectors adopted such an approach, adapted to its "TA KM" organizational context and falls within the ISO30401 framework, to build a complete system with a "SARBACANES" application to support its business processes and perpetuate its know-how and expertise in a knowledge base. Over and above the classic transfer of knowledge between experts and business specialists, SARBACANES also reveals the ability of this type of engineering to deliver multi-functional operation. Modeling was accelerated by the use of a tool adapted to this type of operation: the Ardans Knowledge Maker platform. △ Less

Submitted 8 July, 2024; originally announced July 2024.

Comments: in French language. 35es Journ{é}es francophones d'Ing{é}nierie des Connaissances (IC 2024) @ Plate-Forme Intelligence Artificielle (PFIA 2024), Association Française pour l'Intelligence Artificielle; Laboratoire L3i La Rochelle Universit{é}, Jul 2024, La Rochelle, France

arXiv:2403.11001 [pdf, other]

doi 10.1007/978-3-031-72111-3_68

Topologically Faithful Multi-class Segmentation in Medical Images

Authors: Alexander H. Berger, Nico Stucki, Laurin Lux, Vincent Buergin, Suprosanna Shit, Anna Banaszak, Daniel Rueckert, Ulrich Bauer, Johannes C. Paetzold

Abstract: Topological accuracy in medical image segmentation is a highly important property for downstream applications such as network analysis and flow modeling in vessels or cell counting. Recently, significant methodological advancements have brought well-founded concepts from algebraic topology to binary segmentation. However, these approaches have been underexplored in multi-class segmentation scenari… ▽ More Topological accuracy in medical image segmentation is a highly important property for downstream applications such as network analysis and flow modeling in vessels or cell counting. Recently, significant methodological advancements have brought well-founded concepts from algebraic topology to binary segmentation. However, these approaches have been underexplored in multi-class segmentation scenarios, where topological errors are common. We propose a general loss function for topologically faithful multi-class segmentation extending the recent Betti matching concept, which is based on induced matchings of persistence barcodes. We project the N-class segmentation problem to N single-class segmentation tasks, which allows us to use 1-parameter persistent homology, making training of neural networks computationally feasible. We validate our method on a comprehensive set of four medical datasets with highly variant topological characteristics. Our loss formulation significantly enhances topological correctness in cardiac, cell, artery-vein, and Circle of Willis segmentation. △ Less

Submitted 9 October, 2024; v1 submitted 16 March, 2024; originally announced March 2024.

Journal ref: MICCAI 2024, Lecture Notes in Computer Science, vol. 15008, pp. 721-731, 2024

arXiv:2403.06601 [pdf, other]

Cross-domain and Cross-dimension Learning for Image-to-Graph Transformers

Authors: Alexander H. Berger, Laurin Lux, Suprosanna Shit, Ivan Ezhov, Georgios Kaissis, Martin J. Menten, Daniel Rueckert, Johannes C. Paetzold

Abstract: Direct image-to-graph transformation is a challenging task that involves solving object detection and relationship prediction in a single model. Due to this task's complexity, large training datasets are rare in many domains, making the training of deep-learning methods challenging. This data sparsity necessitates transfer learning strategies akin to the state-of-the-art in general computer vision… ▽ More Direct image-to-graph transformation is a challenging task that involves solving object detection and relationship prediction in a single model. Due to this task's complexity, large training datasets are rare in many domains, making the training of deep-learning methods challenging. This data sparsity necessitates transfer learning strategies akin to the state-of-the-art in general computer vision. In this work, we introduce a set of methods enabling cross-domain and cross-dimension learning for image-to-graph transformers. We propose (1) a regularized edge sampling loss to effectively learn object relations in multiple domains with different numbers of edges, (2) a domain adaptation framework for image-to-graph transformers aligning image- and graph-level features from different domains, and (3) a projection function that allows using 2D data for training 3D transformers. We demonstrate our method's utility in cross-domain and cross-dimension experiments, where we utilize labeled data from 2D road networks for simultaneous learning in vastly different target domains. Our method consistently outperforms standard transfer learning and self-supervised pretraining on challenging benchmarks, such as retinal or whole-brain vessel graph extraction. △ Less

Submitted 5 December, 2024; v1 submitted 11 March, 2024; originally announced March 2024.

arXiv:2401.05740 [pdf, ps, other]

An improved bound for the price of anarchy for related machine scheduling

Authors: Andre Berger, Arman Rouhani, Marc Schröder

Abstract: In this paper, we introduce an improved upper bound for the efficiency of Nash equilibria in utilitarian scheduling games on related machines. The machines have varying speeds and adhere to the Shortest Processing Time (SPT) policy as the global order for job processing. The goal of each job is to minimize its completion time, while the social objective is to minimize the sum of completion times.… ▽ More In this paper, we introduce an improved upper bound for the efficiency of Nash equilibria in utilitarian scheduling games on related machines. The machines have varying speeds and adhere to the Shortest Processing Time (SPT) policy as the global order for job processing. The goal of each job is to minimize its completion time, while the social objective is to minimize the sum of completion times. Our main result provides an upper bound of $2-\frac{1}{2\cdot(2m-1)}$ on the price of anarchy for the general case of $m$ machines. We improve this bound to 3/2 for the case of two machines, and to $2-\frac{1}{2\cdot m}$ for the general case of $m$ machines when the machines have divisible speeds. △ Less

Submitted 11 January, 2024; originally announced January 2024.

arXiv:2311.03697 [pdf, other]

Towards Autonomous Crop Monitoring: Inserting Sensors in Cluttered Environments

Authors: Moonyoung Lee, Aaron Berger, Dominic Guri, Kevin Zhang, Lisa Coffee, George Kantor, Oliver Kroemer

Abstract: We present a contact-based phenotyping robot platform that can autonomously insert nitrate sensors into cornstalks to proactively monitor macronutrient levels in crops. This task is challenging because inserting such sensors requires sub-centimeter precision in an environment which contains high levels of clutter, lighting variation, and occlusion. To address these challenges, we develop a robust… ▽ More We present a contact-based phenotyping robot platform that can autonomously insert nitrate sensors into cornstalks to proactively monitor macronutrient levels in crops. This task is challenging because inserting such sensors requires sub-centimeter precision in an environment which contains high levels of clutter, lighting variation, and occlusion. To address these challenges, we develop a robust perception-action pipeline to detect and grasp stalks, and create a custom robot gripper which mechanically aligns the sensor before inserting it into the stalk. Through experimental validation on 48 unique stalks in a cornfield in Iowa, we demonstrate our platform's capability of detecting a stalk with 94% success, grasping a stalk with 90% success, and inserting a sensor with 60% success. In addition to developing an autonomous phenotyping research platform, we share key challenges and insights obtained from deployment in the field. Our research platform is open-sourced, with additional information available at https://kantor-lab.github.io/cornbot. △ Less

Submitted 6 November, 2023; originally announced November 2023.

arXiv:2308.08674 [pdf, other]

Approximating Min-Diameter: Standard and Bichromatic

Authors: Aaron Berger, Jenny Kaufmann, Virginia Vassilevska Williams

Abstract: The min-diameter of a directed graph $G$ is a measure of the largest distance between nodes. It is equal to the maximum min-distance $d_{min}(u,v)$ across all pairs $u,v \in V(G)$, where $d_{min}(u,v) = \min(d(u,v), d(v,u))$. Our work provides a $O(m^{1.426}n^{0.288})$-time $3/2$-approximation algorithm for min-diameter in DAGs, and a faster $O(m^{0.713}n)$-time almost-$3/2$-approximation variant.… ▽ More The min-diameter of a directed graph $G$ is a measure of the largest distance between nodes. It is equal to the maximum min-distance $d_{min}(u,v)$ across all pairs $u,v \in V(G)$, where $d_{min}(u,v) = \min(d(u,v), d(v,u))$. Our work provides a $O(m^{1.426}n^{0.288})$-time $3/2$-approximation algorithm for min-diameter in DAGs, and a faster $O(m^{0.713}n)$-time almost-$3/2$-approximation variant. (An almost-$α$-approximation algorithm determines the min-diameter to within a multiplicative factor of $α$ plus constant additive error.) By a conditional lower bound result of [Abboud et al, SODA 2016], a better than $3/2$-approximation can't be achieved in truly subquadratic time under the Strong Exponential Time Hypothesis (SETH), so our result is conditionally tight. We additionally obtain a new conditional lower bound for min-diameter approximation in general directed graphs, showing that under SETH, one cannot achieve an approximation factor below 2 in truly subquadratic time. We also present the first study of approximating bichromatic min-diameter, which is the maximum min-distance between oppositely colored vertices in a 2-colored graph. △ Less

Submitted 16 August, 2023; originally announced August 2023.

Comments: ESA 2023

arXiv:2308.06111 [pdf, other]

Improving Zero-Shot Text Matching for Financial Auditing with Large Language Models

Authors: Lars Hillebrand, Armin Berger, Tobias Deußer, Tim Dilmaghani, Mohamed Khaled, Bernd Kliem, Rüdiger Loitz, Maren Pielka, David Leonhard, Christian Bauckhage, Rafet Sifa

Abstract: Auditing financial documents is a very tedious and time-consuming process. As of today, it can already be simplified by employing AI-based solutions to recommend relevant text passages from a report for each legal requirement of rigorous accounting standards. However, these methods need to be fine-tuned regularly, and they require abundant annotated data, which is often lacking in industrial envir… ▽ More Auditing financial documents is a very tedious and time-consuming process. As of today, it can already be simplified by employing AI-based solutions to recommend relevant text passages from a report for each legal requirement of rigorous accounting standards. However, these methods need to be fine-tuned regularly, and they require abundant annotated data, which is often lacking in industrial environments. Hence, we present ZeroShotALI, a novel recommender system that leverages a state-of-the-art large language model (LLM) in conjunction with a domain-specifically optimized transformer-based text-matching solution. We find that a two-step approach of first retrieving a number of best matching document sections per legal requirement with a custom BERT-based model and second filtering these selections using an LLM yields significant performance improvements over existing approaches. △ Less

Submitted 14 August, 2023; v1 submitted 11 August, 2023; originally announced August 2023.

Comments: Accepted at DocEng 2023, 4 pages, 1 figure, 2 tables

arXiv:2301.09545 [pdf, other]

doi 10.1145/3544548.3581175

The Entoptic Field Camera as Metaphor-Driven Research-through-Design with AI Technologies

Authors: Jesse Josua Benjamin, Heidi Biggs, Arne Berger, Julija Rukanskaitė, Michael Heidt, Nick Merrill, James Pierce, Joseph Lindley

Abstract: Artificial intelligence (AI) technologies are widely deployed in smartphone photography; and prompt-based image synthesis models have rapidly become commonplace. In this paper, we describe a Research-through-Design (RtD) project which explores this shift in the means and modes of image production via the creation and use of the Entoptic Field Camera. Entoptic phenomena usually refer to perceptions… ▽ More Artificial intelligence (AI) technologies are widely deployed in smartphone photography; and prompt-based image synthesis models have rapidly become commonplace. In this paper, we describe a Research-through-Design (RtD) project which explores this shift in the means and modes of image production via the creation and use of the Entoptic Field Camera. Entoptic phenomena usually refer to perceptions of floaters or bright blue dots stemming from the physiological interplay of the eye and brain. We use the term entoptic as a metaphor to investigate how the material interplay of data and models in AI technologies shapes human experiences of reality. Through our case study using first-person design and a field study, we offer implications for critical, reflective, more-than-human and ludic design to engage AI technologies; the conceptualisation of an RtD research space which contributes to AI literacy discourses; and outline a research trajectory concerning materiality and design affordances of AI technologies. △ Less

Submitted 23 January, 2023; originally announced January 2023.

Comments: To be published in Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI '23), April 23--28, 2023, Hamburg, Germany

arXiv:2210.10561 [pdf, other]

doi 10.1145/3517745.3561452

Illuminating Large-Scale IPv6 Scanning in the Internet

Authors: Philipp Richter, Oliver Gasser, Arthur Berger

Abstract: While scans of the IPv4 space are ubiquitous, today little is known about scanning activity in the IPv6 Internet. In this work, we present a longitudinal and detailed empirical study on large-scale IPv6 scanning behavior in the Internet, based on firewall logs captured at some 230,000 hosts of a major Content Distribution Network (CDN). We develop methods to identify IPv6 scans, assess current and… ▽ More While scans of the IPv4 space are ubiquitous, today little is known about scanning activity in the IPv6 Internet. In this work, we present a longitudinal and detailed empirical study on large-scale IPv6 scanning behavior in the Internet, based on firewall logs captured at some 230,000 hosts of a major Content Distribution Network (CDN). We develop methods to identify IPv6 scans, assess current and past levels of IPv6 scanning activity, and study dominant characteristics of scans, including scanner origins, targeted services, and insights on how scanners find target IPv6 addresses. Where possible, we compare our findings to what can be assessed from publicly available traces. Our work identifies and highlights new challenges to detect scanning activity in the IPv6 Internet, and uncovers that today's scans of the IPv6 space show widely different characteristics when compared to the more well-known IPv4 scans. △ Less

Submitted 19 October, 2022; originally announced October 2022.

Journal ref: in Proceedings of the ACM Internet Measurement Conference (IMC), 2022

arXiv:2201.10491 [pdf]

Playing The Ethics Card: Ethical Aspects In Design Tools For Inspiration And Education

Authors: Albrecht Kurze, Arne Berger

Abstract: This paper relates findings of own research in the domain of co-design tools in terms of ethical aspects and their opportunities for inspiration and in HCI education. We overview a number of selected general-purpose HCI/design tools as well as domain specific tools for the Internet of Things. These tools are often card-based, not only suitable for workshops with co-designers but also for internal… ▽ More This paper relates findings of own research in the domain of co-design tools in terms of ethical aspects and their opportunities for inspiration and in HCI education. We overview a number of selected general-purpose HCI/design tools as well as domain specific tools for the Internet of Things. These tools are often card-based, not only suitable for workshops with co-designers but also for internal workshops with students to include these aspects in the built-up of their expertise, sometimes even in a playful way. △ Less

Submitted 25 January, 2022; originally announced January 2022.

Comments: Workshop Co-designing Resources for Ethics Education in HCI at Conference on Human Factors in Computing Systems (CHI 21). May 9, 2021

arXiv:2101.04035 [pdf, other]

doi 10.1145/3411764.3445481

Machine Learning Uncertainty as a Design Material: A Post-Phenomenological Inquiry

Authors: Jesse Josua Benjamin, Arne Berger, Nick Merrill, James Pierce

Abstract: Design research is important for understanding and interrogating how emerging technologies shape human experience. However, design research with Machine Learning (ML) is relatively underdeveloped. Crucially, designers have not found a grasp on ML uncertainty as a design opportunity rather than an obstacle. The technical literature points to data and model uncertainties as two main properties of ML… ▽ More Design research is important for understanding and interrogating how emerging technologies shape human experience. However, design research with Machine Learning (ML) is relatively underdeveloped. Crucially, designers have not found a grasp on ML uncertainty as a design opportunity rather than an obstacle. The technical literature points to data and model uncertainties as two main properties of ML. Through post-phenomenology, we position uncertainty as one defining material attribute of ML processes which mediate human experience. To understand ML uncertainty as a design material, we investigate four design research case studies involving ML. We derive three provocative concepts: thingly uncertainty: ML-driven artefacts have uncertain, variable relations to their environments; pattern leakage: ML uncertainty can lead to patterns shaping the world they are meant to represent; and futures creep: ML technologies texture human relations to time with uncertainty. Finally, we outline design research trajectories and sketch a post-phenomenological approach to human-ML relations. △ Less

Submitted 11 January, 2021; originally announced January 2021.

Comments: Accepted to ACM 2021 CHI Conference on Human Factors in Computing Systems (CHI 2021)

ACM Class: H.5.0

arXiv:2009.12115 [pdf, other]

doi 10.1145/3424954.3424968

Towards Reconstructing Multi-Step Cyber Attacks in Modern Cloud Environments with Tripwires

Authors: Mario Kahlhofer, Michael Hölzl, Andreas Berger

Abstract: Rapidly-changing cloud environments that consist of heavily interconnected components are difficult to secure. Existing solutions often try to correlate many weak indicators to identify and reconstruct multi-step cyber attacks. The lack of a true, causal link between most of these indicators still leaves administrators with a lot of false-positives to browse through. We argue that cyber deception… ▽ More Rapidly-changing cloud environments that consist of heavily interconnected components are difficult to secure. Existing solutions often try to correlate many weak indicators to identify and reconstruct multi-step cyber attacks. The lack of a true, causal link between most of these indicators still leaves administrators with a lot of false-positives to browse through. We argue that cyber deception can improve the precision of attack detection systems, if used in a structured, and automatic way, i.e., in the form of so-called tripwires that ultimately span an attack graph, which assists attack reconstruction algorithms. This paper proposes an idea for a framework that combines cyber deception, automatic tripwire injection and attack graphs, which eventually enables us to reconstruct multi-step cyber attacks in modern cloud environments. △ Less

Submitted 25 September, 2020; originally announced September 2020.

Comments: To be published in European Interdisciplinary Cybersecurity Conference (EICC 2020)

arXiv:2008.10709 [pdf, ps, other]

Memoryless Worker-Task Assignment with Polylogarithmic Switching Cost

Authors: Aaron Berger, William Kuszmaul, Adam Polak, Jonathan Tidor, Nicole Wein

Abstract: We study the basic problem of assigning memoryless workers to tasks with dynamically changing demands. Given a set of $w$ workers and a multiset $T \subseteq[t]$ of $|T|=w$ tasks, a memoryless worker-task assignment function is any function $φ$ that assigns the workers $[w]$ to the tasks $T$ based only on the current value of $T$. The assignment function $φ$ is said to have switching cost at most… ▽ More We study the basic problem of assigning memoryless workers to tasks with dynamically changing demands. Given a set of $w$ workers and a multiset $T \subseteq[t]$ of $|T|=w$ tasks, a memoryless worker-task assignment function is any function $φ$ that assigns the workers $[w]$ to the tasks $T$ based only on the current value of $T$. The assignment function $φ$ is said to have switching cost at most $k$ if, for every task multiset $T$, changing the contents of $T$ by one task changes $φ(T)$ by at most $k$ worker assignments. The goal of memoryless worker task assignment is to construct an assignment function with the smallest possible switching cost. In past work, the problem of determining the optimal switching cost has been posed as an open question. There are no known sub-linear upper bounds, and after considerable effort, the best known lower bound remains 4 (ICALP 2020). We show that it is possible to achieve polylogarithmic switching cost. We give a construction via the probabilistic method that achieves switching cost $O(\log w \log (wt))$ and an explicit construction that achieves switching cost $\operatorname{polylog} (wt)$. We also prove a super-constant lower bound on switching cost: we show that for any value of $w$, there exists a value of $t$ for which the optimal switching cost is $w$. Thus it is not possible to achieve a switching cost that is sublinear strictly as a function of $w$. Finally, we present an application of the worker-task assignment problem to a metric embeddings problem. In particular, we use our results to give the first low-distortion embedding from sparse binary vectors into low-dimensional Hamming space. △ Less

Submitted 28 April, 2022; v1 submitted 24 August, 2020; originally announced August 2020.

Comments: ICALP 2022

arXiv:2004.12195 [pdf, other]

QURATOR: Innovative Technologies for Content and Data Curation

Authors: Georg Rehm, Peter Bourgonje, Stefanie Hegele, Florian Kintzel, Julián Moreno Schneider, Malte Ostendorff, Karolina Zaczynska, Armin Berger, Stefan Grill, Sören Räuchle, Jens Rauenbusch, Lisa Rutenburg, André Schmidt, Mikka Wild, Henry Hoffmann, Julian Fink, Sarah Schulz, Jurica Seva, Joachim Quantz, Joachim Böttger, Josefine Matthey, Rolf Fricke, Jan Thomsen, Adrian Paschke, Jamal Al Qundus , et al. (15 additional authors not shown)

Abstract: In all domains and sectors, the demand for intelligent systems to support the processing and generation of digital content is rapidly increasing. The availability of vast amounts of content and the pressure to publish new content quickly and in rapid succession requires faster, more efficient and smarter processing and generation methods. With a consortium of ten partners from research and industr… ▽ More In all domains and sectors, the demand for intelligent systems to support the processing and generation of digital content is rapidly increasing. The availability of vast amounts of content and the pressure to publish new content quickly and in rapid succession requires faster, more efficient and smarter processing and generation methods. With a consortium of ten partners from research and industry and a broad range of expertise in AI, Machine Learning and Language Technologies, the QURATOR project, funded by the German Federal Ministry of Education and Research, develops a sustainable and innovative technology platform that provides services to support knowledge workers in various industries to address the challenges they face when curating digital content. The project's vision and ambition is to establish an ecosystem for content curation technologies that significantly pushes the current state of the art and transforms its region, the metropolitan area Berlin-Brandenburg, into a global centre of excellence for curation technologies. △ Less

Submitted 25 April, 2020; originally announced April 2020.

Comments: Proceedings of QURATOR 2020: The conference for intelligent content solutions, Berlin, Germany, February 2020

arXiv:1912.13273 [pdf]

From Ideation to Implications: Directions for the Internet of Things in the Home

Authors: Albrecht Kurze, Arne Berger, Teresa Denefleh

Abstract: In this paper we give a brief overview of our approaches and ongoing work for future directions of the Internet of Things (IoT) with a focus on the IoT in the home. We highlight some of our activities including tools and methods for an ideation-driven approach as well as for an implications-driven approach. We point to some findings of workshops and empirical field-studies. We show examples for ne… ▽ More In this paper we give a brief overview of our approaches and ongoing work for future directions of the Internet of Things (IoT) with a focus on the IoT in the home. We highlight some of our activities including tools and methods for an ideation-driven approach as well as for an implications-driven approach. We point to some findings of workshops and empirical field-studies. We show examples for new classes of idiosyncratic IoT devices, how implications emerge by (mis)using sensor data and how users interacted with IoT systems in shared spaces. △ Less

Submitted 31 December, 2019; originally announced December 2019.

Comments: Proceedings of the CHI 2019 Workshop on New Directions for the IoT: Automate, Share, Build, and Care, (arXiv:1906.06089)

Report number: IOTD/2019/05

arXiv:1911.09890 [pdf, other]

Degree-Bounded Generalized Polymatroids and Approximating the Metric Many-Visits TSP

Authors: Kristóf Bérczi, André Berger, Matthias Mnich, Roland Vincze

Abstract: In the Bounded Degree Matroid Basis Problem, we are given a matroid and a hypergraph on the same ground set, together with costs for the elements of that set as well as lower and upper bounds $f(\varepsilon)$ and $g(\varepsilon)$ for each hyperedge $\varepsilon$. The objective is to find a minimum-cost basis $B$ such that $f(\varepsilon) \leq |B \cap \varepsilon| \leq g(\varepsilon)$ for each hype… ▽ More In the Bounded Degree Matroid Basis Problem, we are given a matroid and a hypergraph on the same ground set, together with costs for the elements of that set as well as lower and upper bounds $f(\varepsilon)$ and $g(\varepsilon)$ for each hyperedge $\varepsilon$. The objective is to find a minimum-cost basis $B$ such that $f(\varepsilon) \leq |B \cap \varepsilon| \leq g(\varepsilon)$ for each hyperedge $\varepsilon$. Király et al. (Combinatorica, 2012) provided an algorithm that finds a basis of cost at most the optimum value which violates the lower and upper bounds by at most $2 Δ-1$, where $Δ$ is the maximum degree of the hypergraph. When only lower or only upper bounds are present for each hyperedge, this additive error is decreased to $Δ-1$. We consider an extension of the matroid basis problem to generalized polymatroids, or g-polymatroids, and additionally allow element multiplicities. The Bounded Degree g-polymatroid Element Problem with Multiplicities takes as input a g-polymatroid $Q(p,b)$ instead of a matroid, and besides the lower and upper bounds, each hyperedge $\varepsilon$ has element multiplicities $m_\varepsilon$. Building on the approach of Király et al., we provide an algorithm for finding a solution of cost at most the optimum value, having the same additive approximation guarantee. As an application, we develop a $1.5$-approximation for the metric Many-Visits TSP, where the goal is to find a minimum-cost tour that visits each city $v$ a positive $r(v)$ number of times. Our approach combines our algorithm for the Bounded Degree g-polymatroid Element Problem with Multiplicities with the principle of Christofides' algorithm from 1976 for the (single-visit) metric TSP, whose approximation guarantee it matches. △ Less

Submitted 14 December, 2019; v1 submitted 22 November, 2019; originally announced November 2019.

Comments: 17 pages

arXiv:1805.06265 [pdf, ps, other]

Integrated Bounds for Disintegrated Storage

Authors: Alon Berger, Idit Keidar, Alexander Spiegelman

Abstract: We point out a somewhat surprising similarity between non-authenticated Byzantine storage, coded storage, and certain emulations of shared registers from smaller ones. A common characteristic in all of these is the inability of reads to safely return a value obtained in a single atomic access to shared storage. We collectively refer to such systems as disintegrated storage, and show integrated spa… ▽ More We point out a somewhat surprising similarity between non-authenticated Byzantine storage, coded storage, and certain emulations of shared registers from smaller ones. A common characteristic in all of these is the inability of reads to safely return a value obtained in a single atomic access to shared storage. We collectively refer to such systems as disintegrated storage, and show integrated space lower bounds for asynchronous regular wait-free emulations in all of them. In a nutshell, if readers are invisible, then the storage cost of such systems is inherently exponential in the size of written values; otherwise, it is at least linear in the number of readers. Our bounds are asymptotically tight to known algorithms, and thus justify their high costs. △ Less

Submitted 6 August, 2018; v1 submitted 16 May, 2018; originally announced May 2018.

arXiv:1804.06361 [pdf, other]

A time- and space-optimal algorithm for the many-visits TSP

Authors: André Berger, László Kozma, Matthias Mnich, Roland Vincze

Abstract: The many-visits traveling salesperson problem (MV-TSP) asks for an optimal tour of $n$ cities that visits each city $c$ a prescribed number $k_c$ of times. Travel costs may be asymmetric, and visiting a city twice in a row may incur a non-zero cost. The MV-TSP problem finds applications in scheduling, geometric approximation, and Hamiltonicity of certain graph families. The fastest known algorit… ▽ More The many-visits traveling salesperson problem (MV-TSP) asks for an optimal tour of $n$ cities that visits each city $c$ a prescribed number $k_c$ of times. Travel costs may be asymmetric, and visiting a city twice in a row may incur a non-zero cost. The MV-TSP problem finds applications in scheduling, geometric approximation, and Hamiltonicity of certain graph families. The fastest known algorithm for MV-TSP is due to Cosmadakis and Papadimitriou (SICOMP, 1984). It runs in time $n^{O(n)} + O(n^3 \log \sum_c k_c )$ and requires $n^{Θ(n)}$ space. An interesting feature of the Cosmadakis-Papadimitriou algorithm is its \emph{logarithmic} dependence on the total length $\sum_c k_c$ of the tour, allowing the algorithm to handle instances with very long tours. The \emph{superexponential} dependence on the number of cities in both the time and space complexity, however, renders the algorithm impractical for all but the narrowest range of this parameter. In this paper we improve upon the Cosmadakis-Papadimitriou algorithm, giving an MV-TSP algorithm that runs in time $2^{O(n)}$, i.e.\ \emph{single-exponential} in the number of cities, using \emph{polynomial} space. Our algorithm is deterministic, and arguably both simpler and easier to analyse than the original approach of Cosmadakis and Papadimitriou. It involves an optimization over directed spanning trees and a recursive, centroid-based decomposition of trees. △ Less

Submitted 21 April, 2020; v1 submitted 17 April, 2018; originally announced April 2018.

Comments: Small fixes, journal version

arXiv:1710.02984 [pdf]

doi 10.1038/s41598-017-12925-z

Algorithm guided outlining of 105 pancreatic cancer liver metastases in Ultrasound

Authors: Alexander Hann, Lucas Bettac, Mark M. Haenle, Tilmann Graeter, Andreas W. Berger, Jens Dreyhaupt, Dieter Schmalstieg, Wolfram G. Zoller, Jan Egger

Abstract: Manual segmentation of hepatic metastases in ultrasound images acquired from patients suffering from pancreatic cancer is common practice. Semiautomatic measurements promising assistance in this process are often assessed using a small number of lesions performed by examiners who already know the algorithm. In this work, we present the application of an algorithm for the segmentation of liver meta… ▽ More Manual segmentation of hepatic metastases in ultrasound images acquired from patients suffering from pancreatic cancer is common practice. Semiautomatic measurements promising assistance in this process are often assessed using a small number of lesions performed by examiners who already know the algorithm. In this work, we present the application of an algorithm for the segmentation of liver metastases due to pancreatic cancer using a set of 105 different images of metastases. The algorithm and the two examiners had never assessed the images before. The examiners first performed a manual segmentation and, after five weeks, a semiautomatic segmentation using the algorithm. They were satisfied in up to 90% of the cases with the semiautomatic segmentation results. Using the algorithm was significantly faster and resulted in a median Dice similarity score of over 80%. Estimation of the inter-operator variability by using the intra class correlation coefficient was good with 0.8. In conclusion, the algorithm facilitates fast and accurate segmentation of liver metastases, comparable to the current gold standard of manual segmentation. △ Less

Submitted 9 October, 2017; originally announced October 2017.

Comments: 7 pages, 3 Figures, 3 Tables, 46 References

Journal ref: Sci Rep. 2017 Oct 6;7(1):12779

arXiv:1707.03900 [pdf, other]

kIP: a Measured Approach to IPv6 Address Anonymization

Authors: David Plonka, Arthur Berger

Abstract: Privacy-minded Internet service operators anonymize IPv6 addresses by truncating them to a fixed length, perhaps due to long-standing use of this technique with IPv4 and a belief that it's "good enough." We claim that simple anonymization by truncation is suspect since it does not entail privacy guarantees nor does it take into account some common address assignment practices observed today. To in… ▽ More Privacy-minded Internet service operators anonymize IPv6 addresses by truncating them to a fixed length, perhaps due to long-standing use of this technique with IPv4 and a belief that it's "good enough." We claim that simple anonymization by truncation is suspect since it does not entail privacy guarantees nor does it take into account some common address assignment practices observed today. To investigate, with standard activity logs as input, we develop a counting method to determine a lower bound on the number of active IPv6 addresses that are simultaneously assigned, such as those of clients that access World-Wide Web services. In many instances, we find that these empirical measurements offer no evidence that truncating IPv6 addresses to a fixed number of bits, e.g., 48 in common practice, protects individuals' privacy. To remedy this problem, we propose kIP anonymization, an aggregation method that ensures a certain level of address privacy. Our method adaptively determines variable truncation lengths using parameter k, the desired number of active (rather than merely potential) addresses, e.g., 32 or 256, that can not be distinguished from each other once anonymized. We describe our implementation and present first results of its application to millions of real IPv6 client addresses active over a week's time, demonstrating both feasibility at large scale and ability to automatically adapt to each network's address assignment practice and synthesize a set of anonymous aggregates (prefixes), each of which is guaranteed to cover (contain) at least k of the active addresses. Each address is anonymized by truncating it to the length of its longest matching prefix in that set. △ Less

Submitted 12 July, 2017; originally announced July 2017.

arXiv:1609.05137 [pdf, other]

doi 10.1016/j.mex.2018.06.018

A unifying framework for fast randomization of ecological networks with fixed (node) degrees

Authors: Corrie Jacobien Carstens, Annabell Berger, Giovanni Strona

Abstract: The switching model is a Markov chain approach to sample graphs with fixed degree sequence uniformly at random. The recently invented Curveball algorithm for bipartite graphs applies several switches simultaneously (`trades'). Here, we introduce Curveball algorithms for simple (un)directed graphs which use single or simultaneous trades. We show experimentally that these algorithms converge magnitu… ▽ More The switching model is a Markov chain approach to sample graphs with fixed degree sequence uniformly at random. The recently invented Curveball algorithm for bipartite graphs applies several switches simultaneously (`trades'). Here, we introduce Curveball algorithms for simple (un)directed graphs which use single or simultaneous trades. We show experimentally that these algorithms converge magnitudes faster than the corresponding switching models. △ Less

Submitted 26 July, 2018; v1 submitted 16 September, 2016; originally announced September 2016.

Journal ref: Corrie Jacobien Carstens, Annabell Berger, Giovanni Strona, A unifying framework for fast randomization of ecological networks with fixed (node) degrees, MethodsX, Volume 5, 2018, Pages 773-780

arXiv:1607.04597 [pdf, ps, other]

Query Complexity of Mastermind Variants

Authors: Aaron Berger, Christopher Chute, Matthew Stone

Abstract: We study variants of Mastermind, a popular board game in which the objective is sequence reconstruction. In this two-player game, the so-called \textit{codemaker} constructs a hidden sequence $H = (h_1, h_2, \ldots, h_n)$ of colors selected from an alphabet $\mathcal{A} = \{1,2,\ldots, k\}$ (\textit{i.e.,} $h_i\in\mathcal{A}$ for all $i\in\{1,2,\ldots, n\}$). The game then proceeds in turns, each… ▽ More We study variants of Mastermind, a popular board game in which the objective is sequence reconstruction. In this two-player game, the so-called \textit{codemaker} constructs a hidden sequence $H = (h_1, h_2, \ldots, h_n)$ of colors selected from an alphabet $\mathcal{A} = \{1,2,\ldots, k\}$ (\textit{i.e.,} $h_i\in\mathcal{A}$ for all $i\in\{1,2,\ldots, n\}$). The game then proceeds in turns, each of which consists of two parts: in turn $t$, the second player (the \textit{codebreaker}) first submits a query sequence $Q_t = (q_1, q_2, \ldots, q_n)$ with $q_i\in \mathcal{A}$ for all $i$, and second receives feedback $Δ(Q_t, H)$, where $Δ$ is some agreed-upon function of distance between two sequences with $n$ components. The game terminates when $Q_t = H$, and the codebreaker seeks to end the game in as few turns as possible. Throughout we let $f(n,k)$ denote the smallest integer such that the codebreaker can determine any $H$ in $f(n,k)$ turns. We prove three main results: First, when $H$ is known to be a permutation of $\{1,2,\ldots, n\}$, we prove that $f(n, n)\ge n - \log\log n$ for all sufficiently large $n$. Second, we show that Knuth's Minimax algorithm identifies any $H$ in at most $nk$ queries. Third, when feedback is not received until all queries have been submitted, we show that $f(n,k)=Ω(n\log k)$. △ Less

Submitted 25 September, 2017; v1 submitted 15 July, 2016; originally announced July 2016.

Comments: Revised and trimmed- 17 pages

MSC Class: 91A46; 68Q25

arXiv:1606.04327 [pdf, other]

doi 10.1145/2987443.2987445

Entropy/IP: Uncovering Structure in IPv6 Addresses

Authors: Pawel Foremski, David Plonka, Arthur Berger

Abstract: In this paper, we introduce Entropy/IP: a system that discovers Internet address structure based on analyses of a subset of IPv6 addresses known to be active, i.e., training data, gleaned by readily available passive and active means. The system is completely automated and employs a combination of information-theoretic and machine learning techniques to probabilistically model IPv6 addresses. We p… ▽ More In this paper, we introduce Entropy/IP: a system that discovers Internet address structure based on analyses of a subset of IPv6 addresses known to be active, i.e., training data, gleaned by readily available passive and active means. The system is completely automated and employs a combination of information-theoretic and machine learning techniques to probabilistically model IPv6 addresses. We present results showing that our system is effective in exposing structural characteristics of portions of the IPv6 Internet address space populated by active client, service, and router addresses. In addition to visualizing the address structure for exploration, the system uses its models to generate candidate target addresses for scanning. For each of 15 evaluated datasets, we train on 1K addresses and generate 1M candidates for scanning. We achieve some success in 14 datasets, finding up to 40% of the generated addresses to be active. In 11 of these datasets, we find active network identifiers (e.g., /64 prefixes or `subnets') not seen in training. Thus, we provide the first evidence that it is practical to discover subnets and hosts by scanning probabilistically selected areas of the IPv6 address space not known to contain active hosts a priori. △ Less

Submitted 21 November, 2016; v1 submitted 14 June, 2016; originally announced June 2016.

Comments: Paper presented at the ACM IMC 2016 in Santa Monica, USA (https://dl.acm.org/citation.cfm?id=2987445). Live Demo site available at http://www.entropy-ip.com/

Journal ref: IMC '16 Proceedings of the 2016 ACM on Internet Measurement Conference, pp. 167-181

arXiv:1606.00360 [pdf, other]

doi 10.1145/2987443.2987473

Beyond Counting: New Perspectives on the Active IPv4 Address Space

Authors: Philipp Richter, Georgios Smaragdakis, David Plonka, Arthur Berger

Abstract: In this study, we report on techniques and analyses that enable us to capture Internet-wide activity at individual IP address-level granularity by relying on server logs of a large commercial content delivery network (CDN) that serves close to 3 trillion HTTP requests on a daily basis. Across the whole of 2015, these logs recorded client activity involving 1.2 billion unique IPv4 addresses, the hi… ▽ More In this study, we report on techniques and analyses that enable us to capture Internet-wide activity at individual IP address-level granularity by relying on server logs of a large commercial content delivery network (CDN) that serves close to 3 trillion HTTP requests on a daily basis. Across the whole of 2015, these logs recorded client activity involving 1.2 billion unique IPv4 addresses, the highest ever measured, in agreement with recent estimates. Monthly client IPv4 address counts showed constant growth for years prior, but since 2014, the IPv4 count has stagnated while IPv6 counts have grown. Thus, it seems we have entered an era marked by increased complexity, one in which the sole enumeration of active IPv4 addresses is of little use to characterize recent growth of the Internet as a whole. With this observation in mind, we consider new points of view in the study of global IPv4 address activity. Our analysis shows significant churn in active IPv4 addresses: the set of active IPv4 addresses varies by as much as 25% over the course of a year. Second, by looking across the active addresses in a prefix, we are able to identify and attribute activity patterns to network restructurings, user behaviors, and, in particular, various address assignment practices. Third, by combining spatio-temporal measures of address utilization with measures of traffic volume, and sampling-based estimates of relative host counts, we present novel perspectives on worldwide IPv4 address activity, including empirical observation of under-utilization in some areas, and complete utilization, or exhaustion, in others. △ Less

Submitted 9 September, 2016; v1 submitted 1 June, 2016; originally announced June 2016.

Comments: in Proceedings of ACM IMC 2016

arXiv:1508.04740 [pdf, ps, other]

doi 10.1371/journal.pone.0147935

Marathon: An open source software library for the analysis of Markov-Chain Monte Carlo algorithms

Authors: Steffen Rechner, Annabell Berger

Abstract: In this paper, we consider the Markov-Chain Monte Carlo (MCMC) approach for random sampling of combinatorial objects. The running time of such an algorithm depends on the total mixing time of the underlying Markov chain and is unknown in general. For some Markov chains, upper bounds on this total mixing time exist but are too large to be applicable in practice. We try to answer the question, wheth… ▽ More In this paper, we consider the Markov-Chain Monte Carlo (MCMC) approach for random sampling of combinatorial objects. The running time of such an algorithm depends on the total mixing time of the underlying Markov chain and is unknown in general. For some Markov chains, upper bounds on this total mixing time exist but are too large to be applicable in practice. We try to answer the question, whether the total mixing time is close to its upper bounds, or if there is a significant gap between them. In doing so, we present the software library marathon which is designed to support the analysis of MCMC based sampling algorithms. The main application of this library is to compute properties of so-called state graphs which represent the structure of Markov chains. We use marathon to investigate the quality of several bounding methods on four well-known Markov chains for sampling perfect matchings and bipartite graph realizations. In a set of experiments, we compute the total mixing time and several of its bounds for a large number of input instances. We find that the upper bound gained by the famous canonical path method is several magnitudes larger than the total mixing time and deteriorates with growing input size. In contrast, the spectral bound is found to be a precise approximation of the total mixing time. △ Less

Submitted 14 September, 2016; v1 submitted 19 August, 2015; originally announced August 2015.

arXiv:1506.08134 [pdf, other]

doi 10.1145/2815675.2815678

Temporal and Spatial Classification of Active IPv6 Addresses

Authors: David Plonka, Arthur Berger

Abstract: There is striking volume of World-Wide Web activity on IPv6 today. In early 2015, one large Content Distribution Network handles 50 billion IPv6 requests per day from hundreds of millions of IPv6 client addresses; billions of unique client addresses are observed per month. Address counts, however, obscure the number of hosts with IPv6 connectivity to the global Internet. There are numerous address… ▽ More There is striking volume of World-Wide Web activity on IPv6 today. In early 2015, one large Content Distribution Network handles 50 billion IPv6 requests per day from hundreds of millions of IPv6 client addresses; billions of unique client addresses are observed per month. Address counts, however, obscure the number of hosts with IPv6 connectivity to the global Internet. There are numerous address assignment and subnetting options in use; privacy addresses and dynamic subnet pools significantly inflate the number of active IPv6 addresses. As the IPv6 address space is vast, it is infeasible to comprehensively probe every possible unicast IPv6 address. Thus, to survey the characteristics of IPv6 addressing, we perform a year-long passive measurement study, analyzing the IPv6 addresses gleaned from activity logs for all clients accessing a global CDN. The goal of our work is to develop flexible classification and measurement methods for IPv6, motivated by the fact that its addresses are not merely more numerous; they are different in kind. We introduce the notion of classifying addresses and prefixes in two ways: (1) temporally, according to their instances of activity to discern which addresses can be considered stable; (2) spatially, according to the density or sparsity of aggregates in which active addresses reside. We present measurement and classification results numerically and visually that: provide details on IPv6 address use and structure in global operation across the past year; establish the efficacy of our classification methods; and demonstrate that such classification can clarify dimensions of the Internet that otherwise appear quite blurred by current IPv6 addressing practices. △ Less

Submitted 17 July, 2015; v1 submitted 26 June, 2015; originally announced June 2015.

arXiv:1504.06779 [pdf, other]

Computational Cost Reduction in Learned Transform Classifications

Authors: Emerson Lopes Machado, Cristiano Jacques Miosso, Ricardo von Borries, Murilo Coutinho, Pedro de Azevedo Berger, Thiago Marques, Ricardo Pezzuol Jacobi

Abstract: We present a theoretical analysis and empirical evaluations of a novel set of techniques for computational cost reduction of classifiers that are based on learned transform and soft-threshold. By modifying optimization procedures for dictionary and classifier training, as well as the resulting dictionary entries, our techniques allow to reduce the bit precision and to replace each floating-point m… ▽ More We present a theoretical analysis and empirical evaluations of a novel set of techniques for computational cost reduction of classifiers that are based on learned transform and soft-threshold. By modifying optimization procedures for dictionary and classifier training, as well as the resulting dictionary entries, our techniques allow to reduce the bit precision and to replace each floating-point multiplication by a single integer bit shift. We also show how the optimization algorithms in some dictionary training methods can be modified to penalize higher-energy dictionaries. We applied our techniques with the classifier Learning Algorithm for Soft-Thresholding, testing on the datasets used in its original paper. Our results indicate it is feasible to use solely sums and bit shifts of integers to classify at test time with a limited reduction of the classification accuracy. These low power operations are a valuable trade off in FPGA implementations as they increase the classification throughput while decrease both energy consumption and manufacturing cost. △ Less

Submitted 30 April, 2016; v1 submitted 25 April, 2015; originally announced April 2015.

arXiv:1406.1605 [pdf, ps, other]

Energy Efficient and Reliable Wireless Sensor Networks - An Extension to IEEE 802.15.4e

Authors: Achim Berger, Markus Pichler, Werner Haslmayr, Andreas Springer

Abstract: Collecting sensor data in industrial environments from up to some tenth of battery powered sensor nodes with sampling rates up to 100Hz requires energy aware protocols, which avoid collisions and long listening phases. The IEEE 802.15.4 standard focuses on energy aware wireless sensor networks (WSNs) and the Task Group 4e has published an amendment to fulfill up to 100 sensor value transmissions p… ▽ More Collecting sensor data in industrial environments from up to some tenth of battery powered sensor nodes with sampling rates up to 100Hz requires energy aware protocols, which avoid collisions and long listening phases. The IEEE 802.15.4 standard focuses on energy aware wireless sensor networks (WSNs) and the Task Group 4e has published an amendment to fulfill up to 100 sensor value transmissions per second per sensor node (Low Latency Deterministic Network (LLDN) mode) to satisfy demands of factory automation. To improve the reliability of the data collection in the star topology of the LLDN mode, we propose a relay strategy, which can be performed within the LLDN schedule. Furthermore we propose an extension of the star topology to collect data from two-hop sensor nodes. The proposed Retransmission Mode enables power savings in the sensor node of more than 33%, while reducing the packet loss by up to 50%. To reach this performance, an optimum spatial distribution is necessary, which is discussed in detail. △ Less

Submitted 6 June, 2014; originally announced June 2014.

arXiv:1404.4249 [pdf, other]

Broder's Chain Is Not Rapidly Mixing

Authors: Annabell Berger, Steffen Rechner

Abstract: We prove that Broder's Markov chain for approximate sampling near-perfect and perfect matchings is not rapidly mixing for Hamiltonian, regular, threshold and planar bipartite graphs, filling a gap in the literature. In the second part we experimentally compare Broder's chain with the Markov chain by Jerrum, Sinclair and Vigoda from 2004. For the first time, we provide a systematic experimental inv… ▽ More We prove that Broder's Markov chain for approximate sampling near-perfect and perfect matchings is not rapidly mixing for Hamiltonian, regular, threshold and planar bipartite graphs, filling a gap in the literature. In the second part we experimentally compare Broder's chain with the Markov chain by Jerrum, Sinclair and Vigoda from 2004. For the first time, we provide a systematic experimental investigation of mixing time bounds for these Markov chains. We observe that the exact total mixing time is in many cases significantly lower than known upper bounds using canonical path or multicommodity flow methods, even if the structure of an underlying state graph is known. In contrast we observe comparatively tighter upper bounds using spectral gaps. △ Less

Submitted 16 April, 2014; originally announced April 2014.

Comments: Keywords: sampling of matchings, rapidly mixing Markov chains, permanent of a matrix, random generation, monomer-dimer systems, Markov chain Monte Carlo

arXiv:1212.5443 [pdf, ps, other]

The Connection between the Number of Realizations for Degree Sequences and Majorization

Authors: Annabell Berger

Abstract: The \emph{graph realization problem} is to find for given nonnegative integers $a_1,\dots,a_n$ a simple graph (no loops or multiple edges) such that each vertex $v_i$ has degree $a_i.$ Given pairs of nonnegative integers $(a_1,b_1),\dots,(a_n,b_n),$ (i) the \emph{bipartite realization problem} ask whether there is a bipartite graph (no loops or multiple edges) such that vectors $(a_1,...,a_n)$ and… ▽ More The \emph{graph realization problem} is to find for given nonnegative integers $a_1,\dots,a_n$ a simple graph (no loops or multiple edges) such that each vertex $v_i$ has degree $a_i.$ Given pairs of nonnegative integers $(a_1,b_1),\dots,(a_n,b_n),$ (i) the \emph{bipartite realization problem} ask whether there is a bipartite graph (no loops or multiple edges) such that vectors $(a_1,...,a_n)$ and $(b_1,...,b_n)$ correspond to the lists of degrees in the two partite sets, (ii) the \emph{digraph realization problem} is to find a digraph (no loops or multiple arcs) such that each vertex $v_i$ has indegree $a_i$ and outdegree $b_i.$\\ The classic literature provides characterizations for the existence of such realizations that are strongly related to the concept of majorization. Aigner and Triesch (1994) extended this approach to a more general result for graphs, leading to an efficient realization algorithm and a short and simple proof for the Erdős-Gallai Theorem. We extend this approach to the bipartite realization problem and the digraph realization problem.\\ Our main result is the connection between majorization and the number of realizations for a degree list in all three problems. We show: if degree list $S'$ majorizes $S$ in a certain sense, then $S$ possesses more realizations than $S'.$ We prove that constant lists possess the largest number of realizations for fixed $n$ and a fixed number of arcs $m$ when $n$ divides $m.$ So-called \emph{minconvex lists} for graphs and bipartite graphs or \emph{opposed minconvex lists} for digraphs maximize the number of realizations when $n$ does not divide $m$. △ Less

Submitted 1 July, 2014; v1 submitted 21 December, 2012; originally announced December 2012.

Comments: 30 pages. There was a mistake an case~3 and case~4 in the proof of the result of Proposition 10 (current version). I corrected it. For that I added a further result in Proposition 9

arXiv:1203.3636 [pdf, other]

How to Attack the NP-complete Dag Realization Problem in Practice

Authors: Annabell Berger, Matthias Müller-Hannemann

Abstract: We study the following fundamental realization problem of directed acyclic graphs (dags). Given a sequence S:=(a_1,b_1),...,(a_n, b_n) with a_i, b_i in Z_0^+, does there exist a dag (no parallel arcs allowed) with labeled vertex set V:= {v_1,...,v_n} such that for all v_i in V indegree and outdegree of v_i match exactly the given numbers a_i and b_i, respectively? Recently this decision problem ha… ▽ More We study the following fundamental realization problem of directed acyclic graphs (dags). Given a sequence S:=(a_1,b_1),...,(a_n, b_n) with a_i, b_i in Z_0^+, does there exist a dag (no parallel arcs allowed) with labeled vertex set V:= {v_1,...,v_n} such that for all v_i in V indegree and outdegree of v_i match exactly the given numbers a_i and b_i, respectively? Recently this decision problem has been shown to be NP-complete by Nichterlein (2011). However, we can show that several important classes of sequences are efficiently solvable. In previous work (Berger and Mueller-Hannemann, FCT2011), we have proved that yes-instances always have a special kind of topological order which allows us to reduce the number of possible topological orderings in most cases drastically. This leads to an exact exponential-time algorithm which significantly improves upon a straightforward approach. Moreover, a combination of this exponential-time algorithm with a special strategy gives a linear-time algorithm. Interestingly, in systematic experiments we observed that we could solve a huge majority of all instances by the linear-time heuristic. This motivates us to develop characteristics like dag density and "distance to provably easy sequences" which can give us an indicator how easy or difficult a given sequence can be realized. Furthermore, we propose a randomized algorithm which exploits our structural insight on topological sortings and uses a number of reduction rules. We observe that it clearly outperforms all other variants and behaves surprisingly well for almost all instances. Another striking observation is that our simple linear-time algorithm solves a set of real-world instances from different domains, namely ordered binary decision diagrams (OBDDs), train and flight schedules, as well as instances derived from food-web networks without any exception. △ Less

Submitted 16 March, 2012; originally announced March 2012.

Comments: 20 pages, 11 figures, extended abstract to appear in Proceedings of SEA 2012

arXiv:0912.0685 [pdf, ps, other]

Uniform sampling of undirected and directed graphs with a fixed degree sequence

Authors: Annabell Berger, Matthias Müller-Hannemann

Abstract: Many applications in network analysis require algorithms to sample uniformly at random from the set of all graphs with a prescribed degree sequence. We present a Markov chain based approach which converges to the uniform distribution of all realizations for both the directed and undirected case. It remains an open challenge whether these Markov chains are rapidly mixing. For the case of direct… ▽ More Many applications in network analysis require algorithms to sample uniformly at random from the set of all graphs with a prescribed degree sequence. We present a Markov chain based approach which converges to the uniform distribution of all realizations for both the directed and undirected case. It remains an open challenge whether these Markov chains are rapidly mixing. For the case of directed graphs, we also explain in this paper that a popular switching algorithm fails in general to sample uniformly at random because the state graph of the Markov chain decomposes into different isomorphic components. We call degree sequences for which the state graph is strongly connected arc swap sequences. To handle arbitrary degree sequences, we develop two different solutions. The first uses an additional operation (a reorientation of induced directed 3-cycles) which makes the state graph strongly connected, the second selects randomly one of the isomorphic components and samples inside it. Our main contribution is a precise characterization of arc swap sequences, leading to an efficient recognition algorithm. Finally, we point out some interesting consequences for network analysis. △ Less

Submitted 5 March, 2010; v1 submitted 3 December, 2009; originally announced December 2009.

ACM Class: F.2.2; G.2.2; G.2.3

arXiv:cmp-lg/9706018 [pdf, ps, other]

A Model of Lexical Attraction and Repulsion

Authors: Doug Beeferman, Adam Berger, John Lafferty

Abstract: This paper introduces new methods based on exponential families for modeling the correlations between words in text and speech. While previous work assumed the effects of word co-occurrence statistics to be constant over a window of several hundred words, we show that their influence is nonstationary on a much smaller time scale. Empirical data drawn from English and Japanese text, as well as co… ▽ More This paper introduces new methods based on exponential families for modeling the correlations between words in text and speech. While previous work assumed the effects of word co-occurrence statistics to be constant over a window of several hundred words, we show that their influence is nonstationary on a much smaller time scale. Empirical data drawn from English and Japanese text, as well as conversational speech, reveals that the ``attraction'' between words decays exponentially, while stylistic and syntactic contraints create a ``repulsion'' between words that discourages close co-occurrence. We show that these characteristics are well described by simple mixture models based on two-stage exponential distributions which can be trained using the EM algorithm. The resulting distance distributions can then be incorporated as penalizing features in an exponential language model. △ Less

Submitted 16 June, 1997; v1 submitted 12 June, 1997; originally announced June 1997.

Comments: 8 pages, LaTeX source and postscript figures for ACL/EACL'97 paper

arXiv:cmp-lg/9706016 [pdf, ps, other]

Text Segmentation Using Exponential Models

Authors: Doug Beeferman, Adam Berger, John Lafferty

Abstract: This paper introduces a new statistical approach to partitioning text automatically into coherent segments. Our approach enlists both short-range and long-range language models to help it sniff out likely sites of topic changes in text. To aid its search, the system consults a set of simple lexical hints it has learned to associate with the presence of boundaries through inspection of a large co… ▽ More This paper introduces a new statistical approach to partitioning text automatically into coherent segments. Our approach enlists both short-range and long-range language models to help it sniff out likely sites of topic changes in text. To aid its search, the system consults a set of simple lexical hints it has learned to associate with the presence of boundaries through inspection of a large corpus of annotated data. We also propose a new probabilistically motivated error metric for use by the natural language processing and information retrieval communities, intended to supersede precision and recall for appraising segmentation algorithms. Qualitative assessment of our algorithm as well as evaluation using this new metric demonstrate the effectiveness of our approach in two very different domains, Wall Street Journal articles and the TDT Corpus, a collection of newswire articles and broadcast news transcripts. △ Less

Submitted 12 June, 1997; v1 submitted 11 June, 1997; originally announced June 1997.

Comments: 12 pages, LaTeX source and postscript figures for EMNLP-2 paper

Showing 1–41 of 41 results for author: Berger, A