Skip to main content

Showing 1–10 of 10 results for author: Thiele, M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.03817  [pdf, ps, other

    cs.LG

    Survey of Active Learning Hyperparameters: Insights from a Large-Scale Experimental Grid

    Authors: Julius Gonsior, Tim Rieß, Anja Reusch, Claudio Hartmann, Maik Thiele, Wolfgang Lehner

    Abstract: Annotating data is a time-consuming and costly task, but it is inherently required for supervised machine learning. Active Learning (AL) is an established method that minimizes human labeling effort by iteratively selecting the most informative unlabeled samples for expert annotation, thereby improving the overall classification performance. Even though AL has been known for decades, AL is still r… ▽ More

    Submitted 4 June, 2025; originally announced June 2025.

  2. arXiv:2403.03504  [pdf, other

    cs.DS cs.DM

    Graph Visualization for Blockchain Data

    Authors: Marcell Dietl, Andre Gemünd, Daniel Oeltz, Felix M. Thiele, Christian Werner

    Abstract: In this report, we introduce a novel approach to visualize extremely large graphs efficiently. Our method combines two force-directed algorithms, Kamada-Kawai and ForceAtlas2, to handle different graph components based on their node count. Additionally, we suggest utilizing the Fast Multipole method to enhance the speed of ForceAtlas2. Although initially designed for analyzing bitcoin transaction… ▽ More

    Submitted 6 March, 2024; originally announced March 2024.

  3. arXiv:2210.03005  [pdf, other

    cs.LG cs.AI cs.CL cs.DB

    To Softmax, or not to Softmax: that is the question when applying Active Learning for Transformer Models

    Authors: Julius Gonsior, Christian Falkenberg, Silvio Magino, Anja Reusch, Maik Thiele, Wolfgang Lehner

    Abstract: Despite achieving state-of-the-art results in nearly all Natural Language Processing applications, fine-tuning Transformer-based language models still requires a significant amount of labeled data to work. A well known technique to reduce the amount of human effort in acquiring a labeled dataset is \textit{Active Learning} (AL): an iterative process in which only the minimal amount of samples is l… ▽ More

    Submitted 6 October, 2022; originally announced October 2022.

  4. ImitAL: Learned Active Learning Strategy on Synthetic Data

    Authors: Julius Gonsior, Maik Thiele, Wolfgang Lehner

    Abstract: Active Learning (AL) is a well-known standard method for efficiently obtaining annotated data by first labeling the samples that contain the most information based on a query strategy. In the past, a large variety of such query strategies has been proposed, with each generation of new strategies increasing the runtime and adding more complexity. However, to the best of our our knowledge, none of t… ▽ More

    Submitted 24 August, 2022; originally announced August 2022.

    Comments: arXiv admin note: text overlap with arXiv:2108.07670

  5. arXiv:2108.07670  [pdf, other

    cs.LG cs.AI

    ImitAL: Learning Active Learning Strategies from Synthetic Data

    Authors: Julius Gonsior, Maik Thiele, Wolfgang Lehner

    Abstract: One of the biggest challenges that complicates applied supervised machine learning is the need for huge amounts of labeled data. Active Learning (AL) is a well-known standard method for efficiently obtaining labeled data by first labeling the samples that contain the most information based on a query strategy. Although many methods for query strategies have been proposed in the past, no clear supe… ▽ More

    Submitted 17 August, 2021; originally announced August 2021.

  6. arXiv:2106.12224  [pdf, other

    cs.DC

    Revisiting the Arguments for Edge Computing Research

    Authors: Blesson Varghese, Eyal de Lara, Aaron Ding, Cheol-Ho Hong, Flavio Bonomi, Schahram Dustdar, Paul Harvey, Peter Hewkin, Weisong Shi, Mark Thiele, Peter Willis

    Abstract: This article argues that low latency, high bandwidth, device proliferation, sustainable digital infrastructure, and data privacy and sovereignty continue to motivate the need for edge computing research even though its initial concepts were formulated more than a decade ago.

    Submitted 23 June, 2021; originally announced June 2021.

  7. arXiv:2105.14867  [pdf, other

    cs.DB

    Accurate and Efficient Time Series Matching by Season- and Trend-aware Symbolic Approximation -- Extended Version Including Additional Evaluation and Proofs

    Authors: Lars Kegel, Claudio Hartmann, Maik Thiele, Wolfgang Lehner

    Abstract: Processing and analyzing time series data\-sets have become a central issue in many domains requiring data management systems to support time series as a native data type. A crucial prerequisite of these systems is time series matching, which still is a challenging problem. A time series is a high-dimensional data type, its representation is storage-, and its comparison is time-consuming. Among th… ▽ More

    Submitted 11 October, 2021; v1 submitted 31 May, 2021; originally announced May 2021.

  8. arXiv:1911.12674  [pdf, other

    cs.DB cs.CL cs.LG

    RETRO: Relation Retrofitting For In-Database Machine Learning on Textual Data

    Authors: Michael Günther, Maik Thiele, Wolfgang Lehner

    Abstract: There are massive amounts of textual data residing in databases, valuable for many machine learning (ML) tasks. Since ML techniques depend on numerical input representations, word embeddings are increasingly utilized to convert symbolic representations such as text into meaningful numbers. However, a naive one-to-one mapping of each word in a database to a word embedding vector is not sufficient a… ▽ More

    Submitted 22 January, 2020; v1 submitted 28 November, 2019; originally announced November 2019.

    Comments: 14 pages

    MSC Class: H.2.8; H.3.3; I.2.7 ACM Class: H.2.8; H.3.3; I.2.7

  9. arXiv:1806.03901  [pdf, other

    cs.DC

    A Cost-based Storage Format Selector for Materialization in Big Data Frameworks

    Authors: Rana Faisal Munir, Alberto Abelló, Oscar Romero, Maik Thiele, Wolfgang Lehner

    Abstract: Modern big data frameworks (such as Hadoop and Spark) allow multiple users to do large-scale analysis simultaneously. Typically, users deploy Data-Intensive Workflows (DIWs) for their analytical tasks. These DIWs of different users share many common parts (i.e, 50-80%), which can be materialized to reuse them in future executions. The materialization improves the overall processing time of DIWs an… ▽ More

    Submitted 11 June, 2018; originally announced June 2018.

  10. arXiv:1205.2465  [pdf, other

    cs.DB

    Identifying And Weighting Integration Hypotheses On Open Data Platforms

    Authors: Julian Eberius, Katrin Braunschweig, Maik Thiele, Wolfgang Lehner

    Abstract: Open data platforms such as data.gov or opendata.socrata. com provide a huge amount of valuable information. Their free-for-all nature, the lack of publishing standards and the multitude of domains and authors represented on these platforms lead to new integration and standardization problems. At the same time, crowd-based data integration techniques are emerging as new way of dealing with these p… ▽ More

    Submitted 11 May, 2012; originally announced May 2012.

    Comments: Presented at the First International Workshop On Open Data, WOD-2012 (https://arxiv.boxedpaper.com/abs/1204.3726)

    Report number: WOD/2012/NANTES/11 ACM Class: J.3; H.2.m