-
Unraveling Spatio-Temporal Foundation Models via the Pipeline Lens: A Comprehensive Review
Authors:
Yuchen Fang,
Hao Miao,
Yuxuan Liang,
Liwei Deng,
Yue Cui,
Ximu Zeng,
Yuyang Xia,
Yan Zhao,
Torben Bach Pedersen,
Christian S. Jensen,
Xiaofang Zhou,
Kai Zheng
Abstract:
Spatio-temporal deep learning models aims to utilize useful patterns in such data to support tasks like prediction. However, previous deep learning models designed for specific tasks typically require separate training for each use case, leading to increased computational and storage costs. To address this issue, spatio-temporal foundation models have emerged, offering a unified framework capable…
▽ More
Spatio-temporal deep learning models aims to utilize useful patterns in such data to support tasks like prediction. However, previous deep learning models designed for specific tasks typically require separate training for each use case, leading to increased computational and storage costs. To address this issue, spatio-temporal foundation models have emerged, offering a unified framework capable of solving multiple spatio-temporal tasks. These foundation models achieve remarkable success by learning general knowledge with spatio-temporal data or transferring the general capabilities of pre-trained language models. While previous surveys have explored spatio-temporal data and methodologies separately, they have ignored a comprehensive examination of how foundation models are designed, selected, pre-trained, and adapted. As a result, the overall pipeline for spatio-temporal foundation models remains unclear. To bridge this gap, we innovatively provide an up-to-date review of previous spatio-temporal foundation models from the pipeline perspective. The pipeline begins with an introduction to different types of spatio-temporal data, followed by details of data preprocessing and embedding techniques. The pipeline then presents a novel data property taxonomy to divide existing methods according to data sources and dependencies, providing efficient and effective model design and selection for researchers. On this basis, we further illustrate the training objectives of primitive models, as well as the adaptation techniques of transferred models. Overall, our survey provides a clear and structured pipeline to understand the connection between core elements of spatio-temporal foundation models while guiding researchers to get started quickly. Additionally, we introduce emerging opportunities such as multi-objective training in the field of spatio-temporal foundation models.
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
Duluth at SemEval-2025 Task 7: TF-IDF with Optimized Vector Dimensions for Multilingual Fact-Checked Claim Retrieval
Authors:
Shujauddin Syed,
Ted Pedersen
Abstract:
This paper presents the Duluth approach to the SemEval-2025 Task 7 on Multilingual and Crosslingual Fact-Checked Claim Retrieval. We implemented a TF-IDF-based retrieval system with experimentation on vector dimensions and tokenization strategies. Our best-performing configuration used word-level tokenization with a vocabulary size of 15,000 features, achieving an average success@10 score of 0.78…
▽ More
This paper presents the Duluth approach to the SemEval-2025 Task 7 on Multilingual and Crosslingual Fact-Checked Claim Retrieval. We implemented a TF-IDF-based retrieval system with experimentation on vector dimensions and tokenization strategies. Our best-performing configuration used word-level tokenization with a vocabulary size of 15,000 features, achieving an average success@10 score of 0.78 on the development set and 0.69 on the test set across ten languages. Our system showed stronger performance on higher-resource languages but still lagged significantly behind the top-ranked system, which achieved 0.96 average success@10. Our findings suggest that though advanced neural architectures are increasingly dominant in multilingual retrieval tasks, properly optimized traditional methods like TF-IDF remain competitive baselines, especially in limited compute resource scenarios.
△ Less
Submitted 18 May, 2025;
originally announced May 2025.
-
Extending the SAREF4ENER Ontology with Flexibility Based on FlexOffers
Authors:
Fabio Lilliu,
Amir Laadhar,
Christian Thomsen,
Diego Reforgiato Recupero,
Torben Bach Pedersen
Abstract:
A key element to support the increased amounts of renewable energy in the energy system is flexibility, i.e., the possibility of changing energy loads in time and amount. Many flexibility models have been designed; however, exact models fail to scale for long time horizons or many devices. Because of this, the FlexOffer (FOs) model has been designed, to provide device-independent approximations of…
▽ More
A key element to support the increased amounts of renewable energy in the energy system is flexibility, i.e., the possibility of changing energy loads in time and amount. Many flexibility models have been designed; however, exact models fail to scale for long time horizons or many devices. Because of this, the FlexOffer (FOs) model has been designed, to provide device-independent approximations of flexibility with good accuracy, and much better scaling for long time horizons and many devices. An important aspect of the real-life implementation of energy flexibility is enabling flexible data exchange with many types of smart energy appliances and market systems, e.g., in smart buildings. For this, ontologies standardizing data formats are required. However, the current industry standard ontology for integrating smart devices for energy purposes, SAREF for Energy Flexibility (SAREF4ENER) only has limited support for flexibility and thus cannot support important use cases. In this paper we propose an extension of SAREF4ENER that integrates full support for the complete FlexOffer model, including advanced use cases, while maintaining backward compatibility. This novel ontology module can accurately describe flexibility for advanced devices such as electric vehicles, batteries, and heat pumps. It can also capture the inherent uncertainty associated with many flexible load types.
△ Less
Submitted 18 April, 2025; v1 submitted 4 April, 2025;
originally announced April 2025.
-
CAMEO: Autocorrelation-Preserving Line Simplification for Lossy Time Series Compression
Authors:
Carlos Enrique Muñiz-Cuza,
Matthias Boehm,
Torben Bach Pedersen
Abstract:
Time series data from a variety of sensors and IoT devices need effective compression to reduce storage and I/O bandwidth requirements. While most time series databases and systems rely on lossless compression, lossy techniques offer even greater space-saving with a small loss in precision. However, the unknown impact on downstream analytics applications requires a semi-manual trial-and-error expl…
▽ More
Time series data from a variety of sensors and IoT devices need effective compression to reduce storage and I/O bandwidth requirements. While most time series databases and systems rely on lossless compression, lossy techniques offer even greater space-saving with a small loss in precision. However, the unknown impact on downstream analytics applications requires a semi-manual trial-and-error exploration. We initiate work on lossy compression that provides guarantees on complex statistical features (which are strongly correlated with the accuracy of the downstream analytics). Specifically, we propose a new lossy compression method that provides guarantees on the autocorrelation and partial-autocorrelation functions (ACF/PACF) of a time series. Our method leverages line simplification techniques as well as incremental maintenance of aggregates, blocking, and parallelization strategies for effective and efficient compression. The results show that our method improves compression ratios by 2x on average and up to 54x on selected datasets, compared to previous lossy and lossless compression methods. Moreover, we maintain -- and sometimes even improve -- the forecasting accuracy by preserving the autocorrelation properties of the time series. Our framework is extensible to multivariate time series and other statistical features of the time series.
△ Less
Submitted 24 January, 2025;
originally announced January 2025.
-
Digital Twin-Empowered Voltage Control for Power Systems
Authors:
Jiachen Xu,
Yushuai Li,
Torben Bach Pedersen,
Yuqiang He,
Kim Guldstrand Larsen,
Tianyi Li
Abstract:
Emerging digital twin technology has the potential to revolutionize voltage control in power systems. However, the state-of-the-art digital twin method suffers from low computational and sampling efficiency, which hinders its applications. To address this issue, we propose a Gumbel-Consistency Digital Twin (GC-DT) method that enhances voltage control with improved computational and sampling effici…
▽ More
Emerging digital twin technology has the potential to revolutionize voltage control in power systems. However, the state-of-the-art digital twin method suffers from low computational and sampling efficiency, which hinders its applications. To address this issue, we propose a Gumbel-Consistency Digital Twin (GC-DT) method that enhances voltage control with improved computational and sampling efficiency. First, the proposed method incorporates a Gumbel-based strategy improvement that leverages the Gumbel-top trick to enhance non-repetitive sampling actions and reduce the reliance on Monte Carlo Tree Search simulations, thereby improving computational efficiency. Second, a consistency loss function aligns predicted hidden states with actual hidden states in the latent space, which increases both prediction accuracy and sampling efficiency. Experiments on IEEE 123-bus, 34-bus, and 13-bus systems demonstrate that the proposed GC-DT outperforms the state-of-the-art DT method in both computational and sampling efficiency.
△ Less
Submitted 9 December, 2024;
originally announced December 2024.
-
Data-Driven Prescriptive Analytics Applications: A Comprehensive Survey
Authors:
Martin Moesmann,
Torben Bach Pedersen
Abstract:
Prescriptive Analytics (PSA), an emerging business analytics field suggesting concrete options for solving business problems, has seen an increasing amount of interest after more than a decade of multidisciplinary research. This paper is a comprehensive survey of existing applications within PSA in terms of their use cases, methodologies, and possible future research directions. To ensure a manage…
▽ More
Prescriptive Analytics (PSA), an emerging business analytics field suggesting concrete options for solving business problems, has seen an increasing amount of interest after more than a decade of multidisciplinary research. This paper is a comprehensive survey of existing applications within PSA in terms of their use cases, methodologies, and possible future research directions. To ensure a manageable scope, we focus on PSA applications that develop data-driven, automatic workflows, i.e., Data-Driven PSA (DPSA). Following a systematic methodology, we identify and include 104 papers in our survey. As our key contributions, we derive a number of novel taxonomies of the field and use them to analyse the field's temporal development. In terms of use cases, we derive 10 application domains for DPSA, from Healthcare to Manufacturing, and subsumed problem types within each. In terms of individual method usage, we derive 5 method types and map them to a comprehensive taxonomy of method usage within DPSA applications, covering mathematical optimization, data mining and machine learning, probabilistic modelling, domain expertise, as well as simulations. As for combined method usage, we provide a statistical overview of how different method usage combinations are distributed and derive 2 generic workflow patterns along with subsumed workflow patterns, combining methods by either sequential or simultaneous relationships. Finally, we derive 5 possible research directions based on frequently recurring issues among surveyed papers, suggesting new frontiers in terms of methods, tools, and use cases.
△ Less
Submitted 22 May, 2025; v1 submitted 21 November, 2024;
originally announced December 2024.
-
Modular assurance of an Autonomous Ferry using Contract-Based Design and Simulation-based Verification Principles
Authors:
Jon Arne Glomsrud,
Stephanie Kemna,
Chanjei Vasanthan,
Luman Zhao,
Dag McGeorge,
Tom Arne Pedersen,
Tobias Rye Torben,
Børge Rokseth,
Dong Trong Nguyen
Abstract:
With the introduction of autonomous technology into our society, e.g. autonomous shipping, it is important to assess and assure the safety of autonomous systems in a real-world context. Simulation-based testing is a common approach to attempt to verify performance of autonomous systems, but assurance also requires formal evidence. This paper introduces the Assurance of Digital Assets (ADA) framewo…
▽ More
With the introduction of autonomous technology into our society, e.g. autonomous shipping, it is important to assess and assure the safety of autonomous systems in a real-world context. Simulation-based testing is a common approach to attempt to verify performance of autonomous systems, but assurance also requires formal evidence. This paper introduces the Assurance of Digital Assets (ADA) framework, a structured method for the assurance of digital assets, i.e. novel, complex, or intelligent systems enabled by digital technologies, using contract-based design. Results are shown for an autonomous ferry assurance case, focusing on collision avoidance during the ferry's transit. Further, we discuss the role of simulation-based testing in verifying compliance to contract specifications, to build the necessary evidence for an assurance case.
△ Less
Submitted 30 October, 2024; v1 submitted 6 August, 2024;
originally announced August 2024.
-
Decentralized Multi-Party Multi-Network AI for Global Deployment of 6G Wireless Systems
Authors:
Merim Dzaferagic,
Marco Ruffini,
Nina Slamnik-Krijestorac,
Joao F. Santos,
Johann Marquez-Barja,
Christos Tranoris,
Spyros Denazis,
Thomas Kyriakakis,
Panagiotis Karafotis,
Luiz DaSilva,
Shashi Raj Pandey,
Junya Shiraishi,
Petar Popovski,
Soren Kejser Jensen,
Christian Thomsen,
Torben Bach Pedersen,
Holger Claussen,
Jinfeng Du,
Gil Zussman,
Tingjun Chen,
Yiran Chen,
Seshu Tirupathi,
Ivan Seskar,
Daniel Kilper
Abstract:
Multiple visions of 6G networks elicit Artificial Intelligence (AI) as a central, native element. When 6G systems are deployed at a large scale, end-to-end AI-based solutions will necessarily have to encompass both the radio and the fiber-optical domain. This paper introduces the Decentralized Multi-Party, Multi-Network AI (DMMAI) framework for integrating AI into 6G networks deployed at scale. DM…
▽ More
Multiple visions of 6G networks elicit Artificial Intelligence (AI) as a central, native element. When 6G systems are deployed at a large scale, end-to-end AI-based solutions will necessarily have to encompass both the radio and the fiber-optical domain. This paper introduces the Decentralized Multi-Party, Multi-Network AI (DMMAI) framework for integrating AI into 6G networks deployed at scale. DMMAI harmonizes AI-driven controls across diverse network platforms and thus facilitates networks that autonomously configure, monitor, and repair themselves. This is particularly crucial at the network edge, where advanced applications meet heightened functionality and security demands. The radio/optical integration is vital due to the current compartmentalization of AI research within these domains, which lacks a comprehensive understanding of their interaction. Our approach explores multi-network orchestration and AI control integration, filling a critical gap in standardized frameworks for AI-driven coordination in 6G networks. The DMMAI framework is a step towards a global standard for AI in 6G, aiming to establish reference use cases, data and model management methods, and benchmarking platforms for future AI/ML solutions.
△ Less
Submitted 15 April, 2024;
originally announced July 2024.
-
An Explainable and Conformal AI Model to Detect Temporomandibular Joint Involvement in Children Suffering from Juvenile Idiopathic Arthritis
Authors:
Lena Todnem Bach Christensen,
Dikte Straadt,
Stratos Vassis,
Christian Marius Lillelund,
Peter Bangsgaard Stoustrup,
Ruben Pauwels,
Thomas Klit Pedersen,
Christian Fischer Pedersen
Abstract:
Juvenile idiopathic arthritis (JIA) is the most common rheumatic disease during childhood and adolescence. The temporomandibular joints (TMJ) are among the most frequently affected joints in patients with JIA, and mandibular growth is especially vulnerable to arthritic changes of the TMJ in children. A clinical examination is the most cost-effective method to diagnose TMJ involvement, but clinicia…
▽ More
Juvenile idiopathic arthritis (JIA) is the most common rheumatic disease during childhood and adolescence. The temporomandibular joints (TMJ) are among the most frequently affected joints in patients with JIA, and mandibular growth is especially vulnerable to arthritic changes of the TMJ in children. A clinical examination is the most cost-effective method to diagnose TMJ involvement, but clinicians find it difficult to interpret and inaccurate when used only on clinical examinations. This study implemented an explainable artificial intelligence (AI) model that can help clinicians assess TMJ involvement. The classification model was trained using Random Forest on 6154 clinical examinations of 1035 pediatric patients (67% female, 33% male) and evaluated on its ability to correctly classify TMJ involvement or not on a separate test set. Most notably, the results show that the model can classify patients within two years of their first examination as having TMJ involvement with a precision of 0.86 and a sensitivity of 0.7. The results show promise for an AI model in the assessment of TMJ involvement in children and as a decision support tool.
△ Less
Submitted 2 May, 2024;
originally announced May 2024.
-
Is It Really You Who Forgot the Password? When Account Recovery Meets Risk-Based Authentication
Authors:
Andre Büttner,
Andreas Thue Pedersen,
Stephan Wiefling,
Nils Gruschka,
Luigi Lo Iacono
Abstract:
Risk-based authentication (RBA) is used in online services to protect user accounts from unauthorized takeover. RBA commonly uses contextual features that indicate a suspicious login attempt when the characteristic attributes of the login context deviate from known and thus expected values. Previous research on RBA and anomaly detection in authentication has mainly focused on the login process. Ho…
▽ More
Risk-based authentication (RBA) is used in online services to protect user accounts from unauthorized takeover. RBA commonly uses contextual features that indicate a suspicious login attempt when the characteristic attributes of the login context deviate from known and thus expected values. Previous research on RBA and anomaly detection in authentication has mainly focused on the login process. However, recent attacks have revealed vulnerabilities in other parts of the authentication process, specifically in the account recovery function. Consequently, to ensure comprehensive authentication security, the use of anomaly detection in the context of account recovery must also be investigated.
This paper presents the first study to investigate risk-based account recovery (RBAR) in the wild. We analyzed the adoption of RBAR by five prominent online services (that are known to use RBA). Our findings confirm the use of RBAR at Google, LinkedIn, and Amazon. Furthermore, we provide insights into the different RBAR mechanisms of these services and explore the impact of multi-factor authentication on them. Based on our findings, we create a first maturity model for RBAR challenges. The goal of our work is to help developers, administrators, and policy-makers gain an initial understanding of RBAR and to encourage further research in this direction.
△ Less
Submitted 18 March, 2024;
originally announced March 2024.
-
A Comparative Study of Rapidly-exploring Random Tree Algorithms Applied to Ship Trajectory Planning and Behavior Generation
Authors:
Trym Tengesdal,
Tom Arne Pedersen,
Tor Arne Johansen
Abstract:
Rapidly Exploring Random Tree (RRT) algorithms, notably used for nonholonomic vehicle navigation in complex environments, are often not thoroughly evaluated for their specific challenges. This paper presents a first such comparison study of the variants Potential-Quick RRT* (PQ-RRT*), Informed RRT* (IRRT*), RRT*, and RRT, in maritime single-query nonholonomic motion planning. Additionally, the pra…
▽ More
Rapidly Exploring Random Tree (RRT) algorithms, notably used for nonholonomic vehicle navigation in complex environments, are often not thoroughly evaluated for their specific challenges. This paper presents a first such comparison study of the variants Potential-Quick RRT* (PQ-RRT*), Informed RRT* (IRRT*), RRT*, and RRT, in maritime single-query nonholonomic motion planning. Additionally, the practicalities of using these algorithms in maritime environments are discussed and outlined. We also contend that these algorithms are beneficial not only for trajectory planning in Collision Avoidance Systems (CAS) but also for CAS verification when used as vessel behavior generators.
Optimal RRT variants tend to produce more distance-optimal paths but require more computational time due to complex tree wiring and nearest neighbor searches. Our findings, supported by Welch`s t-test at a significance level of Alpha = 0.05, indicate that PQ-RRT* slightly outperform IRRT* and RRT* in achieving shorter trajectory length but at the expense of higher tuning complexity and longer run-times. Based on the results, we argue that these RRT algorithms are better suited for smaller-scale problems or environments with low obstacle congestion ratio. This is attributed to the curse of dimensionality, and trade-off with available memory and computational resources.
△ Less
Submitted 17 April, 2024; v1 submitted 2 March, 2024;
originally announced March 2024.
-
SemEval-2017 Task 4: Sentiment Analysis in Twitter using BERT
Authors:
Rupak Kumar Das,
Ted Pedersen
Abstract:
This paper uses the BERT model, which is a transformer-based architecture, to solve task 4A, English Language, Sentiment Analysis in Twitter of SemEval2017. BERT is a very powerful large language model for classification tasks when the amount of training data is small. For this experiment, we have used the BERT(BASE) model, which has 12 hidden layers. This model provides better accuracy, precision…
▽ More
This paper uses the BERT model, which is a transformer-based architecture, to solve task 4A, English Language, Sentiment Analysis in Twitter of SemEval2017. BERT is a very powerful large language model for classification tasks when the amount of training data is small. For this experiment, we have used the BERT(BASE) model, which has 12 hidden layers. This model provides better accuracy, precision, recall, and f1 score than the Naive Bayes baseline model. It performs better in binary classification subtasks than the multi-class classification subtasks. We also considered all kinds of ethical issues during this experiment, as Twitter data contains personal and sensible information. The dataset and code used in our experiment can be found in this GitHub repository.
△ Less
Submitted 19 June, 2024; v1 submitted 15 January, 2024;
originally announced January 2024.
-
Domain Adaptation for Time series Transformers using One-step fine-tuning
Authors:
Subina Khanal,
Seshu Tirupathi,
Giulio Zizzo,
Ambrish Rawat,
Torben Bach Pedersen
Abstract:
The recent breakthrough of Transformers in deep learning has drawn significant attention of the time series community due to their ability to capture long-range dependencies. However, like other deep learning models, Transformers face limitations in time series prediction, including insufficient temporal understanding, generalization challenges, and data shift issues for the domains with limited d…
▽ More
The recent breakthrough of Transformers in deep learning has drawn significant attention of the time series community due to their ability to capture long-range dependencies. However, like other deep learning models, Transformers face limitations in time series prediction, including insufficient temporal understanding, generalization challenges, and data shift issues for the domains with limited data. Additionally, addressing the issue of catastrophic forgetting, where models forget previously learned information when exposed to new data, is another critical aspect that requires attention in enhancing the robustness of Transformers for time series tasks. To address these limitations, in this paper, we pre-train the time series Transformer model on a source domain with sufficient data and fine-tune it on the target domain with limited data. We introduce the \emph{One-step fine-tuning} approach, adding some percentage of source domain data to the target domains, providing the model with diverse time series instances. We then fine-tune the pre-trained model using a gradual unfreezing technique. This helps enhance the model's performance in time series prediction for domains with limited data. Extensive experimental results on two real-world datasets show that our approach improves over the state-of-the-art baselines by 4.35% and 11.54% for indoor temperature and wind power prediction, respectively.
△ Less
Submitted 12 January, 2024;
originally announced January 2024.
-
Creating and Querying Data Cubes in Python using pyCube
Authors:
Sigmundur Vang,
Christian Thomsen,
Torben Bach Pedersen
Abstract:
Data cubes are used for analyzing large data sets usually contained in data warehouses. The most popular data cube tools use graphical user interfaces (GUI) to do the data analysis. Traditionally this was fine since data analysts were not expected to be technical people. However, in the subsequent decades the data landscape changed dramatically requiring companies to employ large teams of highly t…
▽ More
Data cubes are used for analyzing large data sets usually contained in data warehouses. The most popular data cube tools use graphical user interfaces (GUI) to do the data analysis. Traditionally this was fine since data analysts were not expected to be technical people. However, in the subsequent decades the data landscape changed dramatically requiring companies to employ large teams of highly technical data scientists in order to manage and use the ever increasing amount of data. These data scientists generally use tools like Python, interactive notebooks, pandas, etc. while modern data cube tools are still GUI based. This paper proposes a Python-based data cube tool called pyCube. pyCube is able to semi-automatically create data cubes for data stored in an RDBMS and manages the data cube metadata. pyCube's programmatic interface enables data scientist to query data cubes by specifying the expected metadata of the result. pyCube is experimentally evaluated on Star Schema Benchmark (SSB). The results show that pyCube vastly outperforms different implementations of SSB queries in pandas in both runtime and memory while being easier to read and write.
△ Less
Submitted 28 January, 2024; v1 submitted 13 December, 2023;
originally announced December 2023.
-
Teacher-Student Reinforcement Learning for Mapless Navigation using a Planetary Space Rover
Authors:
Anton Bjørndahl Mortensen,
Emil Tribler Pedersen,
Laia Vives Benedicto,
Lionel Burg,
Mads Rossen Madsen,
Simon Bøgh
Abstract:
We address the challenge of enhancing navigation autonomy for planetary space rovers using reinforcement learning (RL). The ambition of future space missions necessitates advanced autonomous navigation capabilities for rovers to meet mission objectives. RL's potential in robotic autonomy is evident, but its reliance on simulations poses a challenge. Transferring policies to real-world scenarios of…
▽ More
We address the challenge of enhancing navigation autonomy for planetary space rovers using reinforcement learning (RL). The ambition of future space missions necessitates advanced autonomous navigation capabilities for rovers to meet mission objectives. RL's potential in robotic autonomy is evident, but its reliance on simulations poses a challenge. Transferring policies to real-world scenarios often encounters the "reality gap", disrupting the transition from virtual to physical environments. The reality gap is exacerbated in the context of mapless navigation on Mars and Moon-like terrains, where unpredictable terrains and environmental factors play a significant role. Effective navigation requires a method attuned to these complexities and real-world data noise. We introduce a novel two-stage RL approach using offline noisy data. Our approach employs a teacher-student policy learning paradigm, inspired by the "learning by cheating" method. The teacher policy is trained in simulation. Subsequently, the student policy is trained on noisy data, aiming to mimic the teacher's behaviors while being more robust to real-world uncertainties. Our policies are transferred to a custom-designed rover for real-world testing. Comparative analyses between the teacher and student policies reveal that our approach offers improved behavioral performance, heightened noise resilience, and more effective sim-to-real transfer.
△ Less
Submitted 22 September, 2023;
originally announced September 2023.
-
Efficient Generalized Temporal Pattern Mining in Big Time Series Using Mutual Information
Authors:
Van Long Ho,
Nguyen Ho,
Torben Bach Pedersen,
Panagiotis Papapetrou
Abstract:
Big time series are increasingly available from an ever wider range of IoT-enabled sensors deployed in various environments. Significant insights can be gained by mining temporal patterns from these time series. Temporal pattern mining (TPM) extends traditional pattern mining by adding event time intervals into extracted patterns, making them more expressive at the expense of increased time and sp…
▽ More
Big time series are increasingly available from an ever wider range of IoT-enabled sensors deployed in various environments. Significant insights can be gained by mining temporal patterns from these time series. Temporal pattern mining (TPM) extends traditional pattern mining by adding event time intervals into extracted patterns, making them more expressive at the expense of increased time and space complexities. Besides frequent temporal patterns (FTPs), which occur frequently in the entire dataset, another useful type of temporal patterns are so-called rare temporal patterns (RTPs), which appear rarely but with high confidence. Mining rare temporal patterns yields additional challenges. For FTP mining, the temporal information and complex relations between events already create an exponential search space. For RTP mining, the support measure is set very low, leading to a further combinatorial explosion and potentially producing too many uninteresting patterns. Thus, there is a need for a generalized approach which can mine both frequent and rare temporal patterns. This paper presents our Generalized Temporal Pattern Mining from Time Series (GTPMfTS) approach with the following specific contributions: (1) The end-to-end GTPMfTS process taking time series as input and producing frequent/rare temporal patterns as output. (2) The efficient Generalized Temporal Pattern Mining (GTPM) algorithm mines frequent and rare temporal patterns using efficient data structures for fast retrieval of events and patterns during the mining process, and employs effective pruning techniques for significantly faster mining. (3) An approximate version of GTPM that uses mutual information, a measure of data correlation, to prune unpromising time series from the search space.
△ Less
Submitted 19 June, 2023;
originally announced June 2023.
-
Goal-Oriented Scheduling in Sensor Networks with Application Timing Awareness
Authors:
Josefine Holm,
Federico Chiariotti,
Anders E. Kalør,
Beatriz Soret,
Torben Bach Pedersen,
Petar Popovski
Abstract:
Taking inspiration from linguistics, the communications theoretical community has recently shown a significant recent interest in pragmatic , or goal-oriented, communication. In this paper, we tackle the problem of pragmatic communication with multiple clients with different, and potentially conflicting, objectives. We capture the goal-oriented aspect through the metric of Value of Information (Vo…
▽ More
Taking inspiration from linguistics, the communications theoretical community has recently shown a significant recent interest in pragmatic , or goal-oriented, communication. In this paper, we tackle the problem of pragmatic communication with multiple clients with different, and potentially conflicting, objectives. We capture the goal-oriented aspect through the metric of Value of Information (VoI), which considers the estimation of the remote process as well as the timing constraints. However, the most common definition of VoI is simply the Mean Square Error (MSE) of the whole system state, regardless of the relevance for a specific client. Our work aims to overcome this limitation by including different summary statistics, i.e., value functions of the state, for separate clients, and a diversified query process on the client side, expressed through the fact that different applications may request different functions of the process state at different times. A query-aware Deep Reinforcement Learning (DRL) solution based on statically defined VoI can outperform naive approaches by 15-20%.
△ Less
Submitted 6 June, 2023;
originally announced June 2023.
-
A Comparative Study on Unsupervised Anomaly Detection for Time Series: Experiments and Analysis
Authors:
Yan Zhao,
Liwei Deng,
Xuanhao Chen,
Chenjuan Guo,
Bin Yang,
Tung Kieu,
Feiteng Huang,
Torben Bach Pedersen,
Kai Zheng,
Christian S. Jensen
Abstract:
The continued digitization of societal processes translates into a proliferation of time series data that cover applications such as fraud detection, intrusion detection, and energy management, where anomaly detection is often essential to enable reliability and safety. Many recent studies target anomaly detection for time series data. Indeed, area of time series anomaly detection is characterized…
▽ More
The continued digitization of societal processes translates into a proliferation of time series data that cover applications such as fraud detection, intrusion detection, and energy management, where anomaly detection is often essential to enable reliability and safety. Many recent studies target anomaly detection for time series data. Indeed, area of time series anomaly detection is characterized by diverse data, methods, and evaluation strategies, and comparisons in existing studies consider only part of this diversity, which makes it difficult to select the best method for a particular problem setting. To address this shortcoming, we introduce taxonomies for data, methods, and evaluation strategies, provide a comprehensive overview of unsupervised time series anomaly detection using the taxonomies, and systematically evaluate and compare state-of-the-art traditional as well as deep learning techniques. In the empirical study using nine publicly available datasets, we apply the most commonly-used performance evaluation metrics to typical methods under a fair implementation standard. Based on the structuring offered by the taxonomies, we report on empirical studies and provide guidelines, in the form of comparative tables, for choosing the methods most suitable for particular application settings. Finally, we propose research directions for this dynamic field.
△ Less
Submitted 10 September, 2022;
originally announced September 2022.
-
Mining Seasonal Temporal Patterns in Time Series
Authors:
Van Long Ho,
Nguyen Ho,
Torben Bach Pedersen
Abstract:
Very large time series are increasingly available from an ever wider range of IoT-enabled sensors, from which significant insights can be obtained through mining temporal patterns from them. A useful type of patterns found in many real-world applications exhibits periodic occurrences, and is thus called seasonal temporal pattern (STP). Compared to regular patterns, mining seasonal temporal pattern…
▽ More
Very large time series are increasingly available from an ever wider range of IoT-enabled sensors, from which significant insights can be obtained through mining temporal patterns from them. A useful type of patterns found in many real-world applications exhibits periodic occurrences, and is thus called seasonal temporal pattern (STP). Compared to regular patterns, mining seasonal temporal patterns is more challenging since traditional measures such as support and confidence do not capture the seasonality characteristics. Further, the anti-monotonicity property does not hold for STPs, and thus, resulting in an exponential search space. This paper presents our Frequent Seasonal Temporal Pattern Mining from Time Series (FreqSTPfTS) solution providing: (1) The first solution for seasonal temporal pattern mining (STPM) from time series that can mine STP at different data granularities. (2) The STPM algorithm that uses efficient data structures and two pruning techniques to reduce the search space and speed up the mining process. (3) An approximate version of STPM that uses mutual information, a measure of data correlation, to prune unpromising time series from the search space. (4) An extensive experimental evaluation showing that STPM outperforms the baseline in runtime and memory consumption, and can scale to big datasets. The approximate STPM is up to an order of magnitude faster and less memory consuming than the baseline, while maintaining high accuracy.
△ Less
Submitted 9 January, 2023; v1 submitted 28 June, 2022;
originally announced June 2022.
-
A Unified Approach for Multi-Scale Synchronous Correlation Search in Big Time Series -- Full Version
Authors:
Nguyen Ho,
Van Long Ho,
Torben Bach Pedersen,
Mai Vu,
Christophe A. N. Biscio
Abstract:
The wide deployment of IoT sensors has enabled the collection of very big time series across different domains, from which advanced analytics can be performed to find unknown relationships, most importantly the correlations between them. However, current approaches for correlation search on time series are limited to only a single temporal scale and simple types of relations, and cannot handle noi…
▽ More
The wide deployment of IoT sensors has enabled the collection of very big time series across different domains, from which advanced analytics can be performed to find unknown relationships, most importantly the correlations between them. However, current approaches for correlation search on time series are limited to only a single temporal scale and simple types of relations, and cannot handle noise effectively. This paper presents the integrated SYnchronous COrrelation Search (iSYCOS) framework to find multi-scale correlations in big time series. Specifically, iSYCOS integrates top-down and bottom-up approaches into a single auto-configured framework capable of efficiently extracting complex window-based correlations from big time series using mutual information (MI). Moreover, iSYCOS includes a novel MI-based theory to identify noise in the data, and is used to perform pruning to improve iSYCOS performance. Besides, we design a distributed version of iSYCOS that can scale out in a Spark cluster to handle big time series. Our extensive experimental evaluation on synthetic and real-world datasets shows that iSYCOS can auto-configure on a given dataset to find complex multi-scale correlations. The pruning and optimisations can improve iSYCOS performance up to an order of magnitude, and the distributed iSYCOS can scale out linearly on a computing cluster.
△ Less
Submitted 19 April, 2022;
originally announced April 2022.
-
Methods for Efficient Unfolding of Colored Petri Nets
Authors:
Alexander Bilgram,
Peter G. Jensen,
Thomas Pedersen,
Jiri Srba,
Peter H. Taankvist
Abstract:
Colored Petri nets offer a compact and user friendly representation of the traditional P/T nets and colored nets with finite color ranges can be unfolded into the underlying P/T nets, however, at the expense of an exponential explosion in size. We present two novel techniques based on static analysis in order to reduce the size of unfolded colored nets. The first method identifies colors that beha…
▽ More
Colored Petri nets offer a compact and user friendly representation of the traditional P/T nets and colored nets with finite color ranges can be unfolded into the underlying P/T nets, however, at the expense of an exponential explosion in size. We present two novel techniques based on static analysis in order to reduce the size of unfolded colored nets. The first method identifies colors that behave equivalently and groups them into equivalence classes, potentially reducing the number of used colors. The second method overapproximates the sets of colors that can appear in places and excludes colors that can never be present in a given place. Both methods are complementary and the combined approach allows us to significantly reduce the size of multiple colored Petri nets from the Model Checking Contest benchmark. We compare the performance of our unfolder with state-of-the-art techniques implemented in the tools MCC, Spike and ITS-Tools, and while our approach is competitive w.r.t. unfolding time, it also outperforms the existing approaches both in the size of unfolded nets as well as in the number of answered model checking queries from the 2021 Model Checking Contest.
△ Less
Submitted 11 October, 2023; v1 submitted 12 April, 2022;
originally announced April 2022.
-
Finding Representative Sampling Subsets in Sensor Graphs using Time Series Similarities
Authors:
Roshni Chakraborty,
Josefine Holm,
Torben Bach Pedersen,
Petar Popovski
Abstract:
With the increasing use of IoT-enabled sensors, it is important to have effective methods for querying the sensors. For example, in a dense network of battery-driven temperature sensors, it is often possible to query (sample) just a subset of the sensors at any given time, since the values of the non-sampled sensors can be estimated from the sampled values. If we can divide the set of sensors into…
▽ More
With the increasing use of IoT-enabled sensors, it is important to have effective methods for querying the sensors. For example, in a dense network of battery-driven temperature sensors, it is often possible to query (sample) just a subset of the sensors at any given time, since the values of the non-sampled sensors can be estimated from the sampled values. If we can divide the set of sensors into disjoint so-called representative sampling subsets that each represent the other sensors sufficiently well, we can alternate the sampling between the sampling subsets and thus, increase battery life significantly. In this paper, we formulate the problem of finding representative sampling subsets as a graph problem on a so-called sensor graph with the sensors as nodes. Our proposed solution, SubGraphSample, consists of two phases. In Phase-I, we create edges in the sensor graph based on the similarities between the time series of sensor values, analyzing six different techniques based on proven time series similarity metrics. In Phase-II, we propose two new techniques and extend four existing ones to find the maximal number of representative sampling subsets. Finally, we propose AutoSubGraphSample which auto-selects the best technique for Phase-I and Phase-II for a given dataset. Our extensive experimental evaluation shows that our approach can yield significant battery life improvements within realistic error bounds.
△ Less
Submitted 18 February, 2022; v1 submitted 17 February, 2022;
originally announced February 2022.
-
Optimal Scheduling of Flexible Power-to-X Technologies in the Day-ahead Electricity Market
Authors:
Neeraj Dhanraj Bokde,
Tim T Pedersen,
Gorm Bruun Andresen
Abstract:
The ambitious CO2 emission targets of the Paris agreements are achievable only with renewable energy, CO2-free power generation, new policies, and planning. The main motivation of this paper is that future green fuels from power-to-X assets should be produced from power with the lowest possible emissions while still keeping the cost of electricity low. To this end we propose a power-to-X schedulin…
▽ More
The ambitious CO2 emission targets of the Paris agreements are achievable only with renewable energy, CO2-free power generation, new policies, and planning. The main motivation of this paper is that future green fuels from power-to-X assets should be produced from power with the lowest possible emissions while still keeping the cost of electricity low. To this end we propose a power-to-X scheduling framework that is capable of co-optimizing CO2 emission intensity and electricity prices in the day-ahead electricity market scheduling. Three realistic models for local production units are developed for flexible dispatch and the impact on electricity market scheduling is examined. Furthermore, the possible benefits of using CO2 emission intensity and electricity prices trade-off in scheduling are discussed. We find that there is a non-linear trade-off between CO2 emission intensity and cost, favoring a weighted optimization between the two objectives.
△ Less
Submitted 19 October, 2021;
originally announced October 2021.
-
Evolutionary Clustering of Streaming Trajectories
Authors:
Tianyi Li,
Lu Chen,
Christian S. Jensen,
Torben Bach Pedersen,
Jilin Hu
Abstract:
The widespread deployment of smartphones and location-enabled, networked in-vehicle devices renders it increasingly feasible to collect streaming trajectory data of moving objects. The continuous clustering of such data can enable a variety of real-time services, such as identifying representative paths or common moving trends among objects in real-time. However, little attention has so far been g…
▽ More
The widespread deployment of smartphones and location-enabled, networked in-vehicle devices renders it increasingly feasible to collect streaming trajectory data of moving objects. The continuous clustering of such data can enable a variety of real-time services, such as identifying representative paths or common moving trends among objects in real-time. However, little attention has so far been given to the quality of clusters -- for example, it is beneficial to smooth short-term fluctuations in clusters to achieve robustness to exceptional data.
We propose the notion of evolutionary clustering of streaming trajectories, abbreviated ECO, that enhances streaming-trajectory clustering quality by means of temporal smoothing that prevents abrupt changes in clusters across successive timestamps. Employing the notions of snapshot and historical trajectory costs, we formalize ECO and then formulate ECO as an optimization problem and prove that ECO can be performed approximately in linear time, thus eliminating the iterative processes employed in previous studies. Further, we propose a minimal-group structure and a seed point shifting strategy to facilitate temporal smoothing. Finally, we present all algorithms underlying ECO along with a set of optimization techniques. Extensive experiments with two real-life datasets offer insight into ECO and show that it outperforms state-of-the-art solutions in terms of both clustering quality and efficiency.
△ Less
Submitted 23 September, 2021;
originally announced September 2021.
-
SemEval-2021 Task 11: NLPContributionGraph -- Structuring Scholarly NLP Contributions for a Research Knowledge Graph
Authors:
Jennifer D'Souza,
Sören Auer,
Ted Pedersen
Abstract:
There is currently a gap between the natural language expression of scholarly publications and their structured semantic content modeling to enable intelligent content search. With the volume of research growing exponentially every year, a search feature operating over semantically structured content is compelling. The SemEval-2021 Shared Task NLPContributionGraph (a.k.a. 'the NCG task') tasks par…
▽ More
There is currently a gap between the natural language expression of scholarly publications and their structured semantic content modeling to enable intelligent content search. With the volume of research growing exponentially every year, a search feature operating over semantically structured content is compelling. The SemEval-2021 Shared Task NLPContributionGraph (a.k.a. 'the NCG task') tasks participants to develop automated systems that structure contributions from NLP scholarly articles in the English language. Being the first-of-its-kind in the SemEval series, the task released structured data from NLP scholarly articles at three levels of information granularity, i.e. at sentence-level, phrase-level, and phrases organized as triples toward Knowledge Graph (KG) building. The sentence-level annotations comprised the few sentences about the article's contribution. The phrase-level annotations were scientific term and predicate phrases from the contribution sentences. Finally, the triples constituted the research overview KG. For the Shared Task, participating systems were then expected to automatically classify contribution sentences, extract scientific terms and relations from the sentences, and organize them as KG triples.
Overall, the task drew a strong participation demographic of seven teams and 27 participants. The best end-to-end task system classified contribution sentences at 57.27% F1, phrases at 46.41% F1, and triples at 22.28% F1. While the absolute performance to generate triples remains low, in the conclusion of this article, the difficulty of producing such data and as a consequence of modeling it is highlighted.
△ Less
Submitted 15 October, 2021; v1 submitted 10 June, 2021;
originally announced June 2021.
-
Query Age of Information: Freshness in Pull-Based Communication
Authors:
Federico Chiariotti,
Josefine Holm,
Anders E. Kalør,
Beatriz Soret,
Søren K. Jensen,
Torben B. Pedersen,
Petar Popovski
Abstract:
Age of Information (AoI) has become an important concept in communications, as it allows system designers to measure the freshness of the information available to remote monitoring or control processes. However, its definition tacitly assumes that new information is used at any time, which is not always the case: the instants at which information is collected and used are dependent on a certain qu…
▽ More
Age of Information (AoI) has become an important concept in communications, as it allows system designers to measure the freshness of the information available to remote monitoring or control processes. However, its definition tacitly assumes that new information is used at any time, which is not always the case: the instants at which information is collected and used are dependent on a certain query process. We propose a model that accounts for the discrete time nature of many monitoring processes, considering a pull-based communication model in which the freshness of information is only important when the receiver generates a query: if the monitoring process is not using the value, the age of the last update is irrelevant. We then define the Age of Information at Query (QAoI), a more general metric that fits the pull-based scenario, and show how its optimization can lead to very different choices from traditional push-based AoI optimization when using a Packet Erasure Channel (PEC) and with limited link availability. Our results show that QAoI-aware optimization can significantly reduce the average and worst-case perceived age for both periodic and stochastic queries.
△ Less
Submitted 12 January, 2022; v1 submitted 14 May, 2021;
originally announced May 2021.
-
Explainability in CNN Models By Means of Z-Scores
Authors:
David Malmgren-Hansen,
Allan Aasbjerg Nielsen,
Leif Toudal Pedersen
Abstract:
This paper explores the similarities of output layers in Neural Networks (NNs) with logistic regression to explain importance of inputs by Z-scores. The network analyzed, a network for fusion of Synthetic Aperture Radar (SAR) and Microwave Radiometry (MWR) data, is applied to prediction of arctic sea ice. With the analysis the importance of MWR relative to SAR is found to favor MWR components. Fur…
▽ More
This paper explores the similarities of output layers in Neural Networks (NNs) with logistic regression to explain importance of inputs by Z-scores. The network analyzed, a network for fusion of Synthetic Aperture Radar (SAR) and Microwave Radiometry (MWR) data, is applied to prediction of arctic sea ice. With the analysis the importance of MWR relative to SAR is found to favor MWR components. Further, as the model represents image features at different scales, the relative importance of these are as well analyzed. The suggested methodology offers a simple and easy framework for analyzing output layer components and can reduce the number of components for further analysis with e.g. common NN visualization methods.
△ Less
Submitted 11 February, 2021;
originally announced February 2021.
-
The Forgotten Document-Oriented Database Management Systems: An Overview and Benchmark of Native XML DODBMSes in Comparison with JSON DODBMSes
Authors:
Ciprian-Octavian Truică,
Elena-Simona Apostol,
Jérôme Darmont,
Torben Bach Pedersen
Abstract:
In the current context of Big Data, a multitude of new NoSQL solutions for storing, managing, and extracting information and patterns from semi-structured data have been proposed and implemented. These solutions were developed to relieve the issue of rigid data structures present in relational databases, by introducing semi-structured and flexible schema design. As current data generated by differ…
▽ More
In the current context of Big Data, a multitude of new NoSQL solutions for storing, managing, and extracting information and patterns from semi-structured data have been proposed and implemented. These solutions were developed to relieve the issue of rigid data structures present in relational databases, by introducing semi-structured and flexible schema design. As current data generated by different sources and devices, especially from IoT sensors and actuators, use either XML or JSON format, depending on the application, database technologies that store and query semi-structured data in XML format are needed. Thus, Native XML Databases, which were initially designed to manipulate XML data using standardized querying languages, i.e., XQuery and XPath, were rebranded as NoSQL Document-Oriented Databases Systems. Currently, the majority of these solutions have been replaced with the more modern JSON based Database Management Systems. However, we believe that XML-based solutions can still deliver performance in executing complex queries on heterogeneous collections. Unfortunately nowadays, research lacks a clear comparison of the scalability and performance for database technologies that store and query documents in XML versus the more modern JSON format. Moreover, to the best of our knowledge, there are no Big Data-compliant benchmarks for such database technologies. In this paper, we present a comparison for selected Document-Oriented Database Systems that either use the XML format to encode documents, i.e., BaseX, eXist-db, and Sedna, or the JSON format, i.e., MongoDB, CouchDB, and Couchbase. To underline the performance differences we also propose a benchmark that uses a heterogeneous complex schema on a large DBLP corpus.
△ Less
Submitted 3 February, 2021;
originally announced February 2021.
-
A Multidisciplinary Definition of Privacy Labels: The Story of Princess Privacy and the Seven Helpers
Authors:
Johanna Johansen,
Tore Pedersen,
Simone Fischer-Hübner,
Christian Johansen,
Gerardo Schneider,
Arnold Roosendaal,
Harald Zwingelberg,
Anders Jakob Sivesind,
Josef Noll
Abstract:
Privacy is currently in distress and in need of rescue, much like princesses in the all-familiar fairytales. We employ storytelling and metaphors from fairytales to make reader-friendly and streamline our arguments about how a complex concept of Privacy Labeling (the 'knight in shining armor') can be a solution to the current state of Privacy (the 'princess in distress'). We give a precise definit…
▽ More
Privacy is currently in distress and in need of rescue, much like princesses in the all-familiar fairytales. We employ storytelling and metaphors from fairytales to make reader-friendly and streamline our arguments about how a complex concept of Privacy Labeling (the 'knight in shining armor') can be a solution to the current state of Privacy (the 'princess in distress'). We give a precise definition of Privacy Labeling (PL), painting a panoptic portrait from seven different perspectives (the 'seven helpers'): Business, Legal, Regulatory, Usability and Human Factors, Educative, Technological, and Multidisciplinary. We describe a common vision, proposing several important 'traits of character' of PL as well as identifying 'undeveloped potentialities', i.e., open problems on which the community can focus. More specifically, this position paper identifies the stakeholders of the PL and their needs with regard to privacy, describing how PL should be and look like in order to address these needs. Throughout the paper, we highlight goals, characteristics, open problems, and starting points for creating, what we consider to be, the ideal PL. In the end we present three approaches to establish and manage PL, through: self-evaluations, certifications, or community endeavors. Based on these, we sketch a roadmap for future developments.
△ Less
Submitted 9 May, 2021; v1 submitted 3 December, 2020;
originally announced December 2020.
-
Freshness on Demand: Optimizing Age of Information for the Query Process
Authors:
Josefine Holm,
Anders E. Kalør,
Federico Chiariotti,
Beatriz Soret,
Søren K. Jensen,
Torben B. Pedersen,
Petar Popovski
Abstract:
Age of Information (AoI) has become an important concept in communications, as it allows system designers to measure the freshness of the information available to remote monitoring or control processes. However, its definition tacitly assumed that new information is used at any time, which is not always the case and the instants at which information is collected and used are dependent on a certain…
▽ More
Age of Information (AoI) has become an important concept in communications, as it allows system designers to measure the freshness of the information available to remote monitoring or control processes. However, its definition tacitly assumed that new information is used at any time, which is not always the case and the instants at which information is collected and used are dependent on a certain query process. We propose a model that accounts for the discrete time nature of many monitoring processes, considering a pull-based communication model in which the freshness of information is only important when the receiver generates a query. We then define the Age of Information at Query (QAoI), a more general metric that fits the pull-based scenario, and show how its optimization can lead to very different choices from traditional push-based AoI optimization when using a Packet Erasure Channel (PEC).
△ Less
Submitted 2 November, 2020;
originally announced November 2020.
-
On Efficient and Scalable Time-Continuous Spatial Crowdsourcing -- Full Version
Authors:
Ting Wang,
Xike Xie,
Xin Cao,
Torben Bach Pedersen,
Yang Wang,
Mingjun Xiao
Abstract:
The proliferation of advanced mobile terminals opened up a new crowdsourcing avenue, spatial crowdsourcing, to utilize the crowd potential to perform real-world tasks. In this work, we study a new type of spatial crowdsourcing, called time-continuous spatial crowdsourcing (TCSC in short). It supports broad applications for long-term continuous spatial data acquisition, ranging from environmental m…
▽ More
The proliferation of advanced mobile terminals opened up a new crowdsourcing avenue, spatial crowdsourcing, to utilize the crowd potential to perform real-world tasks. In this work, we study a new type of spatial crowdsourcing, called time-continuous spatial crowdsourcing (TCSC in short). It supports broad applications for long-term continuous spatial data acquisition, ranging from environmental monitoring to traffic surveillance in citizen science and crowdsourcing projects. However, due to limited budgets and limited availability of workers in practice, the data collected is often incomplete, incurring data deficiency problem. To tackle that, in this work, we first propose an entropy-based quality metric, which captures the joint effects of incompletion in data acquisition and the imprecision in data interpolation. Based on that, we investigate quality-aware task assignment methods for both single- and multi-task scenarios. We show the NP-hardness of the single-task case, and design polynomial-time algorithms with guaranteed approximation ratios. We study novel indexing and pruning techniques for further enhancing the performance in practice. Then, we extend the solution to multi-task scenarios and devise a parallel framework for speeding up the process of optimization. We conduct extensive experiments on both real and synthetic datasets to show the effectiveness of our proposals.
△ Less
Submitted 29 October, 2020;
originally announced October 2020.
-
Efficient Temporal Pattern Mining in Big Time Series Using Mutual Information -- Full Version
Authors:
Van Long Ho,
Nguyen Ho,
Torben Bach Pedersen
Abstract:
Very large time series are increasingly available from an ever wider range of IoT-enabled sensors deployed in different environments. Significant insights can be gained by mining temporal patterns from these time series. Unlike traditional pattern mining, temporal pattern mining (TPM) adds event time intervals into extracted patterns, making them more expressive at the expense of increased mining…
▽ More
Very large time series are increasingly available from an ever wider range of IoT-enabled sensors deployed in different environments. Significant insights can be gained by mining temporal patterns from these time series. Unlike traditional pattern mining, temporal pattern mining (TPM) adds event time intervals into extracted patterns, making them more expressive at the expense of increased mining time complexity. Existing TPM methods either cannot scale to large datasets, or work only on pre-processed temporal events rather than on time series. This paper presents our Frequent Temporal Pattern Mining from Time Series (FTPMf TS) approach which provides: (1) The end-to-end FTPMf TS process taking time series as input and producing frequent temporal patterns as output. (2) The efficient Hierarchical Temporal Pattern Graph Mining (HTPGM) algorithm that uses efficient data structures for fast support and confidence computation, and employs effective pruning techniques for significantly faster mining. (3) An approximate version of HTPGM that uses mutual information, a measure of data correlation known from information theory, to prune unpromising time series from the search space. (4) An extensive experimental evaluation showing that HTPGM outperforms the baselines in runtime and memory consumption, and can scale to big datasets. The approximate HTPGM is up to two orders of magnitude faster and less memory consuming than the baselines, while retaining high accuracy.
△ Less
Submitted 17 November, 2021; v1 submitted 7 October, 2020;
originally announced October 2020.
-
Modeling all alternative solutions for highly renewable energy systems
Authors:
Tim T. Pedersen,
Marta Victoria,
Morten G. Rasmussen,
Gorm B. Andresen
Abstract:
As the world is transitioning towards highly renewable energy systems, advanced tools are needed to analyze such complex networks. Energy system design is, however, challenged by real-world objective functions consisting of a blurry mix of technical and socioeconomic agendas, with limitations that cannot always be clearly stated. As a result, it is highly likely that solutions which are techno-eco…
▽ More
As the world is transitioning towards highly renewable energy systems, advanced tools are needed to analyze such complex networks. Energy system design is, however, challenged by real-world objective functions consisting of a blurry mix of technical and socioeconomic agendas, with limitations that cannot always be clearly stated. As a result, it is highly likely that solutions which are techno-economically suboptimal will be preferable. Here, we present a method capable of determining the continuum containing all techno-economically near-optimal solutions, moving the field of energy system modeling from discrete solutions to a new era where continuous solution ranges are available. The presented method is applied to study a range of technical and socioeconomic metrics on a model of the European electricity system. The near-optimal region is found to be relatively flat allowing for solutions that are slightly more expensive than the optimum but better in terms of equality, land use, and implementation time.
△ Less
Submitted 29 June, 2021; v1 submitted 2 October, 2020;
originally announced October 2020.
-
Duluth at SemEval-2020 Task 7: Using Surprise as a Key to Unlock Humorous Headlines
Authors:
Shuning Jin,
Yue Yin,
XianE Tang,
Ted Pedersen
Abstract:
We use pretrained transformer-based language models in SemEval-2020 Task 7: Assessing the Funniness of Edited News Headlines. Inspired by the incongruity theory of humor, we use a contrastive approach to capture the surprise in the edited headlines. In the official evaluation, our system gets 0.531 RMSE in Subtask 1, 11th among 49 submissions. In Subtask 2, our system gets 0.632 accuracy, 9th amon…
▽ More
We use pretrained transformer-based language models in SemEval-2020 Task 7: Assessing the Funniness of Edited News Headlines. Inspired by the incongruity theory of humor, we use a contrastive approach to capture the surprise in the edited headlines. In the official evaluation, our system gets 0.531 RMSE in Subtask 1, 11th among 49 submissions. In Subtask 2, our system gets 0.632 accuracy, 9th among 32 submissions.
△ Less
Submitted 6 September, 2020;
originally announced September 2020.
-
Duluth at SemEval-2019 Task 6: Lexical Approaches to Identify and Categorize Offensive Tweets
Authors:
Ted Pedersen
Abstract:
This paper describes the Duluth systems that participated in SemEval--2019 Task 6, Identifying and Categorizing Offensive Language in Social Media (OffensEval). For the most part these systems took traditional Machine Learning approaches that built classifiers from lexical features found in manually labeled training data. However, our most successful system for classifying a tweet as offensive (or…
▽ More
This paper describes the Duluth systems that participated in SemEval--2019 Task 6, Identifying and Categorizing Offensive Language in Social Media (OffensEval). For the most part these systems took traditional Machine Learning approaches that built classifiers from lexical features found in manually labeled training data. However, our most successful system for classifying a tweet as offensive (or not) was a rule-based black--list approach, and we also experimented with combining the training data from two different but related SemEval tasks. Our best systems in each of the three OffensEval tasks placed in the middle of the comparative evaluation, ranking 57th of 103 in task A, 39th of 75 in task B, and 44th of 65 in task C.
△ Less
Submitted 25 July, 2020;
originally announced July 2020.
-
Duluth at SemEval-2020 Task 12: Offensive Tweet Identification in English with Logistic Regression
Authors:
Ted Pedersen
Abstract:
This paper describes the Duluth systems that participated in SemEval--2020 Task 12, Multilingual Offensive Language Identification in Social Media (OffensEval--2020). We participated in the three English language tasks. Our systems provide a simple Machine Learning baseline using logistic regression. We trained our models on the distantly supervised training data made available by the task organiz…
▽ More
This paper describes the Duluth systems that participated in SemEval--2020 Task 12, Multilingual Offensive Language Identification in Social Media (OffensEval--2020). We participated in the three English language tasks. Our systems provide a simple Machine Learning baseline using logistic regression. We trained our models on the distantly supervised training data made available by the task organizers and used no other resources. As might be expected we did not rank highly in the comparative evaluation: 79th of 85 in Task A, 34th of 43 in Task B, and 24th of 39 in Task C. We carried out a qualitative analysis of our results and found that the class labels in the gold standard data are somewhat noisy. We hypothesize that the extremely high accuracy (> 90%) of the top ranked systems may reflect methods that learn the training data very well but may not generalize to the task of identifying offensive language in English. This analysis includes examples of tweets that despite being mildly redacted are still offensive.
△ Less
Submitted 25 July, 2020;
originally announced July 2020.
-
High-Level ETL for Semantic Data Warehouses -- Full Version
Authors:
Rudra Pratap Deb Nath,
Oscar Romero,
Torben Bach Pedersen,
Katja Hose
Abstract:
The popularity of the Semantic Web (SW) encourages organizations to organize and publish semantic data using the RDF model. This growth poses new requirements to Business Intelligence (BI) technologies to enable On-Line Analytical Processing (OLAP)-like analysis over semantic data. The incorporation of semantic data into a Data Warehouse (DW) is not supported by the traditional Extract-Transform-L…
▽ More
The popularity of the Semantic Web (SW) encourages organizations to organize and publish semantic data using the RDF model. This growth poses new requirements to Business Intelligence (BI) technologies to enable On-Line Analytical Processing (OLAP)-like analysis over semantic data. The incorporation of semantic data into a Data Warehouse (DW) is not supported by the traditional Extract-Transform-Load (ETL) tools because they do not consider semantic issues in the integration process. In this paper, we propose a layer-based integration process and a set of high-level RDF-based ETL constructs required to define, map, extract, process, transform, integrate, update, and load (multidimensional) semantic data. Different to other ETL tools, we automate the ETL data flows by creating metadata at the schema level. Therefore, it relieves ETL developers from the burden of manual mapping at the ETL operation level. We create a prototype, named Semantic ETL Construct (SETLCONSTRUCT), based on the innovative ETL constructs proposed here. To evaluate SETLCONSTRUCT, we create a multidimensional semantic DW by integrating a Danish Business dataset and an EU Subsidy dataset using it and compare it with the previous programmable framework SETLPROG in terms of productivity, development time and performance. The evaluation shows that 1) SETLCONSTRUCT uses 92% fewer Number of Typed Characters (NOTC) than SETLPROG, and SETLAUTO (the extension of SETLCONSTRUCT for generating ETL execution flow automatically) further reduces the Number of Used Concepts (NOUC) by another 25%; 2) using SETLCONSTRUCT, the development time is almost cut in half compared to SETLPROG, and is cut by another 27% using SETLAUTO; 3) SETLCONSTRUCT is scalable and has similar performance compared to SETLPROG.
△ Less
Submitted 12 June, 2020;
originally announced June 2020.
-
Studying the Transfer of Biases from Programmers to Programs
Authors:
Johanna Johansen,
Tore Pedersen,
Christian Johansen
Abstract:
It is generally agreed that one origin of machine bias is resulting from characteristics within the dataset on which the algorithms are trained, i.e., the data does not warrant a generalized inference. We, however, hypothesize that a different `mechanism', hitherto not articulated in the literature, may also be responsible for machine's bias, namely that biases may originate from (i) the programme…
▽ More
It is generally agreed that one origin of machine bias is resulting from characteristics within the dataset on which the algorithms are trained, i.e., the data does not warrant a generalized inference. We, however, hypothesize that a different `mechanism', hitherto not articulated in the literature, may also be responsible for machine's bias, namely that biases may originate from (i) the programmers' cultural background, such as education or line of work, or (ii) the contextual programming environment, such as software requirements or developer tools. Combining an experimental and comparative design, we studied the effects of cultural metaphors and contextual metaphors, and tested whether each of these would `transfer' from the programmer to program, thus constituting a machine bias. The results show (i) that cultural metaphors influence the programmer's choices and (ii) that `induced' contextual metaphors can be used to moderate or exacerbate the effects of the cultural metaphors. This supports our hypothesis that biases in automated systems do not always originate from within the machine's training data. Instead, machines may also `replicate' and `reproduce' biases from the programmers' cultural background by the transfer of cultural metaphors into the programming process. Implications for academia and professional practice range from the micro programming-level to the macro national-regulations or educational level, and span across all societal domains where software-based systems are operating such as the popular AI-based automated decision support systems.
△ Less
Submitted 13 December, 2020; v1 submitted 17 May, 2020;
originally announced May 2020.
-
Multidimensional Enrichment of Spatial RDF Data for SOLAP -- Full Version
Authors:
Nurefsan Gür,
Torben Bach Pedersen,
Katja Hose,
Mikael Midtgaard
Abstract:
Large volumes of spatial data and multidimensional data are being published on the Semantic Web, which has led to new opportunities for advanced analysis, such as Spatial Online Analytical Processing (SOLAP). The RDF Data Cube (QB) and QB4OLAP vocabularies have been widely used for annotating and publishing statistical and multidimensional RDF data. Although such statistical data sets might have s…
▽ More
Large volumes of spatial data and multidimensional data are being published on the Semantic Web, which has led to new opportunities for advanced analysis, such as Spatial Online Analytical Processing (SOLAP). The RDF Data Cube (QB) and QB4OLAP vocabularies have been widely used for annotating and publishing statistical and multidimensional RDF data. Although such statistical data sets might have spatial information, such as coordinates, the lack of spatial semantics and spatial multidimensional concepts in QB4OLAP and QB prevents users from employing SOLAP queries over spatial data using SPARQL. The QB4SOLAP vocabulary, on the other hand, fully supports annotating spatial and multidimensional data on the Semantic Web and enables users to query endpoints with SOLAP operators in SPARQL. To bridge the gap between QB/QB4OLAP and QB4SOLAP, we propose an RDF2SOLAP enrichment model that automatically annotates spatial multidimensional concepts with QB4SOLAP and in doing so enables SOLAP on existing QB and QB4OLAP data on the Semantic Web. Furthermore, we present and evaluate a wide range of enrichment algorithms and apply them on a non-trivial real-world use case involving governmental open data with complex geometry types.
△ Less
Submitted 16 February, 2020;
originally announced February 2020.
-
Multi-Source Spatial Entity Linkage
Authors:
Suela Isaj,
Torben Bach Pedersen,
Esteban Zimányi
Abstract:
Besides the traditional cartographic data sources, spatial information can also be derived from location-based sources. However, even though different location-based sources refer to the same physical world, each one has only partial coverage of the spatial entities, describe them with different attributes, and sometimes provide contradicting information. Hence, we introduce the spatial entity lin…
▽ More
Besides the traditional cartographic data sources, spatial information can also be derived from location-based sources. However, even though different location-based sources refer to the same physical world, each one has only partial coverage of the spatial entities, describe them with different attributes, and sometimes provide contradicting information. Hence, we introduce the spatial entity linkage problem, which finds which pairs of spatial entities belong to the same physical spatial entity. Our proposed solution (QuadSky) starts with a time-efficient spatial blocking technique (QuadFlex), compares pairwise the spatial entities in the same block, ranks the pairs using Pareto optimality with the SkyRank algorithm, and finally, classifies the pairs with our novel SkyEx-* family of algorithms that yield 0.85 precision and 0.85 recall for a manually labeled dataset of 1,500 pairs and 0.87 precision and 0.6 recall for a semi-manually labeled dataset of 777,452 pairs. Moreover, we provide a theoretical guarantee and formalize the SkyEx-FES algorithm that explores only 27% of the skylines without any loss in F-measure. Furthermore, our fully unsupervised algorithm SkyEx-D approximates the optimal result with an F-measure loss of just 0.01. Finally, QuadSky provides the best trade-off between precision and recall, and the best F-measure compared to the existing baselines and clustering techniques, and approximates the results of supervised learning solutions.
△ Less
Submitted 29 April, 2020; v1 submitted 20 November, 2019;
originally announced November 2019.
-
AMIC: An Adaptive Information Theoretic Method to Identify Multi-Scale Temporal Correlations in Big Time Series Data -- Accepted Version
Authors:
Nguyen Ho,
Huy Vo,
Mai Vu,
Torben Bach Pedersen
Abstract:
Recent development in computing, sensing and crowd-sourced data have resulted in an explosion in the availability of quantitative information. The possibilities of analyzing this so-called Big Data to inform research and the decision-making process are virtually endless. In general, analyses have to be done across multiple data sets in order to bring out the most value of Big Data. A first importa…
▽ More
Recent development in computing, sensing and crowd-sourced data have resulted in an explosion in the availability of quantitative information. The possibilities of analyzing this so-called Big Data to inform research and the decision-making process are virtually endless. In general, analyses have to be done across multiple data sets in order to bring out the most value of Big Data. A first important step is to identify temporal correlations between data sets. Given the characteristics of Big Data in terms of volume and velocity, techniques that identify correlations not only need to be fast and scalable, but also need to help users in ordering the correlations across temporal scales so that they can focus on important relationships. In this paper, we present AMIC (Adaptive Mutual Information-based Correlation), a method based on mutual information to identify correlations at multiple temporal scales in large time series. Discovered correlations are suggested to users in an order based on the strength of the relationships. Our method supports an adaptive streaming technique that minimizes duplicated computation and is implemented on top of Apache Spark for scalability. We also provide a comprehensive evaluation on the effectiveness and the scalability of AMIC using both synthetic and real-world data sets.
△ Less
Submitted 7 July, 2019; v1 submitted 24 June, 2019;
originally announced June 2019.
-
Scalable Model-Based Management of Correlated Dimensional Time Series in ModelarDB+
Authors:
Søren Kejser Jensen,
Torben Bach Pedersen,
Christian Thomsen
Abstract:
To monitor critical infrastructure, high quality sensors sampled at a high frequency are increasingly used. However, as they produce huge amounts of data, only simple aggregates are stored. This removes outliers and fluctuations that could indicate problems. As a remedy, we present a model-based approach for managing time series with dimensions that exploits correlation in and among time series. S…
▽ More
To monitor critical infrastructure, high quality sensors sampled at a high frequency are increasingly used. However, as they produce huge amounts of data, only simple aggregates are stored. This removes outliers and fluctuations that could indicate problems. As a remedy, we present a model-based approach for managing time series with dimensions that exploits correlation in and among time series. Specifically, we propose compressing groups of correlated time series using an extensible set of model types within a user-defined error bound (possibly zero). We name this new category of model-based compression methods for time series Multi-Model Group Compression (MMGC). We present the first MMGC method GOLEMM and extend model types to compress time series groups. We propose primitives for users to effectively define groups for differently sized data sets, and based on these, an automated grouping method using only the time series dimensions. We propose algorithms for executing simple and multi-dimensional aggregate queries on models. Last, we implement our methods in the Time Series Management System (TSMS) ModelarDB (ModelarDB+). Our evaluation shows that compared to widely used formats, ModelarDB+ provides up to 13.7 times faster ingestion due to high compression, 113 times better compression due to the adaptivity of GOLEMM, 630 times faster aggregates by using models, and close to linear scalability. It is also extensible and supports online query processing.
△ Less
Submitted 29 June, 2021; v1 submitted 25 March, 2019;
originally announced March 2019.
-
Seed-Driven Geo-Social Data Extraction -- Full Version
Authors:
Suela Isaj,
Torben Bach Pedersen
Abstract:
Geo-social data has been an attractive source for a variety of problems such as mining mobility patterns, link prediction, location recommendation, and influence maximization. However, new geo-social data is increasingly unavailable and suffers several limitations. In this paper, we aim to remedy the problem of effective data extraction from geo-social data sources. We first identify and categoriz…
▽ More
Geo-social data has been an attractive source for a variety of problems such as mining mobility patterns, link prediction, location recommendation, and influence maximization. However, new geo-social data is increasingly unavailable and suffers several limitations. In this paper, we aim to remedy the problem of effective data extraction from geo-social data sources. We first identify and categorize the limitations of extracting geo-social data. In order to overcome the limitations, we propose a novel seed-driven approach that uses the points of one source as the seed to feed as queries for the others. We additionally handle differences between, and dynamics within the sources by proposing three variants for optimizing search radius. Furthermore, we provide an optimization based on recursive clustering to minimize the number of requests and an adaptive procedure to learn the specific data distribution of each source. Our comprehensive experiments with six popular sources show that our seed-driven approach yields 14.3 times more data overall, while our request-optimized algorithm retrieves up to 95% of the data with less than 16% of the requests. Thus, our proposed seed-driven approach set new standards for effective and efficient extraction of geo-social data.
△ Less
Submitted 23 June, 2019; v1 submitted 20 January, 2019;
originally announced January 2019.
-
UMDSub at SemEval-2018 Task 2: Multilingual Emoji Prediction Multi-channel Convolutional Neural Network on Subword Embedding
Authors:
Zhenduo Wang,
Ted Pedersen
Abstract:
This paper describes the UMDSub system that participated in Task 2 of SemEval-2018. We developed a system that predicts an emoji given the raw text in a English tweet. The system is a Multi-channel Convolutional Neural Network based on subword embeddings for the representation of tweets. This model improves on character or word based methods by about 2\%. Our system placed 21st of 48 participating…
▽ More
This paper describes the UMDSub system that participated in Task 2 of SemEval-2018. We developed a system that predicts an emoji given the raw text in a English tweet. The system is a Multi-channel Convolutional Neural Network based on subword embeddings for the representation of tweets. This model improves on character or word based methods by about 2\%. Our system placed 21st of 48 participating systems in the official evaluation.
△ Less
Submitted 25 May, 2018;
originally announced May 2018.
-
UMDuluth-CS8761 at SemEval-2018 Task 9: Hypernym Discovery using Hearst Patterns, Co-occurrence frequencies and Word Embeddings
Authors:
Arshia Z. Hassan,
Manikya S. Vallabhajosyula,
Ted Pedersen
Abstract:
Hypernym Discovery is the task of identifying potential hypernyms for a given term. A hypernym is a more generalized word that is super-ordinate to more specific words. This paper explores several approaches that rely on co-occurrence frequencies of word pairs, Hearst Patterns based on regular expressions, and word embeddings created from the UMBC corpus. Our system Babbage participated in Subtask…
▽ More
Hypernym Discovery is the task of identifying potential hypernyms for a given term. A hypernym is a more generalized word that is super-ordinate to more specific words. This paper explores several approaches that rely on co-occurrence frequencies of word pairs, Hearst Patterns based on regular expressions, and word embeddings created from the UMBC corpus. Our system Babbage participated in Subtask 1A for English and placed 6th of 19 systems when identifying concept hypernyms, and 12th of 18 systems for entity hypernyms.
△ Less
Submitted 25 May, 2018;
originally announced May 2018.
-
Duluth UROP at SemEval-2018 Task 2: Multilingual Emoji Prediction with Ensemble Learning and Oversampling
Authors:
Shuning Jin,
Ted Pedersen
Abstract:
This paper describes the Duluth UROP systems that participated in SemEval--2018 Task 2, Multilingual Emoji Prediction. We relied on a variety of ensembles made up of classifiers using Naive Bayes, Logistic Regression, and Random Forests. We used unigram and bigram features and tried to offset the skewness of the data through the use of oversampling. Our task evaluation results place us 19th of 48…
▽ More
This paper describes the Duluth UROP systems that participated in SemEval--2018 Task 2, Multilingual Emoji Prediction. We relied on a variety of ensembles made up of classifiers using Naive Bayes, Logistic Regression, and Random Forests. We used unigram and bigram features and tried to offset the skewness of the data through the use of oversampling. Our task evaluation results place us 19th of 48 systems in the English evaluation, and 5th of 21 in the Spanish. After the evaluation we realized that some simple changes to preprocessing could significantly improve our results. After making these changes we attained results that would have placed us sixth in the English evaluation, and second in the Spanish.
△ Less
Submitted 25 May, 2018;
originally announced May 2018.
-
Adaptive User-Oriented Direct Load-Control of Residential Flexible Devices
Authors:
Davide Frazzetto,
Bijay Neupane,
Torben Bach Pedersen,
Thomas Dyhre Nielsen
Abstract:
Demand Response (DR) schemes are effective tools to maintain a dynamic balance in energy markets with higher integration of fluctuating renewable energy sources. DR schemes can be used to harness residential devices' flexibility and to utilize it to achieve social and financial objectives. However, existing DR schemes suffer from low user participation as they fail at taking into account the users…
▽ More
Demand Response (DR) schemes are effective tools to maintain a dynamic balance in energy markets with higher integration of fluctuating renewable energy sources. DR schemes can be used to harness residential devices' flexibility and to utilize it to achieve social and financial objectives. However, existing DR schemes suffer from low user participation as they fail at taking into account the users' requirements. First, DR schemes are highly demanding for the users, as users need to provide direct information, e.g. via surveys, on their energy consumption preferences. Second, the user utility models based on these surveys are hard-coded and do not adapt over time. Third, the existing scheduling techniques require the users to input their energy requirements on a daily basis. As an alternative, this paper proposes a DR scheme for user-oriented direct load-control of residential appliances operations. Instead of relying on user surveys to evaluate the user utility, we propose an online data-driven approach for estimating user utility functions, purely based on available load consumption data, that adaptively models the users' preference over time. Our scheme is based on a day-ahead scheduling technique that transparently prescribes the users with optimal device operation schedules that take into account both financial benefits and user-perceived quality of service. To model day-ahead user energy demand and flexibility, we propose a probabilistic approach for generating flexibility models under uncertainty. Results on both real-world and simulated datasets show that our DR scheme can provide significant financial benefits while preserving the user-perceived quality of service.
△ Less
Submitted 9 May, 2018;
originally announced May 2018.
-
Day-ahead Trading of Aggregated Energy Flexibility - Full Version
Authors:
Emmanouil Valsomatzis,
Torben Bach Pedersen,
Alberto Abello
Abstract:
Flexibility of small loads, in particular from Electric Vehicles (EVs), has recently attracted a lot of interest due to their possibility of participating in the energy market and the new commercial potentials. Different from existing work, the aggregation techniques proposed in this paper produce flexible aggregated loads from EVs taking into account technical market requirements. They can be fur…
▽ More
Flexibility of small loads, in particular from Electric Vehicles (EVs), has recently attracted a lot of interest due to their possibility of participating in the energy market and the new commercial potentials. Different from existing work, the aggregation techniques proposed in this paper produce flexible aggregated loads from EVs taking into account technical market requirements. They can be further transformed into the so-called flexible orders and be traded in the day-ahead market by a Balance Responsible Party (BRP). As a result, the BRP can achieve at least 20% cost reduction on average in energy purchase compared to traditional charging based on 2017 real electricity prices from the Danish electricity market.
△ Less
Submitted 24 May, 2018; v1 submitted 6 May, 2018;
originally announced May 2018.
-
Utilizing Device-level Demand Forecasting for Flexibility Markets - Full Version
Authors:
Bijay Neupane,
Torben Bach Pedersen,
Bo Thiesson
Abstract:
The uncertainty in the power supply due to fluctuating Renewable Energy Sources (RES) has severe (financial and other) implications for energy market players. In this paper, we present a device-level Demand Response (DR) scheme that captures the atomic (all available) flexibilities in energy demand and provides the largest possible solution space to generate demand/supply schedules that minimize m…
▽ More
The uncertainty in the power supply due to fluctuating Renewable Energy Sources (RES) has severe (financial and other) implications for energy market players. In this paper, we present a device-level Demand Response (DR) scheme that captures the atomic (all available) flexibilities in energy demand and provides the largest possible solution space to generate demand/supply schedules that minimize market imbalances. We evaluate the effectiveness and feasibility of widely used forecasting models for device-level flexibility analysis. In a typical device-level flexibility forecast, a market player is more concerned with the \textit{utility} that the demand flexibility brings to the market, rather than the intrinsic forecast accuracy. In this regard, we provide comprehensive predictive modeling and scheduling of demand flexibility from household appliances to demonstrate the (financial and otherwise) viability of introducing flexibility-based DR in the Danish/Nordic market. Further, we investigate the correlation between the potential utility and the accuracy of the demand forecast model. Furthermore, we perform a number of experiments to determine the data granularity that provides the best financial reward to market players for adopting the proposed DR scheme. A cost-benefit analysis of forecast results shows that even with somewhat low forecast accuracy, market players can achieve regulation cost savings of 54% of the theoretically optimal.
△ Less
Submitted 2 May, 2018;
originally announced May 2018.
-
Time Series Management Systems: A Survey
Authors:
Søren Kejser Jensen,
Torben Bach Pedersen,
Christian Thomsen
Abstract:
The collection of time series data increases as more monitoring and automation are being deployed. These deployments range in scale from an Internet of things (IoT) device located in a household to enormous distributed Cyber-Physical Systems (CPSs) producing large volumes of data at high velocity. To store and analyze these vast amounts of data, specialized Time Series Management Systems (TSMSs) h…
▽ More
The collection of time series data increases as more monitoring and automation are being deployed. These deployments range in scale from an Internet of things (IoT) device located in a household to enormous distributed Cyber-Physical Systems (CPSs) producing large volumes of data at high velocity. To store and analyze these vast amounts of data, specialized Time Series Management Systems (TSMSs) have been developed to overcome the limitations of general purpose Database Management Systems (DBMSs) for times series management. In this paper, we present a thorough analysis and classification of TSMSs developed through academic or industrial research and documented through publications. Our classification is organized into categories based on the architectures observed during our analysis. In addition, we provide an overview of each system with a focus on the motivational use case that drove the development of the system, the functionality for storage and querying of time series a system implements, the components the system is composed of, and the capabilities of each system with regard to Stream Processing and Approximate Query Processing (AQP). Last, we provide a summary of research directions proposed by other researchers in the field and present our vision for a next generation TSMS.
△ Less
Submitted 3 October, 2017;
originally announced October 2017.