-
JobHop: A Large-Scale Dataset of Career Trajectories
Authors:
Iman Johary,
Raphael Romero,
Alexandru C. Mara,
Tijl De Bie
Abstract:
Understanding labor market dynamics is essential for policymakers, employers, and job seekers. However, comprehensive datasets that capture real-world career trajectories are scarce. In this paper, we introduce JobHop, a large-scale public dataset derived from anonymized resumes provided by VDAB, the public employment service in Flanders, Belgium. Utilizing Large Language Models (LLMs), we process…
▽ More
Understanding labor market dynamics is essential for policymakers, employers, and job seekers. However, comprehensive datasets that capture real-world career trajectories are scarce. In this paper, we introduce JobHop, a large-scale public dataset derived from anonymized resumes provided by VDAB, the public employment service in Flanders, Belgium. Utilizing Large Language Models (LLMs), we process unstructured resume data to extract structured career information, which is then mapped to standardized ESCO occupation codes using a multi-label classification model. This results in a rich dataset of over 2.3 million work experiences, extracted from and grouped into more than 391,000 user resumes and mapped to standardized ESCO occupation codes, offering valuable insights into real-world occupational transitions. This dataset enables diverse applications, such as analyzing labor market mobility, job stability, and the effects of career breaks on occupational transitions. It also supports career path prediction and other data-driven decision-making processes. To illustrate its potential, we explore key dataset characteristics, including job distributions, career breaks, and job transitions, demonstrating its value for advancing labor market research.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
Large Language Models Reflect the Ideology of their Creators
Authors:
Maarten Buyl,
Alexander Rogiers,
Sander Noels,
Guillaume Bied,
Iris Dominguez-Catena,
Edith Heiter,
Iman Johary,
Alexandru-Cristian Mara,
Raphaël Romero,
Jefrey Lijffijt,
Tijl De Bie
Abstract:
Large language models (LLMs) are trained on vast amounts of data to generate natural language, enabling them to perform tasks like text summarization and question answering. These models have become popular in artificial intelligence (AI) assistants like ChatGPT and already play an influential role in how humans access information. However, the behavior of LLMs varies depending on their design, tr…
▽ More
Large language models (LLMs) are trained on vast amounts of data to generate natural language, enabling them to perform tasks like text summarization and question answering. These models have become popular in artificial intelligence (AI) assistants like ChatGPT and already play an influential role in how humans access information. However, the behavior of LLMs varies depending on their design, training, and use.
In this paper, we prompt a diverse panel of popular LLMs to describe a large number of prominent personalities with political relevance, in all six official languages of the United Nations. By identifying and analyzing moral assessments reflected in their responses, we find normative differences between LLMs from different geopolitical regions, as well as between the responses of the same LLM when prompted in different languages. Among only models in the United States, we find that popularly hypothesized disparities in political views are reflected in significant normative differences related to progressive values. Among Chinese models, we characterize a division between internationally- and domestically-focused models.
Our results show that the ideological stance of an LLM appears to reflect the worldview of its creators. This poses the risk of political instrumentalization and raises concerns around technological and regulatory efforts with the stated aim of making LLMs ideologically 'unbiased'.
△ Less
Submitted 30 January, 2025; v1 submitted 24 October, 2024;
originally announced October 2024.
-
A Systematic Evaluation of Node Embedding Robustness
Authors:
Alexandru Mara,
Jefrey Lijffijt,
Stephan Günnemann,
Tijl De Bie
Abstract:
Node embedding methods map network nodes to low dimensional vectors that can be subsequently used in a variety of downstream prediction tasks. The popularity of these methods has grown significantly in recent years, yet, their robustness to perturbations of the input data is still poorly understood. In this paper, we assess the empirical robustness of node embedding models to random and adversaria…
▽ More
Node embedding methods map network nodes to low dimensional vectors that can be subsequently used in a variety of downstream prediction tasks. The popularity of these methods has grown significantly in recent years, yet, their robustness to perturbations of the input data is still poorly understood. In this paper, we assess the empirical robustness of node embedding models to random and adversarial poisoning attacks. Our systematic evaluation covers representative embedding methods based on Skip-Gram, matrix factorization, and deep neural networks. We compare edge addition, deletion and rewiring attacks computed using network properties as well as node labels. We also investigate the performance of popular node classification attack baselines that assume full knowledge of the node labels. We report qualitative results via embedding visualization and quantitative results in terms of downstream node classification and network reconstruction performances. We find that node classification results are impacted more than network reconstruction ones, that degree-based and label-based attacks are on average the most damaging and that label heterophily can strongly influence attack performance.
△ Less
Submitted 30 November, 2022; v1 submitted 16 September, 2022;
originally announced September 2022.
-
CSNE: Conditional Signed Network Embedding
Authors:
Alexandru Mara,
Yoosof Mashayekhi,
Jefrey Lijffijt,
Tijl De Bie
Abstract:
Signed networks are mathematical structures that encode positive and negative relations between entities such as friend/foe or trust/distrust. Recently, several papers studied the construction of useful low-dimensional representations (embeddings) of these networks for the prediction of missing relations or signs. Existing embedding methods for sign prediction generally enforce different notions o…
▽ More
Signed networks are mathematical structures that encode positive and negative relations between entities such as friend/foe or trust/distrust. Recently, several papers studied the construction of useful low-dimensional representations (embeddings) of these networks for the prediction of missing relations or signs. Existing embedding methods for sign prediction generally enforce different notions of status or balance theories in their optimization function. These theories, however, are often inaccurate or incomplete, which negatively impacts method performance.
In this context, we introduce conditional signed network embedding (CSNE). Our probabilistic approach models structural information about the signs in the network separately from fine-grained detail. Structural information is represented in the form of a prior, while the embedding itself is used for capturing fine-grained information. These components are then integrated in a rigorous manner. CSNE's accuracy depends on the existence of sufficiently powerful structural priors for modelling signed networks, currently unavailable in the literature. Thus, as a second main contribution, which we find to be highly valuable in its own right, we also introduce a novel approach to construct priors based on the Maximum Entropy (MaxEnt) principle. These priors can model the \emph{polarity} of nodes (degree to which their links are positive) as well as signed \emph{triangle counts} (a measure of the degree structural balance holds to in a network).
Experiments on a variety of real-world networks confirm that CSNE outperforms the state-of-the-art on the task of sign prediction. Moreover, the MaxEnt priors on their own, while less accurate than full CSNE, achieve accuracies competitive with the state-of-the-art at very limited computational cost, thus providing an excellent runtime-accuracy trade-off in resource-constrained situations.
△ Less
Submitted 25 May, 2020; v1 submitted 19 May, 2020;
originally announced May 2020.
-
Benchmarking Network Embedding Models for Link Prediction: Are We Making Progress?
Authors:
Alexandru Mara,
Jefrey Lijffijt,
Tijl De Bie
Abstract:
Network embedding methods map a network's nodes to vectors in an embedding space, in such a way that these representations are useful for estimating some notion of similarity or proximity between pairs of nodes in the network. The quality of these node representations is then showcased through results of downstream prediction tasks. Commonly used benchmark tasks such as link prediction, however, p…
▽ More
Network embedding methods map a network's nodes to vectors in an embedding space, in such a way that these representations are useful for estimating some notion of similarity or proximity between pairs of nodes in the network. The quality of these node representations is then showcased through results of downstream prediction tasks. Commonly used benchmark tasks such as link prediction, however, present complex evaluation pipelines and an abundance of design choices. This, together with a lack of standardized evaluation setups can obscure the real progress in the field. In this paper, we aim to shed light on the state-of-the-art of network embedding methods for link prediction and show, using a consistent evaluation pipeline, that only thin progress has been made over the last years. The newly conducted benchmark that we present here, including 17 embedding methods, also shows that many approaches are outperformed even by simple heuristics. Finally, we argue that standardized evaluation tools can repair this situation and boost future progress in this field.
△ Less
Submitted 3 September, 2020; v1 submitted 25 February, 2020;
originally announced February 2020.
-
Block-Approximated Exponential Random Graphs
Authors:
Florian Adriaens,
Alexandru Mara,
Jefrey Lijffijt,
Tijl De Bie
Abstract:
An important challenge in the field of exponential random graphs (ERGs) is the fitting of non-trivial ERGs on large graphs. By utilizing fast matrix block-approximation techniques, we propose an approximative framework to such non-trivial ERGs that result in dyadic independence (i.e., edge independent) distributions, while being able to meaningfully model both local information of the graph (e.g.,…
▽ More
An important challenge in the field of exponential random graphs (ERGs) is the fitting of non-trivial ERGs on large graphs. By utilizing fast matrix block-approximation techniques, we propose an approximative framework to such non-trivial ERGs that result in dyadic independence (i.e., edge independent) distributions, while being able to meaningfully model both local information of the graph (e.g., degrees) as well as global information (e.g., clustering coefficient, assortativity, etc.) if desired. This allows one to efficiently generate random networks with similar properties as an observed network, and the models can be used for several downstream tasks such as link prediction. Our methods are scalable to sparse graphs consisting of millions of nodes. Empirical evaluation demonstrates competitiveness in terms of both speed and accuracy with state-of-the-art methods -- which are typically based on embedding the graph into some low-dimensional space -- for link prediction, showcasing the potential of a more direct and interpretable probabalistic model for this task.
△ Less
Submitted 26 August, 2020; v1 submitted 14 February, 2020;
originally announced February 2020.
-
Semi-supervised Learning in Network-Structured Data via Total Variation Minimization
Authors:
Alexander Jung,
Alfred O. Hero III,
Alexandru Mara,
Saeed Jahromi,
Ayelet Heimowitz,
Yonina C. Eldar
Abstract:
We propose and analyze a method for semi-supervised learning from partially-labeled network-structured data. Our approach is based on a graph signal recovery interpretation under a clustering hypothesis that labels of data points belonging to the same well-connected subset (cluster) are similar valued. This lends naturally to learning the labels by total variation (TV) minimization, which we solve…
▽ More
We propose and analyze a method for semi-supervised learning from partially-labeled network-structured data. Our approach is based on a graph signal recovery interpretation under a clustering hypothesis that labels of data points belonging to the same well-connected subset (cluster) are similar valued. This lends naturally to learning the labels by total variation (TV) minimization, which we solve by applying a recently proposed primal-dual method for non-smooth convex optimization. The resulting algorithm allows for a highly scalable implementation using message passing over the underlying empirical graph, which renders the algorithm suitable for big data applications. By applying tools of compressed sensing, we derive a sufficient condition on the underlying network structure such that TV minimization recovers clusters in the empirical graph of the data. In particular, we show that the proposed primal-dual method amounts to maximizing network flows over the empirical graph of the dataset. Moreover, the learning accuracy of the proposed algorithm is linked to the set of network flows between data points having known labels. The effectiveness and scalability of our approach is verified by numerical experiments.
△ Less
Submitted 2 November, 2019; v1 submitted 28 January, 2019;
originally announced January 2019.
-
EvalNE: A Framework for Evaluating Network Embeddings on Link Prediction
Authors:
Alexandru Mara,
Jefrey Lijffijt,
Tijl De Bie
Abstract:
In this paper we present EvalNE, a Python toolbox for evaluating network embedding methods on link prediction tasks. Link prediction is one of the most popular choices for evaluating the quality of network embeddings. However, the complexity of this task requires a carefully designed evaluation pipeline in order to provide consistent, reproducible and comparable results. EvalNE simplifies this pro…
▽ More
In this paper we present EvalNE, a Python toolbox for evaluating network embedding methods on link prediction tasks. Link prediction is one of the most popular choices for evaluating the quality of network embeddings. However, the complexity of this task requires a carefully designed evaluation pipeline in order to provide consistent, reproducible and comparable results. EvalNE simplifies this process by providing automation and abstraction of tasks such as hyper-parameter tuning and model validation, edge sampling and negative edge sampling, computation of edge embeddings from node embeddings, and evaluation metrics. The toolbox allows for the evaluation of any off-the-shelf embedding method without the need to write extra code. Moreover, it can also be used for evaluating any other link prediction method, and integrates several link prediction heuristics as baselines.
△ Less
Submitted 22 January, 2019;
originally announced January 2019.
-
Recovery Conditions and Sampling Strategies for Network Lasso
Authors:
Alexandru Mara,
Alexander Jung
Abstract:
The network Lasso is a recently proposed convex optimization method for machine learning from massive network structured datasets, i.e., big data over networks. It is a variant of the well-known least absolute shrinkage and selection operator (Lasso), which is underlying many methods in learning and signal processing involving sparse models. Highly scalable implementations of the network Lasso can…
▽ More
The network Lasso is a recently proposed convex optimization method for machine learning from massive network structured datasets, i.e., big data over networks. It is a variant of the well-known least absolute shrinkage and selection operator (Lasso), which is underlying many methods in learning and signal processing involving sparse models. Highly scalable implementations of the network Lasso can be obtained by state-of-the art proximal methods, e.g., the alternating direction method of multipliers (ADMM). By generalizing the concept of the compatibility condition put forward by van de Geer and Buehlmann as a powerful tool for the analysis of plain Lasso, we derive a sufficient condition, i.e., the network compatibility condition, on the underlying network topology such that network Lasso accurately learns a clustered underlying graph signal. This network compatibility condition relates the location of the sampled nodes with the clustering structure of the network. In particular, the NCC informs the choice of which nodes to sample, or in machine learning terms, which data points provide most information if labeled.
△ Less
Submitted 3 September, 2017;
originally announced September 2017.
-
Semi-Supervised Learning via Sparse Label Propagation
Authors:
Alexander Jung,
Alfred O. Hero III,
Alexandru Mara,
Saeed Jahromi
Abstract:
This work proposes a novel method for semi-supervised learning from partially labeled massive network-structured datasets, i.e., big data over networks. We model the underlying hypothesis, which relates data points to labels, as a graph signal, defined over some graph (network) structure intrinsic to the dataset. Following the key principle of supervised learning, i.e., similar inputs yield simila…
▽ More
This work proposes a novel method for semi-supervised learning from partially labeled massive network-structured datasets, i.e., big data over networks. We model the underlying hypothesis, which relates data points to labels, as a graph signal, defined over some graph (network) structure intrinsic to the dataset. Following the key principle of supervised learning, i.e., similar inputs yield similar outputs, we require the graph signals induced by labels to have small total variation. Accordingly, we formulate the problem of learning the labels of data points as a non-smooth convex optimization problem which amounts to balancing between the empirical loss, i.e., the discrepancy with some partially available label information, and the smoothness quantified by the total variation of the learned graph signal. We solve this optimization problem by appealing to a recently proposed preconditioned variant of the popular primal-dual method by Pock and Chambolle, which results in a sparse label propagation algorithm. This learning algorithm allows for a highly scalable implementation as message passing over the underlying data graph. By applying concepts of compressed sensing to the learning problem, we are also able to provide a transparent sufficient condition on the underlying network structure such that accurate learning of the labels is possible. We also present an implementation of the message passing formulation allows for a highly scalable implementation in big data frameworks.
△ Less
Submitted 15 May, 2017; v1 submitted 5 December, 2016;
originally announced December 2016.
-
Scalable Semi-Supervised Learning over Networks using Nonsmooth Convex Optimization
Authors:
Alexander Jung,
Alfred O. Hero III,
Alexandru Mara,
Sabeur Aridhi
Abstract:
We propose a scalable method for semi-supervised (transductive) learning from massive network-structured datasets. Our approach to semi-supervised learning is based on representing the underlying hypothesis as a graph signal with small total variation. Requiring a small total variation of the graph signal representing the underlying hypothesis corresponds to the central smoothness assumption that…
▽ More
We propose a scalable method for semi-supervised (transductive) learning from massive network-structured datasets. Our approach to semi-supervised learning is based on representing the underlying hypothesis as a graph signal with small total variation. Requiring a small total variation of the graph signal representing the underlying hypothesis corresponds to the central smoothness assumption that forms the basis for semi-supervised learning, i.e., input points forming clusters have similar output values or labels. We formulate the learning problem as a nonsmooth convex optimization problem which we solve by appealing to Nesterovs optimal first-order method for nonsmooth optimization. We also provide a message passing formulation of the learning method which allows for a highly scalable implementation in big data frameworks.
△ Less
Submitted 2 November, 2016;
originally announced November 2016.
-
Elaboration of a new tool for weather data sequences generation
Authors:
Laetitia Adelard,
Thierry Alex Mara,
Harry Boyer,
Jean Claude Gatina
Abstract:
This paper deals about the presentation of a new software RUNEOLE used to provide weather data in buildings physics. RUNEOLE associates three modules leading to the description, the modelling and the generation of weather data. The first module is dedicated to the description of each climatic variable included in the database. Graphic representation is possible (with histograms for example). Mathe…
▽ More
This paper deals about the presentation of a new software RUNEOLE used to provide weather data in buildings physics. RUNEOLE associates three modules leading to the description, the modelling and the generation of weather data. The first module is dedicated to the description of each climatic variable included in the database. Graphic representation is possible (with histograms for example). Mathematical tools used to compare statistical distributions, determine daily characteristic evolutions, find typical days, and the correlations between the different climatic variables have been elaborated in the second module. Artificial weather datafiles adapted to different simulation codes are available at the issue of the third module. This tool can then be used in HVAC system evaluation, or in the study of thermal comfort. The studied buildings can then be tested under different thermal, aeraulic, and radiative solicitations, leading to a best understanding of their behaviour for example in humid climates.
△ Less
Submitted 21 December, 2012;
originally announced December 2012.
-
Black box modelling of HVAC system : improving the performances of neural networks
Authors:
Eric Fock,
Thierry Alex Mara,
Alfred Jean Philippe Lauret,
Harry Boyer
Abstract:
This paper deals with neural networks modelling of HVAC systems. In order to increase the neural networks performances, a method based on sensitivity analysis is applied. The same technique is also used to compute the relevance of each input. To avoid the prediction errors in dry coil conditions, a metamodel for each capacity is derived from the neural networks. The regression coefficients of the…
▽ More
This paper deals with neural networks modelling of HVAC systems. In order to increase the neural networks performances, a method based on sensitivity analysis is applied. The same technique is also used to compute the relevance of each input. To avoid the prediction errors in dry coil conditions, a metamodel for each capacity is derived from the neural networks. The regression coefficients of the polynomial forms are identified through the use of spectral analysis. These methods based on sensitivity and spectral analysis lead to an optimized neural network model, as regard to its architecture and predictions.
△ Less
Submitted 21 December, 2012;
originally announced December 2012.
-
Use of BESTEST procedure to improve a building thermal simulation program
Authors:
Ted Soubdhan,
Thierry Alex Mara,
Harry Boyer,
Anis Younès
Abstract:
Validation of building energy simulation programs is of major interest to both users and modellers. To achieve such a task, it is essential to apply a methodology based on a priori test and empirical validation. A priori test consists in verifying that models embedded in a program and their implementation are correct. this should be achieved before carrying out experiments. The aim of this report…
▽ More
Validation of building energy simulation programs is of major interest to both users and modellers. To achieve such a task, it is essential to apply a methodology based on a priori test and empirical validation. A priori test consists in verifying that models embedded in a program and their implementation are correct. this should be achieved before carrying out experiments. The aim of this report is to present results from the application of the BESTEST procedure to our code. We will emphasise the way it allows to find bugs in our program and also how it permits to qualify models of heat transfer by conduction
△ Less
Submitted 20 December, 2012;
originally announced December 2012.
-
A Comparison between CODYRUN and TRNSYS, simulation models for thermal buildings behaviour
Authors:
Franck Lucas,
Thierry Alex Mara,
François Garde,
Harry Boyer
Abstract:
Simulation codes of thermal behaviour could significantly improve housing construction design. Among the existing software, CODYRUN and TRNSYS are calculations codes of different conceptions. CODYRUN is exclusively dedicated to housing thermal behaviour, whereas TRNSYS is more generally used on any thermal system. The purpose of this article is to compare these two instruments in two different con…
▽ More
Simulation codes of thermal behaviour could significantly improve housing construction design. Among the existing software, CODYRUN and TRNSYS are calculations codes of different conceptions. CODYRUN is exclusively dedicated to housing thermal behaviour, whereas TRNSYS is more generally used on any thermal system. The purpose of this article is to compare these two instruments in two different conditions . We will first modelize a mono-zone test cell, and analyse the results by means of signal treatment methods. Then, we will modelize a real case of multi-zone housing, representative of housing in wet tropical climates. We could so evaluate influences of meteorological and building description data on model errors.
△ Less
Submitted 18 December, 2012;
originally announced December 2012.
-
A validation methodology aid for improving a thermal building model: Case of diffuse radiation accounting in a tropical climate
Authors:
A. J. P. Lauret,
T. A. Mara,
H. Boyer,
L. Adelard,
F. Garde
Abstract:
As part of our efforts to complete the software CODYRUN validation, we chose as test building a block of flats constructed in Reunion Island, which has a humid tropical climate. The sensitivity analysis allowed us to study the effects of both diffuse and direct solar radiation on our model of this building. With regard to the choice and location of sensors, this stage of the study also led us to m…
▽ More
As part of our efforts to complete the software CODYRUN validation, we chose as test building a block of flats constructed in Reunion Island, which has a humid tropical climate. The sensitivity analysis allowed us to study the effects of both diffuse and direct solar radiation on our model of this building. With regard to the choice and location of sensors, this stage of the study also led us to measure the solar radiation falling on the windows. The comparison of measured and predicted radiation clearly showed that our predictions over-estimated the incoming solar radiation, and we were able to trace the problem to the algorithm which calculates diffuse solar radiation. By calculating view factors between the windows and the associated shading devices, changes to the original program allowed us to improve the predictions, and so this article shows the importance of sensitivity analysis in this area of research.
△ Less
Submitted 17 December, 2012;
originally announced December 2012.
-
Building ventilation: A pressure airflow model computer generation and elements of validation
Authors:
H. Boyer,
A. P. Lauret,
L. Adelard,
T. A. Mara
Abstract:
The calculation of airflows is of great importance for detailed building thermal simulation computer codes, these airflows most frequently constituting an important thermal coupling between the building and the outside on one hand, and the different thermal zones on the other. The driving effects of air movement, which are the wind and the thermal buoyancy, are briefly outlined and we look closely…
▽ More
The calculation of airflows is of great importance for detailed building thermal simulation computer codes, these airflows most frequently constituting an important thermal coupling between the building and the outside on one hand, and the different thermal zones on the other. The driving effects of air movement, which are the wind and the thermal buoyancy, are briefly outlined and we look closely at their coupling in the case of buildings, by exploring the difficulties associated with large openings. Some numerical problems tied to the resolving of the non-linear system established are also covered. Part of a detailled simulation software (CODYRUN), the numerical implementation of this airflow model is explained, insisting on data organization and processing allowing the calculation of the airflows. Comparisons are then made between the model results and in one hand analytical expressions and in another and experimental measurements in case of a collective dwelling.
△ Less
Submitted 17 December, 2012;
originally announced December 2012.