-
GneissWeb: Preparing High Quality Data for LLMs at Scale
Authors:
Hajar Emami Gohari,
Swanand Ravindra Kadhe,
Syed Yousaf Shah. Constantin Adam,
Abdulhamid Adebayo,
Praneet Adusumilli,
Farhan Ahmed,
Nathalie Baracaldo Angel,
Santosh Borse,
Yuan-Chi Chang,
Xuan-Hong Dang,
Nirmit Desai,
Ravital Eres,
Ran Iwamoto,
Alexei Karve,
Yan Koyfman,
Wei-Han Lee,
Changchang Liu,
Boris Lublinsky,
Takuyo Ohko,
Pablo Pesce,
Maroun Touma,
Shiqiang Wang,
Shalisha Witherspoon,
Herbert Woisetschlager,
David Wood
, et al. (6 additional authors not shown)
Abstract:
Data quantity and quality play a vital role in determining the performance of Large Language Models (LLMs). High-quality data, in particular, can significantly boost the LLM's ability to generalize on a wide range of downstream tasks. Large pre-training datasets for leading LLMs remain inaccessible to the public, whereas many open datasets are small in size (less than 5 trillion tokens), limiting…
▽ More
Data quantity and quality play a vital role in determining the performance of Large Language Models (LLMs). High-quality data, in particular, can significantly boost the LLM's ability to generalize on a wide range of downstream tasks. Large pre-training datasets for leading LLMs remain inaccessible to the public, whereas many open datasets are small in size (less than 5 trillion tokens), limiting their suitability for training large models.
In this paper, we introduce GneissWeb, a large dataset yielding around 10 trillion tokens that caters to the data quality and quantity requirements of training LLMs. Our GneissWeb recipe that produced the dataset consists of sharded exact sub-string deduplication and a judiciously constructed ensemble of quality filters. GneissWeb achieves a favorable trade-off between data quality and quantity, producing models that outperform models trained on state-of-the-art open large datasets (5+ trillion tokens).
We show that models trained using GneissWeb dataset outperform those trained on FineWeb-V1.1.0 by 2.73 percentage points in terms of average score computed on a set of 11 commonly used benchmarks (both zero-shot and few-shot) for pre-training dataset evaluation. When the evaluation set is extended to 20 benchmarks (both zero-shot and few-shot), models trained using GneissWeb still achieve a 1.75 percentage points advantage over those trained on FineWeb-V1.1.0.
△ Less
Submitted 18 February, 2025;
originally announced February 2025.
-
Finding path and cycle counting formulae in graphs with Deep Reinforcement Learning
Authors:
Jason Piquenot,
Maxime Bérar,
Pierre Héroux,
Jean-Yves Ramel,
Romain Raveaux,
Sébastien Adam
Abstract:
This paper presents Grammar Reinforcement Learning (GRL), a reinforcement learning algorithm that uses Monte Carlo Tree Search (MCTS) and a transformer architecture that models a Pushdown Automaton (PDA) within a context-free grammar (CFG) framework. Taking as use case the problem of efficiently counting paths and cycles in graphs, a key challenge in network analysis, computer science, biology, an…
▽ More
This paper presents Grammar Reinforcement Learning (GRL), a reinforcement learning algorithm that uses Monte Carlo Tree Search (MCTS) and a transformer architecture that models a Pushdown Automaton (PDA) within a context-free grammar (CFG) framework. Taking as use case the problem of efficiently counting paths and cycles in graphs, a key challenge in network analysis, computer science, biology, and social sciences, GRL discovers new matrix-based formulas for path/cycle counting that improve computational efficiency by factors of two to six w.r.t state-of-the-art approaches. Our contributions include: (i) a framework for generating gramformers that operate within a CFG, (ii) the development of GRL for optimizing formulas within grammatical structures, and (iii) the discovery of novel formulas for graph substructure counting, leading to significant computational improvements.
△ Less
Submitted 23 January, 2025; v1 submitted 2 October, 2024;
originally announced October 2024.
-
Temporal receptive field in dynamic graph learning: A comprehensive analysis
Authors:
Yannis Karmim,
Leshanshui Yang,
Raphaël Fournier S'Niehotta,
Clément Chatelain,
Sébastien Adam,
Nicolas Thome
Abstract:
Dynamic link prediction is a critical task in the analysis of evolving networks, with applications ranging from recommender systems to economic exchanges. However, the concept of the temporal receptive field, which refers to the temporal context that models use for making predictions, has been largely overlooked and insufficiently analyzed in existing research. In this study, we present a comprehe…
▽ More
Dynamic link prediction is a critical task in the analysis of evolving networks, with applications ranging from recommender systems to economic exchanges. However, the concept of the temporal receptive field, which refers to the temporal context that models use for making predictions, has been largely overlooked and insufficiently analyzed in existing research. In this study, we present a comprehensive analysis of the temporal receptive field in dynamic graph learning. By examining multiple datasets and models, we formalize the role of temporal receptive field and highlight their crucial influence on predictive accuracy. Our results demonstrate that appropriately chosen temporal receptive field can significantly enhance model performance, while for some models, overly large windows may introduce noise and reduce accuracy. We conduct extensive benchmarking to validate our findings, ensuring that all experiments are fully reproducible. Code is available at https://github.com/ykrmm/BenchmarkTW .
△ Less
Submitted 19 July, 2024; v1 submitted 17 July, 2024;
originally announced July 2024.
-
Improving the quality of individual-level online information tracking: challenges of existing approaches and introduction of a new content- and long-tail sensitive academic solution
Authors:
Silke Adam,
Mykola Makhortykh,
Michaela Maier,
Viktor Aigenseer,
Aleksandra Urman,
Teresa Gil Lopez,
Clara Christner,
Ernesto de León,
Roberto Ulloa
Abstract:
This article evaluates the quality of data collection in individual-level desktop information tracking used in the social sciences and shows that the existing approaches face sampling issues, validity issues due to the lack of content-level data and their disregard of the variety of devices and long-tail consumption patterns as well as transparency and privacy issues. To overcome some of these pro…
▽ More
This article evaluates the quality of data collection in individual-level desktop information tracking used in the social sciences and shows that the existing approaches face sampling issues, validity issues due to the lack of content-level data and their disregard of the variety of devices and long-tail consumption patterns as well as transparency and privacy issues. To overcome some of these problems, the article introduces a new academic tracking solution, WebTrack, an open source tracking tool maintained by a major European research institution. The design logic, the interfaces and the backend requirements for WebTrack, followed by a detailed examination of strengths and weaknesses of the tool, are discussed. Finally, using data from 1185 participants, the article empirically illustrates how an improvement in the data collection through WebTrack leads to new innovative shifts in the processing of tracking data. As WebTrack allows collecting the content people are exposed to on more than classical news platforms, we can strongly improve the detection of politics-related information consumption in tracking data with the application of automated content analysis compared to traditional approaches that rely on the list-based identification of news.
△ Less
Submitted 5 March, 2024;
originally announced March 2024.
-
Dynamic Graph Representation Learning with Neural Networks: A Survey
Authors:
Leshanshui Yang,
Sébastien Adam,
Clément Chatelain
Abstract:
In recent years, Dynamic Graph (DG) representations have been increasingly used for modeling dynamic systems due to their ability to integrate both topological and temporal information in a compact representation. Dynamic graphs allow to efficiently handle applications such as social network prediction, recommender systems, traffic forecasting or electroencephalography analysis, that can not be ad…
▽ More
In recent years, Dynamic Graph (DG) representations have been increasingly used for modeling dynamic systems due to their ability to integrate both topological and temporal information in a compact representation. Dynamic graphs allow to efficiently handle applications such as social network prediction, recommender systems, traffic forecasting or electroencephalography analysis, that can not be adressed using standard numeric representations. As a direct consequence of the emergence of dynamic graph representations, dynamic graph learning has emerged as a new machine learning problem, combining challenges from both sequential/temporal data processing and static graph learning. In this research area, Dynamic Graph Neural Network (DGNN) has became the state of the art approach and plethora of models have been proposed in the very recent years. This paper aims at providing a review of problems and models related to dynamic graph learning. The various dynamic graph supervised learning settings are analysed and discussed. We identify the similarities and differences between existing models with respect to the way time information is modeled. Finally, general guidelines for a DGNN designer when faced with a dynamic graph learning problem are provided.
△ Less
Submitted 12 April, 2023;
originally announced April 2023.
-
Technical report: Graph Neural Networks go Grammatical
Authors:
Jason Piquenot,
Aldo Moscatelli,
Maxime Bérar,
Pierre Héroux,
Romain raveaux,
Jean-Yves Ramel,
Sébastien Adam
Abstract:
This paper introduces a framework for formally establishing a connection between a portion of an algebraic language and a Graph Neural Network (GNN). The framework leverages Context-Free Grammars (CFG) to organize algebraic operations into generative rules that can be translated into a GNN layer model. As CFGs derived directly from a language tend to contain redundancies in their rules and variabl…
▽ More
This paper introduces a framework for formally establishing a connection between a portion of an algebraic language and a Graph Neural Network (GNN). The framework leverages Context-Free Grammars (CFG) to organize algebraic operations into generative rules that can be translated into a GNN layer model. As CFGs derived directly from a language tend to contain redundancies in their rules and variables, we present a grammar reduction scheme. By applying this strategy, we define a CFG that conforms to the third-order Weisfeiler-Lehman (3-WL) test using MATLANG. From this 3-WL CFG, we derive a GNN model, named G$^2$N$^2$, which is provably 3-WL compliant. Through various experiments, we demonstrate the superior efficiency of G$^2$N$^2$ compared to other 3-WL GNNs across numerous downstream tasks. Specifically, one experiment highlights the benefits of grammar reduction within our framework.
△ Less
Submitted 4 October, 2023; v1 submitted 2 March, 2023;
originally announced March 2023.
-
Panning for gold: Lessons learned from the platform-agnostic automated detection of political content in textual data
Authors:
Mykola Makhortykh,
Ernesto de León,
Aleksandra Urman,
Clara Christner,
Maryna Sydorova,
Silke Adam,
Michaela Maier,
Teresa Gil-Lopez
Abstract:
The growing availability of data about online information behaviour enables new possibilities for political communication research. However, the volume and variety of these data makes them difficult to analyse and prompts the need for developing automated content approaches relying on a broad range of natural language processing techniques (e.g. machine learning- or neural network-based ones). In…
▽ More
The growing availability of data about online information behaviour enables new possibilities for political communication research. However, the volume and variety of these data makes them difficult to analyse and prompts the need for developing automated content approaches relying on a broad range of natural language processing techniques (e.g. machine learning- or neural network-based ones). In this paper, we discuss how these techniques can be used to detect political content across different platforms. Using three validation datasets, which include a variety of political and non-political textual documents from online platforms, we systematically compare the performance of three groups of detection techniques relying on dictionaries, supervised machine learning, or neural networks. We also examine the impact of different modes of data preprocessing (e.g. stemming and stopword removal) on the low-cost implementations of these techniques using a large set (n = 66) of detection models. Our results show the limited impact of preprocessing on model performance, with the best results for less noisy data being achieved by neural network- and machine-learning-based models, in contrast to the more robust performance of dictionary-based models on noisy data.
△ Less
Submitted 1 July, 2022;
originally announced July 2022.
-
A Set Membership Approach to Discovering Feature Relevance and Explaining Neural Classifier Decisions
Authors:
Stavros P. Adam,
Aristidis C. Likas
Abstract:
Neural classifiers are non linear systems providing decisions on the classes of patterns, for a given problem they have learned. The output computed by a classifier for each pattern constitutes an approximation of the output of some unknown function, mapping pattern data to their respective classes. The lack of knowledge of such a function along with the complexity of neural classifiers, especiall…
▽ More
Neural classifiers are non linear systems providing decisions on the classes of patterns, for a given problem they have learned. The output computed by a classifier for each pattern constitutes an approximation of the output of some unknown function, mapping pattern data to their respective classes. The lack of knowledge of such a function along with the complexity of neural classifiers, especially when these are deep learning architectures, do not permit to obtain information on how specific predictions have been made. Hence, these powerful learning systems are considered as black boxes and in critical applications their use tends to be considered inappropriate. Gaining insight on such a black box operation constitutes a one way approach in interpreting operation of neural classifiers and assessing the validity of their decisions. In this paper we tackle this problem introducing a novel methodology for discovering which features are considered relevant by a trained neural classifier and how they affect the classifier's output, thus obtaining an explanation on its decision. Although, feature relevance has received much attention in the machine learning literature here we reconsider it in terms of nonlinear parameter estimation targeted by a set membership approach which is based on interval analysis. Hence, the proposed methodology builds on sound mathematical approaches and the results obtained constitute a reliable estimation of the classifier's decision premises.
△ Less
Submitted 4 June, 2023; v1 submitted 5 April, 2022;
originally announced April 2022.
-
Breaking the Limits of Message Passing Graph Neural Networks
Authors:
Muhammet Balcilar,
Pierre Héroux,
Benoit Gaüzère,
Pascal Vasseur,
Sébastien Adam,
Paul Honeine
Abstract:
Since the Message Passing (Graph) Neural Networks (MPNNs) have a linear complexity with respect to the number of nodes when applied to sparse graphs, they have been widely implemented and still raise a lot of interest even though their theoretical expressive power is limited to the first order Weisfeiler-Lehman test (1-WL). In this paper, we show that if the graph convolution supports are designed…
▽ More
Since the Message Passing (Graph) Neural Networks (MPNNs) have a linear complexity with respect to the number of nodes when applied to sparse graphs, they have been widely implemented and still raise a lot of interest even though their theoretical expressive power is limited to the first order Weisfeiler-Lehman test (1-WL). In this paper, we show that if the graph convolution supports are designed in spectral-domain by a non-linear custom function of eigenvalues and masked with an arbitrary large receptive field, the MPNN is theoretically more powerful than the 1-WL test and experimentally as powerful as a 3-WL existing models, while remaining spatially localized. Moreover, by designing custom filter functions, outputs can have various frequency components that allow the convolution process to learn different relationships between a given input graph signal and its associated properties. So far, the best 3-WL equivalent graph neural networks have a computational complexity in $\mathcal{O}(n^3)$ with memory usage in $\mathcal{O}(n^2)$, consider non-local update mechanism and do not provide the spectral richness of output profile. The proposed method overcomes all these aforementioned problems and reaches state-of-the-art results in many downstream tasks.
△ Less
Submitted 8 June, 2021;
originally announced June 2021.
-
Soap serialization effect on communication nodes and protocols
Authors:
Ali Baba Dauda,
Mohammed Sani Adam,
Muhammad Ahmad Mustapha,
Audu Musa Mabu,
Suleiman Mustafa
Abstract:
Although serialization improves the transmission of data through utilization of bandwidth, but its impact at the communication systems is not fully accounted. This research used Simple Object Access Protocol (SOAP) Web services to exchange serialized and normal messages via Hypertext Transfer Protocol (HTTP) and Java Messaging System (JMS). We implemented two web services as server and client endp…
▽ More
Although serialization improves the transmission of data through utilization of bandwidth, but its impact at the communication systems is not fully accounted. This research used Simple Object Access Protocol (SOAP) Web services to exchange serialized and normal messages via Hypertext Transfer Protocol (HTTP) and Java Messaging System (JMS). We implemented two web services as server and client endpoints and transmitted SOAP messages as payload. We analyzed the effect of unserialized and serialized messages on the computing resources based on the response time and overhead at both server and client endpoints. The analysis identified the reasons for high response time and causes for overhead. We provided some insights on the resources utilization and trade-offs when choosing messaging format or transmission protocol. This study is vital in resource management in edge computing and data centers.
△ Less
Submitted 23 December, 2020;
originally announced December 2020.
-
Bridging the Gap Between Spectral and Spatial Domains in Graph Neural Networks
Authors:
Muhammet Balcilar,
Guillaume Renton,
Pierre Heroux,
Benoit Gauzere,
Sebastien Adam,
Paul Honeine
Abstract:
This paper aims at revisiting Graph Convolutional Neural Networks by bridging the gap between spectral and spatial design of graph convolutions. We theoretically demonstrate some equivalence of the graph convolution process regardless it is designed in the spatial or the spectral domain. The obtained general framework allows to lead a spectral analysis of the most popular ConvGNNs, explaining thei…
▽ More
This paper aims at revisiting Graph Convolutional Neural Networks by bridging the gap between spectral and spatial design of graph convolutions. We theoretically demonstrate some equivalence of the graph convolution process regardless it is designed in the spatial or the spectral domain. The obtained general framework allows to lead a spectral analysis of the most popular ConvGNNs, explaining their performance and showing their limits. Moreover, the proposed framework is used to design new convolutions in spectral domain with a custom frequency profile while applying them in the spatial domain. We also propose a generalization of the depthwise separable convolution framework for graph convolutional networks, what allows to decrease the total number of trainable parameters by keeping the capacity of the model. To the best of our knowledge, such a framework has never been used in the GNNs literature. Our proposals are evaluated on both transductive and inductive graph learning problems. Obtained results show the relevance of the proposed method and provide one of the first experimental evidence of transferability of spectral filter coefficients from one graph to another. Our source codes are publicly available at: https://github.com/balcilar/Spectral-Designed-Graph-Convolutions
△ Less
Submitted 25 March, 2020;
originally announced March 2020.
-
Neural Networks Regularization Through Class-wise Invariant Representation Learning
Authors:
Soufiane Belharbi,
Clément Chatelain,
Romain Hérault,
Sébastien Adam
Abstract:
Training deep neural networks is known to require a large number of training samples. However, in many applications only few training samples are available. In this work, we tackle the issue of training neural networks for classification task when few training samples are available. We attempt to solve this issue by proposing a new regularization term that constrains the hidden layers of a network…
▽ More
Training deep neural networks is known to require a large number of training samples. However, in many applications only few training samples are available. In this work, we tackle the issue of training neural networks for classification task when few training samples are available. We attempt to solve this issue by proposing a new regularization term that constrains the hidden layers of a network to learn class-wise invariant representations. In our regularization framework, learning invariant representations is generalized to the class membership where samples with the same class should have the same representation. Numerical experiments over MNIST and its variants showed that our proposal helps improving the generalization of neural network particularly when trained with few samples. We provide the source code of our framework https://github.com/sbelharbi/learning-class-invariant-features .
△ Less
Submitted 22 December, 2017; v1 submitted 6 September, 2017;
originally announced September 2017.
-
Graph edit distance : a new binary linear programming formulation
Authors:
Julien Lerouge,
Zeina Abu-Aisheh,
Romain Raveaux,
Pierre Héroux,
Sébastien Adam
Abstract:
Graph edit distance (GED) is a powerful and flexible graph matching paradigm that can be used to address different tasks in structural pattern recognition, machine learning, and data mining. In this paper, some new binary linear programming formulations for computing the exact GED between two graphs are proposed. A major strength of the formulations lies in their genericity since the GED can be co…
▽ More
Graph edit distance (GED) is a powerful and flexible graph matching paradigm that can be used to address different tasks in structural pattern recognition, machine learning, and data mining. In this paper, some new binary linear programming formulations for computing the exact GED between two graphs are proposed. A major strength of the formulations lies in their genericity since the GED can be computed between directed or undirected fully attributed graphs (i.e. with attributes on both vertices and edges). Moreover, a relaxation of the domain constraints in the formulations provides efficient lower bound approximations of the GED. A complete experimental study comparing the proposed formulations with 4 state-of-the-art algorithms for exact and approximate graph edit distances is provided. By considering both the quality of the proposed solution and the efficiency of the algorithms as performance criteria, the results show that none of the compared methods dominates the others in the Pareto sense. As a consequence, faced to a given real-world problem, a trade-off between quality and efficiency has to be chosen w.r.t. the application constraints. In this context, this paper provides a guide that can be used to choose the appropriate method.
△ Less
Submitted 21 May, 2015;
originally announced May 2015.
-
Deep Neural Networks Regularization for Structured Output Prediction
Authors:
Soufiane Belharbi,
Romain Hérault,
Clément Chatelain,
Sébastien Adam
Abstract:
A deep neural network model is a powerful framework for learning representations. Usually, it is used to learn the relation $x \to y$ by exploiting the regularities in the input $x$. In structured output prediction problems, $y$ is multi-dimensional and structural relations often exist between the dimensions. The motivation of this work is to learn the output dependencies that may lie in the outpu…
▽ More
A deep neural network model is a powerful framework for learning representations. Usually, it is used to learn the relation $x \to y$ by exploiting the regularities in the input $x$. In structured output prediction problems, $y$ is multi-dimensional and structural relations often exist between the dimensions. The motivation of this work is to learn the output dependencies that may lie in the output data in order to improve the prediction accuracy. Unfortunately, feedforward networks are unable to exploit the relations between the outputs. In order to overcome this issue, we propose in this paper a regularization scheme for training neural networks for these particular tasks using a multi-task framework. Our scheme aims at incorporating the learning of the output representation $y$ in the training process in an unsupervised fashion while learning the supervised mapping function $x \to y$.
We evaluate our framework on a facial landmark detection problem which is a typical structured output task. We show over two public challenging datasets (LFPW and HELEN) that our regularization scheme improves the generalization of deep neural networks and accelerates their training. The use of unlabeled data and label-only data is also explored, showing an additional improvement of the results. We provide an opensource implementation (https://github.com/sbelharbi/structured-output-ae) of our framework.
△ Less
Submitted 30 October, 2017; v1 submitted 28 April, 2015;
originally announced April 2015.
-
Towards Interactive, Incremental Programming of ROS Nodes
Authors:
Sorin Adam,
Ulrik Pagh Schultz
Abstract:
Writing software for controlling robots is a complex task, usually demanding command of many programming languages and requiring significant experimentation. We believe that a bottom-up development process that complements traditional component- and MDSD-based approaches can facilitate experimentation. We propose the use of an internal DSL providing both a tool to interactively create ROS nodes an…
▽ More
Writing software for controlling robots is a complex task, usually demanding command of many programming languages and requiring significant experimentation. We believe that a bottom-up development process that complements traditional component- and MDSD-based approaches can facilitate experimentation. We propose the use of an internal DSL providing both a tool to interactively create ROS nodes and a behaviour-replacement mechanism to interactively reshape existing ROS nodes by wrapping the external interfaces (the publish/subscribe topics), dynamically controlled using the Python command line interface.
△ Less
Submitted 15 December, 2014;
originally announced December 2014.
-
Reliability Conditions in Quadrature Algorithms
Authors:
Gh. Adam,
S. Adam,
N. M. Plakida
Abstract:
The detection of insufficiently resolved or ill-conditioned integrand structures is critical for the reliability assessment of the quadrature rule outputs. We discuss a method of analysis of the profile of the integrand at the quadrature knots which allows inferences approaching the theoretical 100% rate of success, under error estimate sharpening. The proposed procedure is of the highest intere…
▽ More
The detection of insufficiently resolved or ill-conditioned integrand structures is critical for the reliability assessment of the quadrature rule outputs. We discuss a method of analysis of the profile of the integrand at the quadrature knots which allows inferences approaching the theoretical 100% rate of success, under error estimate sharpening. The proposed procedure is of the highest interest for the solution of parametric integrals arising in complex physical models.
△ Less
Submitted 6 March, 2003;
originally announced March 2003.