-
Community-centric modeling of citation dynamics explains collective citation patterns in science, law, and patents
Authors:
Sadamori Kojaku,
Robert Mahari,
Sandro Claudio Lera,
Esteban Moro,
Alex Pentland,
Yong-Yeol Ahn
Abstract:
Many human knowledge systems, such as science, law, and invention, are built on documents and the citations that link them. Citations, while serving multiple purposes, primarily function as a way to explicitly document the use of prior work and thus have become central to the study of knowledge systems. Analyzing citation dynamics has revealed statistical patterns that shed light on knowledge prod…
▽ More
Many human knowledge systems, such as science, law, and invention, are built on documents and the citations that link them. Citations, while serving multiple purposes, primarily function as a way to explicitly document the use of prior work and thus have become central to the study of knowledge systems. Analyzing citation dynamics has revealed statistical patterns that shed light on knowledge production, recognition, and formalization, and has helped identify key mechanisms driving these patterns. However, most quantitative findings are confined to scientific citations, raising the question of universality of these findings. Moreover, existing models of individual citation trajectories fail to explain phenomena such as delayed recognition, calling for a unifying framework. Here, we analyze a newly available corpus of U.S. case law, in addition to scientific and patent citation networks, to show that they share remarkably similar citation patterns, including a heavy-tailed distribution of sleeping beauties. We propose a holistic model that captures the three core mechanisms driving collective dynamics and replicates the elusive phenomenon of delayed recognition. We demonstrate that the model not only replicates observed citation patterns, but also better predicts future successes by considering the whole system. Our work offers insights into key mechanisms that govern large-scale patterns of collective human knowledge systems and may provide generalizable perspectives on discovery and innovation across domains.
△ Less
Submitted 27 January, 2025; v1 submitted 26 January, 2025;
originally announced January 2025.
-
Matrix-weighted networks for modeling multidimensional dynamics
Authors:
Yu Tian,
Sadamori Kojaku,
Hiroki Sayama,
Renaud Lambiotte
Abstract:
Networks are powerful tools for modeling interactions in complex systems. While traditional networks use scalar edge weights, many real-world systems involve multidimensional interactions. For example, in social networks, individuals often have multiple interconnected opinions that can affect different opinions of other individuals, which can be better characterized by matrices. We propose a novel…
▽ More
Networks are powerful tools for modeling interactions in complex systems. While traditional networks use scalar edge weights, many real-world systems involve multidimensional interactions. For example, in social networks, individuals often have multiple interconnected opinions that can affect different opinions of other individuals, which can be better characterized by matrices. We propose a novel, general framework for modeling such multidimensional interacting dynamics: matrix-weighted networks (MWNs). We present the mathematical foundations of MWNs and examine consensus dynamics and random walks within this context. Our results reveal that the coherence of MWNs gives rise to non-trivial steady states that generalize the notions of communities and structural balance in traditional networks.
△ Less
Submitted 7 October, 2024;
originally announced October 2024.
-
Implicit degree bias in the link prediction task
Authors:
Rachith Aiyappa,
Xin Wang,
Munjung Kim,
Ozgur Can Seckin,
Jisung Yoon,
Yong-Yeol Ahn,
Sadamori Kojaku
Abstract:
Link prediction -- a task of distinguishing actual hidden edges from random unconnected node pairs -- is one of the quintessential tasks in graph machine learning. Despite being widely accepted as a universal benchmark and a downstream task for representation learning, the validity of the link prediction benchmark itself has been rarely questioned. Here, we show that the common edge sampling proce…
▽ More
Link prediction -- a task of distinguishing actual hidden edges from random unconnected node pairs -- is one of the quintessential tasks in graph machine learning. Despite being widely accepted as a universal benchmark and a downstream task for representation learning, the validity of the link prediction benchmark itself has been rarely questioned. Here, we show that the common edge sampling procedure in the link prediction task has an implicit bias toward high-degree nodes and produces a highly skewed evaluation that favors methods overly dependent on node degree, to the extent that a ``null'' link prediction method based solely on node degree can yield nearly optimal performance. We propose a degree-corrected link prediction task that offers a more reasonable assessment that aligns better with the performance in the recommendation task. Finally, we demonstrate that the degree-corrected benchmark can more effectively train graph machine-learning models by reducing overfitting to node degrees and facilitating the learning of relevant structures in graphs.
△ Less
Submitted 29 May, 2024; v1 submitted 23 May, 2024;
originally announced May 2024.
-
Iterative embedding and reweighting of complex networks reveals community structure
Authors:
Bianka Kovács,
Sadamori Kojaku,
Gergely Palla,
Santo Fortunato
Abstract:
Graph embeddings learn the structure of networks and represent it in low-dimensional vector spaces. Community structure is one of the features that are recognized and reproduced by embeddings. We show that an iterative procedure, in which a graph is repeatedly embedded and its links are reweighted based on the geometric proximity between the nodes, reinforces intra-community links and weakens inte…
▽ More
Graph embeddings learn the structure of networks and represent it in low-dimensional vector spaces. Community structure is one of the features that are recognized and reproduced by embeddings. We show that an iterative procedure, in which a graph is repeatedly embedded and its links are reweighted based on the geometric proximity between the nodes, reinforces intra-community links and weakens inter-community links, making the clusters of the initial network more visible and more easily detectable. The geometric separation between the communities can become so strong that even a very simple parsing of the links may recover the communities as isolated components with surprisingly high precision. Furthermore, when used as a pre-processing step, our embedding and reweighting procedure can improve the performance of traditional community detection algorithms.
△ Less
Submitted 16 February, 2024;
originally announced February 2024.
-
Representing the Disciplinary Structure of Physics: A Comparative Evaluation of Graph and Text Embedding Methods
Authors:
Isabel Constantino,
Sadamori Kojaku,
Santo Fortunato,
Yong-Yeol Ahn
Abstract:
Recent advances in machine learning offer new ways to represent and study scholarly works and the space of knowledge. Graph and text embeddings provide a convenient vector representation of scholarly works based on citations and text. Yet, it is unclear whether their representations are consistent or provide different views of the structure of science. Here, we compare graph and text embedding by…
▽ More
Recent advances in machine learning offer new ways to represent and study scholarly works and the space of knowledge. Graph and text embeddings provide a convenient vector representation of scholarly works based on citations and text. Yet, it is unclear whether their representations are consistent or provide different views of the structure of science. Here, we compare graph and text embedding by testing their ability to capture the hierarchical structure of the Physics and Astronomy Classification Scheme (PACS) of papers published by the American Physical Society (APS). We also provide a qualitative comparison of the overall structure of the graph and text embeddings for reference. We find that neural network-based methods outperform traditional methods and graph embedding methods such as node2vec are better than other methods at capturing the PACS structure. Our results call for further investigations into how different contexts of scientific papers are captured by different methods, and how we can combine and leverage such information in an interpretable manner.
△ Less
Submitted 12 February, 2025; v1 submitted 29 August, 2023;
originally announced August 2023.
-
Network community detection via neural embeddings
Authors:
Sadamori Kojaku,
Filippo Radicchi,
Yong-Yeol Ahn,
Santo Fortunato
Abstract:
Recent advances in machine learning research have produced powerful neural graph embedding methods, which learn useful, low-dimensional vector representations of network data. These neural methods for graph embedding excel in graph machine learning tasks and are now widely adopted. However, how and why these methods work -- particularly how network structure gets encoded in the embedding -- remain…
▽ More
Recent advances in machine learning research have produced powerful neural graph embedding methods, which learn useful, low-dimensional vector representations of network data. These neural methods for graph embedding excel in graph machine learning tasks and are now widely adopted. However, how and why these methods work -- particularly how network structure gets encoded in the embedding -- remain largely unexplained. Here, we show that node2vec -- shallow, linear neural network -- encodes communities into separable clusters better than random partitioning down to the information-theoretic detectability limit for the stochastic block models. We show that this is due to the equivalence between the embedding learned by node2vec and the spectral embedding via the eigenvectors of the symmetric normalized Laplacian matrix. Numerical simulations demonstrate that node2vec is capable of learning communities on sparse graphs generated by the stochastic blockmodel, as well as on sparse degree-heterogeneous networks. Our results highlight the features of graph neural networks that enable them to separate communities in embedding space.
△ Less
Submitted 1 November, 2024; v1 submitted 23 June, 2023;
originally announced June 2023.
-
Residual2Vec: Debiasing graph embedding with random graphs
Authors:
Sadamori Kojaku,
Jisung Yoon,
Isabel Constantino,
Yong-Yeol Ahn
Abstract:
Graph embedding maps a graph into a convenient vector-space representation for graph analysis and machine learning applications. Many graph embedding methods hinge on a sampling of context nodes based on random walks. However, random walks can be a biased sampler due to the structural properties of graphs. Most notably, random walks are biased by the degree of each node, where a node is sampled pr…
▽ More
Graph embedding maps a graph into a convenient vector-space representation for graph analysis and machine learning applications. Many graph embedding methods hinge on a sampling of context nodes based on random walks. However, random walks can be a biased sampler due to the structural properties of graphs. Most notably, random walks are biased by the degree of each node, where a node is sampled proportionally to its degree. The implication of such biases has not been clear, particularly in the context of graph representation learning. Here, we investigate the impact of the random walks' bias on graph embedding and propose residual2vec, a general graph embedding method that can debias various structural biases in graphs by using random graphs. We demonstrate that this debiasing not only improves link prediction and clustering performance but also allows us to explicitly model salient structural properties in graph embedding.
△ Less
Submitted 14 October, 2021;
originally announced October 2021.
-
Unsupervised embedding of trajectories captures the latent structure of scientific migration
Authors:
Dakota Murray,
Jisung Yoon,
Sadamori Kojaku,
Rodrigo Costas,
Woo-Sung Jung,
Staša Milojević,
Yong-Yeol Ahn
Abstract:
Human migration and mobility drives major societal phenomena including epidemics, economies, innovation, and the diffusion of ideas. Although human mobility and migration have been heavily constrained by geographic distance throughout the history, advances and globalization are making other factors such as language and culture increasingly more important. Advances in neural embedding models, origi…
▽ More
Human migration and mobility drives major societal phenomena including epidemics, economies, innovation, and the diffusion of ideas. Although human mobility and migration have been heavily constrained by geographic distance throughout the history, advances and globalization are making other factors such as language and culture increasingly more important. Advances in neural embedding models, originally designed for natural language, provide an opportunity to tame this complexity and open new avenues for the study of migration. Here, we demonstrate the ability of the model word2vec to encode nuanced relationships between discrete locations from migration trajectories, producing an accurate, dense, continuous, and meaningful vector-space representation. The resulting representation provides a functional distance between locations, as well as a digital double that can be distributed, re-used, and itself interrogated to understand the many dimensions of migration. We show that the unique power of word2vec to encode migration patterns stems from its mathematical equivalence with the gravity model of mobility. Focusing on the case of scientific migration, we apply word2vec to a database of three million migration trajectories of scientists derived from the affiliations listed on their publication records. Using techniques that leverage its semantic structure, we demonstrate that embeddings can learn the rich structure that underpins scientific migration, such as cultural, linguistic, and prestige relationships at multiple levels of granularity. Our results provide a theoretical foundation and methodological framework for using neural embeddings to represent and understand migration both within and beyond science.
△ Less
Submitted 17 November, 2023; v1 submitted 4 December, 2020;
originally announced December 2020.
-
Detecting anomalous citation groups in journal networks
Authors:
Sadamori Kojaku,
Giacomo Livan,
Naoki Masuda
Abstract:
The ever-increasing competitiveness in the academic publishing market incentivizes journal editors to pursue higher impact factors. This translates into journals becoming more selective, and, ultimately, into higher publication standards. However, the fixation on higher impact factors leads some journals to artificially boost impact factors through the coordinated effort of a "citation cartel" of…
▽ More
The ever-increasing competitiveness in the academic publishing market incentivizes journal editors to pursue higher impact factors. This translates into journals becoming more selective, and, ultimately, into higher publication standards. However, the fixation on higher impact factors leads some journals to artificially boost impact factors through the coordinated effort of a "citation cartel" of journals. "Citation cartel" behavior has become increasingly common in recent years, with several instances being reported. Here, we propose an algorithm -- named CIDRE -- to detect anomalous groups of journals that exchange citations at excessively high rates when compared against a null model that accounts for scientific communities and journal size. CIDRE detects more than half of the journals suspended from Journal Citation Reports due to anomalous citation behavior in the year of suspension or in advance. Furthermore, CIDRE detects many new anomalous groups, where the impact factors of the member journals are lifted substantially higher by the citations from other member journals. We describe a number of such examples in detail and discuss the implications of our findings with regard to the current academic climate.
△ Less
Submitted 15 July, 2021; v1 submitted 18 September, 2020;
originally announced September 2020.
-
The effectiveness of backward contact tracing in networks
Authors:
Sadamori Kojaku,
Laurent Hébert-Dufresne,
Enys Mones,
Sune Lehmann,
Yong-Yeol Ahn
Abstract:
Discovering and isolating infected individuals is a cornerstone of epidemic control. Because many infectious diseases spread through close contacts, contact tracing is a key tool for case discovery and control. However, although contact tracing has been performed widely, the mathematical understanding of contact tracing has not been fully established and it has not been clearly understood what det…
▽ More
Discovering and isolating infected individuals is a cornerstone of epidemic control. Because many infectious diseases spread through close contacts, contact tracing is a key tool for case discovery and control. However, although contact tracing has been performed widely, the mathematical understanding of contact tracing has not been fully established and it has not been clearly understood what determines the efficacy of contact tracing. Here, we reveal that, compared with "forward" tracing---tracing to whom disease spreads, "backward" tracing---tracing from whom disease spreads---is profoundly more effective. The effectiveness of backward tracing is due to simple but overlooked biases arising from the heterogeneity in contacts. Using simulations on both synthetic and high-resolution empirical contact datasets, we show that even at a small probability of detecting infected individuals, strategically executed contact tracing can prevent a significant fraction of further transmissions. We also show that---in terms of the number of prevented transmissions per isolation---case isolation combined with a small amount of contact tracing is more efficient than case isolation alone. By demonstrating that backward contact tracing is highly effective at discovering super-spreading events, we argue that the potential effectiveness of contact tracing has been underestimated. Therefore, there is a critical need for revisiting current contact tracing strategies so that they leverage all forms of biases. Our results also have important consequences for digital contact tracing because it will be crucial to incorporate the capability for backward and deep tracing while adhering to the privacy-preserving requirements of these new platforms.
△ Less
Submitted 14 September, 2020; v1 submitted 5 May, 2020;
originally announced May 2020.
-
Constructing networks by filtering correlation matrices: A null model approach
Authors:
Sadamori Kojaku,
Naoki Masuda
Abstract:
Network analysis has been applied to various correlation matrix data. Thresholding on the value of the pairwise correlation is probably the most straightforward and common method to create a network from a correlation matrix. However, there have been criticisms on this thresholding approach such as an inability to filter out spurious correlations, which have led to proposals of alternative methods…
▽ More
Network analysis has been applied to various correlation matrix data. Thresholding on the value of the pairwise correlation is probably the most straightforward and common method to create a network from a correlation matrix. However, there have been criticisms on this thresholding approach such as an inability to filter out spurious correlations, which have led to proposals of alternative methods to overcome some of the problems. We propose a method to create networks from correlation matrices based on optimisation with regularization, where we lay an edge between each pair of nodes if and only if the edge is unexpected from a null model. The proposed algorithm is advantageous in that it can be combined with different types of null models. Moreover, the algorithm can select the most plausible null model from a set of candidate null models using a model selection criterion. For three economic data sets, we find that the configuration model for correlation matrices is often preferred to standard null models. For country-level product export data, the present method better predicts main products exported from countries than sample correlation matrices do.
△ Less
Submitted 26 March, 2019;
originally announced March 2019.
-
Multiscale core-periphery structure in a global liner shipping network
Authors:
Sadamori Kojaku,
Mengqiao Xu,
Haoxiang Xia,
Naoki Masuda
Abstract:
Maritime transport accounts for a majority of trades in volume, of which 70% in value is carried by container ships that transit regular routes on fixed schedules in the ocean. In the present paper, we analyse a data set of global liner shipping as a network of ports. In particular, we construct the network of the ports as the one-mode projection of a bipartite network composed of ports and ship r…
▽ More
Maritime transport accounts for a majority of trades in volume, of which 70% in value is carried by container ships that transit regular routes on fixed schedules in the ocean. In the present paper, we analyse a data set of global liner shipping as a network of ports. In particular, we construct the network of the ports as the one-mode projection of a bipartite network composed of ports and ship routes. Like other transportation networks, global liner shipping networks may have core-periphery structure, where a core and a periphery are groups of densely and sparsely interconnected nodes, respectively. Core-periphery structure may have practical implications for understanding the robustness, efficiency and uneven development of international transportation systems. We develop an algorithm to detect core-periphery pairs in a network, which allows one to find core and peripheral nodes on different scales and uses a configuration model that accounts for the fact that the network is obtained by the one-mode projection of a bipartite network. We also found that most ports are core (as opposed to peripheral) ports and that ports in some countries in Europe, America and Asia belong to a global core-periphery pair across different scales, whereas ports in other countries do not.
△ Less
Submitted 25 January, 2019; v1 submitted 14 August, 2018;
originally announced August 2018.
-
Configuration model for correlation matrices preserving the node strength
Authors:
Naoki Masuda,
Sadamori Kojaku,
Yukie Sano
Abstract:
Correlation matrices are a major type of multivariate data. To examine properties of a given correlation matrix, a common practice is to compare the same quantity between the original correlation matrix and reference correlation matrices, such as those derived from random matrix theory, that partially preserve properties of the original matrix. We propose a model to generate such reference correla…
▽ More
Correlation matrices are a major type of multivariate data. To examine properties of a given correlation matrix, a common practice is to compare the same quantity between the original correlation matrix and reference correlation matrices, such as those derived from random matrix theory, that partially preserve properties of the original matrix. We propose a model to generate such reference correlation and covariance matrices for the given matrix. Correlation matrices are often analysed as networks, which are heterogeneous across nodes in terms of the total connectivity to other nodes for each node. Given this background, the present algorithm generates random networks that preserve the expectation of total connectivity of each node to other nodes, akin to configuration models for conventional networks. Our algorithm is derived from the maximum entropy principle. We will apply the proposed algorithm to measurement of clustering coefficients and community detection, both of which require a null model to assess the statistical significance of the obtained results.
△ Less
Submitted 22 July, 2018; v1 submitted 22 June, 2018;
originally announced June 2018.
-
Structural changes in the interbank market across the financial crisis from multiple core-periphery analysis
Authors:
Sadamori Kojaku,
Giulio Cimini,
Guido Caldarelli,
Naoki Masuda
Abstract:
Interbank markets are often characterised in terms of a core-periphery network structure, with a highly interconnected core of banks holding the market together, and a periphery of banks connected mostly to the core but not internally. This paradigm has recently been challenged for short time scales, where interbank markets seem better characterised by a bipartite structure with more core-peripher…
▽ More
Interbank markets are often characterised in terms of a core-periphery network structure, with a highly interconnected core of banks holding the market together, and a periphery of banks connected mostly to the core but not internally. This paradigm has recently been challenged for short time scales, where interbank markets seem better characterised by a bipartite structure with more core-periphery connections than inside the core. Using a novel core-periphery detection method on the eMID interbank market, we enrich this picture by showing that the network is actually characterised by multiple core-periphery pairs. Moreover, a transition from core-periphery to bipartite structures occurs by shortening the temporal scale of data aggregation. We further show how the global financial crisis transformed the market, in terms of composition, multiplicity and internal organisation of core-periphery pairs. By unveiling such a fine-grained organisation and transformation of the interbank market, our method can find important applications in the understanding of how distress can propagate over financial networks.
△ Less
Submitted 14 February, 2018;
originally announced February 2018.
-
A generalised significance test for individual communities in networks
Authors:
Sadamori Kojaku,
Naoki Masuda
Abstract:
Many empirical networks have community structure, in which nodes are densely interconnected within each community (i.e., a group of nodes) and sparsely across different communities. Like other local and meso-scale structure of networks, communities are generally heterogeneous in various aspects such as the size, density of edges, connectivity to other communities and significance. In the present s…
▽ More
Many empirical networks have community structure, in which nodes are densely interconnected within each community (i.e., a group of nodes) and sparsely across different communities. Like other local and meso-scale structure of networks, communities are generally heterogeneous in various aspects such as the size, density of edges, connectivity to other communities and significance. In the present study, we propose a method to statistically test the significance of individual communities in a given network. Compared to the previous methods, the present algorithm is unique in that it accepts different community-detection algorithms and the corresponding quality function for single communities. The present method requires that a quality of each community can be quantified and that community detection is performed as optimisation of such a quality function summed over the communities. Various community detection algorithms including modularity maximisation and graph partitioning meet this criterion. Our method estimates a distribution of the quality function for randomised networks to calculate a likelihood of each community in the given network. We illustrate our algorithm by synthetic and empirical networks.
△ Less
Submitted 9 May, 2018; v1 submitted 1 December, 2017;
originally announced December 2017.
-
Core-periphery structure requires something else in the network
Authors:
Sadamori Kojaku,
Naoki Masuda
Abstract:
A network with core-periphery structure consists of core nodes that are densely interconnected. In contrast to community structure, which is a different meso-scale structure of networks, core nodes can be connected to peripheral nodes and peripheral nodes are not densely interconnected. Although core-periphery structure sounds reasonable, we argue that it is merely accounted for by heterogeneous d…
▽ More
A network with core-periphery structure consists of core nodes that are densely interconnected. In contrast to community structure, which is a different meso-scale structure of networks, core nodes can be connected to peripheral nodes and peripheral nodes are not densely interconnected. Although core-periphery structure sounds reasonable, we argue that it is merely accounted for by heterogeneous degree distributions, if one partitions a network into a single core block and a single periphery block, which the famous Borgatti-Everett algorithm and many succeeding algorithms assume. In other words, there is a strong tendency that high-degree and low-degree nodes are judged to be core and peripheral nodes, respectively. To discuss core-periphery structure beyond the expectation of the node's degree (as described by the configuration model), we propose that one needs to assume at least one block of nodes apart from the focal core-periphery structure, such as a different core-periphery pair, community or nodes not belonging to any meso-scale structure. We propose a scalable algorithm to detect pairs of core and periphery in networks, controlling for the effect of the node's degree. We illustrate our algorithm using various empirical networks.
△ Less
Submitted 27 April, 2018; v1 submitted 19 October, 2017;
originally announced October 2017.
-
Finding multiple core-periphery pairs in networks
Authors:
Sadamori Kojaku,
Naoki Masuda
Abstract:
With a core-periphery structure of networks, core nodes are densely interconnected, peripheral nodes are connected to core nodes to different extents, and peripheral nodes are sparsely interconnected. Core-periphery structure composed of a single core and periphery has been identified for various networks. However, analogous to the observation that many empirical networks are composed of densely i…
▽ More
With a core-periphery structure of networks, core nodes are densely interconnected, peripheral nodes are connected to core nodes to different extents, and peripheral nodes are sparsely interconnected. Core-periphery structure composed of a single core and periphery has been identified for various networks. However, analogous to the observation that many empirical networks are composed of densely interconnected groups of nodes, i.e., communities, a network may be better regarded as a collection of multiple cores and peripheries. We propose a scalable algorithm to detect multiple non-overlapping groups of core-periphery structure in a network. We illustrate our algorithm using synthesised and empirical networks. For example, we find distinct core-periphery pairs with different political leanings in a network of political blogs and separation between international and domestic subnetworks of airports in some single countries in a world-wide airport network.
△ Less
Submitted 22 November, 2017; v1 submitted 22 February, 2017;
originally announced February 2017.