Search | arXiv e-print repository

Property-Isometric Variational Autoencoders for Sequence Modeling and Design

Authors: Elham Sadeghi, Xianqi Deng, I-Hsin Lin, Stacy M. Copp, Petko Bogdanov

Abstract: Biological sequence design (DNA, RNA, or peptides) with desired functional properties has applications in discovering novel nanomaterials, biosensors, antimicrobial drugs, and beyond. One common challenge is the ability to optimize complex high-dimensional properties such as target emission spectra of DNA-mediated fluorescent nanoparticles, photo and chemical stability, and antimicrobial activity… ▽ More Biological sequence design (DNA, RNA, or peptides) with desired functional properties has applications in discovering novel nanomaterials, biosensors, antimicrobial drugs, and beyond. One common challenge is the ability to optimize complex high-dimensional properties such as target emission spectra of DNA-mediated fluorescent nanoparticles, photo and chemical stability, and antimicrobial activity of peptides across target microbes. Existing models rely on simple binary labels (e.g., binding/non-binding) rather than high-dimensional complex properties. To address this gap, we propose a geometry-preserving variational autoencoder framework, called PrIVAE, which learns latent sequence embeddings that respect the geometry of their property space. Specifically, we model the property space as a high-dimensional manifold that can be locally approximated by a nearest neighbor graph, given an appropriately defined distance measure. We employ the property graph to guide the sequence latent representations using (1) graph neural network encoder layers and (2) an isometric regularizer. PrIVAE learns a property-organized latent space that enables rational design of new sequences with desired properties by employing the trained decoder. We evaluate the utility of our framework for two generative tasks: (1) design of DNA sequences that template fluorescent metal nanoclusters and (2) design of antimicrobial peptides. The trained models retain high reconstruction accuracy while organizing the latent space according to properties. Beyond in silico experiments, we also employ sampled sequences for wet lab design of DNA nanoclusters, resulting in up to 16.1-fold enrichment of rare-property nanoclusters compared to their abundance in training data, demonstrating the practical utility of our framework. △ Less

Submitted 16 September, 2025; originally announced September 2025.

Comments: 20 pages, 6 figures, preprint

arXiv:2509.10033 [pdf, ps, other]

Sparse Coding Representation of 2-way Data

Authors: Boya Ma, Abram Magner, Maxwell McNeil, Petko Bogdanov

Abstract: Sparse dictionary coding represents signals as linear combinations of a few dictionary atoms. It has been applied to images, time series, graph signals and multi-way spatio-temporal data by jointly employing temporal and spatial dictionaries. Data-agnostic analytical dictionaries, such as the discrete Fourier transform, wavelets and graph Fourier, have seen wide adoption due to efficient implement… ▽ More Sparse dictionary coding represents signals as linear combinations of a few dictionary atoms. It has been applied to images, time series, graph signals and multi-way spatio-temporal data by jointly employing temporal and spatial dictionaries. Data-agnostic analytical dictionaries, such as the discrete Fourier transform, wavelets and graph Fourier, have seen wide adoption due to efficient implementations and good practical performance. On the other hand, dictionaries learned from data offer sparser and more accurate solutions but require learning of both the dictionaries and the coding coefficients. This becomes especially challenging for multi-dictionary scenarios since encoding coefficients correspond to all atom combinations from the dictionaries. To address this challenge, we propose a low-rank coding model for 2-dictionary scenarios and study its data complexity. Namely, we establish a bound on the number of samples needed to learn dictionaries that generalize to unseen samples from the same distribution. We propose a convex relaxation solution, called AODL, whose exact solution we show also solves the original problem. We then solve this relaxation via alternating optimization between the sparse coding matrices and the learned dictionaries, which we prove to be convergent. We demonstrate its quality for data reconstruction and missing value imputation in both synthetic and real-world datasets. For a fixed reconstruction quality, AODL learns up to 90\% sparser solutions compared to non-low-rank and analytical (fixed) dictionary baselines. In addition, the learned dictionaries reveal interpretable insights into patterns present within the samples used for training. △ Less

Submitted 12 September, 2025; originally announced September 2025.

arXiv:2406.06960 [pdf, ps, other]

Low Rank Multi-Dictionary Selection at Scale

Authors: Boya Ma, Maxwell McNeil, Abram Magner, Petko Bogdanov

Abstract: The sparse dictionary coding framework represents signals as a linear combination of a few predefined dictionary atoms. It has been employed for images, time series, graph signals and recently for 2-way (or 2D) spatio-temporal data employing jointly temporal and spatial dictionaries. Large and over-complete dictionaries enable high-quality models, but also pose scalability challenges which are exa… ▽ More The sparse dictionary coding framework represents signals as a linear combination of a few predefined dictionary atoms. It has been employed for images, time series, graph signals and recently for 2-way (or 2D) spatio-temporal data employing jointly temporal and spatial dictionaries. Large and over-complete dictionaries enable high-quality models, but also pose scalability challenges which are exacerbated in multi-dictionary settings. Hence, an important problem that we address in this paper is: How to scale multi-dictionary coding for large dictionaries and datasets? We propose a multi-dictionary atom selection technique for low-rank sparse coding named LRMDS. To enable scalability to large dictionaries and datasets, it progressively selects groups of row-column atom pairs based on their alignment with the data and performs convex relaxation coding via the corresponding sub-dictionaries. We demonstrate both theoretically and experimentally that when the data has a low-rank encoding with a sparse subset of the atoms, LRMDS is able to select them with strong guarantees under mild assumptions. Furthermore, we demonstrate the scalability and quality of LRMDS in both synthetic and real-world datasets and for a range of coding dictionaries. It achieves 3X to 10X speed-up compared to baselines, while obtaining up to two orders of magnitude improvement in representation quality on some of the real world datasets given a fixed target number of atoms. △ Less

Submitted 11 June, 2024; originally announced June 2024.

Comments: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, August 25--29, 2024, Barcelona, Spain

arXiv:2309.09717 [pdf, other]

Multi-Dictionary Tensor Decomposition

Authors: Maxwell McNeil, Petko Bogdanov

Abstract: Tensor decomposition methods are popular tools for analysis of multi-way datasets from social media, healthcare, spatio-temporal domains, and others. Widely adopted models such as Tucker and canonical polyadic decomposition (CPD) follow a data-driven philosophy: they decompose a tensor into factors that approximate the observed data well. In some cases side information is available about the tenso… ▽ More Tensor decomposition methods are popular tools for analysis of multi-way datasets from social media, healthcare, spatio-temporal domains, and others. Widely adopted models such as Tucker and canonical polyadic decomposition (CPD) follow a data-driven philosophy: they decompose a tensor into factors that approximate the observed data well. In some cases side information is available about the tensor modes. For example, in a temporal user-item purchases tensor a user influence graph, an item similarity graph, and knowledge about seasonality or trends in the temporal mode may be available. Such side information may enable more succinct and interpretable tensor decomposition models and improved quality in downstream tasks. We propose a framework for Multi-Dictionary Tensor Decomposition (MDTD) which takes advantage of prior structural information about tensor modes in the form of coding dictionaries to obtain sparsely encoded tensor factors. We derive a general optimization algorithm for MDTD that handles both complete input and input with missing values. Our framework handles large sparse tensors typical to many real-world application domains. We demonstrate MDTD's utility via experiments with both synthetic and real-world datasets. It learns more concise models than dictionary-free counterparts and improves (i) reconstruction quality ($60\%$ fewer non-zero coefficients coupled with smaller error); (ii) missing values imputation quality (two-fold MSE reduction with up to orders of magnitude time savings) and (iii) the estimation of the tensor rank. MDTD's quality improvements do not come with a running time premium: it can decompose $19GB$ datasets in less than a minute. It can also impute missing values in sparse billion-entry tensors more accurately and scalably than state-of-the-art competitors. △ Less

Submitted 18 September, 2023; originally announced September 2023.

arXiv:2212.12771 [pdf, ps, other]

Unsupervised Instance and Subnetwork Selection for Network Data

Authors: Lin Zhang, Nicholas Moskwa, Melinda Larsen, Petko Bogdanov

Abstract: Unlike tabular data, features in network data are interconnected within a domain-specific graph. Examples of this setting include gene expression overlaid on a protein interaction network (PPI) and user opinions in a social network. Network data is typically high-dimensional (large number of nodes) and often contains outlier snapshot instances and noise. In addition, it is often non-trivial and ti… ▽ More Unlike tabular data, features in network data are interconnected within a domain-specific graph. Examples of this setting include gene expression overlaid on a protein interaction network (PPI) and user opinions in a social network. Network data is typically high-dimensional (large number of nodes) and often contains outlier snapshot instances and noise. In addition, it is often non-trivial and time-consuming to annotate instances with global labels (e.g., disease or normal). How can we jointly select discriminative subnetworks and representative instances for network data without supervision? We address these challenges within an unsupervised framework for joint subnetwork and instance selection in network data, called UISS, via a convex self-representation objective. Given an unlabeled network dataset, UISS identifies representative instances while ignoring outliers. It outperforms state-of-the-art baselines on both discriminative subnetwork selection and representative instance selection, achieving up to 10% accuracy improvement on all real-world data sets we use for evaluation. When employed for exploratory analysis in RNA-seq network samples from multiple studies it produces interpretable and informative summaries. △ Less

Submitted 24 December, 2022; originally announced December 2022.

arXiv:2109.10937 [pdf, other]

Temporal Scale Estimation for Oversampled Network Cascades: Theory, Algorithms, and Experiment

Authors: Abram Magner, Carolyn Kaminski, Petko Bogdanov

Abstract: Spreading processes on graphs arise in a host of application domains, from the study of online social networks to viral marketing to epidemiology. Various discrete-time probabilistic models for spreading processes have been proposed. These are used for downstream statistical estimation and prediction problems, often involving messages or other information that is transmitted along with infections… ▽ More Spreading processes on graphs arise in a host of application domains, from the study of online social networks to viral marketing to epidemiology. Various discrete-time probabilistic models for spreading processes have been proposed. These are used for downstream statistical estimation and prediction problems, often involving messages or other information that is transmitted along with infections caused by the process. It is thus important to design models of cascade observation that take into account phenomena that lead to uncertainty about the process state at any given time. We highlight one such phenomenon -- temporal distortion -- caused by a misalignment between the rate at which observations of a cascade process are made and the rate at which the process itself operates, and argue that failure to correct for it results in degradation of performance on downstream statistical tasks. To address these issues, we formulate the clock estimation problem in terms of a natural distortion measure. We give a clock estimation algorithm, which we call FastClock, that runs in linear time in the size of its input and is provably statistically accurate for a broad range of model parameters when cascades are generated from the independent cascade process with known parameters and when the underlying graph is Erdős-Rényi. We further give empirical results on the performance of our algorithm in comparison to the state of the art estimator, a likelihood proxy maximization-based estimator implemented via dynamic programming. We find that, in a broad parameter regime, our algorithm substantially outperforms the dynamic programming algorithm in terms of both running time and accuracy. △ Less

Submitted 22 September, 2021; originally announced September 2021.

Comments: 31 pages

arXiv:2106.13517 [pdf, other]

doi 10.1145/3447548.3467379

Temporal Graph Signal Decomposition

Authors: Maxwell McNeil, Lin Zhang, Petko Bogdanov

Abstract: Temporal graph signals are multivariate time series with individual components associated with nodes of a fixed graph structure. Data of this kind arises in many domains including activity of social network users, sensor network readings over time, and time course gene expression within the interaction network of a model organism. Traditional matrix decomposition methods applied to such data fall… ▽ More Temporal graph signals are multivariate time series with individual components associated with nodes of a fixed graph structure. Data of this kind arises in many domains including activity of social network users, sensor network readings over time, and time course gene expression within the interaction network of a model organism. Traditional matrix decomposition methods applied to such data fall short of exploiting structural regularities encoded in the underlying graph and also in the temporal patterns of the signal. How can we take into account such structure to obtain a succinct and interpretable representation of temporal graph signals? We propose a general, dictionary-based framework for temporal graph signal decomposition (TGSD). The key idea is to learn a low-rank, joint encoding of the data via a combination of graph and time dictionaries. We propose a highly scalable decomposition algorithm for both complete and incomplete data, and demonstrate its advantage for matrix decomposition, imputation of missing values, temporal interpolation, clustering, period estimation, and rank estimation in synthetic and real-world data ranging from traffic patterns to social media activity. Our framework achieves 28% reduction in RMSE compared to baselines for temporal interpolation when as many as 75% of the observations are missing. It scales best among baselines taking under 20 seconds on 3.5 million data points and produces the most parsimonious models. To the best of our knowledge, TGSD is the first framework to jointly model graph signals by temporal and graph dictionaries. △ Less

Submitted 25 June, 2021; originally announced June 2021.

Comments: 9 Main Pages 2 Supplement to be published in the research track in Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2021), August 14 through August 18, 2021,Virtual Event, Singapore

arXiv:2005.04149 [pdf, other]

LinksIQ: Robust and Efficient Modulation Recognition with Imperfect Spectrum Scans

Authors: Wei Xiong, Karyn Doke, Petko Bogdanov, Mariya Zheleva

Abstract: While critical for the practical progress of spectrum sharing, modulation recognition has so far been investigated under unrealistic assumptions: (i) a transmitter's bandwidth must be scanned alone and in full, (ii) prior knowledge of the technology must be available and (iii) a transmitter must be trustworthy. In reality these assumptions cannot be readily met, as a transmitter's bandwidth may on… ▽ More While critical for the practical progress of spectrum sharing, modulation recognition has so far been investigated under unrealistic assumptions: (i) a transmitter's bandwidth must be scanned alone and in full, (ii) prior knowledge of the technology must be available and (iii) a transmitter must be trustworthy. In reality these assumptions cannot be readily met, as a transmitter's bandwidth may only be scanned intermittently, partially, or alongside other transmitters, and modulation obfuscation may be introduced by short-lived scans or malicious activity. This paper presents LinksIQ, which bridges the gap between real-world spectrum sensing and the growing body of modrec methods designed under simplifying assumptions. Our key insight is that ordered IQ samples form distinctive patterns across modulations, which persist even with scan deficiencies. We mine these patterns through a Fisher Kernel framework and employ lightweight linear support vector machine for modulation classification. LinksIQ is robust to noise, scan partiality and data biases without utilizing prior knowledge of transmitter technology. Its accuracy consistently outperforms baselines in both simulated and real traces. We evaluate LinksIQ performance in a testbed using two popular SDR platforms, RTL-SDR and USRP. We demonstrate high detection accuracy (i.e. 0.74) even with a $20 RTL-SDR scanning at 50% transmitter overlap. This constitutes an average of 43% improvement over existing counterparts employed on RTL-SDR scans. We also explore the effects of platform-aware classifier training and discuss implications on real-world modrec system design. Our results demonstrate the feasibility of low-cost transmitter fingerprinting at scale. △ Less

Submitted 7 May, 2020; originally announced May 2020.

arXiv:1904.00791 [pdf]

DSL: Discriminative Subgraph Learning via Sparse Self-Representation

Authors: Lin Zhang, Petko Bogdanov

Abstract: The goal in network state prediction (NSP) is to classify the global state (label) associated with features embedded in a graph. This graph structure encoding feature relationships is the key distinctive aspect of NSP compared to classical supervised learning. NSP arises in various applications: gene expression samples embedded in a protein-protein interaction (PPI) network, temporal snapshots of… ▽ More The goal in network state prediction (NSP) is to classify the global state (label) associated with features embedded in a graph. This graph structure encoding feature relationships is the key distinctive aspect of NSP compared to classical supervised learning. NSP arises in various applications: gene expression samples embedded in a protein-protein interaction (PPI) network, temporal snapshots of infrastructure or sensor networks, and fMRI coherence network samples from multiple subjects to name a few. Instances from these domains are typically ``wide'' (more features than samples), and thus, feature sub-selection is required for robust and generalizable prediction. How to best employ the network structure in order to learn succinct connected subgraphs encompassing the most discriminative features becomes a central challenge in NSP. Prior work employs connected subgraph sampling or graph smoothing within optimization frameworks, resulting in either large variance of quality or weak control over the connectivity of selected subgraphs. In this work we propose an optimization framework for discriminative subgraph learning (DSL) which simultaneously enforces (i) sparsity, (ii) connectivity and (iii) high discriminative power of the resulting subgraphs of features. Our optimization algorithm is a single-step solution for the NSP and the associated feature selection problem. It is rooted in the rich literature on maximal-margin optimization, spectral graph methods and sparse subspace self-representation. DSL simultaneously ensures solution interpretability and superior predictive power (up to 16% improvement in challenging instances compared to baselines), with execution times up to an hour for large instances. △ Less

Submitted 24 March, 2019; originally announced April 2019.

Comments: 9 pages

Journal ref: SIAM International Conference on Data Mining(SDM) 2019

arXiv:1807.08888 [pdf, other]

An Efficient System for Subgraph Discovery

Authors: Aparna Joshi, Yu Zhang, Petko Bogdanov, Jeong-Hyon Hwang

Abstract: Subgraph discovery in a single data graph---finding subsets of vertices and edges satisfying a user-specified criteria---is an essential and general graph analytics operation with a wide spectrum of applications. Depending on the criteria, subgraphs of interest may correspond to cliques of friends in social networks, interconnected entities in RDF data, or frequent patterns in protein interaction… ▽ More Subgraph discovery in a single data graph---finding subsets of vertices and edges satisfying a user-specified criteria---is an essential and general graph analytics operation with a wide spectrum of applications. Depending on the criteria, subgraphs of interest may correspond to cliques of friends in social networks, interconnected entities in RDF data, or frequent patterns in protein interaction networks to name a few. Existing systems usually examine a large number of subgraphs while employing many computers and often produce an enormous result set of subgraphs. How can we enable fast discovery of only the most relevant subgraphs while minimizing the computational requirements? We present Nuri, a general subgraph discovery system that allows users to succinctly specify subgraphs of interest and criteria for ranking them. Given such specifications, Nuri efficiently finds the k most relevant subgraphs using only a single computer. It prioritizes (i.e., expands earlier than others) subgraphs that are more likely to expand into the desired subgraphs (prioritized subgraph expansion) and proactively discards irrelevant subgraphs from which the desired subgraphs cannot be constructed (pruning). Nuri can also efficiently store and retrieve a large number of subgraphs on disk without being limited by the size of main memory. We demonstrate using both real and synthetic datasets that Nuri on a single core outperforms the closest alternative distributed system consuming 40 times more computational resources by more than 2 orders of magnitude for clique discovery and 1 order of magnitude for subgraph isomorphism and pattern mining. △ Less

Submitted 23 July, 2018; originally announced July 2018.

arXiv:1709.04033 [pdf, other]

Local Community Detection in Dynamic Networks

Authors: Daniel J. DiTursi, Gaurav Ghosh, Petko Bogdanov

Abstract: Given a time-evolving network, how can we detect communities over periods of high internal and low external interactions? To address this question we generalize traditional local community detection in graphs to the setting of dynamic networks. Adopting existing static-network approaches in an "aggregated" graph of all temporal interactions is not appropriate for the problem as dynamic communities… ▽ More Given a time-evolving network, how can we detect communities over periods of high internal and low external interactions? To address this question we generalize traditional local community detection in graphs to the setting of dynamic networks. Adopting existing static-network approaches in an "aggregated" graph of all temporal interactions is not appropriate for the problem as dynamic communities may be short-lived and thus lost when mixing interactions over long periods. Hence, dynamic community mining requires the detection of both the community nodes and an optimal time interval in which they are actively interacting. We propose a filter-and-verify framework for dynamic community detection. To scale to long intervals of graph evolution, we employ novel spectral bounds for dynamic community conductance and employ them to filter suboptimal periods in near-linear time. We also design a time-and-graph-aware locality sensitive hashing family to effectively spot promising community cores. Our method PHASR discovers communities of consistently higher quality (2 to 67 times better) than those of baselines. At the same time, our bounds allow for pruning between $55\%$ and $95\%$ of the search space, resulting in significant savings in running time compared to exhaustive alternatives for even modest time intervals of graph evolution. △ Less

Submitted 12 September, 2017; originally announced September 2017.

Comments: extended version of paper in ICDM 2017

arXiv:1709.04015 [pdf, other]

Network Clocks: Detecting the Temporal Scale of Information Diffusion

Authors: Daniel J. DiTursi, Gregorios A. Katsios, Petko Bogdanov

Abstract: Information diffusion models typically assume a discrete timeline in which an information token spreads in the network. Since users in real-world networks vary significantly in their intensity and periods of activity, our objective in this work is to answer: How to determine a temporal scale that best agrees with the observed information propagation within a network? A key limitation of existing a… ▽ More Information diffusion models typically assume a discrete timeline in which an information token spreads in the network. Since users in real-world networks vary significantly in their intensity and periods of activity, our objective in this work is to answer: How to determine a temporal scale that best agrees with the observed information propagation within a network? A key limitation of existing approaches is that they aggregate the timeline into fixed-size windows, which may not fit all network nodes' activity periods. We propose the notion of a heterogeneous network clock: a mapping of events to discrete timestamps that best explains their occurrence according to a given cascade propagation model. We focus on the widely-adopted independent cascade (IC) model and formalize the optimal clock as the one that maximizes the likelihood of all observed cascades. The single optimal clock (OC) problem can be solved exactly in polynomial time. However, we prove that learning multiple optimal clocks(kOC), corresponding to temporal patterns of groups of network nodes, is NP-hard. We propose scalable solutions that run in almost linear time in the total number of cascade activations and discuss approximation guarantees for each variant. Our algorithms and their detected clocks enable improved cascade size classification (up to 8 percent F1 lift) and improved missing cascade data inference (0.15 better recall). We also demonstrate that the network clocks exhibit consistency within the type of content diffusing in the network and are robust with respect to the propagation probability parameters of the IC model. △ Less

Submitted 12 September, 2017; originally announced September 2017.

Comments: extended version of paper from ICDM 2017

arXiv:1609.08228 [pdf, other]

Towards Scalable Network Delay Minimization

Authors: Sourav Medya, Petko Bogdanov, Ambuj Singh

Abstract: Reduction of end-to-end network delays is an optimization task with applications in multiple domains. Low delays enable improved information flow in social networks, quick spread of ideas in collaboration networks, low travel times for vehicles on road networks and increased rate of packets in the case of communication networks. Delay reduction can be achieved by both improving the propagation cap… ▽ More Reduction of end-to-end network delays is an optimization task with applications in multiple domains. Low delays enable improved information flow in social networks, quick spread of ideas in collaboration networks, low travel times for vehicles on road networks and increased rate of packets in the case of communication networks. Delay reduction can be achieved by both improving the propagation capabilities of individual nodes and adding additional edges in the network. One of the main challenges in such design problems is that the effects of local changes are not independent, and as a consequence, there is a combinatorial search-space of possible improvements. Thus, minimizing the cumulative propagation delay requires novel scalable and data-driven approaches. In this paper, we consider the problem of network delay minimization via node upgrades. Although the problem is NP-hard, we show that probabilistic approximation for a restricted version can be obtained. We design scalable and high-quality techniques for the general setting based on sampling and targeted to different models of delay distribution. Our methods scale almost linearly with the graph size and consistently outperform competitors in quality. △ Less

Submitted 26 September, 2016; originally announced September 2016.

arXiv:1512.06173 [pdf, ps, other]

Discriminative Subnetworks with Regularized Spectral Learning for Global-state Network Data

Authors: Xuan Hong Dang, Ambuj K. Singh, Petko Bogdanov, Hongyuan You, Bayyuan Hsu

Abstract: Data mining practitioners are facing challenges from data with network structure. In this paper, we address a specific class of global-state networks which comprises of a set of network instances sharing a similar structure yet having different values at local nodes. Each instance is associated with a global state which indicates the occurrence of an event. The objective is to uncover a small set… ▽ More Data mining practitioners are facing challenges from data with network structure. In this paper, we address a specific class of global-state networks which comprises of a set of network instances sharing a similar structure yet having different values at local nodes. Each instance is associated with a global state which indicates the occurrence of an event. The objective is to uncover a small set of discriminative subnetworks that can optimally classify global network values. Unlike most existing studies which explore an exponential subnetwork space, we address this difficult problem by adopting a space transformation approach. Specifically, we present an algorithm that optimizes a constrained dual-objective function to learn a low-dimensional subspace that is capable of discriminating networks labelled by different global states, while reconciling with common network topology sharing across instances. Our algorithm takes an appealing approach from spectral graph learning and we show that the globally optimum solution can be achieved via matrix eigen-decomposition. △ Less

Submitted 18 December, 2015; originally announced December 2015.

Comments: manuscript for the ECML 2014 paper

arXiv:1510.05058 [pdf, other]

doi 10.1109/ICDE.2017.64

A Distance Measure for the Analysis of Polar Opinion Dynamics in Social Networks

Authors: Victor Amelkin, Ambuj Singh, Petko Bogdanov

Abstract: Analysis of opinion dynamics in social networks plays an important role in today's life. For applications such as predicting users' political preference, it is particularly important to be able to analyze the dynamics of competing opinions. While observing the evolution of polar opinions of a social network's users over time, can we tell when the network "behaved" abnormally? Furthermore, can we p… ▽ More Analysis of opinion dynamics in social networks plays an important role in today's life. For applications such as predicting users' political preference, it is particularly important to be able to analyze the dynamics of competing opinions. While observing the evolution of polar opinions of a social network's users over time, can we tell when the network "behaved" abnormally? Furthermore, can we predict how the opinions of the users will change in the future? Do opinions evolve according to existing network opinion dynamics models? To answer such questions, it is not sufficient to study individual user behavior, since opinions can spread far beyond users' egonets. We need a method to analyze opinion dynamics of all network users simultaneously and capture the effect of individuals' behavior on the global evolution pattern of the social network. In this work, we introduce Social Network Distance (SND) - a distance measure that quantifies the "cost" of evolution of one snapshot of a social network into another snapshot under various models of polar opinion propagation. SND has a rich semantics of a transportation problem, yet, is computable in time linear in the number of users, which makes SND applicable to the analysis of large-scale online social networks. In our experiments with synthetic and real-world Twitter data, we demonstrate the utility of our distance measure for anomalous event detection. It achieves a true positive rate of 0.83, twice as high as that of alternatives. When employed for opinion prediction in Twitter, our method's accuracy is 75.63%, which is 7.5% higher than that of the next best method. Source Code: https://cs.ucsb.edu/~victor/pub/ucsb/dbl/snd/ △ Less

Submitted 16 October, 2015; originally announced October 2015.

ACM Class: G.2.2; H.2.8; I.5.3

arXiv:1307.0309 [pdf, ps, other]

doi 10.1145/2492517.2492621

The Social Media Genome: Modeling Individual Topic-Specific Behavior in Social Media

Authors: Petko Bogdanov, Michael Busch, Jeff Moehli, Ambuj K. Singh, Boleslaw K. Szymanski

Abstract: Information propagation in social media depends not only on the static follower structure but also on the topic-specific user behavior. Hence novel models incorporating dynamic user behavior are needed. To this end, we propose a model for individual social media users, termed a genotype. The genotype is a per-topic summary of a user's interest, activity and susceptibility to adopt new information.… ▽ More Information propagation in social media depends not only on the static follower structure but also on the topic-specific user behavior. Hence novel models incorporating dynamic user behavior are needed. To this end, we propose a model for individual social media users, termed a genotype. The genotype is a per-topic summary of a user's interest, activity and susceptibility to adopt new information. We demonstrate that user genotypes remain invariant within a topic by adopting them for classification of new information spread in large-scale real networks. Furthermore, we extract topic-specific influence backbone structures based on information adoption and show that they differ significantly from the static follower network. When employed for influence prediction of new content spread, our genotype model and influence backbones enable more than $20% improvement, compared to purely structural features. We also demonstrate that knowledge of user genotypes and influence backbones allow for the design of effective strategies for latency minimization of topic-specific information spread. △ Less

Submitted 1 July, 2013; originally announced July 2013.

Comments: ASONAM 2013, 7 pages

Journal ref: Proc. 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining ASONAM, Niagara Falls, Canada, August 25-28, 2013, pp. 236-242

Showing 1–16 of 16 results for author: Bogdanov, P