Causal Discovery using Compression-Complexity Measures

SY, Pranay; Nagaraj, Nithin

doi:10.1016/j.jbi.2021.103724

Computer Science > Machine Learning

arXiv:2010.09336 (cs)

COVID-19 e-print

Important: e-prints posted on arXiv are not peer-reviewed by arXiv; they should not be relied upon without context to guide clinical practice or health-related behavior and should not be reported in news media as established information without consulting multiple experts in the field.

[Submitted on 19 Oct 2020 (v1), last revised 17 Mar 2021 (this version, v3)]

Title:Causal Discovery using Compression-Complexity Measures

Authors:Pranay SY, Nithin Nagaraj

View PDF

Abstract:Causal inference is one of the most fundamental problems across all domains of science. We address the problem of inferring a causal direction from two observed discrete symbolic sequences $X$ and $Y$. We present a framework which relies on lossless compressors for inferring context-free grammars (CFGs) from sequence pairs and quantifies the extent to which the grammar inferred from one sequence compresses the other sequence. We infer $X$ causes $Y$ if the grammar inferred from $X$ better compresses $Y$ than in the other direction. To put this notion to practice, we propose three models that use the Compression-Complexity Measures (CCMs) - Lempel-Ziv (LZ) complexity and Effort-To-Compress (ETC) to infer CFGs and discover causal directions without demanding temporal structures. We evaluate these models on synthetic and real-world benchmarks and empirically observe performances competitive with current state-of-the-art methods. Lastly, we present two unique applications of the proposed models for causal inference directly from pairs of genome sequences belonging to the SARS-CoV-2 virus. Using a large number of sequences, we show that our models capture directed causal information exchange between sequence pairs, presenting novel opportunities for addressing key issues such as contact-tracing, motif discovery, evolution of virulence and pathogenicity in future applications.

Comments:	Accepted version with major revisions to results and discussion. 17 pages, 9 figures
Subjects:	Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (stat.ML)
MSC classes:	62D20 (Primary) 68P30, 68Q30, 94A17 (Secondary)
Cite as:	arXiv:2010.09336 [cs.LG]
	(or arXiv:2010.09336v3 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2010.09336
Journal reference:	Pranay SY, & Nagaraj, N. (2021). Causal discovery using compression-complexity measures. Journal of Biomedical Informatics, 103724
Related DOI:	https://doi.org/10.1016/j.jbi.2021.103724

Submission history

From: Pranay Yadav [view email]
[v1] Mon, 19 Oct 2020 09:19:56 UTC (319 KB)
[v2] Thu, 22 Oct 2020 11:46:08 UTC (319 KB)
[v3] Wed, 17 Mar 2021 10:45:26 UTC (400 KB)

Computer Science > Machine Learning

Title:Causal Discovery using Compression-Complexity Measures

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Causal Discovery using Compression-Complexity Measures

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators