Unsupervised clustering of file dialects according to monotonic decompositions of mixtures

Robinson, Michael; Altman, Tate; Lam, Denley; Li, Letitia W.

Computer Science > Programming Languages

arXiv:2304.09082 (cs)

[Submitted on 9 Feb 2023]

Title:Unsupervised clustering of file dialects according to monotonic decompositions of mixtures

Authors:Michael Robinson, Tate Altman, Denley Lam, Letitia W. Li

View PDF

Abstract:This paper proposes an unsupervised classification method that partitions a set of files into non-overlapping dialects based upon their behaviors, determined by messages produced by a collection of programs that consume them. The pattern of messages can be used as the signature of a particular kind of behavior, with the understanding that some messages are likely to co-occur, while others are not. Patterns of messages can be used to classify files into dialects. A dialect is defined by a subset of messages, called the required messages. Once files are conditioned upon dialect and its required messages, the remaining messages are statistically independent.
With this definition of dialect in hand, we present a greedy algorithm that deduces candidate dialects from a dataset consisting of a matrix of file-message data, demonstrate its performance on several file formats, and prove conditions under which it is optimal. We show that an analyst needs to consider fewer dialects than distinct message patterns, which reduces their cognitive load when studying a complex format.

Subjects:	Programming Languages (cs.PL); Computation and Language (cs.CL); Information Retrieval (cs.IR)
MSC classes:	62P30 (Primary), 06A07 (Secondary)
ACM classes:	D.3.4
Cite as:	arXiv:2304.09082 [cs.PL]
	(or arXiv:2304.09082v1 [cs.PL] for this version)
	https://doi.org/10.48550/arXiv.2304.09082

Submission history

From: Michael Robinson [view email]
[v1] Thu, 9 Feb 2023 21:15:36 UTC (234 KB)

Computer Science > Programming Languages

Title:Unsupervised clustering of file dialects according to monotonic decompositions of mixtures

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Programming Languages

Title:Unsupervised clustering of file dialects according to monotonic decompositions of mixtures

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators