Is Tokenization Needed for Masked Particle Modelling?

Leigh, Matthew; Klein, Samuel; Charton, François; Golling, Tobias; Heinrich, Lukas; Kagan, Michael; Ochoa, Inês; Osadchy, Margarita

High Energy Physics - Phenomenology

arXiv:2409.12589 (hep-ph)

[Submitted on 19 Sep 2024 (v1), last revised 1 Oct 2024 (this version, v2)]

Title:Is Tokenization Needed for Masked Particle Modelling?

Authors:Matthew Leigh, Samuel Klein, François Charton, Tobias Golling, Lukas Heinrich, Michael Kagan, Inês Ochoa, Margarita Osadchy

View PDF HTML (experimental)

Abstract:In this work, we significantly enhance masked particle modeling (MPM), a self-supervised learning scheme for constructing highly expressive representations of unordered sets relevant to developing foundation models for high-energy physics. In MPM, a model is trained to recover the missing elements of a set, a learning objective that requires no labels and can be applied directly to experimental data. We achieve significant performance improvements over previous work on MPM by addressing inefficiencies in the implementation and incorporating a more powerful decoder. We compare several pre-training tasks and introduce new reconstruction methods that utilize conditional generative models without data tokenization or discretization. We show that these new methods outperform the tokenized learning objective from the original MPM on a new test bed for foundation models for jets, which includes using a wide variety of downstream tasks relevant to jet physics, such as classification, secondary vertex finding, and track identification.

Subjects:	High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG)
Cite as:	arXiv:2409.12589 [hep-ph]
	(or arXiv:2409.12589v2 [hep-ph] for this version)
	https://doi.org/10.48550/arXiv.2409.12589

Submission history

From: Matthew Leigh [view email]
[v1] Thu, 19 Sep 2024 09:12:29 UTC (319 KB)
[v2] Tue, 1 Oct 2024 11:40:11 UTC (319 KB)

High Energy Physics - Phenomenology

Title:Is Tokenization Needed for Masked Particle Modelling?

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

High Energy Physics - Phenomenology

Title:Is Tokenization Needed for Masked Particle Modelling?

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators