UI-JEPA: Towards Active Perception of User Intent through Onscreen User Activity

Fu, Yicheng; Anantha, Raviteja; Vashisht, Prabal; Cheng, Jianpeng; Littwin, Etai

Computer Science > Computation and Language

arXiv:2409.04081 (cs)

[Submitted on 6 Sep 2024 (v1), last revised 2 Oct 2024 (this version, v3)]

Title:UI-JEPA: Towards Active Perception of User Intent through Onscreen User Activity

Authors:Yicheng Fu, Raviteja Anantha, Prabal Vashisht, Jianpeng Cheng, Etai Littwin

View PDF HTML (experimental)

Abstract:Generating user intent from a sequence of user interface (UI) actions is a core challenge in comprehensive UI understanding. Recent advancements in multimodal large language models (MLLMs) have led to substantial progress in this area, but their demands for extensive model parameters, computing power, and high latency makes them impractical for scenarios requiring lightweight, on-device solutions with low latency or heightened privacy. Additionally, the lack of high-quality datasets has hindered the development of such lightweight models. To address these challenges, we propose UI-JEPA, a novel framework that employs masking strategies to learn abstract UI embeddings from unlabeled data through self-supervised learning, combined with an LLM decoder fine-tuned for user intent prediction. We also introduce two new UI-grounded multimodal datasets, "Intent in the Wild" (IIW) and "Intent in the Tame" (IIT), designed for few-shot and zero-shot UI understanding tasks. IIW consists of 1.7K videos across 219 intent categories, while IIT contains 914 videos across 10 categories. We establish the first baselines for these datasets, showing that representations learned using a JEPA-style objective, combined with an LLM decoder, can achieve user intent predictions that match the performance of state-of-the-art large MLLMs, but with significantly reduced annotation and deployment resources. Measured by intent similarity scores, UI-JEPA outperforms GPT-4 Turbo and Claude 3.5 Sonnet by 10.0% and 7.2% respectively, averaged across two datasets. Notably, UI-JEPA accomplishes the performance with a 50.5x reduction in computational cost and a 6.6x improvement in latency in the IIW dataset. These results underscore the effectiveness of UI-JEPA, highlighting its potential for lightweight, high-performance UI understanding.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Cite as:	arXiv:2409.04081 [cs.CL]
	(or arXiv:2409.04081v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2409.04081

Submission history

From: Yicheng Fu [view email]
[v1] Fri, 6 Sep 2024 07:44:44 UTC (17,543 KB)
[v2] Fri, 13 Sep 2024 21:08:40 UTC (17,543 KB)
[v3] Wed, 2 Oct 2024 05:00:57 UTC (17,543 KB)

Computer Science > Computation and Language

Title:UI-JEPA: Towards Active Perception of User Intent through Onscreen User Activity

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:UI-JEPA: Towards Active Perception of User Intent through Onscreen User Activity

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators