SynCoBERT: Syntax-Guided Multi-Modal Contrastive Pre-Training for Code Representation

Wang, Xin; Wang, Yasheng; Mi, Fei; Zhou, Pingyi; Wan, Yao; Liu, Xiao; Li, Li; Wu, Hao; Liu, Jin; Jiang, Xin

Computer Science > Computation and Language

arXiv:2108.04556 (cs)

[Submitted on 10 Aug 2021 (v1), last revised 9 Sep 2021 (this version, v3)]

Title:SynCoBERT: Syntax-Guided Multi-Modal Contrastive Pre-Training for Code Representation

Authors:Xin Wang, Yasheng Wang, Fei Mi, Pingyi Zhou, Yao Wan, Xiao Liu, Li Li, Hao Wu, Jin Liu, Xin Jiang

View PDF

Abstract:Code representation learning, which aims to encode the semantics of source code into distributed vectors, plays an important role in recent deep-learning-based models for code intelligence. Recently, many pre-trained language models for source code (e.g., CuBERT and CodeBERT) have been proposed to model the context of code and serve as a basis for downstream code intelligence tasks such as code search, code clone detection, and program translation. Current approaches typically consider the source code as a plain sequence of tokens, or inject the structure information (e.g., AST and data-flow) into the sequential model pre-training. To further explore the properties of programming languages, this paper proposes SynCoBERT, a syntax-guided multi-modal contrastive pre-training approach for better code representations. Specially, we design two novel pre-training objectives originating from the symbolic and syntactic properties of source code, i.e., Identifier Prediction (IP) and AST Edge Prediction (TEP), which are designed to predict identifiers, and edges between two nodes of AST, respectively. Meanwhile, to exploit the complementary information in semantically equivalent modalities (i.e., code, comment, AST) of the code, we propose a multi-modal contrastive learning strategy to maximize the mutual information among different modalities. Extensive experiments on four downstream tasks related to code intelligence show that SynCoBERT advances the state-of-the-art with the same pre-training corpus and model size.

Comments:	9 pages, 3 figures, 5 tables
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
Cite as:	arXiv:2108.04556 [cs.CL]
	(or arXiv:2108.04556v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2108.04556

Submission history

From: Xin Wang [view email]
[v1] Tue, 10 Aug 2021 10:08:21 UTC (148 KB)
[v2] Mon, 23 Aug 2021 07:52:14 UTC (127 KB)
[v3] Thu, 9 Sep 2021 12:45:58 UTC (167 KB)

Computer Science > Computation and Language

Title:SynCoBERT: Syntax-Guided Multi-Modal Contrastive Pre-Training for Code Representation

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:SynCoBERT: Syntax-Guided Multi-Modal Contrastive Pre-Training for Code Representation

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators