Filtered Corpus Training (FiCT) Shows that Language Models can Generalize from Indirect Evidence

Patil, Abhinav; Jumelet, Jaap; Chiu, Yu Ying; Lapastora, Andy; Shen, Peter; Wang, Lexie; Willrich, Clevis; Steinert-Threlkeld, Shane

Computer Science > Computation and Language

arXiv:2405.15750 (cs)

[Submitted on 24 May 2024 (v1), last revised 6 Aug 2024 (this version, v2)]

Title:Filtered Corpus Training (FiCT) Shows that Language Models can Generalize from Indirect Evidence

Authors:Abhinav Patil, Jaap Jumelet, Yu Ying Chiu, Andy Lapastora, Peter Shen, Lexie Wang, Clevis Willrich, Shane Steinert-Threlkeld

View PDF HTML (experimental)

Abstract:This paper introduces Filtered Corpus Training, a method that trains language models (LMs) on corpora with certain linguistic constructions filtered out from the training data, and uses it to measure the ability of LMs to perform linguistic generalization on the basis of indirect evidence. We apply the method to both LSTM and Transformer LMs (of roughly comparable size), developing filtered corpora that target a wide range of linguistic phenomena. Our results show that while transformers are better qua LMs (as measured by perplexity), both models perform equally and surprisingly well on linguistic generalization measures, suggesting that they are capable of generalizing from indirect evidence.

Comments:	Forthcoming in Transactions of the Association for Computational Linguistics (TACL). This is a pre-MIT Press publication version. For code and trained models, see this http URL
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2405.15750 [cs.CL]
	(or arXiv:2405.15750v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2405.15750

Submission history

From: Shane Steinert-Threlkeld [view email]
[v1] Fri, 24 May 2024 17:47:20 UTC (977 KB)
[v2] Tue, 6 Aug 2024 22:29:11 UTC (464 KB)

Computer Science > Computation and Language

Title:Filtered Corpus Training (FiCT) Shows that Language Models can Generalize from Indirect Evidence

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Filtered Corpus Training (FiCT) Shows that Language Models can Generalize from Indirect Evidence

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators