Showing 1–2 of 2 results for author: Goldin, G

Search v0.5.6 released 2020-02-24

arXiv:2407.20581 [pdf, ps, other]

cs.CL

Knesset-DictaBERT: A Hebrew Language Model for Parliamentary Proceedings

Authors: Gili Goldin, Shuly Wintner

Abstract: We present Knesset-DictaBERT, a large Hebrew language model fine-tuned on the Knesset Corpus, which comprises Israeli parliamentary proceedings. The model is based on the DictaBERT architecture and demonstrates significant improvements in understanding parliamentary language according to the MLM task. We provide a detailed evaluation of the model's performance, showing improvements in perplexity a… ▽ More We present Knesset-DictaBERT, a large Hebrew language model fine-tuned on the Knesset Corpus, which comprises Israeli parliamentary proceedings. The model is based on the DictaBERT architecture and demonstrates significant improvements in understanding parliamentary language according to the MLM task. We provide a detailed evaluation of the model's performance, showing improvements in perplexity and accuracy over the baseline DictaBERT model. △ Less

Submitted 30 July, 2024; originally announced July 2024.

Comments: 3 pages, 1 table

MSC Class: 68T50
arXiv:2405.18115 [pdf, other]

cs.CL

doi 10.1007/s10579-025-09833-4

The Knesset Corpus: An Annotated Corpus of Hebrew Parliamentary Proceedings

Authors: Gili Goldin, Nick Howell, Noam Ordan, Ella Rabinovich, Shuly Wintner

Abstract: We present the Knesset Corpus, a corpus of Hebrew parliamentary proceedings containing over 30 million sentences (over 384 million tokens) from all the (plenary and committee) protocols held in the Israeli parliament between 1998 and 2022. Sentences are annotated with morpho-syntactic information and are associated with detailed meta-information reflecting demographic and political properties of t… ▽ More We present the Knesset Corpus, a corpus of Hebrew parliamentary proceedings containing over 30 million sentences (over 384 million tokens) from all the (plenary and committee) protocols held in the Israeli parliament between 1998 and 2022. Sentences are annotated with morpho-syntactic information and are associated with detailed meta-information reflecting demographic and political properties of the speakers, based on a large database of parliament members and factions that we compiled. We discuss the structure and composition of the corpus and the various processing steps we applied to it. To demonstrate the utility of this novel dataset we present two use cases. We show that the corpus can be used to examine historical developments in the style of political discussions by showing a reduction in lexical richness in the proceedings over time. We also investigate some differences between the styles of men and women speakers. These use cases exemplify the potential of the corpus to shed light on important trends in the Israeli society, supporting research in linguistics, political science, communication, law, etc. △ Less

Submitted 28 May, 2024; originally announced May 2024.

Comments: 28 pages, 7 figures

MSC Class: 68T50 ACM Class: I.2.7

Search v0.5.6 released 2020-02-24