Comparison of Unsupervised Metrics for Evaluating Judicial Decision Extraction

Litvak, Ivan Leonidovich; Kostin, Anton; Lashkin, Fedor; Maksiyan, Tatiana; Lagutin, Sergey

Abstract:The rapid advancement of artificial intelligence in legal natural language processing demands scalable methods for evaluating text extraction from judicial decisions. This study evaluates 16 unsupervised metrics, including novel formulations, to assess the quality of extracting seven semantic blocks from 1,000 anonymized Russian judicial decisions, validated against 7,168 expert reviews on a 1--5 Likert scale. These metrics, spanning document-based, semantic, structural, pseudo-ground truth, and legal-specific categories, operate without pre-annotated ground truth. Bootstrapped correlations, Lin's concordance correlation coefficient (CCC), and mean absolute error (MAE) reveal that Term Frequency Coherence (Pearson $r = 0.540$, Lin CCC = 0.512, MAE = 0.127) and Coverage Ratio/Block Completeness (Pearson $r = 0.513$, Lin CCC = 0.443, MAE = 0.139) best align with expert ratings, while Legal Term Density (Pearson $r = -0.479$, Lin CCC = -0.079, MAE = 0.394) show strong negative correlations. The LLM Evaluation Score (mean = 0.849, Pearson $r = 0.382$, Lin CCC = 0.325, MAE = 0.197) showed moderate alignment, but its performance, using gpt-4.1-mini via g4f, suggests limited specialization for legal textse. These findings highlight that unsupervised metrics, including LLM-based approaches, enable scalable screening but, with moderate correlations and low CCC values, cannot fully replace human judgment in high-stakes legal contexts. This work advances legal NLP by providing annotation-free evaluation tools, with implications for judicial analytics and ethical AI deployment.

Comments:	28 pages
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
ACM classes:	H.3.3; I.2.8; I.2.7
Cite as:	arXiv:2510.01792 [cs.CL]
	(or arXiv:2510.01792v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2510.01792

Computer Science > Computation and Language

Title:Comparison of Unsupervised Metrics for Evaluating Judicial Decision Extraction

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators