Extraction of Product Specifications from the Web -- Going Beyond Tables and Lists

Gangadhar, Govind Krishnan; Kulkarni, Ashish

doi:10.1145/3493700.3493713

Computer Science > Information Retrieval

arXiv:2201.02896 (cs)

[Submitted on 8 Jan 2022]

Title:Extraction of Product Specifications from the Web -- Going Beyond Tables and Lists

Authors:Govind Krishnan Gangadhar, Ashish Kulkarni

View PDF

Abstract:E-commerce product pages on the web often present product specification data in structured tabular blocks. Extraction of these product attribute-value specifications has benefited applications like product catalogue curation, search, question answering, and others. However, across different Websites, there is a wide variety of HTML elements (like <table>, <ul>, <div>, <span>, <dl> etc.) typically used to render these blocks that makes their automatic extraction a challenge. Most of the current research has focused on extracting product specifications from tables and lists and, therefore, suffers from recall when applied to a large-scale extraction setting. In this paper, we present a product specification extraction approach that goes beyond tables or lists and generalizes across the diverse HTML elements used for rendering specification blocks. Using a combination of hand-coded features and deep learned spatial and token features, we first identify the specification blocks on a product page. We then extract the product attribute-value pairs from these blocks following an approach inspired by wrapper induction. We created a labeled dataset of product specifications extracted from 14,111 diverse specification blocks taken from a range of different product websites. Our experiments show the efficacy of our approach compared to the current specification extraction models and support our claim about its application to large-scale product specification extraction.

Comments:	9 pages, 7 figures, 9th ACM IKDD CODS and 27th COMAD
Subjects:	Information Retrieval (cs.IR); Machine Learning (cs.LG)
Cite as:	arXiv:2201.02896 [cs.IR]
	(or arXiv:2201.02896v1 [cs.IR] for this version)
	https://doi.org/10.48550/arXiv.2201.02896
Related DOI:	https://doi.org/10.1145/3493700.3493713

Submission history

From: Govind Krishnan Gangadhar [view email]
[v1] Sat, 8 Jan 2022 22:25:32 UTC (505 KB)

Computer Science > Information Retrieval

Title:Extraction of Product Specifications from the Web -- Going Beyond Tables and Lists

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Information Retrieval

Title:Extraction of Product Specifications from the Web -- Going Beyond Tables and Lists

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators