Effective Blog Pages Extractor for Better UGC Accessing

Zhao, Kui; Wang, Yi; Hu, Xia; Wang, Can

doi:10.1109/ICISCE.2016.86

Computer Science > Information Retrieval

arXiv:1708.07935 (cs)

[Submitted on 26 Aug 2017]

Title:Effective Blog Pages Extractor for Better UGC Accessing

Authors:Kui Zhao, Yi Wang, Xia Hu, Can Wang

View PDF

Abstract:Blog is becoming an increasingly popular media for information publishing. Besides the main content, most of blog pages nowadays also contain noisy information such as advertisements etc. Removing these unrelated elements can improves user experience, but also can better adapt the content to various devices such as mobile phones. Though template-based extractors are highly accurate, they may incur expensive cost in that a large number of template need to be developed and they will fail once the template is updated. To address these issues, we present a novel template-independent content extractor for blog pages. First, we convert a blog page into a DOM-Tree, where all elements including the title and body blocks in a page correspond to subtrees. Then we construct subtree candidate set for the title and the body blocks respectively, and extract both spatial and content features for elements contained in the subtree. SVM classifiers for the title and the body blocks are trained using these features. Finally, the classifiers are used to extract the main content from blog pages. We test our extractor on 2,250 blog pages crawled from nine blog sites with obviously different styles and templates. Experimental results verify the effectiveness of our extractor.

Comments:	2016 3rd International Conference on Information Science and Control Engineering (ICISCE)
Subjects:	Information Retrieval (cs.IR)
Cite as:	arXiv:1708.07935 [cs.IR]
	(or arXiv:1708.07935v1 [cs.IR] for this version)
	https://doi.org/10.48550/arXiv.1708.07935
Related DOI:	https://doi.org/10.1109/ICISCE.2016.86

Submission history

From: Kui Zhao [view email]
[v1] Sat, 26 Aug 2017 05:56:32 UTC (673 KB)

Computer Science > Information Retrieval

Title:Effective Blog Pages Extractor for Better UGC Accessing

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Information Retrieval

Title:Effective Blog Pages Extractor for Better UGC Accessing

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators