Multilingual datasets for Main content extraction from web pages

This item is shared privately

dataset

modified on 2022-05-26, 00:50

Please read detailed description before use this dataset: https://github.com/dreamwayjgs/main-content-extraction-assessment-framework#demo-datasets

This dataset is for researching main content extraction from web pages as a archived mongoDB file and postgresql dump file.

This dataset has crawled MHTML files of web pages from nine languages (Korean, Japanese, Indonesian, French, Russian, Saudi Arabian (Arabic), and Chinese).

Releated Resources:

- Main Content Extraction Framework: https://github.com/dreamwayjgs/main-content-extraction-assessment-framework

- GCE Algorithm (on the above framework): https://gitlab.com/dreamwayjgs/main-content-extractor-v2

Keywords

Web Technologies (excl. Web Search)Information Retrieval and Web Search

Licence

CC BY 4.0

Multilingual datasets for Main content extraction from web pages

Categories

Keywords

Licence