figshare
Browse
1/1
3 files

Multilingual datasets for Main content extraction from web pages

Download all (5.21 GB) This item is shared privately
dataset
modified on 2022-05-26, 00:50

Please read detailed description before use this dataset: https://github.com/dreamwayjgs/main-content-extraction-assessment-framework#demo-datasets


This dataset is for researching main content extraction from web pages as a archived mongoDB file and postgresql dump file.


This dataset has crawled MHTML files of web pages from nine languages (Korean, Japanese, Indonesian, French, Russian, Saudi Arabian (Arabic), and Chinese).


Releated Resources:

- Main Content Extraction Framework: https://github.com/dreamwayjgs/main-content-extraction-assessment-framework

- GCE Algorithm (on the above framework): https://gitlab.com/dreamwayjgs/main-content-extractor-v2