Multilingual datasets for Main content extraction from web pages
Download all (5.21 GB) This item is shared privately
dataset
modified on 2022-05-26, 00:50 Please read detailed description before use this dataset: https://github.com/dreamwayjgs/main-content-extraction-assessment-framework#demo-datasets
This dataset is for researching main content extraction from web pages as a archived mongoDB file and postgresql dump file.
This dataset has crawled MHTML files of web pages from nine languages (Korean, Japanese, Indonesian, French, Russian, Saudi Arabian (Arabic), and Chinese).
Releated Resources:
- Main Content Extraction Framework: https://github.com/dreamwayjgs/main-content-extraction-assessment-framework
- GCE Algorithm (on the above framework): https://gitlab.com/dreamwayjgs/main-content-extractor-v2