Vision Based Page Segmentation: Extended and Improved Algorithm

Web pages consist of different segments, serving different purposes. Most common types of these segments are header, right or left columns and main content. Moreover, these parts may include several subparts, as a news web page may contain more than one article in a page. In order to detect different segments in a web page, we first need to construct its block structure, and using visual cues is a very useful practice in this process. Being one of the most popular algorithms for this pur- pose, Vision Based Segmentation Algorithm (VIPS Algorithm) needs some improvement in its most important part, visual block extraction. We defined some additional terms and detected visual cues for extending visual block extraction part. In this technical report, deficiencies of VIPS algorithm are explained, and new rules are defined. Moreover, our implementation of VIPS algorithm is introduced.