Publication detail

Two-Phase Categorization of Web Documents

BARTÍK, V. BURGET, R.

Original Title

Two-Phase Categorization of Web Documents

English Title

Two-Phase Categorization of Web Documents

Type

conference paper

Language

en

Original Abstract

The number of pages on the World Wide Web is permanently growing and there is a need to process pages efficiently and obtain some useful knowledge from them. Web page categorization is a very important issue in this area. The method proposed here takes both visual and textual information into consideration. It consists of two phases. In the first phase, web page areas obtained by segmentation are classified based on their visual properties, and in the second phase, pages are classified, based on information from the first phase and textual information. Several experiments with web pages taken from news web sites are presented in the final part of the paper.

English abstract

The number of pages on the World Wide Web is permanently growing and there is a need to process pages efficiently and obtain some useful knowledge from them. Web page categorization is a very important issue in this area. The method proposed here takes both visual and textual information into consideration. It consists of two phases. In the first phase, web page areas obtained by segmentation are classified based on their visual properties, and in the second phase, pages are classified, based on information from the first phase and textual information. Several experiments with web pages taken from news web sites are presented in the final part of the paper.

Keywords

Web page categorization, visual block classification, term weighting, TF-IDF, page segmentation

RIV year

2010

Released

01.11.2010

Publisher

Institute for Systems and Technologies of Information, Control and Communication

Location

Valencia

ISBN

978-989-8425-28-7

Book

Proceedings of the International Conference on Knowledge Discovery and Information Retrieval

Edition

NEUVEDEN

Edition number

NEUVEDEN

Pages from

458

Pages to

462

Pages count

5

Documents

BibTex


@inproceedings{BUT34415,
  author="Vladimír {Bartík} and Radek {Burget}",
  title="Two-Phase Categorization of Web Documents",
  annote="The number of pages on the World Wide Web is permanently growing and there is
a need to process pages efficiently and obtain some useful knowledge from them.
Web page categorization is a very important issue in this area. The method
proposed here takes both visual and textual information into consideration. It
consists of two phases. In the first phase, web page areas obtained by
segmentation are classified based on their visual properties, and in the second
phase, pages are classified, based on information from the first phase and
textual information. Several experiments with web pages taken from news web sites
are presented in the final part of the paper.",
  address="Institute for Systems and Technologies of Information, Control and Communication",
  booktitle="Proceedings of the International Conference on Knowledge Discovery and Information Retrieval",
  chapter="34415",
  edition="NEUVEDEN",
  howpublished="print",
  institution="Institute for Systems and Technologies of Information, Control and Communication",
  year="2010",
  month="november",
  pages="458--462",
  publisher="Institute for Systems and Technologies of Information, Control and Communication",
  type="conference paper"
}