Publication detail

Visual Area Classification for Article Identification in Web Documents

BURGET, R.

Original Title

Visual Area Classification for Article Identification in Web Documents

English Title

Visual Area Classification for Article Identification in Web Documents

Type

conference paper

Language

en

Original Abstract

In the World Wide Web, the news and other articles are usually published in complex HTML documents containing many types of additional information that is not explicitly marked. In this paper, we propose a visual information analysis approach to the article discovery in complex HTML documents. We use a classification approach for the identification the important parts of the article within the page and we propose an algorithm for the detection of the article bounds within the page. Finally, we provide the results of an experimental evaluation.

English abstract

In the World Wide Web, the news and other articles are usually published in complex HTML documents containing many types of additional information that is not explicitly marked. In this paper, we propose a visual information analysis approach to the article discovery in complex HTML documents. We use a classification approach for the identification the important parts of the article within the page and we propose an algorithm for the detection of the article bounds within the page. Finally, we provide the results of an experimental evaluation.

Keywords

article extraction, document cleaning, page segmentation, visual analysis

RIV year

2010

Released

30.08.2010

Publisher

IEEE Computer Society

Location

Bilbao

ISBN

978-0-7695-4174-7

Book

21st International Workshop on Databases and Expert Systems Applications

Edition

NEUVEDEN

Edition number

NEUVEDEN

Pages from

171

Pages to

175

Pages count

5

Documents

BibTex


@inproceedings{BUT35628,
  author="Radek {Burget}",
  title="Visual Area Classification for Article Identification in Web Documents",
  annote="In the World Wide Web, the news and other articles are usually published in
complex HTML documents containing many types of additional information that is
not explicitly marked. In this paper, we propose a visual information analysis
approach to the article discovery in complex HTML documents. We use
a classification approach for the identification the important parts of the
article within the page and we propose an algorithm for the detection of the
article bounds within the page. Finally, we provide the results of an
experimental evaluation.",
  address="IEEE Computer Society",
  booktitle="21st International Workshop on Databases and Expert Systems Applications",
  chapter="35628",
  edition="NEUVEDEN",
  howpublished="print",
  institution="IEEE Computer Society",
  year="2010",
  month="august",
  pages="171--175",
  publisher="IEEE Computer Society",
  type="conference paper"
}