Detail publikace

Layout Based Information Extraction from HTML Documents

Originální název

Layout Based Information Extraction from HTML Documents

Anglický název

Layout Based Information Extraction from HTML Documents

Jazyk

en

Originální abstrakt

We propose a method of information extraction from HTML documents based on modelling the visual information in the document. A page segmentation algorithm is used for detecting the document layout and subsequently, the extraction process is based on the analysis of mutual positions of the detected blocks and their visual features. This approach is more robust that the traditional DOM-based methods and it opens new possibilities for the extraction task specification.

Anglický abstrakt

We propose a method of information extraction from HTML documents based on modelling the visual information in the document. A page segmentation algorithm is used for detecting the document layout and subsequently, the extraction process is based on the analysis of mutual positions of the detected blocks and their visual features. This approach is more robust that the traditional DOM-based methods and it opens new possibilities for the extraction task specification.

BibTex


@inproceedings{BUT28821,
  author="Radek {Burget}",
  title="Layout Based Information Extraction from HTML Documents",
  annote="We propose a method of information extraction from HTML documents based on
modelling the visual information in the document. A page segmentation algorithm
is used for detecting the document layout and subsequently, the extraction
process is based on the analysis of mutual positions of the detected blocks and
their visual features. This approach is more robust that the traditional
DOM-based methods and it opens new possibilities for the extraction task
specification.",
  address="IEEE Computer Society",
  booktitle="9th International Conference on Document Analysis and Recognition ICDAR 2007",
  chapter="28821",
  howpublished="print",
  institution="IEEE Computer Society",
  year="2007",
  month="september",
  pages="624--629",
  publisher="IEEE Computer Society",
  type="conference paper"
}