Product detail

Content collector and document analysis for the M-Eco project

OTRUSINA, L. SMRŽ, P. JEŘÁBEK, J. MAREK, T. RYLKO, V. SZNAPKA, J. ŠAFÁŘ, M. UHERČÍK, M.

Product type

software

Abstract

The system collects data from various sources, and makes them accessible to other components of the M-Eco project. The collection focuses on three groups of data: multimedia data such as broadcast news from TV and radio, online news data from MedISys, and social media content from blogs, forums and Twitter messages.The multimedia data is collected and transcribed by SAIL's Media Mining Indexing System (MMI) that subsequently provides the transcriptions to the MedISys via RSS feed. For later retrieval, links to the original content are part of this RSS feed. MedISys provides these RSS feeds along with additional annotations and online news data collected by this system for further processing by the document analysis component. A third source of data collected by the content collector comprises social media content collected from MedWorm, Twitter, about 85 discussion fora and 45 blogs written especially in German.Collected documents are pre-processed. This process includes filtering of irrelevant data, named entity recognition, parsing, tagging etc. As a result, a set of tagged documents is produced which is stored in the annotated text repository and made available via web services for the indicator detection and signal generation process.

Keywords

name entitiy recognition, geonames.org, finite state automaton, Twitter, MedISys, M-Eco

Create date

14. 12. 2012

Location

https://github.com/iotrusina/M-Eco-WP3-package

Possibilities of use

K využití výsledku jiným subjektem je vždy nutné nabytí licence

Licence fee

Poskytovatel licence na výsledek nepožaduje licenční poplatek

www

https://github.com/iotrusina/M-Eco-WP3-package

VUT

Faculties

University Institutes

Parts

Content collector and document analysis for the M-Eco project