Content collector and document analysis for the M-Eco project
OTRUSINA, L. SMRŽ, P. JEŘÁBEK, J. MAREK, T. RYLKO, V. SZNAPKA, J. ŠAFÁŘ, M. UHERČÍK, M.
The system collects data from various sources, and makes them accessible to other components of the M-Eco project. The collection focuses on three groups of data: multimedia data such as broadcast news from TV and radio, online news data from MedISys, and social media content from blogs, forums and Twitter messages.The multimedia data is collected and transcribed by SAIL's Media Mining Indexing System (MMI) that subsequently provides the transcriptions to the MedISys via RSS feed. For later retrieval, links to the original content are part of this RSS feed. MedISys provides these RSS feeds along with additional annotations and online news data collected by this system for further processing by the document analysis component. A third source of data collected by the content collector comprises social media content collected from MedWorm, Twitter, about 85 discussion fora and 45 blogs written especially in German.Collected documents are pre-processed. This process includes filtering of irrelevant data, named entity recognition, parsing, tagging etc. As a result, a set of tagged documents is produced which is stored in the annotated text repository and made available via web services for the indicator detection and signal generation process.
name entitiy recognition, geonames.org, finite state automaton, Twitter, MedISys, M-Eco
K využití výsledku jiným subjektem je vždy nutné nabytí licence
Poskytovatel licence na výsledek nepožaduje licenční poplatek