MOOD project is at the forefront of European research of infectious disease surveillance and modelling from a data science perspective, investigating the impact of global warming on disease outbreaks, and proposing innovations for building of One Health systems across Europe and the world.
In the table below all publications to which the MOOD project contributed are listed. Use the filter to select the most relevant articles.
Mathieu Roche Edmond Menya, Roberto Interdonato; Owuor, Dickson
EpidGPT: A Combined Strategy to Discriminate Between Redundant and New Information for Epidemiological Surveillance Systems Journal Article
In: Natural Language Processing and Information Systems, pp. 439–454, 2024, ISBN: 978-3-031-70238-9.
Abstract | Links | BibTeX | Tags: Text mining
@article{Menya2024,
title = {EpidGPT: A Combined Strategy to Discriminate Between Redundant and New Information for Epidemiological Surveillance Systems},
author = {Edmond Menya, Mathieu Roche, Roberto Interdonato and Dickson Owuor},
doi = {https://doi.org/10.1007/978-3-031-70239-6_30},
isbn = {978-3-031-70238-9},
year = {2024},
date = {2024-09-20},
journal = {Natural Language Processing and Information Systems},
pages = {439–454},
abstract = {Textual documents such as online news articles have become a key source in epidemiological surveillance such as being used in the detection of new and re-emerging diseases. However, such sources suffer redundancies with the need to automate the process of identifying novel information. In this paper, we propose a framework for learning novel thematic information in epidemiological news documents. Our approach involves both extraction and classification of new, duplicate, additional and/or missing pieces of relevant information in epidemiological news documents. Firstly, we propose an initial step to solve the limited data problem where fewer gold labelled datasets exists for training text-based epidemiological surveillance systems. This initial step is built using extractive question answering technique whereby we automate the process of extracting relevant thematic features inclusive of disease and host names, location and date of reported events and reported number of cases in order to create a large silver labelled dataset. We then propose a main step where we build a novelty information classification model that is trained using our large silver labeled dataset. We then test our novelty classifier model alongside competitive ones on the challenge of detecting whether there is novel, redundant and/or missing information in a target epidemiological news article. We later carry out ablation studies on the most informative document segments in epidemiological news articles.},
keywords = {Text mining},
pubstate = {published},
tppubtype = {article}
}
Menya, Edmond; Interdonato, Roberto; Owuor, Dickson; Roche, Mathieu
Explainable epidemiological thematic features for event based disease surveillance Journal Article
In: ScienceDirect, vol. 250, 2024.
Abstract | Links | BibTeX | Tags: OpenDataSet, Text mining
@article{nokey,
title = {Explainable epidemiological thematic features for event based disease surveillance},
author = {Edmond Menya and Roberto Interdonato and Dickson Owuor and Mathieu Roche},
url = {https://www.sciencedirect.com/science/article/pii/S0957417424007607?via%3Dihub},
doi = {https://doi.org/10.1016/j.eswa.2024.123894},
year = {2024},
date = {2024-09-15},
urldate = {2024-09-15},
journal = {ScienceDirect},
volume = {250},
abstract = {Event based disease surveillance (EBS) systems are biosurveillance systems that have the ability to detect and alert on (re)-emerging infectious diseases by monitoring acute public or animal health event patterns from sources such as blogs, online news reports and curated expert accounts. These information rich sources, however, are largely unstructured text data requiring novel text mining techniques to achieve EBS goals such as epidemiological text classification. The main objective of this research was to improve epidemiological text classification by proposing a novel technique of enriching thematic features using a weak supervision approach. In our approach, we train and test a mixed domain language model named EpidBioELECTRA to first enrich thematic features which are then used to improve epidemiological text classification. We train EpidBioELECTRA on a large dataset which we create consisting of 70,700 annotated documents that includes 70,400 labeled thematic features. We empirically compare EpidBioELECTRA with both general purpose language models and domain specific language models in the task of epidemiological corpus classification. Our findings shows that epidemiological classification systems work best with language models pre-trained using both epidemiological and biomedical corpora with a continual pre-training strategy. EpidBioELECTRA improves epidemiological document classification by 19.2
score points as compared to its vanilla implementation BioELECTRA. We observe this by the comparison of BioELECTRA verses EpidBioELECTRA on our most challenging dataset PADI-Web
where our approach records 92.33 precision score, 94.62 recall score and 93.46
score. We also experiment the impact of increasing context length of train documents in epidemiological document classification and found out that this improves the classification task by 7.79
score points as recorded by EpidBioELECTRA’s performance. We also compute Almost Stochastic Order (ASO) scores to track EpidBioELECTRA’s statistical dominance. In addition, we carry out ablation studies on our proposed thematic feature enrichment approach using explainable AI techniques. We present explanations for the most critical thematic features and how they influence epidemiological classification task We found out that biomedical features (such as mentions of names of diseases and symptoms) are the most influential while spatio-temporal features (such as the mention of date of a given disease outbreak) are the least influential in epidemiological document classification. Our model can easily be extended to fit other domains.},
keywords = {OpenDataSet, Text mining},
pubstate = {published},
tppubtype = {article}
}
score points as compared to its vanilla implementation BioELECTRA. We observe this by the comparison of BioELECTRA verses EpidBioELECTRA on our most challenging dataset PADI-Web
where our approach records 92.33 precision score, 94.62 recall score and 93.46
score. We also experiment the impact of increasing context length of train documents in epidemiological document classification and found out that this improves the classification task by 7.79
score points as recorded by EpidBioELECTRA’s performance. We also compute Almost Stochastic Order (ASO) scores to track EpidBioELECTRA’s statistical dominance. In addition, we carry out ablation studies on our proposed thematic feature enrichment approach using explainable AI techniques. We present explanations for the most critical thematic features and how they influence epidemiological classification task We found out that biomedical features (such as mentions of names of diseases and symptoms) are the most influential while spatio-temporal features (such as the mention of date of a given disease outbreak) are the least influential in epidemiological document classification. Our model can easily be extended to fit other domains.
Jérôme Azé Laetitia Viau, Fati Chen; Sallaberry, Arnaud
Epid Data Explorer: A Visualization Tool for Exploring and Comparing Spatio-Temporal Epidemiological Data Journal Article
In: Health Informatics Journal, 2024.
Abstract | Links | BibTeX | Tags: Text mining
@article{Viau2024b,
title = {Epid Data Explorer: A Visualization Tool for Exploring and Comparing Spatio-Temporal Epidemiological Data},
author = {Laetitia Viau, Jérôme Azé, Fati Chen, Pierre Pompidor, Pascal Poncelet, Vincent Raveneau, Nancy Rodriguez and Arnaud Sallaberry},
doi = {https://doi.org/10.1177/14604582241279720},
year = {2024},
date = {2024-09-03},
journal = {Health Informatics Journal},
abstract = {The analysis of large sets of spatio-temporal data is a fundamental challenge in epidemiological research. As the quantity and the complexity of such kind of data increases, automatic analysis approaches, such as statistics, data mining, machine learning, etc., can be used to extract useful information. While these approaches have proven effective, they require a priori knowledge of the information being sought, and some interesting insights into the data may be missed. To bridge this gap, information visualization offers a set of techniques for not only presenting known information, but also exploring data without having a hypothesis formulated beforehand. In this paper, we introduce Epid Data Explorer (EDE), a visualization tool that enables exploration of spatio-temporal epidemiological data},
keywords = {Text mining},
pubstate = {published},
tppubtype = {article}
}
Elena ARSEVSKA Mehtab Alam SYED, Mathieu ROCHE
GeospaCy: A tool for extraction and geographical referencing of spatial expressions in textual data Proceedings
2024.
Abstract | Links | BibTeX | Tags: Text mining
@proceedings{nokey,
title = {GeospaCy: A tool for extraction and geographical referencing of spatial expressions in textual data},
author = {Mehtab Alam SYED, Elena ARSEVSKA, Mathieu ROCHE, Maguelonne TEISSEIRE},
url = {https://aclanthology.org/2024.eacl-demo.13},
year = {2024},
date = {2024-03-01},
journal = {Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations},
abstract = {Spatial information in text enables to understand the geographical context and relationships within text for better decision-making across various domains such as disease surveillance, disaster management and other location based services. Therefore, it is crucial to understand the precise geographical context for location-sensitive applications. In response to this necessity, we introduce the GeospaCy software tool, designed for the extraction and georeferencing of spatial information present in textual data. GeospaCy fulfils two primary objectives: 1) Geoparsing, which involves extracting spatial expressions, encompassing place names and associated spatial relations within the text data, and 2) Geocoding, which facilitates the assignment of geographical coordinates to the spatial expressions extracted during the Geoparsing task. Geoparsing is evaluated with a disease news article dataset consisting of event information, whereas a qualitative evaluation of geographical coordinates (polygons/geometries) of spatial expressions is performed by end-users for Geocoding task.
},
keywords = {Text mining},
pubstate = {published},
tppubtype = {proceedings}
}
SYED, Mehtab Alam; ARSEVSKA, Elena; ROCHE, Mathieu; TEISSEIRE, Maguelonne
GeospartRE: Extraction and Geocoding of spatial relation entities in textual documents Journal Article
In: Cartography and Geographic Information Science, 2023.
Links | BibTeX | Tags: OpenDataSet, Text mining
@article{nokey,
title = {GeospartRE: Extraction and Geocoding of spatial relation entities in textual documents},
author = {Mehtab Alam SYED and Elena ARSEVSKA and Mathieu ROCHE and Maguelonne TEISSEIRE},
doi = {https://doi.org/10.1080/15230406.2023.2264753},
year = {2023},
date = {2023-11-30},
urldate = {2023-11-30},
journal = {Cartography and Geographic Information Science},
keywords = {OpenDataSet, Text mining},
pubstate = {published},
tppubtype = {article}
}
Dias, Hélder; Guimarães, Artur; Martins, Bruno; Roche, Mathieu
Unsupervised Key-Phrase Extraction from Long Texts with Multilingual Sentence Transformers Journal Article
In: pp. 141–155, 2023.
Abstract | Links | BibTeX | Tags: Text mining
@article{nokey,
title = {Unsupervised Key-Phrase Extraction from Long Texts with Multilingual Sentence Transformers},
author = {Hélder Dias and Artur Guimarães and Bruno Martins and Mathieu Roche },
url = {https://link.springer.com/chapter/10.1007/978-3-031-45275-8_10},
doi = {https://doi.org/10.1007/978-3-031-45275-8_10},
year = {2023},
date = {2023-10-08},
pages = {141–155},
abstract = {Key-phrase extraction concerns retrieving a small set of phrases that encapsulate the core concepts of an input textual document. As in other text mining tasks, current methods often rely on pre-trained neural language models. Using these models, the state-of-the-art supervised systems for key-phrase extraction require large amounts of labelled data and generalize poorly outside the training domain, while unsupervised approaches generally present a lower accuracy. This paper presents a multilingual unsupervised approach to key-phrase extraction, improving upon previous methods in several ways (e.g., using representations from pre-trained Transformer models, while supporting the processing of long documents). Experimental results on datasets covering multiple languages and domains attest to the quality of the results.},
keywords = {Text mining},
pubstate = {published},
tppubtype = {article}
}
Valentin, Sarah; Boudoua, Bahdja; Sewalk, Kara; Arınık, Nejat; Roche, Mathieu; Lancelot, Renaud; Arsevska, Elena
Dissemination of information in event-based surveillance, a case study of Avian Influenza Journal Article
In: PLoS ONE, 2023.
Abstract | Links | BibTeX | Tags: HPAI (Avian Influenza), OpenDataSet, Text mining
@article{nokey,
title = {Dissemination of information in event-based surveillance, a case study of Avian Influenza},
author = {Sarah Valentin and Bahdja Boudoua and Kara Sewalk and Nejat Arınık and Mathieu Roche and Renaud Lancelot and Elena Arsevska },
url = {https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0285341},
doi = {10.1371/journal.pone.0285341},
year = {2023},
date = {2023-09-05},
urldate = {2023-09-05},
journal = {PLoS ONE},
abstract = {Event-Based Surveillance (EBS) tools, such as HealthMap and PADI-web, monitor online news reports and other unofficial sources, with the primary aim to provide timely information to users from health agencies on disease outbreaks occurring worldwide. In this work, we describe how outbreak-related information disseminates from a primary source, via a secondary source, to a definitive aggregator, an EBS tool, during the 2018/19 avian influenza season. We analysed 337 news items from the PADI-web and 115 news articles from HealthMap EBS tools reporting avian influenza outbreaks in birds worldwide between July 2018 and June 2019. We used the sources cited in the news to trace the path of each outbreak. We built a directed network with nodes representing the sources (characterised by type, specialisation, and geographical focus) and edges representing the flow of information. We calculated the degree as a centrality measure to determine the importance of the nodes in information dissemination. We analysed the role of the sources in early detection (detection of an event before its official notification) to the World Organisation for Animal Health (WOAH) and late detection. A total of 23% and 43% of the avian influenza outbreaks detected by the PADI-web and HealthMap, respectively, were shared on time before their notification. For both tools, national and local veterinary authorities were the primary sources of early detection. The early detection component mainly relied on the dissemination of nationally acknowledged events by online news and press agencies, bypassing international reporting to the WAOH. WOAH was the major secondary source for late detection, occupying a central position between national authorities and disseminator sources, such as online news. PADI-web and HealthMap were highly complementary in terms of detected sources, explaining why 90% of the events were detected by only one of the tools. We show that current EBS tools can provide timely outbreak-related information and priority news sources to improve digital disease surveillance.
Figures},
keywords = {HPAI (Avian Influenza), OpenDataSet, Text mining},
pubstate = {published},
tppubtype = {article}
}
Figures
Decoupes, Rémy; Roche, Mathieu; Teisseire, Maguelonne
GeoNLPlify: A spatial data augmentation enhancing text classification for crisis monitoring Journal Article
In: Intelligent Data Analysis, pp. 1-25, 2023.
Abstract | Links | BibTeX | Tags: OpenDataSet, Text mining
@article{nokey,
title = {GeoNLPlify: A spatial data augmentation enhancing text classification for crisis monitoring},
author = {Rémy Decoupes and Mathieu Roche and Maguelonne Teisseire},
url = {https://content.iospress.com/articles/intelligent-data-analysis/ida230040},
doi = {10.3233/IDA-230040},
year = {2023},
date = {2023-07-06},
urldate = {2023-07-06},
journal = {Intelligent Data Analysis},
pages = {1-25},
abstract = {Crises such as natural disasters and public health emergencies generate vast amounts of text data, making it challenging to classify the information into relevant categories. Acquiring expert-labeled data for such scenarios can be difficult, leading to limited training datasets for text classification by fine-tuning BERT-like models. Unfortunately, traditional data augmentation techniques only slightly improve F1-scores. How can data augmentation be used to obtain better results in this applied domain? In this paper, using neural network explicability methods, we aim to highlight that fine-tuned BERT-like models on crisis corpora give too much importance to spatial information to make their predictions. This overfitting of spatial information limits their ability to generalize especially when the event which occurs in a place has evolved and changed since the training dataset has been built. To reduce this bias, we propose GeoNLPlify,1
a novel data augmentation technique that leverages spatial information to generate new labeled data for text classification related to crises. Our approach aims to address overfitting without necessitating modifications to the underlying model architecture, distinguishing it from other prevalent methods employed to combat overfitting. Our results show that GeoNLPlify significantly improves F1-scores, demonstrating the potential of the spatial information for data augmentation for crisis-related text classification tasks. In order to evaluate the contribution of our method, GeoNLPlify is applied to three public datasets (PADI-web, CrisisNLP and SST2) and compared with classical natural language processing data augmentations.},
keywords = {OpenDataSet, Text mining},
pubstate = {published},
tppubtype = {article}
}
a novel data augmentation technique that leverages spatial information to generate new labeled data for text classification related to crises. Our approach aims to address overfitting without necessitating modifications to the underlying model architecture, distinguishing it from other prevalent methods employed to combat overfitting. Our results show that GeoNLPlify significantly improves F1-scores, demonstrating the potential of the spatial information for data augmentation for crisis-related text classification tasks. In order to evaluate the contribution of our method, GeoNLPlify is applied to three public datasets (PADI-web, CrisisNLP and SST2) and compared with classical natural language processing data augmentations.
Arınık, Nejat; Bortel, Wim Van; Boudoua, Bahdja; Busani, Luca; Decoupes, Rémy; Interdonato, Roberto; Kafando, Rodrique; van Kleef, Esther; Roche, Mathieu; Syed, Mehtab Alam; Teisseire, Maguelonne
An annotated dataset for event-based surveillance of antimicrobial resistance Journal Article
In: ScienceDirect, 2023.
Abstract | Links | BibTeX | Tags: AMR (Antimicrobial Resistance), OpenDataSet, Text mining
@article{nokey,
title = {An annotated dataset for event-based surveillance of antimicrobial resistance},
author = {Nejat Arınık and Wim Van Bortel and Bahdja Boudoua and Luca Busani and Rémy Decoupes and Roberto Interdonato and Rodrique Kafando and Esther van Kleef and Mathieu Roche and Mehtab Alam Syed and Maguelonne Teisseire
},
url = {https://www.sciencedirect.com/science/article/pii/S2352340922010733?via%3Dihub},
doi = {10.1016/j.dib.2022.108870},
year = {2023},
date = {2023-02-08},
urldate = {2023-02-08},
journal = {ScienceDirect},
abstract = {This paper presents an annotated dataset used in the MOOD Antimicrobial Resistance (AMR) hackathon, hosted in Montpellier, June 2022. The collected data concerns unstructured data from news items, scientific publications and national or international reports, collected from four event-based surveillance (EBS) Systems, i.e. ProMED, PADI-web, HealthMap and MedISys. Data was annotated by relevance for epidemic intelligence (EI) purposes with the help of AMR experts and an annotation guideline. Extracted data were intended to include relevant events on the emergence and spread of AMR such as reports on AMR trends, discovery of new drug-bug resistances, or new AMR genes in human, animal or environmental reservoirs. This dataset can be used to train or evaluate classification approaches to automatically identify written text on AMR events across the different reservoirs and sectors of One Health (i.e. human, animal, food, environmental sources, such as soil and waste water) in unstructured data (e.g. news, tweets) and classify these events by relevance for EI purposes.
},
keywords = {AMR (Antimicrobial Resistance), OpenDataSet, Text mining},
pubstate = {published},
tppubtype = {article}
}
Valentin, Sarah; Arsevska, Elena; Mercier, Alizé; Falala, Sylvain; Rabatel, Julien; Lancelot, Renaud; Roche, Mathieu
PADI-web: An Event-Based Surveillance System for Detecting, Classifying and Processing Online News Conference
Human Language Technology. Challenges for Computer Science and Linguistics, vol. 12598, Springer International Publishing, 2022, ISBN: 978-3-030-66526-5.
Abstract | Links | BibTeX | Tags: ASF (African Swine Fever), HPAI (Avian Influenza), Text mining
@conference{@InProceedings{10.1007/978-3-030-66527-2_7,
title = {PADI-web: An Event-Based Surveillance System for Detecting, Classifying and Processing Online News},
author = {Sarah Valentin and Elena Arsevska and Alizé Mercier and Sylvain Falala and Julien Rabatel and Renaud Lancelot and Mathieu Roche},
editor = {Vetulani, Zygmunt and Paroubek, Patrick and Kubis, Marek},
url = {https://link.springer.com/chapter/10.1007/978-3-030-66527-2_7},
doi = {https://doi.org/10.1007/978-3-030-66527-2_7},
isbn = {978-3-030-66526-5},
year = {2022},
date = {2022-12-31},
urldate = {2022-12-31},
booktitle = {Human Language Technology. Challenges for Computer Science and Linguistics},
volume = {12598},
pages = {87-101},
publisher = {Springer International Publishing},
abstract = {The Platform for Automated Extraction of Animal Disease Information from the Web (PADI-web) is a multilingual text mining tool for automatic detection, classification, and extraction of disease outbreak information from online news articles. PADI-web currently monitors the Web for nine animal infectious diseases and eight syndromes in five animal hosts. The classification module is based on a supervised machine learning approach to filter the relevant news with an overall accuracy of 0.94. The classification of relevant news between 5 topic categories (confirmed, suspected or unknown outbreak, preparedness and impact) obtained an overall accuracy of 0.75. In the first six months of its implementation (January--June 2016), PADI-web detected 73{%} of the outbreaks of African swine fever; 20{%} of foot-and-mouth disease; 13{%} of bluetongue, and 62{%} of highly pathogenic avian influenza. The information extraction module of PADI-web obtained F-scores of 0.80 for locations, 0.85 for dates, 0.95 for diseases, 0.95 for hosts, and 0.85 for case numbers},
keywords = {ASF (African Swine Fever), HPAI (Avian Influenza), Text mining},
pubstate = {published},
tppubtype = {conference}
}
Valentin, Sarah; Arsevska, Elena; al.,
Elaboration of a new framework for fine-grained epidemiological annotation Journal Article
In: 2022.
Abstract | Links | BibTeX | Tags: OpenDataSet, Text mining
@article{nokey,
title = {Elaboration of a new framework for fine-grained epidemiological annotation},
author = {Sarah Valentin and Elena Arsevska and al.
},
url = {https://www.nature.com/articles/s41597-022-01743-2},
doi = {10.1038/s41597-022-01743-2},
year = {2022},
date = {2022-10-26},
urldate = {2022-10-26},
abstract = {Event-based surveillance (EBS) gathers information from a variety of data sources, including online news articles. Unlike the data from formal reporting, the EBS data are not structured, and their interpretation can overwhelm epidemic intelligence (EI) capacities in terms of available human resources. Therefore, diverse EBS systems that automatically process (all or part of) the acquired nonstructured data from online news articles have been developed. These EBS systems (e.g., GPHIN, HealthMap, MedISys, ProMED, PADI-web) can use annotated data to improve the surveillance systems. This paper describes a framework for the annotation of epidemiological information in animal disease-related news articles. We provide annotation guidelines that are generic and applicable to both animal and zoonotic infectious diseases, regardless of the pathogen involved or its mode of transmission (e.g., vector-borne, airborne, by contact). The framework relies on the successive annotation of all the sentences from a news article. The annotator evaluates the sentences in a specific epidemiological context, corresponding to the publication date of the news article.
},
keywords = {OpenDataSet, Text mining},
pubstate = {published},
tppubtype = {article}
}
Schaeffer, Camille; Interdonato, Roberto; Lancelot, Renaud; Roche, Mathieu; Teisseire, Maguelonne
Labeled entities from social media data related to avian influenza disease Journal Article Forthcoming
In: Data in Brief, vol. 43, pp. 108317, Forthcoming, ISSN: 2352-3409.
Abstract | Links | BibTeX | Tags: HPAI (Avian Influenza), OpenDataSet, Text mining
@article{@article{SCHAEFFER2022108317,,
title = {Labeled entities from social media data related to avian influenza disease},
author = {Camille Schaeffer and Roberto Interdonato and Renaud Lancelot and Mathieu Roche and Maguelonne Teisseire},
url = {https://www.sciencedirect.com/science/article/pii/S2352340922005194},
doi = {https://doi.org/10.1016/j.dib.2022.108317},
issn = {2352-3409},
year = {2022},
date = {2022-08-01},
urldate = {2022-08-01},
journal = {Data in Brief},
volume = {43},
pages = {108317},
abstract = {This dataset is composed by spatial (e.g. location) and thematic (e.g. diseases, symptoms, virus) entities concerning avian influenza in social media (textual) data in English. It was created from three corpora: the first one includes 10 transcriptions of YouTube videos and 70 tweets manually annotated. The second corpus is composed by the same textual data but automatically annotated with Named Entity Recognition (NER) tools. These two corpora have been built to evaluate NER tools and apply them to a bigger corpus. The third corpus is composed of 100 YouTube transcriptions automatically annotated with NER tools. The aim of the annotation task is to recognize spatial information such as the names of the cities and epidemiological information such as the names of the diseases. An annotation guideline is provided in order to ensure a unified annotation and to help the annotators. This dataset can be used to train or evaluate Natural Language Processing (NLP) approaches such as specialized entity recognition.},
keywords = {HPAI (Avian Influenza), OpenDataSet, Text mining},
pubstate = {forthcoming},
tppubtype = {article}
}
Roche, Mathieu; Arsevska, Elena; Valentin, Sarah; Falala, Sylvain; Rabatel, Julien; Lancelot, Renaud
How Textual Datasets Enhance the PADI-Web Tool? Journal Article
In: SciTePress, 2022.
Abstract | Links | BibTeX | Tags: Text mining
@article{nokey,
title = {How Textual Datasets Enhance the PADI-Web Tool?},
author = {Mathieu Roche and Elena Arsevska and Sarah Valentin and Sylvain Falala and Julien Rabatel and Renaud Lancelot
},
url = {https://www.scitepress.org/Link.aspx?doi=10.5220/0011590400003318},
doi = {10.5220/0011590400003318},
year = {2022},
date = {2022-07-27},
urldate = {2022-07-27},
journal = {SciTePress},
abstract = {The ability to rapidly detect outbreaks of emerging infectious diseases is a health priority of global health agencies. In this context, event-based surveillance (EBS) systems gather outbreak-related information from heterogeneous data sources, including online news articles. EBS systems, thus, increasingly marshal text-mining methods to alleviate the amount of manual curation of the freely available text. This paper documents the use of datasets obtained through an EBS system, PADI-Web (Platform for Automated extraction of Disease Information from the web), dedicated to digital outbreak detection in animal health. This paper describes the datasets used for improving 3 important tasks related to PADI-Web, i.e., news classification, information extraction and dissemination.},
keywords = {Text mining},
pubstate = {published},
tppubtype = {article}
}
Syed, Mehtab Alam; Arsevska, Elena; Roche, Mathieu; Teisseire, Maguelonne
A Data-Driven Score Model to Assess Online News Articles in Event-Based Surveillance System Conference
Information Management and Big Data, vol. 1577, Springer International Publishing, 2022.
Abstract | BibTeX | Tags: Text mining
@conference{@InProceedings{10.1007/978-3-031-04447-2_18,
title = {A Data-Driven Score Model to Assess Online News Articles in Event-Based Surveillance System},
author = {Mehtab Alam Syed and Elena Arsevska and Mathieu Roche and Maguelonne Teisseire},
editor = {Juan Antonio Lossio-Ventura, Eduardo Díaz, Carlos Gavidia-Calderon, Alan Demétrius Baria Valejo, Hugo Alatrista-Salas
},
year = {2022},
date = {2022-04-20},
urldate = {2022-04-20},
booktitle = {Information Management and Big Data},
volume = {1577},
pages = {264-280},
publisher = {Springer International Publishing},
abstract = {Online news sources are popular resources for learning about current health situations and developing event-based surveillance (EBS) systems. However, having access to diverse information originating from multiple sources can misinform stakeholders, eventually leading to false health risks. The existing literature contains several techniques for performing data quality evaluation to minimize the effects of misleading information. However, these methods only rely on the extraction of spatiotemporal information for representing health events. To address this research gap, a score-based technique is proposed to quantify the data quality of online news articles through three assessment measures: 1) news article metadata, 2) content analysis, and 3) epidemiological entity extraction with NLP to weight the contextual information. The results are calculated using classification metrics with two evaluation approaches: 1) a strict approach and 2) a flexible approach. The obtained results show significant enhancement in the data quality by filtering irrelevant news, which can potentially reduce false alert generation in EBS systems.},
keywords = {Text mining},
pubstate = {published},
tppubtype = {conference}
}
Valentin, Sarah; Lancelot, Renaud; Roche, Mathieu
Fusion of spatiotemporal and thematic features of textual data for animal disease surveillance Journal Article
In: Information Processing in Agriculture, 2022, ISSN: 2214-3173.
Abstract | Links | BibTeX | Tags: Text mining
@article{@article{VALENTIN2022,
title = {Fusion of spatiotemporal and thematic features of textual data for animal disease surveillance},
author = {Sarah Valentin and Renaud Lancelot and Mathieu Roche},
url = {https://www.sciencedirect.com/science/article/pii/S2214317322000312},
doi = {https://doi.org/10.1016/j.inpa.2022.03.004},
issn = {2214-3173},
year = {2022},
date = {2022-03-28},
journal = {Information Processing in Agriculture},
abstract = {Several internet-based surveillance systems have been created to monitor the web for animal health surveillance. These systems collect a large amount of news dealing with outbreaks related to animal diseases. Automatically identifying news articles that describe the same outbreak event is a key step to quickly detect relevant epidemiological information while alleviating manual curation of news content. This paper addresses the task of retrieving news articles that are related in epidemiological terms. We tackle this issue using text mining and feature fusion methods. The main objective of this paper is to identify a textual representation in which two articles that share the same epidemiological content are close. We compared two types of representations (i.e., features) to represent the documents: (i) morphosyntactic features (i.e., selection and transformation of all terms from the news, based on classical textual processing steps) and (ii) lexicosemantic features (i.e., selection, transformation and fusion of epidemiological terms including diseases, hosts, locations and dates). We compared two types of term weighing (i.e., Boolean and TF-IDF) for both representations. To combine and transform lexicosemantic features, we compared two data fusion techniques (i.e., early fusion and late fusion) and the effect of features generalisation, while evaluating the relative importance of each type of feature. We conducted our analysis using a corpus composed of a subset of news articles in English related to animal disease outbreaks. Our results showed that the combination of relevant lexicosemantic (epidemiological) features using fusion methods improves classical morphosyntactic representation in the context of disease-related news retrieval. The lexicosemantic representation based on TF-IDF and feature generalisation (F-measure = 0.92, r-precision = 0.58) outperformed the morphosyntactic representation (F-measure = 0.89, r-precision = 0.45), while reducing the features space. Converting the features into lower granular features (i.e., generalisation) contributed to improving the results of the lexicosemantic representation. Our results showed no difference between the early and late fusion approaches. Temporal features performed poorly on their own. Conversely, spatial features were the most discriminative features, highlighting the need for robust methods for spatial entity extraction, disambiguation and representation in internet-based surveillance systems.},
keywords = {Text mining},
pubstate = {published},
tppubtype = {article}
}
Roche, Mathieu; Teisseire, Maguelonne
Integrating Textual Data into Heterogeneous Data Ingestion Processing Conference
2021 IEEE International Conference on Big Data (Big Data), IEEE, Orlando, FL, USA, 2022, ISBN: 978-1-6654-3902-2.
Abstract | Links | BibTeX | Tags: Text mining
@conference{@INPROCEEDINGS{9671759,
title = {Integrating Textual Data into Heterogeneous Data Ingestion Processing},
author = {Mathieu Roche and Maguelonne Teisseire},
url = {https://ieeexplore.ieee.org/document/9671759},
doi = {10.1109/BigData52589.2021.9671759},
isbn = {978-1-6654-3902-2},
year = {2022},
date = {2022-01-13},
urldate = {2022-01-13},
booktitle = {2021 IEEE International Conference on Big Data (Big Data)},
pages = {6008-6010},
publisher = {IEEE},
address = {Orlando, FL, USA},
abstract = {In this abstract, two methods for integrating textual data and textual features into ingestion processing are summarized. The first method involves integrating all features, including textual features, into dedicated frameworks, such as by using machine learning techniques. In the second method, text and textual features, such as keywords, are used to explain results returned by heterogeneous data mining. In this context, it is necessary to link data (e.g., databases, images, etc.) and/or obtained results with textual data (e.g., documents and keywords).},
keywords = {Text mining},
pubstate = {published},
tppubtype = {conference}
}
Syed, Mehtab; Arsevska, Elena; Roche, Mathieu; Teisseire, Maguelonne
Feature Selection for Sentiment Classification of COVID-19 Tweets: H-TFIDF Featuring BERT Proceedings Article
In: SciTePress, (Ed.): pp. 648-656, 2022, ISBN: 978-989-758-552-4.
Abstract | Links | BibTeX | Tags: Covid-19 (Coronavirus), OpenDataSet, Text mining
@inproceedings{@conference{healthinf22,,
title = {Feature Selection for Sentiment Classification of COVID-19 Tweets: H-TFIDF Featuring BERT},
author = {Mehtab Syed and Elena Arsevska and Mathieu Roche and Maguelonne Teisseire},
editor = {SciTePress},
url = {https://www.scitepress.org/Link.aspx?doi=10.5220/0010887800003123},
doi = {10.5220/0010887800003123},
isbn = {978-989-758-552-4},
year = {2022},
date = {2022-01-01},
urldate = {2022-01-01},
journal = {Proceedings of the 15th International Joint Conference on Biomedical Engineering Systems and Technologies - HEALTHINF},
pages = {648-656},
abstract = {In the first quarter of 2020, the World Health Organization (WHO) declared COVID-19 a public health emergency around the globe. Different users from all over the world shared their opinions about COVID-19 on social media platforms such as Twitter and Facebook. At the beginning of the pandemic, it became relevant to assess public opinions regarding COVID-19 using data available on social media. We used a recently proposed hierarchy-based measure for tweet analysis (H-TFIDF) for feature extraction over sentiment classification of tweets. We assessed how H-TFIDF and concatenation of H-TFIDF with bidirectional encoder representations from transformers (BH-TFIDF) perform over state-of-the-art bag-of-words (BOW) and term frequency-inverse document frequency (TF-IDF) features for sentiment classification of COVID-19 tweets. A uniform experimental setup of the training-test (90% and 10%) split scheme was used to train the classifier. Moreover, evaluation was performed with the gold standard expert labeled dataset to measure precision for each binary classified class. },
keywords = {Covid-19 (Coronavirus), OpenDataSet, Text mining},
pubstate = {published},
tppubtype = {inproceedings}
}
Syed, Mehtab Alam; Decoupes, Remy; Arsevska, Elena; Roche, Mathieu; Teisseire, Maguelonne
Spatial opinion mining from COVID-19 twitter data Journal Article
In: International Journal of Infectious Diseases, vol. 116, iss. 549, pp. 527, 2021.
Abstract | Links | BibTeX | Tags: Covid-19 (Coronavirus), Text mining
@article{nokey,
title = {Spatial opinion mining from COVID-19 twitter data},
author = {Mehtab Alam Syed and Remy Decoupes and Elena Arsevska and Mathieu Roche and Maguelonne Teisseire},
url = {https://www.ijidonline.com/article/S1201-9712(21)00957-7/pdf},
doi = {https://doi.org/10.1016/j.ijid.2021.12.065},
year = {2021},
date = {2021-11-06},
urldate = {2021-11-06},
journal = {International Journal of Infectious Diseases},
volume = {116},
issue = {549},
pages = {527},
abstract = {: In the first quarter of 2020, World Health Organization (WHO) declared COVID-19 as a public health emergency around the globe. Therefore, different users from all over the world shared their thoughts about COVID-19 on social media platforms i.e., Twitter, Facebook etc. So, it is important to analyze public opinions about COVID-19 from different regions over different period of time. To fulfill the spatial analysis issue, a previous work called H-TF-IDF (Hierarchy-based measure for tweet analysis) for term extraction from tweet data has been proposed. In this work, we focus on the sentiment analysis performed on terms selected by H-TFIDF for spatial tweets groups to know local situations during the ongoing epidemic COVID-19 over different time frames.},
keywords = {Covid-19 (Coronavirus), Text mining},
pubstate = {published},
tppubtype = {article}
}
Valentin, Sarah; Lancelot, Renaud; Roche, Mathieu
Identifying associations between epidemiological entities in news data for animal disease surveillance Journal Article
In: Artificial Intelligence in Agriculture, vol. 5, pp. 163-174, 2021, ISSN: 2589-7217.
Abstract | Links | BibTeX | Tags: Text mining
@article{VALENTIN2021163,
title = {Identifying associations between epidemiological entities in news data for animal disease surveillance},
author = {Sarah Valentin and Renaud Lancelot and Mathieu Roche},
url = {https://www.sciencedirect.com/science/article/pii/S2589721721000246},
doi = {https://doi.org/10.1016/j.aiia.2021.07.003},
issn = {2589-7217},
year = {2021},
date = {2021-01-01},
journal = {Artificial Intelligence in Agriculture},
volume = {5},
pages = {163-174},
abstract = {Event-based surveillance systems are at the crossroads of human and animal (and plant and ecosystem) health, epidemiology, statistics, and informatics. Thus, their deployment faces many challenges specific to each domain and their intersections, such as relations among automation, artificial intelligence, and expertise. In this context, our work pertins to the extraction of epidemiological events in textual data (i.e. news) by unsupervised methods. We define the event extraction task as detecting pairs of epidemiological entities (e.g. a disease name and location). The quality of the ranked lists of pairs was evaluated using specific ranking evaluation metrics. We used a publicly available annotated corpus of 438 documents (i.e. news articles) related to animal disease events. The statistical approach was able to detect event-related pairs of epidemiological features with a good trade-off between precision and recall. Our results showed that using a window of words outperformed document-based and sentence-based approaches, while reducing the probability of detecting false pairs. Our results indicated that Mutual Information was less adapted than the Dice coefficient for ranking pairs of features in the event extraction framework. We believe that Mutual Information would be more relevant for rare pair detection (i.e. weak signals), but requires higher manual curation to avoid false positive extraction pairs. Moreover, generalising the country-level spatial features enabled better discrimination (i.e. ranking) of relevant disease-location pairs for event extraction.},
keywords = {Text mining},
pubstate = {published},
tppubtype = {article}
}
Li, Sabrina L; Messina, Jane P; Pybus, Oliver G; Kraemer, Moritz U G; Gardner, Lauren
A review of models applied to the geographic spread of Zika virus Journal Article
In: Transactions of The Royal Society of Tropical Medicine and Hygiene, vol. 115, no. 9, pp. 956-964, 2021, ISSN: 0035-9203.
Abstract | Links | BibTeX | Tags: Text mining, Zika
@article{10.1093/trstmh/trab009,
title = {A review of models applied to the geographic spread of Zika virus},
author = {Sabrina L Li and Jane P Messina and Oliver G Pybus and Moritz U G Kraemer and Lauren Gardner},
url = {https://doi.org/10.1093/trstmh/trab009},
doi = {10.1093/trstmh/trab009},
issn = {0035-9203},
year = {2021},
date = {2021-01-01},
urldate = {2021-01-01},
journal = {Transactions of The Royal Society of Tropical Medicine and Hygiene},
volume = {115},
number = {9},
pages = {956-964},
abstract = {In recent years, Zika virus (ZIKV) has expanded its geographic range and in 2015–2016 caused a substantial epidemic linked to a surge in developmental and neurological complications in newborns. Mathematical models are powerful tools for assessing ZIKV spread and can reveal important information for preventing future outbreaks. We reviewed the literature and retrieved modelling studies that were developed to understand the spatial epidemiology of ZIKV spread and risk. We classified studies by type, scale, aim and applications and discussed their characteristics, strengths and limitations. We examined the main objectives of these models and evaluated the effectiveness of integrating epidemiological and phylogeographic data, along with socioenvironmental risk factors that are known to contribute to vector–human transmission. We also assessed the promising application of human mobility data as a real-time indicator of ZIKV spread. Lastly, we summarised model validation methods used in studies to ensure accuracy in models and modelled outcomes. Models are helpful for understanding ZIKV spread and their characteristics should be carefully considered when developing future modelling studies to improve arbovirus surveillance.},
keywords = {Text mining, Zika},
pubstate = {published},
tppubtype = {article}
}