One key element of epidemic intelligence is the capturing, filtering and verification of non-structured epidemiological data from a variety of informal sources (e.g., social media & electronic media news monitoring), also known as Event-based Surveillance (EBS).
Within the MOOD project, PADI-web has been developed as an EBS to monitor animal health using valuable information directly from online news articles. This online tool is set to improve the MOOD platform for epidemic intelligence and revolutionize the way we monitor animal disease outbreaks in Europe and globally.
The tool has been crafted by the collaboration of a multidisciplinary team ranging from MOOD epidemiologists such as Renaud Lancelot (ASTRE – CIRAD), Elena Arsevska (1st thesis on PADI-web – CIRAD), Sarah Valentin (2nd thesis – CIRAD) and Carléne Trevennec (ASTRE) to the team of computer scientists and developers composed by Mathieu Roche (text mining – TETIS – CIRAD), Sylvain Falala (training – ASTRE – INRAE), Julien Rabatel (IT development and data science), as well as several involved students.
At the last MOOD science webinar on March 30th, we hosted Julien Rabatel, a freelance developer currently collaborating with CIRAD and involved directly in the realization of PADI-web to dive deep into the platform’s main functionalities, and what it can provide to the end-user. Although the tool is fully operative, the future of PADI-web is still dynamic and will require the involvement of end-users to address its opportunities and current limitations.
PADI-web in short
PADI-web, short for Platform for Automated extraction of Disease Information from the web, can be accessed online. It was created to meet the needs of the International Health Monitoring Unit (VSI) of the French Animal Health Epidemiological Surveillance Platform (ESA), which gained a great time advantage in detecting and monitoring all potential health risks for the French territory by automatically highlighting health issues mentioned in news articles on the internet (web media). Now the tool is undergoing continuous development to serve the epidemic intelligence activities of public and veterinary health agencies involved in the MOOD project.
PADI-web in short
The goal of PADI-web is to use this unofficial information from online news as a complementary addition to official health monitoring sources for the detection of emerging animal infectious diseases and the extraction of outbreak information.
Despite its focus on animal health, Julien specified that ‘During the development of PADI-web we always had in mind that we wanted it to be generic so that it could be applied to other domains of epidemiology than animal health’, making it is a very generic and highly customizable tool. This characteristic allows for a broad range of end-user domains – from experts (e.g. epidemiologists, modellers and risk assessors) to data scientists and developers (e.g., data visualization experts and text miners).
By processing hundreds of Google News articles per day, PADI-web monitors a lot of diseases related to animal health from African swine fever to avian influenza and West Nile virus as well as new diseases, through fully-automated, machine-learning-based pipelines.
A fully-automated, but customizable data pipeline
1. Data collection
”Every few hours, PADI-web queries Google News with a combination of keywords — like disease names, symptoms and hosts names — in order to retrieve only potentially relevant articles on animal epidemiology”
Google News provides structured results in the form of RSS feeds, from which PADI-web retrieves relevant article metadata from the web pages (publication date, title, hyperlink to visit it).
2. Webpage processing
The article is then ‘cleaned’ by removing unnecessary information and elements (advertisement, pictures, etc…) so that the raw text with the information is easily accessible. To simplify the pipeline, the PADI-web team decided to use English as the official language of the tool, so translation work is done through a language detection step followed by a machine-learning-based translation process (Microsoft Azure Translator). At this point, the text is ready to be processed.
3. Data classification
“This step is mainly, initially, here to decide whether an article is relevant or not,” told Julien, as PADI-web collects a wealth of articles daily. However, not all of them are of epidemiological interest: according to the team, an article is considered relevant from an epidemiological perspective if it relates to a new suspected or unknown outbreak.
To ensure an effective ‘yes-or-no’ classification, this process is automated and based on supervised machine learning. This classifying module is customizable and it is used not only to state ‘what is the topic of an article’ – Outbreak declaration, consequences, alert/preparedness, or general information? But also to dig deeper into ‘sentence classification’, a process that provides the type of information contained in the selected phrase. Despite the highly-automatized system, users can rectify wrong classifications, and classification models are trained every day based on new examples to improve the performance of the tool over time.
4. Information extraction
The goal of PADI-web’s information extraction module is to automatically detect pieces of information within the collected text ‘that can be of interest for animal epidemiology. This can be location, host species, diseases, case numbers, dates and so on” according to Julien.
Fishing key information from the sea of the internet is not new to text miners: the group of Julien applied a Named Entity Recognition tool from the Python library known as SpaCy (http://spacy.io) which includes a model to recognize common generic entities in English – names of places, dates, names of organization and more. However, SpaCy was not designed to extract epidemiology information, so Julien and his team had to ‘build a dataset of around 500 articles whose entities have been labelled, and then we had trained a specific model in SpaCy for animal epidemiology’ to recognize those entities that are relevant to PADI-web. Training the SpaCy model has therefore been fundamental to providing relevant information to the end-users.
5. Outputs
PADI-web pipeline converges into a set of outputs that are of use for the end-users.
“PADI-web has very different types of end-users, so we have to provide different types of outputs” described Julien. PADI-web offers four formats of outputs:
1. A public user interface (website);
2. A customizable notification email service – daily or weekly – collecting new articles of interest with some basic information in support of epidemic monitoring and surveillance from veterinary health agencies (cf. image);
3. Various export formats (CSV, XLSX, JSON, HTML) for almost every type of data extracted by PADI-web, for instance, location of an outbreak and its surrounding contextual information, specifically thought for data scientists;
4. JSON API for developers to be able to use PADI-web capabilities.
A screenshot of the PADI-web output
Current and future work to improve PADI-web
There is plenty of room to improve this EBS system, and the to-do list of the PADI-web’s team is rich in improvements and novel additions. For example, in the One Health context, a new instance of PADI-Web will be dedicated to plant health, as an addition to the currently available animal health platform.
“We have in progress the work in visual analytics of detected events with some of the MOOD partners; for instance, there is work on space-time visualization of detected outbreaks in collaboration with partners in France [ed: LIRMM and UM]; There is also some work on risk mapping using extracted outbreak data in collaboration with partners in Belgium [ ed: ULB, AVIA-gis, Institute of Tropical Medicine]” continued Rabatel.
“One part of my job as a developer is to integrate them [ed: new modules] while maintaining the main functionality of the platform,” he concluded.
In a joint effort to improve European health monitoring, MOOD is in the process of developing new and customized epidemic intelligence tools such as PADI-web, that are complementary to those already in place. The involvement of end-users across European countries is therefore key shape innovative and sustainable health threat monitoring systems – a series of co-development workshops are scheduled over the coming months.
Do you want to collaborate or help improve PADI-web? Then contact: padi-web@cirad.fr