openVirus: Tools and resources for COVID-19 knowledge

petermr · March 16, 2020, 11:15pm

I’ve just joined this list - so apologies if in wrong category.

This project is about using R tools for COVID-19 semantic knowledge. Everything is Open.

With Thomas Shafee of LaTrobe Univ, Melbourne we’ve started a resource to collect published knowledge on COVID-19 , such as preprints, articles, theses, grey literature. We then make the fulltext semantic and publish for re-use, such as textmining, searching, machine-learning, etc. We currently use a rich directory structure (“CProject” from contentmine.org) and expose all content as XML, HTML, PNG, etc. Wherever possible this is linked to Wikidata which has over 60 million chunks of data or metadata as triples. I believe it is the best scientific metadata in the world as it includes identifiers from many well-known authorities.

There is an exciting opportunity to join R and Wikidata.

Here’s what Thomas wrote to the Wikimedia community:

This email is going out to people whose githubs indicate an interest in two or more of: R, textmining, Wikidata and COVID19.

Peter Murray Rust (cced) and I are looking to put together an integrated resource of current published information & data on COVID19 (see the petermr/openVirus repo for details).

A few example subgoals:

Immediate term: be able to edit wikidata from R via the API

Short term: pull all coronavirus published papers from ePMC

textmine for main topics and broader topics (all topics should match to wikidata items)

write main topics to Wikidata items for those published papers

publish broader topics of each paper in a separate open database

Medium term: as above, but for bioRxiv, medRxiv, SciElo, Redalyc etc.

Medium term: as above, but for paywalled articles

Medium term: more in-depth text analysis

Long term: make process applicable to other topics

Please let us know if you’d be interested in being involved (and feel free to forward/adapt this email)!

Thomas Shafee

I believe that we need not just the COVID19 literature but everything on viruses, epidemics, etc. Our answers will come not just from bio-and chemical sciences but from maths/physics, engineering, statistucs, modelling, economics, political science, law, philosophy… They will also come from countries and people beyond North/West universities and I’ve made contact with Arianna Becerril-Grande who runs AMelicA and Redalyc in Mexico. I’m writing scrapers for them…
Wikidata is the most advanced way of managing multilinguality, multidisciplinarity and we tackle this through ContentMine dictionaries (see Wikidata:WikiFactMine - Wikidata ). The idea is simple - we create lots of dictionaries, which are a list of terms linked to Wikidata IDs. They are easy to create - at the simplest a list of terms and the AMI software (GitHub - petermr/ami3: Integration of cephis and normami code into a single base. Tests will be slimmed down) looks them up in Wikidata. There are also tools for extracting dictionaries from Wikidata (via SPARQL) or from Wikipedia (links from pages, categories, templates, etc.).
The workflow is then roughtly:

search a repository (e.g. EuropePMC, biorxiv…) and download fulltext
normalize and chop it into sections (this greatly increases precision)
search with whatever tools (Thomas and I will want to adjust “Textmining in R” to this)
analyze and display (again R will be very valuable)

Examples and a full tutorial (on plant crops) at GitHub - petermr/tigr2ess: Materials for TIGR2ESS workshop in Delhi Feb 2019 - joint UK(Cambridge) - India project on Food Security.. The software has expanded since then and covers semantic capture from bitmaps (e.g. graphs as PNG, which may be useful later).

My AMI software is written in Java, but the key resources are the CProject structure (basically unix filestore) and dictionaries. I believe that glueware can be written in many languages - data structures are more important than the code.

The project will be published continuously (i,e, daily) as OpenNotebookScience () in WikiJournalMedicine (example: WikiJournal of Medicine/Western African Ebola virus epidemic - Wikiversity). WJM’s ethos follows Wikimedia and is to communicate high quality knowledge to and from the world, rather than to gather glory.

Peter Murray-Rust
(Peter Murray-Rust - Wikipedia)

stefanie · March 16, 2020, 11:38pm

Thank you for posting this here for our community to see.

petermr · March 17, 2020, 12:06am

Thanks for the welcome.
Have just discovered Wikidata:WikiProject COVID-19 - Wikidata
so there can be a lot of synergy.

Topic		Replies	Views
covidpreprints.com using europepmc and rAltmetric UseCases r , package , europepmc , raltmetric	0	1406	September 8, 2020
Using DataPackageR to create data package Pandemic Papers with Chris Knox UseCases r , package , datapackager	0	1475	June 17, 2020
Searching Microsoft Academic & extracting the metadata UseCases r , package , metadata , microdemic	0	1383	March 27, 2020
fulltext v1: text-mining scholarly works Blog r , text-mining , fulltext	1	836	July 22, 2021
rOpenSci \| Covidpreprints.com: Automating Website Updates with the europepmc and rAltmetric Packages Blog	0	295	October 13, 2020

openVirus: Tools and resources for COVID-19 knowledge

Related topics