I’ve just joined this list - so apologies if in wrong category.
This project is about using R tools for COVID-19 semantic knowledge. Everything is Open.
With Thomas Shafee of LaTrobe Univ, Melbourne we’ve started a resource to collect published knowledge on COVID-19 , such as preprints, articles, theses, grey literature. We then make the fulltext semantic and publish for re-use, such as textmining, searching, machine-learning, etc. We currently use a rich directory structure (“CProject” from contentmine.org) and expose all content as XML, HTML, PNG, etc. Wherever possible this is linked to Wikidata which has over 60 million chunks of data or metadata as triples. I believe it is the best scientific metadata in the world as it includes identifiers from many well-known authorities.
There is an exciting opportunity to join R and Wikidata.
Here’s what Thomas wrote to the Wikimedia community:
This email is going out to people whose githubs indicate an interest in two or more of: R, textmining, Wikidata and COVID19.
Peter Murray Rust (cced) and I are looking to put together an integrated resource of current published information & data on COVID19 (see the petermr/openVirus repo for details).
A few example subgoals:
Immediate term: be able to edit wikidata from R via the API
Short term: pull all coronavirus published papers from ePMC
- textmine for main topics and broader topics (all topics should match to wikidata items)
- write main topics to Wikidata items for those published papers
- publish broader topics of each paper in a separate open database
Medium term: as above, but for bioRxiv, medRxiv, SciElo, Redalyc etc.
Medium term: as above, but for paywalled articles
Medium term: more in-depth text analysis
Long term: make process applicable to other topics
Please let us know if you’d be interested in being involved (and feel free to forward/adapt this email)!
I believe that we need not just the COVID19 literature but everything on viruses, epidemics, etc. Our answers will come not just from bio-and chemical sciences but from maths/physics, engineering, statistucs, modelling, economics, political science, law, philosophy… They will also come from countries and people beyond North/West universities and I’ve made contact with Arianna Becerril-Grande who runs AMelicA and Redalyc in Mexico. I’m writing scrapers for them…
Wikidata is the most advanced way of managing multilinguality, multidisciplinarity and we tackle this through ContentMine dictionaries (see Wikidata:WikiFactMine - Wikidata ). The idea is simple - we create lots of dictionaries, which are a list of terms linked to Wikidata IDs. They are easy to create - at the simplest a list of terms and the AMI software (http://github.com/petermr/ami3) looks them up in Wikidata. There are also tools for extracting dictionaries from Wikidata (via SPARQL) or from Wikipedia (links from pages, categories, templates, etc.).
The workflow is then roughtly:
- search a repository (e.g. EuropePMC, biorxiv…) and download fulltext
- normalize and chop it into sections (this greatly increases precision)
- search with whatever tools (Thomas and I will want to adjust “Textmining in R” to this)
- analyze and display (again R will be very valuable)
Examples and a full tutorial (on plant crops) at GitHub - petermr/tigr2ess: Materials for TIGR2ESS workshop in Delhi Feb 2019 - joint UK(Cambridge) - India project on Food Security.. The software has expanded since then and covers semantic capture from bitmaps (e.g. graphs as PNG, which may be useful later).
My AMI software is written in Java, but the key resources are the CProject structure (basically unix filestore) and dictionaries. I believe that glueware can be written in many languages - data structures are more important than the code.
The project will be published continuously (i,e, daily) as OpenNotebookScience () in WikiJournalMedicine (example: WikiJournal of Medicine/Western African Ebola virus epidemic - Wikiversity). WJM’s ethos follows Wikimedia and is to communicate high quality knowledge to and from the world, rather than to gather glory.
(Peter Murray-Rust - Wikipedia)