New package idea: Public Data depository using Google BigQuery?

Hi all,

It would be great to get your thoughts on a new package that I am starting to develop (https://github.com/iainmwallace/DataDepository)

The idea is to make it easier for people to publish tabular datasets of interest to Google Bigquery. These datasets could be both primary datasets or datasets that were created by cleaning/standardizing/integrating one or more primary datasets before beginning an analysis. This would greatly benefit secondary users of the data.

Datasets that are stored in BigQuery are immediately available for exploring via the web UI or via a rest api. This includes combining with any other public dataset in BigQuery to create a new public dataset.

I have three datasets loaded as a proof of concept
Compound names from PubChem mapped onto InChIKeys
Compound activities from ChEMBL enhanced with InChIKeys
Count of compounds appearing in databases based on UniChem

Is this something that might be of general interest? If so, any suggestions on how best to implement it would be great :slight_smile:

Thanks,

Iain

Hi @iainmwallace

Great idea! My first thought is about price. If I remember correctly, it costs money to query bigquery. Is there some small amount of queries that are free or so?

Can you explain what the pkg does? Not quite clear to me.

Hi @sckott,

Thanks, the package currently is just two helper functions to get data frame into a table in bigquery.
The first function (data2gcs) will split a data frame into many small zipped json files that are uploaded to a google storage folder. The second function (loadJsonFiles2BigQuery) will load specified json files from google storage into a specific table. I want to add a third helper function that will let a user assign meta data to the table such as description and column descriptions.

Regarding costs, there is no fee for the server it is hosted on, rather there is a small fee for storing data (10Gb free, $0.02 for each additional Gb - i.e. 1TB for $20 per month) and a fee for querying the data (1Tb free, $5 per additional TB).

So there is a small amount available for free (in addition to the $300 sign up bonus). If you store a 10Gb table, you could run 100 select * queries per month as each query would process the entire table. Reducing the number of columns you are querying/returning reduces the amount of data processed. Additionally, it is free to view the preview of the table.

This query, to return all compounds that have been tested in clinical trials from Chembl cost 14mb
SELECT md_pref_name,md_max_phase
FROM [decisive-coder-171820:Chembl.annotated_molecule_dictionary]
where md_max_phase >0

Hope that helps explain it, but let me know if anything isn’t clear.

Ideally, the package would make it easy for scientists to upload their own datasets in such a way that it was easy for others to find and re-use.

Cheers,

Iain

hi Ian,

Thanks for the description of the package.

How difficult is the authorization setup? Does it need OAuth, or some kind of API key?

I think it’s possible this could be useful to folks - I know CKAN has been really useful, but seems like it’s been more taken up in the open government space. But it’s a pretty different product as it’s a host your own kind of thing, whereas here Google does everything for you.

As a separate issue, are you interested to submit to ropensci or not? no pressure either way, but if so, we can discuss whether it fits as well

Hi Scott,

At the moment it uses the authentication from two different packages ( bigrquery and bigqueryr. I will migrate to the bigqueryr as there is a nice example of how to use it with a shiny package.

The big gap in the process is how to make the datasets discoverable as there doesn’t seem to be a way to search across all public projects. This might change or perhaps I could make a public registry that anyone could write too.

Yeah, it would be cool to submit it to ropensci. The onboarding process sounds very welcoming :slight_smile:
Cheers,

Iain

1 Like

Yeah, the discovery thing seems important - not being able to search across projects.