Figshare and other options for private data repositories

At my organization we have a bunch of data lying around people’s hard drives and Dropbox accounts that needs to be packaged up and made available to more people internally, and eventually publicly. My instinct for these is to document these datasets and put them in private figshare repositories, which can be made public when people are ready.

A couple of questions:

  • It seems that the figshare API is in transition and rfigshare is on hiatus until that’s figured out. Do we anticipate that this is going to be fixed, and any idea if either will support figshare “projects” and collaborations?
  • Are there other similar options? Repositories that allow private repos share with collaborators until it’s time to make them public? I’m also considering using private GitHub repos for this, but GitHub isn’t designed for metadata, and I’d prefer not to use up all our private repos on a bunch of small datasets.
1 Like

Really great question! I think this is an area that we’ll be focusing more on.

But meanwhile a few quick thoughts from me:

I think you’ve nailed the key issue here: the ideal workflow is one that allows individuals, labs, organizations, to leverage data repositories from day one, rather than as some publication/post-publication step, and that the key to making that work is having privacy controls for sharing among collaborators only.

Personally I think the most mature option in this area that I know (a little) about is DataONE, and I’m keenly watching for the (re)-release of the R dataone package, https://github.com/DataONEorg/rdataone/tree/master/dataone/vignettes. This should provide more useful metadata than figshare ever did, and finer access controls (the API for figshare’s private files was never really great as a collaboration tool; though I remain optimistic that the situation will improve after the API rewrite).

Note that DataONE doesn’t necessarily mean EML, which is used on the KNB and offers much richer metadata (at the cost of much higher barrier to entry). In my opinion, internal use cases (along with more user-friendly tooling) are the key to making that worth it; e.g. if it’s valuable to you to be able to query your private organization’s data for all files that pertain to species X in geographical region Y during time interval Z, for instance. So I’m hoping to really move EML forward once dataone package is back online, but dataone package is probably quite useful for this context even without touching the EML side. However, I haven’t had a chance to play with the dataone package’s new authentication model (which just uses tokens I believe … so much more streamlined than the old one!) to have a good feeling for how it works with multiple users, so I only mean to highlight it as a possibility.

The other thing I wish I knew more about in the space is Max & co’s dat project, which may just be what you want. But I know nothing (“github for data?”), so maybe someone else can fill us in? @karthik ?

Yeah, GitHub itself is an intriguing option, particularly with their new support for large binary objects. You could of course just leave a metadata file (README of course, but perhaps a .json like many software languages are starting to use for metadata (in place of R’s use of dcf format), or use some standard schema with jsonld. Sure, that’s not “ideal”, but really depends on the use case if that matters or not. The approach has some elegant simplicity. (For instance, the Zenodo import can already read a zenodo.json metadata file to make sure it gets the right metadata from the repository. This is already an almost-complete example for your use case – you collaborate on a private GitHub repo, and when you publish it, you also import it to Zenodo to get your DOI with all its aura of persistence and cite-ablility, and zenodo just sucks up the metadata in zenodo.json and passes it along to DataCite where it joins the world of globally indexed data. Of course this is limited to only the metadata those things understand, and we’re working to make that pipeline actually work better now via http://codemeta.github.io

Oh, perhaps there is also the OKNF of course http://data.okfn.org/ . The Economist had a piece recently that seemed to imply OKNF had surprisingly good uptake in the non-academic world… I’d love to learn more if anyone is using it in this context.

Thanks! Can you help me with some DataONE ignorance? My understanding is that’s it a coordination layer for many repositories, so we’ll need a repository appropriate to our topics and privacy needs, or possibly create our own member node (actually a longer-term possibility should some projects get off the ground).

Karissa’s presentation on Dat from the last community call did make it sound like the next phase of Dat was very much trying to be a “github for data” that replaces ad-hoc tools like Dropbox for researchers. I’m looking out for that keenly.

Good points on metadata on GitHub. This seems like the most ready-to-go option out there, that at minimum can be adopted as an interim solution. I’d be interested to see if figshare or other repositories set up a similarly easy solution.

I have been developing something along these lines with Will Cornwell, Matt Pennell and Daniel Falster.

repo: https://github.com/richfitz/datastorr
vignette that outlines the idea: http://richfitz.github.io/datastorr/vignettes/datastorr.html

The idea is to store data in github releases (which can be any size up to 2GB) and fetch them as they are requested. This works with private data and is easy to work with programmatically via access tokens or OAuth. While the tooling is focussed on R, this would work for other platforms easily enough via the GitHub API. The code in the github repo is a very small, possibly autogenerated, package that manages data downloads, versioning, caching and uploads.

This is totally separate from the metadata issues that Carl brings up. I think that our approach will work well with the OKFN data package approach, especially because the required metadata can be generated from the DESCRIPTION file.

Good question @noamross

How big is the data? If biggish, will that provide an obstacle/limit your options? E.g., would you need LFS on github?

One option is CKAN http://ckan.org/ You’d I think have to setup your own CKAN instance, but maybe there are public ones you could use, though I don’t know if they’d allow private data. One benefit to this (and it would benefit if you used it) is we’re working on an R client https://github.com/ropensci/ckanr

@richfitz That’s awesome. I will definitely pilot it for the first of these datasets and give you feedback, although I wonder about how easy it is to “flip the switch”, not so much public to private, but from private to deposit in an institutional repository. Ideally I’d like to do what Carl describes, just add a .json metadata file and have Zenodo/Figshare/DataONE import everything, but here the repository and the releases are structured differently, so I’m not sure how that process would work.

@sckott The data aren’t very big. We have genomic data that’s bigger, but it has its own infrastructure. These are mostly ecological and sociological observations of tens to thousands of data points.

I’m unfamiliar with what is required by

but if it’s mostly a case of dropping off a json file in the appropriate place it should be manageable. If you have thoughts or can elaborate further that’d be useful as our use-case is people with “small to medium sized” data like you have.

@noamross You’re right about DataONE being an interoperability layer that makes client tools work with multiple repositories – DataONE links repositories like the KNB, Dryad, USGS, and dozens of others containing over 170K data sets, all accessible from one location. Whereas the KNB itself hosts about 25K of those data sets.

As Carl mentioned, the KNB is a repo that was built to support the types of use cases you describe, and that supports the DataONE REST API. So, you can upload data to the KNB using many tools, including curl, python, java, and R, among others. We are putting the finishing touches on a new dataone package for R that will support a simple token-based authentication scheme as Carl hinted at. When uploading data, you can control access by keeping data completely private, sharing it with specific collaborators or groups, or making it publicly accessible. You can also provide updated versions of your data, and assign a DOI to published objects, all through the R dataone library. We support many metadata standards, so you can provide rich data and metadata that supports searching across space, time, and taxa, and we provide detailed information on data access for all data in the system (and soon we’ll be adding citation counts from the literature). You can see the search interface for the KNB Data Repository and for the whole DataONE network to get a feel for the types of data we store, index, and preserve.

I would recommend using the R EML package that Carl has been developing, as it provides a nice programatic way to build up an effective metadata record, and then upload that to the KNB using the R dataone library

If you are looking into something not R specific, consider our http://datalad.org for version control managing your (or someone else’s) data alongside with the code. Since it uses git-annex – you could actually offload your data files to one of the many supported backends. In DataLad we also extend that with ability to publish to Figshare as a tarball (export-to-figshare plugin) and eventually to Zenodo and others.