Really great question! I think this is an area that we’ll be focusing more on.
But meanwhile a few quick thoughts from me:
I think you’ve nailed the key issue here: the ideal workflow is one that allows individuals, labs, organizations, to leverage data repositories from day one, rather than as some publication/post-publication step, and that the key to making that work is having privacy controls for sharing among collaborators only.
Personally I think the most mature option in this area that I know (a little) about is DataONE, and I’m keenly watching for the (re)-release of the R dataone package, https://github.com/DataONEorg/rdataone/tree/master/dataone/vignettes. This should provide more useful metadata than figshare ever did, and finer access controls (the API for figshare’s private files was never really great as a collaboration tool; though I remain optimistic that the situation will improve after the API rewrite).
Note that DataONE doesn’t necessarily mean EML, which is used on the KNB and offers much richer metadata (at the cost of much higher barrier to entry). In my opinion, internal use cases (along with more user-friendly tooling) are the key to making that worth it; e.g. if it’s valuable to you to be able to query your private organization’s data for all files that pertain to species X in geographical region Y during time interval Z, for instance. So I’m hoping to really move EML forward once dataone package is back online, but dataone package is probably quite useful for this context even without touching the EML side. However, I haven’t had a chance to play with the dataone package’s new authentication model (which just uses tokens I believe … so much more streamlined than the old one!) to have a good feeling for how it works with multiple users, so I only mean to highlight it as a possibility.
The other thing I wish I knew more about in the space is Max & co’s dat project, which may just be what you want. But I know nothing (“github for data?”), so maybe someone else can fill us in? @karthik ?
Yeah, GitHub itself is an intriguing option, particularly with their new support for large binary objects. You could of course just leave a metadata file (README of course, but perhaps a .json
like many software languages are starting to use for metadata (in place of R’s use of dcf format), or use some standard schema with jsonld. Sure, that’s not “ideal”, but really depends on the use case if that matters or not. The approach has some elegant simplicity. (For instance, the Zenodo import can already read a zenodo.json metadata file to make sure it gets the right metadata from the repository. This is already an almost-complete example for your use case – you collaborate on a private GitHub repo, and when you publish it, you also import it to Zenodo to get your DOI with all its aura of persistence and cite-ablility, and zenodo just sucks up the metadata in zenodo.json and passes it along to DataCite where it joins the world of globally indexed data. Of course this is limited to only the metadata those things understand, and we’re working to make that pipeline actually work better now via http://codemeta.github.io …
Oh, perhaps there is also the OKNF of course http://data.okfn.org/ . The Economist had a piece recently that seemed to imply OKNF had surprisingly good uptake in the non-academic world… I’d love to learn more if anyone is using it in this context.