dataspice, codebook and ropensci scope

Hey there!

I just saw dataspice in the summary of the unconf2018.
Given this, I’m a bit surprised that ropensci rejected my package codebook during a pre-submission inquiry for not being in scope for ropensci. My idea was to offer some nice plots and websites to data holders to get machine-readable metadata for others. I just now saw that emldown by Carl Boettiger had the exact same pitch a few months earlier (but based around EML rather than attributes). I didn’t know about emldown unfortunately, and also wasn’t clued in during the onboarding process.
But some of the ropensci people who I was in touch with before the pre-submission inquiry (e.g. Maëlle Salmon) now built dataspice in ropenscilabs. It has very similar core functionalities to my package (website generation for a metadata codebook + JSON-LD generation).
Obviously, I don’t want copyright over an idea or anything like that. dataspice seems really cool, and goes above and beyond what my package does (with EML support and a Shiny app to enter metadata). It just seems that a joint effort could have been even better, and the kind of community input dataspice got, is what I was hoping for when I submitted the pre-submission inquiry, i.e. that people collaborate to avoid re-inventing the wheel. I’d appreciate to hear whether that is on me for misunderstanding what ropensci is about.

If it’s just on me, I’d like to know if dataspice will be maintained and brought to CRAN (it seems this didn’t happen to emldown), because then I should probably stop developing codebook.

Hi @ruben!

Thanks for reaching out. I can see how the overlapping-but-different spaces of the stuff we do through unconferences, internal development, and onboarding can be confusing. The emphasis of the unconference is on experimentation, exploration and community-building - folks work on all kinds of things that wouldn’t be in scope for onboarding. This year I can think of a few statistical methods packages, for instance, which we wouldn’t onboard either. Similarly, some of the stuff developed by our core staff is focused on filling gaps in R infrastructure and wouldn’t be in-scope. We want to encourage the community to produce things like codebook and these tools even though we don’t think they’re a great match for our review process. Onboarding is more narrow because some things, like reports or visualizations, are tough to judge objectively and there are many opinions of what the best output would look like, and also our reviewer base and process isn’t designed to evaluate things like algorithmic correctness.

That said, we could do a better job of connecting developers like you with others in the community who are working on similar topics and enabling collaboration even if your packages don’t go through the onboarding process. I didn’t work on the dataspice team and I’m not sure about further plans for that package. Usually after the conference it takes a while for projects to regroup and see whether there’s interest and availability in continuing development on the initial experiments. But if they’re moving forward to it I’m sure they’d be interested in discussing how the packages can work compatibly or how you can contribute to each others’ work.

Hey @ruben, I’m really glad you reached out. At the time of unconf18, I hadn’t seen codebook and it looks great. And I do see the similarities. I don’t see any reason why we can’t collaborate.

It seems like codebook is aimed at a specific audience but I wasn’t totally sure after reading the docs. I see references to ‘survey data’, ‘psychological scales’, and ‘computing reliabilities’. Is codebook general or more narrowly scoped?

Another thing I saw that intrigued was your mention of grabbing metadata from R attributes which I’m totally unfamiliar with. Would you be willing to point me to an example? This sounds really useful.

1 Like

Hi @noamross. Thanks for the clarifications. This was difficult to understand from the outside, and it makes it look harder than it maybe really is to get “inside.” So, yes maybe something could be improved there.

@brycem: Because my background is with psychological and sociological survey data, I began with features that I knew would be useful in my community. Part of my motivation to submit to ropensci was to hear what else could make it useful for other communities. In ecology, probably better handling of location data for example! I tend to think a large enough part of the problem set of documenting data and generating metadata is shared for it to make sense to make one package that handles various communities’ needs well.

Regarding attributes: It’s useful, if these attributes exist. Data imported via haven from SPSS or Stata files for example has variable and value labels. Data imported via my survey software formr.org has additional technical metadata on the survey questions, I have started doing something similar for QualtRics. It’s probably discussing the merits and demerits of my approach of storing metadata in attributes vs. dataspice’s approach of storing it in additional CSV files. I like the attributes since they will travel with the data when you merge and subset, and make it easy to access the metadata from the data object in R. But attributes can also get lost quite easily, especially outside the tidyverse. So, storing metadata in files is useful too, but in my case I’ve only implemented writing JSON-LD (ish) so far.

Since I’ve failed at seeing a bit of overlap in some packages (sorry about that and thanks for reaching out!), reg QualtRics, do you mean qualtRics the R package or Qualtrics the survey tool? We’ve recently onboarded https://github.com/ropensci/qualtRics (we also know of https://github.com/earthlab/qtoolkit, younger, not sure about their overlap)

I meant the R package, we had a bit of back and forth, and I’ve got a working test case with the test data he sent me, but it takes some manual translation and I haven’t tested it with a real study yet. I didn’t know qtoolkit

1 Like

Yes, I think so! At least so far as citation-level metadata and a bit beyond that too. I think we had kinda aimed at only documenting as far as Schema.org lets us, while providing outlets into richer metadata standards such as EML or ISO19115. Kind of like a metadata onboarding of sorts. I’m not sure if the whole group thinks of it this way.

That’s super cool. It’s really too bad Excel/CSV are so common.

Yeah, we’re definitely aiming at a low bar by letting the user keep their data as-is, though this is how every metadata standard I’ve seen does it. Does the psych/socio survey community not tend to ship around XML metadata with their datasets like ISO or EML? I think I’m totally naive as to what data these communities store and how they document and archive it.

Yes, schema.org is not that rich. It doesn’t even have number of observations, nor what is being observed (humans? mice? solar systems?). I thought it was still pending and extensible though, and that it had a lot going for it compared to old, steep learning curve systems like DDI. Really not an expert though.

I think datasets packaged with metadata are nice, that way you cannot accidentally separate them so easily, especially if you’re not clued in. In my community, sharing them in the proprietary formats that allow this (SPSS, Stata, etc) is common (although people don’t think of this as sharing metadata). But that isn’t indexed in search engines I think. Also, these standards are not documented so Evan Miller (I think) reverse-engineers it all. Now that more and more people switch to sharing open formats, they tend towards pretty poor practice with codebooks (CSVs for data + mostly PDFs, some Excel for documentation, usually nonstandard names).

I don’t know any file formats (except simply storing R data frames in RDA/RDS) that openly supports storing data attributes in the same file. I guess it exits? But it’s also not super important to me. Because a lot of human subjects data cannot be shared openly owing to identification concerns, I thought getting a way to share a high-level summary of the data that can be read by humans and machines is important. Basically sharing as much as you can without disclosing too much.

The only other metadata standard that seems common in social science is DDI. There, you share a separate XML, but there are few open readers and almost no open writers. Also, it’s super huge.

I think datasets packaged with metadata are nice, that way you cannot accidentally separate them so easily, especially if you’re not clued in.

Totally agree with this and it’s great that in you can make use of such metadata inclusion in SPSS & Stata outputs. In some ways what we were looking to do was provide some basic templates and tooling for researchers to be able to “get together” some of that metadata in the absence of anything else.

One thing we discussed (that I’m still really interested in) but didn’t get far enough to develop functionality for is indeed, how to best make metadata available during analysis. There was also talk of enriching the schema.org basic metadata fields with some extras that would be useful in a research context. Handling factor levels and including data type being the most basic discussed. Extracting such stored metadata from SPSS and Stata outputs using haven and making it easily accessible through packages like labelled and sjlabelled seems a really cool approach in your domain that I’d love to learn more and explore whether such an approach might be adapted to work more generically within the dataspice workflow.

I don’t know any file formats (except simply storing R data frames in RDA/RDS) that openly supports storing data attributes in the same file

Perhaps exploring the potential of csvy for including metadata in YAML header of a csv could be an option (using pkg csvy to read in and write). I imagine care will be needed with duplicating metadata storage (ie in the dataspice framework we’ve already got metadata in csv format which would then feed into a json format), however if appropriate hooks are established in the metadata creation/storing/serving workflow, perhaps csvy could act as an analogous file format to what SPSS and Stata export and therefore be used to label data up with attributes for analysis automatically? Otherwise exploring whether we could just use the dataspice tabular metadata sheets or the generated metadata json to label up analysis data could be an option.

Anyways, there seems some really fertile conversations to be had here and much to learn by speaking across domains. If you’d like to be involved @ruben, perhaps we can move some of these discussions to issues on github and see how we might combine some of the approaches in dataspice and codebook and learn from each other?

@annakrystalli Yes, you’ve definitely got the superior (any, really) workflow to get metadata in when it doesn’t exist yet.
My RStudio addin is an attempt to make it available more easily during analysis (something that Stata/SPSS users are very much used to). But one could e.g. also generalize some of what I did with my plots to get easier auto-labelling of plots from metadata.

csvy sounds cool, didn’t know about that, or forgot about it. Yeah, duplication and metadata proliferation is something to worry about, but having it ship in one file is definitely something worth aiming for. And interoperability with SPSS/Stata is pretty useful too. I’m not sure whether you based the dataspice CSV files around any particular standard (EML?) or was it more ad-hoc?

Happy to talk more when I return from holidays in ca. 3 weeks, so we don’t duplicate work and exchange ideas. Btw. I think we’ve met at your ReproHack at OpenCONN.

Yes let’s definitely talk after your holiday! I’m super interested in the addins too. And awesome to hear you were at the ReproHack!

I tried responding to a few issues in dataspice where it seemed useful. Ideally, I’d like to hatch some sort of plan to coordinate and avoid duplicating work, or making sure there’s a single good package that unites the strengths of both.

1 Like