Data license visibility

Data license visibility: is rOpenSci doing enough?

This issue has been touched on at various times by the rOpenSci community, but I have not seen a thorough discussion. Please point it out if I’ve missed it!

Background

Many R packages now provide access to data. Some (most?) provide a mechanism for retrieving data from a remote source via an API or similar (let’s call these “API packages” for the want of a better term), whereas other packages provide data directly bundled into the package (call these “data packages”).

The problem: it remains uncommon for such packages to actively inform the user of the license and conditions under which they are accessing data.

Why I think this is an issue

Many data sets are released under a formal license. That license governs what the end user can do with the data and under what conditions. In the case of packages, the code license may be — and in general is probably likely to be — separate to and different from the data license. The code may have been developed or funded by different people/organisations to those responsible for the data: both deserve credit for their contributions.

In terms of formal data licensing, CC-BY licenses are common, and require the user to properly acknowledge the source of the data. Some licenses might impose additional restrictions (e.g. no commercial use). Many data sets are released without a formal license but with a request or requirement that users cite the source of the data.

It is important that users are aware of the license or conditions under which they are accessing data. Not knowing this might lead to users unknowingly breaching license conditions, or inadvertantly failing to conform with requests that the data provider has made (e.g. acknowledging the source). Even if data are released without restriction (e.g. under a CC0 waiver or similar), users should know what the conditions governing the data are.

Scientists are sometimes reluctant to release their data, and “I won’t get recognition for it” or “my data will be misused” are common reasons for this. It’s probably true to say that this is less of an issue now than it has been in the past, but it nevertheless remains true in some quarters.

The data management and wider communities have worked to establish community norms on data citation and good data behaviour, and to build trust within communities to facilitate an open data culture. Users who don’t comply with such norms — knowingly or otherwise — risk undermining some of that trust. The data license is often tied up with the data’s metadata record or other documentation, and so drawing attention to the license also gives the opportunity to draw the user’s attention to the data documentation, which might include caveats on its re-use.

But …

… isn’t it the user’s responsibility to educate themselves about the data they are using?

Of course. But rOpenSci has the opportunity to play a role in promoting good data-use behaviour, and I don’t believe that this opportunity is currently being exercised as fully as it might.

Where are we now?

A quick (and definitely incomplete) audit of current rOpenSci packages suggests that the overwhelming majority of data/API packages provide links to the sources of the data in the package README, vignette, package-level documentation, or similar. Far fewer packages actively provide the user with a clear statement of the data license, restrictions, or (if appropriate) required citation. Some packages certainly do (see examples below) but it does not seem to be standard practice.

Some packages provide access to a data source that requires the user to have an account with the provider organisation: in those cases, one could reasonably assume that the source organisation will make the user aware of the conditions governing data use.

OK, so what

The point of this rant is to provoke some discussion within the community. Is this seen as a worthwhile consideration? If it is, should data license visibility be a routine part of rOpenSci practice? Should it be added to the packaging guide?

How might we achieve this? Suggestions welcome, but I don’t think that this process needs to be particularly onerous. For example:

  • the README should not only include the data source URLs but also the license and citation details, along with links to metadata records or other documentation
  • consider making this information more prominently visible to users (who might not, after all, read the README) by using packageStartupMessage()
  • add the data license details to the LICENSE file (see igraphdata for an example), and
  • add the data citation details to the inst/CITATION file so that citation("mypackage") gives both the package citation and the data citation. Note that a distinction should be maintained between the package license and citation, and the data license and citation. The intention is not for package authors to lose credit or visibility of their contributions, but rather to promote those of the data providers
  • if the package retrieves data from multiple sources, consider including source information in data objects, or provide metadata or functions to help the user know which data came from which source (see e.g. the indicator_metadata function in fingertipsR, or the data source template in bowerbird).

If you are a package developer working with a data source that does not have a clear license/terms of use, encourage the provider to give one.

As a final note, and to end with a positive example, see the bomrang README, which has links to its Bureau of Meteorology data sources, along with:

All data is copyright Australia Bureau of Meteorology
BOM Copyright Notice http://reg.bom.gov.au/other/copyright.shtml

And furthermore, when using the package:

> library(bomrang)

Data (c) Australian Government Bureau of Meteorology,
Creative Commons (CC) Attribution 3.0 licence or
Public Access Licence (PAL) as appropriate.
See http://www.bom.gov.au/other/copyright.shtml

If you use bomrang, please cite it.
See `citation('bomrang')` for the proper citation.

Bravo @adamhsparks and coauthors!

2 Likes

Thanks, @raymondben for the recognition here. I’ve not thought about it more broadly as you have, but in my own work I’ve done this with bomrang, GSODR, getCRUCLdata and the upcoming nasapower package. It’s essential, I think, as package authors to recognise that other scientists have made these data available and we should honor their policies so that they keep offering the data and others will be inclined to do so too.

The NCEI places restrictions on the use of GSOD data outside the US (non-commercial only) and the POWER team asks for recognition if you use the data, which is fair, I expect a citation if you use my package to access it, so both have statements on loading and in the README to make clear that these packages only provide interfaces to the data. The for CRU data I reference via the official paper in README and have a message on startup as well.

So, yes, I think it’s important and would be good to see rOpenSci recognise this and lead the way since we have so many packages of this nature.

1 Like

@adamhsparks, your point “these packages only provide interfaces to the data” is a key one in my mind. I’m far from a licensing expert, but it seems clear that this means that the user is the one accessing the data from the provider. The package just provides the mechanism (but the mechanism could equally be a browser, or a command-line script, or an envelope and a nicely-worded request). So I do think it’s good behaviour on the part of package authors to let users know that by using the package, they are effectively entering into a license agreement with the data provider (if, of course, the data are licensed).
The situation may be more complicated with data packages (i.e. data bundled into package) because that could be viewed as data redistribution or republication rather than a data access mechanism. But I’ll leave that to legal/licensing experts!

1 Like

@raymondben Thanks for raising this issue; this is an important but tricky discussion.

In my understanding, under US Law, “data” are considered “facts” that are discovered, and not “created”, and thus are not subject to US Copyright law which applies only to “creative works.” Software, database architecture and design, etc may be. I agree claims that data is being released under the provisions of various Creative Commons licenses are common, but with the exception of CC0 public domain declaration, I do not believe Creative Commons supports this use. This makes the role of rOpenSci in communicating these issues a bit less clear to me.

There are obviously protections other than copyrights that may be relevant. I believe data can still be considered intellectual property – for instance, private data may be protected as a “trade secret” in the US that gives your company a special advantage from owning the data; but it’s unclear that has any hold on publicly available data. It may also qualify for patent protection if associated with a patent. I also do not have any knowledge about data IP protection outside the US.

I don’t believe this excludes having custom agreements between two parties governing the use of given data, I’m just not aware of any ‘standard’ license other than CC0 that plays the role that approved open source software licenses or creative commons licenses play for data. Happy to hear more about how best to address this issue from others!

Some refs: See https://www.copyright.gov/docs/regstat092303.html, “Explanatory materials such as introductions or footnotes to databases may also be copyrightable. But in no case is the data itself (as distinguished from its selection, coordination or arrangement) copyrightable”, or Victoria Stodden’s work, e.g. http://web.stanford.edu/~vcs/papers/ijclp-STODDEN-2009.pdf.

P.S. I agree entirely that we should encourage community norms like data citation. Many database providers also indicate that they would want a particular scientific paper cited. I totally support this and I think using and encouraging use and awareness of the citation() mechanism in R is a great way to do this. I just think it is also important to keep distinct such legal questions of copyright, etc, from questions of community norms, such as citation.

(And of course there’s a whole third rail in which some agreements require authorship as a term of re-use)

Carl beat me to writing mostly the same thing - I’m wary of passing through data “licenses” and “copyrights” that aren’t applicable or enforceable, and especially of playing the role of arbiter of which ones are valid. That said, we should very much make sure we are promoting good norms of data citation and a practice of generous and transparent credit. I’d say anywhere a package provides citation information (README, citation(), package-level docs etc.), there should be citation information both the package and the data source, and a link to relevant re-use documentation from the provider.

(Relatedly, package citation information is in our guide but we probably aren’t sufficiently vigilant about it).

Finally, our packaging guide currently discourages package startup messages. I think most users prefer this, but it’s open to discussion.

I prefer not to have startup messages as well. However, I think it’s extremely important to make sure that the users are aware of the conditions that may come with the data being accessed.

Happy to move everything into citation() and README for my packages though to be in compliance. That seems OK to me. Hopefully the users pay attention.

Thanks @noamross, @cboettig for your comments — very useful. A quick followup:

I had hoped not to get into the waters of whether data licenses are actually applicable or enforceable, because it gets complicated quite quickly and depends on jurisdiction and circumstance. However, such licenses definitely exist. The Australian, NZ, and UK governments routinely use CC-BY or similar data licenses, and e.g. OpenStreetMap uses a license requiring that credit be given.

You both suggest (I’ll use Noam’s words here) that we should not be “playing the role of arbiter of which ones are valid” — I entirely agree. I don’t believe that it’s necessary or even wise for rOpenSci or a package author to make any comment on data license or copyright validity at all (partly why I didn’t talk about this earlier). My point — and perhaps I didn’t articulate this very well — is that the package author’s role is to communicate the conditions under which the provider is releasing the data. It’s then up to the user to decide if those conditions are applicable or enforceable to their particular use. Those “conditions” might be a formal license, a request for citation, or a “this data is public domain, go nuts” disclaimer.

My view is much the same as your phrasing of “promoting good norms of data citation and a practice of generous and transparent credit” (with which, of course, I entirely agree), but mine perhaps goes slightly beyond that, in the sense that it is protecting package users by drawing their attention to potentially problematic data terms and conditions (or, conversely, helping users to reassure themselves that the data are unencumbered by drawing their attention to the CC0 disclaimer or whatever it might be). Again, the point is not to make a determination on those T&C’s, but simply to make the package user aware of them.

Here’s an example (not to pick on originr, but it provides an illustrative case in point): originr provides access to data from the Global Invasive Species Database. originr isn’t republishing or redistributing these data, it’s simply providing a convenient mechanism for a user to obtain the data, so any terms and conditions relating to data usage apply (in my opinion) to the user, not the package authors. Now, interestingly, the GIISD T&C’s include the clause that “you may not, under any circumstances, repost, redistribute, transmit or sub-license the data in any way, part or form”. An R user might thus, quite unknowingly, download some data via originr and pass it on to someone else, thereby potentially violating that condition. Whether or not that condition is sensible or enforceable is beside the point as far as my argument goes. I simply think that it would be good, routine package author practice to draw the user’s attention to such conditions.

Not all the data is in the US though and some are associated with/part of copyrighted materials. BOM does copyright their data here in Australia as bomrang indicates. The CRU data are part of a published paper, etc.

Good discussion to have.

And, interestingly to us non-US folks, while federal US govt data are public domain by default in the US, they are not automatically public domain outside of the US. Hooray for data transparency.

Yeah, startup messages are a tricky one (thus my “consider” adding one phrasing). Nobody likes a deluge of startup messages, hence the packaging guidelines of avoiding them unless they are important. I guess the challenge is deciding on a reasonable definition of “important”. If the data are free to use with no restrictions, is that “important”? Probably not, I’d say. If the data provider threatens prison tems for misuse, is that “important”? Perhaps so, at least in the sense that package users would probably want to know about it. Everything in between, like requests for attribution - are those “important”? There probably isn’t a single, simple answer.

Yes, the restrictions on the GSOD data OUTSIDE the US was something I considered to be important and interesting as a US citizen abroad. The BOM copyright info could likely go in the citation() header as well, I just checked the 3rd party app on my phone. That information is in the “about” that you have to actively open.

Requests for notification and citations, probably not as important and can be in README and citation() headers for sure.

Thanks for the discussion @raymondben - it’s definitely worth talking about

+1 for adding it to the packaging guidelines and perhaps even enforcing that submitted pkgs follow whatever guidelines we agree on.

I as well don’t like pkg startup messages, but this is once case where startup messages seem more warranted.

Do we want to add guidelines for datasets to our packaging guide @maelle @annakrystalli @lincoln @karthik @noamross

I do something like this in rgbif where users can pass in output of a call to get data to another function that spits out citations for all the datasets used therein.

That’s great! I presume (having not rummaged around in your code) that you are using GBIF’s DOI service, and this may not be so easy for other data providers? Is there an opportunity to fill a void in the rOpenSci packageverse with something that would make this easier for other package developers? Even if a general solution isn’t yet feasible, it might at least encourage consistency across packages, in terms of how users can get citations/T&C’s for their data.

In rgbif we use the datasets route GBIF registry API which contains citation info for each particular dataset. The DOI’s for GBIF are only associated with their download API GBIF occurrence API - for those we do include a citation for the download/DOI itself, as well as a citation for each dataset included in the downloaded data.

what do you have in mind?

Good conversation going on here. Couple thoughts I had while reading this:

  • Probably fairly obvious but, if this ends up getting added to the packaging guidelines, it would also be worth outlining what to do in the case that the data provider doesn’t have clear Terms of Use/license/etc. In such cases: is simply providing a link to the data source in a package’s README sufficient?

  • @sckott’s example for citing multiple data sources works well if the owner actually provides citation information for all ingested sources but what are we going to do about data sources that pull from other data sources but don’t provide access to information about these sources. For example, one of my packages (kiwisR - not part of rOpenSci yet but hoping to submit soon) provides a wrapper for querying databases which often contain a mix of unique data and data ingested from other sources. How deep does the rabbit hole go here?

I suppose the tldr for both of these points is: to what extent are package authors responsible for gaps in the data sharing/stewardship policies of a public dataset’s owner?

Good point about what to do when no terms are given. We definitely should include something about that. I don’t know the answer.

On nested data sources: don’t know. I’d imagine something like this would make sense: “Package authors should make all reasonable efforts to document/provide dataset citation information”. I’d imagine there’s lots of variation here, so reviewers and pkg authors would have a back and forth to work this out.

My opinion: ultimately, they are not. And there are lots of datasets out there that don’t have clear T&C/license, or which are a mishmash of other data and therefore difficult to uniquely trace back. I think that’s an issue to be tackled by the data management and broader science communities. I would not want to see package development (and downstream benefits to data users) held back by those issues.
I would suggest though that there is an opportunity for package authors to improve clarity in some cases (contact the data providers, say that the community wants to use their data but also wants to give appropriate recognition/wants clarity on data lineage, please help us do so). In my experience, data providers are often quite open to such queries, and in some cases are prepared to adapt their T&Cs to better suit the community. But if that comes to naught, or the data lineage is a hideous mess, that should not stop us from building the software.

2 Likes