Data license visibility: is rOpenSci doing enough?
This issue has been touched on at various times by the rOpenSci community, but I have not seen a thorough discussion. Please point it out if I’ve missed it!
Background
Many R packages now provide access to data. Some (most?) provide a mechanism for retrieving data from a remote source via an API or similar (let’s call these “API packages” for the want of a better term), whereas other packages provide data directly bundled into the package (call these “data packages”).
The problem: it remains uncommon for such packages to actively inform the user of the license and conditions under which they are accessing data.
Why I think this is an issue
Many data sets are released under a formal license. That license governs what the end user can do with the data and under what conditions. In the case of packages, the code license may be — and in general is probably likely to be — separate to and different from the data license. The code may have been developed or funded by different people/organisations to those responsible for the data: both deserve credit for their contributions.
In terms of formal data licensing, CC-BY licenses are common, and require the user to properly acknowledge the source of the data. Some licenses might impose additional restrictions (e.g. no commercial use). Many data sets are released without a formal license but with a request or requirement that users cite the source of the data.
It is important that users are aware of the license or conditions under which they are accessing data. Not knowing this might lead to users unknowingly breaching license conditions, or inadvertantly failing to conform with requests that the data provider has made (e.g. acknowledging the source). Even if data are released without restriction (e.g. under a CC0 waiver or similar), users should know what the conditions governing the data are.
Scientists are sometimes reluctant to release their data, and “I won’t get recognition for it” or “my data will be misused” are common reasons for this. It’s probably true to say that this is less of an issue now than it has been in the past, but it nevertheless remains true in some quarters.
The data management and wider communities have worked to establish community norms on data citation and good data behaviour, and to build trust within communities to facilitate an open data culture. Users who don’t comply with such norms — knowingly or otherwise — risk undermining some of that trust. The data license is often tied up with the data’s metadata record or other documentation, and so drawing attention to the license also gives the opportunity to draw the user’s attention to the data documentation, which might include caveats on its re-use.
But …
… isn’t it the user’s responsibility to educate themselves about the data they are using?
Of course. But rOpenSci has the opportunity to play a role in promoting good data-use behaviour, and I don’t believe that this opportunity is currently being exercised as fully as it might.
Where are we now?
A quick (and definitely incomplete) audit of current rOpenSci packages suggests that the overwhelming majority of data/API packages provide links to the sources of the data in the package README, vignette, package-level documentation, or similar. Far fewer packages actively provide the user with a clear statement of the data license, restrictions, or (if appropriate) required citation. Some packages certainly do (see examples below) but it does not seem to be standard practice.
Some packages provide access to a data source that requires the user to have an account with the provider organisation: in those cases, one could reasonably assume that the source organisation will make the user aware of the conditions governing data use.
OK, so what
The point of this rant is to provoke some discussion within the community. Is this seen as a worthwhile consideration? If it is, should data license visibility be a routine part of rOpenSci practice? Should it be added to the packaging guide?
How might we achieve this? Suggestions welcome, but I don’t think that this process needs to be particularly onerous. For example:
- the README should not only include the data source URLs but also the license and citation details, along with links to metadata records or other documentation
- consider making this information more prominently visible to users (who might not, after all, read the README) by using
packageStartupMessage()
- add the data license details to the
LICENSE
file (see igraphdata for an example), and - add the data citation details to the
inst/CITATION
file so thatcitation("mypackage")
gives both the package citation and the data citation. Note that a distinction should be maintained between the package license and citation, and the data license and citation. The intention is not for package authors to lose credit or visibility of their contributions, but rather to promote those of the data providers - if the package retrieves data from multiple sources, consider including source information in data objects, or provide metadata or functions to help the user know which data came from which source (see e.g. the
indicator_metadata
function in fingertipsR, or the data source template in bowerbird).
If you are a package developer working with a data source that does not have a clear license/terms of use, encourage the provider to give one.
As a final note, and to end with a positive example, see the bomrang README, which has links to its Bureau of Meteorology data sources, along with:
All data is copyright Australia Bureau of Meteorology
BOM Copyright Notice http://reg.bom.gov.au/other/copyright.shtml
And furthermore, when using the package:
> library(bomrang)
Data (c) Australian Government Bureau of Meteorology,
Creative Commons (CC) Attribution 3.0 licence or
Public Access Licence (PAL) as appropriate.
See http://www.bom.gov.au/other/copyright.shtml
If you use bomrang, please cite it.
See `citation('bomrang')` for the proper citation.
Bravo @adamhsparks and coauthors!