Copyright practices for *analysis* code

For those of you who post analysis code (as opposed to R packages) on GitHub or other public sites, how do you license it? CC-by? Something else?

For, say, the code used to create ggplot2, there seem to be a few solid options, each with pros & cons. For the code I write to create a plot using ggplot2, though, I’m… confused. :upside_down_face: My most common use case is a repo with an RMarkdown file, often with an R script or two of functions - possibly considered a research compendia. So it’s a combination of code that could become “software,” and code that is used to calculate numbers or create plots.

Related discussions & resources:

  • This thread is related, but refers primarily to a website, which seems to lean more heavily toward “creative”-ish content and I’m not sure how much it applies to code for analysis.
  • The BC Gov team has a great guide and uses Apache 2.0 for basically all code, as I understand it.
5 Likes

At the moment of analyzing I don’t give any License, I delay the decision until the publication of the results (if any ). Then I will think what I do, I really don’t care much about someone else using it but if they do something with it that the attribute back. So maybe I will lend to MIT while asking for a citation in the README of the repository.

1 Like

Having a license, whatever it is, is much much better than not having any one at all. So, if you can’t decide, put at least something. You can always change the license later (of future versions - you obviously cannot change the past).

Many options depending on your objectives. For scientific work, I often find it useful to put myself in the shoes of people who’ll find my work. They’ll ask themselves, if they find the artifact (code, document, …) useful, questions like:

  • Am I allowed to use this in what I’m doing right now?

  • If I improve it and contribute back, will my contribution be available to others? Is it worth my time? Is it fair?

  • Am I allowed to improve upon it and then redistribute at my own will? (e.g. part of a package, a document, a publication, …)

If I find work of interest without a license, I try to reach out to the authors an convince them to add a license. Working with non-licensed artifacts is quite uncertain. You may think you’ll do something quick, but then you find a bug and you fix it, and then you’re in a limbo whether you can share that fixed version with the world or not.

If one of the copyright holders (typically all authors) passes away, then I don’t even wanna think what it takes to get a license in place. Legally, will that be the end of that artifact? I assume this will be a more common problem as more and more artifacts are produced each day. BTW, Software Freedom Conservancy is one organization that thinks about the copyright beyond the life of software maintainers.

EDIT: Added link to Software Freedom Conservancy.

2 Likes

As a general rule of thumb if my code is just analysis stuff I just put it out there under public domain (https://choosealicense.com/licenses/unlicense/). The exceptions to this are if my code is part of a simulation project, an R package, Shiny app, or academic paper then I default to the GNU GPL v3 (https://choosealicense.com/licenses/gpl-3.0/). The question I ask myself when picking between these two is if I think I should be cited in a paper for my contributions then I go GPL.

1 Like

I was so excited about this question that I signed up for the rOpenSci Discuss forum!

This is an area where I haven’t been particularly consistent but I feel like I’m getting closer to a “standard” for myself to follow. I had originally started using MIT and GPL v3 licenses in my GitHub repositories. This always felt unsatisfying, in part because I feel like the spirit (and content) of these licenses is so focused on software.

I switched to using CC-BY-SA licenses in most of my analysis and teaching materials on GitHub. I like that the CC licenses encompass more of what I see when I look in my research/teaching projects. There is code, yes, but there is also a lot of narrative in my .Rmd files. There are often plots, maps, and other forms of “results” (like tables, statistical output, etc). To me, the CC family of licenses are well equipped to describe these various products that make up a research repo.

I also think that the CC family of licenses are well known enough, and are clearly articulated enough, that folks who stumble into my world for one reason or another can answer the questions that @HenrikBengtsson lays out. The GPL license, which I use for my R packages, feels so specifically tailored to software that it feels less appropriate in this context.

I’ve recently started using Open Science Framework, which does not give CC-BY-SA licenses as an option (you can see their documentation on licensing here). To keep things compatible, I’m going to move my research repos that also feed OSF projects to CC-BY.

This has raised a question for me about whether I should transition all of my GitHub content from CC-BY-SA to CC-BY. I like the copy-left protections that come with the slightly more restrictive license (for the same reason I use GPL v3 in my R packages). Haven’t decided one way or another on this…

2 Likes

I love that you were so excited about licenses you signed up for the board!

2 Likes

Thanks so much for the discussion, @llrs @HenrikBengtsson @kylehamilton @chris.prener! I’m glad there’s not a simple solution I’m missing.

I’ve been licensing everything, as Henrik suggests, but basically using either GNU or MIT because GitHub makes those easy to both use and find info on. CC- of some kind seems to make more sense, though, for the points that Chris lays out. But my understanding of all the legal differences is fairly limited at this point, so I’m curious to see if the community lands on a standard eventually.

2 Likes

GitHub has, at least for the last year maybe, “recognized” the CC licenses in terms of parsing their plain text variants and displaying the license type both in a user’s or org’s repo list and on up at the top of the Code tab next to where the number of commits etc are displayed. You can see that behavior here on one of my repos. This repo contains plain text versions of all the CC v4.0 licenses.

What GitHub doesn’t currently offer are CC licenses built into the choose a license tool. There is a relatively recent issue on this topic, however.

Ah, nice! I hadn’t noticed that - thanks for pointing it out. Yes, the built-in tool is exactly what I meant - it’s so difficult to copy and paste license text :wink: Glad to know there’s a conversation about it!

1 Like

I’ve seen a few repos where authors put both two licenses in a LICENSE file, one for code and one for prose. I like this and would be inclined to do so with MIT and CC-BY licenses for most of my stuff. What do people think of this practice? It seems like it makes it harder for automated tooling, but besides that it best reflects the spirit of what I’m going for.

In the spirit of @HenrikBengtsson, I also like the think of a user who isn’t as familiar with licensing, and put a short, plain language explanation of my preferred reuse/citation approach in the README or CONTRIBUTING file.

2 Likes

I used to do this, but stopped because I use .Rmd notebooks pretty extensively and it was unclear how those fit into a GPL & CC-BY-SA logic. I follow a literature programming model for my data cleaning and analysis notebooks, and so there is both prose/narrative and code in them as well as plots and often maps. Where I got hung up on was whether the code chunks were covered by one license and the narrative another? Not sure what the right answer is, but in part because of this I’ve abandoned the practice and switched just to using CC licenses. Would be curious to hear what others thoughts are…

Going to adopt this - I think it is a great idea!

2 Likes

IANAL and all, but here’s my take.

In my reading of the Creative Commons FAQ on CC licenses and software and related links, listing both a CC and Software license would mean that both apply to all content, unless explicitly indicated otherwise.

Version 4.0 of CC’s Attribution-ShareAlike (BY-SA) license is one-way compatible with the GNU General Public License version 3.0 (GPLv3). This compatibility mechanism is designed for situations in which content is integrated into software code in a way that makes it difficult or impossible to distinguish the two. There are special considerations required before using this compatibility mechanism. Read more about it here.

For instance, StackOverflow runs under a CC-BY-SA license, so technically if you re-use stuff from StackOverflow, you should probably use CC-BY-SA, and this would allow derivatives of your code to be released with a GPL license.

Note that I think the situation is somewhat simpler if your code & other content is all original to you and not encumbered by viral clauses of licenses like BY-SA or GPL. I think any CC-BY or CC-BY-SA would thus be fine for most Rmd notebooks etc, while the CC-*-ND licenses would clearly prevent re-use.


You can also declare CC0, which is valid for software and other content. Then your content is compatible with GPL, CC, or any other (open or otherwise) license someone wants to use downstream.

Because CC0 forgoes the “attribution” clause of CC-BY, I think as academics we often conflate that with waiving the expectation of citation. IMHO, citation is a question of academic norms and integrity that is wholly separate from the legal protections of copyright. (For instance, no one would argue it was okay to omit all citations to Newton or Shakespeare, whose works are all in the public domain).

CC0 also makes no mention of patent rights, except to say that it makes no mention of patent rights:

No trademark or patent rights held by Affirmer are waived, abandoned,
surrendered, licensed or otherwise affected by this document

For this reason, the Open Source Initiative does not recognize CC0 as an OSI license itself (as opposed to licenses like MIT & BSD, which simply avoid mentioning the word “patent” one way or another). Go figure.

2 Likes

That’s a really interesting approach, @noamross - I like trying to cover all the bases with the most appropriate solution. I do think for a typical analyst working on an Rmd file (or Jupyter notebook, or…), who probably isn’t paying such close attention to licenses, having a recommendation along the lines of “this is the best general purpose solution we’ve found” would be really helpful, if such a thing is reasonable to determine. So far it sounds like the CC-BY may be the best option for that, or CC-BY-SA if you have opinions on licensing for derivative works (please correct me if I’m misunderstanding the discussion!).

I also love the idea of adding reuse/citation info to the README! And totally agree with @cboettig on legalities vs community norms. I think most of us are trying to do the right thing, but don’t always know the “right” way to do it.

3 Likes

The Data Elixir newsletter for this week posted this help article with a rundown of licensing options. Thought some folks might find it useful!

2 Likes

I’ve encountered the MIT license in the wild: https://github.com/natematias/poweranalysis-onlinebehavior/blob/30154624de55ccfe7507cb8b20106cf4b60f0fb0/Choosing-Sample-and-Estimators.ipynb

By sticking the license at the very end of the notebook, I actually saw it, for whatever that’s worth (instead of another file in the repo, package DESCRIPTION, etc).

1 Like