Copy left data, re models

I have a question for all your copyleft nerds…

Is there a license which requires that data is not only open, but any models built using that data must also be open? There is a huge wave of data which is feeding into prediction models that then deliver solutions back to consumers. I don’t think one could underestimate the impact that data -> model -> prediction systems have and will have.

We are building a small lab to measure food and soil quality data. I’m wondering if I can require that anyone who builds models based on our data (which they definitely will), the model must be public.

What do you guys think?

1 Like

Is it simply a matter of releasing data under CC BY-SA? Are CC licenses good for data? Or is there a database-specific license that is also copyleft?

Well, that’s kind of my question. Reading through cc-by-sa 4.0 it seems it hinges on this definition:

Adapted Material means material subject to Copyright and Similar Rights that is derived from or based upon the Licensed Material and in which the Licensed Material is translated, altered, arranged, transformed, or otherwise modified in a manner requiring permission under the Copyright and Similar Rights held by the Licensor. For purposes of this Public License, where the Licensed Material is a musical work, performance, or sound recording, Adapted Material is always produced where the Licensed Material is synched in timed relation with a moving image.

So there are two requirements for a work to fit this definition:

  1. material subject to Copyright and Similar Rights that is derived from or based upon the Licensed Material …
  2. Licensed Material is translated, altered, arranged, transformed, or otherwise modified in a manner requiring permission

In order to say that models derived from data must also be open as per copyleft clause, then (1) above is possible as models are “based upon” licensed materials. However, (2) above fails because the data itself isn’t translated, altered, arranged, transformed, or otherwise modified (I dunno, maybe arranged?).

So CC license feels maybe, but not quite… My guess is that’s a step further than folks have gone in the past. @Javier do you have an opinion here?


Hi @gbathree,

I have not done much thinking about open data. I do remember listening to Luis Villa (a very competent lawyer with lots of experience in open source) talk about that topic in a legal meeting. At the time, he was sceptic about copyleft for data. Here are three of his blog posts, which basically explain his thinking about this issue:

It looks like his basic premise is that data is (most often) not covered by copyright, but rather by database rights. Hope it helps! Cheers,


You might be able to impose conditions like that, but not under any existing data licenses that I’m aware of and it wouldn’t then be ‘open data’ because it would contravene the Open Definition:

Open means anyone can freely access, use, modify, and share for any purpose (subject, at most, to requirements that preserve provenance and openness).

It’s a bit more inflexible as a term than ‘open source’ - basically only attribution and share-alike are acceptable terms.

The issue with using copyright is you have to be sure that the data is copyrighted for that to work and as Javier points out, factual data doesn’t usually attract copyright.

Interesting thing to think through though as an example of being ‘less open’ in one respect (if measured agains the open definition) in order to promote more openness…


Yes, this tension is real and has also been recognised in the case of software. The GPLv3 text summarises it eloquently in one sentence:

To protect your rights, we need to prevent others from denying you these rights or asking you to surrender the rights.


Hi @gbathree,

Indeed as Jenny pointed above, putting such requirement will render the license non-free. But there are some reciprocity licenses with that mindset, as discussed in some pages and one paper:

Article: Between copyleft and copyfarleft: advance reciprocity for the commons. By Miguel Said Vieira & Primavera De Filippi. Journal of Peer Production, Issue #4: Value and currency


Ok - so this is a learning experience for me. Thanks everyone for your thoughtful responses.

@rpez though that’s one interpretation or pathway for what I’m saying, I personally don’t really like copy far left. I’m not interested in making determinations about the values, organization, or anything else of the people using the material. I just want to propagate openness down the line wherever it goes, which I suppose is GPL v3.

@jcm80 GPL v3 is certainly one step less open. I suppose the most open is simply public domain and call it a day. While that’s fine, I think the success of software (and especially software libraries) in building massive amounts of publicly available code suggests that that relatively small step can have massive positive impacts. So that feels like a worthwhile step to me if possible.

@Javier Those three articles are a great read! Had he not written it, it would have taken years of ‘lawyering’ as he put it to reasonable conclusions about these things. I think his suggestion of using norms rather than the copyrights to enforce copyleft makes sense, that’s probably the route we’ll go.

Finally - while those articles clearly indicate the perils of data copyleft protection, that doesn’t change the fact that it is without question one of the next great frontiers of bottled human knowledge. The future machines which we will be having conversations with, the miniaturized devices which will tell us all kinds of things about our food/soil/bodies, the sources of disaster prediction… these are all models built on data. And anyone building them will tell you that the data is 90% of the work.

So it’s a strange situation where if I built 90% of software code and made it GPL v3, and you added the last 10%, of course you could not make that private.

But if I collect all the data (90% of the work) and share it, and you feed into your proprietary algorithm (10% of the work), you can make it private.

This simply feels like a real problem that needs to be solved.

Reading this also helped me for others interested in understand data and copyright at least in the US.

Thanks all –

A couple of talks which seem interesting in the upcoming FOSDEM 2018:

Recordings are typically made available a few weeks after the event.