Research rights in the evolving generative artificial intelligence landscape
Take home points:
- There is considerable legal uncertainty for author right protections and concerns about fairness and infringement among creators
- Researchers also face uncertainty when conducting research using Gen AI tools because of copyright and e-resource licensing restrictions
- Libraries and publishers are often at odds in license negotiation about how materials may be used with AI for research purposes
- Researchers must confirm licensing terms before using digital resources for AI research
- Traditional publishing agreements have broad transfer rights that arguably give latitude to publishers to sell content to AI companies for training purposes
- Open access and licensing grant authors copyright retention but legal protections of these licenses are unclear
- Read and understand all aspects of publishing agreements and negotiate for ‘opt out’ clauses for training AI purposes if this reflects your values
What is Gen AI?
Gen AI tools use algorithms based on various models (like Diffusion or Large Language) to generate output. However, to do this, they first need to be trained on large data samples. The internet, with over a billion pages of content, is a tempting source for the large amounts of data needed to train tools like Stable Diffusion or Chat GPT. How much data is used and their methods of gathering are not always explicitly disclosed by AI companies, but from the Gen AI output it can be inferred that data was obtained from the open internet. Companies like Meta have not denied the use of publicly accessible (but questionably legal) shadow libraries to train their early AI models.
How does Copyright factor in?
While copyright law makes a distinction between data and copyright protected works, full works like articles or photographs are ingested to train Generative AI models so that data can be extracted and weighed by the Gen AI algorithm. By detecting patterns and rules from the data, the Gen AI “learns” and creates output based on this training to respond to user prompts.
Even though they are publicly available on the internet, the works used for this training typically have copyright protection (sometimes in addition to terms and conditions which prohibit copying on the sites the works come from). There is currently legal uncertainty around whether copyright law permits this scraping or downloading, or if terms and conditions (which are like an agreement between a website user and the website owner) can prohibit these actions.
Gen AI companies have largely taken the position that copyright exemptions like Fair Dealing in Canada and Fair Use in the United States permit this copying, saying that the copying is being done for training or research purposes to generate something “new” that has a different goal or market than the original works used for training. They liken it to a human reading a text and then creating a new work based on what they have learned.
As a counterpoint, many creators and publishers have taken the position that this uncompensated copying is simply infringement by providing examples of Gen AI output that is substantially similar to their original copyright protected work; they argue this output could compete with the market for their original works. Studies have shown that entering generic pop cultural prompts like “videogame plumber” or “animated toys” into popular Gen AI image generators produce images that default to common pop culture copyrighted icons like Nintendo’s Mario or Disney’s Toy Story characters.
To generate increasingly sophisticated output, large quantities of high-quality data are needed. Academic publications are a prime source for just this type of data. Whether training on copyright protected works without first securing permission is “fair” is currently unclear; several cases in both the United States and Canada are pending before the courts to determine this issue.
Terms and conditions apply
Copyright is not the only legal consideration with Gen AI; contracts also provide an important framework. The terms and conditions or licenses that apply to library e-resources, websites, apps, and other aspects of our digital lives are contracts that often dictate how digital copyright protected works can be used and by whom. This impacts researchers both in terms of how they can use copyright protected materials and how their own scholarship can be used with Gen AI.
Libraries role and e-resource licensing
The primary role of the Electronic Resources Librarian at the University of Manitoba is to negotiate pricing and license agreements for all different types of e-resources – electronic journals, e-books, streaming films, digital music scores, and so on. One of the main goals of any license negotiation is to secure the widest array of user rights while limiting the University's risk. On the other side of the negotiation table, the publisher or vendor is attempting to safeguard their licensed materials and ensure there will be repercussions in the case of misuse.
Publishers and vendors are reworking their license agreements to include specific clauses regarding how and whether their licensed content can be used together with Gen AI. At the same time, students, faculty, and researchers at UM may want or need to use Gen AI in the course of their work. Protecting their right to conduct their research or studies as they see fit has now become more challenging.
Some publishers have attempted to forbid all usage of their materials in Gen AI; others have tried to include punitive measures against any users caught violating these restrictions. Libraries have been scrambling to understand what kinds of Gen AI-related clauses we should and should not be accepting in license agreements. The University of California at Berkeley Library was one of the first to offer up guidance and sample licensing language for librarians to use when countering publishers. Consortia like CRKN, OCUL, and COPPUL have been doing a lot of this work in Canada, spending months negotiating with publishers to reach acceptable compromises on Gen AI licensing language.
The language some major publishers have settled on tends to be complicated, stipulating that the Gen AI tool used by researchers must not be external or public-facing or retain licensed content (which is not always easy to determine). In this way, publishers are putting the onus on users to determine which specific Gen AI tools are ok to use.
Gen AI restrictions are now visible in Library Search by clicking the ‘Show License’ button next to the resource access link. While attempts have been made to make the language as clear and concise as possible, users should consult the UM guide to Gen AI tools for further information: https://libguides.lib.umanitoba.ca/ai/tools
Licensing and the journal publishing agreement
Publishing agreements are a necessary step towards publication. While they can seem esoteric and shrouded in legalese, understanding the terms of these agreements is essential for the identified publisher and author rights. In traditional academic journal publishing agreements, authors typically transferred (or “assigned”) the copyright in their manuscript to the publisher in exchange for the benefit of being published. The transfer of copyright meant that the publisher could control all future publication and grant permissions for uses of the article (although the author may also retain limited rights—like the ability to make copies of their pre-print manuscript available).
The movement towards open scholarship and Open Access to academic publications has brought a change to the traditional academic journal publishing agreement. An open publishing agreement frequently allows authors to retain their copyright while applying an open license like Creative Commons. These Creative Commons licenses allow others to do a variety of things with the work that they are applied to, including making it openly available to read and, in some cases, adapt.
Because of the transfer of copyright in traditional publishing agreements, publishers have broad rights to determine how academic content can be used, including not being required to compensate or consult with authors. Since many Gen AI companies have proceeded with training on copyright protected works despite the uncertainty about the fairness of this practice, some publishers have instead decided to create certainty and ensure compensation by entering licensing agreements with Gen AI companies that grant them permission to train.
An Open Access work under a Creative Commons license may not always require further licensing for AI training due to the permitted uses under the license. Creative Commons licenses do not restrict the use of copyright exemptions, so they cannot prevent AI training on a “fairness” basis. This creates a couple of potential ways that research journal publications may legally be trained on by Gen AI companies regardless of their agreements with publishers, although whether “fairness” qualifies as a basis for this is yet to be determined.
Where does this leave researchers?
Many researchers have concerns about their work being used in ways that were never contemplated when they signed publishing agreements or chose open licensing, but the current legal and regulatory atmosphere remains unsettled. Right now, a “watch and wait” approach is necessary. We are seeing regulation in the European Union based on an “opt out” model where authors can request that their works be excluded from training data but the North American approach is still evolving. Creative Commons has suggested developing “preference signals” to express creator preferences for AI training, but it is unlikely they would have legal effect to prevent training.
If researchers do not want their work used for Gen AI it is important to address this issue in any publishing agreement and be aware of publisher’s stances on AI training and whether they align with your own. Remember that a publishing agreement can always be negotiated. Applying statements to academic works expressing the desire to “opt out” of AI training would also not go amiss during this period of uncertainty. While this may not ultimately prevent Gen AI training, it is a proactive step that can be taken for researchers to voice their concerns while the Gen AI landscape continues to develop.
At the same time, if researchers want to use copyright protected materials with Gen AI for their own research, they must confirm any terms and conditions or licensing that applies to the materials first. Gen AI provides challenges and opportunities for researchers, and a new balance to be struck between how we all use copyright protected materials and allow others to use the materials that we have created.
If you have questions or would like a session about research publications, licensing, or copyright and Gen AI, please contact: LIBRESRV@umanitoba.ca or um.copyright@umanitoba.ca