(This post is co-authored with Sneha Jain, Partner, Saikrishna & Associates)
The scope of copyright liability of Generative AI (‘genAI’) models is a hot topic globally. Copyright issues that stem out of genAI technology can be categorized into four heads. All the litigations in the United States form a part of one of these four heads:
- Allegation of copyright Infringement due to copying/storage of copyrighted works as data sets for the purpose of training models;
- Allegation of copyright Infringement due to substantial similarity of the output produced, as well as the output produced being based on the inputted copyrighted work;
- Allegation of copyright Infringement due to lack of attribution or lack of disclosure/tampering with Rights Management Information;
- Whether genAI models can be “authors” for the purposes of copyright law.
Most defense briefs in the various litigations filed in the US till now, have relied upon the transformative fair use defense to avoid copyright liability. Relying on the idea-expression dichotomy, these briefs have argued that genAI models have not copied any protectable copyright “expression” but only copied unprotectable ideas.
While the contours of the idea-expression dichotomy, the merger doctrine, as well as protectable subject matter, as applied in India, remains largely similar to the US copyright jurisprudence, the transformative fair use defense, as it has developed in the US, is not statutorily available under Indian law (though it arguably is available under judge-made law, see Syndicate of The Press for the University of Cambridge v. B.D. Bhandari and Anr., (2011) SCC OnLine Del 3215). A question then arises – how will such litigations fare under Indian copyright law? Will genAI tool providers like ChatGPT, Sora, SDXL Turbo, Google’s Music LM etc., face incremental risk under Indian law, even if they succeed in their transformative fair use defense under US law?
Through these series of articles, we will be exploring the peculiarities of Indian copyright law that may pose incremental risk to genAI tool developers, as well as models. Before we dive into legal issues, it is crucial to understand how a genAI tool is crated and works.
How are genAI tools developed:
Visualize how a child learns reading and writing – by copying, imitating and repeated tracing of the alphabet (ABCs), followed by simple words, sentences and so on. Similarly visualize how a child learns to speak – by listening to and repeating sounds and words spoken by a parent/teacher or other care giver. Having learnt how to read, write and speak, the same child, being exposed to a wide spectrum of social, cultural and informational content and experiences, is not only intrinsically shaped by such content and experiences but also shapes the cultural realm through her contributions. It is this exact process of being shaped by, and at the same time shaping back, the cultural realm that genAI is mimicking through its algorithms that read the vast data sets of content and information available in digital form (‘training sets’) and extract ‘knowledge’ from the training sets. The ‘knowledge’ is nothing, but the meta-information embedded within the training sets. This knowledge extraction happens by firstly breaking and categorizing the data into fundamental ‘tokens’, secondly, identifying statistical patterns from the placement of such tokens to learn the relevance and context of each word in a sentence, and thirdly apply the knowledge to predict answers based on the statistical patterns learnt. Thus, what Gen AI systems most likely tend to do is “produce a “reasonable continuation” of whatever text it’s got so far. It essentially mimics the process of learning and knowledge sharing adopted by a human mind, by converting words into numbers (tokens) and finding massive statistical patterns for learning through the numbers. In other words, creators of genAI tools/models are attempting to create a human brain through computers, as opposed to through natural conception or IVF or test tube baby brains.
Whether storage by genAI systems is copyright infringement?
The current stage of training genAI models involves making copies (fixing) of data sets, which include copyright protectable works, and storing them for varied periods. Storage of data sets for the purposes of training can happen in three distinct ways:
- Storage throughout the subsistence and use of the models.
- Storage until the data is extracted and absorbed.
- No storage, and use of Federated or Collaborative learning, where data sets are not stored on a centralized cloud server. The training happens through data on decentralized servers, i.e., without storing data on any particular server.
It is important to note that irrespective of the fact of there being a copy of the work, which is then stored, the same is solely used by the model developers for extracting the meta-information contained within the expression of the content, through the model, and is not exposed to any human. Copying and Storing are two different acts or uses of a copyrighted work. For training genAI models, though the model does read the content per se to tokenize it for the purpose of weighing the model and parameters, to gauge the logic of the next possible sequence, it is however not reading or enjoying a copyrighted work in the context in which a copyrighted work is meant to be seen or heard or enjoyed. For instance, a musician does not produce a song for the primary purpose of it being used for training. The primary purpose of the same is entertainment.
Under the Indian Copyright Act, the exclusive right of reproduction is conferred to owners of literary, dramatic, musical, artistic works, sound recordings and cinematographic films, as well as to the owners of performers rights and the broadcast reproduction rights. While the contours of the right may be different for each of these, the common thread is that reproduction and storage mostly go hand in hand. The Copyright Act distinctly provides an exclusive right to copyright owner of a literary work, dramatic work or a musical work, under Section 14(a)(i) to reproduce the work in any material form, including the storing of it in any medium by electronic means. It also provides an exclusive right to the copyright owner of an artistic work under Section 14(c)(i) to reproduce the work, including storing it in any material form. In context of cinematographic films and sound recordings, Section 14(1)(d)(i) and 14(1)(e)(i), distinctly provides an exclusive right to copyright owners – to make a copy of the film/sound recording, including storing of it in any medium. Neither is “reproduction”, nor a “copy” defined in the Act. However, the definition of an “infringing copy” under Section 2(m) of the Act, clearly differentiates the concepts of “reproduction” and “making a copy”, as applicable to different set of works. Arguably, this is to eradicate any associated physicalism with literary, artistic, dramatic or musical – i.e., underlying works- and to showcase as to how reproduction of their forms of expression is relevant – and not the mere act of making copies which may not be for the purpose of reproducing the expression. The dictionary meaning of reproduction is to create or bring into existence again, and of copy is to imitate or transcribe. Even the Delhi High Court has read the meaning of copy to be expansive to include imitation of the substance copied, and not merely a physical copy, see MRF v. Metro Tyres, (2019) SCC OnLine Del 8973.
Reproduction includes the act of storing the expression of the work in any medium by electronic means. This deeming fiction of including “storage” within the meaning of reproduction was brought in by the 1994 Amendment to the Copyright Act to comply with TRIPS which extended protection to broadcasters and producers of phonograms. The Parliamentary Standing Committee Report in 2010, clarified that storage was to be held to be infringing specifically qua Internet Service Providers, who would unauthorizedly store content to provide exposure to the same for impermissible purposes.
The reproduction right protects recompense in the primary market for the owner of the work. It is to protect the owner of copyright from losing out economic returns by substitution in its primary market, by the act of copying the expression of the work, or unauthorizedly exposing the expressive originality of the work. This right is limited by various doctrines that have been developed by courts. For instance, courts do not extend the primary market of the work to ideas embedded within the expression. The idea-expression dichotomy clearly recognizes that protection is only limited to the expressive form, and the right only extends to denuding unauthorized reproduction of the expressive form of the work. This dichotomy has even been recognised in Article 9.2 of the TRIPS Agreement, which also explains that protection extends only to the original way in which the information or idea is expressed, and not to the information or idea embedded in the work. The Supreme Court has also recognised, while providing helpful guidance on the meaning of what constitutes a “copy” under the Act, that the fundamental fact to be determined for violation of copy is whether the manner, arrangement, situation to situation, scene to scene with minor changes or super additions have been adopted, as against the mere idea or information embedded, see RG. Anand v. Delux Films, (1978) 4 SCC 118. Even the Division Bench of the Calcutta High Court has recognised that ideas embedded within works are not protected, and only if the expression is appropriated would it form subject matter of copyright protection, see Barbara Taylor Bradford v. Sahara Media, (2003) SCC OnLine Cal 323. The rationale of the same stems from the principle that copyright does not give an exclusive right over the information, experiences or facts embedded, but only over the concrete form in which these ideas are developed. Thus, unless reproduction, including storage is for the purposes of exploiting or substituting the market of the copyright owner in this concrete form, it would not be copyright’s concern. This principle espouses the complex compromise that copyright engages in with the freedom of speech, where access to using speech is restricted only to the extent of reproduction of its concrete form- in order to incentivize and acknowledge the creator of the concrete form of the speech, but not to the idea or information embedded within the speech. The Division Bench of the Delhi High Court has also recognised that copyright consciously restricts its application to ensure it does not override concerns of Article 19(1)(a) of the Constitution of India. Wiley Eastern Ltd. v. Indian Institute of Management, (1995) SCC OnLine Del 784.
The merger doctrine, also recognised in India, further limits protection in those cases where the ideas expressed can only be expressed in a limited number of ways, are functional, or core to the genre of expression. Here as well, protection is limited to the concrete expressive form of the work and does not extend to, in any way, monopolize the idea embedded. Moreover, the de minimis rule further limits protection to the extent that trivial parts of the work being used, which do not form a substantial part of the expressive form of the expression, are not protected.
The focus of the reproduction right, as can be seen from these limiting doctrines, is on unauthorized exposure/consumption to the expressive forms of the work, as against use to extract ideas or the meta-information embedded in the works. In fact, these doctrines make sure that copyright does not stifle with the flow of ideas, however, protects the expressive form in which these ideas are embedded in order to provide economic baits for people to clothe these ideas in different original expressions.
The question is whether copying or storing, which is completely non-expressive or non-consumptive, that is – copying that does not involve appropriating the expression of the said work or exposing the expression to any human being, but rather is only for the purpose of extracting meta-information for weighing models and parameters, and training the genAI model, is an act of infringement? Would extraction of ideas constitute an existing market?
A few examples which scholars quote are – can reproduction of a book for use as a doorknob (a purpose for which the book hasn’t been written or published) be infringement, merely because a copy of the physical book was made? Can storage of student papers on a plagiarism software to decode whether the student plagiarized its paper with other papers available on the internet, be infringing use/copy/storage?, see A.V. v. Iparadigms, Limited Liability Company, 544 F. Supp. 2d 473 (E.D. Va. 2008). Can a web-crawling software that makes cached copies of works on the internet, in order to enable search engines to respond to queries of search by matching queries with cached data, be infringing use/copy/storage that is a part of the reproduction right?, see Field v. Google, 412 F. Supp. 2d 1106 (D. Nev. 2006). Can use of a book for following the procedures provided therein be infringement of the reproduction right? Can use of books for allowing search of the said books by search engines, amount to infringement of the reproduction right?, see Authors Guild v. Google, Case No. 13-4829 (2d Cir. 2015).These are questions that Courts will have to grapple with in the coming times.
A purposive interpretation of the meaning of “Reproduction, including Storing of works” within Section 14 of the Copyright Act, would probably exclude an exclusive right over storage that is not for the purpose of expressive reproduction and is only for the purpose of extracting meta-information, in any case protected by the limiting idea-expression dichotomy in copyright law. The physical fact of storage or copying would be irrelevant to such an analysis- as long as the form of expression, i.e., the protected element in the work, is not being exposed to anyone.
To the contrary, however, a literal construction of the said provision would probably lead to a conclusion that extends the primary market of the copyright owner even to the mere storage/ or copying of the work, irrespective of whether the same is for a reproductive purpose (in an expressive context) or not.
Which way will the courts go is yet to be seen!
Transient Storage
Even if storage is considered to be infringing under Section 14 read with Section 51 of the Copyright Act, Section 52(1)(b) and (c) specifically provide for exemption of transient or incidental storage of a work purely in the technical process of electronic transmission, and transient or incidental storage for the purpose of providing electronic link or access where the same is not expressely prohibited or infringing.
Courts will have to grapple with the question as to- (a) whether storage of training data sets for the training period, can be considered transient; and (b) whether storage of training data sets would amount to being incidental to providing access to the genAI model to extract meta-information.
The concept of “transient and incidental storage” was somewhat clarified by the Delhi High Court in MySpace Inc. v. Super Cassettes Industries Ltd, (2016) SCC OnLine Del 6382. In MySpace, the Court was dealing with the question of whether My Space can be obligated to monitor and review to report any infringing content of Super Cassettes on its platform. The Court while analyzing the purpose of the transient or incidental storage exception held transient to mean temporary, and incidental to mean subordinate to something of greater importance. This was deemed to include “cached data”, or other data generated automatically to improve performance of the core permissible function. Moreover, the text of the Copyright (Amendment) Bill which introduced Section 52(1)(c) shows that storage is permissible when exposure as a result of storage is permissible and non-infringing.
Thus, it is arguable that storage for the sole purpose and functionality of training, which arguably is a transformative and permissible purpose, would be incidental storage that is permissible under the said section. However, Courts are yet to clarify this.
On the aspect of temporary storage, legality would depend on how long the storage is for. If the data set automatically is removed once the meta-information used for training is extracted, it is arguable that storage would be transient and temporary, all the more due to the fact that not even one human is exposed to the stored copy. However, Courts would have to render more clarity on this aspect.
In any case, the next part of this series will delve deeper into use for extractive purposes and whether any of the defenses under Section 52, including fair dealing private use/personal use, use of illegal copies as against lawfully acquired copies, would probably extend to “use” at the training stage of Gen AI models – by AI or by the facilitator, i.e., the company building the AI, or not.
Views of the authors are personal. This Article first appeared on the website of Saikrishna and Associates here: https://www.saikrishnaassociates.com/indian-copyright-law-and-generative-ai/
(Image generated on Dall-E)
[…] first considered the question of whether storing copyrightable works for training purposes is […]