Generative AI tools such as ChatGPT are being trained on copyright news material, according to the US-based News/Media Alliance.
Research in a new white paper indicates the “pervasive, unauthorised use of publisher content” and the impact this may have on publishers.
N/MA, which represents more than 2200 publishers, has reported to the US copyright office on the use of publisher content to power generative artificial intelligence technologies. Its three publications document the unauthorised use of publisher content by GAI developers, the impact this may have on the sustainability and availability of high-quality original content, and the legal implications of such use.
“GAI systems have been developed by copying massive amounts of the expressive material published by the Alliance’s members, almost always without authorisation or compensation, to create new products and services that frequently compete with Alliance member publishers,” the group says.
While recognising “the exciting potential of GAI models and applications to improve aspects of our lives” and supporting the “principled development” of these systems, it says the development must not come at the expense of publishers and journalists who invest considerable time and resources producing material that keeps communities informed, safe and entertained, and holds government officials and other decision makers in check.
The Alliance says the group and its members would welcome working with GAI developers to help build and grow these technologies in a sustainable and responsible manner.
While the copyright office submission and white paper discuss the wider publisher landscape in the face of the GAI revolution, including relevant principles of copyright law, the accompanying technical analysis documents the extent to which GAI developers rely on high-quality journalistic content to power their models.
In particular, it says results show GAI developers have copied and used news, magazine and digital media content to train large language models. “Popular curated datasets underlying LLMs significantly overweight publisher content by a factor ranging from over 5 to almost 100 as compared to the generic collection of content that the well-known entity Common Crawl has scraped from the web.
Other studies show that news and digital media ranks third among all categories of sources in Google’s C4 training set, which was used to develop Google’s GAI-powered products like Bard. Half of the top ten sites represented in the data set are news outlets.”
The LLMs also copy and use publisher content in their outputs. The LLMs can reproduce the content on which they were trained, demonstrating that the models retain and can memorise the expressive content of the training works, it says.
Alliance president and chief executive Danielle Coffey says the research and analysis shows that AI companies and developers are not only engaging in unauthorised copying of members’ content to train their products, but they are using it pervasively and to a greater extent than other sources.
“This shows they recognize our unique value, and yet most of these developers are not obtaining proper permissions through licensing agreements or compensating publishers for the use of this content. This diminishment of high-quality, human created content harms not only publishers but the sustainability of AI models themselves and the availability of reliable, trustworthy information.”
Coffey says generative AI systems should be held responsible and accountable, just like any other business. “This white paper demonstrates that these systems rely on journalistic and creative content, which have the benefit of investment in quality on the front end, as well as publishers who are required by law to take responsibility for the content they share with the public.
“Continued unauthorised use will harm existing markets that acknowledge the value of archived and real-time quality content, and over time the GAI models themselves will deteriorate.
“You get out what you put in. It is critical that our copyright protections are properly enforced and that high standards of quality and accountability are the foundation of these and other new technologies.”
Image: Wikipedia Commons/James Grills