CHRIS ROPER: How AI is remaking news at lightspeed

First it became clear that thanks to the digital revolution, print journalism was heading for the museum — now AI is on the verge of sending news websites to the same place

Picture: REUTERS/Dado Ruvic/Illustration
Picture: REUTERS/Dado Ruvic/Illustration

Last year, three authors filed a class-action lawsuit in California alleging that Anthropic, one of the big four AI labs, trained its AI model using pirated copies of their books. They described it as large-scale theft, happening without permission and certainly without compensation.

The embarrassingly named Claude chatbot is Anthropic’s main product, a competitor to OpenAI’s ChatGPT, Google’s Gemini and Meta’s Llama.

Claude is marketed as “helpful, harmless and honest”, supposedly owing to Anthropic’s research on “constitutional AI”, which it says is a training method whereby models are guided by written principles rather than purely human feedback.

“The system uses a set of principles to make judgments about outputs, hence the term ‘constitutional’,” Anthropic says. “At a high level, the constitution guides the model to take on the normative behaviour described in the constitution — here, helping to avoid toxic or discriminatory outputs, avoiding helping a human engage in illegal or unethical activities, and broadly creating an AI system that is helpful, honest and harmless.”

Perhaps the I in AI stands for irony. The lawsuit alleges that Anthropic downloaded as many as 7-million pirated books from what are called “shadow libraries”. This included at least 196,640 books from the Books3 data set (part of something called The Pile, which is a huge open-source data set created in 2020 by the nonprofit research group EleutherAI), at least 5-million books from Library Genesis, and 2-million books from Pirate Library Mirror. The 7-million pirated book copies were aggregated into a “central library” by Anthropic, to train dear old Claude.

A federal judge in the US ruled that training with legally purchased books was fair use because the AI’s use was deeply transformative, comparable to a reader internalising various texts to then create something new. He did, however, find that storing more than 7-million pirated books in a central “library” was not fair use and constituted copyright infringement.

That cleared the way for a trial that would have determined the damages. But in August, Anthropic settled with the authors. The company was in “a unique situation”, Reuters quoted Cornell Law School professor James Grimmelmann as saying — as much as $1-trillion in damages was at stake in its worst-case scenario. The settlement still has to receive judicial approval, but was expected to be finalised this week.

AI models also use news articles in training large language models (LLMs). Again ironically, given that AI is poised to destroy journalism, this is because journalistic content offers high-quality, reliable and well-edited information. This trusted information contributes to reducing bias, to grounding outputs in accurate sources and to mitigating issues with hallucinations and misinformation. News articles make up a substantial part of the training data for leading LLMs, allowing these models to learn patterns, context and styles that are, in general, prevalent in trustworthy reporting.

In Joburg this week a G20-related media conference, M20, was held to highlight media, journalism and information integrity issues. One of the sessions at the conference, organised by the South African National Editors’ Forum, Media Monitoring Africa and Alt.Advisory, dealt with a policy brief titled “AI’s impact on the intellectual property rights of journalists”.

The brief’s findings are that the uncompensated use of journalistic content by LLMs is both undermining the production of quality news and threatening the integrity of the information ecosystem as a whole. Its author suggests that we need a global initiative to establish norms and rules when it comes to AI and copyright. It suggests some ways for news publishers to get fair compensation for the use of their data, including fixed-fee licensing or negotiated frameworks.

Let’s be clear about this. The impact of AI on journalism is huge. The use of journalistic content without payment undermines the production of original news, and erodes the much-needed quality news ecosystem.

Why would you bother sinking time and money into producing original news when most people are only reading it on their AI chatbot? Which is the scenario that is rapidly approaching, in a world where search engine optimisation is now being replaced by generative engine optimisation.

We have entered the age of AI slop, and the fear is that truth will drown in it

As the M20 policy brief points out, AI responses to search queries — Google’s AI Overviews and AI Mode, for example — are replacing traditional search tools and eroding traffic to news sites. And the AI-generated answers are often so good that people don’t bother to visit news sites for more information, even if the chatbot provides links.

It’s perhaps too soon to see what the impact will really be, but recent stats from Similarweb indicate that between February 2024 and February 2025, the top 500 news sites collectively lost 64-million referral visits from traditional search, while AI chatbot referrals rose only 5.5-million.

The real killer, though, is the way AI can erode trust in the information ecosystem by spreading disinformation through hallucinations and enabling deepfakes, and by stripping out news-brand trust from the consumer relationship. And the more news organisations stop producing robust journalism, the more info integrity is destroyed when incorrect AI outputs re-enter the system as new training data. We have entered the age of AI slop, and the fear is that truth will drown in it.

What’s the solution? According to Jeff Jarvis, a journalism professor, co-host of the podcast AI Inside and unlikely anarchist, we shouldn’t try to litigate.

Jarvis has criticised the push for mandatory licensing, and believes that using data without payment is fair use, and that compulsory licensing would damage the information ecosystem. He’s also dismissive of deals between AI firms and big publishers, such as the one between The New York Times and Amazon that allows Amazon to use editorial content from the paper in Alexa, and as training material for its own AI models.

He thinks that the AI firms are just buying the silence of these publishers, that it’s just lobbying, and that these deals exclude small publishers and undermine democratic journalism. Of course, it’s easier to adopt this attitude in the minority world. African newsrooms are a lot more vulnerable in some ways. But then again, it’s doubtful that any AI firm will bother to ask them.

In a January 2024 US Senate judiciary subcommittee hearing, Jarvis stood out as the only dissenting voice against calls for compulsory licensing of journalistic content by AI companies, arguing that fair use is a fundamental tool for journalists, and that restricting it could harm journalistic freedom. “I am concerned about all the talk I hear about limiting fair use … Fair use is used every day by journalists. We ingest data … and put it out in a different way.”

He compared the AI licensing debate to when “newspapers complained about radio … In the end, democracy was better served because journalists could read each other and use each other’s information.”

Fundamentally, the question before news organisations shouldn’t be how to stop the AI onslaught. They can’t, in the same way they couldn’t stop the social media giants from destroying traditional media.

News sites now are in the same position that print news was at the beginning of the social media era. Doomed, as the clickbaiters would shriek. The only possible solution is one that is already being acted on by the new breed of news creators and news influencers: to develop a new understanding of what news is, of who is consuming it, and of where and how it is being consumed.

What does it mean to be a journalist in this new ecosystem, and what constitutes effective, useful and impactful journalism? That’s a complex question, though in some ways it’s the same answer as it’s always been. The practice of trusted journalism requires sturdy ethical and procedural safeguards, but the culmination of journalism is always at the intersection of the user and the product.

The key to survival, and indeed to thriving, is going to be making sure that journalists understand those users, and in fact embody the same information impulses.

Would you like to comment on this article?
Sign up (it's quick and free) or sign in now.

Comment icon