Do you fight web scraping – try tarpits

Does AI web scraping breach copyright or merely gather facts? Should you embrace the rapid rise of AI with retooling strategies to chase citations instead of clicks?

These are the issues addressed in an INMA product and tech blog, in which EidosMedia chief marketing and chief product officer Massimo Barsotti quotes extensively from UK trade publication Press Gazette.

“Despite obtaining content from willing publishers, AI bots continue to ‘scrape’ other web content without permission,” he says.

Barsotti says since the 1990s, that permission has been relayed by a website’s robots.txt file – the gatekeeper informing hungry website crawlers what content is fair game or off-limits. “But the robots.txt file is more of a courteous suggestion than an enforceable boundary,” he says, quoting Human Security’s Bryan Becker that it is “a sign that says ‘please do not come in if you’re one of these things’ and there’s nothing there to stop you. It’s just always been a standard of the Internet to respect it.”

Barsotti quotes Press Gazette that publishers opting to block AI companies from visiting their websites altogether “have only given rise to third-party content-scrapers, ‘which openly boast about how they can get through paywalls, effectively steal content to order, allowing AI companies to answer ‘live’ news queries with stolen information from publishers’.”

He says the publication even admits to “using third-party scrapers to access paywalled content on the Financial Times website”.

Among allies in the AI resistance, the Internet Engineering Task Force’s AI Preference Working Group (AIPREF) is one of the largest and most influential, its principal objective being to contain AI scrapers through two interrelated mechanisms. These are by seeking to establish “a common vocabulary to express authors’ and publishers’ preferences regarding use of their content for AI training and related tasks’, and secondly, through new and improved boundaries. AIPREF wants to develop a ‘means of attaching that vocabulary to content on the internet, either by embedding it in the content or by formats similar to robots.txt, and a standard mechanism to reconcile multiple expressions of preferences.’

Barsotti says “for those on the frontlines of the fight”, the promise of new protocols and a hope for AI compliance is not enough.

“Increasingly, publishers are fighting back with emerging countermeasures,” he says.

These include AI tarpits – which trap AI crawlers and bog them down in an ‘infinite maze’ of static files – as well as tech to poison the AI models, or introduce a ‘proof of work’ defence.

Barsotti says one of the most recent and significant blows against AI has been landed by infrastructure provider Cloudflare, which has reversed its original ‘opt-out’ model and now automatically blocks AI bots. In this, it had been backed by more than a dozen major news and media publishers including the Associated Press, The Atlantic, BuzzFeed, Condé Nast, DMGT, Dotdash Meredith, Fortune, Gannett, The Independent, Sky News, Time and Ziff Davis.

“Cloudflare is also offering a more aggressive approach called AI Labyrinth, a tarpit/poisoning-inspired tool designed to ensnare AI scrapers. Citing a CloudFlare blog post, The Verge explained that, when Labyrinth ‘detects inappropriate bot behaviour,’ the free, opt-in tool lures crawlers down a path of links to AI-generated decoy pages that slow down, confuse, and waste the resources of those acting in bad faith’.”

He says the age of publishers watching passively as AI bots scrape their content is over. “Some, like The Guardian and The Wall Street Journal, are striking deals and throwing open the gates to AI. Others are communicating firm boundaries, setting technical traps, and collaborating with like-minded leaders to develop effective defences.

“Whether AI Web scrapers can be brought to heel remains to be seen, but for the publishers resisting the AI takeover, it’s clear the fight must continue if they want to retain control over their content.”

Do you fight web scraping – try tarpits – or embrace it?