Page 1 of 1

Community note: AI LLM and BOT forum data scraping

Posted: Tue Jul 15, 2025 1:34 am
by totalmotorcycle
Good day TMW community!

I've noticed over the last year since AI chatbots, Co-Pilot, Gemini, Grok, LLM's (Large Language Models) etc have been aggressively data harvesting anything they can scape (take) over the internet. As our TMW forums have been around for 23 years (since 2002) that's a lot of data they can source.

Because of this, my bandwidth use across the forums has been overwhelming, over 50% of TMW's bandwidth use has been to feed these AI LLM's and is not only costing a lot of money, but slowing the entire site down. I get no revenue from AI models visiting TMW, in fact, with AI scraping (stealing) the information they divert visitors AWAY from our forums and website.

Thus, I have to make some hard decisions soon on what to do with the forums as things are getting out of hand with bandwidth and the AI data theft.

If you have any suggestions, please feel free to raise your hand. Right now I'm looking at all options including:

1. Limiting access to the posts of the forum to login registered guests only.
2. Banning all bots and AI models from the site
3. Shutting down and removing the forums completely.
4. Honeypot (AI traps) to stop the steal.

Thank you for your attention in this matter. (I can't believe I just typed that!).

Mike

Re: Community note: AI LLM and BOT forum data scraping

Posted: Tue Jul 15, 2025 3:58 pm
by pchast
That's a difficult decision..
One of the expressed reasons for the historical data is to help newbies.

Perhaps a 2 tiered approach?

Re: Community note: AI LLM and BOT forum data scraping

Posted: Wed Jul 16, 2025 9:06 am
by totalmotorcycle
pchast wrote: Tue Jul 15, 2025 3:58 pm That's a difficult decision..
One of the expressed reasons for the historical data is to help newbies.

Perhaps a 2 tiered approach?
100% with your thinking there. I really, really don't want to hurt the forums and the information they contain I feel is valuable for the riding community. Plus, we have a great community IMO.

What I've done right now is:

1. Restrict guest access to the forum. The majority of the record number of guests (Most users ever online was 289791 on June 27th, 2025, 2:02 am) were BOTs and AI LLM scrapers. So right now, guests get a "teaser" of 2 forums and won't see the other forums until they login. Bots and AI LLM's don't login as they don't have accounts.
2. I've banned all bots from seeing ANY messages in the forums. What they can't see, they can't scrape. But that's the GOOD bots that will respect that. These are my currently most common forum BOTS visiting:

Bing [Bot]
Amazon [Bot]
Semrush [Bot]
Google Adsense [Bot]
Google [Bot]
Ahrefs [Bot]
Majestic-12 [Bot]
Google Feedfetcher
YaCy [Bot]
AdsBot [Google]
Yahoo [Bot]
DuckDuckGo [Bot]
Baidu [Spider]
MSNbot Media
Ask Jeeves [Bot]
Alexa [Bot]
Exabot [Bot]

Now I'm actively watching the guests numbers, the bandwidth and new spam registrations to see those changes.

Overall, this as a whole makes me unhappy and I'm not amused with AI LLM's right now.

Mike