Guarding My Git Forge Against AI Scrapers

(vulpinecitrus.info)

121 points | by todsacerdoti 11 hours ago

28 comments

mappu 8 hours ago
Gitea has a builtin defense against this, `REQUIRE_SIGNIN_VIEW=expensive`, that completely stopped AI traffic issues for me and cut my VPS's bandwidth usage by 95%.
kstrauser 2 hours ago
Anubis cut the accesses on my little personal Forgejo instance with nothing particularly interesting on it from about 600K hits per day to about 1000.
That’s the kind of result that ensures we’ll be seeing anime girls all over the web in the near future.
craftkiller 1 hour ago
On my forge, I mirror some large repos that I use for CI jobs so I'm not putting unfair load on the upstream project's repos. Those are the only repos large enough to cause problems with the asshole AI scrapers. My solution was to put the web interface for those repos behind oauth2-proxy (while leaving the direct git access open to not impact my CI jobs). It made my CPU usage drop 80% instantly, while still leaving my (significantly smaller) personal projects fully open for anyone to browse unimpeded.
FabCH 7 hours ago
If you don't need global access, I have found that Geoblocking is the best first step. Especially if you are in a small country with a small footprint and you can get away at blocking the rest of the world. But even if you live in the US, excluding Russia, India, Iran and a few others will cut your traffic by double digit percent.
In the article, quite a few listed sources of traffic would simply be completely unable to access the server if the author could get away with a geoblock.
[-]
- krupan 3 hours ago
  This makes me a little sad. There's an ideal built into the Internet, that it has no borders, that individuals around the world can connect directly. Blocking an entire geographic region because of a few bad actors kills that. I see why it's done, but it's unfortunate
  [-]
  - FabCH 3 hours ago
    I know what you mean.
    But the numbers don't lie. In my case, I locked down to a fairly small group of European countries and the server went down from about 1500 bot scans per day down to 0.
    The tradeoff is just too big to ignore.
- komali2 7 hours ago
  Reminds me of when 4chan banned Russia entirely to stop DDOSes. I can't find it but there was a funny post from Hiro saying something like "couldn't figure out how to stop the ddos. Banned Russia. Ddos ended. So Russia is banned. /Shrug"
  [-]
  - ralferoo 2 hours ago
    Similarly, for my e-mail server, I manually add spammers into my exim local_sender_blacklist a single domain at a time. About a month into doing this, I just gave up and added * @* .ru and that instantly cut out around 80% of the spam e-mail.
    It's funny observing their tactics though. On the whole, spammers have moved from bare domain to various prefixes like @outreach.domain, @msg.domain, @chat.domain, @mail.domain, @contact.domain and most recently @email.domain.
    It's also interesting watching the common parts before the @. Most recently I've seen a lot of marketing@, before that chat@ and about a month after I blocked that chat1@. I mostly block *@domain though, so I'm less aware of these trends.
- ThatPlayer 5 hours ago
  We've had a similar discussion at my work. E-commerce that only ships to North America. So blocking anyone outside of that is an option.
  Or I might try and put up Anubis only for them.
  [-]
  - FabCH 3 hours ago
    Be slightly careful with commerce websites, because GeoIP databases are not perfect in my experience.
    I got accidentally locked out from my server when I connected over Starlink that IP-maps to the US even though I was physically in Greece.
    As a practical advice, I would use a blocklist for commerce websites, and allowlist for infra/personal.
  - redirectyou 3 hours ago
    [dead]
Bender 2 hours ago
Do git clients support HTTP/2.0 yet? Or could they use SSH? I ask because I block most of the bots by requiring HTTP/2.0 even on my silliest of throw-away sites. I agree their caching method is good and should be done when much of the content is cachable. Blocking specific IP's is a never-ending game of whack-a-mole. I do block some data-centers ASN's as I do not expect real people to come from them even though they could. It's an acceptable trade-off for my junk. There are many things people can learn from capturing TCP SYN packets for a day and comparing to access logs sorting out bots vs legit people. There are quite a few headers that a browser will send that most bots do not. Many bots also lack sending a valid TCP MSS and TCP WINDOW.
Anyway, test some scrapers and bots here [1] and let me know if they get through. A successful response will show "Can your bot see this? If so you win 10 bot points." and a figlet banner. Read-only SFTP login is "mirror" and no pw.
[Edit] - I should add that I require bots to tell me they speak English optionally in addition to other languages but not a couple that are blocked, e.g. en,de-DE,de good, de-DE,de will fail, because. Not suggesting anyone do this.
[1] - https://mirror.newsdump.org/bot_test.txt
[-]
- cortesoft 1 hour ago
  > I do block some data-centers ASN's as I do not expect real people to come from them even though they could.
  My company runs our VPN from our datacenter (although we have our own IP block, which hopefully doesn’t get blocked)
  [-]
  - Bender 1 hour ago
    It's of course optional to block whatever one finds appropriate for their use case. My hobby stuff is not revenue generating so I have more options at my disposal.
    Those with revenue generating systems should capture TCP SYN traffic for while, monitor access logs and give it that college try to correlate bots vs legit users with traffic characteristics. Sometimes generalizations can be derived from the correlation and some of those generalizations can be permitted or denied. There really isn't a one size fits all solution but hopefully my example can give ideas in additional directions to go. Git repos are probably the hardest to protect since I presume many of the git libraries and tools are using older protocols and may look a lot like bots. If one could get people to clone/commit with SSH there are additional protections that can be utilized at that layer.
    [Edit] Other options lay outside of ones network such as either doing pull requests for or making feature requests for the maintainers of the git libraries so that HTTP requests look a lot more like a real browser to stand out from 99% of the bots. The vast majority of bots use really old libraries.
PeterStuer 2 hours ago
"Self-hosting anything that is deemed "content" openly on the web in 2025 is a battle of attrition between you and forces who are able to buy tens of thousands of proxies to ruin your service for data they can resell."
I do wonder though. Content scrapers that truly value data would stand to benefit from deploying heuristics that value being as efficient as possible in the info per query space. Wastefullness of the desctbed type not just loads your servers, but also their whole processing pipeline on their end.
But there is a different class of player that gains more from nuisance maximization: dominant anti-bot/ddos service providers, especially those with ambitions of becoming the ultimate internet middleman. Their cost for creating this nuisance is near 0 as they have 0 interest in doing anyting with the responses. They just want to annoy until you cave and install their "free" service, then they can turn around as ask for a pay to access your data to interested parties.
yunnpp 20 minutes ago
Thanks for putting that together. Not my daily cup but it seems like a good reference for server setup.
dspillett 7 hours ago
> VNPT and Bunny Communications are home/mobile ISPs. i cannot ascertain for sure that their IPs are from domestic users, but it seems worrisome that these are among the top scraping sources once you remove the most obviously malicious actors.
This will be in part people on home connections tinkering with LLMs at home, blindly running some scraper instead of (or as well as) using the common pre-scraped data-sets and their own data. A chunk of it will be from people who have been compromised (perhaps by installing/updating a browser add-in or “free” VPN client that has become (or always was) nefarious) and their home connection is being farmed out by VPN providers selling “domestic IP” services that people running scrapers are buying.
[-]
- simonw 3 hours ago
  I have trouble imagining any home LLM tinkerer who tries to run a naive scraper against the rest of the internet as part of their experiments.
  Much more likely are those companies that pay people (or trick people) into running proxies on their home networks to help with giant scrapping projects what want to rotate through thousands of "real" IPs.
- ArcHound 7 hours ago
  Disagree on the method:
  I recall that bot farms use pre-paid SIM cards for their data connections so that their traffic comes from a good residential ASN.
  No client compromise required, it's a networking abuse that gives you good reputation of you use mobile data.
  But yes, selling botnets made of compromised devices is also a thing.
GoblinSlayer 2 hours ago
>Iocaine has served 38.16GB of garbage
And what is the effect?
I opened https://iocaine.madhouse-project.org/ and it gave the generated maze thinking I'm an AI :)
>If you are an AI scraper, and wish to not receive garbage when visiting my sites, I provide a very easy way to opt out: stop visiting.
dirkc 9 hours ago
I'm not 100% against AI, but I do cheer loudly when I see things like this!
I'm also left wondering about what other things you could do? For example - I have several friends that built their own programming languages, I wonder what the impact would be if you translate lots of repositories to your own language and host it for bots to scrape? Could you introduce sufficient bias in a LLM to make an esoteric programming language popular?
[-]
- hurturue 8 hours ago
  Russia already does that - poisons the net for future LLM pretraining data.
  it's called "LLM grooming"
  https://thebulletin.org/2025/03/russian-networks-flood-the-i...
  [-]
  - brabel 6 hours ago
    This article shows no evidence for anything it claims. None. All of that while claiming we can’t believe almost anything we read online… well you’re god damn right.
    > undermining democracy around the globe is arguably Russia’s foremost foreign policy objective.
    Right, because Russia is such a cartoonish villain it has no interest in pursuing its own development and good relations with any other country, all it cares about is annoying the democratic countries with propaganda about their own messed up politics.
    When did it become acceptable for journalists to make bold, generalizing claims against whole nations without a single direct, falsifiable evidence of what they claim and worse, making claims like this that can be easily dismissed as obviously false by quickly looking at the policies and their diplomatic interactions with other countries?!
    [-]
    - nightpool 3 hours ago
      They link multiple sources, including a Sunshine Foundation report summarizing other research into the area, and a NewsGuard report where they tested claims from the Pravda network directly against leading LLM chatbots: https://static1.squarespace.com/static/6612cbdfd9a9ce56ef931... https://www.newsguardtech.com/special-reports/generative-ai-...
    - frogperson 3 hours ago
      Can you point me to any examples of russia doing something good or helping anyone except billionaires? No? Then their reputation is well deserved.
    - ekropotin 2 hours ago
      As a Russian, I have to say that Putin is indeed way too focused on geopolitics instead of internal state of affairs.
    - nutjob2 6 hours ago
      > Right, because Russia is such a cartoonish villain it has no interest in pursuing its own development and good relations with any other country, all it cares about is annoying the democratic countries with propaganda about their own messed up politics.
      That's actually pretty much spot on.
      [-]
      - brabel 3 hours ago
        When you start believing that there are only good and bad, black and white, them vs us, you know for sure you’ve been brainwashed. Goes to both sides.
        [-]
        hurturue 3 hours ago
        so between 0 (good) and 100 (bad), what would be your gray score "badness/evilness" value for the following: Russia, US, China, EU
        yes, i know, it's not a linear axis, it's multi-dimensional perspective thing. so do a PCA/projection and spit one number, according to your values/beliefs
        nutjob2 2 hours ago
        For someone who complains about unsupported claims, you seem to make a lot of them.
        The fact that you think this is something to do with "both sides" instead of a simple question of facts really gives you away.
- zwnow 9 hours ago
  > Could you introduce sufficient bias in a LLM to make an esoteric programming language popular?
  Wasn't there a study a while back showing that a small sample of data is good enough to poison an LLM? So I'd say it for sure is possible.
wrxd 7 hours ago
I wonder if this is going to push more and more services to be hidden from the public internet.
My personal services are only accessible from my own LAN or via a VPN. If I wanted to share it with a few friends I would use something like Tailscale and invite them to my tailnet. If the number of people grows I would put everything behind a login-wall.
This of course doesn't cover services I genuinely might want to be exposed to the public. In that case the fight with the bots is on, assuming I decide I want to bother at all
qudat 6 hours ago
This is a great reason why letting websites have direct access to git is not a great idea. I started creating static versions of my projects with great success: https://git.erock.io
[-]
- drzaiusx11 5 hours ago
  Do solutions like gitea not have prebuilt indexes of the git file contents? I know GitHub does this to some extent, especially for main repo pages. Seems wild that the default of a web forge would be to hit the actual git server on every http GET request.
  [-]
  - danudey 1 hour ago
    The author discusses his efforts in trying caching; in most use cases, it makes no sense to pre-cache every possible piece of content (because real users don't need to load that much of the repository that fast), and in the case of bot scrapers it doesn't help to cache because they're only fetching each file once.
sodimel 8 hours ago
I, too, am selfhosting some projects on an old computer. And the fact that you can "hear internet" (with the fans going on) is really cool (unless you're trying to sleep while being scrapped).
stevetron 3 hours ago
I was setting up a small system to do web site serving. Mostly just experimental to try out some code. Like learning how to use nginx as a reverse proxy. And learing how to use dynamic dns services since I am on dynamic dns at home. Early-on, I discovered lot's of traffic, and lot's of hard drive activity. The HD activity was from logging. It seemed I was under incessant polling from china. Strange: It's a new dynamic url. I eventually got this down to almost nothing by setting up the firewall to reject traffic from China. That was, of course, before AI scrapers. I don't know what it would do, now.
hashar 9 hours ago
I do not understand why the scrappers do not do it in a smarter way: clone the repositories and fetches from there on a daily or so basis. I have witnessed one going through every single blame and log links across all branches and redoing it every few hours! It sounds like they did not even tried to optimize their scrappers.
[-]
- dspillett 7 hours ago
  > I do not understand why the scrappers do not do it in a smarter way
  If you mean scrapers in terms of the bots, it is because they are basically scraping web content via HTTP(S) generally, without specific optimisations using other protocols at all. Depending on the use case intended for the model being trained, your content might not matter at all, but it is easier just to collect it and let it be useless than to optimise it away⁰. For models where your code in git repos is going to be significant for the end use, the web scraping generally proves to be sufficient so any push to write specific optimisations for bots for git repos would come from academic interest rather than an actual need.
  If you mean scrapers in terms of the people using them, they are largely akin to “script kiddies” just running someone else's scraper to populate their model.
  If by scrapers in terms of people writing them, then the fact that just web scraping is sufficient as mentioned above is likely the significant factor.
  > why the scrappers do not do it in a smarter way
  A lot of the behaviours seen are easier to reason if you stop considering scrapers (the people using scraper bots) to be intelligent, respectful, caring, people who might give a damn about the network as a whole, or who might care about doing things optimally. Things make more sense if you consider them to be in the same bucket as spammers, who are out for a quick lazy gain for themselves and don't care, or even have the foresight to realise, how much it might inconvenience¹ anyone else.
  ----
  [0] the fact this load might be inconvenient to you is immaterial to the scraper
  [1] The ones that do realise that they might cause an inconvenience usually take the view that it is only a small one, and how can the inconvenience little them are imposing really be that significant? They don't think the extra step of considering how many people like them are out there thinking the same. Or they think if other people are doing it, what is the harm in just one more? Or they just take the view “why should I care if getting what I want inconveniences anyone else?”.
- ACCount37 9 hours ago
  Because that kind of optimization takes effort. And a lot of it.
  Recognize that a website is a Git repo web interface. Invoke elaborate Git-specific logic. Get the repo link, git clone it, process cloned data, mark for re-indexing, and then keep re-indexing the site itself but only for things that aren't included in the repo itself - like issues and pull request messages.
  The scrapers that are designed with effort usually aren't the ones webmasters end up complaining about. The ones that go for quantity over quality are the worst offenders. AI inference-time data intake with no caching whatsoever is the second worst offender.
- FieryMechanic 9 hours ago
  The way most scrapers work (I've written plenty of them) is that you just basically get the page and all the links and just drill down.
  [-]
  - conartist6 6 hours ago
    So the easiest strategy to hamper them if you know you're serving a page to an AI bot is simply to take all the hyperlinks off the page...?
    That doesn't even sound all that bad if you happen to catch a human. You could even tell them pretty explicitly with a banner that they were browsing the site in no-links mode for AI bots. Put one link to an FAQ page in the banner since that at least is easily cached
    [-]
    - FieryMechanic 6 hours ago
      When I used to build these scrapers for people, I would usually pretend to be a browser. This normally meant changing the UA and making the headers look like a read browser. Obviously more advanced techniques of bot detection technique would fail.
      Failing that I would use Chrome / Phantom JS or similar to browse the page in a real headless browser.
      [-]
      - conartist6 6 hours ago
        I guess my point is since it's a subtle interference that leaves the explicitly requested code/content fully intact you could just do it as a blanket measure for all non-authenticated users. The real benefit is that you don't need to hide that you're doing it or why...
        [-]
        conartist6 6 hours ago
        You could add a feature kind of like "unlocked article sharing" where you can generate a token that lives in a cache so that if I'm logged in and I want to send you a link to a public page and I want the links to display for you, then I'd send you a sharing link that included a token good for, say, 50 page views with full hyperlink rendering. After that it just degrades to a page without hyperlinks again and you need someone with an account to generate you a new token (or to make an account yourself).
        Surely someone would write a scraper to get around this, but it couldn't be a completely-plain https scraper, which in theory should help a lot.
        [-]
        conartist6 6 hours ago
        I would build a little stoplight status dot into the page header. Red if you're fully untrusted. Yellow if you're semi-trusted by a token, and it shows you the status of the token, e.g. the number of requests remaining on it. Green if you're logged in or on a trusted subnet or something. The status widget would links to all the relevant docs about the trust system. No attempt would be made to hide the workings of the trust system.
  - tigranbs 9 hours ago
    And obviously, you need things fast, so you parallelize a bunch!
    [-]
    - FieryMechanic 8 hours ago
      I was collecting UK bank account sort code numbers (to a buy a database at the time costs a huge amount of money). I had spent a bunch of time using asyncio to speed up scraping and wondered why it was going so slow, I had left Fiddler profiling in the background.
- immibis 9 hours ago
  Because they don't have any reason to give any shits. 90% of their collected data is probably completely useless, but they don't have any incentive to stop collecting useless data, since their compute and bandwidth is completely free (someone else pays for it).
  They don't even use the Wikipedia dumps. They're extremely stupid.
  Actually there's not even any evidence they have anything to do with AI. They could be one of the many organisations trying to shut down the free exchange of knowledge, without collecting anything.
pabs3 10 hours ago
> the difference in power usage caused by scraping costs us ~60 euros a year
jepj57 1 hour ago
What about a copyright on websites stating anyone using your site for training would be giving the owner of the site an eternal non-revocable license to the model, and must provide a copy of the model upon request? At least then there would be SOME benefit.
[-]
- adastra22 1 hour ago
  Contract law doesn’t work that way.
evgpbfhnr 8 hours ago
I had the same problem on our home server.. I just stopped the git forge due to lack of time.
For what it's worth, most requests kept coming in for ~4 days after -everything- returned plain 404 errors. millions. And there's still some now weeks later...
zoobab 4 hours ago
Use stagit, static pages served with a simple nginx is blazing fast and should resist any scrapers.
[-]
- toastal 6 minutes ago
  Darcs by it’s nature can just be hosted by HTTP server too, but without needing a special tool. I use H2O with a small mruby script to throttle IPs.
klaussilveira 7 hours ago
I wish there was a public database of corporate ASNs and IPs, so we wouldn't have to rely on Cloudflare or any third-party service to detect that an IP is not from a household.
[-]
- wrxd 7 hours ago
  Scrapers use residential VPNs so such a database would help only up to a certain point
- ronsor 2 hours ago
  There is... It's literally available in every RIR database through WHOIS.
- eddyg 3 hours ago
  Just search for "residential proxies" and you'll see why this wouldn't help.
frogperson 4 hours ago
Could this be solved with an EULA and some language that non-human readers will be billed at $1 per page? Make all users agree to it. They either pay up or they are breaching contract.
Is this viable?
[-]
- hamdingers 2 hours ago
  Say you have identified a non-human reader, you have a (probably fake) user agent and an IP address. How do you imagine you'll extract a dollar from that?
- kstrauser 2 hours ago
  Most of my scraper traffic came from China and Brazil. How am I going to enforce that?
- grayhatter 3 hours ago
  > Is this viable?
  no
  for many reasons
captn3m0 9 hours ago
I switched to rgit instead of running Gitea.
krupan 3 hours ago
I'm case you didn't read to the end:
"This is depressing. Profoundly depressing. i look at the statistics board for my reverse-proxy and i never see less than 96.7% of requests classified as bots at any given moment. The web is filled with crap, bots that pretend to be real people to flood you. All of that because i want to have my little corner of the internet where i put my silly little code for other people to see."
ArcHound 9 hours ago
Seems like you're cooking up a solid bot detection solution. I'd recommend adding JA3/JA4+ into the mix, I had good results against dumb scrapers.
Also, have you considered Captchas for first contact/rate-limit?
If you have smart scrapers, then good luck. I recall that bot farms use pre-paid SIM cards for their data connections so that their traffic comes from a good residential ASN. They also have a lot of IPs and overall well-made headless browsers with JS support. Then it's a battle of JS quirks where the official implementation differs from headless one.
hurturue 8 hours ago
in general the consensus on HN is that the web should be free, scraping public content should be allowed, and net neutrality is desired.
do we want to change that? do we want to require scrapers to pay for network usage, like the ISPs were demanding from Netflix? is net neutrality a bad thing after all?
[-]
- johneth 5 hours ago
  I think, for many, the web should be free for humans.
  When scraping was mainly used to build things like search indexes which are ultimately mutually beneficial to both the website owner and the search engine, and the scrapers were not abusive, nobody really had a problem.
  But for generative AI training and access, with scrapers that DDoS everything in sight, and which ultimately cause visits to the websites to fall significantly and merely return a mangled copy of its content back to the user, scraping is a bad thing. It also doesn't help that the generative AI companies haven't paid most people for their training data.
- wrxd 7 hours ago
  The general consensus here is also that a DDOS attack is bad. I haven't seen objections against respectful scraping. You can say many things about AI scrapers but I wouldn't call them respectful at all.
  [-]
  - microtherion 23 minutes ago
    a) There are too damn many of them.
    b) They have a complete lack of respect for robots.txt
    I'm starting to think that aggressive scrapers are part of an ongoing business tactic against the decentralized web. Gmail makes self hosted mail servers jump through arduous and poorly documented hoops, and now self hosted services are being DDOSed by hordes of scrapers…
  - BenjiWiebe 6 hours ago
    Do people truly dislike an organic DDoS?
    So much real human traffic that it brings their site down?
    I mean yes it's a problem, but it's a good problem.
    [-]
    - voidUpdate 3 hours ago
      If my website got hugged to death, I would be very happy. If my website got scraped to hell and back by people putting it into the plagiarism machine so that it can regurgitate my content without giving me any attribution, I would be very displeased
  - charcircuit 7 hours ago
    Yet HN does it when linking to poorly optimized sites. I doubt people running forges would complain about AI scrapers if their sites were optimized for serving the static content that is being requested.
- komali2 7 hours ago
  I'm completely happy for everything to be free. Free as in freedom, especially! Agpl3, creative commons, let's do it!
  But for some reason corporations don't want that, I guess they want to be allowed to just take from the commons and give nothing in return :/
- WhyOhWhyQ 8 hours ago
  If net neutrality is a trojan horse for 'Sam Altman and the Antrhopic guy own everything I do' then I voice my support for a different path.
- dns_snek 7 hours ago
  Net neutrality has nothing to do with how content publishers treat visitors, it's about ISPs who try to interfere based on the content of the traffic instead of just providing "dumb pipes" (infrastructure) like they're supposed to.
  I can't speak for everyone, but the web should be free and scraping should be allowed insofar that it promotes dissemination of knowledge and data in a sustainable way that benefits our society and generations to come. You're doing the thing where you're trying to pervert the original intent behind those beliefs.
  I see this as a clear example of the paradox of tolerance.
  [-]
  - pelotron 2 hours ago
    Just as private businesses are allowed "no shirt, no shoes, no service" policies, my website should be allowed a "no heartbeat, no qualia, no HTTP 200".
xyzal 8 hours ago
Does anyone have an idea how to generate, say, insecure code, en masse? I think it should be the next frontier. Not feed them random bytestream, but toxic waste.
[-]
- tpxl 16 minutes ago
  Create a few insecure implementations, parse them into an AST, then turn them back into code (basically compile/decompile) except rename the variables and reorder stuff where you can without affecting the result.
- moooo99 8 hours ago
  Ironically, probably the fastest way to create insecure code is by asking AI chatbots to code