Time |
Nick |
Message |
06:48 |
|
collum joined #evergreen |
09:09 |
|
dguarrac joined #evergreen |
09:47 |
|
mmorgan joined #evergreen |
10:05 |
|
dguarrac joined #evergreen |
10:05 |
|
Jaysal joined #evergreen |
10:05 |
|
dluch joined #evergreen |
10:05 |
|
scottangel joined #evergreen |
10:34 |
|
redavis joined #evergreen |
10:37 |
|
redavis joined #evergreen |
10:56 |
|
Ntwali joined #evergreen |
11:03 |
|
sandbergja joined #evergreen |
11:38 |
|
Christineb joined #evergreen |
12:00 |
|
jihpringle joined #evergreen |
13:10 |
|
jihpringle joined #evergreen |
13:15 |
StomproJ |
Has anyone observed getting thousands (90K today) of requests like "GET /eg/opac/mylist/delete?anchor=record_444643&record=444643" |
13:15 |
StomproJ |
From random IPs. |
13:16 |
StomproJ |
Bots being botty? |
13:19 |
csharp_ |
looking... |
13:20 |
csharp_ |
yes, seeing similar from 177.37.128.0/17 (Brazil) |
13:20 |
csharp_ |
and 138.122.0.0/16 (Uruguay) |
13:21 |
StomproJ |
It looks like there is a redirection loop for that request. |
13:21 |
jeffdavis |
We've been seeing a lot of that lately, esp in the past month or so. I've been meaning to email the dev list about it actually. |
13:21 |
csharp_ |
whatchall doin' Latin America? |
13:21 |
Bmagic |
yes! same. We're using nginx to block whole countries to help solve it |
13:21 |
csharp_ |
Bmagic: GeoIP? |
13:21 |
Bmagic |
right |
13:22 |
csharp_ |
I set that up (with Apache) on the git server since they JUST WON'T LEAVE US ALONE! |
13:22 |
Bmagic |
I think the real solution is to teach evergreen to detect bots and dynamically deny requests before they make it down the whole stack to the DB and back |
13:22 |
* csharp_ |
shakes fist at the heavens |
13:22 |
StomproJ |
ChatGTP bot took us completely down in December, quickly switched from PoundProxy to HAProxy so I could have rate limits and queuing. |
13:22 |
csharp_ |
redirect to https://www.youtube.com/watch?v=dQw4w9WgXcQ\ |
13:23 |
Bmagic |
The first step towards fixing this issue was nginx's leaky bucket feature. But the bots defeat that because they are coming from whole Blocks of IP' |
13:23 |
Bmagic |
s |
13:23 |
csharp_ |
https://www.youtube.com/watch?v=dQw4w9WgXcQ even |
13:23 |
Bmagic |
like, 10k IP's each allowed 10 requests per second via leaky bucket rules |
13:24 |
csharp_ |
I take an hour's logs, isolate the IP column, sort -n and that makes it obvious which ranges to block |
13:24 |
Bmagic |
yep, I do the same, but it's starting to become a full time job |
13:24 |
csharp_ |
but I've over-blocked too - See: Dyrcona not being able to connect to the git server |
13:24 |
csharp_ |
yeah |
13:24 |
csharp_ |
@hate bots |
13:24 |
pinesol |
csharp_: The operation succeeded. csharp_ hates bots. |
13:25 |
StomproJ |
I had fail2ban setup for a while for certain searches. |
13:25 |
csharp_ |
also, I think even the non-CN ranges are still actually China |
13:25 |
Bmagic |
GeoIP seems to be holding (for now) - they can beat that when they decide to use AWS data center in the USA |
13:25 |
csharp_ |
the requests are usually the same across the botnet |
13:26 |
jeffdavis |
oh no, the bots got StomproJ! |
13:26 |
csharp_ |
see, the bots made StomproJ quit! |
13:26 |
Bmagic |
Evergreen should be able to see the querystring and realize that it's not valid before it makes it down to the database. There are obvious patterns |
13:26 |
|
Stompro joined #evergreen |
13:26 |
csharp_ |
@rescue StomproJ |
13:26 |
pinesol |
csharp_: http://images.cryhavok.org/d/1291-1/Computer+Rage.gif |
13:26 |
csharp_ |
hey, that worked! |
13:26 |
Bmagic |
csharp_++ |
13:27 |
csharp_ |
Stompro: glad you brought it up - we're getting slammed with those requests right now |
13:28 |
Bmagic |
bug 1913617 |
13:28 |
pinesol |
Launchpad bug 1913617 in OpenSRF "NGINX could use a DOS mitigation example" [Undecided,New] https://launchpad.net/bugs/1913617 |
13:30 |
Bmagic |
^^ leaky bucket idea. But, the current attacks aren't caught by that because of the 10k IP usage. Another thought I had was: can we use the IP itself as a pattern? Like, when we're seeing hits from the same Class B block inside a 20-30 minute period, that we don't normally see? |
13:31 |
Bmagic |
perhaps we could add a mechanism that would establish a baseline, and then look for anomolies, in terms of large numbers of requests from non-baseline IP's |
13:33 |
Stompro |
I've been looking real hard ad Cloudflare also. Let them deal with it. |
13:33 |
Bmagic |
Cloudflare is a fine idea, but as soon as you turn on the Proxy, certain parts of the Evergreen staff client breaks |
13:33 |
Bmagic |
That was the very first thing I did |
13:34 |
Stompro |
Bmagic++ trying the things out since forever. |
13:34 |
Bmagic |
So, if you go Cloudflare, you'd need to setup a different DNS name for staff client traffic. And that, honestly, might be the best way to go. It's the patron OPAC that is getting all the troubles |
13:36 |
Bmagic |
I really do love Cloudflare. I think their servers would do a great job of figuring this traffic out. They operate at a scale beyond what I can imagine. |
13:37 |
Bmagic |
A wrinkle: DNS for staff only: the server will still respond to OPAC requests. Even parts of the staff catalog interface invoke the patron OPAC, so we'd need to figure out how to allow the logged-in staff to talk to the OPAC but not* the rest of the internet |
13:38 |
Stompro |
How about putting the opac behind ezproxy and making everyone log in first :-) |
13:38 |
Bmagic |
So the bots would just start probing the staff DNS entry because it's a domain name on public record. |
13:40 |
Stompro |
729539 of those mylist/delete queries since Jan 2nd. Looking to see from how many IPs. |
13:40 |
Bmagic |
thinking more about it, perhaps a clever apache rewrite rule could not* rewrite the bare domain to /eg/opac/home, so the page wouldn't load. And you'd need to know the top secret URL (/eg/staff) in order to interact with the server |
13:44 |
csharp_ |
Bmagic: we have Cloudflare, but proxying is not on yet |
13:44 |
Bmagic |
same |
13:45 |
Bmagic |
go ahead, turn it on. It works (mostly) |
13:46 |
Bmagic |
There will be some weird behavior that you can't figure out. Then you'll realize (oh, did I turn on the cloudflare proxy?) - let me turn that off and see if that fixes it. And boom, it will. I can't remember exactly which pages in the staff client suffer from it. |
13:47 |
csharp_ |
we'll have to start it on one of our test servers after next months' upgrade if we can hold out that long |
13:49 |
Stompro |
32880 unique IPs, most making 1 or 31 requests only, a handful at 62. |
14:00 |
csharp_ |
damn - they really are at it today |
14:10 |
|
jihpringle joined #evergreen |
14:11 |
Stompro |
I'm going to trying dropping those requests with haproxy and see if they get the hint. "http-request silent-drop if { path_beg /eg/opac/mylist/delete }" |
14:13 |
sandbergja |
Re: teaching evergreen to identify bots and not do expensive work for them: in bug 1895695, I used a CPAN module called Duadua. I thought it was a nice module (although it only checks user agent, so it only helps with the bots who aren't liars): https://metacpan.org/pod/Duadua |
14:13 |
pinesol |
Launchpad bug 1895695 in Evergreen "Wishlist: add a link tracker for the course materials module" [Wishlist,Confirmed] https://launchpad.net/bugs/1895695 |
14:24 |
Bmagic |
sandbergja++ |
14:32 |
|
mantis1 joined #evergreen |
14:40 |
Stompro |
Has anyone looked at using something like google recaptcha V3 to only let hoomans access certain pages on the opac? |
14:40 |
eby |
there's a whole bots channel on the code4lib slack now because everyone's repositories are getting slammed by the AI scrapers. Quite a few went with the Clourflare Turnstile |
14:40 |
eby |
but that is more of a middleware solution |
14:41 |
Stompro |
eby++ thanks. |
14:42 |
eby |
https://github.com/mitchellkrogza/nginx-ultimate-bad-bot-blocker |
14:42 |
eby |
https://github.com/ai-robots-txt/ai.robots.txt |
14:43 |
eby |
and if you happen to be on AWS https://aws.amazon.com/about-aws/whats-new/2024/09/aws-waf-bot-control-managed-group-rule-bot-detection-capabilities/ |
14:46 |
Stompro |
I don't think what is hitting us now is reading the robots.txt since they are not identifying themselves in the UA. |
14:54 |
JBoyer |
I've got some thoughts on the bot stuff. |
14:54 |
JBoyer |
I don't think we should teach 'evergreen' to do anything about this. Evergreen isn't a web server, load balancer, or sysadmin, though it needs those things. We should share mitigations and ideas about it, maybe including in the Eg docs, though it's not an evergreen-specific issue. |
14:54 |
JBoyer |
Decisions about this traffic should happen before requests hit Evergreen's mod_perl-laden apache processes. |
14:54 |
JBoyer |
This is something Aspen tries to do that irritates me. If you're letting every request through to the php and a pile of db queries just to be told to go away your server is still going to be swamped, just a little bit later. |
14:54 |
JBoyer |
robots.txt is useless except to request that the big 3 search engines maybe (maybe!) don't go here or there, and I think only Google supports Crawl-Delay, if they still do. (Which is also useless if you have many vhosts.) |
14:55 |
JBoyer |
Sysadmins should have the large registrar's whois lookups bookmarked. More and more if ARIN tells me that the whole /8 I'm looking at is registered at APNIC or RIPE it's probably all going to be blocked without any further research. (But don't do that with "early registration" blocks) |
14:55 |
JBoyer |
The internet sucks in 2025 and I'm weary of all of it. |
14:55 |
redavis |
JBoyer++ |
15:07 |
JBoyer |
Also, these kids, my lawn, why. |
15:08 |
csharp_ |
JBoyer++ |
15:08 |
csharp_ |
so grumpy today |
15:08 |
csharp_ |
I am grumpy that is |
15:08 |
* redavis |
is just generally tired...of things. |
15:08 |
csharp_ |
also whois is connect: Network is unreachable |
15:08 |
csharp_ |
and that's making me more miserable |
15:09 |
redavis |
lol, whois is also tired...and unreachable. |
15:09 |
csharp_ |
my bash wrapper to use whois to grab the range and block it via IPTables depends on that working, obvs |
15:11 |
jeffdavis |
IMO, AI companies are jockeying for position and strip-mining us in the process. |
15:12 |
sandbergja |
jeffdavis++ |
15:12 |
redavis |
I think you might be right. |
15:13 |
* redavis |
has a moment thinking about infrastructure and what it was built for and what it wasn't...and what kind of capacity it has and doesn't. |
15:22 |
redavis |
Everything's fine. |
15:31 |
|
mantis1 left #evergreen |
15:33 |
|
justdoglet joined #evergreen |
15:49 |
justdoglet |
Dealing with the bots here is definitely getting towards a full-time job. Russia and environs as well as Turkiye and Brasil are taking up much of my time today. |
15:51 |
csharp_ |
justdoglet: they must be targeting all EG sites - that's all I've done this afternoon |
15:52 |
justdoglet |
We're seeing the mylist/deletes, as well as a number of other things. In talking with one of our institutions, it seems Apple may be routing iCloud Private Relay traffic through netblocks in Korea and other places, making things trickier. |
15:52 |
csharp_ |
ugh |
15:53 |
justdoglet |
csharp_: Sounds about right. These things seem to come in waves -- in our work chat, I often give a daily bot weather roundup. I'm amazed at how things change, yet stay the same. :| |
15:53 |
csharp_ |
:-( |
16:11 |
Bmagic |
JBoyer++ # I completely agree. Though my attitude has evolved to think that we do need to get something baked into Evergreen to do something*. Otherwise, I'm spending all day blocking IP subnets. I am aware of the global flag memcached IP tracking. It would be a dream if Evergreen/Apache somewhere in the stack could thwart this type of traffic, based upon XXXX |
16:12 |
jeffdavis |
Would a collaborative IP blocklist be useful? |
16:13 |
Bmagic |
jeffdavis: I think so |
16:14 |
Bmagic |
Though, I've been swaping nginx configs over to GeoIP, so I have less and less IP subnets to contribute because I'm not hunting these down anymore |
16:14 |
justdoglet |
https://www.irccloud.com/pastebin/YEIpMVfq/ |
16:15 |
justdoglet |
I've got a few to share, though needs some cleanup. |
16:17 |
Bmagic |
lol 7k lines of blocks is a good start. The issue becomes: overtime the IP's change hands |
16:18 |
justdoglet |
IPs for claudebot. https://www.irccloud.com/pastebin/Bt1c3TsK/claudebot.txt |
16:19 |
justdoglet |
Bmagic: Yes, IPs do change hands -- and the rate at which they change hands varies, depending on provider. I assume things like DigitalOcean change more than other providers, for example. I also have data on Internet threats spanning a year, and have noticed that IPs don't change hands as often as we think. |
16:24 |
justdoglet |
Chinanet netblocks. https://www.irccloud.com/pastebin/FL7hsEUw/chinanet.txt |
16:24 |
Bmagic |
we're using this FYI: https://docs.nginx.com/nginx/admin-guide/security-controls/controlling-access-by-geoip/ |
16:24 |
justdoglet |
China Unicom netblocks. https://www.irccloud.com/pastebin/dK9MJnIK/unicom.txt |
16:25 |
justdoglet |
That's a start -- those are the hardest on the db server (Claudebot) and most voluminous (Chinanet/Unicom). |
16:26 |
Bmagic |
it would be "fun" to reverse SYN flood the abusive IP's using the same established TCP HTTP session |
16:29 |
justdoglet |
I've considered getting destructive, but I want to keep my job. :) On my personal sites I am rigging up an arrangement where Claudebot/OpenAI/etc. IPs get random text from a local LLM, and Chinanet, etc. are just firewalled. |
16:30 |
Bmagic |
:) |
16:33 |
justdoglet |
A couple more lists -- miscellaneous bots, typically one-offs, and Amazonbot, who has flattened our app servers more than once. |
16:33 |
Bmagic |
It costs them less CPU to query the server than it does us to answer their call. It's a war of CPU, and the deck is stacked in their favor 10 to 1 by my estimation. We're using AWS, their using AWS. It's datacenter to datacenter, one in the US and one in Singapore. Their paying for the vCPU's and we are too, but they win. |
16:33 |
justdoglet |
Miscellaneous bots. https://www.irccloud.com/pastebin/vBgApy54/miscbots.txt |
16:34 |
justdoglet |
Amazonbot. May be highly variable as they control the IP pools and have tonnes to pull from. https://www.irccloud.com/pastebin/ozLIPUov/amazonbot.txt |
16:35 |
justdoglet |
Yes, absolutely -- they have the upper ground in terms of resources. That's why I've been focusing on exclusion of malicious traffic than attempting to get EG to do magic with it. |
16:41 |
Bmagic |
the amazonbot IP's that you list are the recent offenders for me |
16:43 |
justdoglet |
Their IP pool _seems_ to be relatively constrained -- I don't think I see more than one Amazonbot IP every two weeks now that I've got what I assume is most of them. I just note that they can swap netblocks with other parts of their business to get fresh netblocks at any time, IPs from that list may wind up elsewhere (like S3 or Route 53) eventually. |
16:48 |
justdoglet |
(There are patterns in the IPv4 space, of course -- 3/8 is mostly Amazon, 4/8 is mostly Level3/CenturyLink, 6/8 is all US military, and so on. I saw someone mention massaging the data to get a list of IPs and hit counts to make it easier to zero in on bots. That's one strategy I use often, too.) |
16:58 |
justdoglet |
In addition to the mylist/deletes, are other folks also seeing the mylist/moves? They used to come in groups of 31 requests for me, but recently they've been variable. Like the mylist/deletes, it's bouncing off an URL and getting 302s, but never following the 302 (I'm assuming the 302 isn't to the same URL). |
16:59 |
justdoglet |
https://www.irccloud.com/pastebin/ILCwVHMg/ |
17:13 |
|
mmorgan left #evergreen |
17:20 |
|
sandbergja joined #evergreen |
17:41 |
Bmagic |
justdoglet: yes, same here |
17:42 |
justdoglet |
Okay, I'm happy I'm not the only one. I can't imagine this is useful traffic, so I've been firewalling based on it. |
17:42 |
Bmagic |
my thoughts exactly |
17:47 |
Bmagic |
That URL pattern shouldn't be accessible unless logged in, IMO |
17:52 |
Bmagic |
but, Evergreen does allow anonymous basket manipulation, for temporary baskets. So, there are legit uses for those patterns. Moving those lines down in EGCatLoader.pm beneath "requires login" could help cut that malicious DOS out. But that's not the extent of these Bot's patterns |
17:55 |
pinesol |
News from commits: LP#1851721 Hold Shelf Owning Library Column Needed <https://git.evergreen-ils.org/?p=Evergreen.git;a=commitdiff;h=bd7f1a6885d3cbe9661ce5f8eedf91aefe08d2b4> |
17:59 |
justdoglet |
I do see some request chains from potential bots bounce off /eg/opac/temp_warn, but ones that are all 302s don't. I'm too lazy to dig into the code, but it wouldn't surprise me if those 302s are the ones that try to redirect through temp_warn. |
18:03 |
justdoglet |
Speaking of logged-in stuff, I see more than a few attempts to access reporter output that doesn't exist. I don't know EG's strategy for numbering report output, so it's hard to tell if these requests are random or they're trying values yanked from another installation they've been on. |
18:04 |
Bmagic |
yeah, I get the feeling that they are using some kind of intelligence, riffing off of other querystrings. "Learning" our ways, and slightly altering the querystring to trip us up. |
18:05 |
Bmagic |
I saw someone search for "Way of water", and then the bots starting doing it. Crazy stuff |
18:05 |
justdoglet |
Yep, that's very odd. |
18:06 |
Bmagic |
reminds me of the Dodo birds in Ice Age |
18:06 |
justdoglet |
No searches for "way of water" here today. :) |
18:08 |
Bmagic |
I think that search died out a couple years ago. RIP way of water. I've been working on this over the last 10 years off and on. In that example, I ended up introducing a cronjob to kill DB searches for that phrase. That's not in place anymore of course. |
18:10 |
justdoglet |
We just have a cron job to kill DB searches that take too long. We're blessed with reasonably punchy DB servers, so we haven't had to get too tricky with the auto-killing of searches so far. |
18:14 |
Bmagic |
We do the same |