Evergreen ILS Website

IRC log for #evergreen, 2025-01-23

| Channels | #evergreen index | Today | | Search | Google Search | Plain-Text | summary | Join Webchat

All times shown according to the server's local time.

Time Nick Message
06:48 collum joined #evergreen
09:09 dguarrac joined #evergreen
09:47 mmorgan joined #evergreen
10:05 dguarrac joined #evergreen
10:05 Jaysal joined #evergreen
10:05 dluch joined #evergreen
10:05 scottangel joined #evergreen
10:34 redavis joined #evergreen
10:37 redavis joined #evergreen
10:56 Ntwali joined #evergreen
11:03 sandbergja joined #evergreen
11:38 Christineb joined #evergreen
12:00 jihpringle joined #evergreen
13:10 jihpringle joined #evergreen
13:15 StomproJ Has anyone observed getting thousands (90K today) of requests like "GET /eg/opac/mylist/delete?anchor​=record_444643&record=444643"
13:15 StomproJ From random IPs.
13:16 StomproJ Bots being botty?
13:19 csharp_ looking...
13:20 csharp_ yes, seeing similar from 177.37.128.0/17 (Brazil)
13:20 csharp_ and 138.122.0.0/16 (Uruguay)
13:21 StomproJ It looks like there is a redirection loop for that request.
13:21 jeffdavis We've been seeing a lot of that lately, esp in the past month or so. I've been meaning to email the dev list about it actually.
13:21 csharp_ whatchall doin' Latin America?
13:21 Bmagic yes! same. We're using nginx to block whole countries to help solve it
13:21 csharp_ Bmagic: GeoIP?
13:21 Bmagic right
13:22 csharp_ I set that up (with Apache) on the git server since they JUST WON'T LEAVE US ALONE!
13:22 Bmagic I think the real solution is to teach evergreen to detect bots and dynamically deny requests before they make it down the whole stack to the DB and back
13:22 * csharp_ shakes fist at the heavens
13:22 StomproJ ChatGTP bot took us completely down in December, quickly switched from PoundProxy to HAProxy so I could have rate limits and queuing.
13:22 csharp_ redirect to https://www.youtube.com/watch?v=dQw4w9WgXcQ\
13:23 Bmagic The first step towards fixing this issue was nginx's leaky bucket feature. But the bots defeat that because they are coming from whole Blocks of IP'
13:23 Bmagic s
13:23 csharp_ https://www.youtube.com/watch?v=dQw4w9WgXcQ even
13:23 Bmagic like, 10k IP's each allowed 10 requests per second via leaky bucket rules
13:24 csharp_ I take an hour's logs, isolate the IP column, sort -n and that makes it obvious which ranges to block
13:24 Bmagic yep, I do the same, but it's starting to become a full time job
13:24 csharp_ but I've over-blocked too - See: Dyrcona not being able to connect to the git server
13:24 csharp_ yeah
13:24 csharp_ @hate bots
13:24 pinesol csharp_: The operation succeeded.  csharp_ hates bots.
13:25 StomproJ I had fail2ban setup for a while for certain searches.
13:25 csharp_ also, I think even the non-CN ranges are still actually China
13:25 Bmagic GeoIP seems to be holding (for now) - they can beat that when they decide to use AWS data center in the USA
13:25 csharp_ the requests are usually the same across the botnet
13:26 jeffdavis oh no, the bots got StomproJ!
13:26 csharp_ see, the bots made StomproJ quit!
13:26 Bmagic Evergreen should be able to see the querystring and realize that it's not valid before it makes it down to the database. There are obvious patterns
13:26 Stompro joined #evergreen
13:26 csharp_ @rescue StomproJ
13:26 pinesol csharp_: http://images.cryhavok.org/​d/1291-1/Computer+Rage.gif
13:26 csharp_ hey, that worked!
13:26 Bmagic csharp_++
13:27 csharp_ Stompro: glad you brought it up - we're getting slammed with those requests right now
13:28 Bmagic bug 1913617
13:28 pinesol Launchpad bug 1913617 in OpenSRF "NGINX could use a DOS mitigation example" [Undecided,New] https://launchpad.net/bugs/1913617
13:30 Bmagic ^^ leaky bucket idea. But, the current attacks aren't caught by that because of the 10k IP usage. Another thought I had was: can we use the IP itself as a pattern? Like, when we're seeing hits from the same Class B block inside a 20-30 minute period, that we don't normally see?
13:31 Bmagic perhaps we could add a mechanism that would establish a baseline, and then look for anomolies, in terms of large numbers of requests from non-baseline IP's
13:33 Stompro I've been looking real hard ad Cloudflare also.  Let them deal with it.
13:33 Bmagic Cloudflare is a fine idea, but as soon as you turn on the Proxy, certain parts of the Evergreen staff client breaks
13:33 Bmagic That was the very first thing I did
13:34 Stompro Bmagic++ trying the things out since forever.
13:34 Bmagic So, if you go Cloudflare, you'd need to setup a different DNS name for staff client traffic. And that, honestly, might be the best way to go. It's the patron OPAC that is getting all the troubles
13:36 Bmagic I really do love Cloudflare. I think their servers would do a great job of figuring this traffic out. They operate at a scale beyond what I can imagine.
13:37 Bmagic A wrinkle: DNS for staff only: the server will still respond to OPAC requests. Even parts of the staff catalog interface invoke the patron OPAC, so we'd need to figure out how to allow the logged-in staff to talk to the OPAC but not* the rest of the internet
13:38 Stompro How about putting the opac behind ezproxy and making everyone log in first :-)
13:38 Bmagic So the bots would just start probing the staff DNS entry because it's a domain name on public record.
13:40 Stompro 729539 of those mylist/delete queries since Jan 2nd.  Looking to see from how many IPs.
13:40 Bmagic thinking more about it, perhaps a clever apache rewrite rule could not* rewrite the bare domain to /eg/opac/home, so the page wouldn't load. And you'd need to know the top secret URL (/eg/staff) in order to interact with the server
13:44 csharp_ Bmagic: we have Cloudflare, but proxying is not on yet
13:44 Bmagic same
13:45 Bmagic go ahead, turn it on. It works (mostly)
13:46 Bmagic There will be some weird behavior that you can't figure out. Then you'll realize (oh, did I turn on the cloudflare proxy?) - let me turn that off and see if that fixes it. And boom, it will. I can't remember exactly which pages in the staff client suffer from it.
13:47 csharp_ we'll have to start it on one of our test servers after next months' upgrade if we can hold out that long
13:49 Stompro 32880 unique IPs, most making 1 or 31 requests only, a handful at 62.
14:00 csharp_ damn - they really are at it today
14:10 jihpringle joined #evergreen
14:11 Stompro I'm going to trying dropping those requests with haproxy and see if they get the hint.  "http-request silent-drop if { path_beg /eg/opac/mylist/delete }"
14:13 sandbergja Re: teaching evergreen to identify bots and not do expensive work for them: in bug 1895695, I used a CPAN module called Duadua.  I thought it was a nice module (although it only checks user agent, so it only helps with the bots who aren't liars): https://metacpan.org/pod/Duadua
14:13 pinesol Launchpad bug 1895695 in Evergreen "Wishlist: add a link tracker for the course materials module" [Wishlist,Confirmed] https://launchpad.net/bugs/1895695
14:24 Bmagic sandbergja++
14:32 mantis1 joined #evergreen
14:40 Stompro Has anyone looked at using something like google recaptcha V3 to only let hoomans access certain pages on the opac?
14:40 eby there's a whole bots channel on the code4lib slack now because everyone's repositories are getting slammed by the AI scrapers. Quite a few went with the Clourflare Turnstile
14:40 eby but that is more of a middleware solution
14:41 Stompro eby++ thanks.
14:42 eby https://github.com/mitchellkrogz​a/nginx-ultimate-bad-bot-blocker
14:42 eby https://github.com/ai-robots-txt/ai.robots.txt
14:43 eby and if you happen to be on AWS https://aws.amazon.com/about-aws/whats-​new/2024/09/aws-waf-bot-control-managed​-group-rule-bot-detection-capabilities/
14:46 Stompro I don't think what is hitting us now is reading the robots.txt since they are not identifying themselves in the UA.
14:54 JBoyer I've got some thoughts on the bot stuff.
14:54 JBoyer I don't think we should teach 'evergreen' to do anything about this. Evergreen isn't a web server, load balancer, or sysadmin, though it needs those things. We should share mitigations and ideas about it, maybe including in the Eg docs, though it's not an evergreen-specific issue.
14:54 JBoyer Decisions about this traffic should happen before requests hit Evergreen's mod_perl-laden apache processes.
14:54 JBoyer This is something Aspen tries to do that irritates me. If you're letting every request through to the php and a pile of db queries just to be told to go away your server is still going to be swamped, just a little bit later.
14:54 JBoyer robots.txt is useless except to request that the big 3 search engines maybe (maybe!) don't go here or there, and I think only Google supports Crawl-Delay, if they still do. (Which is also useless if you have many vhosts.)
14:55 JBoyer Sysadmins should have the large registrar's whois lookups bookmarked. More and more if ARIN tells me that the whole /8 I'm looking at is registered at APNIC or RIPE it's probably all going to be blocked without any further research. (But don't do that with "early registration" blocks)
14:55 JBoyer The internet sucks in 2025 and I'm weary of all of it.
14:55 redavis JBoyer++
15:07 JBoyer Also, these kids, my lawn, why.
15:08 csharp_ JBoyer++
15:08 csharp_ so grumpy today
15:08 csharp_ I am grumpy that is
15:08 * redavis is just generally tired...of things.
15:08 csharp_ also whois is connect: Network is unreachable
15:08 csharp_ and that's making me more miserable
15:09 redavis lol, whois is also tired...and unreachable.
15:09 csharp_ my bash wrapper to use whois to grab the range and block it via IPTables depends on that working, obvs
15:11 jeffdavis IMO, AI companies are jockeying for position and strip-mining us in the process.
15:12 sandbergja jeffdavis++
15:12 redavis I think you might be right.
15:13 * redavis has a moment thinking about infrastructure and what it was built for and what it wasn't...and what kind of capacity it has and doesn't.
15:22 redavis Everything's fine.
15:31 mantis1 left #evergreen
15:33 justdoglet joined #evergreen
15:49 justdoglet Dealing with the bots here is definitely getting towards a full-time job. Russia and environs as well as Turkiye and Brasil are taking up much of my time today.
15:51 csharp_ justdoglet: they must be targeting all EG sites - that's all I've done this afternoon
15:52 justdoglet We're seeing the mylist/deletes, as well as a number of other things. In talking with one of our institutions, it seems Apple may be routing iCloud Private Relay traffic through netblocks in Korea and other places, making things trickier.
15:52 csharp_ ugh
15:53 justdoglet csharp_: Sounds about right. These things seem to come in waves -- in our work chat, I often give a daily bot weather roundup. I'm amazed at how things change, yet stay the same. :|
15:53 csharp_ :-(
16:11 Bmagic JBoyer++ # I completely agree. Though my attitude has evolved to think that we do need to get something baked into Evergreen to do something*. Otherwise, I'm spending all day blocking IP subnets. I am aware of the global flag memcached IP tracking. It would be a dream if Evergreen/Apache somewhere in the stack could thwart this type of traffic, based upon XXXX
16:12 jeffdavis Would a collaborative IP blocklist be useful?
16:13 Bmagic jeffdavis: I think so
16:14 Bmagic Though, I've been swaping nginx configs over to GeoIP, so I have less and less IP subnets to contribute because I'm not hunting these down anymore
16:14 justdoglet https://www.irccloud.com/pastebin/YEIpMVfq/
16:15 justdoglet I've got a few to share, though needs some cleanup.
16:17 Bmagic lol 7k lines of blocks is a good start. The issue becomes: overtime the IP's change hands
16:18 justdoglet IPs for claudebot. https://www.irccloud.com/pas​tebin/Bt1c3TsK/claudebot.txt
16:19 justdoglet Bmagic: Yes, IPs do change hands -- and the rate at which they change hands varies, depending on provider. I assume things like DigitalOcean change more than other providers, for example. I also have data on Internet threats spanning a year, and have noticed that IPs don't change hands as often as we think.
16:24 justdoglet Chinanet netblocks. https://www.irccloud.com/pas​tebin/FL7hsEUw/chinanet.txt
16:24 Bmagic we're using this FYI: https://docs.nginx.com/nginx/admin-guide/sec​urity-controls/controlling-access-by-geoip/
16:24 justdoglet China Unicom netblocks. https://www.irccloud.com/pa​stebin/dK9MJnIK/unicom.txt
16:25 justdoglet That's a start -- those are the hardest on the db server (Claudebot) and most voluminous (Chinanet/Unicom).
16:26 Bmagic it would be "fun" to reverse SYN flood the abusive IP's using the same established TCP HTTP session
16:29 justdoglet I've considered getting destructive, but I want to keep my job. :) On my personal sites I am rigging up an arrangement where Claudebot/OpenAI/etc. IPs get random text from a local LLM, and Chinanet, etc. are just firewalled.
16:30 Bmagic :)
16:33 justdoglet A couple more lists -- miscellaneous bots, typically one-offs, and Amazonbot, who has flattened our app servers more than once.
16:33 Bmagic It costs them less CPU to query the server than it does us to answer their call. It's a war of CPU, and the deck is stacked in their favor 10 to 1 by my estimation. We're using AWS, their using AWS. It's datacenter to datacenter, one in the US and one in Singapore. Their paying for the vCPU's and we are too, but they win.
16:33 justdoglet Miscellaneous bots. https://www.irccloud.com/pas​tebin/vBgApy54/miscbots.txt
16:34 justdoglet Amazonbot. May be highly variable as they control the IP pools and have tonnes to pull from. https://www.irccloud.com/pas​tebin/ozLIPUov/amazonbot.txt
16:35 justdoglet Yes, absolutely -- they have the upper ground in terms of resources. That's why I've been focusing on exclusion of malicious traffic than attempting to get EG to do magic with it.
16:41 Bmagic the amazonbot IP's that you list are the recent offenders for me
16:43 justdoglet Their IP pool _seems_ to be relatively constrained -- I don't think I see more than one Amazonbot IP every two weeks now that I've got what I assume is most of them. I just note that they can swap netblocks with other parts of their business to get fresh netblocks at any time, IPs from that list may wind up elsewhere (like S3 or Route 53) eventually.
16:48 justdoglet (There are patterns in the IPv4 space, of course -- 3/8 is mostly Amazon, 4/8 is mostly Level3/CenturyLink, 6/8 is all US military, and so on. I saw someone mention massaging the data to get a list of IPs and hit counts to make it easier to zero in on bots. That's one strategy I use often, too.)
16:58 justdoglet In addition to the mylist/deletes, are other folks also seeing the mylist/moves? They used to come in groups of 31 requests for me, but recently they've been variable. Like the mylist/deletes, it's bouncing off an URL and getting 302s, but never following the 302 (I'm assuming the 302 isn't to the same URL).
16:59 justdoglet https://www.irccloud.com/pastebin/ILCwVHMg/
17:13 mmorgan left #evergreen
17:20 sandbergja joined #evergreen
17:41 Bmagic justdoglet: yes, same here
17:42 justdoglet Okay, I'm happy I'm not the only one. I can't imagine this is useful traffic, so I've been firewalling based on it.
17:42 Bmagic my thoughts exactly
17:47 Bmagic That URL pattern shouldn't be accessible unless logged in, IMO
17:52 Bmagic but, Evergreen does allow anonymous basket manipulation, for temporary baskets. So, there are legit uses for those patterns. Moving those lines down in EGCatLoader.pm beneath "requires login" could help cut that malicious DOS out. But that's not the extent of these Bot's patterns
17:55 pinesol News from commits: LP#1851721 Hold Shelf Owning Library Column Needed <https://git.evergreen-ils.org/?p=E​vergreen.git;a=commitdiff;h=bd7f1a​6885d3cbe9661ce5f8eedf91aefe08d2b4>
17:59 justdoglet I do see some request chains from potential bots bounce off /eg/opac/temp_warn, but ones that are all 302s don't. I'm too lazy to dig into the code, but it wouldn't surprise me if those 302s are the ones that try to redirect through temp_warn.
18:03 justdoglet Speaking of logged-in stuff, I see more than a few attempts to access reporter output that doesn't exist. I don't know EG's strategy for numbering report output, so it's hard to tell if these requests are random or they're trying values yanked from another installation they've been on.
18:04 Bmagic yeah, I get the feeling that they are using some kind of intelligence, riffing off of other querystrings. "Learning" our ways, and slightly altering the querystring to trip us up.
18:05 Bmagic I saw someone search for "Way of water", and then the bots starting doing it. Crazy stuff
18:05 justdoglet Yep, that's very odd.
18:06 Bmagic reminds me of the Dodo birds in Ice Age
18:06 justdoglet No searches for "way of water" here today. :)
18:08 Bmagic I think that search died out a couple years ago. RIP way of water. I've been working on this over the last 10 years off and on. In that example, I ended up introducing a cronjob to kill DB searches for that phrase. That's not in place anymore of course.
18:10 justdoglet We just have a cron job to kill DB searches that take too long. We're blessed with reasonably punchy DB servers, so we haven't had to get too tricky with the auto-killing of searches so far.
18:14 Bmagic We do the same

| Channels | #evergreen index | Today | | Search | Google Search | Plain-Text | summary | Join Webchat