Evergreen ILS Website

IRC log for #evergreen, 2021-01-25

| Channels | #evergreen index | Today | | Search | Google Search | Plain-Text | summary | Join Webchat

All times shown according to the server's local time.

Time Nick Message
00:09 sandbergja joined #evergreen
01:00 sandbergja joined #evergreen
07:23 rjackson_isl_hom joined #evergreen
07:47 mantis1 joined #evergreen
07:56 rfrasur joined #evergreen
08:40 mmorgan joined #evergreen
08:44 mmorgan left #evergreen
08:51 Dyrcona joined #evergreen
08:55 yar joined #evergreen
08:55 rfrasur joined #evergreen
08:56 mmorgan joined #evergreen
09:00 mmorgan left #evergreen
09:05 jvwoolf joined #evergreen
09:07 alynn26 joined #evergreen
09:24 dbwells joined #evergreen
09:54 mmorgan joined #evergreen
09:54 csharp berick: I was able to test your patch over the weekend and it works as expected - because drone exhaustion was dire this morning, I've applied it to PINES production and am watching carefully
09:54 csharp so far so good though
09:55 csharp if we don't see any problems by noon or so, I'll sign off/commit
09:57 berick csharp++
10:00 Cocopuff2018 joined #evergreen
10:09 JBoyer Hi! If you haven't seen my Dev Meeting Poll on evergreen-dev or evergreen-general, it's here: https://forms.gle/ypTx2zqLbW7eVoj99
10:10 JBoyer Not a lot of responses yet, so if you haven't filled it out yet please do!
10:11 berick ohhh.. i saw "new developer" in the subject and thought that's what it was about.
10:16 csharp JBoyer++ # thanks for the poke
11:09 Dyrcona Funny how sometimes the more you work on code, the worse it gets.
11:18 * Dyrcona thinks it may be time to bite the bullet and refactor Circulate.pm.
11:34 Bmagic general production brick qustion: Anyone else see an occational brick spike really high CPU? Caused by a dramatic increase of apache2 processes? In this example, the machines are using nginx proxy (standard setup) - monitoring the number of nginx and apache processes, we see they are usually 9 and 18 respectively
11:35 Bmagic sometimes, though, I see the number of apache processes spike up over 200!
11:35 berick Bmagic: could be related to the same drone exhaustion issues we've been battling, web clients just blasting requests
11:36 Bmagic yeah?
11:38 Dyrcona Bmagic: I see it when someone is spidering us, like EBSCO did last week. Apache maxed out on a brick or two.
11:38 Dyrcona Drone exhaustion doesn't seem to be a cause in our case, at least not to the same extent.
11:39 Bmagic I just witnessed a single machine go out of control and die. Then the rest of the machines followed suit. Drone exhaustion is defined by the number of allowed children getting maxed?
11:40 Dyrcona Bmagic: Yes.
11:40 rfrasur joined #evergreen
11:40 berick oh yeah, ebsco could certainly be it too
11:40 Dyrcona We don't usually hit max number of Apache processes unless someone is being a bad actor, like spamming searches or unapi requests.
11:40 Bmagic so, when the number of allowed drones max, there should be some log messages like "no children" - In my case, I didn't see that
11:42 Bmagic It seems there should* be a way to mitigate that kind of thing
11:43 Dyrcona Bmagic: There are several. One is to configure connection limiting in a proxy. Oh hey! You have nginx in front of Apache.
11:43 Bmagic I'm listening
11:43 Dyrcona The rest is an exercise for the reader. :)
11:46 Bmagic max apache workers is set to 250, which is a hold over setting from when apache was up front
11:47 Dyrcona Ours is 150, IIRC.
11:47 Dyrcona You're likely to run out of RAM before you hit 250, unless you have heaps.
11:48 Bmagic nginx makes everything play nicer - but sometimes (once or twice a week) apache processes go from less than 20 to over 200. you're suggesting a configuration on nginx to limit the number of connections? will that deny legit requests? Is it the number of connections from the same client?
11:48 Bmagic 32GB memory on each brick with swap - though, swap doesn't get touched oddly
11:48 csharp berick: still seeing higher-than-desired open-ils.actor drone counts, but definitely improved
11:48 mmorgan JBoyer: FYI RE: the form, I also heard the "new developer" confusion from a colleague
11:48 sandbergja joined #evergreen
11:49 Dyrcona It's usually configured by IP address, so it could throttle legit requests if you have a lot of sites with NAT.
11:49 * csharp starts a new Confusted Developers' group
11:49 csharp er... Confused, even
11:49 csharp appropriate typo, I guess
11:49 berick csharp: gotcha, good to know
11:49 Dyrcona Confuzzled.
11:49 berick heh
11:50 csharp berick: do you have a reliable way of identifying which client-side actions are at fault?
11:50 berick csharp: if you want to experiment, you could modify MAX_PARALLEL_REQUESTS in opensrf.js, make it lower
11:50 csharp I was counting messages per threadtrace as a start
11:51 csharp berick: does that require an opensrf restart to take effect?
11:51 Dyrcona Speaking of confused developers, my comments about Circulate.pm earlier were not a joke. I'm reviewing code that attempts to a feature to circulation and as the work progresses, the code gets less functional, not because the developer is bad, but because the Circulate.pm code is poorly organzied.
11:52 berick csharp: hm, not really.  i would probably look for bursts from a single IP address in the activity log.  i imagine in most cases the threadtraces will be different
11:53 berick csharp: no
11:53 berick no restart needed
11:53 Dyrcona csharp: It should require an OpenSRF restart, but will likely require you to clear the cache in your browser, and cache maybe why it's not having a huge impact.
11:53 berick exactly..
11:53 Dyrcona grr... s/should/shouldn't/
11:53 csharp berick: Dyrcona: good to know - I'll wait it out for now
11:53 berick i wouldn't be surprised if a lot of clients aren't using the new code yet
11:53 Dyrcona I'll check back with you in a year....
11:53 csharp I thought that might be the case
11:54 berick Dyrcona: agree circulate.pm has not aged well
11:54 csharp Dyrcona: because cache expiry is set super high?
11:54 Dyrcona javascript is access + 1 year, and was before the infamous commit went in.
11:55 csharp argh
11:55 Dyrcona Of course, browsers should send an If-Modified-Since: and the server should server the JS again, but you never know.
11:55 Dyrcona We turned the main ExpiresActive off in our configuration recently.
11:56 csharp Dyrcona: +1 re: Circulate.pm
11:56 berick csharp: curl -I 'https://gapines.org/js/dojo/opensrf/opensrf.js'
11:56 berick Expires: Tue, 26 Jan 2021 10:56:01 GMT
11:57 berick so not a year at least
11:57 csharp whew
11:57 csharp well, I'll see how it looks tomorrow
11:57 * csharp hops in TARDIS to take a look
12:02 jihpringle joined #evergreen
12:04 Dyrcona csharp: Looks like you're set to 18 hours.
12:06 Dyrcona Nifty! We're apparently doing HTTP/2.
12:14 jeffdavis The super-long default cache expire times seem based on the use of cache-busting in the OPAC, not sure they make sense for JS in the Ang/AngJS client era?
12:15 jeffdavis We're using 18 hours here too.
12:23 Bmagic apachetop -f /var/log/apache2/other_vhosts_access.log
12:30 JBoyer mmorgan++ berick++ csharp++
12:30 JBoyer I'll re-send with a proper subject. :/
12:36 Bmagic Dyrcona: do you impose limits? Anyone? Reading this article gives me some ideas: https://www.nginx.com/blog/rate-limiting-nginx/
12:40 Dyrcona Bmagic: No, we don't currently use rate limits. That article does look interesting.
12:41 Dyrcona Ideally, you would have 1 proxy in front of all your bricks that would handle the rate limits, but rate limiting per brick would still help.
12:42 Bmagic It seems that we be hitting the same issue from csharp - bug 1912834 ?
12:42 pinesol Launchpad bug 1912834 in OpenSRF "Browser client should limit the number of parallel requests" [High,New] https://launchpad.net/bugs/1912834
12:42 Dyrcona Yeah, we all are from time to time.
12:43 Dyrcona Rate limiting should still be implemented to protect you from the people who drop books or cats on keyboards and other malicious actors, deliberate or not.
12:44 Bmagic using that apachetop command, I can observe osrf-http-translator request burst up over 10/second with totals over 300 in a short period. That may or may not be an issue though.
12:44 JBoyer Something that I've seen make a difference in the past is to be absolutely certain that there are reasonable timeouts in place on your proxy, especially for websocketd since it doesn't have a timeout of its own. That won't make this kind of thing stop entirely, but may help free up some drones.
12:45 Dyrcona I've not seen a spike in OpenSRF requests push our Apache counts up to max. When that happens, its something more persistent, like 300 requests for the same search terms or 30,000 uapi requests, etc.
12:46 JBoyer And yeah, nothing will help those kinds of things but rate limiting.
12:47 Bmagic example https://ibb.co/FxsLxqX
12:49 Bmagic that snapshot doesn't show some of the higher numbers, but anyways, that's what I'm looking at to try and figure out where to "fix" this
12:50 Dyrcona Rate limiting on unapi would be a start.
12:52 Dyrcona @decide make it better or make it worse
12:52 pinesol Dyrcona: go with make it worse
12:53 Dyrcona Yeah, pinesol, that is easier than making it better, at least for now. :)
13:41 mrisher joined #evergreen
13:42 mrisher joined #evergreen
13:46 jihpringle joined #evergreen
14:18 Dyrcona Hopefully, I didn't make it that much worse. :)
14:22 jihpringle joined #evergreen
14:24 Cocopuff2018 joined #evergreen
14:36 Dyrcona Anyone else been getting this lately:
14:36 Dyrcona warning: inexact rename detection was skipped due to too many files.
14:36 Dyrcona warning: you may want to set your merge.renamelimit variable to at least 1760 and retry the command.
14:36 Dyrcona I see it when I'm backporting from master to 3.5 or 3.2.
14:36 berick Dyrcona: yes, and beware if you bump the merge.renamelimit it will slow to a crawl :(
14:36 berick it did for me, anyway
14:37 Dyrcona Well mine is set to 1493 because that was suggested to me earlier by a git merge/cherry-pick.
14:42 Dyrcona It's the documentation that's causing it, isn't it?
14:56 Bmagic Dyrcona: lol, I was about to report something very similar. ejabberd log: 2021-01-25 14:49:59.550 [error] <0.441.0>@ejabberd_listener:accept:311 (#Port<0.8999>) Failed TCP accept: too many open files
14:57 Bmagic kernel file limits are all the way to the max: cat /proc/sys/fs/file-max   result 65535
14:58 Bmagic aha! but ejabberd user "ulimit -n" is still 1024
15:01 Dyrcona Bmagic: That is very different.
15:01 Bmagic :)
15:06 Dyrcona Bmagic try asking for more and see what happens.
15:06 Bmagic no doubt, just need to bake it into the brick build
15:06 Bmagic I ran into this before, solved it, and now it's back
15:08 Dyrcona TBH, I've not had problems with limits on Linux. OpenBSD on the other hand.... They like to set low defaults.
15:09 Dyrcona Well, not had problems on Linux in quite a while.
15:10 csharp berick: so we're getting complaints from catalogers that buckets aren't working - could that be the new patch at work?
15:11 csharp I'm not seeing anything system-side that shows trouble
15:11 berick csharp: hard to say, it could be
15:11 Dyrcona Bmagic: Switch to FreeBSD: open files                          (-n) 116487
15:11 Dyrcona
15:11 Bmagic oh sure
15:11 mmorgan JBoyer: Having trouble filling out the Developer Meeting form. For example, if someone has a recurring meeting the third Wednesday of each month, how would they say they are available the first or second Wednesday of the month?
15:12 berick csharp: if you have specifics i could try to verify
15:12 csharp berick: trying to get those
15:12 Dyrcona Probably because they have too many things in the bucket, and it's timing out processing 5 at a time.
15:13 csharp that makes sense
15:13 Dyrcona It does, and it could be wrong. :)
15:14 csharp I guess I'm going to need their console messages to see the console.warn messages to see if they're hitting the limit
15:14 JBoyer mmorgan, it is an imprecise instrument. :/ You could fill out everything that would work and say what's no good in the text box, or avoid combinations that aren't entirely open.
15:15 JBoyer I was trying to avoid having 20-30 different options to choose from but that may have been easier to use.
15:15 csharp "When I first scan a barcode in Item Status then go to that record I can view holdings, holds and MARC but if I then go back to one of those tabs nothing happens. It looks like it is trying to load the page but nothing ever happens."
15:15 mmorgan Ok, gotcha.
15:15 csharp "I am having to refresh the screen every time I add holding and also every time I print a label."
15:16 csharp "I can't get items to transfer when I do get Holdings View to open.  This is transferring between libraries in the same holdings view. Refreshing the screen doesn't help.
15:16 csharp quotes from ACTUAL pines catalogers
15:16 csharp several "me too" messages along with
15:17 csharp I'll rollback the change for now, I think
15:17 Dyrcona Screenshots or it didn't happen.
15:18 berick csharp: EG 3.6?
15:18 JBoyer "We replaced these ACTUAL Pines catalogers' MARC records with Folgers Crystals. Let's see if they notice."
15:18 Dyrcona JBoyer++
15:18 csharp berick: yep
15:18 * JBoyer realizes he's only barely old enough for that joke to land
15:18 mmorgan JBoyer++ :)
15:18 csharp JBoyer++
15:19 csharp JBoyer: I would have never guessed that given the accuracy
15:19 * csharp fills it to the rim, with Brim
15:20 JBoyer I apparently have very little control over what I will lose in a minute vs. what I will remember decades later than it could ever be useful.
15:20 csharp https://www.youtube.com/watch?v=aGY7maLpA1I
15:20 mmorgan JBoyer: You're not alone in that! Wish we could repurpose that ROM!
15:20 berick csharp: i was just able to reproduce
15:21 csharp oh?
15:21 berick csharp: it does seem likely the patch is at fault (sigh(
15:23 Dyrcona csharp++ # For boldly going where no Evergreen sysadmin has gone before.
15:24 csharp Dyrcona: all while staying in my living room office for nearly a year!
15:25 csharp berick: I rolled the patch out, but I can apply any fixes to a test server
15:25 csharp working without the patch means I have to babysit all day long
15:26 * berick nods
15:30 mantis1 left #evergreen
15:32 mmorgan So, poking at Curbside, I have hours open set as 2:00pm - 5:30pm, 4 hour appointment slots. As a patron, I'm offered appointments at 2:00am, 6:00am, 10:00am, 2:00pm. Anyone seen this problem?
15:34 jihpringle joined #evergreen
15:57 mmorgan Re: my curbside question, nevermind. Hours of operation showed PM in client, but stored AM in db :-/
15:57 sandbergja joined #evergreen
16:00 Dyrcona mmorgan: I'd double check the servers' timezone settings.
16:02 mmorgan Dyrcona: Thanks, will do that.
16:30 berick csharp: fyi working/user/berick/lp1912834-max-parallel-net-v2
16:33 csharp berick: cool - I'll take a look in a bit
16:37 troy__ joined #evergreen
16:39 rfrasur joined #evergreen
17:19 mmorgan left #evergreen
17:56 Cocopuff2018 joined #evergreen
18:01 pinesol News from qatests: Testing Success <http://testing.evergreen-ils.org/~live>
18:47 sandbergja joined #evergreen
21:26 sandbergja joined #evergreen
21:52 JBoyer joined #evergreen
22:15 JBoyer joined #evergreen

| Channels | #evergreen index | Today | | Search | Google Search | Plain-Text | summary | Join Webchat