IRC log for #evergreen, 2021-01-25

All times shown according to the server's local time.

Time	Nick	Message
00:09		sandbergja joined #evergreen
01:00		sandbergja joined #evergreen
07:23		rjackson_isl_hom joined #evergreen
07:47		mantis1 joined #evergreen
07:56		rfrasur joined #evergreen
08:40		mmorgan joined #evergreen
08:44		mmorgan left #evergreen
08:51		Dyrcona joined #evergreen
08:55		yar joined #evergreen
08:55		rfrasur joined #evergreen
08:56		mmorgan joined #evergreen
09:00		mmorgan left #evergreen
09:05		jvwoolf joined #evergreen
09:07		alynn26 joined #evergreen
09:24		dbwells joined #evergreen
09:54		mmorgan joined #evergreen
09:54	csharp	berick: I was able to test your patch over the weekend and it works as expected - because drone exhaustion was dire this morning, I've applied it to PINES production and am watching carefully
09:54	csharp	so far so good though
09:55	csharp	if we don't see any problems by noon or so, I'll sign off/commit
09:57	berick	csharp++
10:00		Cocopuff2018 joined #evergreen
10:09	JBoyer	Hi! If you haven't seen my Dev Meeting Poll on evergreen-dev or evergreen-general, it's here: https://forms.gle/ypTx2zqLbW7eVoj99
10:10	JBoyer	Not a lot of responses yet, so if you haven't filled it out yet please do!
10:11	berick	ohhh.. i saw "new developer" in the subject and thought that's what it was about.
10:16	csharp	JBoyer++ # thanks for the poke
11:09	Dyrcona	Funny how sometimes the more you work on code, the worse it gets.
11:18	* Dyrcona	thinks it may be time to bite the bullet and refactor Circulate.pm.
11:34	Bmagic	general production brick qustion: Anyone else see an occational brick spike really high CPU? Caused by a dramatic increase of apache2 processes? In this example, the machines are using nginx proxy (standard setup) - monitoring the number of nginx and apache processes, we see they are usually 9 and 18 respectively
11:35	Bmagic	sometimes, though, I see the number of apache processes spike up over 200!
11:35	berick	Bmagic: could be related to the same drone exhaustion issues we've been battling, web clients just blasting requests
11:36	Bmagic	yeah?
11:38	Dyrcona	Bmagic: I see it when someone is spidering us, like EBSCO did last week. Apache maxed out on a brick or two.
11:38	Dyrcona	Drone exhaustion doesn't seem to be a cause in our case, at least not to the same extent.
11:39	Bmagic	I just witnessed a single machine go out of control and die. Then the rest of the machines followed suit. Drone exhaustion is defined by the number of allowed children getting maxed?
11:40	Dyrcona	Bmagic: Yes.
11:40		rfrasur joined #evergreen
11:40	berick	oh yeah, ebsco could certainly be it too
11:40	Dyrcona	We don't usually hit max number of Apache processes unless someone is being a bad actor, like spamming searches or unapi requests.
11:40	Bmagic	so, when the number of allowed drones max, there should be some log messages like "no children" - In my case, I didn't see that
11:42	Bmagic	It seems there should* be a way to mitigate that kind of thing
11:43	Dyrcona	Bmagic: There are several. One is to configure connection limiting in a proxy. Oh hey! You have nginx in front of Apache.
11:43	Bmagic	I'm listening
11:43	Dyrcona	The rest is an exercise for the reader. :)
11:46	Bmagic	max apache workers is set to 250, which is a hold over setting from when apache was up front
11:47	Dyrcona	Ours is 150, IIRC.
11:47	Dyrcona	You're likely to run out of RAM before you hit 250, unless you have heaps.
11:48	Bmagic	nginx makes everything play nicer - but sometimes (once or twice a week) apache processes go from less than 20 to over 200. you're suggesting a configuration on nginx to limit the number of connections? will that deny legit requests? Is it the number of connections from the same client?
11:48	Bmagic	32GB memory on each brick with swap - though, swap doesn't get touched oddly
11:48	csharp	berick: still seeing higher-than-desired open-ils.actor drone counts, but definitely improved
11:48	mmorgan	JBoyer: FYI RE: the form, I also heard the "new developer" confusion from a colleague
11:48		sandbergja joined #evergreen
11:49	Dyrcona	It's usually configured by IP address, so it could throttle legit requests if you have a lot of sites with NAT.
11:49	* csharp	starts a new Confusted Developers' group
11:49	csharp	er... Confused, even
11:49	csharp	appropriate typo, I guess
11:49	berick	csharp: gotcha, good to know
11:49	Dyrcona	Confuzzled.
11:49	berick	heh
11:50	csharp	berick: do you have a reliable way of identifying which client-side actions are at fault?
11:50	berick	csharp: if you want to experiment, you could modify MAX_PARALLEL_REQUESTS in opensrf.js, make it lower
11:50	csharp	I was counting messages per threadtrace as a start
11:51	csharp	berick: does that require an opensrf restart to take effect?
11:51	Dyrcona	Speaking of confused developers, my comments about Circulate.pm earlier were not a joke. I'm reviewing code that attempts to a feature to circulation and as the work progresses, the code gets less functional, not because the developer is bad, but because the Circulate.pm code is poorly organzied.
11:52	berick	csharp: hm, not really. i would probably look for bursts from a single IP address in the activity log. i imagine in most cases the threadtraces will be different
11:53	berick	csharp: no
11:53	berick	no restart needed
11:53	Dyrcona	csharp: It should require an OpenSRF restart, but will likely require you to clear the cache in your browser, and cache maybe why it's not having a huge impact.
11:53	berick	exactly..
11:53	Dyrcona	grr... s/should/shouldn't/
11:53	csharp	berick: Dyrcona: good to know - I'll wait it out for now
11:53	berick	i wouldn't be surprised if a lot of clients aren't using the new code yet
11:53	Dyrcona	I'll check back with you in a year....
11:53	csharp	I thought that might be the case
11:54	berick	Dyrcona: agree circulate.pm has not aged well
11:54	csharp	Dyrcona: because cache expiry is set super high?
11:54	Dyrcona	javascript is access + 1 year, and was before the infamous commit went in.
11:55	csharp	argh
11:55	Dyrcona	Of course, browsers should send an If-Modified-Since: and the server should server the JS again, but you never know.
11:55	Dyrcona	We turned the main ExpiresActive off in our configuration recently.
11:56	csharp	Dyrcona: +1 re: Circulate.pm
11:56	berick	csharp: curl -I 'https://gapines.org/js/dojo/opensrf/opensrf.js'
11:56	berick	Expires: Tue, 26 Jan 2021 10:56:01 GMT
11:57	berick	so not a year at least
11:57	csharp	whew
11:57	csharp	well, I'll see how it looks tomorrow
11:57	* csharp	hops in TARDIS to take a look
12:02		jihpringle joined #evergreen
12:04	Dyrcona	csharp: Looks like you're set to 18 hours.
12:06	Dyrcona	Nifty! We're apparently doing HTTP/2.
12:14	jeffdavis	The super-long default cache expire times seem based on the use of cache-busting in the OPAC, not sure they make sense for JS in the Ang/AngJS client era?
12:15	jeffdavis	We're using 18 hours here too.
12:23	Bmagic	apachetop -f /var/log/apache2/other_vhosts_access.log
12:30	JBoyer	mmorgan++ berick++ csharp++
12:30	JBoyer	I'll re-send with a proper subject. :/
12:36	Bmagic	Dyrcona: do you impose limits? Anyone? Reading this article gives me some ideas: https://www.nginx.com/blog/rate-limiting-nginx/
12:40	Dyrcona	Bmagic: No, we don't currently use rate limits. That article does look interesting.
12:41	Dyrcona	Ideally, you would have 1 proxy in front of all your bricks that would handle the rate limits, but rate limiting per brick would still help.
12:42	Bmagic	It seems that we be hitting the same issue from csharp - bug 1912834 ?
12:42	pinesol	Launchpad bug 1912834 in OpenSRF "Browser client should limit the number of parallel requests" [High,New] https://launchpad.net/bugs/1912834
12:42	Dyrcona	Yeah, we all are from time to time.
12:43	Dyrcona	Rate limiting should still be implemented to protect you from the people who drop books or cats on keyboards and other malicious actors, deliberate or not.
12:44	Bmagic	using that apachetop command, I can observe osrf-http-translator request burst up over 10/second with totals over 300 in a short period. That may or may not be an issue though.
12:44	JBoyer	Something that I've seen make a difference in the past is to be absolutely certain that there are reasonable timeouts in place on your proxy, especially for websocketd since it doesn't have a timeout of its own. That won't make this kind of thing stop entirely, but may help free up some drones.
12:45	Dyrcona	I've not seen a spike in OpenSRF requests push our Apache counts up to max. When that happens, its something more persistent, like 300 requests for the same search terms or 30,000 uapi requests, etc.
12:46	JBoyer	And yeah, nothing will help those kinds of things but rate limiting.
12:47	Bmagic	example https://ibb.co/FxsLxqX
12:49	Bmagic	that snapshot doesn't show some of the higher numbers, but anyways, that's what I'm looking at to try and figure out where to "fix" this
12:50	Dyrcona	Rate limiting on unapi would be a start.
12:52	Dyrcona	@decide make it better or make it worse
12:52	pinesol	Dyrcona: go with make it worse
12:53	Dyrcona	Yeah, pinesol, that is easier than making it better, at least for now. :)
13:41		mrisher joined #evergreen
13:42		mrisher joined #evergreen
13:46		jihpringle joined #evergreen
14:18	Dyrcona	Hopefully, I didn't make it that much worse. :)
14:22		jihpringle joined #evergreen
14:24		Cocopuff2018 joined #evergreen
14:36	Dyrcona	Anyone else been getting this lately:
14:36	Dyrcona	warning: inexact rename detection was skipped due to too many files.
14:36	Dyrcona	warning: you may want to set your merge.renamelimit variable to at least 1760 and retry the command.
14:36	Dyrcona	I see it when I'm backporting from master to 3.5 or 3.2.
14:36	berick	Dyrcona: yes, and beware if you bump the merge.renamelimit it will slow to a crawl :(
14:36	berick	it did for me, anyway
14:37	Dyrcona	Well mine is set to 1493 because that was suggested to me earlier by a git merge/cherry-pick.
14:42	Dyrcona	It's the documentation that's causing it, isn't it?
14:56	Bmagic	Dyrcona: lol, I was about to report something very similar. ejabberd log: 2021-01-25 14:49:59.550 [error] <0.441.0>@ejabberd_listener:accept:311 (#Port<0.8999>) Failed TCP accept: too many open files
14:57	Bmagic	kernel file limits are all the way to the max: cat /proc/sys/fs/file-max result 65535
14:58	Bmagic	aha! but ejabberd user "ulimit -n" is still 1024
15:01	Dyrcona	Bmagic: That is very different.
15:01	Bmagic	:)
15:06	Dyrcona	Bmagic try asking for more and see what happens.
15:06	Bmagic	no doubt, just need to bake it into the brick build
15:06	Bmagic	I ran into this before, solved it, and now it's back
15:08	Dyrcona	TBH, I've not had problems with limits on Linux. OpenBSD on the other hand.... They like to set low defaults.
15:09	Dyrcona	Well, not had problems on Linux in quite a while.
15:10	csharp	berick: so we're getting complaints from catalogers that buckets aren't working - could that be the new patch at work?
15:11	csharp	I'm not seeing anything system-side that shows trouble
15:11	berick	csharp: hard to say, it could be
15:11	Dyrcona	Bmagic: Switch to FreeBSD: open files (-n) 116487
15:11	Dyrcona
15:11	Bmagic	oh sure
15:11	mmorgan	JBoyer: Having trouble filling out the Developer Meeting form. For example, if someone has a recurring meeting the third Wednesday of each month, how would they say they are available the first or second Wednesday of the month?
15:12	berick	csharp: if you have specifics i could try to verify
15:12	csharp	berick: trying to get those
15:12	Dyrcona	Probably because they have too many things in the bucket, and it's timing out processing 5 at a time.
15:13	csharp	that makes sense
15:13	Dyrcona	It does, and it could be wrong. :)
15:14	csharp	I guess I'm going to need their console messages to see the console.warn messages to see if they're hitting the limit
15:14	JBoyer	mmorgan, it is an imprecise instrument. :/ You could fill out everything that would work and say what's no good in the text box, or avoid combinations that aren't entirely open.
15:15	JBoyer	I was trying to avoid having 20-30 different options to choose from but that may have been easier to use.
15:15	csharp	"When I first scan a barcode in Item Status then go to that record I can view holdings, holds and MARC but if I then go back to one of those tabs nothing happens. It looks like it is trying to load the page but nothing ever happens."
15:15	mmorgan	Ok, gotcha.
15:15	csharp	"I am having to refresh the screen every time I add holding and also every time I print a label."
15:16	csharp	"I can't get items to transfer when I do get Holdings View to open. This is transferring between libraries in the same holdings view. Refreshing the screen doesn't help.
15:16	csharp	quotes from ACTUAL pines catalogers
15:16	csharp	several "me too" messages along with
15:17	csharp	I'll rollback the change for now, I think
15:17	Dyrcona	Screenshots or it didn't happen.
15:18	berick	csharp: EG 3.6?
15:18	JBoyer	"We replaced these ACTUAL Pines catalogers' MARC records with Folgers Crystals. Let's see if they notice."
15:18	Dyrcona	JBoyer++
15:18	csharp	berick: yep
15:18	* JBoyer	realizes he's only barely old enough for that joke to land
15:18	mmorgan	JBoyer++ :)
15:18	csharp	JBoyer++
15:19	csharp	JBoyer: I would have never guessed that given the accuracy
15:19	* csharp	fills it to the rim, with Brim
15:20	JBoyer	I apparently have very little control over what I will lose in a minute vs. what I will remember decades later than it could ever be useful.
15:20	csharp	https://www.youtube.com/watch?v=aGY7maLpA1I
15:20	mmorgan	JBoyer: You're not alone in that! Wish we could repurpose that ROM!
15:20	berick	csharp: i was just able to reproduce
15:21	csharp	oh?
15:21	berick	csharp: it does seem likely the patch is at fault (sigh(
15:23	Dyrcona	csharp++ # For boldly going where no Evergreen sysadmin has gone before.
15:24	csharp	Dyrcona: all while staying in my living room office for nearly a year!
15:25	csharp	berick: I rolled the patch out, but I can apply any fixes to a test server
15:25	csharp	working without the patch means I have to babysit all day long
15:26	* berick	nods
15:30		mantis1 left #evergreen
15:32	mmorgan	So, poking at Curbside, I have hours open set as 2:00pm - 5:30pm, 4 hour appointment slots. As a patron, I'm offered appointments at 2:00am, 6:00am, 10:00am, 2:00pm. Anyone seen this problem?
15:34		jihpringle joined #evergreen
15:57	mmorgan	Re: my curbside question, nevermind. Hours of operation showed PM in client, but stored AM in db :-/
15:57		sandbergja joined #evergreen
16:00	Dyrcona	mmorgan: I'd double check the servers' timezone settings.
16:02	mmorgan	Dyrcona: Thanks, will do that.
16:30	berick	csharp: fyi working/user/berick/lp1912834-max-parallel-net-v2
16:33	csharp	berick: cool - I'll take a look in a bit
16:37		troy__ joined #evergreen
16:39		rfrasur joined #evergreen
17:19		mmorgan left #evergreen
17:56		Cocopuff2018 joined #evergreen
18:01	pinesol	News from qatests: Testing Success <http://testing.evergreen-ils.org/~live>
18:47		sandbergja joined #evergreen
21:26		sandbergja joined #evergreen
21:52		JBoyer joined #evergreen
22:15		JBoyer joined #evergreen