IRC log for #evergreen, 2020-12-14

All times shown according to the server's local time.

Time	Nick	Message
03:22		jamesrf joined #evergreen
03:22		bshum joined #evergreen
03:22		mrisher joined #evergreen
03:22		abowling joined #evergreen
03:22		akilsdonk joined #evergreen
03:22		rhamby joined #evergreen
03:22		miker joined #evergreen
03:22		phasefx joined #evergreen
03:22		abneiman joined #evergreen
03:22		JBoyer joined #evergreen
03:22		laurie joined #evergreen
03:22		jeffdavis joined #evergreen
03:22		jonadab joined #evergreen
03:22		yar joined #evergreen
03:22		drigney joined #evergreen
03:22		jweston joined #evergreen
03:22		pinesol joined #evergreen
03:22		csharp joined #evergreen
03:22		pastebot joined #evergreen
03:22		eby joined #evergreen
03:22		troy__ joined #evergreen
03:22		devted joined #evergreen
03:22		ejk_ joined #evergreen
03:22		Bmagic joined #evergreen
03:22		dickreckard joined #evergreen
03:22		awitter joined #evergreen
03:22		book`_ joined #evergreen
03:22		jeff joined #evergreen
03:22		egbuilder joined #evergreen
03:22		genpaku joined #evergreen
03:22		kip joined #evergreen
03:22		gmcharlt joined #evergreen
03:22		yeats joined #evergreen
03:22		dbs joined #evergreen
03:22		RBecker joined #evergreen
03:22		berick joined #evergreen
03:22		dluch joined #evergreen
06:01	pinesol	News from qatests: Testing Success <http://testing.evergreen-ils.org/~live>
07:26		Dyrcona joined #evergreen
07:35		rjackson_isl_hom joined #evergreen
08:05		mantis1 joined #evergreen
08:34		collum joined #evergreen
08:38		mmorgan joined #evergreen
08:57		rfrasur joined #evergreen
09:19		nfBurton joined #evergreen
09:46		tlittle joined #evergreen
10:18	csharp	can anyone advise me on the best way to troubleshoot NOT CONNECTED TO THE NETWORK errors?
10:18	csharp	2020-12-14 09:24:44 brick04-head open-ils.actor: [ERR :103194:EX.pm:66:16079558581032824] Exception: OpenSRF::EX::Session 2020-12-14T09:24:44 OpenSRF::Utils::Logger /usr/local/share/perl/5.22.1/OpenSRF/Utils/Logger.pm:243 Session Error: o
10:18	csharp	pensrfpublic.brick04-head.gapines.org/_brick04-head_1607955858.346341_103282 IS NOT CONNECTED TO THE NETWORK!!!
10:19	csharp	what can I glean from that that might point me to potential causes?
10:19	csharp	_brick04-head_1607955858.346341_103282 - does this contain useful information?
10:19	csharp	it doesn't seem consistent from log error to log error
10:20	csharp	oh wait - maybe it is
10:21	* csharp	rolls up sleeves to dive into C code
10:23	berick	csharp: the main info of value there is the pid, brick, log trace, and time. there's nothing particularly meaningful in the error message, apart from 'not connected'
10:24	berick	there are likely errors preceding this one w/ more info
10:25	Dyrcona	csharp: Most of the time when I look into it, the NOT CONNECTED message comes $TIMEOUT_VALUE + 1 second after the request.
10:25	csharp	berick: Dyrcona: thanks for the pointers
10:26	Dyrcona	I just accept it as "normal."
10:27	csharp	our libraries will not accept that :-(
10:27	csharp	well, I mean that the resulting "system instability" on the end users' end is not acceptable to them
10:28		dbwells joined #evergreen
10:28	csharp	and this is a "new" problem that popped up in the last couple of months, happening a couple of times per week
10:28	csharp	starting to feel like the bad old days of PINES
10:28	Dyrcona	We get it multiple times a day, and always have AFAIK.
10:29	Dyrcona	Throw more hardware at it. ;)
10:29	csharp	I did up the actor max_children in hopes of forestalling it
10:29	Dyrcona	I find it is not really related to running out of drones.
10:29	csharp	but it's like adding lanes to the freeway - they just fill up the new lanes
10:30	Dyrcona	Sometimes, it may be.
10:30	csharp	in these cases, when I discover the problem, actor drones are near 100%
10:30	Dyrcona	Usually looks like a client timing out, waiting on CStore (i.e. the database), going away, and the router not having anywhere to send the response that finally comes back.
10:31	csharp	2020-12-14 09:24:12 brick04-head open-ils.actor: [WARN:108187:Server.pm:200:16076285143703913435] server: no children available, waiting... consider increasing max_children for this application higher than 192 in the OpenSRF configuration if this message occurs frequently
10:31	csharp	that preceded the outage by about 30 seconds
10:31	Dyrcona	I don't always find that those messages coincide.
10:31	Dyrcona	The often do, though.
10:32	Dyrcona	And, I just now see this email from Nagios, so maybe I'll take a look: CRIT: 4 NOT CONNECTEDs returned this hour: (Top server this hour: 3 bd2-bh5)
10:37	Dyrcona	It's something else in our case. bd2-bh5 is only running 8% of open-ils.actor drones, 12/150.
10:38	csharp	ok
10:38	csharp	so is this only discoverable by filtering the logs for NOT CONNECTED errors?
10:39	csharp	osrf_control --diagnostic doesn't seem to know about this problem
10:39	Dyrcona	Well, no. Diagnostic only reports how many are currently running. And, the not connected often look like clients to me, but what do I know? :)
10:40	Dyrcona	Also, I think I just lost the firewall or something at CW MARS....
11:22	* mmorgan	reads the backscroll with interest
11:24	mmorgan	We see the NOT CONNECTEDs at times as well. Does this manifest to the user as a frozen client?
11:25	Dyrcona	mmorgan: It could depending on what's going on.
11:26	* mmorgan	nods
11:34		Christineb joined #evergreen
11:46	csharp	mmorgan: in our multi-brick setup, it appears to the end user to be "unstable" because it sometimes works (when they're hitting working bricks)
11:47	mmorgan	csharp: Ok, thanks, we have had that experience.
11:48	berick	csharp: testing the fixes for bug 1896285 might help
11:48	pinesol	Launchpad bug 1896285 in Evergreen "Use batch methods for multi-row grid actions" [Medium,Confirmed] https://launchpad.net/bugs/1896285
12:09	csharp	berick: that's on my to-do for this exact reason - thanks
12:09		jihpringle joined #evergreen
12:14	Dyrcona	Using nginx as a proxy on the brick head, I find it seems to be hit or miss with logging the remote IP in the apache logs. It appears to only be logging it for certain errors, but I need to do a more thorough investigation.
12:16	Dyrcona	Y'know, maybe its my log format. Never mind.
12:21	Dyrcona	Eh, no. RemoteIP appears to not be working. I'm getting 127.0.0.1 for pretty much all of the log entries.
12:24	Dyrcona	Hmm. Could be my nginx configuration is wrong.....
12:34	Dyrcona	Yeah. I think I have the wrong variable being used.
12:34	* Dyrcona	will fix it for tomorrow morning.
13:11		alynn26 joined #evergreen
13:21	tlittle	In the Angular fm-editor, it defines what should be shown as placeholder text. Can you modify that per modal, or do you just currently auto-inherit that and that's the end of it? I'm looking at bug 1906862 again and want to confirm that I haven't just missed it somewhere that you can do that.
13:21	pinesol	Launchpad bug 1906862 in Evergreen "Angular Providers: Should not show text in entry fields" [Undecided,New] https://launchpad.net/bugs/1906862
13:55		sandbergja joined #evergreen
14:00	berick	tlittle: the placeholders hard coded. i could imagine a feature that disables placeholders, though, and/or lets you specify them
14:04	tlittle	berick thanks! When I was poking around earlier, I was pondering whether that would be something that you could do through the fm-editor TS file, kind of like how you can specify fieldorder. Even if it was just "show placeholders"=yes/no.
14:05	berick	tlittle: yep, .ts and the .html file. could add a @Input() hidePlaceholders = false then avoid adding them in the html if the value is true
14:07	tlittle	Oh neat! Maybe I'll take a crack at that. :) berick++
14:20		jihpringle joined #evergreen
14:35	jeff	How it started: UPDATE action.circulation AS circ SET due_date = '2020-04-16 23:59:59-0400' WHERE [...]
14:35	jeff	How it's going: UPDATE action.circulation AS circ SET due_date = '2020-12-28 23:59:59-0500' WHERE […]
14:35	jeff	(oh, we were so optimistic back in March!)
14:45	mmorgan	jeff: Or maybe just naive
14:46	Dyrcona	So, I'm not getting remote ips in my Apache logs with nginx as the proxy. I tried fixing the configration on a test vm, but I'm still getting 127.0.0.1 after restarting both nginx and apache.
14:46	Dyrcona	Does anyone have an example configuration that works?
14:47	Dyrcona	jeff: We still have rolling updates for due dates at two of our member libraries.
14:52		jihpringle joined #evergreen
14:55	Dyrcona	I've tried passing $remote_addr and $proxy_protocol_addr in the X-Forwarded-For header but neither works.
14:57	Dyrcona	I started out with $proxy_add_x_forwarded_for.
14:59	Dyrcona	mod_remoteip is enabled and the directives to use the X-Forwarded-For header are set up, along with the internal proxy ip address.
14:59	berick	was about to ask
15:05		laurie joined #evergreen
15:05	berick	Dyrcona: are you only seeing this issue with websockets requests?
15:09	Dyrcona	No. It's with regular Apache requests.
15:10	Dyrcona	AFAICT, it should be working, and I thought it was working once.
15:10	Dyrcona	I do see the remote ip on some SSL info/error messages.
15:23	Dyrcona	I guess I'll stick with what I've got since none of the other changes seem to work, either.
15:29		mantis1 left #evergreen
15:46	jeff	I laughed:
15:46	jeff	Due to a recent COVID-19 exposure, the library is closed until Dec 28. Curbside service is also suspended. Your item(s) including TEN LESSONS FOR A POST-PANDEMIC WORLD will be held until service resumes. More info will be posted on the library web site.
15:48	berick	heh
15:49	mmorgan	Probably won't need that book for a while yet, anyway :-/
15:59	csharp	jeff++
16:09		Cocopuff2018 joined #evergreen
16:57	csharp	happened again - at 4:30 p.m. EST, our open-ils.actor drone count was 12/192 - just now it's 192/192
16:57	csharp	and a wall of NOT CONNECTED errors
16:58		sandbergja joined #evergreen
16:59	csharp	berick: I tested your branches for bug 1896285 on a smallish test server and it kept the open-ils.actor count low but I saw a spike in pcrud drones (small use case)
16:59	pinesol	Launchpad bug 1896285 in Evergreen "Use batch methods for multi-row grid actions" [Medium,Confirmed] https://launchpad.net/bugs/1896285
16:59	csharp	berick: would the fix for the patron buckets (which I don't think are widely used in PINES) help?
17:00	berick	the fixes only address the specific work flows
17:00	csharp	I'm interested in tracking down the exact call(s) that spiked this brick
17:00	csharp	ok
17:00	csharp	that's what I thought
17:00	berick	if you find more, i'll do what I can to patch
17:00	csharp	berick++
17:02		sandbergja joined #evergreen
17:02		mmorgan left #evergreen
17:05	csharp	berick: looks like a crazy sh*t ton of these: 2020-12-14 16:45:37 brick01-head gateway: [ACT:61996:osrf-websocket-stdio.c:559:16079823286199641] [127.0.0.1] [] open-ils.actor open-ils.actor.ou_setting.ancestor_default.batch 178, ["cat.default_copy_status_normal"],
17:05	csharp	they started about 10/12 seconds before the NOT CONNECTED errors
17:07	berick	yeah, looks like it's called with each new copy, which could be a lot
17:10	csharp	I can confirm that the same call is repeated over and over during this morning's problems too
17:10	berick	csharp: mind adding a note to https://bugs.launchpad.net/evergreen/+bug/1896285 ?
17:10	pinesol	Launchpad bug 1896285 in Evergreen "Use batch methods for multi-row grid actions" [Medium,Confirmed]
17:12	csharp	berick: done - thanks!
17:19		Dyrcona joined #evergreen
17:20	Dyrcona	I signed back in to say that we ran out of open-ils.actor drones this afternoon on brick 6, the one that I replaced this morning. It happened just a bit after I clocked out for the day.
17:20	csharp	Dyrcona: did you see the scrollback from the last 20-30 mins?
17:20	Dyrcona	We need to fix the cause of this instead of increasing the number of drones to paper over it.
17:21	csharp	yeah, that didn't help us at all
17:21	csharp	open-ils.actor.ou_setting.ancestor_default.batch 178, ["cat.default_copy_status_normal"]
17:21	csharp	see if that's happening in crazy numbers in your activity log
17:22	berick	i've reproduced and working on a patch now
17:23	csharp	berick: awesome
17:25	Dyrcona	Oh. I've seen that an more. Over 30,000 requests for the same setting from the same client within a matter of minutes.
17:25	Dyrcona	We're still on 3.2.
17:26	csharp	Dyrcona: good to know
17:26	csharp	we're on 3.4, heading to 3.6 next month
17:40	Dyrcona	We'll be going to 3.6 in April, probably.
17:40	Dyrcona	We make big jumps these days.
17:41		sandbergja_ joined #evergreen
17:41		sandbergja joined #evergreen
17:43	sandbergja	I have a hold that really seems like it should target a specific copy. action.hold_request_permit_test says that everything's good. But retargeting the hold never targets that (or any other) copy. Any tips for my next line of troubleshooting?
17:45	sandbergja	Never mind, it actually is targeting it properly. It just won't go into transit when we check it in, despite being the targeted copy
17:46	sandbergja	And never mind my never mind -- I was looking at the wrong column
17:47	sandbergja	nothing in current_copy
17:56		dbwells joined #evergreen
18:00	pinesol	News from qatests: Testing Success <http://testing.evergreen-ils.org/~live>
18:01	Dyrcona	So, I tried restarting the service, and the Listener would not die. I had to kill it with -9, i.e. fire.
18:01		sandbergja__ joined #evergreen
18:03		jihpringle joined #evergreen
18:04	berick	csharp: fix pushed
18:50	csharp	berick: rock on - will test very soon
20:21		sandbergja__ joined #evergreen
20:33	csharp	berick - tested fine on my test server - I'll let you know tomorrow how it looks with PINES data
22:34		sandbergja__ joined #evergreen