IRC log for #evergreen, 2022-01-21

All times shown according to the server's local time.

Time	Nick	Message
00:49		bshum joined #evergreen
05:23		Keith__isl joined #evergreen
06:00	pinesol	News from qatests: Testing Success <http://testing.evergreen-ils.org/~live>
07:21		rjackson_isl_hom joined #evergreen
07:56		Keith_isl joined #evergreen
08:25		mantis1 joined #evergreen
08:41		mmorgan joined #evergreen
08:42	csharp_	there's nothing consistent in the logs that would explain why open-ils.actor keeps dropping its connection to the network
08:43	csharp_	I'm looking for frequency of calls (varies), types of calls (vary), anything going on at the OS level (nothing special, no duress), segfault-type stuff (none so far)
08:44	csharp_	and ejabberd would probably be emitting more than "Connection failed: timeout" if there was a serious resource problem, I would think?
08:45	csharp_	at this point I'm willing to conclude that there's a bug in OpenSRF (possibly upstream in perl or C or something)
08:46	csharp_	so I guess it's time to start adding debug logging to the perl stuff?
08:46	csharp_	(I hope miker is watching since this is one of his areas of expertise)
09:23		rfrasur joined #evergreen
09:36		collum joined #evergreen
09:53	miker	why are my ears burning...
09:56	miker	csharp_: since it's specific to one service, I think it's more likely how that service is used / the shape of calls to it. to clarify, symptomatic result of whatever is happening is that the listener ends up being deregistered from the router, is that right? as in, if you use srfsh to ask for the list of services at the router, you see actor is no listed
10:02	miker	you can use this command in srfsh to do that, btw, as long as your router's ejabberd username is exactly "router": request router opensrf.router.info.stats.class.all
10:04	csharp_	miker: yes, that's what we're seeing
10:07	miker	csharp_: so, the absolute most common cause of that is having a c2s shaper installed in ejabberd. the way shapers work, a single very large stanza (say, huge template as parameter) can cause a pause -- ejabberd will deliver the payload and then wait for the amount of time it would have taken if the bytes/sec were strictly honored -- and the router sees that as a >1s delay from the listener, at which point it drops the listener from the list for the
10:07	miker	class
10:09	miker	if you have entirely standalone bricks, with no cross-registration, this presents as an immediate listener failure. if you're cross-registering, you have more buffer because the other actor instances can take up the slack while you detect that situation and tell the service to rereg (or restart it)
10:11	csharp_	I do have c2s_shaper: none in all my ejabberd configs as recommended in bug 1941653
10:11	pinesol	Launchpad bug 1941653 in OpenSRF "Update recommended shaper settings" [Medium,New] https://launchpad.net/bugs/1941653
10:11	csharp_	no cross registration
10:13	miker	what version of ejabberd are you using?
10:13	csharp_	18.01
10:14	miker	ok, I think it's spelled differently (version differences) based on the docs. you have a ejabberd_c2s section, right?
10:15	miker	specifically, I believe it should be "shaper: none" rather than "c2s_shaper: none"
10:16	miker	can you share a paste of the relevant listen: block?
10:16	csharp_	yes, one sec
10:17	miker	(and thanks a tonne, ejabberd, for changing the spelling and format of your config options all the time)
10:17	csharp_	https://pastebin.com/DYNVG8Pi
10:18	csharp_	seriously
10:21	miker	yeah, instead of naming a shaper, in the listen: block just say "shaper: none"
10:21	csharp_	ok - just a sec
10:23	miker	or, just remove that entirely. fwiw, because "none" is the default we just don't have a shaper: line
10:23	miker	side note: because max-concurrent-sessions defaults to Inf, we don't specify that either
10:24	csharp_	given my pain of the last few days, gonna stick with explicit over implicit :-)
10:26	miker	fair. I'm looking at https://docs.ejabberd.im/admin/configuration/listen-options/#shaper for this, btw
10:29	csharp_	ok, change is applied, now to watch
10:29	miker	ah, one other thing that might be important in a high-traffic env (though it should /not/ be something that only one service suffers from) is the max_fsm_queue setting https://docs.ejabberd.im/admin/configuration/listen-options/#max-fsm-queue
10:29	miker	we've doubled that to 20000
10:35	csharp_	still busted :-(
10:40	csharp_	I removed the backlog designation as well and restarted ejabberd - still watching - I'll look at the fsm queue stuff next
10:49	miker	csharp_: here's an example of our ENTIRE .yml, sans hostname, IP and auth setup: https://pastebin.com/gJ1pGwVE
10:50	csharp_	miker: the main thing I'm having trouble with from the troubleshooting perspective is the lack of useful logs - we're planning to turn on debug-level opensrf logs for a while to see if we can capture things, but I'm surprised ejabberd isn't providing better info
10:50	csharp_	fsm increase in place, but the issue is still happening btw
10:51	miker	oh, ejabberd logs are terrible
10:53	miker	are you seeing this in the logs: We lost the last node in the class, responding with error and removing...
10:53	csharp_	yes
10:53	miker	it's WARN level
10:53	csharp_	constantly
10:53	miker	k
10:54	csharp_	30 times in the last hour across 6 servers
10:54	csharp_	sorry - all day
10:54	csharp_	since 7:48 a.m.
10:54	csharp_	actually, not that far off though - 28 in the last hour 10 mins
10:57	miker	always for actor?
10:57	csharp_	yes
10:58	csharp_	this feels like a threshold issue since actor is by far the most called service
11:00	csharp_	I realized I had not applied max_fsm_queue in the "right" place in the file, so my change was getting overwritten by the default (assuming the lower down value in the file is the last read)
11:00	csharp_	so applying that for real now
11:00		jvwoolf joined #evergreen
11:01	miker	and you applied all three of the recent patches we talked about earlier in the week?
11:01		jvwoolf1 joined #evergreen
11:02	csharp_	yes, that's right
11:08	miker	do you see much in the logs about the backlog queue? "server: backlog queue size is now" etc
11:12	csharp_	no - it doesn't look like the number of active children corresponds with the drops at all
11:13	miker	k, that's what I expected, but to be sure there's not some new source of pauses in ejabberd communication...
11:16	miker	do you see log lines with: server: restarting after fatal crash...
11:17	miker	preceded by "server: died with error " and an error message?
11:19	csharp_	I see 'died with error' but not the restarting message
11:20	csharp_	just cranked up logging to debug everywhere in opensrf_core.xml to see more into what's going on
11:22	miker	thoughts and prayers, rsyslogd
11:23	csharp_	heh osrfsys.11.log just hit 4.3
11:23	csharp_	GB
11:23	csharp_	miker: pray for grep
11:23	csharp_	of course, it's calm right this second :-
11:23	csharp_	:-?
11:28	csharp_	what the hell? it's quiet now
11:29	berick	could be one of those bugs where it's affected by the log level.
11:29	berick	the logging bogs everything down, reducing overall throughput
11:29	csharp_	huh
11:29	* csharp_	pipes all logs to /dev/null and calls it a day
11:39	csharp_	berick: apparently you're correct - when I disabled debug logs, I see the problem again :-(
11:39	csharp_	WELL THAT'S JUST GREAT
11:44	csharp_	* open-ils.actor [72141] uptime=02:42 cputime=00:01:16 #drones=194/192 101%
11:44	csharp_	that's not something I've seen before
12:01	miker	you really have 192 drones allowed?
12:01	jeff	keep in mind that --diagnostic has some limitations/quirks: for one thing, the max number of drones is always read from the config on disk, not from whatever config is active. (that particular quirk is not likely related to what you're seeing right now, unless you recently LOWERED the number of drones)
12:01	jeff	(I recently noticed that quirk and was annoyed by it, but hadn't looked for / opened a bug on it)
12:03	csharp_	miker: yeah - lower values have resulted in saturation
12:03		jihpringle joined #evergreen
12:03	csharp_	jeff: interesting
12:04	csharp_	miker: happy to provide additional configs to you if you think it would be helpful
12:05	csharp_	issue hasn't happened since 2022-01-21 11:41:59
12:09	miker	csharp_: 1) that's good (except for the log spew) 2) no, this has got to be a too-many-bytes-too-fast issue, which is backed up by the logging-slows-us-down-enough-to-succeed situation. and the lack of the "restarting" log message means ejabberd is doing the disconnecting of the socket, not some internal explosion that is separate from the socket.
12:10	miker	no, meaning, no I don't think more configs will be helpful right now
12:10		terranm joined #evergreen
12:17	csharp_	and... it's back
12:17	csharp_	miker: thanks
12:25	mmorgan	csharp_: Someone went to lunch and just came back?
12:26	miker	mmorgan: heh... good thought. could be unrelated to changing the logging, just usage difference
12:37	csharp_	that's one of my theories - a cataloger somewhere with a particular workflow is unwittingly making the state's libraries miserable
12:41	miker	csharp_: are you seeing router log messages like this: "Error sending message from " ... ?
12:44	csharp_	not in osrfsys logs, no
12:45	miker	k
12:46	miker	well... I need brain fuel. I'll give it some more thought, but I'm running out of ideas
12:47	csharp_	miker: thanks
12:52	miker	csharp_: one last thought, nobody has said to you "hey, I keep trying to look at this "special" user that we merge all our deleted users into, and evergreen never lets me see it" right? or "my 400 templates won't save"
12:54	jeff	I've spent a few spare cycles trying to think about the best way for csharp to redact a pcap file in a way that doesn't limit its usefulness. :-)
12:57	csharp_	miker: there have been complaints, but sorting the potential causes from the symptoms (also considering the inaccuracy of end user reporting) is pretty much useless :-/
12:57	csharp_	but I'll look for patters
12:57	csharp_	patterns also
12:58	csharp_	it does seem less intense right now, but mmorgan's out-to-lunch theory may be in play
13:01	jeff	still have your opensrf log level cranked to debug?
13:08		collum joined #evergreen
13:11	csharp_	jeff: nope - all at normal levels
13:12	mmorgan	@blame the patters
13:12	pinesol	mmorgan: but I never knew until this day that it was the patters all along
13:12	mmorgan	:)
13:12	terranm	lol
13:12	csharp_	@band add The Patters
13:12	pinesol	csharp_: Band 'The Patters' added to list
13:43	csharp_	the > 100% osrf_control --diagnostic thing is because when you restart a specific service, it isn't killing the orphans left over from the previous process (because it doesn't know about them)
13:43	csharp_	only --kill-with-fire takes care of them (or a manual kill of each process)
13:44	csharp_	maybe that's a bug? not sure what you'd want in that situation
14:03		jihpringle joined #evergreen
14:33	csharp_	this is just unacceptable - it's back to constant
14:33	csharp_	just saw it happen to cstore too
15:05		mmorgan1 joined #evergreen
15:07		collum joined #evergreen
15:19		book` joined #evergreen
16:50		jvwoolf1 left #evergreen
17:09		mmorgan left #evergreen
18:00	pinesol	News from qatests: Testing Success <http://testing.evergreen-ils.org/~live>
23:10		Keith_isl joined #evergreen