IRC log for #evergreen, 2019-03-18

All times shown according to the server's local time.

Time	Nick	Message
00:46		sandbergja joined #evergreen
04:37		jamesrf joined #evergreen
05:01	pinesol	News from qatests: Testing Success <http://testing.evergreen-ils.org/~live>
07:00		JBoyer joined #evergreen
07:02		agoben joined #evergreen
07:10		rjackson_isl joined #evergreen
07:25		bdljohn joined #evergreen
08:03		_bott_ joined #evergreen
08:14		jamesrf joined #evergreen
08:25		bos20k joined #evergreen
08:37		terran joined #evergreen
08:43		mmorgan joined #evergreen
08:49		Dyrcona joined #evergreen
08:51		littlet joined #evergreen
09:19		yboston joined #evergreen
09:48	csharp	we need to consider something to distribute the load of individual opensrf services - 3 of our 6 EG app servers are maxed out of open-ils.actor drones (and pushing pcrud limits) while the other 3 are underutilized
09:49	csharp	so we're seeing backlogs on one server where another one is just sitting there
09:49	csharp	this is the trade off of moving to websocketd I guess
09:49	csharp	for one we worry about spinning apache2 procs and the other consumes our drones and DB connections
09:50	Dyrcona	csharp: You can set up one central ejabberd that shares communications among all of your bricks.
09:51	Dyrcona	As for consuming DB connections, you need to reconsider your postgres configuration in light of your typical load.
09:53		sandbergja joined #evergreen
09:53	Dyrcona	You also might want to look at your Apache configuration because I've not seen many spinning Apache2 processes in a while.
09:54	csharp	yeah - not seeing those since moving off apache2-websockets
09:55	Dyrcona	Well, I was thinking of main Apache2 processes, but yeah... websocketd helps.
09:55	Dyrcona	You might want to consider setting up 1 machine/vm to run ejabberd and the routers for your whole cluster.
09:56	Dyrcona	You should then be able to run listeners on all of your other machines, configure them to use the master ejabberd, and your load will get spread out differently.
09:57	Dyrcona	I also question how well our load balancer actually balances, because I often see a disparity in the use of our bricks, though it's often someone holding the enter key for too long.
09:58	bshum	That or you'll see all the loads spike up on every server because the services are all being used :)
09:58	berick	csharp: hm, I'm surprised websocketd would have any impact on that. if the idle timeout is the same, then clients should connect and disconnect with the same general distribution.
09:59	berick	csharp: i suspect your apache2-websockets idle timeout was lower than 5 minutes (which is your websockted timeout IIRC)
09:59	berick	i think the default for apache was maybe 2 minutes
09:59	Dyrcona	berick: I think csharp was referring to the fact that apache2-websocket processes would go ballistic from time to time, not that it made a difference in drone usage.
09:59	Dyrcona	At least, that's what I meant about websocketd making a difference.
10:04	csharp	berick: so the idle timeout with websocketd is set in the nginx config, right? (as opposed to the apache config for apache2-websockets where we left the nginx timeouts higher)
10:04	berick	csharp: correct
10:05	berick	as a test, maybe knock it down to 1 minute
10:05	miker	csharp: opensrf is already able to distribute load. you can create a single xmpp domain, as Dyrcona suggests, or just tell the services about all the existing domains (which requires they have non-localhost addresses and unique names)
10:05	csharp	berick: thanks
10:05	* Dyrcona	uses non-localhost addresses anyway, but doesn't do cross-brick communication, yet.
10:06		bos20k_ joined #evergreen
10:06	Dyrcona	You do want non-routeable addresses or have good firewall rules. :)
10:06	csharp	miker: Dyrcona: yeah - I want to experiment with that - I think we're going to have to since this keeps coming up (and yeah, we already use non-localhost names)
10:07	Dyrcona	We used to do that at MVLC with the utility server sharing drones/listeners with the staff side.
10:08	Dyrcona	It was two ejabberd servers talking to each other, IIRC.
10:13	JBoyer	Also, how are you doing your load balancing? dedicated hardware, ldirectord, the-other-ones-whose-names-escape-me-currently?
10:24		collum joined #evergreen
10:28		mmorgan joined #evergreen
10:53		Christineb joined #evergreen
11:18	csharp	JBoyer: ldirectord
11:19	csharp	berick: reducing the idle timeout to 1m has helped, for sure.
11:26	berick	csharp: glad to hear it
11:33		khuckins joined #evergreen
11:49	JBoyer	And if you're not using the wlc load balancing method that may also help.
11:51	csharp	scheduler=wlc - looks like we are
12:00	JBoyer	I see. It sounds like the shorter timeout has helped, that would likely have a bigger impact than the lb I imagine.
12:03	csharp	yeah - signficantly calmer
12:07		jihpringle joined #evergreen
12:36		sandbergja joined #evergreen
14:40		yboston joined #evergreen
14:48		Christineb joined #evergreen
15:32	gmcharlt	berick: miker: re bug 1820747, can you confirm that the code in question is indeed well and truly dead?
15:32	pinesol	Launchpad bug 1820747 in Evergreen "dead code cleanup: remove open-ils.search.added_content.*" [Low,New] https://launchpad.net/bugs/1820747
15:33	gmcharlt	an ex-handler, even? shuffled off the mortal coil?
15:35	Dyrcona	"He's just pinin' for his 'fianced...."
15:40	miker	gmcharlt: will look, sec
15:44	berick	forgot all about that stuff
15:45	berick	gmcharlt: i trust your research there.
15:48	miker	gmcharlt: hrm. it looks to me like we still eval it into place if no added content provider is defined in opensrf.xml, no? I haven't looked past the named commit yet
15:49	miker	gmcharlt: we don't in master... ok, it does look dead to me, indeed
15:51	miker	looks like we can drop a test! :)
15:52	miker	Open-ILS/src/perlmods/t/06-OpenILS-Application-Search.t specifically
15:52	miker	or, one in there
15:53		bdljohn joined #evergreen
16:25	csharp	...and we're back to drone saturation
16:25	csharp	might have to revert to apache2-websockets
16:25	csharp	it's annoying to have to stomp out the spinning procs, but it's not causing this
16:26	csharp	memcached connections are full, drone limits are full, higher-than-normal DB connections
16:26	Dyrcona	csharp: Are you sure that you're just not overly busy and need to add a brick or two? I don't see these problems with websocketd, but I'm also not using nginx as a proxy.
16:27	csharp	never seen it like this
16:27	Dyrcona	Load hit 127 on our main db server this morning with 100+ queries going at once. No phone calls from outside, and it looks like "normal" use, though someone probably dropped a book on an enter key or something.
16:27	csharp	it's not like March 18th is expected to be a special day like day after Memorial day, etc.
16:28	Dyrcona	I see crazy db load, but almost never drone saturation these days.
16:28	Dyrcona	Sometimes, we're being attacked with attempted SQL injection, but I didn't see any signs of that earlier today.
16:29	csharp	DB server's system load is fine, but it's at ~760 active procs
16:29	csharp	limit is 1000
16:29	csharp	brick load is high-ish, but not insane
16:30	Dyrcona	I find the db server load goes up 1 for each actively running query, note this is different from number of connected procs.
16:31	Dyrcona	csharp: What pg version, out of curiosity.
16:31	berick	csharp: wth, how many nginx procs and websocketd procs are alive?
16:32	csharp	berick: 17 per brick ngnix
16:32	csharp	between 27 and 31 websocketd procs per brick
16:32	csharp	Dyrcona: 9.5
16:33	Dyrcona	csharp: Thanks. Same here.
16:35	berick	hm, those aren't particularly high numbers
16:35	csharp	yeah - looking for why it seems to be limited to 16
16:35	csharp	1 master proc and 16 workers
16:36	berick	any other changes deployed with websocketd or was it done alone?
16:36	Dyrcona	Well, I've got to head out. I'll check the logs this evening.
16:36	csharp	berick: I upgraded to opensrf 3.1.0
16:36	csharp	and then I upgrade to EG 3.2.3
16:36	csharp	(from 3.2.2)
16:36	csharp	*upgraded
16:37	csharp	all the same evening, but incrementally
16:37	berick	ok, so websocketd being the problem is the theory, but it could be related to any of those changes?
16:37	berick	you were on opensrf 3.0 before that?
16:37	csharp	could be, I guess?
16:37	csharp	yep, opensrf 3.0.2, I think
16:38	bshum	How many cores does the server have? (just read a thing that said nginx workers is like 1 per core)
16:39	csharp	worth mentioning that we have not gotten complaints from the front end about this, but the logs tell a different story
16:39	csharp	and nagios is blowing up my phone
16:39	csharp	bshum: yep - 16 cores
16:40	csharp	we can up the cores, but that seems like the wrong direction to go
16:40	bshum	No, that shouldn't be the problem
16:40	bshum	I was just trying to explain why you were capped :)
16:40	bshum	Err solution/problem?
16:40	bshum	Haha
16:40	csharp	bshum: thanks for looking into that - it's not specified in the config as far as I can tell
16:40	berick	csharp: something else that may require investigation is bug #1729610
16:40	pinesol	Launchpad bug 1729610 in OpenSRF "allow requests to be queued if max_children limit is hit" [Wishlist,Fix released] https://launchpad.net/bugs/1729610
16:41	csharp	berick: ah - that interesting - I did see the queue in the logs
16:41	csharp	it mentions the backlog
16:41	csharp	might be why we're not getting complaints
16:42	berick	yeah, i could helping in this situation for sure.
16:42	csharp	2019-03-18 16:41:53 brick03-head open-ils.actor: [WARN:26954:Server.pm:212:] server: backlog queue size is now 35
16:42	berick	but i mention it only because it's a change down in the osrf layers that could affect this territory
16:42	jeff	default for nginx worker_connections is 512. it's not like Apache with mpm_prefork...
16:42	csharp	yeah
16:45	jeff	on at least Debian 9 (stretch), worker_connections is 768.
16:46	csharp	yeah - it's 768 here
16:47	csharp	but I now wonder whether that means you need 768 cores to get that high :-)
16:47	jeff	Negative.
16:48	bshum	It's like each process has that many max connections
16:48	csharp	https://pastebin.com/TF2JgLPh is our config (should be stock ubuntu 16.04)
16:48	jeff	That's what I'm trying to say, is that unlike Apache with mpm_prefork, nginx is closer to mpm_event. There isn't a 1:1 client:worker relationship.
16:48	bshum	So 16x768
16:48	berick	csharp: if you feel like experimenting, you could try setting max_backlog_queue (under <unix_config>) to some very low number (say, 1) in opensrf.xml for busy services. the goal being to see if you can rule that out as a potential issue.
16:48	csharp	jeff: gotcha
16:48	csharp	berick: I'll try that
16:50	jeff	The fact that you have "so few" nginx processes is probably not a problem. I could be wrong, though! ;-)
16:50	berick	in addition to what jeff and bshum said, websocketd /forks/ each connection, so the number of websocketd proces equals number of connected clients for websockets. the other nginx threads/procs are just proxying stuff.
16:51	berick	html, js, etc.
16:51	csharp	good to know that
16:51	bshum	Admittedly I haven't run into the backlog issue, but doesn't that just sound like the max children is getting hit somewhere and it might be time to check all the configured opensrf service values?
16:52	bshum	I presume that the web client uses slightly different resources than XUL did
16:53	berick	bshum: i'm just wondering since it's also very new code if there could be bugs lurking. worst case would be a feedback loop where requestes get duplicated or something.
16:53	berick	it's a change to the Sacred Innards
16:53	bshum	Right.
16:55	* csharp	notes that the example opensrf.xml in master doesn't include max_backlog_queue
16:55		sandbergja_ joined #evergreen
16:55	csharp	(Evergreen master)
16:56	csharp	OpenSRF 3.1.0 does have that setting
16:56	berick	yeah, it defaults to 1000
16:56	jeff	Am now imagining a partially-carved pumpkin on the table in front of us, Sacred Innards strewn about.
16:56	bshum	oops, loose end
16:57	csharp	@band add Sacred Innards
16:57	pinesol	csharp: Band 'Sacred Innards' added to list
16:58	csharp	sounds more like the album name than the band name though
17:00	bshum	jeff++ berick++ csharp++ # I miss production issues (not really)
17:00	csharp	it's a mixed bag - the soul-crushing lows are often followed by dizzying highs
17:01	pinesol	News from qatests: Testing Success <http://testing.evergreen-ils.org/~live>
17:01	* bshum	now wants pumpkin pie
17:01	bshum	It's too early for this.
17:02	* bshum	wanders off, but will read the rest of this adventure story later
17:05		mmorgan left #evergreen
17:22		yboston joined #evergreen
17:23	csharp	ok, I've set the max_backlog_queue to 1 for open-ils.actor - started on one brick and it didn't explode, so activating the change on the others
17:23	csharp	I can see in the logs that it's defaulting to 1000 for services without a setting specified
17:23	berick	csharp++
17:25	csharp	Backlog queue full for open-ils.actor; forced to drop message 0.59448590011106451552944256474 from opensrfpublic.brick03-head.gapines.org/_brick03-head_1552944269.425670_13415
17:26	berick	ok, that's expected (assuming load)
17:26	csharp	I'm thinking now that queuing is helping more than hurting, but we'll see
17:26	berick	it very well could be
17:27	berick	actor queue filled up pretty fast
17:27	csharp	on that one brick - yeah
17:27	csharp	it's maxxed, but others are underutilized
17:29	berick	ok, so the symptoms of stuff getting clobbered are essentially the same?
17:29	berick	only now they're dropped instead of queued?
17:30	csharp	yeah
17:30	csharp	so far anyway
17:30	csharp	things have calmed in general
17:30	csharp	I guess that was kids getting out of school and hitting the library
17:32	berick	ok, if it's acting the same, then likely the queueing is not the source of the problem (or setting a low value is not bypassing a bug)
17:32	csharp	agreed
17:33	berick	csharp: so your load balance setup, does it distribute in a loop or stick to most recent used?
17:33	csharp	it should be weighted for the server with the least connections
17:33	berick	gotcha
17:38	berick	csharp: do most/all of the bricks experience the issue or is it sticky?
17:40	csharp	most/all
17:40	* berick	nods
17:41	csharp	this feels like something that could be fixed with tuning configs though
17:42	csharp	at this very moment, everything is pretty balanced across (after restarting opensrf to reactivate backlog queuing)
17:42	csharp	also just generally calmer
17:43	berick	yeah, probably over the hump
17:44	berick	well, keep us posted, especially if you decide to revert websocketd to test.
17:44	csharp	thanks!
17:44	berick	guessing tests now won't prove much
17:45	csharp	agreed
19:33		Dyrcona joined #evergreen
20:26		gryffonophenomen joined #evergreen
20:54		sandbergja joined #evergreen
23:29		jamesrf joined #evergreen