IRC log for #evergreen, 2018-03-03

All times shown according to the server's local time.

Time	Nick	Message
05:21		eby joined #evergreen
11:34	pastebot	"Bmagic" at 64.57.241.14 pasted "Is this Chunking bug? FYI - EG 2.12.7 OpenSRF 2.5.2" (7 lines) at http://paste.evergreen-ils.org/521
11:54	Bmagic	The funny thing is - After that error occurs, I can still login to that brick and do staff functions and anything else. I can't seem to make it break. I haven't restarted the services....
12:16	miker	Bmagic: that error says a /client/ connection went away during the request. could be an apache backend died or was killed. something sending a large stanza through apache /could/ cause that error, but it seems unlikely to do so during a search request. so, no, not likely a chunking issue
12:27	Bmagic	miker: thanks! That helps. Not enough apache children?
12:30	Bmagic	miker: StartServers 20 MinSpareServers 5 MaxSpareServers 15 MaxRequestWorkers 150 MaxClients 150 MaxRequestsPerChild 10000
12:32	miker	that's the tail end of a search call, so the (presumably, apache backend) client went away while waiting on a response, during an active request, before getting the data from open-ils.search. I don't recall the exact ordering of the search steps in 2.12's mod_perl code, but I can't think of anything OTTOMH that it would do in parallel with the main search request, so I imagine it was just waiting on the data (up to 70s, IIRC)... maybe OOM killed it
12:33	Bmagic	ps -ef \| grep apache2 \| wc -l - is reporting things approaching 150
12:34	Bmagic	if it hits 150, apache will kill them?
12:34	miker	IOW, I doubt it's related to too few apache backends -- the apache process in question was actively trying to service an HTTP request. I don't see why apache would kill it -- if it reached max backends, it would just refuse or stall new requests
12:37	Bmagic	hmm memory usage looks ok. 16GB total with 3GB free but of course that is right now and not during the issue
12:38	miker	`dmesg -T` will tell you if the OOM killer was at fault
12:44	pastebot	"miker" at 64.57.241.14 pasted "Full trace" (89 lines) at http://paste.evergreen-ils.org/523
12:46	Bmagic	miker: dmesg -T shows nothing with todays date
12:53	Bmagic	so, it started at 11:14:43 and got the response at 11:15:54? (top two lines)
12:54	miker	search call timed out on the client end (at 70s, the second log line), apache may have recycled the backend after it generated the "no hits" page, but then the request finally finished 11s later. it tried to send the result to the apache client, but it was gone by then so it got the not-connected response from ejabberd
12:55	Bmagic	ah, it took too long. So do I tweak something?
12:58	Bmagic	search got improved in 3.x ? Postgres is 9.5 with default_statistics_target = 50 maintenance_work_mem = 1GB constraint_exclusion = on checkpoint_completion_target = 0.9 effective_cache_size = 160GB work_mem = 224MB wal_buffers = 8MB shared_buffers = 8GB max_connections = 550
12:59	Bmagic	the DB machine has 480GB memory and the dabase on disk is 305GB
13:01	miker	could just be a cold cache, could be tuning, hard to say...
13:01	Bmagic	those numbers look sane to you?
13:03	miker	they don't look crazy, certainly
13:03	Bmagic	perhaps we can increase the 70s limit?
13:05	Bmagic	I have another system with 256GB db on disk and only 100GB of memory. Running 3.0.2 with no NOT CONNECTED TO NETWORK errors. I am wondering if the 3.x code sped it up enough to stay under the 70s limit
13:09	miker	well, the NOT CONNECTED won't happen every time
13:09	miker	first of all
13:09	miker	it only happens when the client goes away between the timeout and the eventual completion of the search
13:09	miker	it's not something to worry about, generally, when a client disappears
13:09	Bmagic	well alright then
13:11	miker	unless that client was critical, of course. like, it was consuming responses and you need that to happen. this one, though, the end user already got a "no hits" page from the server
13:11	Bmagic	I see
13:11	Bmagic	I was contemplating a script that restarts the services automatically when NOT CONNECTED is greped out of the log... but that is sounding likle a bad idea
13:14	Bmagic	miker++
13:15	Bmagic	Thank you for lending your immense wisdom on A FRIGGIN Saturday. I really really apprecaite it!
13:16	miker	np
13:18	miker	you do want to notice when an actual service backend (and, more critically, a listener) goes away and you see a not-connected message. but simply restarting the service may not be the right action. it can depend on why the process died
13:19	Bmagic	right, I am not sure the right search criteria for that. I was thinking grep "NOT CONNECTED" \|grep -v "search"
13:20	Bmagic	but it's more complicated. Like I need to find the not connected line, then grep the threadID, and grep again for the first line of that and parse the time and figure out if it's the 70s issue
13:24	Bmagic	or maybe just counting the number of occurances and if it's more than, say 5, restart services
19:22		beanjammin joined #evergreen