Time |
Nick |
Message |
05:21 |
|
eby joined #evergreen |
11:34 |
pastebot |
"Bmagic" at 64.57.241.14 pasted "Is this Chunking bug? FYI - EG 2.12.7 OpenSRF 2.5.2" (7 lines) at http://paste.evergreen-ils.org/521 |
11:54 |
Bmagic |
The funny thing is - After that error occurs, I can still login to that brick and do staff functions and anything else. I can't seem to make it break. I haven't restarted the services.... |
12:16 |
miker |
Bmagic: that error says a /client/ connection went away during the request. could be an apache backend died or was killed. something sending a large stanza through apache /could/ cause that error, but it seems unlikely to do so during a search request. so, no, not likely a chunking issue |
12:27 |
Bmagic |
miker: thanks! That helps. Not enough apache children? |
12:30 |
Bmagic |
miker: StartServers 20 MinSpareServers 5 MaxSpareServers 15 MaxRequestWorkers 150 MaxClients 150 MaxRequestsPerChild 10000 |
12:32 |
miker |
that's the tail end of a search call, so the (presumably, apache backend) client went away while waiting on a response, during an active request, before getting the data from open-ils.search. I don't recall the exact ordering of the search steps in 2.12's mod_perl code, but I can't think of anything OTTOMH that it would do in parallel with the main search request, so I imagine it was just waiting on the data (up to 70s, IIRC)... maybe OOM killed it |
12:33 |
Bmagic |
ps -ef | grep apache2 | wc -l - is reporting things approaching 150 |
12:34 |
Bmagic |
if it hits 150, apache will kill them? |
12:34 |
miker |
IOW, I doubt it's related to too few apache backends -- the apache process in question was actively trying to service an HTTP request. I don't see why apache would kill it -- if it reached max backends, it would just refuse or stall new requests |
12:37 |
Bmagic |
hmm memory usage looks ok. 16GB total with 3GB free but of course that is right now and not during the issue |
12:38 |
miker |
`dmesg -T` will tell you if the OOM killer was at fault |
12:44 |
pastebot |
"miker" at 64.57.241.14 pasted "Full trace" (89 lines) at http://paste.evergreen-ils.org/523 |
12:46 |
Bmagic |
miker: dmesg -T shows nothing with todays date |
12:53 |
Bmagic |
so, it started at 11:14:43 and got the response at 11:15:54? (top two lines) |
12:54 |
miker |
search call timed out on the client end (at 70s, the second log line), apache may have recycled the backend after it generated the "no hits" page, but then the request finally finished 11s later. it tried to send the result to the apache client, but it was gone by then so it got the not-connected response from ejabberd |
12:55 |
Bmagic |
ah, it took too long. So do I tweak something? |
12:58 |
Bmagic |
search got improved in 3.x ? Postgres is 9.5 with default_statistics_target = 50 maintenance_work_mem = 1GB constraint_exclusion = on checkpoint_completion_target = 0.9 effective_cache_size = 160GB work_mem = 224MB wal_buffers = 8MB shared_buffers = 8GB max_connections = 550 |
12:59 |
Bmagic |
the DB machine has 480GB memory and the dabase on disk is 305GB |
13:01 |
miker |
could just be a cold cache, could be tuning, hard to say... |
13:01 |
Bmagic |
those numbers look sane to you? |
13:03 |
miker |
they don't look crazy, certainly |
13:03 |
Bmagic |
perhaps we can increase the 70s limit? |
13:05 |
Bmagic |
I have another system with 256GB db on disk and only 100GB of memory. Running 3.0.2 with no NOT CONNECTED TO NETWORK errors. I am wondering if the 3.x code sped it up enough to stay under the 70s limit |
13:09 |
miker |
well, the NOT CONNECTED won't happen every time |
13:09 |
miker |
first of all |
13:09 |
miker |
it only happens when the client goes away between the timeout and the eventual completion of the search |
13:09 |
miker |
it's not something to worry about, generally, when a client disappears |
13:09 |
Bmagic |
well alright then |
13:11 |
miker |
unless that client was critical, of course. like, it was consuming responses and you need that to happen. this one, though, the end user already got a "no hits" page from the server |
13:11 |
Bmagic |
I see |
13:11 |
Bmagic |
I was contemplating a script that restarts the services automatically when NOT CONNECTED is greped out of the log... but that is sounding likle a bad idea |
13:14 |
Bmagic |
miker++ |
13:15 |
Bmagic |
Thank you for lending your immense wisdom on A FRIGGIN Saturday. I really really apprecaite it! |
13:16 |
miker |
np |
13:18 |
miker |
you do want to notice when an actual service backend (and, more critically, a listener) goes away and you see a not-connected message. but simply restarting the service may not be the right action. it can depend on why the process died |
13:19 |
Bmagic |
right, I am not sure the right search criteria for that. I was thinking grep "NOT CONNECTED" |grep -v "search" |
13:20 |
Bmagic |
but it's more complicated. Like I need to find the not connected line, then grep the threadID, and grep again for the first line of that and parse the time and figure out if it's the 70s issue |
13:24 |
Bmagic |
or maybe just counting the number of occurances and if it's more than, say 5, restart services |
19:22 |
|
beanjammin joined #evergreen |