Evergreen ILS Website

IRC log for #evergreen, 2018-03-03

| Channels | #evergreen index | Today | | Search | Google Search | Plain-Text | summary | Join Webchat

All times shown according to the server's local time.

Time Nick Message
05:21 eby joined #evergreen
11:34 pastebot "Bmagic" at 64.57.241.14 pasted "Is this Chunking bug? FYI - EG 2.12.7 OpenSRF 2.5.2" (7 lines) at http://paste.evergreen-ils.org/521
11:54 Bmagic The funny thing is - After that error occurs, I can still login to that brick and do staff functions and anything else. I can't seem to make it break. I haven't restarted the services....
12:16 miker Bmagic: that error says a /client/ connection went away during the request. could be an apache backend died or was killed. something sending a large stanza through apache /could/ cause that error, but it seems unlikely to do so during a search request. so, no, not likely a chunking issue
12:27 Bmagic miker: thanks! That helps. Not enough apache children?
12:30 Bmagic miker: StartServers 20 MinSpareServers 5 MaxSpareServers 15 MaxRequestWorkers 150 MaxClients 150 MaxRequestsPerChild 10000
12:32 miker that's the tail end of a search call, so the (presumably, apache backend) client went away while waiting on a response, during an active request, before getting the data from open-ils.search. I don't recall the exact ordering of the search steps in 2.12's mod_perl code, but I can't think of anything OTTOMH that it would do in parallel with the main search request, so I imagine it was just waiting on the data (up to 70s, IIRC)... maybe OOM killed it
12:33 Bmagic ps -ef | grep apache2 | wc -l - is reporting things approaching 150
12:34 Bmagic if it hits 150, apache will kill them?
12:34 miker IOW, I doubt it's related to too few apache backends -- the apache process in question was actively trying to service an HTTP request.  I don't see why apache would kill it -- if it reached max backends, it would just refuse or stall new requests
12:37 Bmagic hmm memory usage looks ok. 16GB total with 3GB free but of course that is right now and not during the issue
12:38 miker `dmesg -T` will tell you if the OOM killer was at fault
12:44 pastebot "miker" at 64.57.241.14 pasted "Full trace" (89 lines) at http://paste.evergreen-ils.org/523
12:46 Bmagic miker: dmesg -T shows nothing with todays date
12:53 Bmagic so, it started at 11:14:43 and got the response at 11:15:54? (top two lines)
12:54 miker search call timed out on the client end (at 70s, the second log line), apache may have recycled the backend after it generated the "no hits" page, but then the request finally finished 11s later. it tried to send the result to the apache client, but it was gone by then so it got the not-connected response from ejabberd
12:55 Bmagic ah, it took too long. So do I tweak something?
12:58 Bmagic search got improved in 3.x ? Postgres is 9.5 with default_statistics_target = 50 maintenance_work_mem = 1GB constraint_exclusion = on checkpoint_completion_target = 0.9 effective_cache_size = 160GB work_mem = 224MB wal_buffers = 8MB shared_buffers = 8GB  max_connections = 550
12:59 Bmagic the DB machine has 480GB memory and the dabase on disk is 305GB
13:01 miker could just be a cold cache, could be tuning, hard to say...
13:01 Bmagic those numbers look sane to you?
13:03 miker they don't look crazy, certainly
13:03 Bmagic perhaps we can increase the 70s limit?
13:05 Bmagic I have another system with 256GB db on disk and only 100GB of memory. Running 3.0.2 with no NOT CONNECTED TO NETWORK errors. I am wondering if the 3.x code sped it up enough to stay under the 70s limit
13:09 miker well, the NOT CONNECTED won't happen every time
13:09 miker first of all
13:09 miker it only happens when the client goes away between the timeout and the eventual completion of the search
13:09 miker it's not something to worry about, generally, when a client disappears
13:09 Bmagic well alright then
13:11 miker unless that client was critical, of course. like, it was consuming responses and you need that to happen. this one, though, the end user already got a "no hits" page from the server
13:11 Bmagic I see
13:11 Bmagic I was contemplating a script that restarts the services automatically when NOT CONNECTED is greped out of the log... but that is sounding likle a bad idea
13:14 Bmagic miker++
13:15 Bmagic Thank you for lending your immense wisdom on A FRIGGIN Saturday. I really really apprecaite it!
13:16 miker np
13:18 miker you do want to notice when an actual service backend (and, more critically, a listener) goes away and you see a not-connected message. but simply restarting the service may not be the right action. it can depend on why the process died
13:19 Bmagic right, I am not sure the right search criteria for that. I was thinking grep "NOT CONNECTED" |grep -v "search"
13:20 Bmagic but it's more complicated. Like I need to find the not connected line, then grep the threadID, and grep again for the first line of that and parse the time and figure out if it's the 70s issue
13:24 Bmagic or maybe just counting the number of occurances and if it's more than, say 5, restart services
19:22 beanjammin joined #evergreen

| Channels | #evergreen index | Today | | Search | Google Search | Plain-Text | summary | Join Webchat