Evergreen ILS Website

IRC log for #evergreen, 2019-03-18

| Channels | #evergreen index | Today | | Search | Google Search | Plain-Text | summary | Join Webchat

All times shown according to the server's local time.

Time Nick Message
00:46 sandbergja joined #evergreen
04:37 jamesrf joined #evergreen
05:01 pinesol News from qatests: Testing Success <http://testing.evergreen-ils.org/~live>
07:00 JBoyer joined #evergreen
07:02 agoben joined #evergreen
07:10 rjackson_isl joined #evergreen
07:25 bdljohn joined #evergreen
08:03 _bott_ joined #evergreen
08:14 jamesrf joined #evergreen
08:25 bos20k joined #evergreen
08:37 terran joined #evergreen
08:43 mmorgan joined #evergreen
08:49 Dyrcona joined #evergreen
08:51 littlet joined #evergreen
09:19 yboston joined #evergreen
09:48 csharp we need to consider something to distribute the load of individual opensrf services - 3 of our 6 EG app servers are maxed out of open-ils.actor drones (and pushing pcrud limits) while the other 3 are underutilized
09:49 csharp so we're seeing backlogs on one server where another one is just sitting there
09:49 csharp this is the trade off of moving to websocketd I guess
09:49 csharp for one we worry about spinning apache2 procs and the other consumes our drones and DB connections
09:50 Dyrcona csharp: You can set up one central ejabberd that shares communications among all of your bricks.
09:51 Dyrcona As for consuming DB connections, you need to reconsider your postgres configuration in light of your typical load.
09:53 sandbergja joined #evergreen
09:53 Dyrcona You also might want to look at your Apache configuration because I've not seen many spinning Apache2 processes in a while.
09:54 csharp yeah - not seeing those since moving off apache2-websockets
09:55 Dyrcona Well, I was thinking of main Apache2 processes, but yeah... websocketd helps.
09:55 Dyrcona You might want to consider setting up 1 machine/vm to run ejabberd and the routers for your whole cluster.
09:56 Dyrcona You should then be able to run listeners on all of your other machines, configure them to use the master ejabberd, and your load will get spread out differently.
09:57 Dyrcona I also question how well our load balancer actually balances, because I often see a disparity in the use of our bricks, though it's often someone holding the enter key for too long.
09:58 bshum That or you'll see all the loads spike up on every server because the services are all being used :)
09:58 berick csharp: hm, I'm surprised websocketd would have any impact on that.  if the idle timeout is the same, then clients should connect and disconnect with the same general distribution.
09:59 berick csharp: i suspect your apache2-websockets idle timeout was lower than 5 minutes (which is your websockted timeout IIRC)
09:59 berick i think the default for apache was maybe 2 minutes
09:59 Dyrcona berick: I think csharp was referring to the fact that apache2-websocket processes would go ballistic from time to time, not that it made a difference in drone usage.
09:59 Dyrcona At least, that's what I meant about websocketd making a difference.
10:04 csharp berick: so the idle timeout with websocketd is set in the nginx config, right? (as opposed to the apache config for apache2-websockets where we left the nginx timeouts higher)
10:04 berick csharp: correct
10:05 berick as a test, maybe knock it down to 1 minute
10:05 miker csharp: opensrf is already able to distribute load. you can create a single xmpp domain, as Dyrcona suggests, or just tell the services about all the existing domains (which requires they have non-localhost addresses and unique names)
10:05 csharp berick: thanks
10:05 * Dyrcona uses non-localhost addresses anyway, but doesn't do cross-brick communication, yet.
10:06 bos20k_ joined #evergreen
10:06 Dyrcona You do want non-routeable addresses or have good firewall rules. :)
10:06 csharp miker: Dyrcona: yeah - I want to experiment with that - I think we're going to have to since this keeps coming up (and yeah, we already use non-localhost names)
10:07 Dyrcona We used to do that at MVLC with the utility server sharing drones/listeners with the staff side.
10:08 Dyrcona It was two ejabberd servers talking to each other, IIRC.
10:13 JBoyer Also, how are you doing your load balancing? dedicated hardware, ldirectord, the-other-ones-whose-names-escape-me-currently?
10:24 collum joined #evergreen
10:28 mmorgan joined #evergreen
10:53 Christineb joined #evergreen
11:18 csharp JBoyer: ldirectord
11:19 csharp berick: reducing the idle timeout to 1m has helped, for sure.
11:26 berick csharp: glad to hear it
11:33 khuckins joined #evergreen
11:49 JBoyer And if you're not using the wlc load balancing method that may also help.
11:51 csharp scheduler=wlc - looks like we are
12:00 JBoyer I see. It sounds like the shorter timeout has helped, that would likely have a bigger impact than the lb I imagine.
12:03 csharp yeah - signficantly calmer
12:07 jihpringle joined #evergreen
12:36 sandbergja joined #evergreen
14:40 yboston joined #evergreen
14:48 Christineb joined #evergreen
15:32 gmcharlt berick: miker: re bug 1820747, can you confirm that the code in question is indeed well and truly dead?
15:32 pinesol Launchpad bug 1820747 in Evergreen "dead code cleanup: remove open-ils.search.added_content.*" [Low,New] https://launchpad.net/bugs/1820747
15:33 gmcharlt an ex-handler, even? shuffled off the mortal coil?
15:35 Dyrcona "He's just pinin' for his 'fianced...."
15:40 miker gmcharlt: will look, sec
15:44 berick forgot all about that stuff
15:45 berick gmcharlt: i trust your research there.
15:48 miker gmcharlt: hrm. it looks to me like we still eval it into place if no added content provider is defined in opensrf.xml, no? I haven't looked past the named commit yet
15:49 miker gmcharlt: we don't in master... ok, it does look dead to me, indeed
15:51 miker looks like we can drop a test! :)
15:52 miker Open-ILS/src/perlmods/t/06-O​penILS-Application-Search.t specifically
15:52 miker or, one in there
15:53 bdljohn joined #evergreen
16:25 csharp ...and we're back to drone saturation
16:25 csharp might have to revert to apache2-websockets
16:25 csharp it's annoying to have to stomp out the spinning procs, but it's not causing this
16:26 csharp memcached connections are full, drone limits are full, higher-than-normal DB connections
16:26 Dyrcona csharp: Are you sure that you're just not overly busy and need to add a brick or two? I don't see these problems with websocketd, but I'm also not using nginx as a proxy.
16:27 csharp never seen it like this
16:27 Dyrcona Load hit 127 on our main db server this morning with 100+ queries going at once. No phone calls from outside, and it looks like "normal" use, though someone probably dropped a book on an enter key or something.
16:27 csharp it's not like March 18th is expected to be a special day like day after Memorial day, etc.
16:28 Dyrcona I see crazy db load, but almost never drone saturation these days.
16:28 Dyrcona Sometimes, we're being attacked with attempted SQL injection, but I didn't see any signs of that earlier today.
16:29 csharp DB server's system load is fine, but it's at ~760 active procs
16:29 csharp limit is 1000
16:29 csharp brick load is high-ish, but not insane
16:30 Dyrcona I find the db server load goes up 1 for each actively running query, note this is different from number of connected procs.
16:31 Dyrcona csharp: What pg version, out of curiosity.
16:31 berick csharp: wth, how many nginx procs and websocketd procs are alive?
16:32 csharp berick: 17 per brick ngnix
16:32 csharp between 27 and 31 websocketd procs per brick
16:32 csharp Dyrcona: 9.5
16:33 Dyrcona csharp: Thanks. Same here.
16:35 berick hm, those aren't particularly high numbers
16:35 csharp yeah - looking for why it seems to be limited to 16
16:35 csharp 1 master proc and 16 workers
16:36 berick any other changes deployed with websocketd or was it done alone?
16:36 Dyrcona Well, I've got to head out. I'll check the logs this evening.
16:36 csharp berick: I upgraded to opensrf 3.1.0
16:36 csharp and then I upgrade to EG 3.2.3
16:36 csharp (from 3.2.2)
16:36 csharp *upgraded
16:37 csharp all the same evening, but incrementally
16:37 berick ok, so websocketd being the problem is the theory, but it could be related to any of those changes?
16:37 berick you were on opensrf 3.0 before that?
16:37 csharp could be, I guess?
16:37 csharp yep, opensrf 3.0.2, I think
16:38 bshum How many cores does the server have?  (just read a thing that said nginx workers is like 1 per core)
16:39 csharp worth mentioning that we have not gotten complaints from the front end about this, but the logs tell a different story
16:39 csharp and nagios is blowing up my phone
16:39 csharp bshum: yep - 16 cores
16:40 csharp we can up the cores, but that seems like the wrong direction to go
16:40 bshum No, that shouldn't be the problem
16:40 bshum I was just trying to explain why you were capped :)
16:40 bshum Err solution/problem?
16:40 bshum Haha
16:40 csharp bshum: thanks for looking into that - it's not specified in the config as far as I can tell
16:40 berick csharp: something else that may require investigation is bug #1729610
16:40 pinesol Launchpad bug 1729610 in OpenSRF "allow requests to be queued if max_children limit is hit" [Wishlist,Fix released] https://launchpad.net/bugs/1729610
16:41 csharp berick: ah - that interesting - I did see the queue in the logs
16:41 csharp it mentions the backlog
16:41 csharp might be why we're not getting complaints
16:42 berick yeah, i could helping in this situation for sure.
16:42 csharp 2019-03-18 16:41:53 brick03-head open-ils.actor: [WARN:26954:Server.pm:212:] server: backlog queue size is now 35
16:42 berick but i mention it only because it's a change down in the osrf layers that could affect this territory
16:42 jeff default for nginx worker_connections is 512. it's not like Apache with mpm_prefork...
16:42 csharp yeah
16:45 jeff on at least Debian 9 (stretch), worker_connections is 768.
16:46 csharp yeah - it's 768 here
16:47 csharp but I now wonder whether that means you need 768 cores to get that high :-)
16:47 jeff Negative.
16:48 bshum It's like each process has that many max connections
16:48 csharp https://pastebin.com/TF2JgLPh is our config (should be stock ubuntu 16.04)
16:48 jeff That's what I'm trying to say, is that unlike Apache with mpm_prefork, nginx is closer to mpm_event. There isn't a 1:1 client:worker relationship.
16:48 bshum So 16x768
16:48 berick csharp: if you feel like experimenting, you could try setting max_backlog_queue (under <unix_config>) to some very low number (say, 1) in opensrf.xml for busy services.    the goal being to see if you can rule that out as a potential issue.
16:48 csharp jeff: gotcha
16:48 csharp berick: I'll try that
16:50 jeff The fact that you have "so few" nginx processes is probably not a problem. I could be wrong, though! ;-)
16:50 berick in addition to what jeff and bshum said, websocketd /forks/ each connection, so the number of websocketd proces equals number of connected clients for websockets.  the other nginx threads/procs are just proxying stuff.
16:51 berick html, js, etc.
16:51 csharp good to know that
16:51 bshum Admittedly I haven't run into the backlog issue, but doesn't that just sound like the max children is getting hit somewhere and it might be time to check all the configured opensrf service values?
16:52 bshum I presume that the web client uses slightly different resources than XUL did
16:53 berick bshum: i'm just wondering since it's also very new code if there could be bugs lurking.  worst case would be a feedback loop where requestes get duplicated or something.
16:53 berick it's a change to the Sacred Innards
16:53 bshum Right.
16:55 * csharp notes that the example opensrf.xml in master doesn't include max_backlog_queue
16:55 sandbergja_ joined #evergreen
16:55 csharp (Evergreen master)
16:56 csharp OpenSRF 3.1.0 *does* have that setting
16:56 berick yeah, it defaults to 1000
16:56 jeff Am now imagining a partially-carved pumpkin on the table in front of us, Sacred Innards strewn about.
16:56 bshum oops, loose end
16:57 csharp @band add Sacred Innards
16:57 pinesol csharp: Band 'Sacred Innards' added to list
16:58 csharp sounds more like the album name than the band name though
17:00 bshum jeff++ berick++ csharp++ # I miss production issues (not really)
17:00 csharp it's a mixed bag - the soul-crushing lows are often followed by dizzying highs
17:01 pinesol News from qatests: Testing Success <http://testing.evergreen-ils.org/~live>
17:01 * bshum now wants pumpkin pie
17:01 bshum It's too early for this.
17:02 * bshum wanders off, but will read the rest of this adventure story later
17:05 mmorgan left #evergreen
17:22 yboston joined #evergreen
17:23 csharp ok, I've set the max_backlog_queue to 1 for open-ils.actor - started on one brick and it didn't explode, so activating the change on the others
17:23 csharp I can see in the logs that it's defaulting to 1000 for services without a setting specified
17:23 berick csharp++
17:25 csharp Backlog queue full for open-ils.actor; forced to drop message 0.59448590011106451552944256474 from opensrf@public.brick03-head.gapines.org/_br​ick03-head_1552944269.425670_13415
17:26 berick ok, that's expected (assuming load)
17:26 csharp I'm thinking now that queuing is helping more than hurting, but we'll see
17:26 berick it very well could be
17:27 berick actor queue filled up pretty fast
17:27 csharp on that one brick - yeah
17:27 csharp it's maxxed, but others are underutilized
17:29 berick ok, so the symptoms of stuff getting clobbered are essentially the same?
17:29 berick only now they're dropped instead of queued?
17:30 csharp yeah
17:30 csharp so far anyway
17:30 csharp things have calmed in general
17:30 csharp I guess that was kids getting out of school and hitting the library
17:32 berick ok, if it's acting the same, then likely the queueing is not the source of the problem (or setting a low value is not bypassing a bug)
17:32 csharp agreed
17:33 berick csharp: so your load balance setup, does it distribute in a loop or stick to most recent used?
17:33 csharp it should be weighted for the server with the least connections
17:33 berick gotcha
17:38 berick csharp: do most/all of the bricks experience the issue or is it sticky?
17:40 csharp most/all
17:40 * berick nods
17:41 csharp this feels like something that could be fixed with tuning configs though
17:42 csharp at this very moment, everything is pretty balanced across (after restarting opensrf to reactivate backlog queuing)
17:42 csharp also just generally calmer
17:43 berick yeah, probably over the hump
17:44 berick well, keep us posted, especially if you decide to revert websocketd to test.
17:44 csharp thanks!
17:44 berick guessing tests now won't prove much
17:45 csharp agreed
19:33 Dyrcona joined #evergreen
20:26 gryffonophenomen joined #evergreen
20:54 sandbergja joined #evergreen
23:29 jamesrf joined #evergreen

| Channels | #evergreen index | Today | | Search | Google Search | Plain-Text | summary | Join Webchat