Time |
Nick |
Message |
00:46 |
|
sandbergja joined #evergreen |
04:37 |
|
jamesrf joined #evergreen |
05:01 |
pinesol |
News from qatests: Testing Success <http://testing.evergreen-ils.org/~live> |
07:00 |
|
JBoyer joined #evergreen |
07:02 |
|
agoben joined #evergreen |
07:10 |
|
rjackson_isl joined #evergreen |
07:25 |
|
bdljohn joined #evergreen |
08:03 |
|
_bott_ joined #evergreen |
08:14 |
|
jamesrf joined #evergreen |
08:25 |
|
bos20k joined #evergreen |
08:37 |
|
terran joined #evergreen |
08:43 |
|
mmorgan joined #evergreen |
08:49 |
|
Dyrcona joined #evergreen |
08:51 |
|
littlet joined #evergreen |
09:19 |
|
yboston joined #evergreen |
09:48 |
csharp |
we need to consider something to distribute the load of individual opensrf services - 3 of our 6 EG app servers are maxed out of open-ils.actor drones (and pushing pcrud limits) while the other 3 are underutilized |
09:49 |
csharp |
so we're seeing backlogs on one server where another one is just sitting there |
09:49 |
csharp |
this is the trade off of moving to websocketd I guess |
09:49 |
csharp |
for one we worry about spinning apache2 procs and the other consumes our drones and DB connections |
09:50 |
Dyrcona |
csharp: You can set up one central ejabberd that shares communications among all of your bricks. |
09:51 |
Dyrcona |
As for consuming DB connections, you need to reconsider your postgres configuration in light of your typical load. |
09:53 |
|
sandbergja joined #evergreen |
09:53 |
Dyrcona |
You also might want to look at your Apache configuration because I've not seen many spinning Apache2 processes in a while. |
09:54 |
csharp |
yeah - not seeing those since moving off apache2-websockets |
09:55 |
Dyrcona |
Well, I was thinking of main Apache2 processes, but yeah... websocketd helps. |
09:55 |
Dyrcona |
You might want to consider setting up 1 machine/vm to run ejabberd and the routers for your whole cluster. |
09:56 |
Dyrcona |
You should then be able to run listeners on all of your other machines, configure them to use the master ejabberd, and your load will get spread out differently. |
09:57 |
Dyrcona |
I also question how well our load balancer actually balances, because I often see a disparity in the use of our bricks, though it's often someone holding the enter key for too long. |
09:58 |
bshum |
That or you'll see all the loads spike up on every server because the services are all being used :) |
09:58 |
berick |
csharp: hm, I'm surprised websocketd would have any impact on that. if the idle timeout is the same, then clients should connect and disconnect with the same general distribution. |
09:59 |
berick |
csharp: i suspect your apache2-websockets idle timeout was lower than 5 minutes (which is your websockted timeout IIRC) |
09:59 |
berick |
i think the default for apache was maybe 2 minutes |
09:59 |
Dyrcona |
berick: I think csharp was referring to the fact that apache2-websocket processes would go ballistic from time to time, not that it made a difference in drone usage. |
09:59 |
Dyrcona |
At least, that's what I meant about websocketd making a difference. |
10:04 |
csharp |
berick: so the idle timeout with websocketd is set in the nginx config, right? (as opposed to the apache config for apache2-websockets where we left the nginx timeouts higher) |
10:04 |
berick |
csharp: correct |
10:05 |
berick |
as a test, maybe knock it down to 1 minute |
10:05 |
miker |
csharp: opensrf is already able to distribute load. you can create a single xmpp domain, as Dyrcona suggests, or just tell the services about all the existing domains (which requires they have non-localhost addresses and unique names) |
10:05 |
csharp |
berick: thanks |
10:05 |
* Dyrcona |
uses non-localhost addresses anyway, but doesn't do cross-brick communication, yet. |
10:06 |
|
bos20k_ joined #evergreen |
10:06 |
Dyrcona |
You do want non-routeable addresses or have good firewall rules. :) |
10:06 |
csharp |
miker: Dyrcona: yeah - I want to experiment with that - I think we're going to have to since this keeps coming up (and yeah, we already use non-localhost names) |
10:07 |
Dyrcona |
We used to do that at MVLC with the utility server sharing drones/listeners with the staff side. |
10:08 |
Dyrcona |
It was two ejabberd servers talking to each other, IIRC. |
10:13 |
JBoyer |
Also, how are you doing your load balancing? dedicated hardware, ldirectord, the-other-ones-whose-names-escape-me-currently? |
10:24 |
|
collum joined #evergreen |
10:28 |
|
mmorgan joined #evergreen |
10:53 |
|
Christineb joined #evergreen |
11:18 |
csharp |
JBoyer: ldirectord |
11:19 |
csharp |
berick: reducing the idle timeout to 1m has helped, for sure. |
11:26 |
berick |
csharp: glad to hear it |
11:33 |
|
khuckins joined #evergreen |
11:49 |
JBoyer |
And if you're not using the wlc load balancing method that may also help. |
11:51 |
csharp |
scheduler=wlc - looks like we are |
12:00 |
JBoyer |
I see. It sounds like the shorter timeout has helped, that would likely have a bigger impact than the lb I imagine. |
12:03 |
csharp |
yeah - signficantly calmer |
12:07 |
|
jihpringle joined #evergreen |
12:36 |
|
sandbergja joined #evergreen |
14:40 |
|
yboston joined #evergreen |
14:48 |
|
Christineb joined #evergreen |
15:32 |
gmcharlt |
berick: miker: re bug 1820747, can you confirm that the code in question is indeed well and truly dead? |
15:32 |
pinesol |
Launchpad bug 1820747 in Evergreen "dead code cleanup: remove open-ils.search.added_content.*" [Low,New] https://launchpad.net/bugs/1820747 |
15:33 |
gmcharlt |
an ex-handler, even? shuffled off the mortal coil? |
15:35 |
Dyrcona |
"He's just pinin' for his 'fianced...." |
15:40 |
miker |
gmcharlt: will look, sec |
15:44 |
berick |
forgot all about that stuff |
15:45 |
berick |
gmcharlt: i trust your research there. |
15:48 |
miker |
gmcharlt: hrm. it looks to me like we still eval it into place if no added content provider is defined in opensrf.xml, no? I haven't looked past the named commit yet |
15:49 |
miker |
gmcharlt: we don't in master... ok, it does look dead to me, indeed |
15:51 |
miker |
looks like we can drop a test! :) |
15:52 |
miker |
Open-ILS/src/perlmods/t/06-OpenILS-Application-Search.t specifically |
15:52 |
miker |
or, one in there |
15:53 |
|
bdljohn joined #evergreen |
16:25 |
csharp |
...and we're back to drone saturation |
16:25 |
csharp |
might have to revert to apache2-websockets |
16:25 |
csharp |
it's annoying to have to stomp out the spinning procs, but it's not causing this |
16:26 |
csharp |
memcached connections are full, drone limits are full, higher-than-normal DB connections |
16:26 |
Dyrcona |
csharp: Are you sure that you're just not overly busy and need to add a brick or two? I don't see these problems with websocketd, but I'm also not using nginx as a proxy. |
16:27 |
csharp |
never seen it like this |
16:27 |
Dyrcona |
Load hit 127 on our main db server this morning with 100+ queries going at once. No phone calls from outside, and it looks like "normal" use, though someone probably dropped a book on an enter key or something. |
16:27 |
csharp |
it's not like March 18th is expected to be a special day like day after Memorial day, etc. |
16:28 |
Dyrcona |
I see crazy db load, but almost never drone saturation these days. |
16:28 |
Dyrcona |
Sometimes, we're being attacked with attempted SQL injection, but I didn't see any signs of that earlier today. |
16:29 |
csharp |
DB server's system load is fine, but it's at ~760 active procs |
16:29 |
csharp |
limit is 1000 |
16:29 |
csharp |
brick load is high-ish, but not insane |
16:30 |
Dyrcona |
I find the db server load goes up 1 for each actively running query, note this is different from number of connected procs. |
16:31 |
Dyrcona |
csharp: What pg version, out of curiosity. |
16:31 |
berick |
csharp: wth, how many nginx procs and websocketd procs are alive? |
16:32 |
csharp |
berick: 17 per brick ngnix |
16:32 |
csharp |
between 27 and 31 websocketd procs per brick |
16:32 |
csharp |
Dyrcona: 9.5 |
16:33 |
Dyrcona |
csharp: Thanks. Same here. |
16:35 |
berick |
hm, those aren't particularly high numbers |
16:35 |
csharp |
yeah - looking for why it seems to be limited to 16 |
16:35 |
csharp |
1 master proc and 16 workers |
16:36 |
berick |
any other changes deployed with websocketd or was it done alone? |
16:36 |
Dyrcona |
Well, I've got to head out. I'll check the logs this evening. |
16:36 |
csharp |
berick: I upgraded to opensrf 3.1.0 |
16:36 |
csharp |
and then I upgrade to EG 3.2.3 |
16:36 |
csharp |
(from 3.2.2) |
16:36 |
csharp |
*upgraded |
16:37 |
csharp |
all the same evening, but incrementally |
16:37 |
berick |
ok, so websocketd being the problem is the theory, but it could be related to any of those changes? |
16:37 |
berick |
you were on opensrf 3.0 before that? |
16:37 |
csharp |
could be, I guess? |
16:37 |
csharp |
yep, opensrf 3.0.2, I think |
16:38 |
bshum |
How many cores does the server have? (just read a thing that said nginx workers is like 1 per core) |
16:39 |
csharp |
worth mentioning that we have not gotten complaints from the front end about this, but the logs tell a different story |
16:39 |
csharp |
and nagios is blowing up my phone |
16:39 |
csharp |
bshum: yep - 16 cores |
16:40 |
csharp |
we can up the cores, but that seems like the wrong direction to go |
16:40 |
bshum |
No, that shouldn't be the problem |
16:40 |
bshum |
I was just trying to explain why you were capped :) |
16:40 |
bshum |
Err solution/problem? |
16:40 |
bshum |
Haha |
16:40 |
csharp |
bshum: thanks for looking into that - it's not specified in the config as far as I can tell |
16:40 |
berick |
csharp: something else that may require investigation is bug #1729610 |
16:40 |
pinesol |
Launchpad bug 1729610 in OpenSRF "allow requests to be queued if max_children limit is hit" [Wishlist,Fix released] https://launchpad.net/bugs/1729610 |
16:41 |
csharp |
berick: ah - that interesting - I did see the queue in the logs |
16:41 |
csharp |
it mentions the backlog |
16:41 |
csharp |
might be why we're not getting complaints |
16:42 |
berick |
yeah, i could helping in this situation for sure. |
16:42 |
csharp |
2019-03-18 16:41:53 brick03-head open-ils.actor: [WARN:26954:Server.pm:212:] server: backlog queue size is now 35 |
16:42 |
berick |
but i mention it only because it's a change down in the osrf layers that could affect this territory |
16:42 |
jeff |
default for nginx worker_connections is 512. it's not like Apache with mpm_prefork... |
16:42 |
csharp |
yeah |
16:45 |
jeff |
on at least Debian 9 (stretch), worker_connections is 768. |
16:46 |
csharp |
yeah - it's 768 here |
16:47 |
csharp |
but I now wonder whether that means you need 768 cores to get that high :-) |
16:47 |
jeff |
Negative. |
16:48 |
bshum |
It's like each process has that many max connections |
16:48 |
csharp |
https://pastebin.com/TF2JgLPh is our config (should be stock ubuntu 16.04) |
16:48 |
jeff |
That's what I'm trying to say, is that unlike Apache with mpm_prefork, nginx is closer to mpm_event. There isn't a 1:1 client:worker relationship. |
16:48 |
bshum |
So 16x768 |
16:48 |
berick |
csharp: if you feel like experimenting, you could try setting max_backlog_queue (under <unix_config>) to some very low number (say, 1) in opensrf.xml for busy services. the goal being to see if you can rule that out as a potential issue. |
16:48 |
csharp |
jeff: gotcha |
16:48 |
csharp |
berick: I'll try that |
16:50 |
jeff |
The fact that you have "so few" nginx processes is probably not a problem. I could be wrong, though! ;-) |
16:50 |
berick |
in addition to what jeff and bshum said, websocketd /forks/ each connection, so the number of websocketd proces equals number of connected clients for websockets. the other nginx threads/procs are just proxying stuff. |
16:51 |
berick |
html, js, etc. |
16:51 |
csharp |
good to know that |
16:51 |
bshum |
Admittedly I haven't run into the backlog issue, but doesn't that just sound like the max children is getting hit somewhere and it might be time to check all the configured opensrf service values? |
16:52 |
bshum |
I presume that the web client uses slightly different resources than XUL did |
16:53 |
berick |
bshum: i'm just wondering since it's also very new code if there could be bugs lurking. worst case would be a feedback loop where requestes get duplicated or something. |
16:53 |
berick |
it's a change to the Sacred Innards |
16:53 |
bshum |
Right. |
16:55 |
* csharp |
notes that the example opensrf.xml in master doesn't include max_backlog_queue |
16:55 |
|
sandbergja_ joined #evergreen |
16:55 |
csharp |
(Evergreen master) |
16:56 |
csharp |
OpenSRF 3.1.0 *does* have that setting |
16:56 |
berick |
yeah, it defaults to 1000 |
16:56 |
jeff |
Am now imagining a partially-carved pumpkin on the table in front of us, Sacred Innards strewn about. |
16:56 |
bshum |
oops, loose end |
16:57 |
csharp |
@band add Sacred Innards |
16:57 |
pinesol |
csharp: Band 'Sacred Innards' added to list |
16:58 |
csharp |
sounds more like the album name than the band name though |
17:00 |
bshum |
jeff++ berick++ csharp++ # I miss production issues (not really) |
17:00 |
csharp |
it's a mixed bag - the soul-crushing lows are often followed by dizzying highs |
17:01 |
pinesol |
News from qatests: Testing Success <http://testing.evergreen-ils.org/~live> |
17:01 |
* bshum |
now wants pumpkin pie |
17:01 |
bshum |
It's too early for this. |
17:02 |
* bshum |
wanders off, but will read the rest of this adventure story later |
17:05 |
|
mmorgan left #evergreen |
17:22 |
|
yboston joined #evergreen |
17:23 |
csharp |
ok, I've set the max_backlog_queue to 1 for open-ils.actor - started on one brick and it didn't explode, so activating the change on the others |
17:23 |
csharp |
I can see in the logs that it's defaulting to 1000 for services without a setting specified |
17:23 |
berick |
csharp++ |
17:25 |
csharp |
Backlog queue full for open-ils.actor; forced to drop message 0.59448590011106451552944256474 from opensrfpublic.brick03-head.gapines.org/_brick03-head_1552944269.425670_13415 |
17:26 |
berick |
ok, that's expected (assuming load) |
17:26 |
csharp |
I'm thinking now that queuing is helping more than hurting, but we'll see |
17:26 |
berick |
it very well could be |
17:27 |
berick |
actor queue filled up pretty fast |
17:27 |
csharp |
on that one brick - yeah |
17:27 |
csharp |
it's maxxed, but others are underutilized |
17:29 |
berick |
ok, so the symptoms of stuff getting clobbered are essentially the same? |
17:29 |
berick |
only now they're dropped instead of queued? |
17:30 |
csharp |
yeah |
17:30 |
csharp |
so far anyway |
17:30 |
csharp |
things have calmed in general |
17:30 |
csharp |
I guess that was kids getting out of school and hitting the library |
17:32 |
berick |
ok, if it's acting the same, then likely the queueing is not the source of the problem (or setting a low value is not bypassing a bug) |
17:32 |
csharp |
agreed |
17:33 |
berick |
csharp: so your load balance setup, does it distribute in a loop or stick to most recent used? |
17:33 |
csharp |
it should be weighted for the server with the least connections |
17:33 |
berick |
gotcha |
17:38 |
berick |
csharp: do most/all of the bricks experience the issue or is it sticky? |
17:40 |
csharp |
most/all |
17:40 |
* berick |
nods |
17:41 |
csharp |
this feels like something that could be fixed with tuning configs though |
17:42 |
csharp |
at this very moment, everything is pretty balanced across (after restarting opensrf to reactivate backlog queuing) |
17:42 |
csharp |
also just generally calmer |
17:43 |
berick |
yeah, probably over the hump |
17:44 |
berick |
well, keep us posted, especially if you decide to revert websocketd to test. |
17:44 |
csharp |
thanks! |
17:44 |
berick |
guessing tests now won't prove much |
17:45 |
csharp |
agreed |
19:33 |
|
Dyrcona joined #evergreen |
20:26 |
|
gryffonophenomen joined #evergreen |
20:54 |
|
sandbergja joined #evergreen |
23:29 |
|
jamesrf joined #evergreen |