Time |
Nick |
Message |
00:49 |
|
bshum joined #evergreen |
05:23 |
|
Keith__isl joined #evergreen |
06:00 |
pinesol |
News from qatests: Testing Success <http://testing.evergreen-ils.org/~live> |
07:21 |
|
rjackson_isl_hom joined #evergreen |
07:56 |
|
Keith_isl joined #evergreen |
08:25 |
|
mantis1 joined #evergreen |
08:41 |
|
mmorgan joined #evergreen |
08:42 |
csharp_ |
there's nothing consistent in the logs that would explain why open-ils.actor keeps dropping its connection to the network |
08:43 |
csharp_ |
I'm looking for frequency of calls (varies), types of calls (vary), anything going on at the OS level (nothing special, no duress), segfault-type stuff (none so far) |
08:44 |
csharp_ |
and ejabberd would probably be emitting more than "Connection failed: timeout" if there was a serious resource problem, I would think? |
08:45 |
csharp_ |
at this point I'm willing to conclude that there's a bug in OpenSRF (possibly upstream in perl or C or something) |
08:46 |
csharp_ |
so I guess it's time to start adding debug logging to the perl stuff? |
08:46 |
csharp_ |
(I hope miker is watching since this is one of his areas of expertise) |
09:23 |
|
rfrasur joined #evergreen |
09:36 |
|
collum joined #evergreen |
09:53 |
miker |
why are my ears burning... |
09:56 |
miker |
csharp_: since it's specific to one service, I think it's more likely how that service is used / the shape of calls to it. to clarify, symptomatic result of whatever is happening is that the listener ends up being deregistered from the router, is that right? as in, if you use srfsh to ask for the list of services at the router, you see actor is no listed |
10:02 |
miker |
you can use this command in srfsh to do that, btw, as long as your router's ejabberd username is exactly "router": request router opensrf.router.info.stats.class.all |
10:04 |
csharp_ |
miker: yes, that's what we're seeing |
10:07 |
miker |
csharp_: so, the absolute most common cause of that is having a c2s shaper installed in ejabberd. the way shapers work, a single very large stanza (say, huge template as parameter) can cause a pause -- ejabberd will deliver the payload and then wait for the amount of time it would have taken if the bytes/sec were strictly honored -- and the router sees that as a >1s delay from the listener, at which point it drops the listener from the list for the |
10:07 |
miker |
class |
10:09 |
miker |
if you have entirely standalone bricks, with no cross-registration, this presents as an immediate listener failure. if you're cross-registering, you have more buffer because the other actor instances can take up the slack while you detect that situation and tell the service to rereg (or restart it) |
10:11 |
csharp_ |
I do have c2s_shaper: none in all my ejabberd configs as recommended in bug 1941653 |
10:11 |
pinesol |
Launchpad bug 1941653 in OpenSRF "Update recommended shaper settings" [Medium,New] https://launchpad.net/bugs/1941653 |
10:11 |
csharp_ |
no cross registration |
10:13 |
miker |
what version of ejabberd are you using? |
10:13 |
csharp_ |
18.01 |
10:14 |
miker |
ok, I think it's spelled differently (version differences) based on the docs. you have a ejabberd_c2s section, right? |
10:15 |
miker |
specifically, I believe it should be "shaper: none" rather than "c2s_shaper: none" |
10:16 |
miker |
can you share a paste of the relevant listen: block? |
10:16 |
csharp_ |
yes, one sec |
10:17 |
miker |
(and thanks a tonne, ejabberd, for changing the spelling and format of your config options all the time) |
10:17 |
csharp_ |
https://pastebin.com/DYNVG8Pi |
10:18 |
csharp_ |
seriously |
10:21 |
miker |
yeah, instead of naming a shaper, in the listen: block just say "shaper: none" |
10:21 |
csharp_ |
ok - just a sec |
10:23 |
miker |
or, just remove that entirely. fwiw, because "none" is the default we just don't have a shaper: line |
10:23 |
miker |
side note: because max-concurrent-sessions defaults to Inf, we don't specify that either |
10:24 |
csharp_ |
given my pain of the last few days, gonna stick with explicit over implicit :-) |
10:26 |
miker |
fair. I'm looking at https://docs.ejabberd.im/admin/configuration/listen-options/#shaper for this, btw |
10:29 |
csharp_ |
ok, change is applied, now to watch |
10:29 |
miker |
ah, one other thing that might be important in a high-traffic env (though it should /not/ be something that only one service suffers from) is the max_fsm_queue setting https://docs.ejabberd.im/admin/configuration/listen-options/#max-fsm-queue |
10:29 |
miker |
we've doubled that to 20000 |
10:35 |
csharp_ |
still busted :-( |
10:40 |
csharp_ |
I removed the backlog designation as well and restarted ejabberd - still watching - I'll look at the fsm queue stuff next |
10:49 |
miker |
csharp_: here's an example of our ENTIRE .yml, sans hostname, IP and auth setup: https://pastebin.com/gJ1pGwVE |
10:50 |
csharp_ |
miker: the main thing I'm having trouble with from the troubleshooting perspective is the lack of useful logs - we're planning to turn on debug-level opensrf logs for a while to see if we can capture things, but I'm surprised ejabberd isn't providing better info |
10:50 |
csharp_ |
fsm increase in place, but the issue is still happening btw |
10:51 |
miker |
oh, ejabberd logs are terrible |
10:53 |
miker |
are you seeing this in the logs: We lost the last node in the class, responding with error and removing... |
10:53 |
csharp_ |
yes |
10:53 |
miker |
it's WARN level |
10:53 |
csharp_ |
constantly |
10:53 |
miker |
k |
10:54 |
csharp_ |
30 times in the last hour across 6 servers |
10:54 |
csharp_ |
sorry - all day |
10:54 |
csharp_ |
since 7:48 a.m. |
10:54 |
csharp_ |
actually, not that far off though - 28 in the last hour 10 mins |
10:57 |
miker |
always for actor? |
10:57 |
csharp_ |
yes |
10:58 |
csharp_ |
this feels like a threshold issue since actor is by far the most called service |
11:00 |
csharp_ |
I realized I had not applied max_fsm_queue in the "right" place in the file, so my change was getting overwritten by the default (assuming the lower down value in the file is the last read) |
11:00 |
csharp_ |
so applying that for real now |
11:00 |
|
jvwoolf joined #evergreen |
11:01 |
miker |
and you applied all three of the recent patches we talked about earlier in the week? |
11:01 |
|
jvwoolf1 joined #evergreen |
11:02 |
csharp_ |
yes, that's right |
11:08 |
miker |
do you see much in the logs about the backlog queue? "server: backlog queue size is now" etc |
11:12 |
csharp_ |
no - it doesn't look like the number of active children corresponds with the drops at all |
11:13 |
miker |
k, that's what I expected, but to be sure there's not some new source of pauses in ejabberd communication... |
11:16 |
miker |
do you see log lines with: server: restarting after fatal crash... |
11:17 |
miker |
preceded by "server: died with error " and an error message? |
11:19 |
csharp_ |
I see 'died with error' but not the restarting message |
11:20 |
csharp_ |
just cranked up logging to debug everywhere in opensrf_core.xml to see more into what's going on |
11:22 |
miker |
thoughts and prayers, rsyslogd |
11:23 |
csharp_ |
heh osrfsys.11.log just hit 4.3 |
11:23 |
csharp_ |
GB |
11:23 |
csharp_ |
miker: pray for grep |
11:23 |
csharp_ |
of course, it's calm right this second :- |
11:23 |
csharp_ |
:-? |
11:28 |
csharp_ |
what the hell? it's quiet now |
11:29 |
berick |
could be one of those bugs where it's affected by the log level. |
11:29 |
berick |
the logging bogs everything down, reducing overall throughput |
11:29 |
csharp_ |
huh |
11:29 |
* csharp_ |
pipes all logs to /dev/null and calls it a day |
11:39 |
csharp_ |
berick: apparently you're correct - when I disabled debug logs, I see the problem again :-( |
11:39 |
csharp_ |
WELL THAT'S JUST GREAT |
11:44 |
csharp_ |
* open-ils.actor [72141] uptime=02:42 cputime=00:01:16 #drones=194/192 101% |
11:44 |
csharp_ |
that's not something I've seen before |
12:01 |
miker |
you really have 192 drones allowed? |
12:01 |
jeff |
keep in mind that --diagnostic has some limitations/quirks: for one thing, the max number of drones is always read from the config on disk, not from whatever config is active. (that particular quirk is not likely related to what you're seeing right now, unless you recently LOWERED the number of drones) |
12:01 |
jeff |
(I recently noticed that quirk and was annoyed by it, but hadn't looked for / opened a bug on it) |
12:03 |
csharp_ |
miker: yeah - lower values have resulted in saturation |
12:03 |
|
jihpringle joined #evergreen |
12:03 |
csharp_ |
jeff: interesting |
12:04 |
csharp_ |
miker: happy to provide additional configs to you if you think it would be helpful |
12:05 |
csharp_ |
issue hasn't happened since 2022-01-21 11:41:59 |
12:09 |
miker |
csharp_: 1) that's good (except for the log spew) 2) no, this has got to be a too-many-bytes-too-fast issue, which is backed up by the logging-slows-us-down-enough-to-succeed situation. and the lack of the "restarting" log message means ejabberd is doing the disconnecting of the socket, not some internal explosion that is separate from the socket. |
12:10 |
miker |
no, meaning, no I don't think more configs will be helpful right now |
12:10 |
|
terranm joined #evergreen |
12:17 |
csharp_ |
and... it's back |
12:17 |
csharp_ |
miker: thanks |
12:25 |
mmorgan |
csharp_: Someone went to lunch and just came back? |
12:26 |
miker |
mmorgan: heh... good thought. could be unrelated to changing the logging, just usage difference |
12:37 |
csharp_ |
that's one of my theories - a cataloger somewhere with a particular workflow is unwittingly making the state's libraries miserable |
12:41 |
miker |
csharp_: are you seeing router log messages like this: "Error sending message from " ... ? |
12:44 |
csharp_ |
not in osrfsys logs, no |
12:45 |
miker |
k |
12:46 |
miker |
well... I need brain fuel. I'll give it some more thought, but I'm running out of ideas |
12:47 |
csharp_ |
miker: thanks |
12:52 |
miker |
csharp_: one last thought, nobody has said to you "hey, I keep trying to look at this "special" user that we merge all our deleted users into, and evergreen never lets me see it" right? or "my 400 templates won't save" |
12:54 |
jeff |
I've spent a few spare cycles trying to think about the best way for csharp to redact a pcap file in a way that doesn't limit its usefulness. :-) |
12:57 |
csharp_ |
miker: there have been complaints, but sorting the potential causes from the symptoms (also considering the inaccuracy of end user reporting) is pretty much useless :-/ |
12:57 |
csharp_ |
but I'll look for patters |
12:57 |
csharp_ |
patterns also |
12:58 |
csharp_ |
it does seem less intense right now, but mmorgan's out-to-lunch theory may be in play |
13:01 |
jeff |
still have your opensrf log level cranked to debug? |
13:08 |
|
collum joined #evergreen |
13:11 |
csharp_ |
jeff: nope - all at normal levels |
13:12 |
mmorgan |
@blame the patters |
13:12 |
pinesol |
mmorgan: but I never knew until this day that it was the patters all along |
13:12 |
mmorgan |
:) |
13:12 |
terranm |
lol |
13:12 |
csharp_ |
@band add The Patters |
13:12 |
pinesol |
csharp_: Band 'The Patters' added to list |
13:43 |
csharp_ |
the > 100% osrf_control --diagnostic thing is because when you restart a specific service, it isn't killing the orphans left over from the previous process (because it doesn't know about them) |
13:43 |
csharp_ |
only --kill-with-fire takes care of them (or a manual kill of each process) |
13:44 |
csharp_ |
maybe that's a bug? not sure what you'd want in that situation |
14:03 |
|
jihpringle joined #evergreen |
14:33 |
csharp_ |
this is just unacceptable - it's back to constant |
14:33 |
csharp_ |
just saw it happen to cstore too |
15:05 |
|
mmorgan1 joined #evergreen |
15:07 |
|
collum joined #evergreen |
15:19 |
|
book` joined #evergreen |
16:50 |
|
jvwoolf1 left #evergreen |
17:09 |
|
mmorgan left #evergreen |
18:00 |
pinesol |
News from qatests: Testing Success <http://testing.evergreen-ils.org/~live> |
23:10 |
|
Keith_isl joined #evergreen |