Evergreen ILS Website

IRC log for #evergreen, 2022-01-21

| Channels | #evergreen index | Today | | Search | Google Search | Plain-Text | summary | Join Webchat

All times shown according to the server's local time.

Time Nick Message
00:49 bshum joined #evergreen
05:23 Keith__isl joined #evergreen
06:00 pinesol News from qatests: Testing Success <http://testing.evergreen-ils.org/~live>
07:21 rjackson_isl_hom joined #evergreen
07:56 Keith_isl joined #evergreen
08:25 mantis1 joined #evergreen
08:41 mmorgan joined #evergreen
08:42 csharp_ there's nothing consistent in the logs that would explain why open-ils.actor keeps dropping its connection to the network
08:43 csharp_ I'm looking for frequency of calls (varies), types of calls (vary), anything going on at the OS level (nothing special, no duress), segfault-type stuff (none so far)
08:44 csharp_ and ejabberd would probably be emitting more than "Connection failed: timeout" if there was a serious resource problem, I would think?
08:45 csharp_ at this point I'm willing to conclude that there's a bug in OpenSRF (possibly upstream in perl or C or something)
08:46 csharp_ so I guess it's time to start adding debug logging to the perl stuff?
08:46 csharp_ (I hope miker is watching since this is one of his areas of expertise)
09:23 rfrasur joined #evergreen
09:36 collum joined #evergreen
09:53 miker why are my ears burning...
09:56 miker csharp_: since it's specific to one service, I think it's more likely how that service is used / the shape of calls to it. to clarify, symptomatic result of whatever is happening is that the listener ends up being deregistered from the router, is that right? as in, if you use srfsh to ask for the list of services at the router, you see actor is no listed
10:02 miker you can use this command in srfsh to do that, btw, as long as your router's ejabberd username is exactly "router": request router opensrf.router.info.stats.class.all
10:04 csharp_ miker: yes, that's what we're seeing
10:07 miker csharp_: so, the absolute most common cause of that is having a c2s shaper installed in ejabberd. the way shapers work, a single very large stanza (say, huge template as parameter) can cause a pause -- ejabberd will deliver the payload and then wait for the amount of time it would have taken if the bytes/sec were strictly honored -- and the router sees that as a >1s delay from the listener, at which point it drops the listener from the list for the
10:07 miker class
10:09 miker if you have entirely standalone bricks, with no cross-registration, this presents as an immediate listener failure. if you're cross-registering, you have more buffer because the other actor instances can take up the slack while you detect that situation and tell the service to rereg (or restart it)
10:11 csharp_ I do have c2s_shaper: none in all my ejabberd configs as recommended in bug 1941653
10:11 pinesol Launchpad bug 1941653 in OpenSRF "Update recommended shaper settings" [Medium,New] https://launchpad.net/bugs/1941653
10:11 csharp_ no cross registration
10:13 miker what version of ejabberd are you using?
10:13 csharp_ 18.01
10:14 miker ok, I think it's spelled differently (version differences) based on the docs. you have a ejabberd_c2s section, right?
10:15 miker specifically, I believe it should be "shaper: none" rather than "c2s_shaper: none"
10:16 miker can you share a paste of the relevant listen: block?
10:16 csharp_ yes, one sec
10:17 miker (and thanks a tonne, ejabberd, for changing the spelling and format of your config options all the time)
10:17 csharp_ https://pastebin.com/DYNVG8Pi
10:18 csharp_ seriously
10:21 miker yeah, instead of naming a shaper, in the listen: block just say "shaper: none"
10:21 csharp_ ok - just a sec
10:23 miker or, just remove that entirely. fwiw, because "none" is the default we just don't have a shaper: line
10:23 miker side note: because max-concurrent-sessions defaults to Inf, we don't specify that either
10:24 csharp_ given my pain of the last few days, gonna stick with explicit over implicit :-)
10:26 miker fair. I'm looking at https://docs.ejabberd.im/admin/con​figuration/listen-options/#shaper for this, btw
10:29 csharp_ ok, change is applied, now to watch
10:29 miker ah, one other thing that might be important in a high-traffic env (though it should /not/ be something that only one service suffers from) is the max_fsm_queue setting https://docs.ejabberd.im/admin/config​uration/listen-options/#max-fsm-queue
10:29 miker we've doubled that to 20000
10:35 csharp_ still busted :-(
10:40 csharp_ I removed the backlog designation as well and restarted ejabberd - still watching - I'll look at the fsm queue stuff next
10:49 miker csharp_: here's an example of our ENTIRE .yml, sans hostname, IP and auth setup: https://pastebin.com/gJ1pGwVE
10:50 csharp_ miker: the main thing I'm having trouble with from the troubleshooting perspective is the lack of useful logs - we're planning to turn on debug-level opensrf logs for a while to see if we can capture things, but I'm surprised ejabberd isn't providing better info
10:50 csharp_ fsm increase in place, but the issue is still happening btw
10:51 miker oh, ejabberd logs are terrible
10:53 miker are you seeing this in the logs: We lost the last node in the class, responding with error and removing...
10:53 csharp_ yes
10:53 miker it's WARN level
10:53 csharp_ constantly
10:53 miker k
10:54 csharp_ 30 times in the last hour across 6 servers
10:54 csharp_ sorry - all day
10:54 csharp_ since 7:48 a.m.
10:54 csharp_ actually, not that far off though - 28 in the last hour 10 mins
10:57 miker always for actor?
10:57 csharp_ yes
10:58 csharp_ this feels like a threshold issue since actor is by far the most called service
11:00 csharp_ I realized I had not applied max_fsm_queue in the "right" place in the file, so my change was getting overwritten by the default (assuming the lower down value in the file is the last read)
11:00 csharp_ so applying that for real now
11:00 jvwoolf joined #evergreen
11:01 miker and you applied all three of the recent patches we talked about earlier in the week?
11:01 jvwoolf1 joined #evergreen
11:02 csharp_ yes, that's right
11:08 miker do you see much in the logs about the backlog queue? "server: backlog queue size is now" etc
11:12 csharp_ no - it doesn't look like the number of active children corresponds with the drops at all
11:13 miker k, that's what I expected, but to be sure there's not some new source of pauses in ejabberd communication...
11:16 miker do you see log lines with: server: restarting after fatal crash...
11:17 miker preceded by "server: died with error " and an error message?
11:19 csharp_ I see 'died with error' but not the restarting message
11:20 csharp_ just cranked up logging to debug everywhere in opensrf_core.xml to see more into what's going on
11:22 miker thoughts and prayers, rsyslogd
11:23 csharp_ heh osrfsys.11.log just hit 4.3
11:23 csharp_ GB
11:23 csharp_ miker: pray for grep
11:23 csharp_ of course, it's calm right this second :-
11:23 csharp_ :-?
11:28 csharp_ what the hell? it's quiet now
11:29 berick could be one of those bugs where it's affected by the log level.
11:29 berick the logging bogs everything down, reducing overall throughput
11:29 csharp_ huh
11:29 * csharp_ pipes all logs to /dev/null and calls it a day
11:39 csharp_ berick: apparently you're correct - when I disabled debug logs, I see the problem again :-(
11:39 csharp_ WELL THAT'S JUST GREAT
11:44 csharp_ * open-ils.actor           [72141] uptime=02:42       cputime=00:01:16    #drones=194/192 101%
11:44 csharp_ that's not something I've seen before
12:01 miker you really have 192 drones allowed?
12:01 jeff keep in mind that --diagnostic has some limitations/quirks: for one thing, the max number of drones is always read from the config on disk, not from whatever config is active. (that particular quirk is not likely related to what you're seeing right now, unless you recently LOWERED the number of drones)
12:01 jeff (I recently noticed that quirk and was annoyed by it, but hadn't looked for / opened a bug on it)
12:03 csharp_ miker: yeah - lower values have resulted in saturation
12:03 jihpringle joined #evergreen
12:03 csharp_ jeff: interesting
12:04 csharp_ miker: happy to provide additional configs to you if you think it would be helpful
12:05 csharp_ issue hasn't happened since 2022-01-21 11:41:59
12:09 miker csharp_: 1) that's good (except for the log spew) 2) no, this has got to be a too-many-bytes-too-fast issue, which is backed up by the logging-slows-us-down-enough-to-succeed situation. and the lack of the "restarting" log message means ejabberd is doing the disconnecting of the socket, not some internal explosion that is separate from the socket.
12:10 miker no, meaning, no I don't think more configs will be helpful right now
12:10 terranm joined #evergreen
12:17 csharp_ and... it's back
12:17 csharp_ miker: thanks
12:25 mmorgan csharp_: Someone went to lunch and just came back?
12:26 miker mmorgan: heh... good thought. could be unrelated to changing the logging, just usage difference
12:37 csharp_ that's one of my theories - a cataloger somewhere with a particular workflow is unwittingly making the state's libraries miserable
12:41 miker csharp_: are you seeing router log messages like this: "Error sending message from " ... ?
12:44 csharp_ not in osrfsys logs, no
12:45 miker k
12:46 miker well... I need brain fuel. I'll give it some more thought, but I'm running out of ideas
12:47 csharp_ miker: thanks
12:52 miker csharp_: one last thought, nobody has said to you "hey, I keep trying to look at this "special" user that we merge all our deleted users into, and evergreen never lets me see it" right? or "my 400 templates won't save"
12:54 jeff I've spent a few spare cycles trying to think about the best way for csharp to redact a pcap file in a way that doesn't limit its usefulness. :-)
12:57 csharp_ miker: there have been complaints, but sorting the potential causes from the symptoms (also considering the inaccuracy of end user reporting) is pretty much useless :-/
12:57 csharp_ but I'll look for patters
12:57 csharp_ patterns also
12:58 csharp_ it does seem less intense right now, but mmorgan's out-to-lunch theory may be in play
13:01 jeff still have your opensrf log level cranked to debug?
13:08 collum joined #evergreen
13:11 csharp_ jeff: nope - all at normal levels
13:12 mmorgan @blame the patters
13:12 pinesol mmorgan: but I never knew until this day that it was the patters all along
13:12 mmorgan :)
13:12 terranm lol
13:12 csharp_ @band add The Patters
13:12 pinesol csharp_: Band 'The Patters' added to list
13:43 csharp_ the > 100% osrf_control --diagnostic thing is because when you restart a specific service, it isn't killing the orphans left over from the previous process (because it doesn't know about them)
13:43 csharp_ only --kill-with-fire takes care of them (or a manual kill of each process)
13:44 csharp_ maybe that's a bug? not sure what you'd want in that situation
14:03 jihpringle joined #evergreen
14:33 csharp_ this is just unacceptable - it's back to constant
14:33 csharp_ just saw it happen to cstore too
15:05 mmorgan1 joined #evergreen
15:07 collum joined #evergreen
15:19 book` joined #evergreen
16:50 jvwoolf1 left #evergreen
17:09 mmorgan left #evergreen
18:00 pinesol News from qatests: Testing Success <http://testing.evergreen-ils.org/~live>
23:10 Keith_isl joined #evergreen

| Channels | #evergreen index | Today | | Search | Google Search | Plain-Text | summary | Join Webchat