Time |
Nick |
Message |
00:22 |
|
jvwoolf joined #evergreen |
06:00 |
pinesol |
News from qatests: Testing Success <http://testing.evergreen-ils.org/~live> |
07:30 |
|
rjackson_isl_hom joined #evergreen |
08:13 |
|
mantis1 joined #evergreen |
08:39 |
|
mmorgan joined #evergreen |
08:55 |
|
Dyrcona joined #evergreen |
09:12 |
|
Keith__isl joined #evergreen |
09:24 |
|
Keith_isl joined #evergreen |
09:26 |
csharp_ |
anyone have experience changing/setting the "timeout" listen option in ejabberd? (https://docs.ejabberd.im/admin/configuration/listen-options/#timeout) |
09:27 |
csharp_ |
default is 5 seconds (per the current docs - having trouble finding docs for 18.01, the version on Ubuntu 18.04) |
09:27 |
csharp_ |
there's also "send_timeout", which is set to 30 seconds |
09:28 |
csharp_ |
sorry - 15 seconds |
09:42 |
|
jvwoolf joined #evergreen |
10:09 |
csharp_ |
2022-01-19 17:51:27 brick04-head open-ils.auth: [INFO:80766:transport_session.c:653:1642626979107748628] Received <error> message with type cancel and code 503 - is another one of my dead ends |
10:26 |
JBoyer |
csharp_, I've seen some references to that code also being sent for a login failure. Do all of your opensrf_core.xml files have the correct user / pass for all of the ejabberd users / instances? And with that, are all of the expected accounts registered everywhere they should be? |
10:28 |
Dyrcona |
JBoyer: I would think if that were the case, csharp_ would have more consistent issues with whichever brick it was misconfigured on, but I could be wrong. |
10:30 |
JBoyer |
Me too, but that is a weird error to see. |
10:30 |
JBoyer |
Or ejabberd is *really* unhappy about... something. |
10:30 |
JBoyer |
I suppose another potential avenue is did the OS get upgraded or are these new VMs with a fresh install? |
10:31 |
csharp_ |
these were fresh installed back in October |
10:33 |
csharp_ |
my current paths are 1) something is different with ejabberd 18.04 and we need to add/tweak a config option 2) something in perl 5.26 is breaking something deep in the guts of OpenSRF 3) some specific type of message coming from the client is formatted in a way that OpenSRF/ejabberd chokes on |
10:33 |
csharp_ |
or something at the Linux kernel/resource level |
10:35 |
csharp_ |
the "Received <error> message with type cancel and code 503" message may be a red herring - it's occurred 60 times this hour and while we've seen the open-ils.actor breakage a couple of times, we haven't seen it *60* times |
10:36 |
csharp_ |
plus, looking through some of the ejabberd debug logs I gathered yesterday, that's happening for services aside from open-ils.actor |
10:42 |
csharp_ |
TCP: request_sock_TCP: Possible SYN flooding on port 5222. Sending cookies. Check SNMP counters. |
10:45 |
csharp_ |
I need to rule this^^ out as a cause before proceeding I think |
10:48 |
Dyrcona |
We're using Ubuntu 18.04 and occasionally have this problem. I suspect that jeffdavis was on the right track with file descriptor limits or something like that. You just have too much traffic for the machine. |
10:48 |
Dyrcona |
We're also still on Evergren 3.5 in production. |
10:49 |
Dyrcona |
csharp_: You just upgraded production didn't you? Was it to 3.8 or 3.7? |
10:53 |
Dyrcona |
csharp_: You're running OpenSRF 3.2.2, right? |
10:54 |
csharp_ |
Dyrcona: OpenSRF 3.2.2 - Evergreen 3.8.0 |
10:54 |
csharp_ |
we saw it occasionally pre-upgrade - now constant |
10:56 |
Dyrcona |
csharp_: My gut thinks that the issues with the web staff client making excess backend calls have increased with 3.8. My gut could be wrong. I am hungry at the moment. |
10:57 |
Dyrcona |
csharp_: Have you tried increasing any of the file descriptor limits as suggested in the article shared by jeffdavis yesterday? |
10:59 |
Dyrcona |
I guess it matters, too, if you have increased any of the max children settings when you upgraded. Having more children running would lead to more connections as requests are handled. |
11:01 |
Dyrcona |
FWIW, we're on OpenSRF 3.2.2 and Evergreen 3.5.3(ish) in production on Ubuntu 18.04 and we get these messages, but not so much that it interferes with production. (No ticket, no problem. Right?) |
11:02 |
JBoyer |
And yeah, if there's an open files limit issue that could potentially lead to those SYN flooding messages as things keep trying to connect. They can connect to port 5222, but when the port's file descriptor is dup'd for a new child to listen to it will fail at that point and won't look like a connection refused. |
11:02 |
JBoyer |
(Some paraphrasing from memory in there, but even if the specifics are off the end result would be the same) |
11:02 |
Dyrcona |
Yeahp. |
11:07 |
Dyrcona |
The places in the OpenSRF code where the errors are coming from look like it would be on first connection or getting the first response from ejabberd. |
11:09 |
Dyrcona |
RE Chrome issues from yesterday: I've also had to reload GMail more frequently to it to autofill email recipients. |
11:11 |
csharp_ |
ulimit shows 'unlimited' for every user we've tested opensrf, root, ejabberd |
11:11 |
csharp_ |
@quote add < Dyrcona> csharp_: My gut could be wrong. I am hungry at the moment. |
11:11 |
pinesol |
csharp_: The operation succeeded. Quote #221 added. |
11:12 |
csharp_ |
oh - didn't mean to keep the csharp_ part :-) |
11:17 |
Dyrcona |
:) |
11:18 |
Dyrcona |
"unlimited" doesn't mean unlimited. It means use the system limits, which you have to change possibly in multiple places depending on if you're using PAM or not, which you probably are on Ubuntu 18.04. |
11:20 |
Dyrcona |
csharp_: That post from metajack that jeffdavis shared tells you what to do: https://metajack.im/2008/09/23/file-descriptors-are-yummy-or-common-pitfalls-of-ejabberd/ |
11:21 |
Dyrcona |
I highly recommend trying those steps and seeing what happens. |
11:25 |
Dyrcona |
"unlimited" usually equals 1024. |
11:39 |
csharp_ |
Dyrcona: thanks for that info |
11:40 |
csharp_ |
I've applied the changes suggested in the article - nothing's broken immediately, so there's that :-) |
11:41 |
csharp_ |
nope - still busted |
11:42 |
Dyrcona |
You rebooted? |
11:46 |
csharp_ |
yes |
11:46 |
csharp_ |
gonna have to walk away from this for a while - I'm despondent |
11:47 |
Dyrcona |
Yeah, I'm going to get some lunch. |
12:03 |
|
jihpringle joined #evergreen |
12:07 |
csharp_ |
at this point I'm tempted to revert the new cataloging UI stuff |
12:07 |
csharp_ |
or move back to Ubuntu 16.04 or something |
12:07 |
csharp_ |
anyway - haven't gotten to lunch yet, so walking away for real now |
12:08 |
jvwoolf |
@bartender csharp_ |
12:08 |
* pinesol |
fills a pint glass with Samuel Adams Boston Ale (Stock Ale), and sends it sliding down the bar to csharp_ (http://beeradvocate.com/beer/profile/35/1193/) |
12:17 |
* Dyrcona |
intercepts the beer for csharp_ and send him a sparkling waster instead. |
12:17 |
Dyrcona |
Are any sites on 3.7 seeing this issue? |
12:19 |
mmorgan |
Dyrcona: We're on 3.7, not seeing the same issue as csharp_ |
12:24 |
jvwoolf |
FWIW, we ARE seeing the same thing on 3.6.5 running OpenSRF 3.2.2 and Ubuntu 18.04 |
12:25 |
jvwoolf |
We just upgraded to opensrf 3.2.2 as part of our 3.6.5 upgrade |
12:25 |
* mmorgan |
notes that we are running debian, not ubuntu |
12:29 |
csharp_ |
mmorgan: what version of debian? |
12:29 |
csharp_ |
Dyrcona++ # saving me from the beer :-) |
12:29 |
jvwoolf |
csharp_: Apologies! |
12:30 |
csharp_ |
jvwoolf: nah - no worries |
12:30 |
csharp_ |
I'm 10.5 years sober - it's no longer a struggle for me |
12:31 |
jvwoolf |
@tea csharp_ |
12:31 |
* pinesol |
brews and pours a pot of Wild Snow Sprout Tea, and sends it sliding down the bar to csharp_ (http://ratetea.com/tea/wild-tea-qi/wild-snow-sprout-tea/6447/) |
12:31 |
jvwoolf |
Is that better? |
12:31 |
csharp_ |
there you go! |
12:31 |
Dyrcona |
That sounds interesting. I may have to try Wild Snow Sprout Tea. |
12:32 |
Dyrcona |
csharp_: Have you checked how many files ejabberd has open? I just checked our brick 1, and it has 506 open files at the moment. |
12:32 |
mmorgan |
csharp_: debian 10 |
12:32 |
|
jihpringle joined #evergreen |
12:33 |
Dyrcona |
mmorgan: Thanks for letting me know. How painful is it for production? |
12:33 |
csharp_ |
mmorgan: thanks |
12:34 |
Dyrcona |
Also, Debian 10 is about the same age as Ubuntu 18, right? It's the same ejabberd version more or less. |
12:35 |
csharp_ |
822, 739, 640, 669, 879, 765 (bricks 1-6) |
12:35 |
Dyrcona |
OK. That's close to the default limit but not there. I wonder if it's just some other resource limit? |
12:38 |
mmorgan |
Dyrcona: Do you mean how painful is 3.7? If so, not too (knocks wood). |
12:39 |
mmorgan |
Disclaimer: I'm not the sysadmin, so don't have all the gory details of the upgrade/reingest/etc. :) |
12:42 |
Dyrcona |
mmorgan: OK. I think that answers my question. I was wondering how painful this issue is for your users on 3.7. |
12:43 |
* Dyrcona |
starts to wonder if it is some other erlang bug or ejabberd issue. |
12:43 |
mmorgan |
Dyrcona: We don't seem to be experiencing the same issue as csharp_ |
12:44 |
* Dyrcona |
needs an upgrade. |
12:44 |
Dyrcona |
I misread your earlier statement. |
12:50 |
mmorgan |
@tea Dyrcona |
12:50 |
* pinesol |
brews and pours a pot of Earl Grey Decaffeinated Black Tea, and sends it sliding down the bar to Dyrcona (http://ratetea.com/tea/bigelow/earl-grey-decaf/87/) |
12:51 |
csharp_ |
lager_file_backend dropped 1 messages in the last second that exceeded the limit of 100 messages/sec |
12:53 |
Dyrcona |
erlang logging framework..... |
12:53 |
csharp_ |
eh - ok |
12:53 |
csharp_ |
logger/lager - hilarious |
12:55 |
Dyrcona |
It looks like ejabberd uses it. Not sure that's the root of the problem but might be worth trying to rule it out. |
12:59 |
|
mmorgan1 joined #evergreen |
13:00 |
Dyrcona |
I found a bunch of other erlang processes running on one of our bricks, but each only has 3 files open. |
13:03 |
|
mmorgan joined #evergreen |
13:09 |
|
Ohiojoe joined #evergreen |
13:12 |
Dyrcona |
y'know. That limit is per user and those 3 files each would add up to the total. Sure enough, they're all running as ejabberd. |
13:20 |
Dyrcona |
In my case, that's an additional 48 to 50 files open. |
13:24 |
|
mantis1 joined #evergreen |
13:29 |
|
ohiojoe joined #evergreen |
13:31 |
|
terranm joined #evergreen |
13:35 |
|
rjackson_isl_hom joined #evergreen |
13:53 |
|
ohiojoe joined #evergreen |
13:55 |
ohiojoe |
hello out there |
13:55 |
terranm |
I just found out that since the upgrade to 3.8 the Notification Action Triggers we have that are creating messages for the Patron Message Center are setting Patron Visible to No |
14:20 |
mmorgan |
terranm: Maybe because the database table default is pub = false? |
14:22 |
terranm |
Probably |
14:22 |
terranm |
There's no way to control that in the Notification Action Triggers interface |
14:22 |
terranm |
Trying to find where the code for that is.,.. |
14:29 |
Dyrcona |
It might be Open-ILS/src/perlmods/lib/OpenILS/Application/Trigger/Readtor/ProcessMessage.pm but that looks like it just processes a template. |
14:32 |
csharp_ |
it's in Application/Trigger/Event.pm in the react sub |
14:32 |
terranm |
Looking at that one now |
14:33 |
csharp_ |
it specifies there that pub should be "t", but I can see that it's coming through in the logs as undef and NULL |
14:33 |
terranm |
It looks like it's trying to set pub to t |
14:33 |
terranm |
Jinx |
14:35 |
csharp_ |
ac037f5143b3 is the relevant commit |
14:35 |
pinesol |
csharp_: [evergreen|Jason Etheridge] lp1846354 toward consolidated patron notes - <http://git.evergreen-ils.org/?p=Evergreen.git;a=commit;h=ac037f5> |
14:35 |
terranm |
Is it something like Perl wanting the boolean to be t instead of 't' ... or 1? |
14:36 |
csharp_ |
I think 't' should work... |
14:37 |
Dyrcona |
What's there looks correct. You use 't' for a database boolean, not 0 or 1. |
14:37 |
terranm |
hmm |
14:38 |
terranm |
Oh, there's also EventGroup.pm and it looks like that one is neglecting to set the boolean |
14:38 |
Dyrcona |
Ah, there you gol! |
14:38 |
Dyrcona |
Go, even. |
14:38 |
terranm |
Fix pending... |
14:39 |
csharp_ |
terranm++ |
14:40 |
Dyrcona |
I don't like that it's redundant code. That ought to be refactored to a single method to set the user message values that could be called from either Event.pm or EventGroup.pm, but I'll leave that as an exercise for myself for later. :) |
14:45 |
terranm |
+1 |
14:45 |
Dyrcona |
terranm++ |
14:46 |
terranm |
https://bugs.launchpad.net/evergreen/+bug/1958573 |
14:46 |
pinesol |
Launchpad bug 1958573 in Evergreen "Action triggers that create messages for Patron Message Center are setting visiblity to false" [High,New] |
14:46 |
terranm |
patch ready for testing |
14:46 |
Dyrcona |
Yeah, I got the email. I've not seen it in the wild, but I think looking at the code qualifies me to confirm it. |
14:49 |
mmorgan |
terranm++ |
14:54 |
|
terranm joined #evergreen |
14:56 |
|
rjackson_isl_hom joined #evergreen |
16:31 |
terranm |
Perl change in place and verified that it's working in our production environment with grouped PMC messages |
16:33 |
|
jihpringle joined #evergreen |
17:03 |
|
mmorgan left #evergreen |
17:52 |
|
mantis1 joined #evergreen |
18:00 |
pinesol |
News from qatests: Testing Success <http://testing.evergreen-ils.org/~live> |
19:08 |
|
jihpringle joined #evergreen |
19:44 |
|
jihpringle joined #evergreen |
23:08 |
|
Keith_isl joined #evergreen |