Time |
Nick |
Message |
01:58 |
|
dbs joined #evergreen |
06:01 |
pinesol |
News from qatests: Testing Success <http://testing.evergreen-ils.org/~live> |
07:26 |
|
rjackson_isl_hom joined #evergreen |
08:12 |
|
collum joined #evergreen |
08:13 |
|
Dyrcona joined #evergreen |
08:23 |
Dyrcona |
Bleh... Sites that stick a floating banner at the top of every page when you try to print.... |
08:23 |
Dyrcona |
@blame Linked-In for ruining Print to PDF |
08:23 |
pinesol |
Dyrcona: Linked-In HAXORED Dyrcona's SERVERZ!!!! for ruining Print to PDF |
08:24 |
Dyrcona |
@blame Linked-In for other things |
08:24 |
pinesol |
Dyrcona: Linked-In broke Evergreen. for other things |
08:26 |
Dyrcona |
Also, this is an interesting read moderately related to Evergreen: https://www.linkedin.com/pulse/why-majority-our-mfa-so-phishable-roger-grimes |
08:35 |
|
mmorgan joined #evergreen |
08:51 |
|
rfrasur joined #evergreen |
09:22 |
|
awitter joined #evergreen |
10:54 |
|
alynn26 joined #evergreen |
11:13 |
|
Christineb joined #evergreen |
11:40 |
Bmagic |
so, I'm opening the firewall for the SIP server to receive connections, and it gets hammered so hard, it's falling over. It seems that it gets stuck in a loop where SIPServer.pm throws an error about login timeout (Auth takes a few seconds sometimes) - and I think that's causing the external SIP clients to try again and again. |
11:41 |
Bmagic |
Is there a way to make SIPServer.pm wait a little longer on Auth before giving up? |
11:46 |
|
jihpringle joined #evergreen |
12:15 |
csharp_ |
Bmagic: in our experience, it's going to be one or two clients out there that hammer the server so you might block by IP until you can see it working |
12:21 |
|
jvwoolf joined #evergreen |
12:27 |
berick |
Bmagic: what's the exact log message? |
12:29 |
berick |
in any event, you can configure the login timeout in oils_sip.xml. default is 60 seconds |
12:30 |
berick |
but the login eval can fail for other reasons, which should show in the error logs |
12:47 |
Dyrcona |
Bmagic: You might try installing the Socket::Linux Perl module. SIPServer will only use it if it is available, and it helps with timeout issues and clients. (I'm not saying it will fix your current problem, but it won't hurt.) |
12:50 |
csharp_ |
this issue reminds me that I want to get back to testing berick's SIP proxy thing :-) |
12:51 |
csharp_ |
gmcharlt++ # great presentation this morning on VuFind/Evergreen |
12:56 |
|
collum joined #evergreen |
13:10 |
|
jvwoolf1 joined #evergreen |
13:11 |
|
jvwoolf2 joined #evergreen |
13:15 |
Bmagic |
Dyrcona: let me see if I can get some logs that make sense and I'll post em |
13:15 |
Bmagic |
It seems that ejabberd dies after a certain amount of pressure |
13:15 |
Bmagic |
and then the logs get tons of ejabberd errors, but not at first |
13:16 |
Bmagic |
Dyrcona: and that perl module is already installed |
13:34 |
Dyrcona |
Well, I didn't make any promises. :) |
13:36 |
Bmagic |
Dyrcona: The first signs of an issue comes from this message: |
13:36 |
Bmagic |
[2021-11-15 13:33:31] SIPServer.pm [ERR :68922:AppSession.pm:132:] Attempting to build a client session as a server Session ID [<removed>], remote_id [opensrfprivate.localhost/opensrf.settings_drone_at_localhost_66952] |
13:36 |
Bmagic |
not enough "settings" drones? |
13:39 |
Bmagic |
Around the same time, in the ejabberd log: |
13:40 |
Bmagic |
2021-11-15 13:33:31.849 [info] <0.873.0>@ejabberd_c2s:process_terminated:271 (tcp|<0.873.0>) Closing c2s session for opensrfprivate.localhost/client_at_server.org_68922: Connection failed: connection closed |
13:41 |
Bmagic |
ulimits are set to 65535 |
13:41 |
Dyrcona |
Bmagic: I think the suggestion of figuring out which hosts are hammering you and then blocking them is the better way to go. |
13:42 |
Bmagic |
hmm |
13:43 |
Bmagic |
I shyed away from that because that didn't change. Same set of vendors/logins. Upgrading to a new version of Evergreen did happen |
13:49 |
Bmagic |
The CPU on the machine spikes to unstable levels in about 60 seconds. I'll have to do some netstat's real quick and figure out the IP before it goes unresponsive |
13:50 |
Dyrcona |
Are remote client IPs logged by SIPServer? I don't remember. |
13:53 |
Dyrcona |
Bmagic: Is the real problem that auth is taking too long to complete? |
13:53 |
Bmagic |
I don't think so anymore |
13:53 |
Bmagic |
ejabberd is unstable somehow |
13:54 |
Dyrcona |
ejabberd could be a symptom of too many connections, or you could need to assign more resources RAM or CPU to your VM. |
13:55 |
Dyrcona |
Assuming this is a VM... |
13:55 |
Dyrcona |
On an unrelated note, I think that I may have fixed the data load issues with new Pg versions. |
14:02 |
Bmagic |
I did consider increasing the number of CPU's - thinking that it was dogpiling to a point where ejabberd died. It's theoretical at this point. It's strange that it worked and worked for years on the same machine until the upgrade.... |
14:03 |
Dyrcona |
Could be the software is botched. I sometimes get a bad install when building test VMs and have to redo it. |
14:04 |
Dyrcona |
You upgraded from what version to which vesion? |
14:06 |
Bmagic |
3.5.4 -> 3.7.2 |
14:06 |
Bmagic |
I'm on my 4th rebuilt VM |
14:07 |
Bmagic |
I wonder if I could install 3.5.4 and SIP would still work |
14:07 |
Dyrcona |
Bmagic: It probably would. |
14:07 |
Bmagic |
I'm gonna try it |
14:17 |
Dyrcona |
Yeah. Making a minor change to the env_create.sql solves the Perl test failure. Some of the pgtap tests still fail on Pg 14. |
14:17 |
Dyrcona |
The data load fix is to add ORDER BY id in two of the functions. |
14:26 |
jeff |
well that's embarassingly obvious in hindsight. nice catch! |
14:26 |
jeff |
Don't I feel silly for combing release notes looking for subtle breaking-to-us changes. |
14:27 |
|
jihpringle joined #evergreen |
14:34 |
Dyrcona |
:) |
14:35 |
Dyrcona |
jeff: We may still have some of those. I'm looking at a vandelay test that breaks, and it looks like the import item is not being created. |
14:37 |
Dyrcona |
import.item.invalid.circ_modifier Hm.... |
14:38 |
Dyrcona |
That's bizarre because the TEST circ modifier is there, and that's what the record tries to use. |
15:01 |
Dyrcona |
ugh. I think this might be a change in xpath, libxml behavior. |
15:06 |
Keith-isl |
I find myself stumped: have a library that gets an 'Offline transaction upload failed' error when they...well...try and upload (not process) offline transactions. |
15:07 |
Keith-isl |
Issue occurs on all the workstations in their system, but offline uploads seem to work fine for different OU's. |
15:09 |
Keith-isl |
If we create a workstation locally and do offline transactions as the problem OU and try and upload to an offline session created at that OU, we can replicate the error |
15:11 |
Dyrcona |
Yeahp: Correctly handle relative path expressions in xmltable(), xpath(), and other XML-handling functions (Markus Winand) |
15:11 |
Keith-isl |
Only place I can think to start troubleshooting is library is in another time zone, but otherwise I'm out of usual suspects to try and lean on for leads. |
15:12 |
Dyrcona |
Keith-isl: What does your Require line for offline look like in eg.conf? |
15:13 |
Dyrcona |
Oh never mind. I thought you couldn't replicate the error locally. /me should pay more attention. |
15:14 |
jeffdavis |
Keith-isl: are you forcing HTTPS? |
15:14 |
Keith-isl |
Dyrcona: Oh good - I don't have access to eg.conf anyway. I'm just a low-level investigator; configs are above my paygrade. :) |
15:15 |
Keith-isl |
jeffdavis I believe so. I have a list of other libraries in a different time zone and have been trying to create offline sessions / uploads as those OU's. So far I haven't run into any difficulties like I have with patient 0. |
15:15 |
alynn26 |
here is the offline config file |
15:15 |
alynn26 |
<Directory "/openils/var/cgi-bin/offline"> |
15:15 |
alynn26 |
AddHandler cgi-script .pl |
15:15 |
alynn26 |
AllowOverride None |
15:15 |
alynn26 |
Options +ExecCGI |
15:15 |
alynn26 |
Require all granted |
15:15 |
alynn26 |
</Directory> |
15:16 |
Dyrcona |
alynn26++ |
15:16 |
jeffdavis |
It used to be the case that /cgi-bin/offline/offline.pl could not support HTTPS so we had to make an exception to allow unencrypted connections for that one path. Not sure offhand if that's still an issue. |
15:16 |
Dyrcona |
That looks right. |
15:16 |
alynn26 |
the really weird thing is that it works for other ous |
15:17 |
Dyrcona |
I was about to ask if this library is still using XULRunner, but that's probably not it if you have the problem with their OU locally. |
15:18 |
Dyrcona |
It could be that their ou is at the wrong depth, but that should cause a whole raft of other problems. |
15:20 |
Dyrcona |
jeff: Regarding the release notes, I think we'll need to audit and fix our relative XPATH code. "*[tag='value']" doesn't work anymore, but "//*[code='value']" does. I don't think the former is technically correct, but the latter is. |
15:21 |
Keith-isl |
Other thing that probably doesn't help anyone else, but issue isn't occurring on our Migration server running 3.7, but is occurring on production server running 3.4 |
15:22 |
Keith-isl |
(More or less talking out loud to myself, unsure if that's actually useful in coming to any sort of determination) |
15:23 |
Dyrcona |
Well, something's different between the two, and its probably not just the Evergreen version. I've had fun things like this happen before. |
15:23 |
Dyrcona |
Not with offline specifically, though. |
15:24 |
Dyrcona |
Have you checked permissions on their offline directory? |
15:26 |
* Keith-isl |
bats eyelashes at alynn26 |
15:35 |
|
collum joined #evergreen |
15:37 |
csharp_ |
so setting ejabberd loglevel to 5 (debug) works.... until is stops logging anything for some reason |
15:37 |
csharp_ |
*it |
15:38 |
csharp_ |
still trying to see if there's anything pointing to the cause of open-ils.actor falling offline |
15:46 |
Dyrcona |
csharp_: Did you run out of disk space? |
15:48 |
csharp_ |
Dyrcona: that was my first thought, but nope :-/ |
15:49 |
Dyrcona |
Speaking of disk space, setting up Did You Mean on our training server only uses about 10GB more space, the db grew from 380GB to 390GB. |
15:55 |
Bmagic |
Dyrcona++ # interesting, I wonder what my problem was |
15:56 |
Bmagic |
it was a production server, and the disk grew super huge without my knowing, until the backups kicked in, and I saw what was going on. I truncated the table and it went lower the disk usage by over 100GB |
15:57 |
Bmagic |
Also - on this SIP issue, my working theory is it's the latest revision of 18.04. Something kernel something.... Trying a previous biuld. I'll keep you posted :) |
16:00 |
jeff |
and then the backups kicked in... and my shoes started to squeak... |
16:00 |
Bmagic |
jeff++ # I love it, not sure what it means. Kick = shoes? |
16:05 |
jeff |
obscure recurring line in an early 90s song... original was "and then the horns kicked in / and my shoes started to squeak": https://open.spotify.com/track/0aSoxyXaiEGYioFxNzvzcz |
16:05 |
jeff |
Before long, I was coming up / On this really weird part of my dream |
16:05 |
jeff |
You know, the part where / I know how to tap dance |
16:05 |
jeff |
But I can only do it / While wearing golf shoes |
16:06 |
jeff |
received an absurd amount of radio airtime for how odd it was (or because of how odd it was?) |
16:06 |
Bmagic |
dang |
16:07 |
Bmagic |
at no point during you incoherent ramblings could anything be considered a rational thought. I award you no points and may god have mercy on your soul |
16:08 |
jeff |
about four years before that movie. :-) |
16:09 |
Bmagic |
:) |
16:14 |
Dyrcona |
Interesting. I've found a record that produces a metabib.metarecord entry on Pg 14, but doesn't on Pg 10. It's one of the two inserted for the lp1731960_test_preserving_bookbag_entries.pg tests. |
16:18 |
Dyrcona |
Ah, bizarre. It's the opposite of the behavior on Pg 10. So, the first one gets the metarecord created in Pg 11, but it's the second one that does in Pg 10. |
16:19 |
Dyrcona |
Or, I should say, gets chosen as the master record. |
16:21 |
Dyrcona |
Looks like we get different numbers of metarecords created with Pg11+. So that's something to look into. I probably won't fix it today. |
16:24 |
Dyrcona |
Well, I found some more places that need fixes for relative XPath. |
16:33 |
Dyrcona |
Crazy. Looks like a bad XPATH was causing a bug in pre-Pg11 with metarecord creation. |
16:34 |
Dyrcona |
After fixing a XPath value in biblio.extract_quality, I now get the same results in Pg10 that I get/got in Pg14. |
16:35 |
Dyrcona |
That seems weird.... Maybe I had the versions backwards earlier when I said that I got less on Pg 14? |
16:36 |
Dyrcona |
However, now that it's fixed, the same record is chosen as the master record in Pg 10 as on Pg 14, so that fixes the pgtap test. |
16:38 |
Dyrcona |
Tests are great! :) |
16:43 |
Dyrcona |
I guess I didn't say. I thought that I got fewer on Pg 14, but whatever.... You're tired of my rambling. :) |
16:45 |
Bmagic |
Dyrcona: found oom messages in dmesg finally... adding swap |
16:45 |
Dyrcona |
Just noticed this with make check: WARNING: the following files are missing in your kit: lib/OpenILS/Utils/ISBN.pm Please inform the author. |
16:46 |
Dyrcona |
Bmagic: Well, you should add RAM before swap, but I still think you have some other pathological condition happening. |
16:46 |
|
jvwoolf2 left #evergreen |
16:49 |
Bmagic |
found another one: TCP: request_sock_TCP: Possible SYN flooding on port 5222. Sending cookies. Check SNMP counters. |
16:50 |
Dyrcona |
Well, 5222 is ejabberd. So you're getting a lot of connections from OpenSRF, but I think your SIPServer being hammered is causing that. |
16:50 |
Bmagic |
without a doubt |
16:50 |
Bmagic |
it should* handle it though |
16:51 |
Bmagic |
I'm moving up to a bigger machine now |
16:53 |
Dyrcona |
You said something that made it sound like auth was timing out earlier. Did you look into that? |
16:53 |
Bmagic |
if ejabberd could hang in there, and accomodate the login requests, eventually it would even out. It's all about handling the initial opening |
16:54 |
Bmagic |
Auth was timing out because ejabberd got pulled out from under it |
16:54 |
Dyrcona |
Bmagic: Have you changed the ejabberd shaper settings? |
16:55 |
Bmagic |
when I open the firewall, the requests start coming in. Logs look great. Successfull login all over the place. Until about 30 seconds later or more, Ejabberd errors start flooding the logs |
16:55 |
Bmagic |
yeah, shaper is 10m, maybe go more? |
16:56 |
Dyrcona |
Ten million? I think that's too much. It's supposed to 10,000 or so, IIRC. |
16:57 |
Dyrcona |
How many connections do you get? We run 4 SIPServers and we currently have about 30 connections each. We typically run about 40 each. |
16:57 |
Bmagic |
Sorry, that was max_stanza_size |
16:57 |
Bmagic |
shaper is Evergreen's standard 50000 |
16:58 |
Bmagic |
which, now that you mention it, I'm trying 100000 (that's bandwidth limits right?) |
16:59 |
Bmagic |
surely ejabberd would log something if we hit these things? Nothing... Perhaps I need to set it to debug... looking for that |
17:02 |
Dyrcona |
Bmagic; We suggest 500,000 IIRC. |
17:02 |
Bmagic |
"Change shaper: normal and fast values to 500000 " |
17:02 |
Bmagic |
that's what it was set to |
17:04 |
Dyrcona |
All right you said 50,000, guess it was a typo. |
17:04 |
Bmagic |
yeah, typo |
17:04 |
Bmagic |
https://serverfault.com/questions/518862/will-increasing-net-core-somaxconn-make-a-difference |
17:04 |
Bmagic |
I'm chasing this barking tree now |
17:06 |
Dyrcona |
Bmagic: If you're really overrunning somaxconn, you probably need another SIP vm. |
17:06 |
Bmagic |
could be |
17:06 |
Dyrcona |
128 simultaneous SIP connections is A LOT. |
17:06 |
|
mmorgan left #evergreen |
17:07 |
Dyrcona |
Well, it's normal, but we spread it over 4 VMs. |
17:07 |
Bmagic |
yeah, but I think you have to add all the internal connections for ejabberd, opensrf, etc |
17:07 |
Dyrcona |
Yes. |
17:08 |
Bmagic |
making that 128 balloon to thousands |
17:08 |
Dyrcona |
Not quite that many, probably triple. |
17:10 |
Dyrcona |
Also, somaxconn isn't number of total connections, it's the number of backlog connections. |
17:16 |
Bmagic |
hmmm |
17:17 |
Bmagic |
that's what's mentioned in dmesg |
17:17 |
Bmagic |
this machine is 4CPU 16GB memory. Trying 8CPU 32GB memory now |
17:23 |
alynn26 |
Dyrcona, thanks for the help earlier. The directories had gotten their permissions screwed. |
17:23 |
alynn26 |
Dyrcona++ |
17:24 |
Dyrcona |
alynn26: Glad my suggestion was helpful. |
17:25 |
Dyrcona |
Bmagic: Ours are 16GB and 16CPU |
17:26 |
Bmagic |
it certainly uses the CPU, at least initially |
17:26 |
Dyrcona |
They don't usually that much RAM. |
17:26 |
Dyrcona |
use that much |
17:26 |
Dyrcona |
Yeah, it does use CPU. |
17:27 |
Dyrcona |
We don't share drones with bricks and SIP servers, but that is a possibility. |
17:27 |
Bmagic |
and then, pretty quickly the kernel does something where ejabberd is not allowed to talk, then the whole thing comes crashing down. That's what I think atm |
17:27 |
Dyrcona |
That sounds like you need another SIP server VM. |
17:28 |
Bmagic |
yeah, I'm trying that now. Updated DNS for the sip URL to hit the production load balancer, where it can spread over 7 bricks |
17:29 |
Dyrcona |
I have had OOM killer stop ejabberd before, but that has been a long time. You can tell OOM killer not to kill ejabberd or at least, consider other things first. |
17:29 |
Bmagic |
and I'm crossing my fingers that whatever the issue for the single machine isn't a problem for those machines... which would be, uh, not good |
17:29 |
Dyrcona |
Yeah. |
17:42 |
Dyrcona |
I should sign out, but whatever. I think I'll fix that OpenILS/Utils/ISBN.pm message. |
17:42 |
Bmagic |
heh |
17:42 |
Bmagic |
I'm having a great time myself |
17:43 |
Bmagic |
I'm dealing with "It's as if millions of SIP connections cried out, and were silenced" |
17:44 |
Bmagic |
a disturbance in the production force |
17:53 |
Bmagic |
holy cow |
17:53 |
Bmagic |
8cpu did it |
17:54 |
Bmagic |
CPU spiked for a short time, and calmed back down. Never saw it go over 4 |
17:57 |
Dyrcona |
Bmagic++ |
17:58 |
Dyrcona |
Just in time to go home, too! |
18:01 |
pinesol |
News from qatests: Testing Success <http://testing.evergreen-ils.org/~live> |
18:01 |
Dyrcona |
All right. I'm signing out now. Good night, #evergreen! |
18:18 |
|
jihpringle joined #evergreen |