Evergreen ILS Website

IRC log for #evergreen, 2021-11-15

| Channels | #evergreen index | Today | | Search | Google Search | Plain-Text | summary | Join Webchat

All times shown according to the server's local time.

Time Nick Message
01:58 dbs joined #evergreen
06:01 pinesol News from qatests: Testing Success <http://testing.evergreen-ils.org/~live>
07:26 rjackson_isl_hom joined #evergreen
08:12 collum joined #evergreen
08:13 Dyrcona joined #evergreen
08:23 Dyrcona Bleh... Sites that stick a floating banner at the top of every page when you try to print....
08:23 Dyrcona @blame Linked-In for ruining Print to PDF
08:23 pinesol Dyrcona: Linked-In HAXORED Dyrcona's SERVERZ!!!! for ruining Print to PDF
08:24 Dyrcona @blame Linked-In for other things
08:24 pinesol Dyrcona: Linked-In broke Evergreen. for other things
08:26 Dyrcona Also, this is an interesting read moderately related to Evergreen: https://www.linkedin.com/pulse/why-majo​rity-our-mfa-so-phishable-roger-grimes
08:35 mmorgan joined #evergreen
08:51 rfrasur joined #evergreen
09:22 awitter joined #evergreen
10:54 alynn26 joined #evergreen
11:13 Christineb joined #evergreen
11:40 Bmagic so, I'm opening the firewall for the SIP server to receive connections, and it gets hammered so hard, it's falling over. It seems that it gets stuck in a loop where SIPServer.pm throws an error about login timeout (Auth takes a few seconds sometimes) - and I think that's causing the external SIP clients to try again and again.
11:41 Bmagic Is there a way to make SIPServer.pm wait a little longer on Auth before giving up?
11:46 jihpringle joined #evergreen
12:15 csharp_ Bmagic: in our experience, it's going to be one or two clients out there that hammer the server so you might block by IP until you can see it working
12:21 jvwoolf joined #evergreen
12:27 berick Bmagic: what's the exact log message?
12:29 berick in any event, you can configure the login timeout in oils_sip.xml.  default is 60 seconds
12:30 berick but the login eval can fail for other reasons, which should show in the error logs
12:47 Dyrcona Bmagic: You might try installing the Socket::Linux Perl module. SIPServer will only use it if it is available, and it helps with timeout issues and clients. (I'm not saying it will fix your current problem, but it won't hurt.)
12:50 csharp_ this issue reminds me that I want to get back to testing berick's SIP proxy thing :-)
12:51 csharp_ gmcharlt++ # great presentation this morning on VuFind/Evergreen
12:56 collum joined #evergreen
13:10 jvwoolf1 joined #evergreen
13:11 jvwoolf2 joined #evergreen
13:15 Bmagic Dyrcona: let me see if I can get some logs that make sense and I'll post em
13:15 Bmagic It seems that ejabberd dies after a certain amount of pressure
13:15 Bmagic and then the logs get tons of ejabberd errors, but not at first
13:16 Bmagic Dyrcona: and that perl module is already installed
13:34 Dyrcona Well, I didn't make any promises. :)
13:36 Bmagic Dyrcona: The first signs of an issue comes from this message:
13:36 Bmagic [2021-11-15 13:33:31] SIPServer.pm [ERR :68922:AppSession.pm:132:] Attempting to build a client session as a server Session ID [<removed>], remote_id [opensrf@private.localhost/opensrf.sett​ings_drone_at_localhost_66952]
13:36 Bmagic not enough "settings" drones?
13:39 Bmagic Around the same time, in the ejabberd log:
13:40 Bmagic 2021-11-15 13:33:31.849 [info] <0.873.0>@ejabberd_c2s:process_terminated:271 (tcp|<0.873.0>) Closing c2s session for opensrf@private.localhost/client_at_server.org_68922: Connection failed: connection closed
13:41 Bmagic ulimits are set to 65535
13:41 Dyrcona Bmagic: I think the suggestion of figuring out which hosts are hammering you and then blocking them is the better way to go.
13:42 Bmagic hmm
13:43 Bmagic I shyed away from that because that didn't change. Same set of vendors/logins. Upgrading to a new version of Evergreen did happen
13:49 Bmagic The CPU on the machine spikes to unstable levels in about 60 seconds. I'll have to do some netstat's real quick and figure out the IP before it goes unresponsive
13:50 Dyrcona Are remote client IPs logged by SIPServer? I don't remember.
13:53 Dyrcona Bmagic: Is the real problem that auth is taking too long to complete?
13:53 Bmagic I don't think so anymore
13:53 Bmagic ejabberd is unstable somehow
13:54 Dyrcona ejabberd could be a symptom of too many connections, or you could need to assign more resources RAM or CPU to your VM.
13:55 Dyrcona Assuming this is a VM...
13:55 Dyrcona On an unrelated note, I think that I may have fixed the data load issues with new Pg versions.
14:02 Bmagic I did consider increasing the number of CPU's - thinking that it was dogpiling to a point where ejabberd died. It's theoretical at this point. It's strange that it worked and worked for years on the same machine until the upgrade....
14:03 Dyrcona Could be the software is botched. I sometimes get a bad install when building test VMs and have to redo it.
14:04 Dyrcona You upgraded from what version to which vesion?
14:06 Bmagic 3.5.4 -> 3.7.2
14:06 Bmagic I'm on my 4th rebuilt VM
14:07 Bmagic I wonder if I could install 3.5.4 and SIP would still work
14:07 Dyrcona Bmagic: It probably would.
14:07 Bmagic I'm gonna try it
14:17 Dyrcona Yeah. Making a minor change to the env_create.sql solves the Perl test failure. Some of the pgtap tests still fail on Pg 14.
14:17 Dyrcona The data load fix is to add ORDER BY id in two of the functions.
14:26 jeff well that's embarassingly obvious in hindsight. nice catch!
14:26 jeff Don't I feel silly for combing release notes looking for subtle breaking-to-us changes.
14:27 jihpringle joined #evergreen
14:34 Dyrcona :)
14:35 Dyrcona jeff: We may still have some of those. I'm looking at a vandelay test that breaks, and it looks like the import item is not being created.
14:37 Dyrcona import.item.invalid.circ_modifier Hm....
14:38 Dyrcona That's bizarre because the TEST circ modifier is there, and that's what the record tries to use.
15:01 Dyrcona ugh. I think this might be a change in xpath, libxml behavior.
15:06 Keith-isl I find myself stumped: have a library that gets an 'Offline transaction upload failed' error when they...well...try and upload (not process) offline transactions.
15:07 Keith-isl Issue occurs on all the workstations in their system, but offline uploads seem to work fine for different OU's.
15:09 Keith-isl If we create a workstation locally and do offline transactions as the problem OU and try and upload to an offline session created at that OU, we can replicate the error
15:11 Dyrcona Yeahp: Correctly handle relative path expressions in xmltable(), xpath(), and other XML-handling functions (Markus Winand)
15:11 Keith-isl Only place I can think to start troubleshooting is library is in another time zone, but otherwise I'm out of usual suspects to try and lean on for leads.
15:12 Dyrcona Keith-isl: What does your Require line for offline look like in eg.conf?
15:13 Dyrcona Oh never mind. I thought you couldn't replicate the error locally. /me should pay more attention.
15:14 jeffdavis Keith-isl: are you forcing HTTPS?
15:14 Keith-isl Dyrcona: Oh good - I don't have access to eg.conf anyway. I'm just a low-level investigator; configs are above my paygrade. :)
15:15 Keith-isl jeffdavis I believe so. I have a list of other libraries in a different time zone and have been trying to create offline sessions / uploads as those OU's. So far I haven't run into any difficulties like I have with patient 0.
15:15 alynn26 here is the offline config file
15:15 alynn26 <Directory "/openils/var/cgi-bin/offline">
15:15 alynn26 AddHandler cgi-script .pl
15:15 alynn26 AllowOverride None
15:15 alynn26 Options +ExecCGI
15:15 alynn26 Require all granted
15:15 alynn26 </Directory>
15:16 Dyrcona alynn26++
15:16 jeffdavis It used to be the case that /cgi-bin/offline/offline.pl could not support HTTPS so we had to make an exception to allow unencrypted connections for that one path. Not sure offhand if that's still an issue.
15:16 Dyrcona That looks right.
15:16 alynn26 the really weird thing is that it works for other ous
15:17 Dyrcona I was about to ask if this library is still using XULRunner, but that's probably not it if you have the problem with their OU locally.
15:18 Dyrcona It could be that their ou is at the wrong depth, but that should cause a whole raft of other problems.
15:20 Dyrcona jeff: Regarding the release notes, I think we'll need to audit and fix our relative XPATH code. "*[tag='value']" doesn't work anymore, but "//*[code='value']" does. I don't think the former is technically correct, but the latter is.
15:21 Keith-isl Other thing that probably doesn't help anyone else, but issue isn't occurring on our Migration server running 3.7, but is occurring on production server running 3.4
15:22 Keith-isl (More or less talking out loud to myself, unsure if that's actually useful in coming to any sort of determination)
15:23 Dyrcona Well, something's different between the two, and its probably not just the Evergreen version. I've had fun things like this happen before.
15:23 Dyrcona Not with offline specifically, though.
15:24 Dyrcona Have you checked permissions on their offline directory?
15:26 * Keith-isl bats eyelashes at alynn26
15:35 collum joined #evergreen
15:37 csharp_ so setting ejabberd loglevel to 5 (debug) works.... until is stops logging anything for some reason
15:37 csharp_ *it
15:38 csharp_ still trying to see if there's anything pointing to the cause of open-ils.actor falling offline
15:46 Dyrcona csharp_: Did you run out of disk space?
15:48 csharp_ Dyrcona: that was my first thought, but nope :-/
15:49 Dyrcona Speaking of disk space, setting up Did You Mean on our training server only uses about 10GB more space, the db grew from 380GB to 390GB.
15:55 Bmagic Dyrcona++ # interesting, I wonder what my problem was
15:56 Bmagic it was a production server, and the disk grew super huge without my knowing, until the backups kicked in, and I saw what was going on. I truncated the table and it went lower the disk usage by over 100GB
15:57 Bmagic Also - on this SIP issue, my working theory is it's the latest revision of 18.04. Something kernel something.... Trying a previous biuld. I'll keep you posted :)
16:00 jeff and then the backups kicked in... and my shoes started to squeak...
16:00 Bmagic jeff++ # I love it, not sure what it means. Kick = shoes?
16:05 jeff obscure recurring line in an early 90s song... original was "and then the horns kicked in / and my shoes started to squeak": https://open.spotify.com/tr​ack/0aSoxyXaiEGYioFxNzvzcz
16:05 jeff Before long, I was coming up / On this really weird part of my dream
16:05 jeff You know, the part where / I know how to tap dance
16:05 jeff But I can only do it / While wearing golf shoes
16:06 jeff received an absurd amount of radio airtime for how odd it was (or because of how odd it was?)
16:06 Bmagic dang
16:07 Bmagic at no point during you incoherent ramblings could anything be considered a rational thought. I award you no points and may god have mercy on your soul
16:08 jeff about four years before that movie. :-)
16:09 Bmagic :)
16:14 Dyrcona Interesting. I've found a record that produces a metabib.metarecord entry on Pg 14, but doesn't on Pg 10. It's one of the two inserted for the lp1731960_test_preserving_bookbag_entries.pg tests.
16:18 Dyrcona Ah, bizarre. It's the opposite of the behavior on Pg 10. So, the first one gets the metarecord created in Pg 11, but it's the second one that does in Pg 10.
16:19 Dyrcona Or, I should say, gets chosen as the master record.
16:21 Dyrcona Looks like we get different numbers of metarecords created with Pg11+. So that's something to look into. I probably won't fix it today.
16:24 Dyrcona Well, I found some more places that need fixes for relative XPath.
16:33 Dyrcona Crazy. Looks like a bad XPATH was causing a bug in pre-Pg11 with metarecord creation.
16:34 Dyrcona After fixing a XPath value in biblio.extract_quality, I now get the same results in Pg10 that I get/got in Pg14.
16:35 Dyrcona That seems weird.... Maybe I had the versions backwards earlier when I said that I got less on Pg 14?
16:36 Dyrcona However, now that it's fixed, the same record is chosen as the master record in Pg 10 as on Pg 14, so that fixes the pgtap test.
16:38 Dyrcona Tests are great! :)
16:43 Dyrcona I guess I didn't say. I thought that I got fewer on Pg 14, but whatever.... You're tired of my rambling. :)
16:45 Bmagic Dyrcona: found oom messages in dmesg finally... adding swap
16:45 Dyrcona Just noticed this with make check: WARNING: the following files are missing in your kit: lib/OpenILS/Utils/ISBN.pm Please inform the author.
16:46 Dyrcona Bmagic: Well, you should add RAM before swap, but I still think you have some other pathological condition happening.
16:46 jvwoolf2 left #evergreen
16:49 Bmagic found another one: TCP: request_sock_TCP: Possible SYN flooding on port 5222. Sending cookies.  Check SNMP counters.
16:50 Dyrcona Well, 5222 is ejabberd. So you're getting a lot of connections from OpenSRF, but I think your SIPServer being hammered is causing that.
16:50 Bmagic without a doubt
16:50 Bmagic it should* handle it though
16:51 Bmagic I'm moving up to a bigger machine now
16:53 Dyrcona You said something that made it sound like auth was timing out earlier. Did you look into that?
16:53 Bmagic if ejabberd could hang in there, and accomodate the login requests, eventually it would even out. It's all about handling the initial opening
16:54 Bmagic Auth was timing out because ejabberd got pulled out from under it
16:54 Dyrcona Bmagic: Have you changed the ejabberd shaper settings?
16:55 Bmagic when I open the firewall, the requests start coming in. Logs look great. Successfull login all over the place. Until about 30 seconds later or more, Ejabberd errors start flooding the logs
16:55 Bmagic yeah, shaper is 10m, maybe go more?
16:56 Dyrcona Ten million? I think that's too much. It's supposed to 10,000 or so, IIRC.
16:57 Dyrcona How many connections do you get? We run 4 SIPServers and we currently have about 30 connections each. We typically run about 40 each.
16:57 Bmagic Sorry, that was max_stanza_size
16:57 Bmagic shaper is Evergreen's standard 50000
16:58 Bmagic which, now that you mention it, I'm trying 100000 (that's bandwidth limits right?)
16:59 Bmagic surely ejabberd would log something if we hit these things? Nothing... Perhaps I need to set it to debug... looking for that
17:02 Dyrcona Bmagic; We suggest 500,000 IIRC.
17:02 Bmagic "Change shaper: normal and fast values to 500000 "
17:02 Bmagic that's what it was set to
17:04 Dyrcona All right you said 50,000, guess it was a typo.
17:04 Bmagic yeah, typo
17:04 Bmagic https://serverfault.com/questions/518862/will-i​ncreasing-net-core-somaxconn-make-a-difference
17:04 Bmagic I'm chasing this barking tree now
17:06 Dyrcona Bmagic: If you're really overrunning somaxconn, you probably need another SIP vm.
17:06 Bmagic could be
17:06 Dyrcona 128 simultaneous SIP connections is A LOT.
17:06 mmorgan left #evergreen
17:07 Dyrcona Well, it's normal, but we spread it over 4 VMs.
17:07 Bmagic yeah, but I think you have to add all the internal connections for ejabberd, opensrf, etc
17:07 Dyrcona Yes.
17:08 Bmagic making that 128 balloon to thousands
17:08 Dyrcona Not quite that many, probably triple.
17:10 Dyrcona Also, somaxconn isn't number of total connections, it's the number of backlog connections.
17:16 Bmagic hmmm
17:17 Bmagic that's what's mentioned in dmesg
17:17 Bmagic this machine is 4CPU 16GB memory. Trying 8CPU 32GB memory now
17:23 alynn26 Dyrcona, thanks for the help earlier.  The directories had gotten their permissions screwed.
17:23 alynn26 Dyrcona++
17:24 Dyrcona alynn26: Glad my suggestion was helpful.
17:25 Dyrcona Bmagic: Ours are 16GB and 16CPU
17:26 Bmagic it certainly uses the CPU, at least initially
17:26 Dyrcona They don't usually that much RAM.
17:26 Dyrcona use that much
17:26 Dyrcona Yeah, it does use CPU.
17:27 Dyrcona We don't share drones with bricks and SIP servers, but that is a possibility.
17:27 Bmagic and then, pretty quickly the kernel does something where ejabberd is not allowed to talk, then the whole thing comes crashing down. That's what I think atm
17:27 Dyrcona That sounds like you need another SIP server VM.
17:28 Bmagic yeah, I'm trying that now. Updated DNS for the sip URL to hit the production load balancer, where it can spread over 7 bricks
17:29 Dyrcona I have had OOM killer stop ejabberd before, but that has been a long time. You can tell OOM killer not to kill ejabberd or at least, consider other things first.
17:29 Bmagic and I'm crossing my fingers that whatever the issue for the single machine isn't a problem for those machines... which would be, uh, not good
17:29 Dyrcona Yeah.
17:42 Dyrcona I should sign out, but whatever. I think I'll fix that OpenILS/Utils/ISBN.pm message.
17:42 Bmagic heh
17:42 Bmagic I'm having a great time myself
17:43 Bmagic I'm dealing with "It's as if millions of SIP connections cried out, and were silenced"
17:44 Bmagic a disturbance in the production force
17:53 Bmagic holy cow
17:53 Bmagic 8cpu did it
17:54 Bmagic CPU spiked for a short time, and calmed back down. Never saw it go over 4
17:57 Dyrcona Bmagic++
17:58 Dyrcona Just in time to go home, too!
18:01 pinesol News from qatests: Testing Success <http://testing.evergreen-ils.org/~live>
18:01 Dyrcona All right. I'm signing out now. Good night, #evergreen!
18:18 jihpringle joined #evergreen

| Channels | #evergreen index | Today | | Search | Google Search | Plain-Text | summary | Join Webchat