IRC log for #evergreen, 2021-11-15

All times shown according to the server's local time.

Time	Nick	Message
01:58		dbs joined #evergreen
06:01	pinesol	News from qatests: Testing Success <http://testing.evergreen-ils.org/~live>
07:26		rjackson_isl_hom joined #evergreen
08:12		collum joined #evergreen
08:13		Dyrcona joined #evergreen
08:23	Dyrcona	Bleh... Sites that stick a floating banner at the top of every page when you try to print....
08:23	Dyrcona	@blame Linked-In for ruining Print to PDF
08:23	pinesol	Dyrcona: Linked-In HAXORED Dyrcona's SERVERZ!!!! for ruining Print to PDF
08:24	Dyrcona	@blame Linked-In for other things
08:24	pinesol	Dyrcona: Linked-In broke Evergreen. for other things
08:26	Dyrcona	Also, this is an interesting read moderately related to Evergreen: https://www.linkedin.com/pulse/why-majority-our-mfa-so-phishable-roger-grimes
08:35		mmorgan joined #evergreen
08:51		rfrasur joined #evergreen
09:22		awitter joined #evergreen
10:54		alynn26 joined #evergreen
11:13		Christineb joined #evergreen
11:40	Bmagic	so, I'm opening the firewall for the SIP server to receive connections, and it gets hammered so hard, it's falling over. It seems that it gets stuck in a loop where SIPServer.pm throws an error about login timeout (Auth takes a few seconds sometimes) - and I think that's causing the external SIP clients to try again and again.
11:41	Bmagic	Is there a way to make SIPServer.pm wait a little longer on Auth before giving up?
11:46		jihpringle joined #evergreen
12:15	csharp_	Bmagic: in our experience, it's going to be one or two clients out there that hammer the server so you might block by IP until you can see it working
12:21		jvwoolf joined #evergreen
12:27	berick	Bmagic: what's the exact log message?
12:29	berick	in any event, you can configure the login timeout in oils_sip.xml. default is 60 seconds
12:30	berick	but the login eval can fail for other reasons, which should show in the error logs
12:47	Dyrcona	Bmagic: You might try installing the Socket::Linux Perl module. SIPServer will only use it if it is available, and it helps with timeout issues and clients. (I'm not saying it will fix your current problem, but it won't hurt.)
12:50	csharp_	this issue reminds me that I want to get back to testing berick's SIP proxy thing :-)
12:51	csharp_	gmcharlt++ # great presentation this morning on VuFind/Evergreen
12:56		collum joined #evergreen
13:10		jvwoolf1 joined #evergreen
13:11		jvwoolf2 joined #evergreen
13:15	Bmagic	Dyrcona: let me see if I can get some logs that make sense and I'll post em
13:15	Bmagic	It seems that ejabberd dies after a certain amount of pressure
13:15	Bmagic	and then the logs get tons of ejabberd errors, but not at first
13:16	Bmagic	Dyrcona: and that perl module is already installed
13:34	Dyrcona	Well, I didn't make any promises. :)
13:36	Bmagic	Dyrcona: The first signs of an issue comes from this message:
13:36	Bmagic	[2021-11-15 13:33:31] SIPServer.pm [ERR :68922:AppSession.pm:132:] Attempting to build a client session as a server Session ID [<removed>], remote_id [opensrfprivate.localhost/opensrf.settings_drone_at_localhost_66952]
13:36	Bmagic	not enough "settings" drones?
13:39	Bmagic	Around the same time, in the ejabberd log:
13:40	Bmagic	2021-11-15 13:33:31.849 [info] <0.873.0>@ejabberd_c2s:process_terminated:271 (tcp\|<0.873.0>) Closing c2s session for opensrfprivate.localhost/client_at_server.org_68922: Connection failed: connection closed
13:41	Bmagic	ulimits are set to 65535
13:41	Dyrcona	Bmagic: I think the suggestion of figuring out which hosts are hammering you and then blocking them is the better way to go.
13:42	Bmagic	hmm
13:43	Bmagic	I shyed away from that because that didn't change. Same set of vendors/logins. Upgrading to a new version of Evergreen did happen
13:49	Bmagic	The CPU on the machine spikes to unstable levels in about 60 seconds. I'll have to do some netstat's real quick and figure out the IP before it goes unresponsive
13:50	Dyrcona	Are remote client IPs logged by SIPServer? I don't remember.
13:53	Dyrcona	Bmagic: Is the real problem that auth is taking too long to complete?
13:53	Bmagic	I don't think so anymore
13:53	Bmagic	ejabberd is unstable somehow
13:54	Dyrcona	ejabberd could be a symptom of too many connections, or you could need to assign more resources RAM or CPU to your VM.
13:55	Dyrcona	Assuming this is a VM...
13:55	Dyrcona	On an unrelated note, I think that I may have fixed the data load issues with new Pg versions.
14:02	Bmagic	I did consider increasing the number of CPU's - thinking that it was dogpiling to a point where ejabberd died. It's theoretical at this point. It's strange that it worked and worked for years on the same machine until the upgrade....
14:03	Dyrcona	Could be the software is botched. I sometimes get a bad install when building test VMs and have to redo it.
14:04	Dyrcona	You upgraded from what version to which vesion?
14:06	Bmagic	3.5.4 -> 3.7.2
14:06	Bmagic	I'm on my 4th rebuilt VM
14:07	Bmagic	I wonder if I could install 3.5.4 and SIP would still work
14:07	Dyrcona	Bmagic: It probably would.
14:07	Bmagic	I'm gonna try it
14:17	Dyrcona	Yeah. Making a minor change to the env_create.sql solves the Perl test failure. Some of the pgtap tests still fail on Pg 14.
14:17	Dyrcona	The data load fix is to add ORDER BY id in two of the functions.
14:26	jeff	well that's embarassingly obvious in hindsight. nice catch!
14:26	jeff	Don't I feel silly for combing release notes looking for subtle breaking-to-us changes.
14:27		jihpringle joined #evergreen
14:34	Dyrcona	:)
14:35	Dyrcona	jeff: We may still have some of those. I'm looking at a vandelay test that breaks, and it looks like the import item is not being created.
14:37	Dyrcona	import.item.invalid.circ_modifier Hm....
14:38	Dyrcona	That's bizarre because the TEST circ modifier is there, and that's what the record tries to use.
15:01	Dyrcona	ugh. I think this might be a change in xpath, libxml behavior.
15:06	Keith-isl	I find myself stumped: have a library that gets an 'Offline transaction upload failed' error when they...well...try and upload (not process) offline transactions.
15:07	Keith-isl	Issue occurs on all the workstations in their system, but offline uploads seem to work fine for different OU's.
15:09	Keith-isl	If we create a workstation locally and do offline transactions as the problem OU and try and upload to an offline session created at that OU, we can replicate the error
15:11	Dyrcona	Yeahp: Correctly handle relative path expressions in xmltable(), xpath(), and other XML-handling functions (Markus Winand)
15:11	Keith-isl	Only place I can think to start troubleshooting is library is in another time zone, but otherwise I'm out of usual suspects to try and lean on for leads.
15:12	Dyrcona	Keith-isl: What does your Require line for offline look like in eg.conf?
15:13	Dyrcona	Oh never mind. I thought you couldn't replicate the error locally. /me should pay more attention.
15:14	jeffdavis	Keith-isl: are you forcing HTTPS?
15:14	Keith-isl	Dyrcona: Oh good - I don't have access to eg.conf anyway. I'm just a low-level investigator; configs are above my paygrade. :)
15:15	Keith-isl	jeffdavis I believe so. I have a list of other libraries in a different time zone and have been trying to create offline sessions / uploads as those OU's. So far I haven't run into any difficulties like I have with patient 0.
15:15	alynn26	here is the offline config file
15:15	alynn26	<Directory "/openils/var/cgi-bin/offline">
15:15	alynn26	AddHandler cgi-script .pl
15:15	alynn26	AllowOverride None
15:15	alynn26	Options +ExecCGI
15:15	alynn26	Require all granted
15:15	alynn26	</Directory>
15:16	Dyrcona	alynn26++
15:16	jeffdavis	It used to be the case that /cgi-bin/offline/offline.pl could not support HTTPS so we had to make an exception to allow unencrypted connections for that one path. Not sure offhand if that's still an issue.
15:16	Dyrcona	That looks right.
15:16	alynn26	the really weird thing is that it works for other ous
15:17	Dyrcona	I was about to ask if this library is still using XULRunner, but that's probably not it if you have the problem with their OU locally.
15:18	Dyrcona	It could be that their ou is at the wrong depth, but that should cause a whole raft of other problems.
15:20	Dyrcona	jeff: Regarding the release notes, I think we'll need to audit and fix our relative XPATH code. "[tag='value']" doesn't work anymore, but "//[code='value']" does. I don't think the former is technically correct, but the latter is.
15:21	Keith-isl	Other thing that probably doesn't help anyone else, but issue isn't occurring on our Migration server running 3.7, but is occurring on production server running 3.4
15:22	Keith-isl	(More or less talking out loud to myself, unsure if that's actually useful in coming to any sort of determination)
15:23	Dyrcona	Well, something's different between the two, and its probably not just the Evergreen version. I've had fun things like this happen before.
15:23	Dyrcona	Not with offline specifically, though.
15:24	Dyrcona	Have you checked permissions on their offline directory?
15:26	* Keith-isl	bats eyelashes at alynn26
15:35		collum joined #evergreen
15:37	csharp_	so setting ejabberd loglevel to 5 (debug) works.... until is stops logging anything for some reason
15:37	csharp_	*it
15:38	csharp_	still trying to see if there's anything pointing to the cause of open-ils.actor falling offline
15:46	Dyrcona	csharp_: Did you run out of disk space?
15:48	csharp_	Dyrcona: that was my first thought, but nope :-/
15:49	Dyrcona	Speaking of disk space, setting up Did You Mean on our training server only uses about 10GB more space, the db grew from 380GB to 390GB.
15:55	Bmagic	Dyrcona++ # interesting, I wonder what my problem was
15:56	Bmagic	it was a production server, and the disk grew super huge without my knowing, until the backups kicked in, and I saw what was going on. I truncated the table and it went lower the disk usage by over 100GB
15:57	Bmagic	Also - on this SIP issue, my working theory is it's the latest revision of 18.04. Something kernel something.... Trying a previous biuld. I'll keep you posted :)
16:00	jeff	and then the backups kicked in... and my shoes started to squeak...
16:00	Bmagic	jeff++ # I love it, not sure what it means. Kick = shoes?
16:05	jeff	obscure recurring line in an early 90s song... original was "and then the horns kicked in / and my shoes started to squeak": https://open.spotify.com/track/0aSoxyXaiEGYioFxNzvzcz
16:05	jeff	Before long, I was coming up / On this really weird part of my dream
16:05	jeff	You know, the part where / I know how to tap dance
16:05	jeff	But I can only do it / While wearing golf shoes
16:06	jeff	received an absurd amount of radio airtime for how odd it was (or because of how odd it was?)
16:06	Bmagic	dang
16:07	Bmagic	at no point during you incoherent ramblings could anything be considered a rational thought. I award you no points and may god have mercy on your soul
16:08	jeff	about four years before that movie. :-)
16:09	Bmagic	:)
16:14	Dyrcona	Interesting. I've found a record that produces a metabib.metarecord entry on Pg 14, but doesn't on Pg 10. It's one of the two inserted for the lp1731960_test_preserving_bookbag_entries.pg tests.
16:18	Dyrcona	Ah, bizarre. It's the opposite of the behavior on Pg 10. So, the first one gets the metarecord created in Pg 11, but it's the second one that does in Pg 10.
16:19	Dyrcona	Or, I should say, gets chosen as the master record.
16:21	Dyrcona	Looks like we get different numbers of metarecords created with Pg11+. So that's something to look into. I probably won't fix it today.
16:24	Dyrcona	Well, I found some more places that need fixes for relative XPath.
16:33	Dyrcona	Crazy. Looks like a bad XPATH was causing a bug in pre-Pg11 with metarecord creation.
16:34	Dyrcona	After fixing a XPath value in biblio.extract_quality, I now get the same results in Pg10 that I get/got in Pg14.
16:35	Dyrcona	That seems weird.... Maybe I had the versions backwards earlier when I said that I got less on Pg 14?
16:36	Dyrcona	However, now that it's fixed, the same record is chosen as the master record in Pg 10 as on Pg 14, so that fixes the pgtap test.
16:38	Dyrcona	Tests are great! :)
16:43	Dyrcona	I guess I didn't say. I thought that I got fewer on Pg 14, but whatever.... You're tired of my rambling. :)
16:45	Bmagic	Dyrcona: found oom messages in dmesg finally... adding swap
16:45	Dyrcona	Just noticed this with make check: WARNING: the following files are missing in your kit: lib/OpenILS/Utils/ISBN.pm Please inform the author.
16:46	Dyrcona	Bmagic: Well, you should add RAM before swap, but I still think you have some other pathological condition happening.
16:46		jvwoolf2 left #evergreen
16:49	Bmagic	found another one: TCP: request_sock_TCP: Possible SYN flooding on port 5222. Sending cookies. Check SNMP counters.
16:50	Dyrcona	Well, 5222 is ejabberd. So you're getting a lot of connections from OpenSRF, but I think your SIPServer being hammered is causing that.
16:50	Bmagic	without a doubt
16:50	Bmagic	it should* handle it though
16:51	Bmagic	I'm moving up to a bigger machine now
16:53	Dyrcona	You said something that made it sound like auth was timing out earlier. Did you look into that?
16:53	Bmagic	if ejabberd could hang in there, and accomodate the login requests, eventually it would even out. It's all about handling the initial opening
16:54	Bmagic	Auth was timing out because ejabberd got pulled out from under it
16:54	Dyrcona	Bmagic: Have you changed the ejabberd shaper settings?
16:55	Bmagic	when I open the firewall, the requests start coming in. Logs look great. Successfull login all over the place. Until about 30 seconds later or more, Ejabberd errors start flooding the logs
16:55	Bmagic	yeah, shaper is 10m, maybe go more?
16:56	Dyrcona	Ten million? I think that's too much. It's supposed to 10,000 or so, IIRC.
16:57	Dyrcona	How many connections do you get? We run 4 SIPServers and we currently have about 30 connections each. We typically run about 40 each.
16:57	Bmagic	Sorry, that was max_stanza_size
16:57	Bmagic	shaper is Evergreen's standard 50000
16:58	Bmagic	which, now that you mention it, I'm trying 100000 (that's bandwidth limits right?)
16:59	Bmagic	surely ejabberd would log something if we hit these things? Nothing... Perhaps I need to set it to debug... looking for that
17:02	Dyrcona	Bmagic; We suggest 500,000 IIRC.
17:02	Bmagic	"Change shaper: normal and fast values to 500000 "
17:02	Bmagic	that's what it was set to
17:04	Dyrcona	All right you said 50,000, guess it was a typo.
17:04	Bmagic	yeah, typo
17:04	Bmagic	https://serverfault.com/questions/518862/will-increasing-net-core-somaxconn-make-a-difference
17:04	Bmagic	I'm chasing this barking tree now
17:06	Dyrcona	Bmagic: If you're really overrunning somaxconn, you probably need another SIP vm.
17:06	Bmagic	could be
17:06	Dyrcona	128 simultaneous SIP connections is A LOT.
17:06		mmorgan left #evergreen
17:07	Dyrcona	Well, it's normal, but we spread it over 4 VMs.
17:07	Bmagic	yeah, but I think you have to add all the internal connections for ejabberd, opensrf, etc
17:07	Dyrcona	Yes.
17:08	Bmagic	making that 128 balloon to thousands
17:08	Dyrcona	Not quite that many, probably triple.
17:10	Dyrcona	Also, somaxconn isn't number of total connections, it's the number of backlog connections.
17:16	Bmagic	hmmm
17:17	Bmagic	that's what's mentioned in dmesg
17:17	Bmagic	this machine is 4CPU 16GB memory. Trying 8CPU 32GB memory now
17:23	alynn26	Dyrcona, thanks for the help earlier. The directories had gotten their permissions screwed.
17:23	alynn26	Dyrcona++
17:24	Dyrcona	alynn26: Glad my suggestion was helpful.
17:25	Dyrcona	Bmagic: Ours are 16GB and 16CPU
17:26	Bmagic	it certainly uses the CPU, at least initially
17:26	Dyrcona	They don't usually that much RAM.
17:26	Dyrcona	use that much
17:26	Dyrcona	Yeah, it does use CPU.
17:27	Dyrcona	We don't share drones with bricks and SIP servers, but that is a possibility.
17:27	Bmagic	and then, pretty quickly the kernel does something where ejabberd is not allowed to talk, then the whole thing comes crashing down. That's what I think atm
17:27	Dyrcona	That sounds like you need another SIP server VM.
17:28	Bmagic	yeah, I'm trying that now. Updated DNS for the sip URL to hit the production load balancer, where it can spread over 7 bricks
17:29	Dyrcona	I have had OOM killer stop ejabberd before, but that has been a long time. You can tell OOM killer not to kill ejabberd or at least, consider other things first.
17:29	Bmagic	and I'm crossing my fingers that whatever the issue for the single machine isn't a problem for those machines... which would be, uh, not good
17:29	Dyrcona	Yeah.
17:42	Dyrcona	I should sign out, but whatever. I think I'll fix that OpenILS/Utils/ISBN.pm message.
17:42	Bmagic	heh
17:42	Bmagic	I'm having a great time myself
17:43	Bmagic	I'm dealing with "It's as if millions of SIP connections cried out, and were silenced"
17:44	Bmagic	a disturbance in the production force
17:53	Bmagic	holy cow
17:53	Bmagic	8cpu did it
17:54	Bmagic	CPU spiked for a short time, and calmed back down. Never saw it go over 4
17:57	Dyrcona	Bmagic++
17:58	Dyrcona	Just in time to go home, too!
18:01	pinesol	News from qatests: Testing Success <http://testing.evergreen-ils.org/~live>
18:01	Dyrcona	All right. I'm signing out now. Good night, #evergreen!
18:18		jihpringle joined #evergreen