IRC log for #evergreen, 2022-01-25

All times shown according to the server's local time.

Time	Nick	Message
06:02	pinesol	News from qatests: Testing Success <http://testing.evergreen-ils.org/~live>
07:06		JBoyer_ joined #evergreen
07:38		rjackson_isl_hom joined #evergreen
07:49		collum joined #evergreen
07:58	JBoyer	Shopping for storage during a chip shortage is a real Soylent Drive situation. :-/ "They're made of chips!"
08:33		jvwoolf joined #evergreen
08:35		mantis1 joined #evergreen
08:36		rfrasur joined #evergreen
08:40		mmorgan joined #evergreen
08:44		jvwoolf1 joined #evergreen
08:46		Dyrcona joined #evergreen
09:13		terranm joined #evergreen
09:49	miker	csharp_: re mod_stream_mgmt from yesterday, that has to be supported (and requested) by the client, and our xmpp implementation doesn't know about stream management, so it's def not related to that. at least it's something to cross off the list?
09:50	csharp_	miker: thanks
09:51	csharp_	I think at this point we're going to try to reroute cataloging to the AngJS UIs and see if there's an improvement
09:51	csharp_	stabbing in the dark on this since logs are so opaque
09:52	csharp_	also interested in automated ways to detect the outage and automatically restart the down service
09:53	csharp_	osrf_control --diagnostic is slow to realize that a listener is dead
09:53	csharp_	osrf_control --service X --restart is leaving orphaned drone procs running too
09:58	csharp_	(which is a limitation of the way that sub is implemented in that the proc that it would normally signal is no longer running and it doesn't do anything further to look for orphans)
10:17	miker	csharp_: is actor the only perl service that has such a high drone count? (c is so much faster that I wouldn't be surprised if it effectively never had this issue) ... I wonder if you're bumping up against a takes-just-a-little-too-long to do all the child pipe management stuff between listener servicing
10:17	pastebot	"JBoyer" at 168.25.130.30 pasted "A simple script to clean up orphaned Drones" (28 lines) at http://paste.evergreen-ils.org/14443
10:18	JBoyer	csharp_, Not sure it helps much but I use a modified version of the above to cleanup orphaned drones en mass
10:22	csharp_	JBoyer: thanks!
10:23	JBoyer	As for the service, you could add a script to opensrf's cron on service machines to check once a minute whether a process matching 'OpenSRF Listener [open-ils.actor]' is found, and call osrf_control to restart it if not. Obviously not a fix, but may allow you to at least breathe a bit while tracking it down.
10:23	csharp_	miker: hmm - so if I'm understanding correctly, we would reduce the max children for actor to see if that helps?
10:23	csharp_	we typically aren't hitting the max
10:23	csharp_	(meaning there's room to reduce)
10:24	berick	was also wondering if maybe you've hit some practical limit
10:24	JBoyer	100+ is awfully high unless it were your only machine running the service, and even then 40-50 may be more than enough.
10:25	miker	well, or spread them out. but I admit it's hard to imagine that you have 1152 (192 x 6) concurrent actor requests.
10:26	csharp_	we kept bumping it up to accommodate the actor drone parallel request issues
10:33	csharp_	we have 120 open-ils.circ max children - I could probably drop that down too if needed, but I've moved actor from 192 to 72 per server
10:33		nfBurton joined #evergreen
10:38	Dyrcona	We have 150 max actor children per brick and still run out sometimes.
10:39	Dyrcona	I sometimes consider sharing the drones among all the bricks, but haven't gotten to that point quite yet.
10:41	csharp_	just lost actor again after that change
10:43	csharp_	yeah - it's going offline again in multiple places
10:44	csharp_	like three servers over the course of a few minutes - this is how it's going
10:44	csharp_	and the number of actor drones running is not apparently relevant
10:45	csharp_	meaning that they were low, percentage-wise
10:45	csharp_	I really wish I could see into what's actually getting dropped
10:46	csharp_	so far that's my biggest frustration
10:46	csharp_	also possibly notable that the issue starts right before 11:00 a.m. each day
10:47	Bmagic	csharp_ check ulimits! That's bitten me a few times
10:48	csharp_	Bmagic: I made recommended adjustments per Dyrcona and jeffdavis
10:48	Bmagic	oh, I guess i didn't see that chat up there
10:48		Keith-isl joined #evergreen
10:49	nfBurton	Is there a link to what to do to rerun action_triggers? I had an issue with a template I fixed and new ones are running but trying to run the past week again keeps coming up as invalid....
10:49	Bmagic	nfBurton: I've always "reset" the row in action_trigger.event so that it "looks" like it needs to be run
10:49	nfBurton	Same
10:49	nfBurton	I get Use of uninitialized value in subtraction (-) at /usr/local/share/perl/5.22.1/OpenSRF/AppSession.pm line 952. In logs.
10:50	nfBurton	The old ones just won't complete. But I cleared the event to rerun
10:50	csharp_	and now I'm maxxed out on actor children - going to bump them back up
10:51	mmorgan	nfBurton: What kind of triggers are they?
10:51	nfBurton	Hold Notification Email
10:51	Bmagic	just to be clear, I don't delete the row in action_trigger.event. I update the columns to mostly be null and status to "pending", usually referring to another row that is pending to figure out which columns to nullify
10:51	nfBurton	update action_trigger."event" set run_time = NOW(), add_time = NOW(), start_time = null, update_time = null, update_process = null, state = 'pending' where event_def = 5 and state = 'invalid' and DATE(add_time) > ('2022-01-18')
10:51	Bmagic	yeah, like that :)
10:51	nfBurton	yeah
10:51	mmorgan	nfBurton: Maybe they're invalid because the holds have been picked up?
10:52	nfBurton	They still are all 'invalid'
10:52	berick	csharp_: did you ever get it to fail with debug logging?
10:52	nfBurton	hmmm maybe
10:52	Bmagic	nfBurton: that means that the validator checked into the details and decided that the trigger shouldn't fire
10:52	Bmagic	what's the validator set to on the trigger def?
10:52	nfBurton	I just wanted to be sure I didn't miss something cause it's over 300 for the past week. That makes sense though I guess
10:53	csharp_	berick: yes
10:53	nfBurton	It's too bad it doesn't have an error output or something for it to tell why
10:53	csharp_	but the logs are so dense I'm kind of lost for what to look for
10:54	nfBurton	Validator - HoldIsAvailable - Check that an item is on the hold shelf
10:54	Bmagic	nfBurton: you could create a machine special to run these. Changing the granulairty to something outside of your normal. And on this temporary machine, you can edit the Validator code to log more stuff
10:54	mmorgan	nfBurton: In my experience, triggers with invalid state are not errors, what Bmagic said. Hold is no longer ready for pickup, item is no longer overdue, etc.
10:55	berick	csharp_: hm, ok. i missed that. one more question, anything interesting in /openils/var/log/open-ils.actor_stderr.log on the afffected server?
10:55	nfBurton	So yeah, them being checked out makes sense. Thanks for the sanity check. I do get an error in the logs that basically looks like it errored on a complete time.
10:55	nfBurton	So I was unsure of why
10:56	Bmagic	the held item has a status other than "on holds shelf" - then yeah, it should invalidate
10:56	nfBurton	But it's because they didn't actually complete because there wasn't a need
10:56	nfBurton	cool
11:04	Dyrcona	nfburton: If the validation fails, then the events should have a status of "invalid" after the run. A state of "error" usually means something else.
11:05	nfBurton	Yeah, I got that now. Invalid sounds as bad, but doesn't have the error output lol
11:05	nfBurton	I expected way more to run
11:11	Dyrcona	csharp_: I'm out of ideas for the time being.
11:11	csharp_	berick: there are messages there, but they don't correspond timewise with the issue
11:11	* berick	nods
11:13	berick	csharp_: for the debug logs, i would start by grepping just on the PID of the listener that dies and for a brief period leading up to its death. may be a longshot.
11:14	Dyrcona	nfBurton: I have had events get an 'error' status with no error output sometimes.
11:14	csharp_	berick: thanks - I'll take a look again
11:54		jihpringle joined #evergreen
12:27		awitter joined #evergreen
12:28	csharp_	hmm - the timing may be different than was I was assuming. Now that I'm catching these in real time, it looks like we see "WS received XMPP error message in response to thread=0.55655771755080671643131332366 and recipient=routerpublic.brick01-head.gapines.org/open-ils.actor. Likely the recipient is not accessible/available."
12:29	csharp_	but ps still shows the listener as running at that poing
12:29	csharp_	point
12:30	csharp_	learned this because my attempt to speed up checking to less than once per minute (while true; do blah && sleep 20; done) haven't restarted the process
12:30	csharp_	also explains maybe why osrf_control doesn't notice it immediately too
12:33		abowling left #evergreen
12:44		collum joined #evergreen
12:44	pinesol	[evergreen\|gmontimantis] Docs: Update barcode_completion_grid.jpg - <http://git.evergreen-ils.org/?p=Evergreen.git;a=commit;h=be24afa>
13:14	pinesol	[evergreen\|gmontimantis] Docs: updating z39.50 overlay doc and images - <http://git.evergreen-ils.org/?p=Evergreen.git;a=commit;h=0e977d4>
13:14	pinesol	[evergreen\|Jane Sandberg] Docs: use asciidoc ordered list rather than adding numbers manually - <http://git.evergreen-ils.org/?p=Evergreen.git;a=commit;h=7ad46ac>
13:31		sandbergja joined #evergreen
13:36		terranm joined #evergreen
13:40	alynn26	We had a hicup on our database server last Monday, and since then we been getting "No connection to the server" messages Failed to set timezone messges
13:41		sandbergja joined #evergreen
13:44		terranm joined #evergreen
13:45		jihpringle joined #evergreen
13:46	Dyrcona	alynn26: What do you mean "hiccup?"
13:47	alynn26	It just started not allowing connections.
13:49	jeffdavis	Have you restarted Evergreen services since then?
13:49	jeff	that specific "Failed to set timezone" message likely comes from OpenILS/Application/Storage/Driver/Pg/storage.pm
13:50	alynn26	jeffdavis, several times, even rebooted the db to see if that helped.
13:50	alynn26	The Timezone messages are the only thing that is different from before
13:51	jeff	it should show what value it's trying to set.
13:51	jeff	"Failed to set timezone: something-here"
13:51	jeff	does it?
13:51	alynn26	America/New_York
13:51	alynn26	which is correct.
13:52	alynn26	open-ils.storage: [WARN:2167:Application.pm:624:1642450292163715] no_tz.open-ils.storage.transaction.begin: DBD::Pg::db do failed: no connection to the server [for Statement "SET LOCAL timezone TO ?;" with ParamValues: 1='America/New_York'] at /usr/local/share/perl/5.26.1/OpenILS/Application/Storage/Driver/Pg/storage.pm line 65.
13:52	alynn26	open-ils.storage: [WARN:2167:storage.pm:69:1642450292163715] Failed to set timezone: America/New_York
13:54	jeff	"failed: no connection to the server"
13:54	jeff	how long has that process been running?
13:57	alynn26	most of them when we see it are already died off.
13:57	jeff	ah, that probably makes sense. it probably doesn't stick around after that point.
13:58	jeff	is open-ils.storage connecting directly to the database server, or is there something like pgbouncer between it and the database server?
13:58	alynn26	direct connection
13:58	jeff	as part of fixing the "hiccup", did the connection details change for the database server?
13:58	jeff	are you close to your max connections on the database server?
13:59	alynn26	nope they did not. no where near close to max connections,
13:59	jeff	are there errors in the postgresql logs regarding these clients?
14:00	jeff	did the process in question do something interesting / log something interesting before the "no connection to the server" / "Failed to set timezone" warning?
14:02	alynn26	nope they did not. No other errors that I can find or see. Been looking at log files the last two days. and everything was normal, then wam. no connections.
14:05	Dyrcona	alynn26: Is the db in recovery mode?
14:06	Dyrcona	You can tell by doing "pgrep -af postgres" on the db server I think.
14:07	jeff	What was the cause/fix of the "hiccup"? Did the timezone warnings start right away, or not for a while?
14:08	alynn26	No, the db is not in recovery mode.
14:11	alynn26	One other issue we have is we have a lot of "postgres: 9.6/main: evergreen evergreen 10.1.2.17(55738) idle
14:11	alynn26	"
14:13	alynn26	the fix was to kill out most of the idle processes. which cleared up ram.
14:14	Dyrcona	alynn26: Have you verified that the correct postgresql configuration is in place?
14:14	Dyrcona	It could be you don't have enough connections available.
14:15	alynn26	we have 6000 and usually running around 2000
14:16	JBoyer	alynn26, Those "idle" processes are PostgreSQL backends that are actively listening to a connection, such as a cstore drone or something. They will automatically go away when the connection closes. (like when a drone is recycled or otherwise goes away)
14:16	JBoyer	killing them on the db directly can cause problems for Postgres and the OpenSRF processes talking to them.
14:18	Dyrcona	If 10.1.2.17 has a larger than usual number of connections, you might want to log in to it and see what's going on there. Maybe something is misbehaving?
15:49		jihpringle joined #evergreen
15:51	pinesol	[evergreen\|Bill Erickson] LP1948035 Update Node/Angular Deps for Ang 12 - <http://git.evergreen-ils.org/?p=Evergreen.git;a=commit;h=a46938a>
15:51	pinesol	[evergreen\|Bill Erickson] LP1948035 Angular 12 updates - <http://git.evergreen-ils.org/?p=Evergreen.git;a=commit;h=9b7d6dd>
15:51	pinesol	[evergreen\|Jane Sandberg] LP1948035 (follow-up): don't use commonjs in spec tsconfig - <http://git.evergreen-ils.org/?p=Evergreen.git;a=commit;h=e7b98ca>
16:15		awitter joined #evergreen
16:36	Dyrcona	I've noticed that the Angular staff catalog in 3.7 doesn't use the preferred library when searching. Is that expected, known, etc.?
16:43	terranm	We're in 3.8, but if we search by the whole consortium, the preferred library's copies are floating to the top of the holdings list as expected.
16:45	terranm	The only thing that I know for sure that is buggy with the Angular search preferences is the stickiness of the Exclude Electronic Resources checkbox
16:45	jihpringle	Drycona: yup, that's a known issue in 3.7
16:45	jihpringle	Dyrcona ^^
16:46	Dyrcona	I'm looking into patron search freezing when we try to place a hold. There sure seems to be a lot going on in the logs when I do that.
16:46	Dyrcona	terranm++ jihpringle++
16:47	jihpringle	I think this fix in 3.7.2 fixed part of it - https://bugs.launchpad.net/evergreen/+bug/1913807
16:47	pinesol	Launchpad bug 1913807 in Evergreen 3.6 "Angular staff catalog search results no longer displays holdings count for preferred library" [Medium,Fix released]
16:48	Dyrcona	I think I'm getting timeouts: Jan 25 16:43:55 egdev open-ils.search: [ERR :33653:EX.pm:66:16431467173365213] Exception: OpenSRF::EX::Session 2022-01-25T16:43:55 OpenSRF::Utils::Logger /usr/local/share/perl/5.30.0/OpenSRF/Utils/Logger.pm:243 Session Error: opensrfpublic.localhost/_egdev_1643146717.550417_33652 IS NOT CONNECTED TO THE NETWORK!!!
16:48	Dyrcona	Looks like a client gave up waiting on a reply.
16:49	terranm	jihpringle++
16:49	Dyrcona	jihpringle: I should have that one. I've installed 3.7.2.
16:49	jihpringle	I thought there was another related bug but I can't find it
16:50	Dyrcona	Is patron search doing something stupid, like trying to retrieve every patron in the database?
16:50	mmorgan	Dyrcona: Just to clarify, the angular staff catalog has its own Catalog Preferences with preferred library and default search library settings.
16:51	jihpringle	we're running 3.7.0(ish) and the angular catalogue definitely doesn't follow what it says in the catalog preferences for preferred library (The preferred library is used to show copies and URIs regardless of the library searched.)
16:52	terranm	I can verify that in 3.8 the holdings count for preferred library shows up on the search results page. However, if you go into the item details to see the actual holdings, only the holdings within the search limits show up
16:54	Dyrcona	OK.
16:54	eby	Any user permission gurus on? Trying to wrap my head around the depth/scopes.
16:54	mmorgan	Dyrcona: So this is using the Patron Search from the Place Hold screen in the Angular Staff Catalog?
16:54	Dyrcona	So, I see a lot of searches for holds, copies, bibs, and bib feeds.
16:54	Dyrcona	mmorgan: Yes.
16:54	terranm	Dyrcona: is the Patron Search modal freezing when you first pull it up or after you search? We did encounter this bug and signed off on the fix: https://bugs.launchpad.net/evergreen/+bug/1955927
16:54	pinesol	Launchpad bug 1955927 in Evergreen "Patron search modal on Holds screen fails with barcode search" [Medium,Confirmed]
16:55	Dyrcona	It freezes when I first open it up. My last log activity was at 4:46 and it is still sitting there. I can't close the modal.
16:55	jihpringle	terranm: are you meaning the item details in the search results or in the actual record?
16:55	terranm	jihpringle: in the actual record - lemme look at in the search results...
16:56	terranm	eby: I'm not a guru, but feel free to ask!
16:56	mmorgan	eby: Can't claim guru status, but ... What terranm said!!
16:56	jihpringle	I just tried on our 3.8 test server and I'm not seeing the preferred library item details showing in the grid on the search results screen (just in the counts)
16:57	Dyrcona	I'll have to look at this more tomorrow, but I think my problem is cstore timeouts talking to the database.
16:57	eby	So say have a staff with cancel hold permission at system level. When they go to cancel a hold on another account is that system looking at the patron home library, where the hold is placed or going to or ...?
16:58	terranm	jihpringle: I'm seeing the same thing. The only thing I see for the preferred library is the hold count.
16:59	jihpringle	terranm: you'd expect to see the item details in those other places, right? I think this is a bug
16:59	eby	the docs seem to mostly focus on user home library but I presume that differs depending on if their own account or someone elses
17:00	jihpringle	eby: we had something similar come up, I can't remember exactly what happened but I think it actually looks at either the pickup library or requestor library
17:02	mmorgan	eby: We haven't come across this, we have UPDATE_HOLD at the consortium level.
17:04	terranm	eby: same here - we have ours set at the consortium level, so we haven't looked into that. I would expect it to either be based on the patron's home library or on the pickup library.
17:04	terranm	jihpringle: +1 to that being a bug
17:05	jihpringle	terranm: speaking of bugs, are you planning to submit a bug for the old notes/alerts/messages surfacing? we're seeing that on our test server too
17:06	jeff	eby: if you are attempting to cancel a hold, the permission check on CANCEL_HOLDS is not scoped to any org unit -- if you have it anywhere, you can cancel a hold.
17:07	eby	Thanks I'll do some more testing. It seemed like it was checking based on the staff's home library which didn't make sense to me. But I'll dig more.
17:07	jeff	eby: "uncancel" is checked at the current hold pickup lib (it also tests the CANCEL_HOLDS permission)
17:07	jeff	so, in theory you can cancel a hold which you cannot then uncancel (without more work)
17:08	eby	ok that is what i was seeing jeff. it was sending the home library which of course would always allow regardless of what you were looking at
17:08	eby	(if you can cancel your own)
17:08	terranm	jihpringle: I haven't been viewing that as a bug since those old messages were still in the system and not deleted. However, we are going to do a cleanup project to mark specific types of old messages deleted.
17:08	jeff	eby: actually, let me fix my first statement... it wasn't quite accurate/precise.
17:09	jihpringle	terranm: we're still early days in our testing but we'd be very interested in what you find as you do your cleanup project since it looks like we'll need to do the same when we upgrade this summer
17:09	jeff	eby: when you attempt to cancel a hold, the permission is checked without passing an org unit -- which means that the permission check is performed based on the org unit of the workstation associated with the user's login session.
17:09	jeff	eby: that might explain what you were seeing better.
17:09	terranm	jeff++
17:10		terranm joined #evergreen
17:12	eby	jeff: thanks. my logs showed the permission.usr_has_perm getting the org unit id but have been trying to get an idea of where it was pulled / actually checked
17:16	jihpringle	terranm: https://bugs.launchpad.net/evergreen/+bug/1959052
17:16	pinesol	Launchpad bug 1959052 in Evergreen "Preferred Library copies details and URIs don't display in search results" [Undecided,New]
17:16	eby	jeff: but my melcat woes will never end as you well know
17:24		mmorgan left #evergreen
17:26	jeff	eby: this i know.
17:26	jeff	eby: I'm not sure how it took until today for us to run into this, but it looks like item renewal messages can fail (and loop) when the patron account in Evergreen is expired.
17:28	jeff	that might be me failing to set a new(ish) org unit setting regarding renewals and expired accounts. I'll look into it.
17:28	eby	I think I did run into that but may have tweaked permissions
17:29	jeff	for now, I'm off. another day!
18:02	pinesol	News from qatests: Testing Success <http://testing.evergreen-ils.org/~live>