Evergreen ILS Website

IRC log for #evergreen, 2022-01-25

| Channels | #evergreen index | Today | | Search | Google Search | Plain-Text | summary | Join Webchat

All times shown according to the server's local time.

Time Nick Message
06:02 pinesol News from qatests: Testing Success <http://testing.evergreen-ils.org/~live>
07:06 JBoyer_ joined #evergreen
07:38 rjackson_isl_hom joined #evergreen
07:49 collum joined #evergreen
07:58 JBoyer Shopping for storage during a chip shortage is a real Soylent Drive situation. :-/ "They're made of chips!"
08:33 jvwoolf joined #evergreen
08:35 mantis1 joined #evergreen
08:36 rfrasur joined #evergreen
08:40 mmorgan joined #evergreen
08:44 jvwoolf1 joined #evergreen
08:46 Dyrcona joined #evergreen
09:13 terranm joined #evergreen
09:49 miker csharp_: re mod_stream_mgmt from yesterday, that has to be supported (and requested) by the client, and our xmpp implementation doesn't know about stream management, so it's def not related to that. at least it's something to cross off the list?
09:50 csharp_ miker: thanks
09:51 csharp_ I think at this point we're going to try to reroute cataloging to the AngJS UIs and see if there's an improvement
09:51 csharp_ stabbing in the dark on this since logs are so opaque
09:52 csharp_ also interested in automated ways to detect the outage and automatically restart the down service
09:53 csharp_ osrf_control --diagnostic is slow to realize that a listener is dead
09:53 csharp_ osrf_control --service X --restart is leaving orphaned drone procs running too
09:58 csharp_ (which is a limitation of the way that sub is implemented in that the proc that it would normally signal is no longer running and it doesn't do anything further to look for orphans)
10:17 miker csharp_: is actor the only perl service that has such a high drone count? (c is so much faster that I wouldn't be surprised if it effectively never had this issue) ... I wonder if you're bumping up against a takes-just-a-little-too-long to do all the child pipe management stuff between listener servicing
10:17 pastebot "JBoyer" at 168.25.130.30 pasted "A simple script to clean up orphaned Drones" (28 lines) at http://paste.evergreen-ils.org/14443
10:18 JBoyer csharp_, Not sure it helps much but I use a modified version of the above to cleanup orphaned drones en mass
10:22 csharp_ JBoyer: thanks!
10:23 JBoyer As for the service, you could add a script to opensrf's cron on service machines to check once a minute whether a process matching 'OpenSRF Listener [open-ils.actor]' is found, and call osrf_control to restart it if not. Obviously not a fix, but may allow you to at least breathe a bit while tracking it down.
10:23 csharp_ miker: hmm - so if I'm understanding correctly, we would reduce the max children for actor to see if that helps?
10:23 csharp_ we typically aren't hitting the max
10:23 csharp_ (meaning there's room to reduce)
10:24 berick was also wondering if maybe you've hit some practical limit
10:24 JBoyer 100+ is awfully high unless it were your only machine running the service, and even then 40-50 may be more than enough.
10:25 miker well, or spread them out. but I admit it's hard to imagine that you have 1152 (192 x 6) concurrent actor requests.
10:26 csharp_ we kept bumping it up to accommodate the actor drone parallel request issues
10:33 csharp_ we have 120 open-ils.circ max children - I could probably drop that down too if needed, but I've moved actor from 192 to 72 per server
10:33 nfBurton joined #evergreen
10:38 Dyrcona We have 150 max actor children per brick and still run out sometimes.
10:39 Dyrcona I sometimes consider sharing the drones among all the bricks, but haven't gotten to that point quite yet.
10:41 csharp_ just lost actor again after that change
10:43 csharp_ yeah - it's going offline again in multiple places
10:44 csharp_ like three servers over the course of a few minutes - this is how it's going
10:44 csharp_ and the number of actor drones running is not apparently relevant
10:45 csharp_ meaning that they were low, percentage-wise
10:45 csharp_ I really wish I could see into what's actually getting dropped
10:46 csharp_ so far that's my biggest frustration
10:46 csharp_ also possibly notable that the issue starts right before 11:00 a.m. each day
10:47 Bmagic csharp_ check ulimits! That's bitten me a few times
10:48 csharp_ Bmagic: I made recommended adjustments per Dyrcona and jeffdavis
10:48 Bmagic oh, I guess i didn't see that chat up there
10:48 Keith-isl joined #evergreen
10:49 nfBurton Is there a link to what to do to rerun action_triggers? I had an issue with a template I fixed and new ones are running but trying to run the past week again keeps coming up as invalid....
10:49 Bmagic nfBurton: I've always "reset" the row in action_trigger.event so that it "looks" like it needs to be run
10:49 nfBurton Same
10:49 nfBurton I get Use of uninitialized value in subtraction (-) at /usr/local/share/perl/5.22.1/OpenSRF/AppSession.pm line 952. In logs.
10:50 nfBurton The old ones just won't complete. But I cleared the event to rerun
10:50 csharp_ and now I'm maxxed out on actor children - going to bump them back up
10:51 mmorgan nfBurton: What kind of triggers are they?
10:51 nfBurton Hold Notification Email
10:51 Bmagic just to be clear, I don't delete the row in action_trigger.event. I update the columns to mostly be null and status to "pending", usually referring to another row that is pending to figure out which columns to nullify
10:51 nfBurton update action_trigger."event" set run_time = NOW(), add_time = NOW(), start_time = null, update_time = null, update_process = null, state = 'pending' where event_def = 5 and state = 'invalid' and DATE(add_time) > ('2022-01-18')
10:51 Bmagic yeah, like that :)
10:51 nfBurton yeah
10:51 mmorgan nfBurton: Maybe they're invalid because the holds have been picked up?
10:52 nfBurton They still are all 'invalid'
10:52 berick csharp_: did you ever get it to fail with debug logging?
10:52 nfBurton hmmm maybe
10:52 Bmagic nfBurton: that means that the validator checked into the details and decided that the trigger shouldn't fire
10:52 Bmagic what's the validator set to on the trigger def?
10:52 nfBurton I just wanted to be sure I didn't miss something cause it's over 300 for the past week. That makes sense though I guess
10:53 csharp_ berick: yes
10:53 nfBurton It's too bad it doesn't have an error output or something for it to tell why
10:53 csharp_ but the logs are so dense I'm kind of lost for what to look for
10:54 nfBurton Validator - HoldIsAvailable - Check that an item is on the hold shelf
10:54 Bmagic nfBurton: you could create a machine special to run these. Changing the granulairty to something outside of your normal. And on this temporary machine, you can edit the Validator code to log more stuff
10:54 mmorgan nfBurton: In my experience, triggers with invalid state are not errors, what Bmagic said. Hold is no longer ready for pickup, item is no longer overdue, etc.
10:55 berick csharp_: hm, ok.  i missed that.  one more question, anything interesting in /openils/var/log/open-ils.actor_stderr.log on the afffected server?
10:55 nfBurton So yeah, them being checked out makes sense. Thanks for the sanity check. I do get an error in the logs that basically looks like it errored on a complete time.
10:55 nfBurton So I was unsure of why
10:56 Bmagic the held item has a status other than "on holds shelf" - then yeah, it should invalidate
10:56 nfBurton But it's because they didn't actually complete because there wasn't a need
10:56 nfBurton cool
11:04 Dyrcona nfburton: If the validation fails, then the events should have a status of "invalid" after the run. A state of "error" usually means something else.
11:05 nfBurton Yeah, I got that now. Invalid sounds as bad, but doesn't have the error output lol
11:05 nfBurton I expected way more to run
11:11 Dyrcona csharp_: I'm out of ideas for the time being.
11:11 csharp_ berick: there are messages there, but they don't correspond timewise with the issue
11:11 * berick nods
11:13 berick csharp_:  for the debug logs, i would start by grepping just on the PID of the listener that dies and for a brief period leading up to its death.  may be a longshot.
11:14 Dyrcona nfBurton: I have had events get an 'error' status with no error output sometimes.
11:14 csharp_ berick: thanks - I'll take a look again
11:54 jihpringle joined #evergreen
12:27 awitter joined #evergreen
12:28 csharp_ hmm - the timing may be different than was I was assuming.  Now that I'm catching these in real time, it looks like we see "WS received XMPP error message in response to thread=0.55655771755080671643131332366 and recipient=router@public.brick01-head.gapines.org/open-ils.actor.  Likely the recipient is not accessible/available."
12:29 csharp_ but ps still shows the listener as running at that poing
12:29 csharp_ point
12:30 csharp_ learned this because my attempt to speed up checking to less than once per minute (while true; do blah && sleep 20; done) haven't restarted the process
12:30 csharp_ also explains maybe why osrf_control doesn't notice it immediately too
12:33 abowling left #evergreen
12:44 collum joined #evergreen
12:44 pinesol [evergreen|gmontimantis] Docs: Update barcode_completion_grid.jpg - <http://git.evergreen-ils.org/?p=​Evergreen.git;a=commit;h=be24afa>
13:14 pinesol [evergreen|gmontimantis] Docs: updating z39.50 overlay doc and images - <http://git.evergreen-ils.org/?p=​Evergreen.git;a=commit;h=0e977d4>
13:14 pinesol [evergreen|Jane Sandberg] Docs: use asciidoc ordered list rather than adding numbers manually - <http://git.evergreen-ils.org/?p=​Evergreen.git;a=commit;h=7ad46ac>
13:31 sandbergja joined #evergreen
13:36 terranm joined #evergreen
13:40 alynn26 We had a hicup on our database server last Monday, and since then we been getting "No connection to the server" messages Failed to set timezone messges
13:41 sandbergja joined #evergreen
13:44 terranm joined #evergreen
13:45 jihpringle joined #evergreen
13:46 Dyrcona alynn26: What do you mean "hiccup?"
13:47 alynn26 It just started not allowing connections.
13:49 jeffdavis Have you restarted Evergreen services since then?
13:49 jeff that specific "Failed to set timezone" message likely comes from OpenILS/Application/Storage/Driver/Pg/storage.pm
13:50 alynn26 jeffdavis, several times, even rebooted the db to see if that helped.
13:50 alynn26 The Timezone messages are the only thing that is different from before
13:51 jeff it should show what value it's trying to set.
13:51 jeff "Failed to set timezone: something-here"
13:51 jeff does it?
13:51 alynn26 America/New_York
13:51 alynn26 which is correct.
13:52 alynn26 open-ils.storage: [WARN:2167:Application.pm:624:1642450292163715] no_tz.open-ils.storage.transaction.begin: DBD::Pg::db do failed: no connection to the server [for Statement "SET LOCAL timezone TO ?;" with ParamValues: 1='America/New_York'] at /usr/local/share/perl/5.26.1/OpenILS/Ap​plication/Storage/Driver/Pg/storage.pm line 65.
13:52 alynn26 open-ils.storage: [WARN:2167:storage.pm:69:1642450292163715] Failed to set timezone: America/New_York
13:54 jeff "failed: no connection to the server"
13:54 jeff how long has that process been running?
13:57 alynn26 most of them when we see it are already died off.
13:57 jeff ah, that probably makes sense. it probably doesn't stick around after that point.
13:58 jeff is open-ils.storage connecting directly to the database server, or is there something like pgbouncer between it and the database server?
13:58 alynn26 direct connection
13:58 jeff as part of fixing the "hiccup", did the connection details change for the database server?
13:58 jeff are you close to your max connections on the database server?
13:59 alynn26 nope they did not.  no where near close to max connections,
13:59 jeff are there errors in the postgresql logs regarding these clients?
14:00 jeff did the process in question do something interesting / log something interesting before the "no connection to the server" / "Failed to set timezone" warning?
14:02 alynn26 nope they did not.  No other errors that I can find or see.  Been looking at log files the last two days.  and everything was normal, then wam.  no connections.
14:05 Dyrcona alynn26: Is the db in recovery mode?
14:06 Dyrcona You can tell by doing "pgrep -af postgres" on the db server I think.
14:07 jeff What was the cause/fix of the "hiccup"? Did the timezone warnings start right away, or not for a while?
14:08 alynn26 No, the db is not in recovery mode.
14:11 alynn26 One other issue we have is we have a lot of "postgres: 9.6/main: evergreen evergreen 10.1.2.17(55738) idle
14:11 alynn26 "
14:13 alynn26 the fix was to kill out most of the idle processes. which cleared up ram.
14:14 Dyrcona alynn26: Have you verified that the correct postgresql configuration is in place?
14:14 Dyrcona It could be you don't have enough connections available.
14:15 alynn26 we have 6000 and usually running around 2000
14:16 JBoyer alynn26, Those "idle" processes are PostgreSQL backends that are actively listening to a connection, such as a cstore drone or something. They will automatically go away when the connection closes. (like when a drone is recycled or otherwise goes away)
14:16 JBoyer killing them on the db directly can cause problems for Postgres and the OpenSRF processes talking to them.
14:18 Dyrcona If 10.1.2.17 has a larger than usual number of connections, you might want to log in to it and see what's going on there. Maybe something is misbehaving?
15:49 jihpringle joined #evergreen
15:51 pinesol [evergreen|Bill Erickson] LP1948035 Update Node/Angular Deps for Ang 12 - <http://git.evergreen-ils.org/?p=​Evergreen.git;a=commit;h=a46938a>
15:51 pinesol [evergreen|Bill Erickson] LP1948035 Angular 12 updates - <http://git.evergreen-ils.org/?p=​Evergreen.git;a=commit;h=9b7d6dd>
15:51 pinesol [evergreen|Jane Sandberg] LP1948035 (follow-up): don't use commonjs in spec tsconfig - <http://git.evergreen-ils.org/?p=​Evergreen.git;a=commit;h=e7b98ca>
16:15 awitter joined #evergreen
16:36 Dyrcona I've noticed that the Angular staff catalog in 3.7 doesn't use the preferred library when searching. Is that expected, known, etc.?
16:43 terranm We're in 3.8, but if we search by the whole consortium, the preferred library's copies are floating to the top of the holdings list as expected.
16:45 terranm The only thing that I know for sure that is buggy with the Angular search preferences is the stickiness of the Exclude Electronic Resources checkbox
16:45 jihpringle Drycona: yup, that's a known issue in 3.7
16:45 jihpringle Dyrcona ^^
16:46 Dyrcona I'm looking into patron search freezing when we try to place a hold. There sure seems to be a lot going on in the logs when I do that.
16:46 Dyrcona terranm++ jihpringle++
16:47 jihpringle I think this fix in 3.7.2 fixed part of it - https://bugs.launchpad.net/evergreen/+bug/1913807
16:47 pinesol Launchpad bug 1913807 in Evergreen 3.6 "Angular staff catalog search results no longer displays holdings count for preferred library" [Medium,Fix released]
16:48 Dyrcona I think I'm getting timeouts: Jan 25 16:43:55 egdev open-ils.search: [ERR :33653:EX.pm:66:16431467173365213] Exception: OpenSRF::EX::Session 2022-01-25T16:43:55 OpenSRF::Utils::Logger /usr/local/share/perl/5.30.0​/OpenSRF/Utils/Logger.pm:243 Session Error: opensrf@public.localhost/_egdev_1643146717.550417_33652 IS NOT CONNECTED TO THE NETWORK!!!
16:48 Dyrcona Looks like a client gave up waiting on a reply.
16:49 terranm jihpringle++
16:49 Dyrcona jihpringle: I should have that one. I've installed 3.7.2.
16:49 jihpringle I thought there was another related bug but I can't find it
16:50 Dyrcona Is patron search doing something stupid, like trying to retrieve every patron in the database?
16:50 mmorgan Dyrcona: Just to clarify, the angular staff catalog has its own Catalog Preferences with preferred library and default search library settings.
16:51 jihpringle we're running 3.7.0(ish) and the angular catalogue definitely doesn't follow what it says in the catalog preferences for preferred library (The preferred library is used to show copies and URIs regardless of the library searched.)
16:52 terranm I can verify that in 3.8 the holdings count for preferred library shows up on the search results page. However, if you go into the item details to see the actual holdings, only the holdings within the search limits show up
16:54 Dyrcona OK.
16:54 eby Any user permission gurus on? Trying to wrap my head around the depth/scopes.
16:54 mmorgan Dyrcona: So this is using the Patron Search from the Place Hold screen in the Angular Staff Catalog?
16:54 Dyrcona So, I see a lot of searches for holds, copies, bibs, and bib feeds.
16:54 Dyrcona mmorgan: Yes.
16:54 terranm Dyrcona: is the Patron Search modal freezing when you first pull it up or after you search? We did encounter this bug and signed off on the fix: https://bugs.launchpad.net/evergreen/+bug/1955927
16:54 pinesol Launchpad bug 1955927 in Evergreen "Patron search modal on Holds screen fails with barcode search" [Medium,Confirmed]
16:55 Dyrcona It freezes when I first open it up. My last log activity was at 4:46 and it is still sitting there. I can't close the modal.
16:55 jihpringle terranm: are you meaning the item details in the search results or in the actual record?
16:55 terranm jihpringle: in the actual record - lemme look at in the search results...
16:56 terranm eby: I'm not a guru, but feel free to ask!
16:56 mmorgan eby: Can't claim guru status, but ... What terranm said!!
16:56 jihpringle I just tried on our 3.8 test server and I'm not seeing the preferred library item details showing in the grid on the search results screen (just in the counts)
16:57 Dyrcona I'll have to look at this more tomorrow, but I think my problem is cstore timeouts talking to the database.
16:57 eby So say have a staff with cancel hold permission at system level. When they go to cancel a hold on another account is that system looking at the patron home library, where the hold is placed or going to or ...?
16:58 terranm jihpringle: I'm seeing the same thing. The only thing I see for the preferred library is the hold count.
16:59 jihpringle terranm: you'd expect to see the item details in those other places, right?  I think this is a bug
16:59 eby the docs seem to mostly focus on user home library but I presume that differs depending on if their own account or someone elses
17:00 jihpringle eby: we had something similar come up, I can't remember exactly what happened but I think it actually looks at either the pickup library or requestor library
17:02 mmorgan eby: We haven't come across this, we have UPDATE_HOLD at the consortium level.
17:04 terranm eby: same here - we have ours set at the consortium level, so we haven't looked into that. I would expect it to either be based on the patron's home library or on the pickup library.
17:04 terranm jihpringle: +1 to that being a bug
17:05 jihpringle terranm: speaking of bugs, are you planning to submit a bug for the old notes/alerts/messages surfacing?  we're seeing that on our test server too
17:06 jeff eby: if you are attempting to cancel a hold, the permission check on CANCEL_HOLDS is not scoped to any org unit -- if you have it anywhere, you can cancel a hold.
17:07 eby Thanks I'll do some more testing. It seemed like it was checking based on the staff's home library which didn't make sense to me. But I'll dig more.
17:07 jeff eby: "uncancel" is checked at the current hold pickup lib (it also tests the CANCEL_HOLDS permission)
17:07 jeff so, in theory you can cancel a hold which you cannot then uncancel (without more work)
17:08 eby ok that is what i was seeing jeff. it was sending the home library which of course would always allow regardless of what you were looking at
17:08 eby (if you can cancel your own)
17:08 terranm jihpringle: I haven't been viewing that as a bug since those old messages were still in the system and not deleted. However, we are going to do a cleanup project to mark specific types of old messages deleted.
17:08 jeff eby: actually, let me fix my first statement... it wasn't quite accurate/precise.
17:09 jihpringle terranm: we're still early days in our testing but we'd be very interested in what you find as you do your cleanup project since it looks like we'll need to do the same when we upgrade this summer
17:09 jeff eby: when you attempt to cancel a hold, the permission is checked without passing an org unit -- which means that the permission check is performed based on the org unit of the workstation associated with the user's login session.
17:09 jeff eby: that might explain what you were seeing better.
17:09 terranm jeff++
17:10 terranm joined #evergreen
17:12 eby jeff: thanks. my logs showed the permission.usr_has_perm getting the org unit id but have been trying to get an idea of where it was pulled / actually checked
17:16 jihpringle terranm: https://bugs.launchpad.net/evergreen/+bug/1959052
17:16 pinesol Launchpad bug 1959052 in Evergreen "Preferred Library copies details and URIs don't display in search results" [Undecided,New]
17:16 eby jeff: but my melcat woes will never end as you well know
17:24 mmorgan left #evergreen
17:26 jeff eby: this i know.
17:26 jeff eby: I'm not sure how it took until today for us to run into this, but it looks like item renewal messages can fail (and loop) when the patron account in Evergreen is expired.
17:28 jeff that might be me failing to set a new(ish) org unit setting regarding renewals and expired accounts. I'll look into it.
17:28 eby I think I did run into that but may have tweaked permissions
17:29 jeff for now, I'm off. another day!
18:02 pinesol News from qatests: Testing Success <http://testing.evergreen-ils.org/~live>

| Channels | #evergreen index | Today | | Search | Google Search | Plain-Text | summary | Join Webchat