Time |
Nick |
Message |
06:02 |
pinesol |
News from qatests: Testing Success <http://testing.evergreen-ils.org/~live> |
07:06 |
|
JBoyer_ joined #evergreen |
07:38 |
|
rjackson_isl_hom joined #evergreen |
07:49 |
|
collum joined #evergreen |
07:58 |
JBoyer |
Shopping for storage during a chip shortage is a real Soylent Drive situation. :-/ "They're made of chips!" |
08:33 |
|
jvwoolf joined #evergreen |
08:35 |
|
mantis1 joined #evergreen |
08:36 |
|
rfrasur joined #evergreen |
08:40 |
|
mmorgan joined #evergreen |
08:44 |
|
jvwoolf1 joined #evergreen |
08:46 |
|
Dyrcona joined #evergreen |
09:13 |
|
terranm joined #evergreen |
09:49 |
miker |
csharp_: re mod_stream_mgmt from yesterday, that has to be supported (and requested) by the client, and our xmpp implementation doesn't know about stream management, so it's def not related to that. at least it's something to cross off the list? |
09:50 |
csharp_ |
miker: thanks |
09:51 |
csharp_ |
I think at this point we're going to try to reroute cataloging to the AngJS UIs and see if there's an improvement |
09:51 |
csharp_ |
stabbing in the dark on this since logs are so opaque |
09:52 |
csharp_ |
also interested in automated ways to detect the outage and automatically restart the down service |
09:53 |
csharp_ |
osrf_control --diagnostic is slow to realize that a listener is dead |
09:53 |
csharp_ |
osrf_control --service X --restart is leaving orphaned drone procs running too |
09:58 |
csharp_ |
(which is a limitation of the way that sub is implemented in that the proc that it would normally signal is no longer running and it doesn't do anything further to look for orphans) |
10:17 |
miker |
csharp_: is actor the only perl service that has such a high drone count? (c is so much faster that I wouldn't be surprised if it effectively never had this issue) ... I wonder if you're bumping up against a takes-just-a-little-too-long to do all the child pipe management stuff between listener servicing |
10:17 |
pastebot |
"JBoyer" at 168.25.130.30 pasted "A simple script to clean up orphaned Drones" (28 lines) at http://paste.evergreen-ils.org/14443 |
10:18 |
JBoyer |
csharp_, Not sure it helps much but I use a modified version of the above to cleanup orphaned drones en mass |
10:22 |
csharp_ |
JBoyer: thanks! |
10:23 |
JBoyer |
As for the service, you could add a script to opensrf's cron on service machines to check once a minute whether a process matching 'OpenSRF Listener [open-ils.actor]' is found, and call osrf_control to restart it if not. Obviously not a fix, but may allow you to at least breathe a bit while tracking it down. |
10:23 |
csharp_ |
miker: hmm - so if I'm understanding correctly, we would reduce the max children for actor to see if that helps? |
10:23 |
csharp_ |
we typically aren't hitting the max |
10:23 |
csharp_ |
(meaning there's room to reduce) |
10:24 |
berick |
was also wondering if maybe you've hit some practical limit |
10:24 |
JBoyer |
100+ is awfully high unless it were your only machine running the service, and even then 40-50 may be more than enough. |
10:25 |
miker |
well, or spread them out. but I admit it's hard to imagine that you have 1152 (192 x 6) concurrent actor requests. |
10:26 |
csharp_ |
we kept bumping it up to accommodate the actor drone parallel request issues |
10:33 |
csharp_ |
we have 120 open-ils.circ max children - I could probably drop that down too if needed, but I've moved actor from 192 to 72 per server |
10:33 |
|
nfBurton joined #evergreen |
10:38 |
Dyrcona |
We have 150 max actor children per brick and still run out sometimes. |
10:39 |
Dyrcona |
I sometimes consider sharing the drones among all the bricks, but haven't gotten to that point quite yet. |
10:41 |
csharp_ |
just lost actor again after that change |
10:43 |
csharp_ |
yeah - it's going offline again in multiple places |
10:44 |
csharp_ |
like three servers over the course of a few minutes - this is how it's going |
10:44 |
csharp_ |
and the number of actor drones running is not apparently relevant |
10:45 |
csharp_ |
meaning that they were low, percentage-wise |
10:45 |
csharp_ |
I really wish I could see into what's actually getting dropped |
10:46 |
csharp_ |
so far that's my biggest frustration |
10:46 |
csharp_ |
also possibly notable that the issue starts right before 11:00 a.m. each day |
10:47 |
Bmagic |
csharp_ check ulimits! That's bitten me a few times |
10:48 |
csharp_ |
Bmagic: I made recommended adjustments per Dyrcona and jeffdavis |
10:48 |
Bmagic |
oh, I guess i didn't see that chat up there |
10:48 |
|
Keith-isl joined #evergreen |
10:49 |
nfBurton |
Is there a link to what to do to rerun action_triggers? I had an issue with a template I fixed and new ones are running but trying to run the past week again keeps coming up as invalid.... |
10:49 |
Bmagic |
nfBurton: I've always "reset" the row in action_trigger.event so that it "looks" like it needs to be run |
10:49 |
nfBurton |
Same |
10:49 |
nfBurton |
I get Use of uninitialized value in subtraction (-) at /usr/local/share/perl/5.22.1/OpenSRF/AppSession.pm line 952. In logs. |
10:50 |
nfBurton |
The old ones just won't complete. But I cleared the event to rerun |
10:50 |
csharp_ |
and now I'm maxxed out on actor children - going to bump them back up |
10:51 |
mmorgan |
nfBurton: What kind of triggers are they? |
10:51 |
nfBurton |
Hold Notification Email |
10:51 |
Bmagic |
just to be clear, I don't delete the row in action_trigger.event. I update the columns to mostly be null and status to "pending", usually referring to another row that is pending to figure out which columns to nullify |
10:51 |
nfBurton |
update action_trigger."event" set run_time = NOW(), add_time = NOW(), start_time = null, update_time = null, update_process = null, state = 'pending' where event_def = 5 and state = 'invalid' and DATE(add_time) > ('2022-01-18') |
10:51 |
Bmagic |
yeah, like that :) |
10:51 |
nfBurton |
yeah |
10:51 |
mmorgan |
nfBurton: Maybe they're invalid because the holds have been picked up? |
10:52 |
nfBurton |
They still are all 'invalid' |
10:52 |
berick |
csharp_: did you ever get it to fail with debug logging? |
10:52 |
nfBurton |
hmmm maybe |
10:52 |
Bmagic |
nfBurton: that means that the validator checked into the details and decided that the trigger shouldn't fire |
10:52 |
Bmagic |
what's the validator set to on the trigger def? |
10:52 |
nfBurton |
I just wanted to be sure I didn't miss something cause it's over 300 for the past week. That makes sense though I guess |
10:53 |
csharp_ |
berick: yes |
10:53 |
nfBurton |
It's too bad it doesn't have an error output or something for it to tell why |
10:53 |
csharp_ |
but the logs are so dense I'm kind of lost for what to look for |
10:54 |
nfBurton |
Validator - HoldIsAvailable - Check that an item is on the hold shelf |
10:54 |
Bmagic |
nfBurton: you could create a machine special to run these. Changing the granulairty to something outside of your normal. And on this temporary machine, you can edit the Validator code to log more stuff |
10:54 |
mmorgan |
nfBurton: In my experience, triggers with invalid state are not errors, what Bmagic said. Hold is no longer ready for pickup, item is no longer overdue, etc. |
10:55 |
berick |
csharp_: hm, ok. i missed that. one more question, anything interesting in /openils/var/log/open-ils.actor_stderr.log on the afffected server? |
10:55 |
nfBurton |
So yeah, them being checked out makes sense. Thanks for the sanity check. I do get an error in the logs that basically looks like it errored on a complete time. |
10:55 |
nfBurton |
So I was unsure of why |
10:56 |
Bmagic |
the held item has a status other than "on holds shelf" - then yeah, it should invalidate |
10:56 |
nfBurton |
But it's because they didn't actually complete because there wasn't a need |
10:56 |
nfBurton |
cool |
11:04 |
Dyrcona |
nfburton: If the validation fails, then the events should have a status of "invalid" after the run. A state of "error" usually means something else. |
11:05 |
nfBurton |
Yeah, I got that now. Invalid sounds as bad, but doesn't have the error output lol |
11:05 |
nfBurton |
I expected way more to run |
11:11 |
Dyrcona |
csharp_: I'm out of ideas for the time being. |
11:11 |
csharp_ |
berick: there are messages there, but they don't correspond timewise with the issue |
11:11 |
* berick |
nods |
11:13 |
berick |
csharp_: for the debug logs, i would start by grepping just on the PID of the listener that dies and for a brief period leading up to its death. may be a longshot. |
11:14 |
Dyrcona |
nfBurton: I have had events get an 'error' status with no error output sometimes. |
11:14 |
csharp_ |
berick: thanks - I'll take a look again |
11:54 |
|
jihpringle joined #evergreen |
12:27 |
|
awitter joined #evergreen |
12:28 |
csharp_ |
hmm - the timing may be different than was I was assuming. Now that I'm catching these in real time, it looks like we see "WS received XMPP error message in response to thread=0.55655771755080671643131332366 and recipient=routerpublic.brick01-head.gapines.org/open-ils.actor. Likely the recipient is not accessible/available." |
12:29 |
csharp_ |
but ps still shows the listener as running at that poing |
12:29 |
csharp_ |
point |
12:30 |
csharp_ |
learned this because my attempt to speed up checking to less than once per minute (while true; do blah && sleep 20; done) haven't restarted the process |
12:30 |
csharp_ |
also explains maybe why osrf_control doesn't notice it immediately too |
12:33 |
|
abowling left #evergreen |
12:44 |
|
collum joined #evergreen |
12:44 |
pinesol |
[evergreen|gmontimantis] Docs: Update barcode_completion_grid.jpg - <http://git.evergreen-ils.org/?p=Evergreen.git;a=commit;h=be24afa> |
13:14 |
pinesol |
[evergreen|gmontimantis] Docs: updating z39.50 overlay doc and images - <http://git.evergreen-ils.org/?p=Evergreen.git;a=commit;h=0e977d4> |
13:14 |
pinesol |
[evergreen|Jane Sandberg] Docs: use asciidoc ordered list rather than adding numbers manually - <http://git.evergreen-ils.org/?p=Evergreen.git;a=commit;h=7ad46ac> |
13:31 |
|
sandbergja joined #evergreen |
13:36 |
|
terranm joined #evergreen |
13:40 |
alynn26 |
We had a hicup on our database server last Monday, and since then we been getting "No connection to the server" messages Failed to set timezone messges |
13:41 |
|
sandbergja joined #evergreen |
13:44 |
|
terranm joined #evergreen |
13:45 |
|
jihpringle joined #evergreen |
13:46 |
Dyrcona |
alynn26: What do you mean "hiccup?" |
13:47 |
alynn26 |
It just started not allowing connections. |
13:49 |
jeffdavis |
Have you restarted Evergreen services since then? |
13:49 |
jeff |
that specific "Failed to set timezone" message likely comes from OpenILS/Application/Storage/Driver/Pg/storage.pm |
13:50 |
alynn26 |
jeffdavis, several times, even rebooted the db to see if that helped. |
13:50 |
alynn26 |
The Timezone messages are the only thing that is different from before |
13:51 |
jeff |
it should show what value it's trying to set. |
13:51 |
jeff |
"Failed to set timezone: something-here" |
13:51 |
jeff |
does it? |
13:51 |
alynn26 |
America/New_York |
13:51 |
alynn26 |
which is correct. |
13:52 |
alynn26 |
open-ils.storage: [WARN:2167:Application.pm:624:1642450292163715] no_tz.open-ils.storage.transaction.begin: DBD::Pg::db do failed: no connection to the server [for Statement "SET LOCAL timezone TO ?;" with ParamValues: 1='America/New_York'] at /usr/local/share/perl/5.26.1/OpenILS/Application/Storage/Driver/Pg/storage.pm line 65. |
13:52 |
alynn26 |
open-ils.storage: [WARN:2167:storage.pm:69:1642450292163715] Failed to set timezone: America/New_York |
13:54 |
jeff |
"failed: no connection to the server" |
13:54 |
jeff |
how long has that process been running? |
13:57 |
alynn26 |
most of them when we see it are already died off. |
13:57 |
jeff |
ah, that probably makes sense. it probably doesn't stick around after that point. |
13:58 |
jeff |
is open-ils.storage connecting directly to the database server, or is there something like pgbouncer between it and the database server? |
13:58 |
alynn26 |
direct connection |
13:58 |
jeff |
as part of fixing the "hiccup", did the connection details change for the database server? |
13:58 |
jeff |
are you close to your max connections on the database server? |
13:59 |
alynn26 |
nope they did not. no where near close to max connections, |
13:59 |
jeff |
are there errors in the postgresql logs regarding these clients? |
14:00 |
jeff |
did the process in question do something interesting / log something interesting before the "no connection to the server" / "Failed to set timezone" warning? |
14:02 |
alynn26 |
nope they did not. No other errors that I can find or see. Been looking at log files the last two days. and everything was normal, then wam. no connections. |
14:05 |
Dyrcona |
alynn26: Is the db in recovery mode? |
14:06 |
Dyrcona |
You can tell by doing "pgrep -af postgres" on the db server I think. |
14:07 |
jeff |
What was the cause/fix of the "hiccup"? Did the timezone warnings start right away, or not for a while? |
14:08 |
alynn26 |
No, the db is not in recovery mode. |
14:11 |
alynn26 |
One other issue we have is we have a lot of "postgres: 9.6/main: evergreen evergreen 10.1.2.17(55738) idle |
14:11 |
alynn26 |
" |
14:13 |
alynn26 |
the fix was to kill out most of the idle processes. which cleared up ram. |
14:14 |
Dyrcona |
alynn26: Have you verified that the correct postgresql configuration is in place? |
14:14 |
Dyrcona |
It could be you don't have enough connections available. |
14:15 |
alynn26 |
we have 6000 and usually running around 2000 |
14:16 |
JBoyer |
alynn26, Those "idle" processes are PostgreSQL backends that are actively listening to a connection, such as a cstore drone or something. They will automatically go away when the connection closes. (like when a drone is recycled or otherwise goes away) |
14:16 |
JBoyer |
killing them on the db directly can cause problems for Postgres and the OpenSRF processes talking to them. |
14:18 |
Dyrcona |
If 10.1.2.17 has a larger than usual number of connections, you might want to log in to it and see what's going on there. Maybe something is misbehaving? |
15:49 |
|
jihpringle joined #evergreen |
15:51 |
pinesol |
[evergreen|Bill Erickson] LP1948035 Update Node/Angular Deps for Ang 12 - <http://git.evergreen-ils.org/?p=Evergreen.git;a=commit;h=a46938a> |
15:51 |
pinesol |
[evergreen|Bill Erickson] LP1948035 Angular 12 updates - <http://git.evergreen-ils.org/?p=Evergreen.git;a=commit;h=9b7d6dd> |
15:51 |
pinesol |
[evergreen|Jane Sandberg] LP1948035 (follow-up): don't use commonjs in spec tsconfig - <http://git.evergreen-ils.org/?p=Evergreen.git;a=commit;h=e7b98ca> |
16:15 |
|
awitter joined #evergreen |
16:36 |
Dyrcona |
I've noticed that the Angular staff catalog in 3.7 doesn't use the preferred library when searching. Is that expected, known, etc.? |
16:43 |
terranm |
We're in 3.8, but if we search by the whole consortium, the preferred library's copies are floating to the top of the holdings list as expected. |
16:45 |
terranm |
The only thing that I know for sure that is buggy with the Angular search preferences is the stickiness of the Exclude Electronic Resources checkbox |
16:45 |
jihpringle |
Drycona: yup, that's a known issue in 3.7 |
16:45 |
jihpringle |
Dyrcona ^^ |
16:46 |
Dyrcona |
I'm looking into patron search freezing when we try to place a hold. There sure seems to be a lot going on in the logs when I do that. |
16:46 |
Dyrcona |
terranm++ jihpringle++ |
16:47 |
jihpringle |
I think this fix in 3.7.2 fixed part of it - https://bugs.launchpad.net/evergreen/+bug/1913807 |
16:47 |
pinesol |
Launchpad bug 1913807 in Evergreen 3.6 "Angular staff catalog search results no longer displays holdings count for preferred library" [Medium,Fix released] |
16:48 |
Dyrcona |
I think I'm getting timeouts: Jan 25 16:43:55 egdev open-ils.search: [ERR :33653:EX.pm:66:16431467173365213] Exception: OpenSRF::EX::Session 2022-01-25T16:43:55 OpenSRF::Utils::Logger /usr/local/share/perl/5.30.0/OpenSRF/Utils/Logger.pm:243 Session Error: opensrfpublic.localhost/_egdev_1643146717.550417_33652 IS NOT CONNECTED TO THE NETWORK!!! |
16:48 |
Dyrcona |
Looks like a client gave up waiting on a reply. |
16:49 |
terranm |
jihpringle++ |
16:49 |
Dyrcona |
jihpringle: I should have that one. I've installed 3.7.2. |
16:49 |
jihpringle |
I thought there was another related bug but I can't find it |
16:50 |
Dyrcona |
Is patron search doing something stupid, like trying to retrieve every patron in the database? |
16:50 |
mmorgan |
Dyrcona: Just to clarify, the angular staff catalog has its own Catalog Preferences with preferred library and default search library settings. |
16:51 |
jihpringle |
we're running 3.7.0(ish) and the angular catalogue definitely doesn't follow what it says in the catalog preferences for preferred library (The preferred library is used to show copies and URIs regardless of the library searched.) |
16:52 |
terranm |
I can verify that in 3.8 the holdings count for preferred library shows up on the search results page. However, if you go into the item details to see the actual holdings, only the holdings within the search limits show up |
16:54 |
Dyrcona |
OK. |
16:54 |
eby |
Any user permission gurus on? Trying to wrap my head around the depth/scopes. |
16:54 |
mmorgan |
Dyrcona: So this is using the Patron Search from the Place Hold screen in the Angular Staff Catalog? |
16:54 |
Dyrcona |
So, I see a lot of searches for holds, copies, bibs, and bib feeds. |
16:54 |
Dyrcona |
mmorgan: Yes. |
16:54 |
terranm |
Dyrcona: is the Patron Search modal freezing when you first pull it up or after you search? We did encounter this bug and signed off on the fix: https://bugs.launchpad.net/evergreen/+bug/1955927 |
16:54 |
pinesol |
Launchpad bug 1955927 in Evergreen "Patron search modal on Holds screen fails with barcode search" [Medium,Confirmed] |
16:55 |
Dyrcona |
It freezes when I first open it up. My last log activity was at 4:46 and it is still sitting there. I can't close the modal. |
16:55 |
jihpringle |
terranm: are you meaning the item details in the search results or in the actual record? |
16:55 |
terranm |
jihpringle: in the actual record - lemme look at in the search results... |
16:56 |
terranm |
eby: I'm not a guru, but feel free to ask! |
16:56 |
mmorgan |
eby: Can't claim guru status, but ... What terranm said!! |
16:56 |
jihpringle |
I just tried on our 3.8 test server and I'm not seeing the preferred library item details showing in the grid on the search results screen (just in the counts) |
16:57 |
Dyrcona |
I'll have to look at this more tomorrow, but I think my problem is cstore timeouts talking to the database. |
16:57 |
eby |
So say have a staff with cancel hold permission at system level. When they go to cancel a hold on another account is that system looking at the patron home library, where the hold is placed or going to or ...? |
16:58 |
terranm |
jihpringle: I'm seeing the same thing. The only thing I see for the preferred library is the hold count. |
16:59 |
jihpringle |
terranm: you'd expect to see the item details in those other places, right? I think this is a bug |
16:59 |
eby |
the docs seem to mostly focus on user home library but I presume that differs depending on if their own account or someone elses |
17:00 |
jihpringle |
eby: we had something similar come up, I can't remember exactly what happened but I think it actually looks at either the pickup library or requestor library |
17:02 |
mmorgan |
eby: We haven't come across this, we have UPDATE_HOLD at the consortium level. |
17:04 |
terranm |
eby: same here - we have ours set at the consortium level, so we haven't looked into that. I would expect it to either be based on the patron's home library or on the pickup library. |
17:04 |
terranm |
jihpringle: +1 to that being a bug |
17:05 |
jihpringle |
terranm: speaking of bugs, are you planning to submit a bug for the old notes/alerts/messages surfacing? we're seeing that on our test server too |
17:06 |
jeff |
eby: if you are attempting to cancel a hold, the permission check on CANCEL_HOLDS is not scoped to any org unit -- if you have it anywhere, you can cancel a hold. |
17:07 |
eby |
Thanks I'll do some more testing. It seemed like it was checking based on the staff's home library which didn't make sense to me. But I'll dig more. |
17:07 |
jeff |
eby: "uncancel" is checked at the current hold pickup lib (it also tests the CANCEL_HOLDS permission) |
17:07 |
jeff |
so, in theory you can cancel a hold which you cannot then uncancel (without more work) |
17:08 |
eby |
ok that is what i was seeing jeff. it was sending the home library which of course would always allow regardless of what you were looking at |
17:08 |
eby |
(if you can cancel your own) |
17:08 |
terranm |
jihpringle: I haven't been viewing that as a bug since those old messages were still in the system and not deleted. However, we are going to do a cleanup project to mark specific types of old messages deleted. |
17:08 |
jeff |
eby: actually, let me fix my first statement... it wasn't quite accurate/precise. |
17:09 |
jihpringle |
terranm: we're still early days in our testing but we'd be very interested in what you find as you do your cleanup project since it looks like we'll need to do the same when we upgrade this summer |
17:09 |
jeff |
eby: when you attempt to cancel a hold, the permission is checked without passing an org unit -- which means that the permission check is performed based on the org unit of the workstation associated with the user's login session. |
17:09 |
jeff |
eby: that might explain what you were seeing better. |
17:09 |
terranm |
jeff++ |
17:10 |
|
terranm joined #evergreen |
17:12 |
eby |
jeff: thanks. my logs showed the permission.usr_has_perm getting the org unit id but have been trying to get an idea of where it was pulled / actually checked |
17:16 |
jihpringle |
terranm: https://bugs.launchpad.net/evergreen/+bug/1959052 |
17:16 |
pinesol |
Launchpad bug 1959052 in Evergreen "Preferred Library copies details and URIs don't display in search results" [Undecided,New] |
17:16 |
eby |
jeff: but my melcat woes will never end as you well know |
17:24 |
|
mmorgan left #evergreen |
17:26 |
jeff |
eby: this i know. |
17:26 |
jeff |
eby: I'm not sure how it took until today for us to run into this, but it looks like item renewal messages can fail (and loop) when the patron account in Evergreen is expired. |
17:28 |
jeff |
that might be me failing to set a new(ish) org unit setting regarding renewals and expired accounts. I'll look into it. |
17:28 |
eby |
I think I did run into that but may have tweaked permissions |
17:29 |
jeff |
for now, I'm off. another day! |
18:02 |
pinesol |
News from qatests: Testing Success <http://testing.evergreen-ils.org/~live> |