Evergreen ILS Website

IRC log for #evergreen, 2023-06-27

| Channels | #evergreen index | Today | | Search | Google Search | Plain-Text | summary | Join Webchat

All times shown according to the server's local time.

Time Nick Message
08:02 BDorsey joined #evergreen
08:56 mmorgan joined #evergreen
09:19 Dyrcona joined #evergreen
09:28 mantis joined #evergreen
09:53 mantis joined #evergreen
10:18 jvwoolf joined #evergreen
11:17 Christineb joined #evergreen
11:33 Bmagic I'm digging around in the logs trying to figure out why a trigger ended in "error". Does this sound like the culprit? Creating new event: NO_SESSION
11:37 berick Bmagic: maybe, but "event" in the message is not referring to a trigger event
11:37 Bmagic I see that now
11:37 Bmagic so, probably not related
11:37 mmorgan Bmagic: What type of trigger?
11:38 Bmagic checkout.due SendEmail
11:40 mmorgan Ok, yeah, probably not related to the NO_SESSION.
11:42 Bmagic I think my grep isn't getting enough contextual logs, trying 1000 lines instead of 100
11:42 berick nothing in error log?  or event.error_output ?
11:43 Bmagic nothing in error_output, not sure what you're referring to: error log
11:43 berick well, grep ERR
11:44 Bmagic that's what I'm hunting down.
11:44 * berick nods
11:46 Bmagic yeah, grep wasn't giving me the lines I needed. This looks like the issue open-ils.cstore 2023-06-27 08:47:14 [ERR :1303100:oils_sql.c:2376:] No active transaction -- required for CREATE
11:47 Bmagic followed by Event reacting failed with Can't call method "id" on an undefined value at /usr/local/share/perl/5.30.0/OpenI​LS/Application/Trigger/Reactor.pm line 564
11:50 mmorgan Perhaps the circ transaction is already closed?
11:51 mmorgan Though that might end up invalid rather than error.
11:54 Bmagic it's trying to create the event_output row in the database and failing
11:54 Bmagic I'm thinking OpenSRF drone/router crash
11:56 berick 'No active transaction' can also happen if a cstore backend is idle too long / keepalive timeout expires.
11:58 berick which would be preceeded by INFO logs for "No request was received in"...
11:59 Bmagic hmmm, that's interesting. What's the setting for timeout in this area of the world?
12:04 berick Bmagic: opensrf.xml -> cstore -> keepalive.  the bigger question though is what is causing the timeout.
12:04 berick some other processing is going on while cstore has an open transaction
12:06 Bmagic yeah, it gets busy at night
12:07 Bmagic based on the logs, it seems that the hold targeter was running at the same time, but that seems "normal"
12:08 berick what's it trying to CREATE when it raised that 'no active transactin' error?
12:09 jihpringle joined #evergreen
12:09 berick heh, why ain't it transactin'?
12:14 mmorgan :)
12:18 Bmagic berick: request en-US open-ils.cstore.direct.action​_trigger.event_output.create <new object>
12:22 berick ah so at this point an error has already occurred
12:22 berick is there no "Error processing Trigger template" preceeding it in the logs?
12:23 Bmagic no, not seeing that
12:24 Bmagic The log contains the processed template immdiately above the error line :No active transaction -- required for CREATE
12:25 berick oh, sorry, i misread event_output.create as error_output
12:25 Dyrcona Bmagic: Do you know what patron this email would go to? Is there enough info in the logs to find out?
12:27 Bmagic yep, I can walk the target back to the circulation. Though, the template does not have the patron's email address in it, so I assume this particular email wouldn't have been able to be sent, but, I would expect to have created an output in action_trigger.event_output. Which, we've learned didn't happen because of the 6 second timeout for the DB transation
12:27 Dyrcona I've seen where triggers will blow up because some account had 100+ overdue items and it took too long to gather all the data.
12:27 Bmagic well, that could be part of the issue here. There does seem to be a lot of data in the template output
12:27 Dyrcona Bmagic: I think that event is grouped by usr, no? This means it will pull together all of that usr's circulations for the email.
12:28 Bmagic yes, grouped by usr
12:29 Bmagic 15 items overdue
12:29 Dyrcona I don't recall if template data was there or if it was complete if it was there. This used to happen at CW MARS a lot. I struggled with this a couple of years ago. I made some adjustments here and there and it mostly stopped being an issue.
12:30 Dyrcona Only 15? That isn't very many... Unless. Does one or more of them have a crazy long title?
12:31 Dyrcona Another thing: You could dump the template output column to a file and do 'wc' on it to see how big it is. You could length or whatever the function is in the database. It might be too much data for ejabberd to handle.
12:32 Dyrcona That was on of my issues with sending autorenewal emails, as I recall. I had some that were > 2MB so I bumped the ejabberd setting up to 10MB.
12:32 Dyrcona IF that's not it, I'll keep brainstorming with you.
12:33 Bmagic ejabberd has  max_stanza_size: 10485760
12:35 Dyrcona The event exists in the database with template output? (Just making sure)
12:36 Bmagic psql -c "select template from action_trigger.event_definition where id=493"|wc
12:37 Bmagic 34     184    5095
12:37 Dyrcona That's the template, not the output.
12:37 Bmagic the event output does not exist, because ^^^ above
12:38 Dyrcona Ok. I thought you might have had output in the database.
12:38 Bmagic though, I can copy/paste the resulting output from the log and get a size estimate
12:39 Dyrcona You can try processing the hook from the command line with a 'tail -f ' on the log in another window.
12:39 Bmagic 4KB resulting file when I excerpt the template output into a file on the hard drive
12:43 Dyrcona Well, I don't know at this point what the problem is. I've seen things like this where events just fail or get stuck in 'collected' status. I'm assuming it's a bug with either directing messages to multiple backend or possibly a drone going away and the listener not keeping up.
12:43 Bmagic berick Dyrcona mmorgan: https://pastebin.com/rS6M6UnB
12:43 berick well, " error starting database transaction" pretty much kills everything
12:44 Bmagic I believe we figured out that it was a timeout issue. There is a log farther up the log that says "open-ils.cstore 2023-06-27 08:47:13 [INFO:1303721:osrf_prefork.c:505:] No request was received in 6 seconds, exiting stateful session"
12:44 Dyrcona Yeah, I'd say check the Pg logs for that timestamp and see if something turns up there.
12:44 Dyrcona Maybe it hit some limit in the database?
12:45 Bmagic based on the distance in the logs, I think it actually took more than 6 seconds for the code to finally get to the part where it was going to create the output row
12:45 berick so.. "error starting db transaction" could be because we are no longer in a connected session.. fits the timeout scenario.
12:46 berick yeah, the interesting bit is before these logs
12:46 berick between 08:47:07 and 08:47:13 or thereabouts
12:47 Bmagic let me see if I can narrow down the relavant stuff. Does cstore make a log entry when the transaction is started? I'm tracing the PID 1303721
12:48 mmorgan Bmagic: To be clear, it's just this one event that failed? All others processed normally?
12:48 Bmagic for this group, they all failed
12:49 berick Bmagic: what logs are there for that open-ils.trigger processing leading up to "error starting database transaction"
12:49 Bmagic oh but I'm realizing now that there are 15 ate rows, one for each acirc row
12:49 berick ?
12:49 Bmagic so, it's just one patron, one group, 15 "error" ate
12:49 berick s/processing/process/
12:51 mmorgan ok, so 15 events -> one event_output row (that failed to be created).
12:51 Bmagic ok, let me see if I can get that
12:51 Bmagic mmorgan: yep
12:52 Dyrcona Doesn't xact_begin start a new session if there isn't one? Could the session have gone away, but the drone not know it?
12:56 berick Dyrcona: that's what i'm thinking.  the session timed out while compiling the template, it then tried to send a BEGIN on a disconnected session.
12:57 Bmagic it almost has to be
12:57 berick and it didn't know it was disconnected because no recv() / queue_wait calls occurred before the "am I still connected" test in editor's xact_begin() sub.
12:57 Dyrcona I'd still check the database logs for anything weird around that time.
12:58 Bmagic db logs don't have anything interesting around that time
12:58 Bmagic lots of " WARNING:  there is no transaction in progress"
12:59 Dyrcona Well, that's a bummer.
13:00 Bmagic we do have a clear log that says "No request was received in 6 seconds, exiting stateful session"  - so, I wouldn't expect that the postgres service would have expierenced an error
13:06 Dyrcona Can you tell if that's for the same cstore drone/process?
13:13 * Dyrcona just added 64 tests to live_t/20-hold-targeter.t.
13:21 sleary joined #evergreen
13:25 * Dyrcona wishes there was a decent way to test the pull list from the backend, but the results are not predictable enough.
13:37 Dyrcona joined #evergreen
13:38 Bmagic sorry guys, I was pulled away, now, I'm back at it
13:38 * Dyrcona is switching internet connections.
13:39 Bmagic I'm stuck on this command: zgrep open-ils.trigger /openils/var/log/osrfsys.log.3.gz |grep "2023-06-27 08:4"|less
13:39 Bmagic I'm getting nothing
13:39 Bmagic which doesn't make sense, because I know there are lines in there with those. Maybe I have an bash escape issue, regex something
13:40 jvwoolf joined #evergreen
13:41 Dyrcona joined #evergreen
13:45 Dyrcona Bmaric are you sure that osrfsys.log.3.gz contains rows for that time period?
13:45 Dyrcona Bmagic ^^
13:45 Bmagic yeah, that's what I'm trying to confirm. The original grep shows the filename, which is why I used that filename. But I think because I used -C 1000, the filename isn't correct
13:46 Bmagic now I'm getting hits, to answer the question about what the action trigger logs look like above* the 6 second timeout line
13:47 Dyrcona OK. I was going to suggest using -F on the grep or zgrep, or even eliminating the second grep and try the timestamp via zgrep.
13:51 collum joined #evergreen
13:54 * Dyrcona considers adding more tests... but needs to refresh his memory on making cstore requests from srfsh.
13:54 Bmagic berick: RE: "between 08:47:07 and 08:47:13 or thereabouts" https://pastebin.com/8agDcMnX
13:55 Bmagic the event ID that I'm tracking: 123020652
13:59 Bmagic do we start the transaction (the one that we're going to use to write the output to the database), and then* start generating the template? Leaving that transaction open for the duration?
14:04 Dyrcona Bmagic: I think it might.
14:05 Bmagic I think I'm going to give up. My curiousity ate my morning already
14:06 Bmagic ate (hahaha, that's action_trigger.event)
14:07 Dyrcona :)\
14:08 Dyrcona You can run the hook again and see if the event processes this time.
14:08 Bmagic It does succeed most of the time
14:08 Bmagic I was just curious about this time
14:09 Dyrcona Yeah. It happens more often than one would like.
14:19 Dyrcona Well, I think that's enough tests. We can always add more later.
14:25 berick oddly it tries to create the transaction after it's compiled the template.  "trigger: writing 3273 bytes to template output" happens after template compilation.
14:26 berick then cstore sends a bunch of stuff back after the begin.  that part I don't get.
14:43 Dyrcona Eighteen commits rebased down to 1, and I'm ready to share the code...
15:04 Dyrcona I suppose we have functional tests for the staff client that can be run. I guess I should learn how those work.
15:11 * eeevil reads up and sees Bmagic's adventure... ah yes, a Very Expensive Template Output (esp a grouped set of expensive events) could take longer than the session keepalive time (configured in opensrf.xml, per service) and it does look like that. the fast fix is to increase the keepalive time. I know of EG instances that have in the past had a timeout of 12s (slower hardware, and before some code changes lowered the timeout risk), but I don't see
15:11 * eeevil any currently. if it's right at the edge of the time limit, that'd give breathing room to figure out where we need to insert more interaction and avoid that particular session timeout in A/T land.
15:19 Bmagic eeevil++ berick++ Dyrcona++ mmorgan++ # thanks for the ideas. timeout 6 > 12 seems easy enough. What's the drawback? Tying up more Drones, and running up against max_children?
15:20 berick exactly
15:20 eeevil well, the session ends as soon as the client says "disconnect", so only sessions that /want/ to last longer than 6s will, and only up to 12s, so you might see a few cstores sit idle for up to 6s longer than today
15:21 eeevil but, if you've only found this one instance, then you're pretty safe
15:22 eeevil if you're already at the point of running out of drones, this won't make that better. but, you just need more drones
15:23 Bmagic trigger drones seem to hang out at less than 10. cstore drones, can go up to like 100 sometimes, on the utility server. I've got that set to 250 max. I think that is still OK
15:24 Dyrcona I experimented with jacking the timeout up in production in the past and didn't really have any issues with running out of drones.
15:24 Dyrcona Let me check some git history to see if I still have the values that I tried.
15:24 eeevil if trigger is you're only client-culprit for this, then the risk is tying up that many (10-ish) cstores. seems safe
15:29 Dyrcona Hm... I remember having this issue 9 to 7 years ago at MVLC and jacking the timeout up to low double digits. I'd have to dig my MVLC code out to find it.
15:29 Bmagic makes sense
15:29 Bmagic I like 12. I'll see what that can do for me in the coming weeks
15:30 mantis left #evergreen
15:31 Dyrcona I think I ended up at 15, but it was kind of arbitrary. I remember it was a "snow day" and we had things go haywire.
15:31 Bmagic haha
15:45 Bmagic is it possible that when this error occurs, that we can see cascading issues across other triggers that might be running at the same time?
15:46 Bmagic Like, one patron with expensive template composition, could cause the trigger drone to error out, which, in turn could affect other drones? or other trigger drones that were using the same cstore drone? something like that?
15:47 eeevil if they're separate, there shouldn't be any interaction
15:47 Bmagic if the AT batch was processing 3000 AT's and, this issue happened in the first 15 AT's, the rest of the 2985 triggers could bomb too?
15:49 eeevil I wouldn't expect that, but it depends on the details. each a/t environment should be getting its own cstore editor, so later events (or event groups) shouldn't be impacted. but it depends on other factors, like the parallelism setting, etc
15:51 Bmagic I have messed with the trigger parallel setting. It's at 3 today
15:51 Bmagic collect 3, react 3
15:54 Dyrcona What I saw was some failed and most succeeded. It was often more than 1 event affected.
15:55 Dyrcona It also wasn't time related. Some events after the failed events succeeded. I think it had to do with the cstore that the trigger drones were talking too.
16:03 Bmagic eeevil++
17:05 mmorgan left #evergreen
17:12 JBoyer joined #evergreen
21:42 sleary joined #evergreen

| Channels | #evergreen index | Today | | Search | Google Search | Plain-Text | summary | Join Webchat