Time |
Nick |
Message |
08:02 |
|
BDorsey joined #evergreen |
08:56 |
|
mmorgan joined #evergreen |
09:19 |
|
Dyrcona joined #evergreen |
09:28 |
|
mantis joined #evergreen |
09:53 |
|
mantis joined #evergreen |
10:18 |
|
jvwoolf joined #evergreen |
11:17 |
|
Christineb joined #evergreen |
11:33 |
Bmagic |
I'm digging around in the logs trying to figure out why a trigger ended in "error". Does this sound like the culprit? Creating new event: NO_SESSION |
11:37 |
berick |
Bmagic: maybe, but "event" in the message is not referring to a trigger event |
11:37 |
Bmagic |
I see that now |
11:37 |
Bmagic |
so, probably not related |
11:37 |
mmorgan |
Bmagic: What type of trigger? |
11:38 |
Bmagic |
checkout.due SendEmail |
11:40 |
mmorgan |
Ok, yeah, probably not related to the NO_SESSION. |
11:42 |
Bmagic |
I think my grep isn't getting enough contextual logs, trying 1000 lines instead of 100 |
11:42 |
berick |
nothing in error log? or event.error_output ? |
11:43 |
Bmagic |
nothing in error_output, not sure what you're referring to: error log |
11:43 |
berick |
well, grep ERR |
11:44 |
Bmagic |
that's what I'm hunting down. |
11:44 |
* berick |
nods |
11:46 |
Bmagic |
yeah, grep wasn't giving me the lines I needed. This looks like the issue open-ils.cstore 2023-06-27 08:47:14 [ERR :1303100:oils_sql.c:2376:] No active transaction -- required for CREATE |
11:47 |
Bmagic |
followed by Event reacting failed with Can't call method "id" on an undefined value at /usr/local/share/perl/5.30.0/OpenILS/Application/Trigger/Reactor.pm line 564 |
11:50 |
mmorgan |
Perhaps the circ transaction is already closed? |
11:51 |
mmorgan |
Though that might end up invalid rather than error. |
11:54 |
Bmagic |
it's trying to create the event_output row in the database and failing |
11:54 |
Bmagic |
I'm thinking OpenSRF drone/router crash |
11:56 |
berick |
'No active transaction' can also happen if a cstore backend is idle too long / keepalive timeout expires. |
11:58 |
berick |
which would be preceeded by INFO logs for "No request was received in"... |
11:59 |
Bmagic |
hmmm, that's interesting. What's the setting for timeout in this area of the world? |
12:04 |
berick |
Bmagic: opensrf.xml -> cstore -> keepalive. the bigger question though is what is causing the timeout. |
12:04 |
berick |
some other processing is going on while cstore has an open transaction |
12:06 |
Bmagic |
yeah, it gets busy at night |
12:07 |
Bmagic |
based on the logs, it seems that the hold targeter was running at the same time, but that seems "normal" |
12:08 |
berick |
what's it trying to CREATE when it raised that 'no active transactin' error? |
12:09 |
|
jihpringle joined #evergreen |
12:09 |
berick |
heh, why ain't it transactin'? |
12:14 |
mmorgan |
:) |
12:18 |
Bmagic |
berick: request en-US open-ils.cstore.direct.action_trigger.event_output.create <new object> |
12:22 |
berick |
ah so at this point an error has already occurred |
12:22 |
berick |
is there no "Error processing Trigger template" preceeding it in the logs? |
12:23 |
Bmagic |
no, not seeing that |
12:24 |
Bmagic |
The log contains the processed template immdiately above the error line :No active transaction -- required for CREATE |
12:25 |
berick |
oh, sorry, i misread event_output.create as error_output |
12:25 |
Dyrcona |
Bmagic: Do you know what patron this email would go to? Is there enough info in the logs to find out? |
12:27 |
Bmagic |
yep, I can walk the target back to the circulation. Though, the template does not have the patron's email address in it, so I assume this particular email wouldn't have been able to be sent, but, I would expect to have created an output in action_trigger.event_output. Which, we've learned didn't happen because of the 6 second timeout for the DB transation |
12:27 |
Dyrcona |
I've seen where triggers will blow up because some account had 100+ overdue items and it took too long to gather all the data. |
12:27 |
Bmagic |
well, that could be part of the issue here. There does seem to be a lot of data in the template output |
12:27 |
Dyrcona |
Bmagic: I think that event is grouped by usr, no? This means it will pull together all of that usr's circulations for the email. |
12:28 |
Bmagic |
yes, grouped by usr |
12:29 |
Bmagic |
15 items overdue |
12:29 |
Dyrcona |
I don't recall if template data was there or if it was complete if it was there. This used to happen at CW MARS a lot. I struggled with this a couple of years ago. I made some adjustments here and there and it mostly stopped being an issue. |
12:30 |
Dyrcona |
Only 15? That isn't very many... Unless. Does one or more of them have a crazy long title? |
12:31 |
Dyrcona |
Another thing: You could dump the template output column to a file and do 'wc' on it to see how big it is. You could length or whatever the function is in the database. It might be too much data for ejabberd to handle. |
12:32 |
Dyrcona |
That was on of my issues with sending autorenewal emails, as I recall. I had some that were > 2MB so I bumped the ejabberd setting up to 10MB. |
12:32 |
Dyrcona |
IF that's not it, I'll keep brainstorming with you. |
12:33 |
Bmagic |
ejabberd has max_stanza_size: 10485760 |
12:35 |
Dyrcona |
The event exists in the database with template output? (Just making sure) |
12:36 |
Bmagic |
psql -c "select template from action_trigger.event_definition where id=493"|wc |
12:37 |
Bmagic |
34 184 5095 |
12:37 |
Dyrcona |
That's the template, not the output. |
12:37 |
Bmagic |
the event output does not exist, because ^^^ above |
12:38 |
Dyrcona |
Ok. I thought you might have had output in the database. |
12:38 |
Bmagic |
though, I can copy/paste the resulting output from the log and get a size estimate |
12:39 |
Dyrcona |
You can try processing the hook from the command line with a 'tail -f ' on the log in another window. |
12:39 |
Bmagic |
4KB resulting file when I excerpt the template output into a file on the hard drive |
12:43 |
Dyrcona |
Well, I don't know at this point what the problem is. I've seen things like this where events just fail or get stuck in 'collected' status. I'm assuming it's a bug with either directing messages to multiple backend or possibly a drone going away and the listener not keeping up. |
12:43 |
Bmagic |
berick Dyrcona mmorgan: https://pastebin.com/rS6M6UnB |
12:43 |
berick |
well, " error starting database transaction" pretty much kills everything |
12:44 |
Bmagic |
I believe we figured out that it was a timeout issue. There is a log farther up the log that says "open-ils.cstore 2023-06-27 08:47:13 [INFO:1303721:osrf_prefork.c:505:] No request was received in 6 seconds, exiting stateful session" |
12:44 |
Dyrcona |
Yeah, I'd say check the Pg logs for that timestamp and see if something turns up there. |
12:44 |
Dyrcona |
Maybe it hit some limit in the database? |
12:45 |
Bmagic |
based on the distance in the logs, I think it actually took more than 6 seconds for the code to finally get to the part where it was going to create the output row |
12:45 |
berick |
so.. "error starting db transaction" could be because we are no longer in a connected session.. fits the timeout scenario. |
12:46 |
berick |
yeah, the interesting bit is before these logs |
12:46 |
berick |
between 08:47:07 and 08:47:13 or thereabouts |
12:47 |
Bmagic |
let me see if I can narrow down the relavant stuff. Does cstore make a log entry when the transaction is started? I'm tracing the PID 1303721 |
12:48 |
mmorgan |
Bmagic: To be clear, it's just this one event that failed? All others processed normally? |
12:48 |
Bmagic |
for this group, they all failed |
12:49 |
berick |
Bmagic: what logs are there for that open-ils.trigger processing leading up to "error starting database transaction" |
12:49 |
Bmagic |
oh but I'm realizing now that there are 15 ate rows, one for each acirc row |
12:49 |
berick |
? |
12:49 |
Bmagic |
so, it's just one patron, one group, 15 "error" ate |
12:49 |
berick |
s/processing/process/ |
12:51 |
mmorgan |
ok, so 15 events -> one event_output row (that failed to be created). |
12:51 |
Bmagic |
ok, let me see if I can get that |
12:51 |
Bmagic |
mmorgan: yep |
12:52 |
Dyrcona |
Doesn't xact_begin start a new session if there isn't one? Could the session have gone away, but the drone not know it? |
12:56 |
berick |
Dyrcona: that's what i'm thinking. the session timed out while compiling the template, it then tried to send a BEGIN on a disconnected session. |
12:57 |
Bmagic |
it almost has to be |
12:57 |
berick |
and it didn't know it was disconnected because no recv() / queue_wait calls occurred before the "am I still connected" test in editor's xact_begin() sub. |
12:57 |
Dyrcona |
I'd still check the database logs for anything weird around that time. |
12:58 |
Bmagic |
db logs don't have anything interesting around that time |
12:58 |
Bmagic |
lots of " WARNING: there is no transaction in progress" |
12:59 |
Dyrcona |
Well, that's a bummer. |
13:00 |
Bmagic |
we do have a clear log that says "No request was received in 6 seconds, exiting stateful session" - so, I wouldn't expect that the postgres service would have expierenced an error |
13:06 |
Dyrcona |
Can you tell if that's for the same cstore drone/process? |
13:13 |
* Dyrcona |
just added 64 tests to live_t/20-hold-targeter.t. |
13:21 |
|
sleary joined #evergreen |
13:25 |
* Dyrcona |
wishes there was a decent way to test the pull list from the backend, but the results are not predictable enough. |
13:37 |
|
Dyrcona joined #evergreen |
13:38 |
Bmagic |
sorry guys, I was pulled away, now, I'm back at it |
13:38 |
* Dyrcona |
is switching internet connections. |
13:39 |
Bmagic |
I'm stuck on this command: zgrep open-ils.trigger /openils/var/log/osrfsys.log.3.gz |grep "2023-06-27 08:4"|less |
13:39 |
Bmagic |
I'm getting nothing |
13:39 |
Bmagic |
which doesn't make sense, because I know there are lines in there with those. Maybe I have an bash escape issue, regex something |
13:40 |
|
jvwoolf joined #evergreen |
13:41 |
|
Dyrcona joined #evergreen |
13:45 |
Dyrcona |
Bmaric are you sure that osrfsys.log.3.gz contains rows for that time period? |
13:45 |
Dyrcona |
Bmagic ^^ |
13:45 |
Bmagic |
yeah, that's what I'm trying to confirm. The original grep shows the filename, which is why I used that filename. But I think because I used -C 1000, the filename isn't correct |
13:46 |
Bmagic |
now I'm getting hits, to answer the question about what the action trigger logs look like above* the 6 second timeout line |
13:47 |
Dyrcona |
OK. I was going to suggest using -F on the grep or zgrep, or even eliminating the second grep and try the timestamp via zgrep. |
13:51 |
|
collum joined #evergreen |
13:54 |
* Dyrcona |
considers adding more tests... but needs to refresh his memory on making cstore requests from srfsh. |
13:54 |
Bmagic |
berick: RE: "between 08:47:07 and 08:47:13 or thereabouts" https://pastebin.com/8agDcMnX |
13:55 |
Bmagic |
the event ID that I'm tracking: 123020652 |
13:59 |
Bmagic |
do we start the transaction (the one that we're going to use to write the output to the database), and then* start generating the template? Leaving that transaction open for the duration? |
14:04 |
Dyrcona |
Bmagic: I think it might. |
14:05 |
Bmagic |
I think I'm going to give up. My curiousity ate my morning already |
14:06 |
Bmagic |
ate (hahaha, that's action_trigger.event) |
14:07 |
Dyrcona |
:)\ |
14:08 |
Dyrcona |
You can run the hook again and see if the event processes this time. |
14:08 |
Bmagic |
It does succeed most of the time |
14:08 |
Bmagic |
I was just curious about this time |
14:09 |
Dyrcona |
Yeah. It happens more often than one would like. |
14:19 |
Dyrcona |
Well, I think that's enough tests. We can always add more later. |
14:25 |
berick |
oddly it tries to create the transaction after it's compiled the template. "trigger: writing 3273 bytes to template output" happens after template compilation. |
14:26 |
berick |
then cstore sends a bunch of stuff back after the begin. that part I don't get. |
14:43 |
Dyrcona |
Eighteen commits rebased down to 1, and I'm ready to share the code... |
15:04 |
Dyrcona |
I suppose we have functional tests for the staff client that can be run. I guess I should learn how those work. |
15:11 |
* eeevil |
reads up and sees Bmagic's adventure... ah yes, a Very Expensive Template Output (esp a grouped set of expensive events) could take longer than the session keepalive time (configured in opensrf.xml, per service) and it does look like that. the fast fix is to increase the keepalive time. I know of EG instances that have in the past had a timeout of 12s (slower hardware, and before some code changes lowered the timeout risk), but I don't see |
15:11 |
* eeevil |
any currently. if it's right at the edge of the time limit, that'd give breathing room to figure out where we need to insert more interaction and avoid that particular session timeout in A/T land. |
15:19 |
Bmagic |
eeevil++ berick++ Dyrcona++ mmorgan++ # thanks for the ideas. timeout 6 > 12 seems easy enough. What's the drawback? Tying up more Drones, and running up against max_children? |
15:20 |
berick |
exactly |
15:20 |
eeevil |
well, the session ends as soon as the client says "disconnect", so only sessions that /want/ to last longer than 6s will, and only up to 12s, so you might see a few cstores sit idle for up to 6s longer than today |
15:21 |
eeevil |
but, if you've only found this one instance, then you're pretty safe |
15:22 |
eeevil |
if you're already at the point of running out of drones, this won't make that better. but, you just need more drones |
15:23 |
Bmagic |
trigger drones seem to hang out at less than 10. cstore drones, can go up to like 100 sometimes, on the utility server. I've got that set to 250 max. I think that is still OK |
15:24 |
Dyrcona |
I experimented with jacking the timeout up in production in the past and didn't really have any issues with running out of drones. |
15:24 |
Dyrcona |
Let me check some git history to see if I still have the values that I tried. |
15:24 |
eeevil |
if trigger is you're only client-culprit for this, then the risk is tying up that many (10-ish) cstores. seems safe |
15:29 |
Dyrcona |
Hm... I remember having this issue 9 to 7 years ago at MVLC and jacking the timeout up to low double digits. I'd have to dig my MVLC code out to find it. |
15:29 |
Bmagic |
makes sense |
15:29 |
Bmagic |
I like 12. I'll see what that can do for me in the coming weeks |
15:30 |
|
mantis left #evergreen |
15:31 |
Dyrcona |
I think I ended up at 15, but it was kind of arbitrary. I remember it was a "snow day" and we had things go haywire. |
15:31 |
Bmagic |
haha |
15:45 |
Bmagic |
is it possible that when this error occurs, that we can see cascading issues across other triggers that might be running at the same time? |
15:46 |
Bmagic |
Like, one patron with expensive template composition, could cause the trigger drone to error out, which, in turn could affect other drones? or other trigger drones that were using the same cstore drone? something like that? |
15:47 |
eeevil |
if they're separate, there shouldn't be any interaction |
15:47 |
Bmagic |
if the AT batch was processing 3000 AT's and, this issue happened in the first 15 AT's, the rest of the 2985 triggers could bomb too? |
15:49 |
eeevil |
I wouldn't expect that, but it depends on the details. each a/t environment should be getting its own cstore editor, so later events (or event groups) shouldn't be impacted. but it depends on other factors, like the parallelism setting, etc |
15:51 |
Bmagic |
I have messed with the trigger parallel setting. It's at 3 today |
15:51 |
Bmagic |
collect 3, react 3 |
15:54 |
Dyrcona |
What I saw was some failed and most succeeded. It was often more than 1 event affected. |
15:55 |
Dyrcona |
It also wasn't time related. Some events after the failed events succeeded. I think it had to do with the cstore that the trigger drones were talking too. |
16:03 |
Bmagic |
eeevil++ |
17:05 |
|
mmorgan left #evergreen |
17:12 |
|
JBoyer joined #evergreen |
21:42 |
|
sleary joined #evergreen |