IRC log for #evergreen, 2023-06-27

All times shown according to the server's local time.

Time	Nick	Message
08:02		BDorsey joined #evergreen
08:56		mmorgan joined #evergreen
09:19		Dyrcona joined #evergreen
09:28		mantis joined #evergreen
09:53		mantis joined #evergreen
10:18		jvwoolf joined #evergreen
11:17		Christineb joined #evergreen
11:33	Bmagic	I'm digging around in the logs trying to figure out why a trigger ended in "error". Does this sound like the culprit? Creating new event: NO_SESSION
11:37	berick	Bmagic: maybe, but "event" in the message is not referring to a trigger event
11:37	Bmagic	I see that now
11:37	Bmagic	so, probably not related
11:37	mmorgan	Bmagic: What type of trigger?
11:38	Bmagic	checkout.due SendEmail
11:40	mmorgan	Ok, yeah, probably not related to the NO_SESSION.
11:42	Bmagic	I think my grep isn't getting enough contextual logs, trying 1000 lines instead of 100
11:42	berick	nothing in error log? or event.error_output ?
11:43	Bmagic	nothing in error_output, not sure what you're referring to: error log
11:43	berick	well, grep ERR
11:44	Bmagic	that's what I'm hunting down.
11:44	* berick	nods
11:46	Bmagic	yeah, grep wasn't giving me the lines I needed. This looks like the issue open-ils.cstore 2023-06-27 08:47:14 [ERR :1303100:oils_sql.c:2376:] No active transaction -- required for CREATE
11:47	Bmagic	followed by Event reacting failed with Can't call method "id" on an undefined value at /usr/local/share/perl/5.30.0/OpenILS/Application/Trigger/Reactor.pm line 564
11:50	mmorgan	Perhaps the circ transaction is already closed?
11:51	mmorgan	Though that might end up invalid rather than error.
11:54	Bmagic	it's trying to create the event_output row in the database and failing
11:54	Bmagic	I'm thinking OpenSRF drone/router crash
11:56	berick	'No active transaction' can also happen if a cstore backend is idle too long / keepalive timeout expires.
11:58	berick	which would be preceeded by INFO logs for "No request was received in"...
11:59	Bmagic	hmmm, that's interesting. What's the setting for timeout in this area of the world?
12:04	berick	Bmagic: opensrf.xml -> cstore -> keepalive. the bigger question though is what is causing the timeout.
12:04	berick	some other processing is going on while cstore has an open transaction
12:06	Bmagic	yeah, it gets busy at night
12:07	Bmagic	based on the logs, it seems that the hold targeter was running at the same time, but that seems "normal"
12:08	berick	what's it trying to CREATE when it raised that 'no active transactin' error?
12:09		jihpringle joined #evergreen
12:09	berick	heh, why ain't it transactin'?
12:14	mmorgan	:)
12:18	Bmagic	berick: request en-US open-ils.cstore.direct.action_trigger.event_output.create <new object>
12:22	berick	ah so at this point an error has already occurred
12:22	berick	is there no "Error processing Trigger template" preceeding it in the logs?
12:23	Bmagic	no, not seeing that
12:24	Bmagic	The log contains the processed template immdiately above the error line :No active transaction -- required for CREATE
12:25	berick	oh, sorry, i misread event_output.create as error_output
12:25	Dyrcona	Bmagic: Do you know what patron this email would go to? Is there enough info in the logs to find out?
12:27	Bmagic	yep, I can walk the target back to the circulation. Though, the template does not have the patron's email address in it, so I assume this particular email wouldn't have been able to be sent, but, I would expect to have created an output in action_trigger.event_output. Which, we've learned didn't happen because of the 6 second timeout for the DB transation
12:27	Dyrcona	I've seen where triggers will blow up because some account had 100+ overdue items and it took too long to gather all the data.
12:27	Bmagic	well, that could be part of the issue here. There does seem to be a lot of data in the template output
12:27	Dyrcona	Bmagic: I think that event is grouped by usr, no? This means it will pull together all of that usr's circulations for the email.
12:28	Bmagic	yes, grouped by usr
12:29	Bmagic	15 items overdue
12:29	Dyrcona	I don't recall if template data was there or if it was complete if it was there. This used to happen at CW MARS a lot. I struggled with this a couple of years ago. I made some adjustments here and there and it mostly stopped being an issue.
12:30	Dyrcona	Only 15? That isn't very many... Unless. Does one or more of them have a crazy long title?
12:31	Dyrcona	Another thing: You could dump the template output column to a file and do 'wc' on it to see how big it is. You could length or whatever the function is in the database. It might be too much data for ejabberd to handle.
12:32	Dyrcona	That was on of my issues with sending autorenewal emails, as I recall. I had some that were > 2MB so I bumped the ejabberd setting up to 10MB.
12:32	Dyrcona	IF that's not it, I'll keep brainstorming with you.
12:33	Bmagic	ejabberd has max_stanza_size: 10485760
12:35	Dyrcona	The event exists in the database with template output? (Just making sure)
12:36	Bmagic	psql -c "select template from action_trigger.event_definition where id=493"\|wc
12:37	Bmagic	34 184 5095
12:37	Dyrcona	That's the template, not the output.
12:37	Bmagic	the event output does not exist, because ^^^ above
12:38	Dyrcona	Ok. I thought you might have had output in the database.
12:38	Bmagic	though, I can copy/paste the resulting output from the log and get a size estimate
12:39	Dyrcona	You can try processing the hook from the command line with a 'tail -f ' on the log in another window.
12:39	Bmagic	4KB resulting file when I excerpt the template output into a file on the hard drive
12:43	Dyrcona	Well, I don't know at this point what the problem is. I've seen things like this where events just fail or get stuck in 'collected' status. I'm assuming it's a bug with either directing messages to multiple backend or possibly a drone going away and the listener not keeping up.
12:43	Bmagic	berick Dyrcona mmorgan: https://pastebin.com/rS6M6UnB
12:43	berick	well, " error starting database transaction" pretty much kills everything
12:44	Bmagic	I believe we figured out that it was a timeout issue. There is a log farther up the log that says "open-ils.cstore 2023-06-27 08:47:13 [INFO:1303721:osrf_prefork.c:505:] No request was received in 6 seconds, exiting stateful session"
12:44	Dyrcona	Yeah, I'd say check the Pg logs for that timestamp and see if something turns up there.
12:44	Dyrcona	Maybe it hit some limit in the database?
12:45	Bmagic	based on the distance in the logs, I think it actually took more than 6 seconds for the code to finally get to the part where it was going to create the output row
12:45	berick	so.. "error starting db transaction" could be because we are no longer in a connected session.. fits the timeout scenario.
12:46	berick	yeah, the interesting bit is before these logs
12:46	berick	between 08:47:07 and 08:47:13 or thereabouts
12:47	Bmagic	let me see if I can narrow down the relavant stuff. Does cstore make a log entry when the transaction is started? I'm tracing the PID 1303721
12:48	mmorgan	Bmagic: To be clear, it's just this one event that failed? All others processed normally?
12:48	Bmagic	for this group, they all failed
12:49	berick	Bmagic: what logs are there for that open-ils.trigger processing leading up to "error starting database transaction"
12:49	Bmagic	oh but I'm realizing now that there are 15 ate rows, one for each acirc row
12:49	berick	?
12:49	Bmagic	so, it's just one patron, one group, 15 "error" ate
12:49	berick	s/processing/process/
12:51	mmorgan	ok, so 15 events -> one event_output row (that failed to be created).
12:51	Bmagic	ok, let me see if I can get that
12:51	Bmagic	mmorgan: yep
12:52	Dyrcona	Doesn't xact_begin start a new session if there isn't one? Could the session have gone away, but the drone not know it?
12:56	berick	Dyrcona: that's what i'm thinking. the session timed out while compiling the template, it then tried to send a BEGIN on a disconnected session.
12:57	Bmagic	it almost has to be
12:57	berick	and it didn't know it was disconnected because no recv() / queue_wait calls occurred before the "am I still connected" test in editor's xact_begin() sub.
12:57	Dyrcona	I'd still check the database logs for anything weird around that time.
12:58	Bmagic	db logs don't have anything interesting around that time
12:58	Bmagic	lots of " WARNING: there is no transaction in progress"
12:59	Dyrcona	Well, that's a bummer.
13:00	Bmagic	we do have a clear log that says "No request was received in 6 seconds, exiting stateful session" - so, I wouldn't expect that the postgres service would have expierenced an error
13:06	Dyrcona	Can you tell if that's for the same cstore drone/process?
13:13	* Dyrcona	just added 64 tests to live_t/20-hold-targeter.t.
13:21		sleary joined #evergreen
13:25	* Dyrcona	wishes there was a decent way to test the pull list from the backend, but the results are not predictable enough.
13:37		Dyrcona joined #evergreen
13:38	Bmagic	sorry guys, I was pulled away, now, I'm back at it
13:38	* Dyrcona	is switching internet connections.
13:39	Bmagic	I'm stuck on this command: zgrep open-ils.trigger /openils/var/log/osrfsys.log.3.gz \|grep "2023-06-27 08:4"\|less
13:39	Bmagic	I'm getting nothing
13:39	Bmagic	which doesn't make sense, because I know there are lines in there with those. Maybe I have an bash escape issue, regex something
13:40		jvwoolf joined #evergreen
13:41		Dyrcona joined #evergreen
13:45	Dyrcona	Bmaric are you sure that osrfsys.log.3.gz contains rows for that time period?
13:45	Dyrcona	Bmagic ^^
13:45	Bmagic	yeah, that's what I'm trying to confirm. The original grep shows the filename, which is why I used that filename. But I think because I used -C 1000, the filename isn't correct
13:46	Bmagic	now I'm getting hits, to answer the question about what the action trigger logs look like above* the 6 second timeout line
13:47	Dyrcona	OK. I was going to suggest using -F on the grep or zgrep, or even eliminating the second grep and try the timestamp via zgrep.
13:51		collum joined #evergreen
13:54	* Dyrcona	considers adding more tests... but needs to refresh his memory on making cstore requests from srfsh.
13:54	Bmagic	berick: RE: "between 08:47:07 and 08:47:13 or thereabouts" https://pastebin.com/8agDcMnX
13:55	Bmagic	the event ID that I'm tracking: 123020652
13:59	Bmagic	do we start the transaction (the one that we're going to use to write the output to the database), and then* start generating the template? Leaving that transaction open for the duration?
14:04	Dyrcona	Bmagic: I think it might.
14:05	Bmagic	I think I'm going to give up. My curiousity ate my morning already
14:06	Bmagic	ate (hahaha, that's action_trigger.event)
14:07	Dyrcona	:)\
14:08	Dyrcona	You can run the hook again and see if the event processes this time.
14:08	Bmagic	It does succeed most of the time
14:08	Bmagic	I was just curious about this time
14:09	Dyrcona	Yeah. It happens more often than one would like.
14:19	Dyrcona	Well, I think that's enough tests. We can always add more later.
14:25	berick	oddly it tries to create the transaction after it's compiled the template. "trigger: writing 3273 bytes to template output" happens after template compilation.
14:26	berick	then cstore sends a bunch of stuff back after the begin. that part I don't get.
14:43	Dyrcona	Eighteen commits rebased down to 1, and I'm ready to share the code...
15:04	Dyrcona	I suppose we have functional tests for the staff client that can be run. I guess I should learn how those work.
15:11	* eeevil	reads up and sees Bmagic's adventure... ah yes, a Very Expensive Template Output (esp a grouped set of expensive events) could take longer than the session keepalive time (configured in opensrf.xml, per service) and it does look like that. the fast fix is to increase the keepalive time. I know of EG instances that have in the past had a timeout of 12s (slower hardware, and before some code changes lowered the timeout risk), but I don't see
15:11	* eeevil	any currently. if it's right at the edge of the time limit, that'd give breathing room to figure out where we need to insert more interaction and avoid that particular session timeout in A/T land.
15:19	Bmagic	eeevil++ berick++ Dyrcona++ mmorgan++ # thanks for the ideas. timeout 6 > 12 seems easy enough. What's the drawback? Tying up more Drones, and running up against max_children?
15:20	berick	exactly
15:20	eeevil	well, the session ends as soon as the client says "disconnect", so only sessions that /want/ to last longer than 6s will, and only up to 12s, so you might see a few cstores sit idle for up to 6s longer than today
15:21	eeevil	but, if you've only found this one instance, then you're pretty safe
15:22	eeevil	if you're already at the point of running out of drones, this won't make that better. but, you just need more drones
15:23	Bmagic	trigger drones seem to hang out at less than 10. cstore drones, can go up to like 100 sometimes, on the utility server. I've got that set to 250 max. I think that is still OK
15:24	Dyrcona	I experimented with jacking the timeout up in production in the past and didn't really have any issues with running out of drones.
15:24	Dyrcona	Let me check some git history to see if I still have the values that I tried.
15:24	eeevil	if trigger is you're only client-culprit for this, then the risk is tying up that many (10-ish) cstores. seems safe
15:29	Dyrcona	Hm... I remember having this issue 9 to 7 years ago at MVLC and jacking the timeout up to low double digits. I'd have to dig my MVLC code out to find it.
15:29	Bmagic	makes sense
15:29	Bmagic	I like 12. I'll see what that can do for me in the coming weeks
15:30		mantis left #evergreen
15:31	Dyrcona	I think I ended up at 15, but it was kind of arbitrary. I remember it was a "snow day" and we had things go haywire.
15:31	Bmagic	haha
15:45	Bmagic	is it possible that when this error occurs, that we can see cascading issues across other triggers that might be running at the same time?
15:46	Bmagic	Like, one patron with expensive template composition, could cause the trigger drone to error out, which, in turn could affect other drones? or other trigger drones that were using the same cstore drone? something like that?
15:47	eeevil	if they're separate, there shouldn't be any interaction
15:47	Bmagic	if the AT batch was processing 3000 AT's and, this issue happened in the first 15 AT's, the rest of the 2985 triggers could bomb too?
15:49	eeevil	I wouldn't expect that, but it depends on the details. each a/t environment should be getting its own cstore editor, so later events (or event groups) shouldn't be impacted. but it depends on other factors, like the parallelism setting, etc
15:51	Bmagic	I have messed with the trigger parallel setting. It's at 3 today
15:51	Bmagic	collect 3, react 3
15:54	Dyrcona	What I saw was some failed and most succeeded. It was often more than 1 event affected.
15:55	Dyrcona	It also wasn't time related. Some events after the failed events succeeded. I think it had to do with the cstore that the trigger drones were talking too.
16:03	Bmagic	eeevil++
17:05		mmorgan left #evergreen
17:12		JBoyer joined #evergreen
21:42		sleary joined #evergreen