IRC log for #evergreen, 2025-10-21

All times shown according to the server's local time.

Time	Nick	Message
07:19		collum joined #evergreen
08:40		mmorgan joined #evergreen
08:53		smayo joined #evergreen
10:36		sandbergja joined #evergreen
11:07		sandbergja joined #evergreen
11:20		redavis joined #evergreen
11:58	Bmagic	would anyone be able to look at this and identify why it took 9 seconds? https://explain.depesz.com/s/DcyQ
11:59	Bmagic	it seems the sort is the most expensive
12:16		Christineb joined #evergreen
12:31	sandbergja	I have no answers, but I was curious about the 5,782,762 heap fetches on the index-only scan. Is that typical?
12:50		jvwoolf joined #evergreen
13:22	jeffdavis	This looks relevant: https://www.postgresql.org/docs/current/indexes-index-only-scans.html
13:27	eeevil	Bmagic: the index only scan near the bottom is 6.2s. afaict, all the Sort nodes were fast; the one at 7 was ~.25ms on 949 rows. the nested loop at 9 (and everything inside that) holds all the cost
13:29	Bmagic	eeevil++ # of course, now, do I understand that perhaps that index needs re-created?
13:31	eeevil	jeffdavis: indeed, it does seem relevant ... I wonder if we may just want to add the vis_attr_vector column as a covered col on the record index.
13:33	eeevil	Bmagic: no, the planner thinks that the best way to attack the problem at hand is to scan the table for attr matches and then probe the record index to see if the row is one we want. we want it to look at the record column first, then probe the attr array. without knowing the search that drove this query, I can't guess at why it decided that, though
13:35	Bmagic	I have the query: https://pastebin.com/szEH1f8G
13:35	eeevil	if it was just a search based on record attrs (like, "show me all english records") then there's not a lot we can do. but ... you could try adding a gin index over asset.copy_vis_attr_cache.vis_attr_vector if this is a common pattern with your users and your data
13:35	Bmagic	a generic keyword search on the staff-side
13:37	Bmagic	I noticed that the exact same search on the patron side was fast
13:38	Bmagic	but I didn't capture it's underlying query and analyze it (yet). I'm concerned that the disk full event last week did something to the database.
13:39	eeevil	that's extremely, exceedingly, unlikely, unless the hardware lied to the software about writes right around the out of space event
13:40	Bmagic	well, that's reassuring
13:40	sandbergja	just to check, has asset.copy_vis_attr_cache been vacuumed recently?
13:40	Bmagic	checking..
13:40	eeevil	well, step one is probably vac-analyze the whole db
13:40	eeevil	sandbergja: great minds ;)
13:40	Bmagic	that's been done
13:41	Bmagic	but I wanted to report to you when*
13:41	Bmagic	it's not been more than 2 days
13:42	eeevil	that circ_lib list looks like it's probably ... all of them. is that true?
13:43	Bmagic	yes, all
13:43	eeevil	ah, yes ... site(CONS)
13:47	Bmagic	last vacuum 10/19
13:47	Bmagic	maybe I'll run it now on just that table and re-analyze the search query?
13:50	Bmagic	well, I did that. Brought the query down to 3 seconds. I'd say that's a smoking gun right there
13:55		berick joined #evergreen
13:56	Bmagic	now that the query is fast, I've found another slow down at the opensrf level: Message processing duration 41.322778
13:57	Bmagic	which came from this call: CALL: open-ils.cstore open-ils.cstore.json_query.atomic {"from":["asset.record_has_holdable_copy",1917632]}
13:57	Bmagic	the staff client starts to show results from "better homes and gardens" loads two results, then waits for 41 seconds before loading the third result. This is the call that it's stuck on
14:01	Bmagic	explain analyze SELECT * FROM asset.record_has_holdable_copy( '1917632' ) doesn't help much. Looking at the function now
14:01	sandbergja	Bmagic: for my understanding, after the vacuum, did the number of heap fetches go down?
14:02	Bmagic	I'll post the explain post-vacuum
14:03	eeevil	I thought we were eliding those "all orgs" filters :( ... here's a possible option to do that: https://lumen.esilibrary.com/p/ZI3fRH25.html
14:03	Bmagic	https://explain.depesz.com/s/QFwc
14:07	Bmagic	bummer, when I excerpt the query from the function and explain analyze that (plugging in the copy ID and the OU ID) it runs in 50ms
14:08	Bmagic	but it takes 40 seconds when it's executed as a function
14:08	sandbergja	😮
14:10	sandbergja	what if you try it without specifying an org unit id, and using evergreen.org_top() instead?
14:11	Bmagic	hmmmm, here's a clue: SELECT * FROM asset.record_has_holdable_copy( '1717933' ) takes 40 seconds, SELECT * FROM asset.record_has_holdable_copy( '1717933', 1 ) takes 100ms
14:11	Bmagic	it's the lack of the second argument
14:12	Bmagic	I can't seem to find a function with that name that only takes one argument
14:12	sandbergja	it's the same function, but the second param defaults to NULL
14:12	sandbergja	(which sends it down the SELECT id FROM actor.org_unit_descendants(SELECT id FROM evergreen.org_top()) path via the COALESCE)
14:13	Bmagic	sandbergja: I see what you're getting at and I agree. I think this "SELECT id FROM evergreen.org_top()" is the issue
14:14	Bmagic	ha! and it's fast by itself. Slow when embedded in the coalesce... freaky
14:15	Bmagic	https://explain.depesz.com/s/7k8f <-- plugging "NULL" in for the first argument in the coalesce
14:18	Bmagic	getting closer
14:20	mmorgan	claiming 1492
14:22	sandbergja	haha I guess it's a little late to make a columbus day joke
14:26	Bmagic	lol
14:26	csharp_	never too late to discover america
14:26	Bmagic	26 seconds to perform an index scan on call_number_pkey seems like the issue, but I just vacuum analyzed asset.call_number with no affect
14:28	csharp_	Bmagic: when a function is the problem I often copy out the CREATE FUNCTION statement into a file and add a bunch of debugging (maybe rename the function under a different namespace so you aren't altering the original)
14:28	csharp_	RAISE NOTICE 'blah blah';
14:29	Bmagic	yep, I've used that trick. Though, this function is so simple, I didn't need to create another function. I used the query inside the function and explain analyzed it
14:29	csharp_	gotcha
14:29	Bmagic	And I'm getting the same slowness
14:29	Bmagic	11 seconds to scan asset.copy and 26 seconds to scan asset.call_number looks sus to me
14:29	csharp_	seq scans?
14:29	Bmagic	(the last explain I posted above)
14:30	Bmagic	acp was a bitmap heat scan and acn was index scan
14:31	Bmagic	what's insane is this:
14:31	Bmagic	(SELECT id FROM actor.org_unit_descendants(COALESCE(1,(SELECT id FROM evergreen.org_top())))) 100ms
14:31	Bmagic	(SELECT id FROM actor.org_unit_descendants(COALESCE(null,(SELECT id FROM evergreen.org_top())))) 41 seconds
14:32	Bmagic	(those queries by themselves don't perform any different, but inside the bigger query (from the function) it makes a huge difference)
14:32	csharp_	Time: 1.946 ms and Time: 2.598 ms on our DB respectively
14:33	Bmagic	right, same here
14:33	Bmagic	here's the larger function https://pastebin.com/LuLKFPsJ
14:33	sandbergja	if you're in a situation where you could add an extension, I would be curious where plprofiler thinks that asset.record_has_holdable_copy is spending its time
14:34	csharp_	SELECT * FROM asset.record_has_holdable_copy( '1717933' ) = 13.691 ms, SELECT * FROM asset.record_has_holdable_copy( '1717933', 1 ), 4.104 ms
14:34	Bmagic	yeah! you're seeing the same thing then
14:35	Bmagic	in my case it's a bit more dramatic but still, what gives. Those two calls are the same thing
14:35	csharp_	the query you posted took me 4.363 ms
14:35	Bmagic	yeah, now swap out "1" for "null" and see how long it takes
14:35	pinesol	News from commits: LP2120713 Stamp upgrade script <http://git.evergreen-ils.org/?p=Evergreen.git;a=commitdiff;h=f5fb30feb76f249a279b95cb240c43f212ff3461>
14:35	pinesol	News from commits: LP#2120713: Fix for merging bibs with deleted monograph parts <http://git.evergreen-ils.org/?p=Evergreen.git;a=commitdiff;h=926afbbd2850fd9cd1b4068ba0b9ab5bc547aa0c>
14:35	csharp_	4.093 ms
14:36	Bmagic	huh? really?
14:36	csharp_	yeah - hmmm
14:36	Bmagic	I would hav expected more like 13seconds like you reported above
14:36	csharp_	these are all ms
14:37	Bmagic	oh right, nvm, I have the issue, not you
14:37	Bmagic	it's almost certainly an index or vacuum issue somewhere in here
14:37	csharp_	Bmagic: how is the machine? RAM ok? maybe some config tuning needed?
14:37	csharp_	assuming this is a Google cloud box
14:38	csharp_	(we're virtualized too on ITS's VMWare)
14:38	Bmagic	PG16, 768GB memory, shared_buffers = 192GB, max_connections = 800, effective_cache_size = 500GB, work_mem = 512MB
14:41	csharp_	shared_buffers \| 128MB
14:41	csharp_	effective_cache_size \| 352GB
14:41	csharp_	max_connections \| 2500
14:41	csharp_	work_mem \| 1536MB
14:42	Bmagic	2500 dang
14:42	csharp_	we have like 500GB RAM, 28 processing cores
14:42	csharp_	load avg hovering between 8 and 11, which is probably normal for us
14:42	csharp_	PG 11 (!)
14:42	Bmagic	we have 768GB memory and 96 CPU
14:43	csharp_	huh
14:43	Bmagic	maybe the shared_buffers is waaay to high?
14:43	csharp_	seems like you'd be doing better than us, depending on your DB data size
14:44	csharp_	I always have to look up the config settings - I never remember them between config changes
14:44	Bmagic	it was* 8GB, but changed it up to try and fix performance
14:46	csharp_	did you change kernel default settings for SHMMAX, etc.?
14:46	csharp_	https://www.postgresql.org/docs/current/kernel-resources.html for reference
14:49	csharp_	looks like we didn't change those on this server, but we have in the past and it made a big difference in performance
14:49	csharp_	(in the past on previous servers, always bare metal)
14:50	Bmagic	no, I've not touched those kernel settings... maybe that's part of the issue?
14:50	csharp_	also possible ITS did some VM-level stuff to accommodate our RAM/load
14:50	Bmagic	mind sharing the settings you use for those?
14:50	csharp_	they totally underestimated our impact
14:51	csharp_	Bmagic: I'll dig for records :-)
14:51	Bmagic	it looks like, according to that article, the default linux values are generally ok
14:51	csharp_	Bmagic: might be irrelevant - just remembering things we've touched before to address performance
14:52	Bmagic	we do alter one kernel setting: vm.nr_hugepages
14:52	csharp_	I remember that we set one of the PG config settings too high one time and it all came to a crawl
14:52	csharp_	maybe shared_buffers?
14:52	Bmagic	yeah, shared_buffers is way out of alignment compared to yours
14:52	mmorgan	sandbergja: Late to the party, but that ship has sailed.
14:52	* mmorgan	ducks
14:53	Bmagic	mmorgan++
14:53	sandbergja	heh
14:53	csharp_	mmorgan: aaaayyyy
14:54	Bmagic	cat /proc/sys/kernel/shmmax: 18446744073692774399
14:54	Bmagic	that is a number
14:54	csharp_	Bmagic: yeah, that's ours too - has to be the default for Debian/Ubuntu
14:55	Bmagic	same for shmall
14:55	csharp_	same
14:55	csharp_	ok, so probably not kernel
14:55	Bmagic	yeah, that number is so large, I can't imagine it's imposing any restrictions
14:56	csharp_	maybe play with setting shared_buffers during off hours and see if anything changes?\
14:56	Bmagic	I'm hopeful it's shared_buffers. I have a test machine with the same database on it, I'll increase the buffer to 192, and see if it's slow, then back to 8GB
14:58	csharp_	https://www.postgresql.org/docs/current/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-MEMORY - note the advice about that setting
14:58	csharp_	also that it requires a restart
14:58	csharp_	sorry, probably over-advising
14:59	csharp_	sometimes I'm chatty in the channel for the logs
15:00	csharp_	and sometimes, in the middle of the night when I'm upgrading and something goes wrong, my own advice actually helps :-)
15:00	Bmagic	I'm happy to have you chatty
15:00	csharp_	blah blah blah
15:00	Bmagic	I bet it's the shared_buffers
15:01	csharp_	actually thinking about doing a session about PINES system admin
15:01	csharp_	for the EG conf
15:01	csharp_	possibly more than one - one more general about PINES administration and one with tech details
15:02	Bmagic	I'll attend!
15:02	csharp_	gotta make that trip to Seattle appear worth it and all :-)
15:02	csharp_	<narrator>It was worth it</narrator>
15:04	Bmagic	can't wait for the in person one! It's been 2 years
15:04	Bmagic	you've got me starting to think about presentations of my own, I should do one this time
15:12	csharp_	Bmagic++
15:30		_collum joined #evergreen
15:43		collum joined #evergreen
17:06		mmorgan left #evergreen
18:33		collum joined #evergreen
21:26		sandbergja joined #evergreen
21:36		collum joined #evergreen
23:10		sandbergja joined #evergreen