IRC log for #evergreen, 2025-10-22

All times shown according to the server's local time.

Time	Nick	Message
00:38		collum joined #evergreen
03:08		collum joined #evergreen
06:18		collum joined #evergreen
08:02		collum joined #evergreen
08:13		collum joined #evergreen
08:14		collum joined #evergreen
08:32		mantis joined #evergreen
08:35		mmorgan joined #evergreen
08:50		redavis joined #evergreen
09:10		Dyrcona joined #evergreen
09:54		smayo joined #evergreen
09:58		smayo joined #evergreen
10:16	Bmagic	I've changed shared_buffers back to 8GB and this query is still slow. Here's the latest: https://explain.depesz.com/s/9iN0
10:31		Christineb joined #evergreen
10:32	Dyrcona	Bmagic: I'm looking at that one, too. It's a lot faster in production and looks like the problem is the join from copy.location to copy_location.id.
10:33	Dyrcona	grr. s/production/development/
10:33	Dyrcona	I meant to say it is faster on the development database.
10:36	Dyrcona	Looks like they have the same indexes and fkey relationships. They're also the same Pg version.
10:38	Dyrcona	vacuum analyze asset.copy doesn't help. Neither did vacuum analyze asset.copy_location.
10:38	pinesol	News from commits: Docs: 3.14.11 release notes <http://git.evergreen-ils.org/?p=Evergreen.git;a=commitdiff;h=9f5de769e03eba30a73507041a5b8eb1d22f57bd>
10:40	Dyrcona	This is development: https://explain.depesz.com/s/iI6k . Notice there are a lot fewer rows to skip.
10:42	Dyrcona	Interesting how different the scans are. I suspect there's a missing index or something.
11:02	Bmagic	Dyrcona++
11:03	Bmagic	The mystery is, it's fast when you swap out "null" for 1 in the coalesce
11:04	Bmagic	which is the result anyway if the coalesce runs the second argument
11:05	Bmagic	I think it means: PG is making a bad plan when the first argument is null, and a good plan when the first argument isn't null
11:07	Bmagic	maybe, the function "(SELECT id FROM evergreen.org_top())" is getting ran against each row, instead of ran once
11:11	Dyrcona	Well, why is that different in development and production because it's fast in development.
11:30	csharp_	Dyrcona: Bmagic: not sure if it helps, but you may want to play around with https://www.postgresql.org/docs/current/runtime-config-query.html#GUC-DEFAULT-STATISTICS-TARGET values or setting stats on tables manually with (ALTER TABLE SET STATISTICS...)
11:31	csharp_	I've had meh results with changing that in the past - the query planner appears to be pretty stubborn
11:31	Bmagic	I saw you had yours at 1000. I have 200 for that value
11:32	csharp_	yeah, we upped it to allow for better planning - again, meh
11:32	Bmagic	right, I've reviewed that setting in the past
11:32	csharp_	eeevil almost certainly has better ideas about how to set things like that
11:32	Bmagic	it doesn't explain the test machines not having an issue
11:32	csharp_	(pretty sure he was the one to point me to that setting in the past)
11:33	* eeevil	reads up...
11:34	csharp_	if it's literally the same plan on test/development/production/whatever, then the stats stuff might be a red herring
11:36	csharp_	having run PG on multiple types of machines/VMs/environments, I can say that some of the differences I've seen can only be attributed to the supernatural :-)
11:36	Dyrcona	It's not the same plan. It's the same query and different data, about 200 rows less in development.
11:36	Bmagic	My theory is the plan is different between the two. The databases are identical in terms of general size and shape of the tables involved. call number 1717933 happens to be one with lots of connections
11:37	csharp_	select * from pg_ghost;
11:38	eeevil	I'm confused about what is being tested here
11:39	Bmagic	eeevil: click the "query" tab on those explains
11:39	eeevil	I did
11:39	Dyrcona	I was leaning in the direction of tweaking the query planner. It's using more indexes in development and doing more scans in production.
11:39	eeevil	where's that query coming from
11:39	Bmagic	That query is what's embedded in SELECT * FROM asset.record_has_holdable_copy( '1717933')
11:40	Bmagic	which is called from the staff client when searching, as each search result is loaded into the Angular interface
11:42	Dyrcona	There are 7,786 copies on the bib record.
11:44	eeevil	well, we can elide that in-query part by using a variable (untested): https://lumen.esilibrary.com/p/Qoz3xd59.html
11:44	eeevil	or stick the whole list in an array and use "= ANY"
11:45	eeevil	or use a join to actor.org_unit_descendants()
11:45	eeevil	and even skip all that when the ou input is null -- why join when when it will always succeed?
11:46	Bmagic	eeevil: the solution logic is that PG could* rerun evergreen.org_top() over millions of rows, but if it's in the declared statement, it's only going to run once. Which I think is an upgrade even if your database isn't suffering from this bad plan issue
11:48	Bmagic	that was a question, forgot the question mark
11:50	eeevil	yes, if that's where (much of) the cost is coming from
11:51	eeevil	if coalesce sees its first input is a non-null literal, that coalesce node in the plan tree is replaced with the literal. so, yes, that's the effective benefit
11:52	Bmagic	I think it's an easy win IMO. I'll make a LP for "asset.record_has_holdable_copy can be slow"
11:52	eeevil	however, I don't think that's the case. it's only executing once: row 3 of the slow plan -> Function Scan on org_top (cost=0.25..0.26 rows=1 width=4) (actual time=0.621..0.623 rows=1 loops=1)
11:54	Dyrcona	Removing the coalesce speeds it up.
11:54	Dyrcona	AND acp.circ_lib IN (SELECT id FROM actor.org_unit_descendants((evergreen.org_top()).id));
11:55	Dyrcona	Why are we doing the coalesce? Are there cases where org_top() could return null?
11:57	eeevil	Dyrcona: we're NOT, in the actual function
11:57	eeevil	we're passing $2, an optional (default NULL) org id
11:58	eeevil	interestingly, we're not testing that the acn is not deleted. no biggie, IMO, but ... odd
11:58	Bmagic	right, the function needs to accomodate an OU
11:58	Dyrcona	Hm. I'll look at the actual function, then. I thought Bmagic was testing a verbatim copy/paste with values plugged in.
11:59	Bmagic	Dyrcona: exactly, copy/pasted the function and plugged in null for the OU, and the ID number for the bib
11:59	Dyrcona	Well, if the acn is deleted, it should have all deleted copies, so if the copy isn't deleted, then neither should the acn be deleted.
12:00	Dyrcona	Well, eeevil said the function doesn't do the coalesce... So I'll still have to look at the function.
12:00		jihpringle joined #evergreen
12:02	eeevil	Dyrcona: huh? sure it does https://lumen.esilibrary.com/p/YT5CMU10.html
12:03	eeevil	(and yes, re acn.deleted, acp.deleted is a pretty safe proxy)
12:03	Bmagic	bug 2129571
12:03	pinesol	Launchpad bug 2129571 in Evergreen "asset.record_has_holdable_copy can be slow" [Undecided,New] https://launchpad.net/bugs/2129571
12:05	eeevil	Bmagic: https://lumen.esilibrary.com/p/HBAe8E30.html if you want to test my first-order patch above ... works for me
12:06	Dyrcona	eeevil: I guess I didn't understand something you said earlier or whatever, but I can say that removing the coalesce makes it faster on our production database. The development database doesn't care and they should be about the same, execpt for settings.
12:07	Bmagic	eeevil++ # testing
12:07	Dyrcona	With an ou id, the coalesce doesn't seem to make a difference.
12:08	eeevil	Dyrcona: ah, I confused things, pardon. I should use all my words.
12:08	pinesol	News from commits: Docs: 3.15.6 release notes <http://git.evergreen-ils.org/?p=Evergreen.git;a=commitdiff;h=11842140c303686b69042ff9e32bd908b24ed042>
12:10	eeevil	yeah, with an ou id the org_top function doesn't run (coalesce stops at the first non-null input -- and the planner knows that so it doesn't even emit a plan node for the coalesce when $2 is not null)
12:10	Dyrcona	So, the problem is really a planner setting because our development Pg 16 is still zippy.
12:13	Bmagic	eeevil: your suggested function solves it for me
12:14	Bmagic	I'd also like to say that the change is straightfoward, I don't see it having negative side-effects. The function inputs and outputs should be unchanged
12:16	Bmagic	eeevil: since we're opening this can, do we want to go further? You had some other ideas about skipping the function entirely for null values of OU
12:17	eeevil	something about putting the function there (as a coalesced param to another function) is convincing the planner that scanning the tables /in toto/ will be faster than maybe calling the function. it may depend on the stats related to the first bib used, but that's unlikely. I think it's a planner reasoning issue in param-nested functions in a plan that has to work for both NULL and NOT NULL cases, so it probably CANNOT optimize the call away, ever. (
12:17	eeevil	different versions of PG obviously have different optimizations for function cached plans, so 16 may be smart enough for the old version)
12:17	Bmagic	abneiman++
12:18	Dyrcona	eeevil: Both of the databases where I'm testing this are Pg 16, and they show radically different performance.
12:18	eeevil	Bmagic: well, not the whole function. just the filter we're focusing on. there's no difference between "and is in the FULL list of orgs" and just not asking if it's in the full list
12:19	eeevil	Dyrcona: ah, well, I dunno then.
12:19	eeevil	stats, jit, work_mem, parallel workers, etc
12:20	Bmagic	this AND : "AND acp.circ_lib IN (SELECT id FROM actor.org_unit_descendants(v_ou))" could be omitted if ou is null, right?
12:20	Dyrcona	Yes, that's what I said earlier, just without examples. Ironic that the faster one is basically set to defaults. Maybe we should stop tuning our databases? :)
12:21	Dyrcona	Bmagic: No, you want the list of descendant org units.
12:21	eeevil	right. 2 queries in the function, gated by, say "IF ou IS NULL OR ou = (evergreen.org_top()).id THEN" ... query without that AND "ELSE" query with the AND, but no coalesce
12:21	Bmagic	I'm thinking the function could execute a different SELECT statement if the second argument is omitted.
12:21	Bmagic	jinx
12:22	Bmagic	I like that even better
12:23	eeevil	it would be nearly as fast (to the point of no measurable difference, I'd wager) and more maintainable to just build an array of orgs and use "= ANY"
12:23	Dyrcona	IIUC, that function is looking for any holdable copies at the org unit or at any of its descendants, so you want the select on actor.org_descendants() in either case.
12:23	Bmagic	if you don't specify a circ_lib, it's all circ_libs
12:24	eeevil	Dyrcona: only if there is an ou passed in. if ou is NULL, we use all orgs for the "is it here" test. that's the same as not asking the question -- it's always true
12:24	Dyrcona	What would make sense, and maybe I missed a comment to that effect, would be to set an array variable at the top and then use the ANY in the query. We could probably use ANY even in the current form.
12:25	Dyrcona	eeevil: Yes, I elided that bit thinking it was obvious.
12:25	Dyrcona	Maybe I should use all the words, but verbosity is a drag.
12:26	Dyrcona	I was responding to what read to me like Bmagic saying we didn't need the call to org_descendants.
12:26	eeevil	heh...yes, I did say that. (re = ANY). we could also use a lateral join, which could be faster when an ou /is/ supplied. but these are micro-optimizations, let's just get it back to 1-3ms, IMO
12:27	Bmagic	I don't think we need to call org_descendants when ou isn't specified
12:28	Dyrcona	Bmagic: You would need to drop that whole AND line. My interpretation of what you meant was probably wrong.
12:28	Bmagic	I'll make a paste for concreteness
12:29	eeevil	yes, but the tradeoff is then we have to maintain 2 queries in the function. if we're going that route, we should first remove the descendents call from the query, with =ANY. then split the logic and see if that's even faster.
12:29	Dyrcona	We wouldn't need to check circ_lib at all if it is null is what I think you meant.
12:29	eeevil	Dyrcona: I believe we have all arrived on the same page from different edges
12:29	Dyrcona	eeevil++
12:31	Bmagic	https://pastebin.com/nLv3zzVf
12:31	Bmagic	I just put that function on my test machine and it dropped the execution time from 2.2ms to 1.3ms
12:32	Bmagic	I lifted the exact logic from eeevil's comment above
12:35	Dyrcona	Bmagic: I don't like it. Running different code based on the second input variable being NULL is a maintenance bug waiting to happen.
12:36	Bmagic	I can see that, offer the ANY version
12:36	Dyrcona	I think eeevil's original version is fine, actually.
12:37	Bmagic	I agree
12:38	Bmagic	I have time to make the branch if you all don't but I'd like to give eeevil the commit credit somehow
12:39		collum joined #evergreen
12:39	Dyrcona	Here's one implementation of the ANY version: https://paste.debian.net/1402171/
12:40	Dyrcona	I've tried it on our development database to make sure it does the thing.
12:44	Dyrcona	Bmagic: Co-authored-by: in the commit message. You want me to put my paste in a branch?
12:45	Bmagic	well, we probably need to decide on a winner
12:46	Dyrcona	$2 should probably be changed to ou to be consistent.
12:47	csharp_	@dunno add I WANT MY TWO DOLLARS!
12:47	pinesol	csharp_: The operation succeeded. Dunno #101 added.
12:47	Dyrcona	Well, I think eeevil's original solution is fine. If we want to got the ANY route, then I prefer mine. :)
12:47	Bmagic	FWIW, on my test machine: ANY = 8ms, eeevil's single = 2.1ms, split IF = 1.2ms
12:48	Dyrcona	Yeahp. Is .9ms worth a potential maintenance issue?
12:48	Bmagic	nope, I don't think so
12:49	Bmagic	plus, it's slower when provided a second argument
12:49	eeevil	Dyrcona: yes, your =ANY is what I was thinking of, if we go that way.
12:49	eeevil	Bmagic: you can use the "Co-authored-by:" commit tag
12:50	eeevil	oh, nm, Dyrconasaid that
12:50	Bmagic	happy to do that
12:50	Bmagic	which one are we going with?
12:50	Dyrcona	On my test database, the ANY version is 4.777ms.
12:51	Bmagic	and race it against eeevil's original
12:51	Dyrcona	And the original is 3.763ms.... :(
12:51	Bmagic	the original, original? or eeevil's original proposed fix?
12:51	eeevil	which function? the one with the least amount of maintenance overhead is my preference. if single-query =ANY is as fast as my first one, go with =ANY.
12:52	eeevil	(because =ANY /can/ be faster, even if it's not always)
12:52	Bmagic	they are super close in speed, with ANY taking an additional 1ms (sometimes)
12:54	Bmagic	I just tested a bunch of random integers as the second argument. The ANY performs the same as eeevil's OG
12:55	Bmagic	I'm leaning towards ANY
12:55	Dyrcona	OK, go with ANY.
12:55	Bmagic	was there a nit pick on the variable name from Dyrcona's paste?
12:56	Dyrcona	I was thinking that $2 should be replaced with ou, mainly because we're not using $1 for rid.
12:56	eeevil	+1 to named params, yes
12:56	Dyrcona	I didn't make that change when I pasted my first attempt.
12:57	eeevil	the bike shed better be blue, or I quit
12:57	Bmagic	Evergreen main calls it out by $2, which I thought was strange yesterday
12:57	Bmagic	branching....
12:57	eeevil	(no, I don't care much about the variable names, as long as they're not crazy)
12:58	jeff_	sub patron_mumble_search...
12:58	Dyrcona	It's the old way of addressing parameters in functions. We've upgraded past any version that does not support named parameters.
12:58	Bmagic	my $annoy_eeevil_crazy_name;
12:59	jeff	and I misremembered. it's only in the method name now (or only ever was).
12:59	Dyrcona	On my test database, eeevil's implementation outperforms the any version without a second parameter almost every time.
13:02	csharp_	jeff++ # I had the same thought about crazy search
13:04		jihpringle joined #evergreen
13:06	Bmagic	https://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/blake/lp2129571_slow_asset_record_has_holdable_copy
13:07	Bmagic	did I do that right?
13:08	Bmagic	just saw CTLF endings, fixed it, force pushed
13:12	Dyrcona	Bmagic: That looks right to me.
13:18	Dyrcona	eeevil's original change is still faster for me in production, but almost 2ms than the ANY version.
13:20	Dyrcona	But either is much faster than the 61,000ms that the original takes on the example record.
13:25	Bmagic	hmm, we're going back to preferring non-ANY?
13:35	Dyrcona	IDK, but the non-ANY version has been faster for me. I'm going to put the any version in production, though.
13:35	Dyrcona	I added both versions to production with different function names to compare. The non-any is faster. Maybe we should add it to the bug and let people time the two?
13:37	Dyrcona	Any version: 4.086 ms . Non-any version: 2.327 ms.
13:37	Dyrcona	Those are both on record 1717933, The New York Times Book Review with something like 7,786 copies.
13:37	Dyrcona	No second argument.
13:39	Dyrcona	I wonder if the problem is the coalesce inside the function parameters?
13:40	Dyrcona	Both of them get slower when I add a second argument, and the non-any version still takes almost twice as long.
13:41	Dyrcona	Grrr.... The any version takes about twice as long as the non-any version: 58.664 ms vs. 25.795 ms.
13:48	Dyrcona	So, I'm putting the any version in production. Testing on one record really isn't very scientific, and I agree that ANY on a pre-fetched array can be faster than IN against a query.
13:59	Dyrcona	Bmagic++
13:59	Dyrcona	eeevil++
14:08	Dyrcona	I have another implementation to test that mixes the two approaches. It uses eeevil's original code to set a v_ou variable and then uses that variable to retrieve an array of ous. We'll see if that's faster.
14:15	Dyrcona	Nope. eeevil's original suggestion is still faster.
15:00	Bmagic	A hole in one we would say! eeevil++ miker++ Dyrcona++
15:44		mantis left #evergreen
15:51	Bmagic	what's this query from PG logs? duration: 5300.009 ms statement: FETCH 100 FROM "mrRTSJ2s3b5wuI61";
15:53	Bmagic	On the front end, it's pcrud.search.aou, but in the DB it comes out like that
15:53	jeffdavis	it's related to Postgres cursors
15:54	Bmagic	it appears we have a problem with PG cursors as well. Which is fishy, because it's slow ~5 seconds retrieving pcrud.search.aou when parent_ou is null during the opensrf call
15:54	Bmagic	It's interesting to me that the previous DB function that was slow when OU was null, is also slo when parent_ou is null in this context
16:01	Bmagic	anyone have insight on that? I'm only seeing CURSORS setup for metabib.staged_browse, but I'm sure it's used for caching for hot tables like actor.org_unit. There has to be something awry on this DB for that cursor to take 5 seconds right?
16:12		jihpringle joined #evergreen
16:16	jeffdavis	looks like cursors were added to pcrud in 3.14-beta
16:18	Dyrcona	pcrud and cstore use generated aliases for tables when they build queries, IIRC.
16:19	Dyrcona	I still think something's off on this particular db server because another Pg 16 database does not have these issues.
16:22	jeffdavis	this issue is ringing a bell but I can't recall or dig up specifics, sorry :(
16:31	Bmagic	I think we're close
17:12		mmorgan left #evergreen
19:05		jihpringle joined #evergreen
21:54		stephengwills joined #evergreen