Evergreen ILS Website

IRC log for #evergreen, 2025-10-22

| Channels | #evergreen index | Today | | Search | Google Search | Plain-Text | summary | Join Webchat

All times shown according to the server's local time.

Time Nick Message
00:38 collum joined #evergreen
03:08 collum joined #evergreen
06:18 collum joined #evergreen
08:02 collum joined #evergreen
08:13 collum joined #evergreen
08:14 collum joined #evergreen
08:32 mantis joined #evergreen
08:35 mmorgan joined #evergreen
08:50 redavis joined #evergreen
09:10 Dyrcona joined #evergreen
09:54 smayo joined #evergreen
09:58 smayo joined #evergreen
10:16 Bmagic I've changed shared_buffers back to 8GB and this query is still slow. Here's the latest: https://explain.depesz.com/s/9iN0
10:31 Christineb joined #evergreen
10:32 Dyrcona Bmagic: I'm looking at that one, too. It's a lot faster in production and looks like the problem is the join from copy.location to copy_location.id.
10:33 Dyrcona grr. s/production/development/
10:33 Dyrcona I meant to say it is faster on the development database.
10:36 Dyrcona Looks like they have the same indexes and fkey relationships. They're also the same Pg version.
10:38 Dyrcona vacuum analyze asset.copy doesn't help. Neither did vacuum analyze asset.copy_location.
10:38 pinesol News from commits: Docs: 3.14.11 release notes <http://git.evergreen-ils.org/?p=Ev​ergreen.git;a=commitdiff;h=9f5de76​9e03eba30a73507041a5b8eb1d22f57bd>
10:40 Dyrcona This is development: https://explain.depesz.com/s/iI6k . Notice there are a lot fewer rows to skip.
10:42 Dyrcona Interesting how different the scans are. I suspect there's a missing index or something.
11:02 Bmagic Dyrcona++
11:03 Bmagic The mystery is, it's fast when you swap out "null" for 1 in the coalesce
11:04 Bmagic which is the result anyway if the coalesce runs the second argument
11:05 Bmagic I think it means: PG is making a bad plan when the first argument is null, and a good plan when the first argument isn't null
11:07 Bmagic maybe, the function "(SELECT id FROM evergreen.org_top())" is getting ran against each row, instead of ran once
11:11 Dyrcona Well, why is that different in development and production because it's fast in development.
11:30 csharp_ Dyrcona: Bmagic: not sure if it helps, but you may want to play around with https://www.postgresql.org/docs/current/runtime-​config-query.html#GUC-DEFAULT-STATISTICS-TARGET values or setting stats on tables manually with (ALTER TABLE SET STATISTICS...)
11:31 csharp_ I've had meh results with changing that in the past - the query planner appears to be pretty stubborn
11:31 Bmagic I saw you had yours at 1000. I have 200 for that value
11:32 csharp_ yeah, we upped it to allow for better planning - again, meh
11:32 Bmagic right, I've reviewed that setting in the past
11:32 csharp_ eeevil almost certainly has better ideas about how to set things like that
11:32 Bmagic it doesn't explain the test machines not having an issue
11:32 csharp_ (pretty sure he was the one to point me to that setting in the past)
11:33 * eeevil reads up...
11:34 csharp_ if it's literally the same plan on test/development/production/whatever, then the stats stuff might be a red herring
11:36 csharp_ having run PG on multiple types of machines/VMs/environments, I can say that some of the differences I've seen can only be attributed to the supernatural :-)
11:36 Dyrcona It's not the same plan. It's the same query and different data, about 200 rows less in development.
11:36 Bmagic My theory is the plan is different between the two. The databases are identical in terms of general size and shape of the tables involved. call number 1717933 happens to be one with lots of connections
11:37 csharp_ select * from pg_ghost;
11:38 eeevil I'm confused about what is being tested here
11:39 Bmagic eeevil: click the "query" tab on those explains
11:39 eeevil I did
11:39 Dyrcona I was leaning in the direction of tweaking the query planner. It's using more indexes in development and doing more scans in production.
11:39 eeevil where's that query coming from
11:39 Bmagic That query is what's embedded in  SELECT * FROM asset.record_has_holdable_copy( '1717933')
11:40 Bmagic which is called from the staff client when searching, as each search result is loaded into the Angular interface
11:42 Dyrcona There are 7,786 copies on the bib record.
11:44 eeevil well, we can elide that in-query part by using a variable (untested): https://lumen.esilibrary.com/p/Qoz3xd59.html
11:44 eeevil or stick the whole list in an array and use "= ANY"
11:45 eeevil or use a join to actor.org_unit_descendants()
11:45 eeevil and even skip all that when the ou input is null -- why join when when it will always succeed?
11:46 Bmagic eeevil: the solution logic is that PG could* rerun evergreen.org_top() over millions of rows, but if it's in the declared statement, it's only going to run once. Which I think is an upgrade even if your database isn't suffering from this bad plan issue
11:48 Bmagic that was a question, forgot the question mark
11:50 eeevil yes, if that's where (much of) the cost is coming from
11:51 eeevil if coalesce sees its first input is a non-null literal, that coalesce node in the plan tree is replaced with the literal. so, yes, that's the effective benefit
11:52 Bmagic I think it's an easy win IMO. I'll make a LP for "asset.record_has_holdable_copy can be slow"
11:52 eeevil however, I don't think that's the case. it's only executing once: row 3 of the slow plan -> Function Scan on org_top (cost=0.25..0.26 rows=1 width=4) (actual time=0.621..0.623 rows=1 loops=1)
11:54 Dyrcona Removing the coalesce speeds it up.
11:54 Dyrcona AND acp.circ_lib IN (SELECT id FROM actor.org_unit_descendants(​(evergreen.org_top()).id));
11:55 Dyrcona Why are we doing the coalesce? Are there cases where org_top() could return null?
11:57 eeevil Dyrcona: we're NOT, in the actual function
11:57 eeevil we're passing $2, an optional (default NULL) org id
11:58 eeevil interestingly, we're not testing that the acn is not deleted. no biggie, IMO, but ... odd
11:58 Bmagic right, the function needs to accomodate an OU
11:58 Dyrcona Hm. I'll look at the actual function, then. I thought Bmagic was testing a verbatim copy/paste with values plugged in.
11:59 Bmagic Dyrcona: exactly, copy/pasted the function and plugged in null for the OU, and the ID number for the bib
11:59 Dyrcona Well, if the acn is deleted, it should have all deleted copies, so if the copy isn't deleted, then neither should the acn be deleted.
12:00 Dyrcona Well, eeevil said the function doesn't do the coalesce... So I'll still have to look at the function.
12:00 jihpringle joined #evergreen
12:02 eeevil Dyrcona: huh? sure it does https://lumen.esilibrary.com/p/YT5CMU10.html
12:03 eeevil (and yes, re acn.deleted, acp.deleted is a pretty safe proxy)
12:03 Bmagic bug 2129571
12:03 pinesol Launchpad bug 2129571 in Evergreen "asset.record_has_holdable_copy can be slow" [Undecided,New] https://launchpad.net/bugs/2129571
12:05 eeevil Bmagic: https://lumen.esilibrary.com/p/HBAe8E30.html if you want to test my first-order patch above ... works for me
12:06 Dyrcona eeevil: I guess I didn't understand something you said earlier or whatever, but I can say that removing the coalesce makes it faster on our production database. The development database doesn't care and they should be about the same, execpt for settings.
12:07 Bmagic eeevil++ # testing
12:07 Dyrcona With an ou id, the coalesce doesn't seem to make a difference.
12:08 eeevil Dyrcona: ah, I confused things, pardon. I should use all my words.
12:08 pinesol News from commits: Docs: 3.15.6 release notes <http://git.evergreen-ils.org/?p=Ev​ergreen.git;a=commitdiff;h=1184214​0c303686b69042ff9e32bd908b24ed042>
12:10 eeevil yeah, with an ou id the org_top function doesn't run (coalesce stops at the first non-null input -- and the planner knows that so it doesn't even emit a plan node for the coalesce when $2 is not null)
12:10 Dyrcona So, the problem is really a planner setting because our development Pg 16 is still zippy.
12:13 Bmagic eeevil: your suggested function solves it for me
12:14 Bmagic I'd also like to say that the change is straightfoward, I don't see it having negative side-effects. The function inputs and outputs should be unchanged
12:16 Bmagic eeevil: since we're opening this can, do we want to go further? You had some other ideas about skipping the function entirely for null values of OU
12:17 eeevil something about putting the function there (as a coalesced param to another function) is convincing the planner that scanning the tables /in toto/ will be faster than maybe calling the function. it may depend on the stats related to the first bib used, but that's unlikely. I think it's a planner reasoning issue in param-nested functions in a plan that has to work for both NULL and NOT NULL cases, so it probably CANNOT optimize the call away, ever. (
12:17 eeevil different versions of PG obviously have different optimizations for function cached plans, so 16 may be smart enough for the old version)
12:17 Bmagic abneiman++
12:18 Dyrcona eeevil: Both of the databases where I'm testing this are Pg 16, and they show radically different performance.
12:18 eeevil Bmagic: well, not the whole function. just the filter we're focusing on.  there's no difference between "and is in the FULL list of orgs" and just not asking if it's in the full list
12:19 eeevil Dyrcona: ah, well, I dunno then.
12:19 eeevil stats, jit, work_mem, parallel workers, etc
12:20 Bmagic this AND : "AND acp.circ_lib IN (SELECT id FROM actor.org_unit_descendants(v_ou))" could be omitted if ou is null, right?
12:20 Dyrcona Yes, that's what I said earlier, just without examples. Ironic that the faster one is basically set to defaults. Maybe we should stop tuning our databases? :)
12:21 Dyrcona Bmagic: No, you want the list of descendant org units.
12:21 eeevil right. 2 queries in the function, gated by, say "IF ou IS NULL OR ou = (evergreen.org_top()).id THEN" ... query without that AND "ELSE" query with the AND, but no coalesce
12:21 Bmagic I'm thinking the function could execute a different SELECT statement if the second argument is omitted.
12:21 Bmagic jinx
12:22 Bmagic I like that even better
12:23 eeevil it would be nearly as fast (to the point of no measurable difference, I'd wager) and more maintainable to just build an array of orgs and use "= ANY"
12:23 Dyrcona IIUC, that function is looking for any holdable copies at the org unit or at any of its descendants, so you want the select on actor.org_descendants() in either case.
12:23 Bmagic if you don't specify a circ_lib, it's all circ_libs
12:24 eeevil Dyrcona: only if there is an ou passed in. if ou is NULL, we use all orgs for the "is it here" test. that's the same as not asking the question -- it's always true
12:24 Dyrcona What would make sense, and maybe I missed a comment to that effect, would be to set an array variable at the top and then use the ANY in the query. We could probably use ANY even in the current form.
12:25 Dyrcona eeevil: Yes, I elided that bit thinking it was obvious.
12:25 Dyrcona Maybe I should use all the words, but verbosity is a drag.
12:26 Dyrcona I was responding to what read to me like Bmagic saying we didn't need the call to org_descendants.
12:26 eeevil heh...yes, I did say that. (re = ANY). we could also use a lateral join, which could be faster when an ou /is/ supplied. but these are micro-optimizations, let's just get it back to 1-3ms, IMO
12:27 Bmagic I don't think we need to call org_descendants when ou isn't specified
12:28 Dyrcona Bmagic: You would need to drop that whole AND line. My interpretation of what you meant was probably wrong.
12:28 Bmagic I'll make a paste for concreteness
12:29 eeevil yes, but the tradeoff is then we have to maintain 2 queries in the function. if we're going that route, we should first remove the descendents call from the query, with =ANY. then split the logic and see if that's even faster.
12:29 Dyrcona We wouldn't need to check circ_lib at all if it is null is what I think you meant.
12:29 eeevil Dyrcona: I believe we have all arrived on the same page from different edges
12:29 Dyrcona eeevil++
12:31 Bmagic https://pastebin.com/nLv3zzVf
12:31 Bmagic I just put that function on my test machine and it dropped the execution time from 2.2ms to 1.3ms
12:32 Bmagic I lifted the exact logic from eeevil's comment above
12:35 Dyrcona Bmagic: I don't like it. Running different code based on the second input variable being NULL is a maintenance bug waiting to happen.
12:36 Bmagic I can see that, offer the ANY version
12:36 Dyrcona I think eeevil's original version is fine, actually.
12:37 Bmagic I agree
12:38 Bmagic I have time to make the branch if you all don't but I'd like to give eeevil the commit credit somehow
12:39 collum joined #evergreen
12:39 Dyrcona Here's one implementation of the ANY version: https://paste.debian.net/1402171/
12:40 Dyrcona I've tried it on our development database to make sure it does the thing.
12:44 Dyrcona Bmagic: Co-authored-by: in the commit message. You want me to put my paste in a branch?
12:45 Bmagic well, we probably need to decide on a winner
12:46 Dyrcona $2 should probably be changed to ou to be consistent.
12:47 csharp_ @dunno add I WANT MY TWO DOLLARS!
12:47 pinesol csharp_: The operation succeeded.  Dunno #101 added.
12:47 Dyrcona Well, I think eeevil's original solution is fine. If we want to got the ANY route, then I prefer mine. :)
12:47 Bmagic FWIW, on my test machine: ANY = 8ms, eeevil's single = 2.1ms, split IF = 1.2ms
12:48 Dyrcona Yeahp. Is .9ms worth a potential maintenance issue?
12:48 Bmagic nope, I don't think so
12:49 Bmagic plus, it's slower when provided a second argument
12:49 eeevil Dyrcona: yes, your =ANY is what I was thinking of, if we go that way.
12:49 eeevil Bmagic: you can use the "Co-authored-by:" commit tag
12:50 eeevil oh, nm, Dyrconasaid that
12:50 Bmagic happy to do that
12:50 Bmagic which one are we going with?
12:50 Dyrcona On my test database, the ANY version is 4.777ms.
12:51 Bmagic and race it against eeevil's original
12:51 Dyrcona And the original is 3.763ms.... :(
12:51 Bmagic the original, original? or eeevil's original proposed fix?
12:51 eeevil which function? the one with the least amount of maintenance overhead is my preference. if single-query =ANY is as fast as my first one, go with =ANY.
12:52 eeevil (because =ANY /can/ be faster, even if it's not always)
12:52 Bmagic they are super close in speed, with ANY taking an additional 1ms (sometimes)
12:54 Bmagic I just tested a bunch of random integers as the second argument. The ANY performs the same as  eeevil's OG
12:55 Bmagic I'm leaning towards ANY
12:55 Dyrcona OK, go with ANY.
12:55 Bmagic was there a nit pick on the variable name from Dyrcona's paste?
12:56 Dyrcona I was thinking that $2 should be replaced with ou, mainly because we're not using $1 for rid.
12:56 eeevil +1 to named params, yes
12:56 Dyrcona I didn't make that change when I pasted my first attempt.
12:57 eeevil the bike shed better be blue, or I quit
12:57 Bmagic Evergreen main calls it out by $2, which I thought was strange yesterday
12:57 Bmagic branching....
12:57 eeevil (no, I don't care much about the variable names, as long as they're not crazy)
12:58 jeff_ sub patron_mumble_search...
12:58 Dyrcona It's the old way of addressing parameters in functions. We've upgraded past any version that does not support named parameters.
12:58 Bmagic my $annoy_eeevil_crazy_name;
12:59 jeff and I misremembered. it's only in the method name now (or only ever was).
12:59 Dyrcona On my test database, eeevil's implementation outperforms the any version without a second parameter almost every time.
13:02 csharp_ jeff++ # I had the same thought about crazy search
13:04 jihpringle joined #evergreen
13:06 Bmagic https://git.evergreen-ils.org/?p=working/Ever​green.git;a=shortlog;h=refs/heads/user/blake/​lp2129571_slow_asset_record_has_holdable_copy
13:07 Bmagic did I do that right?
13:08 Bmagic just saw CTLF endings, fixed it, force pushed
13:12 Dyrcona Bmagic: That looks right to me.
13:18 Dyrcona eeevil's original change is still faster for me in production, but almost 2ms than the ANY version.
13:20 Dyrcona But either is much faster than the 61,000ms that the original takes on the example record.
13:25 Bmagic hmm, we're going back to preferring non-ANY?
13:35 Dyrcona IDK, but the non-ANY version has been faster for me. I'm going to put the any version in production, though.
13:35 Dyrcona I added both versions to production with different function names to compare. The non-any is faster. Maybe we should add it to the bug and let people time the two?
13:37 Dyrcona Any version: 4.086 ms . Non-any version: 2.327 ms.
13:37 Dyrcona Those are both on record 1717933, The New York Times Book Review with something like 7,786 copies.
13:37 Dyrcona No second argument.
13:39 Dyrcona I wonder if the problem is the coalesce inside the function parameters?
13:40 Dyrcona Both of them get slower when I add a second argument, and the non-any version still takes almost twice as long.
13:41 Dyrcona Grrr.... The any version takes about twice as long as the non-any version: 58.664 ms vs. 25.795 ms.
13:48 Dyrcona So, I'm putting the any version in production. Testing on one record really isn't very scientific, and I agree that ANY on a pre-fetched array can be faster than IN against a query.
13:59 Dyrcona Bmagic++
13:59 Dyrcona eeevil++
14:08 Dyrcona I have another implementation to test that mixes the two approaches. It uses eeevil's original code to set a v_ou variable and then uses that variable to retrieve an array of ous. We'll see if that's faster.
14:15 Dyrcona Nope. eeevil's original suggestion is still faster.
15:00 Bmagic A hole in one we would say! eeevil++ miker++ Dyrcona++
15:44 mantis left #evergreen
15:51 Bmagic what's this query from PG logs? duration: 5300.009 ms  statement: FETCH 100 FROM "mrRTSJ2s3b5wuI61";
15:53 Bmagic On the front end, it's pcrud.search.aou, but in the DB it comes out like that
15:53 jeffdavis it's related to Postgres cursors
15:54 Bmagic it appears we have a problem with PG cursors as well. Which is fishy, because it's slow ~5 seconds retrieving pcrud.search.aou when parent_ou is null during the opensrf call
15:54 Bmagic It's interesting to me that the previous DB function that was slow when OU was null, is also slo when parent_ou is null in this context
16:01 Bmagic anyone have insight on that? I'm only seeing CURSORS setup for metabib.staged_browse, but I'm sure it's used for caching for hot tables like actor.org_unit. There has to be something awry on this DB for that cursor to take 5 seconds right?
16:12 jihpringle joined #evergreen
16:16 jeffdavis looks like cursors were added to pcrud in 3.14-beta
16:18 Dyrcona pcrud and cstore use generated aliases for tables when they build queries, IIRC.
16:19 Dyrcona I still think something's off on this particular db server because another Pg 16 database does not have these issues.
16:22 jeffdavis this issue is ringing a bell but I can't recall or dig up specifics, sorry :(
16:31 Bmagic I think we're close
17:12 mmorgan left #evergreen
19:05 jihpringle joined #evergreen
21:54 stephengwills joined #evergreen

| Channels | #evergreen index | Today | | Search | Google Search | Plain-Text | summary | Join Webchat