| Time |
Nick |
Message |
| 00:38 |
|
collum joined #evergreen |
| 03:08 |
|
collum joined #evergreen |
| 06:18 |
|
collum joined #evergreen |
| 08:02 |
|
collum joined #evergreen |
| 08:13 |
|
collum joined #evergreen |
| 08:14 |
|
collum joined #evergreen |
| 08:32 |
|
mantis joined #evergreen |
| 08:35 |
|
mmorgan joined #evergreen |
| 08:50 |
|
redavis joined #evergreen |
| 09:10 |
|
Dyrcona joined #evergreen |
| 09:54 |
|
smayo joined #evergreen |
| 09:58 |
|
smayo joined #evergreen |
| 10:16 |
Bmagic |
I've changed shared_buffers back to 8GB and this query is still slow. Here's the latest: https://explain.depesz.com/s/9iN0 |
| 10:31 |
|
Christineb joined #evergreen |
| 10:32 |
Dyrcona |
Bmagic: I'm looking at that one, too. It's a lot faster in production and looks like the problem is the join from copy.location to copy_location.id. |
| 10:33 |
Dyrcona |
grr. s/production/development/ |
| 10:33 |
Dyrcona |
I meant to say it is faster on the development database. |
| 10:36 |
Dyrcona |
Looks like they have the same indexes and fkey relationships. They're also the same Pg version. |
| 10:38 |
Dyrcona |
vacuum analyze asset.copy doesn't help. Neither did vacuum analyze asset.copy_location. |
| 10:38 |
pinesol |
News from commits: Docs: 3.14.11 release notes <http://git.evergreen-ils.org/?p=Evergreen.git;a=commitdiff;h=9f5de769e03eba30a73507041a5b8eb1d22f57bd> |
| 10:40 |
Dyrcona |
This is development: https://explain.depesz.com/s/iI6k . Notice there are a lot fewer rows to skip. |
| 10:42 |
Dyrcona |
Interesting how different the scans are. I suspect there's a missing index or something. |
| 11:02 |
Bmagic |
Dyrcona++ |
| 11:03 |
Bmagic |
The mystery is, it's fast when you swap out "null" for 1 in the coalesce |
| 11:04 |
Bmagic |
which is the result anyway if the coalesce runs the second argument |
| 11:05 |
Bmagic |
I think it means: PG is making a bad plan when the first argument is null, and a good plan when the first argument isn't null |
| 11:07 |
Bmagic |
maybe, the function "(SELECT id FROM evergreen.org_top())" is getting ran against each row, instead of ran once |
| 11:11 |
Dyrcona |
Well, why is that different in development and production because it's fast in development. |
| 11:30 |
csharp_ |
Dyrcona: Bmagic: not sure if it helps, but you may want to play around with https://www.postgresql.org/docs/current/runtime-config-query.html#GUC-DEFAULT-STATISTICS-TARGET values or setting stats on tables manually with (ALTER TABLE SET STATISTICS...) |
| 11:31 |
csharp_ |
I've had meh results with changing that in the past - the query planner appears to be pretty stubborn |
| 11:31 |
Bmagic |
I saw you had yours at 1000. I have 200 for that value |
| 11:32 |
csharp_ |
yeah, we upped it to allow for better planning - again, meh |
| 11:32 |
Bmagic |
right, I've reviewed that setting in the past |
| 11:32 |
csharp_ |
eeevil almost certainly has better ideas about how to set things like that |
| 11:32 |
Bmagic |
it doesn't explain the test machines not having an issue |
| 11:32 |
csharp_ |
(pretty sure he was the one to point me to that setting in the past) |
| 11:33 |
* eeevil |
reads up... |
| 11:34 |
csharp_ |
if it's literally the same plan on test/development/production/whatever, then the stats stuff might be a red herring |
| 11:36 |
csharp_ |
having run PG on multiple types of machines/VMs/environments, I can say that some of the differences I've seen can only be attributed to the supernatural :-) |
| 11:36 |
Dyrcona |
It's not the same plan. It's the same query and different data, about 200 rows less in development. |
| 11:36 |
Bmagic |
My theory is the plan is different between the two. The databases are identical in terms of general size and shape of the tables involved. call number 1717933 happens to be one with lots of connections |
| 11:37 |
csharp_ |
select * from pg_ghost; |
| 11:38 |
eeevil |
I'm confused about what is being tested here |
| 11:39 |
Bmagic |
eeevil: click the "query" tab on those explains |
| 11:39 |
eeevil |
I did |
| 11:39 |
Dyrcona |
I was leaning in the direction of tweaking the query planner. It's using more indexes in development and doing more scans in production. |
| 11:39 |
eeevil |
where's that query coming from |
| 11:39 |
Bmagic |
That query is what's embedded in SELECT * FROM asset.record_has_holdable_copy( '1717933') |
| 11:40 |
Bmagic |
which is called from the staff client when searching, as each search result is loaded into the Angular interface |
| 11:42 |
Dyrcona |
There are 7,786 copies on the bib record. |
| 11:44 |
eeevil |
well, we can elide that in-query part by using a variable (untested): https://lumen.esilibrary.com/p/Qoz3xd59.html |
| 11:44 |
eeevil |
or stick the whole list in an array and use "= ANY" |
| 11:45 |
eeevil |
or use a join to actor.org_unit_descendants() |
| 11:45 |
eeevil |
and even skip all that when the ou input is null -- why join when when it will always succeed? |
| 11:46 |
Bmagic |
eeevil: the solution logic is that PG could* rerun evergreen.org_top() over millions of rows, but if it's in the declared statement, it's only going to run once. Which I think is an upgrade even if your database isn't suffering from this bad plan issue |
| 11:48 |
Bmagic |
that was a question, forgot the question mark |
| 11:50 |
eeevil |
yes, if that's where (much of) the cost is coming from |
| 11:51 |
eeevil |
if coalesce sees its first input is a non-null literal, that coalesce node in the plan tree is replaced with the literal. so, yes, that's the effective benefit |
| 11:52 |
Bmagic |
I think it's an easy win IMO. I'll make a LP for "asset.record_has_holdable_copy can be slow" |
| 11:52 |
eeevil |
however, I don't think that's the case. it's only executing once: row 3 of the slow plan -> Function Scan on org_top (cost=0.25..0.26 rows=1 width=4) (actual time=0.621..0.623 rows=1 loops=1) |
| 11:54 |
Dyrcona |
Removing the coalesce speeds it up. |
| 11:54 |
Dyrcona |
AND acp.circ_lib IN (SELECT id FROM actor.org_unit_descendants((evergreen.org_top()).id)); |
| 11:55 |
Dyrcona |
Why are we doing the coalesce? Are there cases where org_top() could return null? |
| 11:57 |
eeevil |
Dyrcona: we're NOT, in the actual function |
| 11:57 |
eeevil |
we're passing $2, an optional (default NULL) org id |
| 11:58 |
eeevil |
interestingly, we're not testing that the acn is not deleted. no biggie, IMO, but ... odd |
| 11:58 |
Bmagic |
right, the function needs to accomodate an OU |
| 11:58 |
Dyrcona |
Hm. I'll look at the actual function, then. I thought Bmagic was testing a verbatim copy/paste with values plugged in. |
| 11:59 |
Bmagic |
Dyrcona: exactly, copy/pasted the function and plugged in null for the OU, and the ID number for the bib |
| 11:59 |
Dyrcona |
Well, if the acn is deleted, it should have all deleted copies, so if the copy isn't deleted, then neither should the acn be deleted. |
| 12:00 |
Dyrcona |
Well, eeevil said the function doesn't do the coalesce... So I'll still have to look at the function. |
| 12:00 |
|
jihpringle joined #evergreen |
| 12:02 |
eeevil |
Dyrcona: huh? sure it does https://lumen.esilibrary.com/p/YT5CMU10.html |
| 12:03 |
eeevil |
(and yes, re acn.deleted, acp.deleted is a pretty safe proxy) |
| 12:03 |
Bmagic |
bug 2129571 |
| 12:03 |
pinesol |
Launchpad bug 2129571 in Evergreen "asset.record_has_holdable_copy can be slow" [Undecided,New] https://launchpad.net/bugs/2129571 |
| 12:05 |
eeevil |
Bmagic: https://lumen.esilibrary.com/p/HBAe8E30.html if you want to test my first-order patch above ... works for me |
| 12:06 |
Dyrcona |
eeevil: I guess I didn't understand something you said earlier or whatever, but I can say that removing the coalesce makes it faster on our production database. The development database doesn't care and they should be about the same, execpt for settings. |
| 12:07 |
Bmagic |
eeevil++ # testing |
| 12:07 |
Dyrcona |
With an ou id, the coalesce doesn't seem to make a difference. |
| 12:08 |
eeevil |
Dyrcona: ah, I confused things, pardon. I should use all my words. |
| 12:08 |
pinesol |
News from commits: Docs: 3.15.6 release notes <http://git.evergreen-ils.org/?p=Evergreen.git;a=commitdiff;h=11842140c303686b69042ff9e32bd908b24ed042> |
| 12:10 |
eeevil |
yeah, with an ou id the org_top function doesn't run (coalesce stops at the first non-null input -- and the planner knows that so it doesn't even emit a plan node for the coalesce when $2 is not null) |
| 12:10 |
Dyrcona |
So, the problem is really a planner setting because our development Pg 16 is still zippy. |
| 12:13 |
Bmagic |
eeevil: your suggested function solves it for me |
| 12:14 |
Bmagic |
I'd also like to say that the change is straightfoward, I don't see it having negative side-effects. The function inputs and outputs should be unchanged |
| 12:16 |
Bmagic |
eeevil: since we're opening this can, do we want to go further? You had some other ideas about skipping the function entirely for null values of OU |
| 12:17 |
eeevil |
something about putting the function there (as a coalesced param to another function) is convincing the planner that scanning the tables /in toto/ will be faster than maybe calling the function. it may depend on the stats related to the first bib used, but that's unlikely. I think it's a planner reasoning issue in param-nested functions in a plan that has to work for both NULL and NOT NULL cases, so it probably CANNOT optimize the call away, ever. ( |
| 12:17 |
eeevil |
different versions of PG obviously have different optimizations for function cached plans, so 16 may be smart enough for the old version) |
| 12:17 |
Bmagic |
abneiman++ |
| 12:18 |
Dyrcona |
eeevil: Both of the databases where I'm testing this are Pg 16, and they show radically different performance. |
| 12:18 |
eeevil |
Bmagic: well, not the whole function. just the filter we're focusing on. there's no difference between "and is in the FULL list of orgs" and just not asking if it's in the full list |
| 12:19 |
eeevil |
Dyrcona: ah, well, I dunno then. |
| 12:19 |
eeevil |
stats, jit, work_mem, parallel workers, etc |
| 12:20 |
Bmagic |
this AND : "AND acp.circ_lib IN (SELECT id FROM actor.org_unit_descendants(v_ou))" could be omitted if ou is null, right? |
| 12:20 |
Dyrcona |
Yes, that's what I said earlier, just without examples. Ironic that the faster one is basically set to defaults. Maybe we should stop tuning our databases? :) |
| 12:21 |
Dyrcona |
Bmagic: No, you want the list of descendant org units. |
| 12:21 |
eeevil |
right. 2 queries in the function, gated by, say "IF ou IS NULL OR ou = (evergreen.org_top()).id THEN" ... query without that AND "ELSE" query with the AND, but no coalesce |
| 12:21 |
Bmagic |
I'm thinking the function could execute a different SELECT statement if the second argument is omitted. |
| 12:21 |
Bmagic |
jinx |
| 12:22 |
Bmagic |
I like that even better |
| 12:23 |
eeevil |
it would be nearly as fast (to the point of no measurable difference, I'd wager) and more maintainable to just build an array of orgs and use "= ANY" |
| 12:23 |
Dyrcona |
IIUC, that function is looking for any holdable copies at the org unit or at any of its descendants, so you want the select on actor.org_descendants() in either case. |
| 12:23 |
Bmagic |
if you don't specify a circ_lib, it's all circ_libs |
| 12:24 |
eeevil |
Dyrcona: only if there is an ou passed in. if ou is NULL, we use all orgs for the "is it here" test. that's the same as not asking the question -- it's always true |
| 12:24 |
Dyrcona |
What would make sense, and maybe I missed a comment to that effect, would be to set an array variable at the top and then use the ANY in the query. We could probably use ANY even in the current form. |
| 12:25 |
Dyrcona |
eeevil: Yes, I elided that bit thinking it was obvious. |
| 12:25 |
Dyrcona |
Maybe I should use all the words, but verbosity is a drag. |
| 12:26 |
Dyrcona |
I was responding to what read to me like Bmagic saying we didn't need the call to org_descendants. |
| 12:26 |
eeevil |
heh...yes, I did say that. (re = ANY). we could also use a lateral join, which could be faster when an ou /is/ supplied. but these are micro-optimizations, let's just get it back to 1-3ms, IMO |
| 12:27 |
Bmagic |
I don't think we need to call org_descendants when ou isn't specified |
| 12:28 |
Dyrcona |
Bmagic: You would need to drop that whole AND line. My interpretation of what you meant was probably wrong. |
| 12:28 |
Bmagic |
I'll make a paste for concreteness |
| 12:29 |
eeevil |
yes, but the tradeoff is then we have to maintain 2 queries in the function. if we're going that route, we should first remove the descendents call from the query, with =ANY. then split the logic and see if that's even faster. |
| 12:29 |
Dyrcona |
We wouldn't need to check circ_lib at all if it is null is what I think you meant. |
| 12:29 |
eeevil |
Dyrcona: I believe we have all arrived on the same page from different edges |
| 12:29 |
Dyrcona |
eeevil++ |
| 12:31 |
Bmagic |
https://pastebin.com/nLv3zzVf |
| 12:31 |
Bmagic |
I just put that function on my test machine and it dropped the execution time from 2.2ms to 1.3ms |
| 12:32 |
Bmagic |
I lifted the exact logic from eeevil's comment above |
| 12:35 |
Dyrcona |
Bmagic: I don't like it. Running different code based on the second input variable being NULL is a maintenance bug waiting to happen. |
| 12:36 |
Bmagic |
I can see that, offer the ANY version |
| 12:36 |
Dyrcona |
I think eeevil's original version is fine, actually. |
| 12:37 |
Bmagic |
I agree |
| 12:38 |
Bmagic |
I have time to make the branch if you all don't but I'd like to give eeevil the commit credit somehow |
| 12:39 |
|
collum joined #evergreen |
| 12:39 |
Dyrcona |
Here's one implementation of the ANY version: https://paste.debian.net/1402171/ |
| 12:40 |
Dyrcona |
I've tried it on our development database to make sure it does the thing. |
| 12:44 |
Dyrcona |
Bmagic: Co-authored-by: in the commit message. You want me to put my paste in a branch? |
| 12:45 |
Bmagic |
well, we probably need to decide on a winner |
| 12:46 |
Dyrcona |
$2 should probably be changed to ou to be consistent. |
| 12:47 |
csharp_ |
@dunno add I WANT MY TWO DOLLARS! |
| 12:47 |
pinesol |
csharp_: The operation succeeded. Dunno #101 added. |
| 12:47 |
Dyrcona |
Well, I think eeevil's original solution is fine. If we want to got the ANY route, then I prefer mine. :) |
| 12:47 |
Bmagic |
FWIW, on my test machine: ANY = 8ms, eeevil's single = 2.1ms, split IF = 1.2ms |
| 12:48 |
Dyrcona |
Yeahp. Is .9ms worth a potential maintenance issue? |
| 12:48 |
Bmagic |
nope, I don't think so |
| 12:49 |
Bmagic |
plus, it's slower when provided a second argument |
| 12:49 |
eeevil |
Dyrcona: yes, your =ANY is what I was thinking of, if we go that way. |
| 12:49 |
eeevil |
Bmagic: you can use the "Co-authored-by:" commit tag |
| 12:50 |
eeevil |
oh, nm, Dyrconasaid that |
| 12:50 |
Bmagic |
happy to do that |
| 12:50 |
Bmagic |
which one are we going with? |
| 12:50 |
Dyrcona |
On my test database, the ANY version is 4.777ms. |
| 12:51 |
Bmagic |
and race it against eeevil's original |
| 12:51 |
Dyrcona |
And the original is 3.763ms.... :( |
| 12:51 |
Bmagic |
the original, original? or eeevil's original proposed fix? |
| 12:51 |
eeevil |
which function? the one with the least amount of maintenance overhead is my preference. if single-query =ANY is as fast as my first one, go with =ANY. |
| 12:52 |
eeevil |
(because =ANY /can/ be faster, even if it's not always) |
| 12:52 |
Bmagic |
they are super close in speed, with ANY taking an additional 1ms (sometimes) |
| 12:54 |
Bmagic |
I just tested a bunch of random integers as the second argument. The ANY performs the same as eeevil's OG |
| 12:55 |
Bmagic |
I'm leaning towards ANY |
| 12:55 |
Dyrcona |
OK, go with ANY. |
| 12:55 |
Bmagic |
was there a nit pick on the variable name from Dyrcona's paste? |
| 12:56 |
Dyrcona |
I was thinking that $2 should be replaced with ou, mainly because we're not using $1 for rid. |
| 12:56 |
eeevil |
+1 to named params, yes |
| 12:56 |
Dyrcona |
I didn't make that change when I pasted my first attempt. |
| 12:57 |
eeevil |
the bike shed better be blue, or I quit |
| 12:57 |
Bmagic |
Evergreen main calls it out by $2, which I thought was strange yesterday |
| 12:57 |
Bmagic |
branching.... |
| 12:57 |
eeevil |
(no, I don't care much about the variable names, as long as they're not crazy) |
| 12:58 |
jeff_ |
sub patron_mumble_search... |
| 12:58 |
Dyrcona |
It's the old way of addressing parameters in functions. We've upgraded past any version that does not support named parameters. |
| 12:58 |
Bmagic |
my $annoy_eeevil_crazy_name; |
| 12:59 |
jeff |
and I misremembered. it's only in the method name now (or only ever was). |
| 12:59 |
Dyrcona |
On my test database, eeevil's implementation outperforms the any version without a second parameter almost every time. |
| 13:02 |
csharp_ |
jeff++ # I had the same thought about crazy search |
| 13:04 |
|
jihpringle joined #evergreen |
| 13:06 |
Bmagic |
https://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/blake/lp2129571_slow_asset_record_has_holdable_copy |
| 13:07 |
Bmagic |
did I do that right? |
| 13:08 |
Bmagic |
just saw CTLF endings, fixed it, force pushed |
| 13:12 |
Dyrcona |
Bmagic: That looks right to me. |
| 13:18 |
Dyrcona |
eeevil's original change is still faster for me in production, but almost 2ms than the ANY version. |
| 13:20 |
Dyrcona |
But either is much faster than the 61,000ms that the original takes on the example record. |
| 13:25 |
Bmagic |
hmm, we're going back to preferring non-ANY? |
| 13:35 |
Dyrcona |
IDK, but the non-ANY version has been faster for me. I'm going to put the any version in production, though. |
| 13:35 |
Dyrcona |
I added both versions to production with different function names to compare. The non-any is faster. Maybe we should add it to the bug and let people time the two? |
| 13:37 |
Dyrcona |
Any version: 4.086 ms . Non-any version: 2.327 ms. |
| 13:37 |
Dyrcona |
Those are both on record 1717933, The New York Times Book Review with something like 7,786 copies. |
| 13:37 |
Dyrcona |
No second argument. |
| 13:39 |
Dyrcona |
I wonder if the problem is the coalesce inside the function parameters? |
| 13:40 |
Dyrcona |
Both of them get slower when I add a second argument, and the non-any version still takes almost twice as long. |
| 13:41 |
Dyrcona |
Grrr.... The any version takes about twice as long as the non-any version: 58.664 ms vs. 25.795 ms. |
| 13:48 |
Dyrcona |
So, I'm putting the any version in production. Testing on one record really isn't very scientific, and I agree that ANY on a pre-fetched array can be faster than IN against a query. |
| 13:59 |
Dyrcona |
Bmagic++ |
| 13:59 |
Dyrcona |
eeevil++ |
| 14:08 |
Dyrcona |
I have another implementation to test that mixes the two approaches. It uses eeevil's original code to set a v_ou variable and then uses that variable to retrieve an array of ous. We'll see if that's faster. |
| 14:15 |
Dyrcona |
Nope. eeevil's original suggestion is still faster. |
| 15:00 |
Bmagic |
A hole in one we would say! eeevil++ miker++ Dyrcona++ |
| 15:44 |
|
mantis left #evergreen |
| 15:51 |
Bmagic |
what's this query from PG logs? duration: 5300.009 ms statement: FETCH 100 FROM "mrRTSJ2s3b5wuI61"; |
| 15:53 |
Bmagic |
On the front end, it's pcrud.search.aou, but in the DB it comes out like that |
| 15:53 |
jeffdavis |
it's related to Postgres cursors |
| 15:54 |
Bmagic |
it appears we have a problem with PG cursors as well. Which is fishy, because it's slow ~5 seconds retrieving pcrud.search.aou when parent_ou is null during the opensrf call |
| 15:54 |
Bmagic |
It's interesting to me that the previous DB function that was slow when OU was null, is also slo when parent_ou is null in this context |
| 16:01 |
Bmagic |
anyone have insight on that? I'm only seeing CURSORS setup for metabib.staged_browse, but I'm sure it's used for caching for hot tables like actor.org_unit. There has to be something awry on this DB for that cursor to take 5 seconds right? |
| 16:12 |
|
jihpringle joined #evergreen |
| 16:16 |
jeffdavis |
looks like cursors were added to pcrud in 3.14-beta |
| 16:18 |
Dyrcona |
pcrud and cstore use generated aliases for tables when they build queries, IIRC. |
| 16:19 |
Dyrcona |
I still think something's off on this particular db server because another Pg 16 database does not have these issues. |
| 16:22 |
jeffdavis |
this issue is ringing a bell but I can't recall or dig up specifics, sorry :( |
| 16:31 |
Bmagic |
I think we're close |
| 17:12 |
|
mmorgan left #evergreen |
| 19:05 |
|
jihpringle joined #evergreen |
| 21:54 |
|
stephengwills joined #evergreen |