| Time |
Nick |
Message |
| 07:19 |
|
collum joined #evergreen |
| 08:40 |
|
mmorgan joined #evergreen |
| 08:53 |
|
smayo joined #evergreen |
| 10:36 |
|
sandbergja joined #evergreen |
| 11:07 |
|
sandbergja joined #evergreen |
| 11:20 |
|
redavis joined #evergreen |
| 11:58 |
Bmagic |
would anyone be able to look at this and identify why it took 9 seconds? https://explain.depesz.com/s/DcyQ |
| 11:59 |
Bmagic |
it seems the sort is the most expensive |
| 12:16 |
|
Christineb joined #evergreen |
| 12:31 |
sandbergja |
I have no answers, but I was curious about the 5,782,762 heap fetches on the index-only scan. Is that typical? |
| 12:50 |
|
jvwoolf joined #evergreen |
| 13:22 |
jeffdavis |
This looks relevant: https://www.postgresql.org/docs/current/indexes-index-only-scans.html |
| 13:27 |
eeevil |
Bmagic: the index only scan near the bottom is 6.2s. afaict, all the Sort nodes were fast; the one at 7 was ~.25ms on 949 rows. the nested loop at 9 (and everything inside that) holds all the cost |
| 13:29 |
Bmagic |
eeevil++ # of course, now, do I understand that perhaps that index needs re-created? |
| 13:31 |
eeevil |
jeffdavis: indeed, it does seem relevant ... I wonder if we may just want to add the vis_attr_vector column as a covered col on the record index. |
| 13:33 |
eeevil |
Bmagic: no, the planner thinks that the best way to attack the problem at hand is to scan the table for attr matches and then probe the record index to see if the row is one we want. we want it to look at the record column first, then probe the attr array. without knowing the search that drove this query, I can't guess at why it decided that, though |
| 13:35 |
Bmagic |
I have the query: https://pastebin.com/szEH1f8G |
| 13:35 |
eeevil |
if it was just a search based on record attrs (like, "show me all english records") then there's not a lot we can do. but ... you could try adding a gin index over asset.copy_vis_attr_cache.vis_attr_vector if this is a common pattern with your users and your data |
| 13:35 |
Bmagic |
a generic keyword search on the staff-side |
| 13:37 |
Bmagic |
I noticed that the exact same search on the patron side was fast |
| 13:38 |
Bmagic |
but I didn't capture it's underlying query and analyze it (yet). I'm concerned that the disk full event last week did something to the database. |
| 13:39 |
eeevil |
that's extremely, exceedingly, unlikely, unless the hardware lied to the software about writes right around the out of space event |
| 13:40 |
Bmagic |
well, that's reassuring |
| 13:40 |
sandbergja |
just to check, has asset.copy_vis_attr_cache been vacuumed recently? |
| 13:40 |
Bmagic |
checking.. |
| 13:40 |
eeevil |
well, step one is probably vac-analyze the whole db |
| 13:40 |
eeevil |
sandbergja: great minds ;) |
| 13:40 |
Bmagic |
that's been done |
| 13:41 |
Bmagic |
but I wanted to report to you when* |
| 13:41 |
Bmagic |
it's not been more than 2 days |
| 13:42 |
eeevil |
that circ_lib list looks like it's probably ... all of them. is that true? |
| 13:43 |
Bmagic |
yes, all |
| 13:43 |
eeevil |
ah, yes ... site(CONS) |
| 13:47 |
Bmagic |
last vacuum 10/19 |
| 13:47 |
Bmagic |
maybe I'll run it now on just that table and re-analyze the search query? |
| 13:50 |
Bmagic |
well, I did that. Brought the query down to 3 seconds. I'd say that's a smoking gun right there |
| 13:55 |
|
berick joined #evergreen |
| 13:56 |
Bmagic |
now that the query is fast, I've found another slow down at the opensrf level: Message processing duration 41.322778 |
| 13:57 |
Bmagic |
which came from this call: CALL: open-ils.cstore open-ils.cstore.json_query.atomic {"from":["asset.record_has_holdable_copy",1917632]} |
| 13:57 |
Bmagic |
the staff client starts to show results from "better homes and gardens" loads two results, then waits for 41 seconds before loading the third result. This is the call that it's stuck on |
| 14:01 |
Bmagic |
explain analyze SELECT * FROM asset.record_has_holdable_copy( '1917632' ) doesn't help much. Looking at the function now |
| 14:01 |
sandbergja |
Bmagic: for my understanding, after the vacuum, did the number of heap fetches go down? |
| 14:02 |
Bmagic |
I'll post the explain post-vacuum |
| 14:03 |
eeevil |
I thought we were eliding those "all orgs" filters :( ... here's a possible option to do that: https://lumen.esilibrary.com/p/ZI3fRH25.html |
| 14:03 |
Bmagic |
https://explain.depesz.com/s/QFwc |
| 14:07 |
Bmagic |
bummer, when I excerpt the query from the function and explain analyze that (plugging in the copy ID and the OU ID) it runs in 50ms |
| 14:08 |
Bmagic |
but it takes 40 seconds when it's executed as a function |
| 14:08 |
sandbergja |
😮 |
| 14:10 |
sandbergja |
what if you try it without specifying an org unit id, and using evergreen.org_top() instead? |
| 14:11 |
Bmagic |
hmmmm, here's a clue: SELECT * FROM asset.record_has_holdable_copy( '1717933' ) takes 40 seconds, SELECT * FROM asset.record_has_holdable_copy( '1717933', 1 ) takes 100ms |
| 14:11 |
Bmagic |
it's the lack of the second argument |
| 14:12 |
Bmagic |
I can't seem to find a function with that name that only takes one argument |
| 14:12 |
sandbergja |
it's the same function, but the second param defaults to NULL |
| 14:12 |
sandbergja |
(which sends it down the SELECT id FROM actor.org_unit_descendants(SELECT id FROM evergreen.org_top()) path via the COALESCE) |
| 14:13 |
Bmagic |
sandbergja: I see what you're getting at and I agree. I think this "SELECT id FROM evergreen.org_top()" is the issue |
| 14:14 |
Bmagic |
ha! and it's fast by itself. Slow when embedded in the coalesce... freaky |
| 14:15 |
Bmagic |
https://explain.depesz.com/s/7k8f <-- plugging "NULL" in for the first argument in the coalesce |
| 14:18 |
Bmagic |
getting closer |
| 14:20 |
mmorgan |
claiming 1492 |
| 14:22 |
sandbergja |
haha I guess it's a little late to make a columbus day joke |
| 14:26 |
Bmagic |
lol |
| 14:26 |
csharp_ |
never too late to discover america |
| 14:26 |
Bmagic |
26 seconds to perform an index scan on call_number_pkey seems like the issue, but I just vacuum analyzed asset.call_number with no affect |
| 14:28 |
csharp_ |
Bmagic: when a function is the problem I often copy out the CREATE FUNCTION statement into a file and add a bunch of debugging (maybe rename the function under a different namespace so you aren't altering the original) |
| 14:28 |
csharp_ |
RAISE NOTICE 'blah blah'; |
| 14:29 |
Bmagic |
yep, I've used that trick. Though, this function is so simple, I didn't need to create another function. I used the query inside the function and explain analyzed it |
| 14:29 |
csharp_ |
gotcha |
| 14:29 |
Bmagic |
And I'm getting the same slowness |
| 14:29 |
Bmagic |
11 seconds to scan asset.copy and 26 seconds to scan asset.call_number looks sus to me |
| 14:29 |
csharp_ |
seq scans? |
| 14:29 |
Bmagic |
(the last explain I posted above) |
| 14:30 |
Bmagic |
acp was a bitmap heat scan and acn was index scan |
| 14:31 |
Bmagic |
what's insane is this: |
| 14:31 |
Bmagic |
(SELECT id FROM actor.org_unit_descendants(COALESCE(1,(SELECT id FROM evergreen.org_top())))) 100ms |
| 14:31 |
Bmagic |
(SELECT id FROM actor.org_unit_descendants(COALESCE(null,(SELECT id FROM evergreen.org_top())))) 41 seconds |
| 14:32 |
Bmagic |
(those queries by themselves don't perform any different, but inside the bigger query (from the function) it makes a huge difference) |
| 14:32 |
csharp_ |
Time: 1.946 ms and Time: 2.598 ms on our DB respectively |
| 14:33 |
Bmagic |
right, same here |
| 14:33 |
Bmagic |
here's the larger function https://pastebin.com/LuLKFPsJ |
| 14:33 |
sandbergja |
if you're in a situation where you could add an extension, I would be curious where plprofiler thinks that asset.record_has_holdable_copy is spending its time |
| 14:34 |
csharp_ |
SELECT * FROM asset.record_has_holdable_copy( '1717933' ) = 13.691 ms, SELECT * FROM asset.record_has_holdable_copy( '1717933', 1 ), 4.104 ms |
| 14:34 |
Bmagic |
yeah! you're seeing the same thing then |
| 14:35 |
Bmagic |
in my case it's a bit more dramatic but still, what gives. Those two calls are the same thing |
| 14:35 |
csharp_ |
the query you posted took me 4.363 ms |
| 14:35 |
Bmagic |
yeah, now swap out "1" for "null" and see how long it takes |
| 14:35 |
pinesol |
News from commits: LP2120713 Stamp upgrade script <http://git.evergreen-ils.org/?p=Evergreen.git;a=commitdiff;h=f5fb30feb76f249a279b95cb240c43f212ff3461> |
| 14:35 |
pinesol |
News from commits: LP#2120713: Fix for merging bibs with deleted monograph parts <http://git.evergreen-ils.org/?p=Evergreen.git;a=commitdiff;h=926afbbd2850fd9cd1b4068ba0b9ab5bc547aa0c> |
| 14:35 |
csharp_ |
4.093 ms |
| 14:36 |
Bmagic |
huh? really? |
| 14:36 |
csharp_ |
yeah - hmmm |
| 14:36 |
Bmagic |
I would hav expected more like 13seconds like you reported above |
| 14:36 |
csharp_ |
these are all ms |
| 14:37 |
Bmagic |
oh right, nvm, I have the issue, not you |
| 14:37 |
Bmagic |
it's almost certainly an index or vacuum issue somewhere in here |
| 14:37 |
csharp_ |
Bmagic: how is the machine? RAM ok? maybe some config tuning needed? |
| 14:37 |
csharp_ |
assuming this is a Google cloud box |
| 14:38 |
csharp_ |
(we're virtualized too on ITS's VMWare) |
| 14:38 |
Bmagic |
PG16, 768GB memory, shared_buffers = 192GB, max_connections = 800, effective_cache_size = 500GB, work_mem = 512MB |
| 14:41 |
csharp_ |
shared_buffers | 128MB |
| 14:41 |
csharp_ |
effective_cache_size | 352GB |
| 14:41 |
csharp_ |
max_connections | 2500 |
| 14:41 |
csharp_ |
work_mem | 1536MB |
| 14:42 |
Bmagic |
2500 dang |
| 14:42 |
csharp_ |
we have like 500GB RAM, 28 processing cores |
| 14:42 |
csharp_ |
load avg hovering between 8 and 11, which is probably normal for us |
| 14:42 |
csharp_ |
PG 11 (!) |
| 14:42 |
Bmagic |
we have 768GB memory and 96 CPU |
| 14:43 |
csharp_ |
huh |
| 14:43 |
Bmagic |
maybe the shared_buffers is waaay to high? |
| 14:43 |
csharp_ |
seems like you'd be doing better than us, depending on your DB data size |
| 14:44 |
csharp_ |
I always have to look up the config settings - I never remember them between config changes |
| 14:44 |
Bmagic |
it was* 8GB, but changed it up to try and fix performance |
| 14:46 |
csharp_ |
did you change kernel default settings for SHMMAX, etc.? |
| 14:46 |
csharp_ |
https://www.postgresql.org/docs/current/kernel-resources.html for reference |
| 14:49 |
csharp_ |
looks like we didn't change those on this server, but we have in the past and it made a big difference in performance |
| 14:49 |
csharp_ |
(in the past on previous servers, always bare metal) |
| 14:50 |
Bmagic |
no, I've not touched those kernel settings... maybe that's part of the issue? |
| 14:50 |
csharp_ |
also possible ITS did some VM-level stuff to accommodate our RAM/load |
| 14:50 |
Bmagic |
mind sharing the settings you use for those? |
| 14:50 |
csharp_ |
they totally underestimated our impact |
| 14:51 |
csharp_ |
Bmagic: I'll dig for records :-) |
| 14:51 |
Bmagic |
it looks like, according to that article, the default linux values are generally ok |
| 14:51 |
csharp_ |
Bmagic: might be irrelevant - just remembering things we've touched before to address performance |
| 14:52 |
Bmagic |
we do alter one kernel setting: vm.nr_hugepages |
| 14:52 |
csharp_ |
I remember that we set one of the PG config settings too high one time and it all came to a crawl |
| 14:52 |
csharp_ |
maybe shared_buffers? |
| 14:52 |
Bmagic |
yeah, shared_buffers is way out of alignment compared to yours |
| 14:52 |
mmorgan |
sandbergja: Late to the party, but that ship has sailed. |
| 14:52 |
* mmorgan |
ducks |
| 14:53 |
Bmagic |
mmorgan++ |
| 14:53 |
sandbergja |
heh |
| 14:53 |
csharp_ |
mmorgan: aaaayyyy |
| 14:54 |
Bmagic |
cat /proc/sys/kernel/shmmax: 18446744073692774399 |
| 14:54 |
Bmagic |
that is a number |
| 14:54 |
csharp_ |
Bmagic: yeah, that's ours too - has to be the default for Debian/Ubuntu |
| 14:55 |
Bmagic |
same for shmall |
| 14:55 |
csharp_ |
same |
| 14:55 |
csharp_ |
ok, so probably not kernel |
| 14:55 |
Bmagic |
yeah, that number is so large, I can't imagine it's imposing any restrictions |
| 14:56 |
csharp_ |
maybe play with setting shared_buffers during off hours and see if anything changes?\ |
| 14:56 |
Bmagic |
I'm hopeful it's shared_buffers. I have a test machine with the same database on it, I'll increase the buffer to 192, and see if it's slow, then back to 8GB |
| 14:58 |
csharp_ |
https://www.postgresql.org/docs/current/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-MEMORY - note the advice about that setting |
| 14:58 |
csharp_ |
also that it requires a restart |
| 14:58 |
csharp_ |
sorry, probably over-advising |
| 14:59 |
csharp_ |
sometimes I'm chatty in the channel for the logs |
| 15:00 |
csharp_ |
and *sometimes*, in the middle of the night when I'm upgrading and something goes wrong, my *own* advice actually helps :-) |
| 15:00 |
Bmagic |
I'm happy to have you chatty |
| 15:00 |
csharp_ |
blah blah blah |
| 15:00 |
Bmagic |
I bet it's the shared_buffers |
| 15:01 |
csharp_ |
actually thinking about doing a session about PINES system admin |
| 15:01 |
csharp_ |
for the EG conf |
| 15:01 |
csharp_ |
possibly more than one - one more general about PINES administration and one with tech details |
| 15:02 |
Bmagic |
I'll attend! |
| 15:02 |
csharp_ |
gotta make that trip to Seattle appear worth it and all :-) |
| 15:02 |
csharp_ |
<narrator>It *was* worth it</narrator> |
| 15:04 |
Bmagic |
can't wait for the in person one! It's been 2 years |
| 15:04 |
Bmagic |
you've got me starting to think about presentations of my own, I should do one this time |
| 15:12 |
csharp_ |
Bmagic++ |
| 15:30 |
|
_collum joined #evergreen |
| 15:43 |
|
collum joined #evergreen |
| 17:06 |
|
mmorgan left #evergreen |
| 18:33 |
|
collum joined #evergreen |
| 21:26 |
|
sandbergja joined #evergreen |
| 21:36 |
|
collum joined #evergreen |
| 23:10 |
|
sandbergja joined #evergreen |