Evergreen ILS Website

IRC log for #evergreen, 2025-10-21

| Channels | #evergreen index | Today | | Search | Google Search | Plain-Text | summary | Join Webchat

All times shown according to the server's local time.

Time Nick Message
07:19 collum joined #evergreen
08:40 mmorgan joined #evergreen
08:53 smayo joined #evergreen
10:36 sandbergja joined #evergreen
11:07 sandbergja joined #evergreen
11:20 redavis joined #evergreen
11:58 Bmagic would anyone be able to look at this and identify why it took 9 seconds? https://explain.depesz.com/s/DcyQ
11:59 Bmagic it seems the sort is the most expensive
12:16 Christineb joined #evergreen
12:31 sandbergja I have no answers, but I was curious about the 5,782,762 heap fetches on the index-only scan.  Is that typical?
12:50 jvwoolf joined #evergreen
13:22 jeffdavis This looks relevant: https://www.postgresql.org/docs/cur​rent/indexes-index-only-scans.html
13:27 eeevil Bmagic: the index only scan near the bottom is 6.2s. afaict, all the Sort nodes were fast; the one at 7 was ~.25ms on 949 rows. the nested loop at 9 (and everything inside that) holds all the cost
13:29 Bmagic eeevil++ # of course, now, do I understand that perhaps that index needs re-created?
13:31 eeevil jeffdavis: indeed, it does seem relevant ... I wonder if we may just want to add the vis_attr_vector column as a covered col on the record index.
13:33 eeevil Bmagic: no, the planner thinks that the best way to attack the problem at hand is to scan the table for attr matches and then probe the record index to see if the row is one we want. we want it to look at the record column first, then probe the attr array. without knowing the search that drove this query, I can't guess at why it decided that, though
13:35 Bmagic I have the query: https://pastebin.com/szEH1f8G
13:35 eeevil if it was just a search based on record attrs (like, "show me all english records") then there's not a lot we can do. but ... you could try adding a gin index over asset.copy_vis_attr_cache.vis_attr_vector if this is a common pattern with your users and your data
13:35 Bmagic a generic keyword search on the staff-side
13:37 Bmagic I noticed that the exact same search on the patron side was fast
13:38 Bmagic but I didn't capture it's underlying query and analyze it (yet). I'm concerned that the disk full event last week did something to the database.
13:39 eeevil that's extremely, exceedingly, unlikely, unless the hardware lied to the software about writes right around the out of space event
13:40 Bmagic well, that's reassuring
13:40 sandbergja just to check, has asset.copy_vis_attr_cache been vacuumed recently?
13:40 Bmagic checking..
13:40 eeevil well, step one is probably vac-analyze the whole db
13:40 eeevil sandbergja: great minds ;)
13:40 Bmagic that's been done
13:41 Bmagic but I wanted to report to you when*
13:41 Bmagic it's not been more than 2 days
13:42 eeevil that circ_lib list looks like it's probably ... all of them. is that true?
13:43 Bmagic yes, all
13:43 eeevil ah, yes ... site(CONS)
13:47 Bmagic last vacuum 10/19
13:47 Bmagic maybe I'll run it now on just that table and re-analyze the search query?
13:50 Bmagic well, I did that. Brought the query down to 3 seconds. I'd say that's a smoking gun right there
13:55 berick joined #evergreen
13:56 Bmagic now that the query is fast, I've found another slow down at the opensrf level: Message processing duration 41.322778
13:57 Bmagic which came from this call: CALL: open-ils.cstore open-ils.cstore.json_query.atomic {"from":["asset.record_has​_holdable_copy",1917632]}
13:57 Bmagic the staff client starts to show results from "better homes and gardens" loads two results, then waits for 41 seconds before loading the third result. This is the call that it's stuck on
14:01 Bmagic explain analyze SELECT * FROM asset.record_has_holdable_copy( '1917632' ) doesn't help much. Looking at the function now
14:01 sandbergja Bmagic: for my understanding, after the vacuum, did the number of heap fetches go down?
14:02 Bmagic I'll post the explain post-vacuum
14:03 eeevil I thought we were eliding those "all orgs" filters :(  ... here's a possible option to do that: https://lumen.esilibrary.com/p/ZI3fRH25.html
14:03 Bmagic https://explain.depesz.com/s/QFwc
14:07 Bmagic bummer, when I excerpt the query from the function and explain analyze that (plugging in the copy ID and the OU ID) it runs in 50ms
14:08 Bmagic but it takes 40 seconds when it's executed as a function
14:08 sandbergja 😮
14:10 sandbergja what if you try it without specifying an org unit id, and using evergreen.org_top() instead?
14:11 Bmagic hmmmm, here's a clue: SELECT * FROM asset.record_has_holdable_copy( '1717933' ) takes 40 seconds, SELECT * FROM asset.record_has_holdable_copy( '1717933', 1 ) takes 100ms
14:11 Bmagic it's the lack of the second argument
14:12 Bmagic I can't seem to find a function with that name that only takes one argument
14:12 sandbergja it's the same function, but the second param defaults to NULL
14:12 sandbergja (which sends it down the SELECT id FROM actor.org_unit_descendants(SELECT id FROM evergreen.org_top()) path via the COALESCE)
14:13 Bmagic sandbergja: I see what you're getting at and I agree. I think this "SELECT id FROM evergreen.org_top()" is the issue
14:14 Bmagic ha! and it's fast by itself. Slow when embedded in the coalesce... freaky
14:15 Bmagic https://explain.depesz.com/s/7k8f  <-- plugging "NULL" in for the first argument in the coalesce
14:18 Bmagic getting closer
14:20 mmorgan claiming 1492
14:22 sandbergja haha I guess it's a little late to make a columbus day joke
14:26 Bmagic lol
14:26 csharp_ never too late to discover america
14:26 Bmagic 26 seconds to perform an index scan on call_number_pkey seems like the issue, but I just vacuum analyzed asset.call_number with no affect
14:28 csharp_ Bmagic: when a function is the problem I often copy out the CREATE FUNCTION statement into a file and add a bunch of debugging (maybe rename the function under a different namespace so you aren't altering the original)
14:28 csharp_ RAISE NOTICE 'blah blah';
14:29 Bmagic yep, I've used that trick. Though, this function is so simple, I didn't need to create another function. I used the query inside the function and explain analyzed it
14:29 csharp_ gotcha
14:29 Bmagic And I'm getting the same slowness
14:29 Bmagic 11 seconds to scan asset.copy and 26 seconds to scan asset.call_number looks sus to me
14:29 csharp_ seq scans?
14:29 Bmagic (the last explain I posted above)
14:30 Bmagic acp was a bitmap heat scan and acn was index scan
14:31 Bmagic what's insane is this:
14:31 Bmagic (SELECT id FROM actor.org_unit_descendants(COALESCE(1,(SELECT id FROM evergreen.org_top()))))  100ms
14:31 Bmagic (SELECT id FROM actor.org_unit_descendants(COALESCE(null,(SELECT id FROM evergreen.org_top()))))  41 seconds
14:32 Bmagic (those queries by themselves don't perform any different, but inside the bigger query (from the function) it makes a huge difference)
14:32 csharp_ Time: 1.946 ms and Time: 2.598 ms on our DB respectively
14:33 Bmagic right, same here
14:33 Bmagic here's the larger function https://pastebin.com/LuLKFPsJ
14:33 sandbergja if you're in a situation where you could add an extension, I would be curious where plprofiler thinks that asset.record_has_holdable_copy is spending its time
14:34 csharp_ SELECT * FROM asset.record_has_holdable_copy( '1717933' ) = 13.691 ms, SELECT * FROM asset.record_has_holdable_copy( '1717933', 1 ), 4.104 ms
14:34 Bmagic yeah! you're seeing the same thing then
14:35 Bmagic in my case it's a bit more dramatic but still, what gives. Those two calls are the same thing
14:35 csharp_ the query you posted took me 4.363 ms
14:35 Bmagic yeah, now swap out "1" for "null" and see how long it takes
14:35 pinesol News from commits: LP2120713 Stamp upgrade script <http://git.evergreen-ils.org/?p=Ev​ergreen.git;a=commitdiff;h=f5fb30f​eb76f249a279b95cb240c43f212ff3461>
14:35 pinesol News from commits: LP#2120713: Fix for merging bibs with deleted monograph parts <http://git.evergreen-ils.org/?p=Ev​ergreen.git;a=commitdiff;h=926afbb​d2850fd9cd1b4068ba0b9ab5bc547aa0c>
14:35 csharp_ 4.093 ms
14:36 Bmagic huh? really?
14:36 csharp_ yeah - hmmm
14:36 Bmagic I would hav expected more like 13seconds like you reported above
14:36 csharp_ these are all ms
14:37 Bmagic oh right, nvm, I have the issue, not you
14:37 Bmagic it's almost certainly an index or vacuum issue somewhere in here
14:37 csharp_ Bmagic: how is the machine?  RAM ok?  maybe some config tuning needed?
14:37 csharp_ assuming this is a Google cloud box
14:38 csharp_ (we're virtualized too on ITS's VMWare)
14:38 Bmagic PG16, 768GB memory, shared_buffers = 192GB, max_connections = 800, effective_cache_size = 500GB, work_mem = 512MB
14:41 csharp_ shared_buffers | 128MB
14:41 csharp_ effective_cache_size | 352GB
14:41 csharp_ max_connections | 2500
14:41 csharp_ work_mem | 1536MB
14:42 Bmagic 2500 dang
14:42 csharp_ we have like 500GB RAM, 28 processing cores
14:42 csharp_ load avg hovering between 8 and 11, which is probably normal for us
14:42 csharp_ PG 11 (!)
14:42 Bmagic we have 768GB memory and 96 CPU
14:43 csharp_ huh
14:43 Bmagic maybe the shared_buffers is waaay to high?
14:43 csharp_ seems like you'd be doing better than us, depending on your DB data size
14:44 csharp_ I always have to look up the config settings - I never remember them between config changes
14:44 Bmagic it was* 8GB, but changed it up to try and fix performance
14:46 csharp_ did you change kernel default settings for SHMMAX, etc.?
14:46 csharp_ https://www.postgresql.org/docs​/current/kernel-resources.html for reference
14:49 csharp_ looks like we didn't change those on this server, but we have in the past and it made a big difference in performance
14:49 csharp_ (in the past on previous servers, always bare metal)
14:50 Bmagic no, I've not touched those kernel settings... maybe that's part of the issue?
14:50 csharp_ also possible ITS did some VM-level stuff to accommodate our RAM/load
14:50 Bmagic mind sharing the settings you use for those?
14:50 csharp_ they totally underestimated our impact
14:51 csharp_ Bmagic: I'll dig for records :-)
14:51 Bmagic it looks like, according to that article, the default linux values are generally ok
14:51 csharp_ Bmagic: might be irrelevant - just remembering things we've touched before to address performance
14:52 Bmagic we do alter one kernel setting: vm.nr_hugepages
14:52 csharp_ I remember that we set one of the PG config settings too high one time and it all came to a crawl
14:52 csharp_ maybe shared_buffers?
14:52 Bmagic yeah, shared_buffers is way out of alignment compared to yours
14:52 mmorgan sandbergja: Late to the party, but that ship has sailed.
14:52 * mmorgan ducks
14:53 Bmagic mmorgan++
14:53 sandbergja heh
14:53 csharp_ mmorgan: aaaayyyy
14:54 Bmagic cat /proc/sys/kernel/shmmax:  18446744073692774399
14:54 Bmagic that is a number
14:54 csharp_ Bmagic: yeah, that's ours too - has to be the default for Debian/Ubuntu
14:55 Bmagic same for shmall
14:55 csharp_ same
14:55 csharp_ ok, so probably not kernel
14:55 Bmagic yeah, that number is so large, I can't imagine it's imposing any restrictions
14:56 csharp_ maybe play with setting shared_buffers during off hours and see if anything changes?\
14:56 Bmagic I'm hopeful it's shared_buffers. I have a test machine with the same database on it, I'll increase the buffer to 192, and see if it's slow, then back to 8GB
14:58 csharp_ https://www.postgresql.org/docs/current/runtime-co​nfig-resource.html#RUNTIME-CONFIG-RESOURCE-MEMORY - note the advice about that setting
14:58 csharp_ also that it requires a restart
14:58 csharp_ sorry, probably over-advising
14:59 csharp_ sometimes I'm chatty in the channel for the logs
15:00 csharp_ and *sometimes*, in the middle of the night when I'm upgrading and something goes wrong, my *own* advice actually helps :-)
15:00 Bmagic I'm happy to have you chatty
15:00 csharp_ blah blah blah
15:00 Bmagic I bet it's the shared_buffers
15:01 csharp_ actually thinking about doing a session about PINES system admin
15:01 csharp_ for the EG conf
15:01 csharp_ possibly more than one - one more general about PINES administration and one with tech details
15:02 Bmagic I'll attend!
15:02 csharp_ gotta make that trip to Seattle appear worth it and all :-)
15:02 csharp_ <narrator>It *was* worth it</narrator>
15:04 Bmagic can't wait for the in person one! It's been 2 years
15:04 Bmagic you've got me starting to think about presentations of my own, I should do one this time
15:12 csharp_ Bmagic++
15:30 _collum joined #evergreen
15:43 collum joined #evergreen
17:06 mmorgan left #evergreen
18:33 collum joined #evergreen
21:26 sandbergja joined #evergreen
21:36 collum joined #evergreen
23:10 sandbergja joined #evergreen

| Channels | #evergreen index | Today | | Search | Google Search | Plain-Text | summary | Join Webchat