Evergreen ILS Website

IRC log for #evergreen, 2023-10-30

| Channels | #evergreen index | Today | | Search | Google Search | Plain-Text | summary | Join Webchat

All times shown according to the server's local time.

Time Nick Message
04:39 StomproJ joined #evergreen
06:54 redavis joined #evergreen
07:59 BDorsey joined #evergreen
08:12 collum joined #evergreen
09:09 Dyrcona joined #evergreen
09:44 dguarrac joined #evergreen
09:44 Bmagic Dycrona: my exporter did 2.2 million in 8 hours. Not sure if I gave you the final number or not
09:56 Dyrcona Thanks, Bmagic. I'm going to look at Stompro's cursor code today. I think it could be refactored a little bit and mad to work for authorities, too. I'll see how that goes.
09:57 Bmagic sweet
09:57 Dyrcona I'm 99% convinced that my issue is Perl making its way through the massive result set.
10:01 sleary joined #evergreen
10:08 Stompro Dyrcona, By getting rid of a bunch of the acps_for_bre arrays and greps and using hashes instead, I'm seeing a 90% speedup in that subroutine also.
10:09 Stompro I'm cleaning that up and then I'll share.  And I also switched acps_for_bre to using a prepared statement, and that also helped.
10:09 Dyrcona Stompro: Cool! Push it to the collab branch. I'm doing something else right now.
10:13 Dyrcona Hmm.. OpenSRF router died on development VM sometime in the last week.
10:14 Dyrcona Oh! And, it is the Redis branch.
10:14 Bmagic that might have something to do with it :)
10:15 Dyrcona Did I miss a bug report/issue?
10:16 Dyrcona I should update this bad boy to main+Redis. It's currently on Evergreen 3.10.
10:17 Dyrcona Next week after the database refresh.
10:17 Bmagic Bad boys: whatcha gonna do?
10:17 Dyrcona heh.
10:17 Dyrcona Services seem to be hung restarting. Do I need to --reset-message-bus?
10:19 Dyrcona Well, now I've got a mess.... Some services running some not. Does the router change PID, but keep going?
10:19 * Dyrcona tries again.
10:20 berick Dyrcona: i posted a breaking change friday (?) but fixed it an hour or 2 later, if that offers any cluse
10:21 berick the opensrf branch
10:21 berick which, btw, forces reset-message-bus to run when needed so it's not manaully required anymore.
10:22 Dyrcona berick: I think my branch predates either change. I'll grab the latest and plan to upgrade Evergreen to main next week.
10:22 Stompro With my export of 200K bibs I can do that at a rate of 2.5 million an hour, 1.25 million with items... although that may be faster now.
10:22 Dyrcona Though I guess the deadline for 3.12 is Wednesday....
10:23 Dyrcona Stompro: Cool. I'll take a look at your changes and give them a go. I plan to refactor the cursor code into a new/old function so things can be simplified. I'll see how your changes affect that.
10:37 Dyrcona berick: Nice change for batch_size from u64 to u32. Wouldn't want to try exporting more than 4 billion records in a single go. :)
10:42 * Dyrcona updates OpenSRF to latest Redis collab.
10:43 BAMkubasa joined #evergreen
10:50 mantis1 joined #evergreen
10:55 briank joined #evergreen
11:15 Dyrcona OK. I have the latest Rust marc-export test running. I'll see how long this takes.
11:27 jihpringle joined #evergreen
12:14 Christineb joined #evergreen
12:18 eeevil berick: drive by question (from immediately-post hack-a-way) incoming! meetings will probably keep me away from here for the rest of the day, but want to get it to you asap for RediSRF(?) goodness
12:20 eeevil berick: I'm looking at opensrf/redis anew via the collab branches. I have two questions that jump out at me: is the --reset-message-bus documented beyond "you must do this at redis restart" somewhere? (the commit it comes in with isn't obvious to me; also, I want to help remove any instances of boilerplate stuff that we can automate or fold into other actions as early as possible -- ideally before merge -- and I know this is under discussion or at
12:20 eeevil least being thought about); and, while the config and password list files don't require it themselves, it /looks/ like the usernames are pinned as "opensrf", "router", and "gateway" in the code that implements each part. is that correct and do you see any reason we /can't/ replace those strings with the value pulled from opensrf_core.xml as with ejabberd?  the result being ACL-separated user sets that share a redis instance but can't talk to each
12:20 eeevil other or guess each other's user and queue names.
12:32 Dyrcona eeevil: berick said earlier that he modified osrf_control to do --reset-message-bus as required, so its not necessary now. I'm testing that and other updates. That's a good question about the names. (I generally don't change them.)
12:33 Bmagic I'm troubleshooting a database issue. explain analyze on an insert into biblio.record_entry results in 6+ seconds. The analyze shows me that it spends the majority of the time in reporter.simple_rec_trigger(). Which is odd, because if it's an INSERT, it should just skip down to RETURN. Weird right?
12:33 Dyrcona Bmagic: No. It still builds the rmsr entry.
12:34 Dyrcona Wait until you update an authority with hundreds of bibs "attached." :)
12:35 Bmagic this is EG 3.11.1 with queued ingest. I updated the global flag to turn off queued ingest, and tested it. and visa versa. it's 13 seconds for an insert with it turned off, and 6 seconds with it turned on. But still, 6 seconds seems insane for an insert on biblio.record_entry, especially if we're deferring all of the metabib stuff for the ingester process to come later
12:35 Dyrcona Updating/inserting bibs and auths can be slow because of all the triggers.
12:36 Bmagic I got here because it's breaking the course reserves interface. Taking too long
12:37 * Dyrcona colour me unsurprised.
12:37 Dyrcona "Color me your color, baby. Color me your car."
12:44 jeff
12:44 jeff er, hi.
12:49 Dyrcona Hi, jeff!
12:58 Dyrcona Hm. This gets tricky.
13:03 Dyrcona I wonder if Marque::Authority and Marque::Biblio should be refactored to have a common base class: Marque::Record? Seems a bit crazy for something that's all in one file.... Maybe if we split it all out some day?
13:17 Dyrcona Ooh. Found an unrelated bug with the error output that I'll patch with a separate commit.
13:32 Dyrcona Stompro: I pushed a couple of commits to the marc_export collab branch: 1) a refactor of the cursor code that adds a new CLI options and expands the cursor to work with authorities, and 2) a fix for using a wrong variable when reporting errors for an invalid --type parameter.
13:36 Stompro Dyrcona, I'm playing around with requesting the acns and acps in batch, to see if performing less individual sql requests help.
13:44 Dyrcona OK. You might want to rebase your branch on the working repo.
13:57 Bmagic Dyrcona: eeevil: and anyone else. Disecting my query pathway through the DB, I found a line in reporter.simple_rec_update() that seems to be the culprit https://explain.depesz.com/s/WK9m
13:59 Bmagic Disecting/Dissecting
14:02 Dyrcona It's a unique constraint check causing the issue?
14:02 Dyrcona Maybe I'm not understanding that correctly.
14:02 Bmagic essentially it's this: INSERT INTO reporter.materialized_simple_record SELECT DISTINCT ON (id) * FROM reporter.old_super_simple_record WHERE id = 671579;
14:02 Dyrcona Left joins are also a big hit.
14:03 Dyrcona That's where the unique comes from then....
14:03 Bmagic which is buried in reporter.simple_rec_update(ID, BOOL)
14:04 Dyrcona Well, maybe that is due for a refactor, too. :)
14:04 Bmagic select count(*) from reporter.old_super_simple_record took 31 seconds to complete
14:05 Dyrcona Yeah, I thinks that's a view, and it's probably where all of the left joins happen.
14:06 Dyrcona I'd have to look to make sure, though.
14:11 Bmagic maybe the issue is vacuum analyze after* queued ingest has finished
14:12 Dyrcona You're saying to do a vacuum analyze after queued ingest, or one is happening already?
14:12 Bmagic I'm saying that I'm not sure I did that step after the queued ingest finished ingesting the 300k bibs I migrated
14:13 Dyrcona Well, it probably wouldn't hurt to do the affected tables.
14:13 Bmagic That's most likely it, thanks for being my sounding wall!
14:16 Dyrcona Bmagic: This is the mostly likely suspect: metabib.wide_display_entry. It's a view, and reporter.old_super_simple_record is built from metabib.wide_display_entry.
14:17 Bmagic yep
14:17 Bmagic inspecting that view was "fun"\
14:18 Dyrcona Oof.... Yet more views involved....
14:18 Dyrcona View built on views built on views.
14:18 Bmagic that's right
14:25 Dyrcona A View to a pg_kill.
14:25 Dyrcona OK, pb_cancelbackend.... :)
14:35 BAMkubasa joined #evergreen
14:52 jonadab joined #evergreen
15:31 pinesol News from commits: LP#1989151: Add links to secondary management interfaces <https://git.evergreen-ils.org/?p=E​vergreen.git;a=commitdiff;h=03b93d​925c69fd3b055bba0678b33f191aa58060>
15:39 jihpringle joined #evergreen
15:43 jihpringle joined #evergreen
15:55 Dyrcona So, I'm using the Redis branch and I have a script that is failing to authenticate using Cronscript->authenticate. I'll see if I can figure it out.
15:55 Bmagic Dyrcona++
15:58 Dyrcona It might not be the Redis branch. It might be my code. I'm pretty sure this used to work, though.
15:59 Dyrcona I'm going to try it on a non-Redis VM, but that one is also a versions behind the Redis VM when it comes to Evergreen.
16:03 Dyrcona Ok, can_have_users is false on the org_unit. i think that prevents login, doesn't it?
16:03 Dyrcona I'll change it and see what happens.
16:03 Bmagic that might be it
16:06 Dyrcona Ugh..... It's the ou_type....
16:07 Dyrcona Right.... Need to change the update.
16:13 Dyrcona Authentication failed....
16:14 Dyrcona Parent ou, too, maybe?
16:14 Dyrcona No. That's shouldn't be it.
16:15 Dyrcona My update worked... I can see it in the database, and I bet I have to run autogen.sh...
16:16 Dyrcona Nope. Not even after autogen.sh and --restart-all.... Something else is wrong. Guess I'll check the logs.
16:21 Dyrcona Heh. I'm an idiot. I changed my Evergreen password since I wrote the original script that I copied and modified..... Out of date, hard coded credentials were the issue.
16:22 Dyrcona Now, should I get an empty array from open-ils.circ.checkin.override?
16:28 Dyrcona OK. I guess so. Looks like we hit a return; at the end.
16:29 Dyrcona I suppose I could write something to verify the copy and circulation states.
16:33 Bmagic I think my issue might be a PG15 thing, specifically PG15 with a replicated secondary DB. I'm not sure what setting I need to tweak, but I can confirm that after a VACUUM FULL on metabib.*, it's still slow. But when I run the query on the secondary database, it's nice and fast. A clue?
16:38 Bmagic Maybe I can change PG configuration to eliminate the secondary DB to prove it has something to do with that (or not)
16:38 Dyrcona Nope. None of them checked in.
16:40 Dyrcona Bmagic: Could be. I often have queries fail on the secondary DB, 'cause changed rows from replication are needed. I know that's not your issue, but you shouldn't be inserting into a replicated db, so I'm not sure what you're checking.
16:40 Bmagic This is my test query: explain analyze select * from reporter.old_super_simple_record where id=671579
16:41 Bmagic 6+ seconds on db1. and less than one second on db2
16:42 Dyrcona OK. Das stimmt.
16:43 Dyrcona Hm... I need to find the circulations. The copies are all Lost status.... They're also deleted, and I thought using the copy_id would resolve the latter.
16:44 Dyrcona Guess I'll pick that up in the morning or some other time.
16:46 mantis1 left #evergreen
16:47 jeffdavis Bmagic: just to confirm, you've done ANALYZE as well as VACUUM FULL on your metabib tables in the primary db?
16:47 Bmagic jeffdavis: yep
16:50 Dyrcona You did a VACUUM FULL not just a VACUUM or VACUUM ANALYZE?
16:51 Bmagic VACUUM FULL ANALYZE <table>
16:51 Bmagic Repeated for all tables in metabib
16:52 Bmagic retested, and was dissappointed. Restarted postgres, still dissappointed
16:52 Bmagic disappointed
16:55 jeffdavis Is it consistently ~6 seconds even for other IDs? Like, not 4 or 8 seconds?
16:59 jeffdavis We were having an issue on a test server where calls to open-ils.pcrud.search.circ were consistently taking about 6.5s per circ. I never got to the bottom of it (fortunately we're not seeing that in production). It's a long shot but maybe there is a connection? Don't want to send you down a rabbit hole though!
17:00 Dyrcona There are only 1 or 2 tables involved in that function. Everything else is 4 views that percolate up.
17:00 jeffdavis But the flesh fields for the problematic call include wide_display_entry.
17:01 Dyrcona Which is a view, IIRC.
17:01 * Dyrcona calls it a day. I'll be back tomorrow if the creek don't rise.
17:01 jeffdavis CALL: open-ils.pcrud open-ils.pcrud.search.circ "[...]",{"id":[...]},{"flesh":4,"flesh_fields​":{"circ":["target_copy","workstation","check​in_workstation"],"acp":["call_number","holds_​count","status","circ_lib","location","floati​ng","age_protect","parts"],"acpm":["part"],"a​cn":["record","owning_lib","prefix","suffix"]​,"bre":["wide_display_entry"]},"select":{"bre​":["id"]},"order_by":{"circ":["xact_start"]}}
17:02 jeff Bmagic: it would be interesting to compare the explain (analyze, buffers) with the query on the two postgresql instances. database config the same between the two? hardware?
17:02 jeff Bmagic: what replication method are you using?
17:03 Bmagic jeff: I've got the secondary DB in standby mode, configured "primary_conninfo" to connect to the primary db. Plus, if they get out of sync, we have "restore_command" configured to pull the wal logs from the NFS share where DB1 writes them
17:03 jeff basic health checks look good? no long-running trng transactions, etc?
17:04 Bmagic the secondary DB has effective_cache_size = 64GB, compared to primary effective_cache_size = 200GB
17:04 jeff s/trng //
17:05 Bmagic primary: work_mem = 512MB. secondary: work_mem = 160MB
17:05 Bmagic otherwise, the two machines have the same postgresql.conf if you ignore the bits for the secondary needing to know how to connect to primary
17:06 jeff ok, I can see how that would likely result in different planner decisions. buffers will likely be your friend in determining the difference.
17:06 jeff (but that may not give you your answer.)
17:06 Bmagic jeff: I'm afraid I'm not following the buffers idea. Do I need to issue a slightly different query to get more information?
17:07 Bmagic or are you just talking about "free -f" ?
17:07 jeff passing the buffers argument to explain, so you get useful information on what IO took place.
17:08 jeff EXPLAIN (ANALYZE, BUFFERS) SELECT...;
17:08 jeff https://www.postgresql.org/docs/15/usi​ng-explain.html#USING-EXPLAIN-ANALYZE "EXPLAIN has a BUFFERS option that can be used with ANALYZE to get even more run time statistics"
17:09 jeff https://explain.depesz.com/ is helpful for examining/sharing the results, if you're not already familiar with it.
17:09 Bmagic right, I found that as you were typing. sorry to make you explain
17:09 jeff ah, i see what you did there.
17:09 Bmagic :)
17:09 jeff out for now, back later!
17:10 Bmagic https://explain.depesz.com/s/Df5A
17:11 Bmagic so that seems to have revealed that it's sequential scanning display_field_map
17:16 jeffdavis I can confirm we're seeing the same behavior on our test server with your example query (explain (analyze, buffers) select * from reporter.old_super_simple_record where id=x)
17:16 Bmagic so weird, two databases, one a replicant of the other. I just synced up those two settings in postgresql.conf and restarted the primary database. The query is still sequentially scanning that table instead of using an index. I run the same query against the secondary database and it's fast
17:17 Bmagic jeffdavis: yous is sequential scanning?
17:17 jeffdavis test db (slow): https://explain.depesz.com/s/pkwY
17:17 jeffdavis prod db (fast): https://explain.depesz.com/s/UWu5
17:19 Bmagic jeffdavis: interesting. In my case, it's prod that's slow and the secondary database is fast
17:20 Bmagic what's even more interesting, is for your fast case, it's seq scanning display_field_map and apparently that's not taking much time at all
17:24 Bmagic jeffdavis: PG14?
17:24 jeffdavis yes
17:25 Bmagic jeffdavis++
17:25 jeffdavis The servers are pretty different, the test db is the product of pg_dump/pg_restore rather than replication
17:25 Bmagic yeah, I'm tempted to dump/restore as well
17:26 Bmagic I out of ideas, so that one comes to mind :)
17:27 jeffdavis well, we're seeing it after dump/restore, but who knows, maybe it would help!
17:27 Bmagic PG voodoo
19:13 ejk_ joined #evergreen
19:14 jmurray_isl joined #evergreen
21:17 jeff Bmagic: most of the time in your slow case above is taken on reading config.metabib_field, which is pretty surprising. how bloated is that table? i can't imagine how bloated it would have to be to slow things down that much. are simple reads of that table slow on the slow db instance?
21:22 jeff 5+ seconds to read 64 rows is... interesting. I wonder how different things look if you disable parallelism with (i think) SET max_parallel_workers_per_gather = 0;
21:28 Bmagic jeff: I vacuum full'ed that table just for fun, no change. I updated max_parallel_workers_per_gather = 0, restarted pg, no difference. It actually increased to 7 seconds for the query. I love the idea's though, keep em coming!
21:29 Bmagic Also: I dumped, restored the DB, that didn't help either :(
21:30 jeff also... something seems amiss with those row estimates. rows=4215263689798428837764998235160576 !!
21:31 Bmagic jeff: hmm, I didn't see that! Wow, that is peculiar
21:33 Bmagic I'm trying to find the place where you see that, I'm not finding it
21:34 jeff where I see what?
21:34 Bmagic the row estimate with that crazy number
21:35 jeff both are high, but the "fast" explain is higher, I think. it's the first line:
21:35 jeff Hash Left Join  (cost=131123.56..8937592211​221534614396133393301504.00 rows=4215263689798428837764998235160576 width=274) (actual time=4.154..4.186 rows=1 loops=1)
21:35 jeff the web UI turns estimated rows over actual rows into the "rows x" column
21:36 Bmagic which explain URL?
21:37 jeff it's similarly high on a pg10 db here, scaled down a bit due to smaller bib count, I'm sure. that's probably not quite related to the slow/fast discrep here, but I could be wrong.
21:37 jeff the paste a few lines back is from the source here: https://explain.depesz.com/s/UWu5#source
21:37 jeff OH
21:38 jeff dangit, that's jeffdavis's
21:38 jeff sorry.
21:38 jeff Bmagic: did you have a prod/replica set of explain outputs?
21:38 jeff i think I only saw the one, https://explain.depesz.com/s/Df5A
21:39 jeff so yeah, almost everything I said above was related to jeffdavis, but i had it in my head as yours, so some of what i said probably doesn't make sense.
21:39 Bmagic I posted two explains. One that explained the INSERT, and another for the plain select statement.
21:39 Bmagic INSERT: https://explain.depesz.com/s/WK9m#query
21:40 Bmagic SELECT: https://explain.depesz.com/s/Df5A#query
21:40 jeff ah. it would be interesting to see the same SELECT query on your "fast" and "slow" db instances.
21:42 Bmagic on DB2 (fast) https://explain.depesz.com/s/TPmv
21:45 jeff is the query output identical, at least at a glance?
21:46 jeff (on both fast and slow)
21:46 jeff (i'm guessing you'd have noticed and called attention to things if your main and replica were not agreeing here...)
21:47 jeff oh, https://explain.depesz.com/s/TPmv#source doesn't seem to be using EXPLAIN (ANALYZE, BUFFERS)
21:48 Bmagic yeah, I don't have the db2 example with buggers
21:48 Bmagic buffers* lol
21:49 jeff does something still use reporter.old_super_simple_record?
21:49 Bmagic and I can't run it now because I have the second DB broken at the moment. I'm working on getting it back in sync and I'll post the BUFFERS version. It'll be about 45 minutes if history is any indicator
21:49 jeff I mean, there is a reason rmsr exists... ;-)
21:49 jeff ah, ok.
21:49 Bmagic jeff: yep, it's called when inserting new rows into biblio.record_entry
21:50 jeff good luck! going to carve a pumpkin.
21:50 Bmagic good luck to you too!
22:11 sandbergja joined #evergreen
22:34 Bmagic jeff: I'm not sure if this is good but after dumping, restoring, re-syncing the two DB's, they are more similar in speed for this explain analyze. Though DB2 is still winning. But niether one is acceptable now
22:35 Bmagic DB1: https://explain.depesz.com/s/srWZ
22:35 Bmagic DB2: https://explain.depesz.com/s/OWwJ

| Channels | #evergreen index | Today | | Search | Google Search | Plain-Text | summary | Join Webchat