IRC log for #evergreen, 2023-10-30

All times shown according to the server's local time.

Time	Nick	Message
04:39		StomproJ joined #evergreen
06:54		redavis joined #evergreen
07:59		BDorsey joined #evergreen
08:12		collum joined #evergreen
09:09		Dyrcona joined #evergreen
09:44		dguarrac joined #evergreen
09:44	Bmagic	Dycrona: my exporter did 2.2 million in 8 hours. Not sure if I gave you the final number or not
09:56	Dyrcona	Thanks, Bmagic. I'm going to look at Stompro's cursor code today. I think it could be refactored a little bit and mad to work for authorities, too. I'll see how that goes.
09:57	Bmagic	sweet
09:57	Dyrcona	I'm 99% convinced that my issue is Perl making its way through the massive result set.
10:01		sleary joined #evergreen
10:08	Stompro	Dyrcona, By getting rid of a bunch of the acps_for_bre arrays and greps and using hashes instead, I'm seeing a 90% speedup in that subroutine also.
10:09	Stompro	I'm cleaning that up and then I'll share. And I also switched acps_for_bre to using a prepared statement, and that also helped.
10:09	Dyrcona	Stompro: Cool! Push it to the collab branch. I'm doing something else right now.
10:13	Dyrcona	Hmm.. OpenSRF router died on development VM sometime in the last week.
10:14	Dyrcona	Oh! And, it is the Redis branch.
10:14	Bmagic	that might have something to do with it :)
10:15	Dyrcona	Did I miss a bug report/issue?
10:16	Dyrcona	I should update this bad boy to main+Redis. It's currently on Evergreen 3.10.
10:17	Dyrcona	Next week after the database refresh.
10:17	Bmagic	Bad boys: whatcha gonna do?
10:17	Dyrcona	heh.
10:17	Dyrcona	Services seem to be hung restarting. Do I need to --reset-message-bus?
10:19	Dyrcona	Well, now I've got a mess.... Some services running some not. Does the router change PID, but keep going?
10:19	* Dyrcona	tries again.
10:20	berick	Dyrcona: i posted a breaking change friday (?) but fixed it an hour or 2 later, if that offers any cluse
10:21	berick	the opensrf branch
10:21	berick	which, btw, forces reset-message-bus to run when needed so it's not manaully required anymore.
10:22	Dyrcona	berick: I think my branch predates either change. I'll grab the latest and plan to upgrade Evergreen to main next week.
10:22	Stompro	With my export of 200K bibs I can do that at a rate of 2.5 million an hour, 1.25 million with items... although that may be faster now.
10:22	Dyrcona	Though I guess the deadline for 3.12 is Wednesday....
10:23	Dyrcona	Stompro: Cool. I'll take a look at your changes and give them a go. I plan to refactor the cursor code into a new/old function so things can be simplified. I'll see how your changes affect that.
10:37	Dyrcona	berick: Nice change for batch_size from u64 to u32. Wouldn't want to try exporting more than 4 billion records in a single go. :)
10:42	* Dyrcona	updates OpenSRF to latest Redis collab.
10:43		BAMkubasa joined #evergreen
10:50		mantis1 joined #evergreen
10:55		briank joined #evergreen
11:15	Dyrcona	OK. I have the latest Rust marc-export test running. I'll see how long this takes.
11:27		jihpringle joined #evergreen
12:14		Christineb joined #evergreen
12:18	eeevil	berick: drive by question (from immediately-post hack-a-way) incoming! meetings will probably keep me away from here for the rest of the day, but want to get it to you asap for RediSRF(?) goodness
12:20	eeevil	berick: I'm looking at opensrf/redis anew via the collab branches. I have two questions that jump out at me: is the --reset-message-bus documented beyond "you must do this at redis restart" somewhere? (the commit it comes in with isn't obvious to me; also, I want to help remove any instances of boilerplate stuff that we can automate or fold into other actions as early as possible -- ideally before merge -- and I know this is under discussion or at
12:20	eeevil	least being thought about); and, while the config and password list files don't require it themselves, it /looks/ like the usernames are pinned as "opensrf", "router", and "gateway" in the code that implements each part. is that correct and do you see any reason we /can't/ replace those strings with the value pulled from opensrf_core.xml as with ejabberd? the result being ACL-separated user sets that share a redis instance but can't talk to each
12:20	eeevil	other or guess each other's user and queue names.
12:32	Dyrcona	eeevil: berick said earlier that he modified osrf_control to do --reset-message-bus as required, so its not necessary now. I'm testing that and other updates. That's a good question about the names. (I generally don't change them.)
12:33	Bmagic	I'm troubleshooting a database issue. explain analyze on an insert into biblio.record_entry results in 6+ seconds. The analyze shows me that it spends the majority of the time in reporter.simple_rec_trigger(). Which is odd, because if it's an INSERT, it should just skip down to RETURN. Weird right?
12:33	Dyrcona	Bmagic: No. It still builds the rmsr entry.
12:34	Dyrcona	Wait until you update an authority with hundreds of bibs "attached." :)
12:35	Bmagic	this is EG 3.11.1 with queued ingest. I updated the global flag to turn off queued ingest, and tested it. and visa versa. it's 13 seconds for an insert with it turned off, and 6 seconds with it turned on. But still, 6 seconds seems insane for an insert on biblio.record_entry, especially if we're deferring all of the metabib stuff for the ingester process to come later
12:35	Dyrcona	Updating/inserting bibs and auths can be slow because of all the triggers.
12:36	Bmagic	I got here because it's breaking the course reserves interface. Taking too long
12:37	* Dyrcona	colour me unsurprised.
12:37	Dyrcona	"Color me your color, baby. Color me your car."
12:44	jeff
12:44	jeff	er, hi.
12:49	Dyrcona	Hi, jeff!
12:58	Dyrcona	Hm. This gets tricky.
13:03	Dyrcona	I wonder if Marque::Authority and Marque::Biblio should be refactored to have a common base class: Marque::Record? Seems a bit crazy for something that's all in one file.... Maybe if we split it all out some day?
13:17	Dyrcona	Ooh. Found an unrelated bug with the error output that I'll patch with a separate commit.
13:32	Dyrcona	Stompro: I pushed a couple of commits to the marc_export collab branch: 1) a refactor of the cursor code that adds a new CLI options and expands the cursor to work with authorities, and 2) a fix for using a wrong variable when reporting errors for an invalid --type parameter.
13:36	Stompro	Dyrcona, I'm playing around with requesting the acns and acps in batch, to see if performing less individual sql requests help.
13:44	Dyrcona	OK. You might want to rebase your branch on the working repo.
13:57	Bmagic	Dyrcona: eeevil: and anyone else. Disecting my query pathway through the DB, I found a line in reporter.simple_rec_update() that seems to be the culprit https://explain.depesz.com/s/WK9m
13:59	Bmagic	Disecting/Dissecting
14:02	Dyrcona	It's a unique constraint check causing the issue?
14:02	Dyrcona	Maybe I'm not understanding that correctly.
14:02	Bmagic	essentially it's this: INSERT INTO reporter.materialized_simple_record SELECT DISTINCT ON (id) * FROM reporter.old_super_simple_record WHERE id = 671579;
14:02	Dyrcona	Left joins are also a big hit.
14:03	Dyrcona	That's where the unique comes from then....
14:03	Bmagic	which is buried in reporter.simple_rec_update(ID, BOOL)
14:04	Dyrcona	Well, maybe that is due for a refactor, too. :)
14:04	Bmagic	select count(*) from reporter.old_super_simple_record took 31 seconds to complete
14:05	Dyrcona	Yeah, I thinks that's a view, and it's probably where all of the left joins happen.
14:06	Dyrcona	I'd have to look to make sure, though.
14:11	Bmagic	maybe the issue is vacuum analyze after* queued ingest has finished
14:12	Dyrcona	You're saying to do a vacuum analyze after queued ingest, or one is happening already?
14:12	Bmagic	I'm saying that I'm not sure I did that step after the queued ingest finished ingesting the 300k bibs I migrated
14:13	Dyrcona	Well, it probably wouldn't hurt to do the affected tables.
14:13	Bmagic	That's most likely it, thanks for being my sounding wall!
14:16	Dyrcona	Bmagic: This is the mostly likely suspect: metabib.wide_display_entry. It's a view, and reporter.old_super_simple_record is built from metabib.wide_display_entry.
14:17	Bmagic	yep
14:17	Bmagic	inspecting that view was "fun"\
14:18	Dyrcona	Oof.... Yet more views involved....
14:18	Dyrcona	View built on views built on views.
14:18	Bmagic	that's right
14:25	Dyrcona	A View to a pg_kill.
14:25	Dyrcona	OK, pb_cancelbackend.... :)
14:35		BAMkubasa joined #evergreen
14:52		jonadab joined #evergreen
15:31	pinesol	News from commits: LP#1989151: Add links to secondary management interfaces <https://git.evergreen-ils.org/?p=Evergreen.git;a=commitdiff;h=03b93d925c69fd3b055bba0678b33f191aa58060>
15:39		jihpringle joined #evergreen
15:43		jihpringle joined #evergreen
15:55	Dyrcona	So, I'm using the Redis branch and I have a script that is failing to authenticate using Cronscript->authenticate. I'll see if I can figure it out.
15:55	Bmagic	Dyrcona++
15:58	Dyrcona	It might not be the Redis branch. It might be my code. I'm pretty sure this used to work, though.
15:59	Dyrcona	I'm going to try it on a non-Redis VM, but that one is also a versions behind the Redis VM when it comes to Evergreen.
16:03	Dyrcona	Ok, can_have_users is false on the org_unit. i think that prevents login, doesn't it?
16:03	Dyrcona	I'll change it and see what happens.
16:03	Bmagic	that might be it
16:06	Dyrcona	Ugh..... It's the ou_type....
16:07	Dyrcona	Right.... Need to change the update.
16:13	Dyrcona	Authentication failed....
16:14	Dyrcona	Parent ou, too, maybe?
16:14	Dyrcona	No. That's shouldn't be it.
16:15	Dyrcona	My update worked... I can see it in the database, and I bet I have to run autogen.sh...
16:16	Dyrcona	Nope. Not even after autogen.sh and --restart-all.... Something else is wrong. Guess I'll check the logs.
16:21	Dyrcona	Heh. I'm an idiot. I changed my Evergreen password since I wrote the original script that I copied and modified..... Out of date, hard coded credentials were the issue.
16:22	Dyrcona	Now, should I get an empty array from open-ils.circ.checkin.override?
16:28	Dyrcona	OK. I guess so. Looks like we hit a return; at the end.
16:29	Dyrcona	I suppose I could write something to verify the copy and circulation states.
16:33	Bmagic	I think my issue might be a PG15 thing, specifically PG15 with a replicated secondary DB. I'm not sure what setting I need to tweak, but I can confirm that after a VACUUM FULL on metabib.*, it's still slow. But when I run the query on the secondary database, it's nice and fast. A clue?
16:38	Bmagic	Maybe I can change PG configuration to eliminate the secondary DB to prove it has something to do with that (or not)
16:38	Dyrcona	Nope. None of them checked in.
16:40	Dyrcona	Bmagic: Could be. I often have queries fail on the secondary DB, 'cause changed rows from replication are needed. I know that's not your issue, but you shouldn't be inserting into a replicated db, so I'm not sure what you're checking.
16:40	Bmagic	This is my test query: explain analyze select * from reporter.old_super_simple_record where id=671579
16:41	Bmagic	6+ seconds on db1. and less than one second on db2
16:42	Dyrcona	OK. Das stimmt.
16:43	Dyrcona	Hm... I need to find the circulations. The copies are all Lost status.... They're also deleted, and I thought using the copy_id would resolve the latter.
16:44	Dyrcona	Guess I'll pick that up in the morning or some other time.
16:46		mantis1 left #evergreen
16:47	jeffdavis	Bmagic: just to confirm, you've done ANALYZE as well as VACUUM FULL on your metabib tables in the primary db?
16:47	Bmagic	jeffdavis: yep
16:50	Dyrcona	You did a VACUUM FULL not just a VACUUM or VACUUM ANALYZE?
16:51	Bmagic	VACUUM FULL ANALYZE <table>
16:51	Bmagic	Repeated for all tables in metabib
16:52	Bmagic	retested, and was dissappointed. Restarted postgres, still dissappointed
16:52	Bmagic	disappointed
16:55	jeffdavis	Is it consistently ~6 seconds even for other IDs? Like, not 4 or 8 seconds?
16:59	jeffdavis	We were having an issue on a test server where calls to open-ils.pcrud.search.circ were consistently taking about 6.5s per circ. I never got to the bottom of it (fortunately we're not seeing that in production). It's a long shot but maybe there is a connection? Don't want to send you down a rabbit hole though!
17:00	Dyrcona	There are only 1 or 2 tables involved in that function. Everything else is 4 views that percolate up.
17:00	jeffdavis	But the flesh fields for the problematic call include wide_display_entry.
17:01	Dyrcona	Which is a view, IIRC.
17:01	* Dyrcona	calls it a day. I'll be back tomorrow if the creek don't rise.
17:01	jeffdavis	CALL: open-ils.pcrud open-ils.pcrud.search.circ "[...]",{"id":[...]},{"flesh":4,"flesh_fields":{"circ":["target_copy","workstation","checkin_workstation"],"acp":["call_number","holds_count","status","circ_lib","location","floating","age_protect","parts"],"acpm":["part"],"acn":["record","owning_lib","prefix","suffix"],"bre":["wide_display_entry"]},"select":{"bre":["id"]},"order_by":{"circ":["xact_start"]}}
17:02	jeff	Bmagic: it would be interesting to compare the explain (analyze, buffers) with the query on the two postgresql instances. database config the same between the two? hardware?
17:02	jeff	Bmagic: what replication method are you using?
17:03	Bmagic	jeff: I've got the secondary DB in standby mode, configured "primary_conninfo" to connect to the primary db. Plus, if they get out of sync, we have "restore_command" configured to pull the wal logs from the NFS share where DB1 writes them
17:03	jeff	basic health checks look good? no long-running trng transactions, etc?
17:04	Bmagic	the secondary DB has effective_cache_size = 64GB, compared to primary effective_cache_size = 200GB
17:04	jeff	s/trng //
17:05	Bmagic	primary: work_mem = 512MB. secondary: work_mem = 160MB
17:05	Bmagic	otherwise, the two machines have the same postgresql.conf if you ignore the bits for the secondary needing to know how to connect to primary
17:06	jeff	ok, I can see how that would likely result in different planner decisions. buffers will likely be your friend in determining the difference.
17:06	jeff	(but that may not give you your answer.)
17:06	Bmagic	jeff: I'm afraid I'm not following the buffers idea. Do I need to issue a slightly different query to get more information?
17:07	Bmagic	or are you just talking about "free -f" ?
17:07	jeff	passing the buffers argument to explain, so you get useful information on what IO took place.
17:08	jeff	EXPLAIN (ANALYZE, BUFFERS) SELECT...;
17:08	jeff	https://www.postgresql.org/docs/15/using-explain.html#USING-EXPLAIN-ANALYZE "EXPLAIN has a BUFFERS option that can be used with ANALYZE to get even more run time statistics"
17:09	jeff	https://explain.depesz.com/ is helpful for examining/sharing the results, if you're not already familiar with it.
17:09	Bmagic	right, I found that as you were typing. sorry to make you explain
17:09	jeff	ah, i see what you did there.
17:09	Bmagic	:)
17:09	jeff	out for now, back later!
17:10	Bmagic	https://explain.depesz.com/s/Df5A
17:11	Bmagic	so that seems to have revealed that it's sequential scanning display_field_map
17:16	jeffdavis	I can confirm we're seeing the same behavior on our test server with your example query (explain (analyze, buffers) select * from reporter.old_super_simple_record where id=x)
17:16	Bmagic	so weird, two databases, one a replicant of the other. I just synced up those two settings in postgresql.conf and restarted the primary database. The query is still sequentially scanning that table instead of using an index. I run the same query against the secondary database and it's fast
17:17	Bmagic	jeffdavis: yous is sequential scanning?
17:17	jeffdavis	test db (slow): https://explain.depesz.com/s/pkwY
17:17	jeffdavis	prod db (fast): https://explain.depesz.com/s/UWu5
17:19	Bmagic	jeffdavis: interesting. In my case, it's prod that's slow and the secondary database is fast
17:20	Bmagic	what's even more interesting, is for your fast case, it's seq scanning display_field_map and apparently that's not taking much time at all
17:24	Bmagic	jeffdavis: PG14?
17:24	jeffdavis	yes
17:25	Bmagic	jeffdavis++
17:25	jeffdavis	The servers are pretty different, the test db is the product of pg_dump/pg_restore rather than replication
17:25	Bmagic	yeah, I'm tempted to dump/restore as well
17:26	Bmagic	I out of ideas, so that one comes to mind :)
17:27	jeffdavis	well, we're seeing it after dump/restore, but who knows, maybe it would help!
17:27	Bmagic	PG voodoo
19:13		ejk_ joined #evergreen
19:14		jmurray_isl joined #evergreen
21:17	jeff	Bmagic: most of the time in your slow case above is taken on reading config.metabib_field, which is pretty surprising. how bloated is that table? i can't imagine how bloated it would have to be to slow things down that much. are simple reads of that table slow on the slow db instance?
21:22	jeff	5+ seconds to read 64 rows is... interesting. I wonder how different things look if you disable parallelism with (i think) SET max_parallel_workers_per_gather = 0;
21:28	Bmagic	jeff: I vacuum full'ed that table just for fun, no change. I updated max_parallel_workers_per_gather = 0, restarted pg, no difference. It actually increased to 7 seconds for the query. I love the idea's though, keep em coming!
21:29	Bmagic	Also: I dumped, restored the DB, that didn't help either :(
21:30	jeff	also... something seems amiss with those row estimates. rows=4215263689798428837764998235160576 !!
21:31	Bmagic	jeff: hmm, I didn't see that! Wow, that is peculiar
21:33	Bmagic	I'm trying to find the place where you see that, I'm not finding it
21:34	jeff	where I see what?
21:34	Bmagic	the row estimate with that crazy number
21:35	jeff	both are high, but the "fast" explain is higher, I think. it's the first line:
21:35	jeff	Hash Left Join (cost=131123.56..8937592211221534614396133393301504.00 rows=4215263689798428837764998235160576 width=274) (actual time=4.154..4.186 rows=1 loops=1)
21:35	jeff	the web UI turns estimated rows over actual rows into the "rows x" column
21:36	Bmagic	which explain URL?
21:37	jeff	it's similarly high on a pg10 db here, scaled down a bit due to smaller bib count, I'm sure. that's probably not quite related to the slow/fast discrep here, but I could be wrong.
21:37	jeff	the paste a few lines back is from the source here: https://explain.depesz.com/s/UWu5#source
21:37	jeff	OH
21:38	jeff	dangit, that's jeffdavis's
21:38	jeff	sorry.
21:38	jeff	Bmagic: did you have a prod/replica set of explain outputs?
21:38	jeff	i think I only saw the one, https://explain.depesz.com/s/Df5A
21:39	jeff	so yeah, almost everything I said above was related to jeffdavis, but i had it in my head as yours, so some of what i said probably doesn't make sense.
21:39	Bmagic	I posted two explains. One that explained the INSERT, and another for the plain select statement.
21:39	Bmagic	INSERT: https://explain.depesz.com/s/WK9m#query
21:40	Bmagic	SELECT: https://explain.depesz.com/s/Df5A#query
21:40	jeff	ah. it would be interesting to see the same SELECT query on your "fast" and "slow" db instances.
21:42	Bmagic	on DB2 (fast) https://explain.depesz.com/s/TPmv
21:45	jeff	is the query output identical, at least at a glance?
21:46	jeff	(on both fast and slow)
21:46	jeff	(i'm guessing you'd have noticed and called attention to things if your main and replica were not agreeing here...)
21:47	jeff	oh, https://explain.depesz.com/s/TPmv#source doesn't seem to be using EXPLAIN (ANALYZE, BUFFERS)
21:48	Bmagic	yeah, I don't have the db2 example with buggers
21:48	Bmagic	buffers* lol
21:49	jeff	does something still use reporter.old_super_simple_record?
21:49	Bmagic	and I can't run it now because I have the second DB broken at the moment. I'm working on getting it back in sync and I'll post the BUFFERS version. It'll be about 45 minutes if history is any indicator
21:49	jeff	I mean, there is a reason rmsr exists... ;-)
21:49	jeff	ah, ok.
21:49	Bmagic	jeff: yep, it's called when inserting new rows into biblio.record_entry
21:50	jeff	good luck! going to carve a pumpkin.
21:50	Bmagic	good luck to you too!
22:11		sandbergja joined #evergreen
22:34	Bmagic	jeff: I'm not sure if this is good but after dumping, restoring, re-syncing the two DB's, they are more similar in speed for this explain analyze. Though DB2 is still winning. But niether one is acceptable now
22:35	Bmagic	DB1: https://explain.depesz.com/s/srWZ
22:35	Bmagic	DB2: https://explain.depesz.com/s/OWwJ