Time |
Nick |
Message |
04:39 |
|
StomproJ joined #evergreen |
06:54 |
|
redavis joined #evergreen |
07:59 |
|
BDorsey joined #evergreen |
08:12 |
|
collum joined #evergreen |
09:09 |
|
Dyrcona joined #evergreen |
09:44 |
|
dguarrac joined #evergreen |
09:44 |
Bmagic |
Dycrona: my exporter did 2.2 million in 8 hours. Not sure if I gave you the final number or not |
09:56 |
Dyrcona |
Thanks, Bmagic. I'm going to look at Stompro's cursor code today. I think it could be refactored a little bit and mad to work for authorities, too. I'll see how that goes. |
09:57 |
Bmagic |
sweet |
09:57 |
Dyrcona |
I'm 99% convinced that my issue is Perl making its way through the massive result set. |
10:01 |
|
sleary joined #evergreen |
10:08 |
Stompro |
Dyrcona, By getting rid of a bunch of the acps_for_bre arrays and greps and using hashes instead, I'm seeing a 90% speedup in that subroutine also. |
10:09 |
Stompro |
I'm cleaning that up and then I'll share. And I also switched acps_for_bre to using a prepared statement, and that also helped. |
10:09 |
Dyrcona |
Stompro: Cool! Push it to the collab branch. I'm doing something else right now. |
10:13 |
Dyrcona |
Hmm.. OpenSRF router died on development VM sometime in the last week. |
10:14 |
Dyrcona |
Oh! And, it is the Redis branch. |
10:14 |
Bmagic |
that might have something to do with it :) |
10:15 |
Dyrcona |
Did I miss a bug report/issue? |
10:16 |
Dyrcona |
I should update this bad boy to main+Redis. It's currently on Evergreen 3.10. |
10:17 |
Dyrcona |
Next week after the database refresh. |
10:17 |
Bmagic |
Bad boys: whatcha gonna do? |
10:17 |
Dyrcona |
heh. |
10:17 |
Dyrcona |
Services seem to be hung restarting. Do I need to --reset-message-bus? |
10:19 |
Dyrcona |
Well, now I've got a mess.... Some services running some not. Does the router change PID, but keep going? |
10:19 |
* Dyrcona |
tries again. |
10:20 |
berick |
Dyrcona: i posted a breaking change friday (?) but fixed it an hour or 2 later, if that offers any cluse |
10:21 |
berick |
the opensrf branch |
10:21 |
berick |
which, btw, forces reset-message-bus to run when needed so it's not manaully required anymore. |
10:22 |
Dyrcona |
berick: I think my branch predates either change. I'll grab the latest and plan to upgrade Evergreen to main next week. |
10:22 |
Stompro |
With my export of 200K bibs I can do that at a rate of 2.5 million an hour, 1.25 million with items... although that may be faster now. |
10:22 |
Dyrcona |
Though I guess the deadline for 3.12 is Wednesday.... |
10:23 |
Dyrcona |
Stompro: Cool. I'll take a look at your changes and give them a go. I plan to refactor the cursor code into a new/old function so things can be simplified. I'll see how your changes affect that. |
10:37 |
Dyrcona |
berick: Nice change for batch_size from u64 to u32. Wouldn't want to try exporting more than 4 billion records in a single go. :) |
10:42 |
* Dyrcona |
updates OpenSRF to latest Redis collab. |
10:43 |
|
BAMkubasa joined #evergreen |
10:50 |
|
mantis1 joined #evergreen |
10:55 |
|
briank joined #evergreen |
11:15 |
Dyrcona |
OK. I have the latest Rust marc-export test running. I'll see how long this takes. |
11:27 |
|
jihpringle joined #evergreen |
12:14 |
|
Christineb joined #evergreen |
12:18 |
eeevil |
berick: drive by question (from immediately-post hack-a-way) incoming! meetings will probably keep me away from here for the rest of the day, but want to get it to you asap for RediSRF(?) goodness |
12:20 |
eeevil |
berick: I'm looking at opensrf/redis anew via the collab branches. I have two questions that jump out at me: is the --reset-message-bus documented beyond "you must do this at redis restart" somewhere? (the commit it comes in with isn't obvious to me; also, I want to help remove any instances of boilerplate stuff that we can automate or fold into other actions as early as possible -- ideally before merge -- and I know this is under discussion or at |
12:20 |
eeevil |
least being thought about); and, while the config and password list files don't require it themselves, it /looks/ like the usernames are pinned as "opensrf", "router", and "gateway" in the code that implements each part. is that correct and do you see any reason we /can't/ replace those strings with the value pulled from opensrf_core.xml as with ejabberd? the result being ACL-separated user sets that share a redis instance but can't talk to each |
12:20 |
eeevil |
other or guess each other's user and queue names. |
12:32 |
Dyrcona |
eeevil: berick said earlier that he modified osrf_control to do --reset-message-bus as required, so its not necessary now. I'm testing that and other updates. That's a good question about the names. (I generally don't change them.) |
12:33 |
Bmagic |
I'm troubleshooting a database issue. explain analyze on an insert into biblio.record_entry results in 6+ seconds. The analyze shows me that it spends the majority of the time in reporter.simple_rec_trigger(). Which is odd, because if it's an INSERT, it should just skip down to RETURN. Weird right? |
12:33 |
Dyrcona |
Bmagic: No. It still builds the rmsr entry. |
12:34 |
Dyrcona |
Wait until you update an authority with hundreds of bibs "attached." :) |
12:35 |
Bmagic |
this is EG 3.11.1 with queued ingest. I updated the global flag to turn off queued ingest, and tested it. and visa versa. it's 13 seconds for an insert with it turned off, and 6 seconds with it turned on. But still, 6 seconds seems insane for an insert on biblio.record_entry, especially if we're deferring all of the metabib stuff for the ingester process to come later |
12:35 |
Dyrcona |
Updating/inserting bibs and auths can be slow because of all the triggers. |
12:36 |
Bmagic |
I got here because it's breaking the course reserves interface. Taking too long |
12:37 |
* Dyrcona |
colour me unsurprised. |
12:37 |
Dyrcona |
"Color me your color, baby. Color me your car." |
12:44 |
jeff |
|
12:44 |
jeff |
er, hi. |
12:49 |
Dyrcona |
Hi, jeff! |
12:58 |
Dyrcona |
Hm. This gets tricky. |
13:03 |
Dyrcona |
I wonder if Marque::Authority and Marque::Biblio should be refactored to have a common base class: Marque::Record? Seems a bit crazy for something that's all in one file.... Maybe if we split it all out some day? |
13:17 |
Dyrcona |
Ooh. Found an unrelated bug with the error output that I'll patch with a separate commit. |
13:32 |
Dyrcona |
Stompro: I pushed a couple of commits to the marc_export collab branch: 1) a refactor of the cursor code that adds a new CLI options and expands the cursor to work with authorities, and 2) a fix for using a wrong variable when reporting errors for an invalid --type parameter. |
13:36 |
Stompro |
Dyrcona, I'm playing around with requesting the acns and acps in batch, to see if performing less individual sql requests help. |
13:44 |
Dyrcona |
OK. You might want to rebase your branch on the working repo. |
13:57 |
Bmagic |
Dyrcona: eeevil: and anyone else. Disecting my query pathway through the DB, I found a line in reporter.simple_rec_update() that seems to be the culprit https://explain.depesz.com/s/WK9m |
13:59 |
Bmagic |
Disecting/Dissecting |
14:02 |
Dyrcona |
It's a unique constraint check causing the issue? |
14:02 |
Dyrcona |
Maybe I'm not understanding that correctly. |
14:02 |
Bmagic |
essentially it's this: INSERT INTO reporter.materialized_simple_record SELECT DISTINCT ON (id) * FROM reporter.old_super_simple_record WHERE id = 671579; |
14:02 |
Dyrcona |
Left joins are also a big hit. |
14:03 |
Dyrcona |
That's where the unique comes from then.... |
14:03 |
Bmagic |
which is buried in reporter.simple_rec_update(ID, BOOL) |
14:04 |
Dyrcona |
Well, maybe that is due for a refactor, too. :) |
14:04 |
Bmagic |
select count(*) from reporter.old_super_simple_record took 31 seconds to complete |
14:05 |
Dyrcona |
Yeah, I thinks that's a view, and it's probably where all of the left joins happen. |
14:06 |
Dyrcona |
I'd have to look to make sure, though. |
14:11 |
Bmagic |
maybe the issue is vacuum analyze after* queued ingest has finished |
14:12 |
Dyrcona |
You're saying to do a vacuum analyze after queued ingest, or one is happening already? |
14:12 |
Bmagic |
I'm saying that I'm not sure I did that step after the queued ingest finished ingesting the 300k bibs I migrated |
14:13 |
Dyrcona |
Well, it probably wouldn't hurt to do the affected tables. |
14:13 |
Bmagic |
That's most likely it, thanks for being my sounding wall! |
14:16 |
Dyrcona |
Bmagic: This is the mostly likely suspect: metabib.wide_display_entry. It's a view, and reporter.old_super_simple_record is built from metabib.wide_display_entry. |
14:17 |
Bmagic |
yep |
14:17 |
Bmagic |
inspecting that view was "fun"\ |
14:18 |
Dyrcona |
Oof.... Yet more views involved.... |
14:18 |
Dyrcona |
View built on views built on views. |
14:18 |
Bmagic |
that's right |
14:25 |
Dyrcona |
A View to a pg_kill. |
14:25 |
Dyrcona |
OK, pb_cancelbackend.... :) |
14:35 |
|
BAMkubasa joined #evergreen |
14:52 |
|
jonadab joined #evergreen |
15:31 |
pinesol |
News from commits: LP#1989151: Add links to secondary management interfaces <https://git.evergreen-ils.org/?p=Evergreen.git;a=commitdiff;h=03b93d925c69fd3b055bba0678b33f191aa58060> |
15:39 |
|
jihpringle joined #evergreen |
15:43 |
|
jihpringle joined #evergreen |
15:55 |
Dyrcona |
So, I'm using the Redis branch and I have a script that is failing to authenticate using Cronscript->authenticate. I'll see if I can figure it out. |
15:55 |
Bmagic |
Dyrcona++ |
15:58 |
Dyrcona |
It might not be the Redis branch. It might be my code. I'm pretty sure this used to work, though. |
15:59 |
Dyrcona |
I'm going to try it on a non-Redis VM, but that one is also a versions behind the Redis VM when it comes to Evergreen. |
16:03 |
Dyrcona |
Ok, can_have_users is false on the org_unit. i think that prevents login, doesn't it? |
16:03 |
Dyrcona |
I'll change it and see what happens. |
16:03 |
Bmagic |
that might be it |
16:06 |
Dyrcona |
Ugh..... It's the ou_type.... |
16:07 |
Dyrcona |
Right.... Need to change the update. |
16:13 |
Dyrcona |
Authentication failed.... |
16:14 |
Dyrcona |
Parent ou, too, maybe? |
16:14 |
Dyrcona |
No. That's shouldn't be it. |
16:15 |
Dyrcona |
My update worked... I can see it in the database, and I bet I have to run autogen.sh... |
16:16 |
Dyrcona |
Nope. Not even after autogen.sh and --restart-all.... Something else is wrong. Guess I'll check the logs. |
16:21 |
Dyrcona |
Heh. I'm an idiot. I changed my Evergreen password since I wrote the original script that I copied and modified..... Out of date, hard coded credentials were the issue. |
16:22 |
Dyrcona |
Now, should I get an empty array from open-ils.circ.checkin.override? |
16:28 |
Dyrcona |
OK. I guess so. Looks like we hit a return; at the end. |
16:29 |
Dyrcona |
I suppose I could write something to verify the copy and circulation states. |
16:33 |
Bmagic |
I think my issue might be a PG15 thing, specifically PG15 with a replicated secondary DB. I'm not sure what setting I need to tweak, but I can confirm that after a VACUUM FULL on metabib.*, it's still slow. But when I run the query on the secondary database, it's nice and fast. A clue? |
16:38 |
Bmagic |
Maybe I can change PG configuration to eliminate the secondary DB to prove it has something to do with that (or not) |
16:38 |
Dyrcona |
Nope. None of them checked in. |
16:40 |
Dyrcona |
Bmagic: Could be. I often have queries fail on the secondary DB, 'cause changed rows from replication are needed. I know that's not your issue, but you shouldn't be inserting into a replicated db, so I'm not sure what you're checking. |
16:40 |
Bmagic |
This is my test query: explain analyze select * from reporter.old_super_simple_record where id=671579 |
16:41 |
Bmagic |
6+ seconds on db1. and less than one second on db2 |
16:42 |
Dyrcona |
OK. Das stimmt. |
16:43 |
Dyrcona |
Hm... I need to find the circulations. The copies are all Lost status.... They're also deleted, and I thought using the copy_id would resolve the latter. |
16:44 |
Dyrcona |
Guess I'll pick that up in the morning or some other time. |
16:46 |
|
mantis1 left #evergreen |
16:47 |
jeffdavis |
Bmagic: just to confirm, you've done ANALYZE as well as VACUUM FULL on your metabib tables in the primary db? |
16:47 |
Bmagic |
jeffdavis: yep |
16:50 |
Dyrcona |
You did a VACUUM FULL not just a VACUUM or VACUUM ANALYZE? |
16:51 |
Bmagic |
VACUUM FULL ANALYZE <table> |
16:51 |
Bmagic |
Repeated for all tables in metabib |
16:52 |
Bmagic |
retested, and was dissappointed. Restarted postgres, still dissappointed |
16:52 |
Bmagic |
disappointed |
16:55 |
jeffdavis |
Is it consistently ~6 seconds even for other IDs? Like, not 4 or 8 seconds? |
16:59 |
jeffdavis |
We were having an issue on a test server where calls to open-ils.pcrud.search.circ were consistently taking about 6.5s per circ. I never got to the bottom of it (fortunately we're not seeing that in production). It's a long shot but maybe there is a connection? Don't want to send you down a rabbit hole though! |
17:00 |
Dyrcona |
There are only 1 or 2 tables involved in that function. Everything else is 4 views that percolate up. |
17:00 |
jeffdavis |
But the flesh fields for the problematic call include wide_display_entry. |
17:01 |
Dyrcona |
Which is a view, IIRC. |
17:01 |
* Dyrcona |
calls it a day. I'll be back tomorrow if the creek don't rise. |
17:01 |
jeffdavis |
CALL: open-ils.pcrud open-ils.pcrud.search.circ "[...]",{"id":[...]},{"flesh":4,"flesh_fields":{"circ":["target_copy","workstation","checkin_workstation"],"acp":["call_number","holds_count","status","circ_lib","location","floating","age_protect","parts"],"acpm":["part"],"acn":["record","owning_lib","prefix","suffix"],"bre":["wide_display_entry"]},"select":{"bre":["id"]},"order_by":{"circ":["xact_start"]}} |
17:02 |
jeff |
Bmagic: it would be interesting to compare the explain (analyze, buffers) with the query on the two postgresql instances. database config the same between the two? hardware? |
17:02 |
jeff |
Bmagic: what replication method are you using? |
17:03 |
Bmagic |
jeff: I've got the secondary DB in standby mode, configured "primary_conninfo" to connect to the primary db. Plus, if they get out of sync, we have "restore_command" configured to pull the wal logs from the NFS share where DB1 writes them |
17:03 |
jeff |
basic health checks look good? no long-running trng transactions, etc? |
17:04 |
Bmagic |
the secondary DB has effective_cache_size = 64GB, compared to primary effective_cache_size = 200GB |
17:04 |
jeff |
s/trng // |
17:05 |
Bmagic |
primary: work_mem = 512MB. secondary: work_mem = 160MB |
17:05 |
Bmagic |
otherwise, the two machines have the same postgresql.conf if you ignore the bits for the secondary needing to know how to connect to primary |
17:06 |
jeff |
ok, I can see how that would likely result in different planner decisions. buffers will likely be your friend in determining the difference. |
17:06 |
jeff |
(but that may not give you your answer.) |
17:06 |
Bmagic |
jeff: I'm afraid I'm not following the buffers idea. Do I need to issue a slightly different query to get more information? |
17:07 |
Bmagic |
or are you just talking about "free -f" ? |
17:07 |
jeff |
passing the buffers argument to explain, so you get useful information on what IO took place. |
17:08 |
jeff |
EXPLAIN (ANALYZE, BUFFERS) SELECT...; |
17:08 |
jeff |
https://www.postgresql.org/docs/15/using-explain.html#USING-EXPLAIN-ANALYZE "EXPLAIN has a BUFFERS option that can be used with ANALYZE to get even more run time statistics" |
17:09 |
jeff |
https://explain.depesz.com/ is helpful for examining/sharing the results, if you're not already familiar with it. |
17:09 |
Bmagic |
right, I found that as you were typing. sorry to make you explain |
17:09 |
jeff |
ah, i see what you did there. |
17:09 |
Bmagic |
:) |
17:09 |
jeff |
out for now, back later! |
17:10 |
Bmagic |
https://explain.depesz.com/s/Df5A |
17:11 |
Bmagic |
so that seems to have revealed that it's sequential scanning display_field_map |
17:16 |
jeffdavis |
I can confirm we're seeing the same behavior on our test server with your example query (explain (analyze, buffers) select * from reporter.old_super_simple_record where id=x) |
17:16 |
Bmagic |
so weird, two databases, one a replicant of the other. I just synced up those two settings in postgresql.conf and restarted the primary database. The query is still sequentially scanning that table instead of using an index. I run the same query against the secondary database and it's fast |
17:17 |
Bmagic |
jeffdavis: yous is sequential scanning? |
17:17 |
jeffdavis |
test db (slow): https://explain.depesz.com/s/pkwY |
17:17 |
jeffdavis |
prod db (fast): https://explain.depesz.com/s/UWu5 |
17:19 |
Bmagic |
jeffdavis: interesting. In my case, it's prod that's slow and the secondary database is fast |
17:20 |
Bmagic |
what's even more interesting, is for your fast case, it's seq scanning display_field_map and apparently that's not taking much time at all |
17:24 |
Bmagic |
jeffdavis: PG14? |
17:24 |
jeffdavis |
yes |
17:25 |
Bmagic |
jeffdavis++ |
17:25 |
jeffdavis |
The servers are pretty different, the test db is the product of pg_dump/pg_restore rather than replication |
17:25 |
Bmagic |
yeah, I'm tempted to dump/restore as well |
17:26 |
Bmagic |
I out of ideas, so that one comes to mind :) |
17:27 |
jeffdavis |
well, we're seeing it after dump/restore, but who knows, maybe it would help! |
17:27 |
Bmagic |
PG voodoo |
19:13 |
|
ejk_ joined #evergreen |
19:14 |
|
jmurray_isl joined #evergreen |
21:17 |
jeff |
Bmagic: most of the time in your slow case above is taken on reading config.metabib_field, which is pretty surprising. how bloated is that table? i can't imagine how bloated it would have to be to slow things down that much. are simple reads of that table slow on the slow db instance? |
21:22 |
jeff |
5+ seconds to read 64 rows is... interesting. I wonder how different things look if you disable parallelism with (i think) SET max_parallel_workers_per_gather = 0; |
21:28 |
Bmagic |
jeff: I vacuum full'ed that table just for fun, no change. I updated max_parallel_workers_per_gather = 0, restarted pg, no difference. It actually increased to 7 seconds for the query. I love the idea's though, keep em coming! |
21:29 |
Bmagic |
Also: I dumped, restored the DB, that didn't help either :( |
21:30 |
jeff |
also... something seems amiss with those row estimates. rows=4215263689798428837764998235160576 !! |
21:31 |
Bmagic |
jeff: hmm, I didn't see that! Wow, that is peculiar |
21:33 |
Bmagic |
I'm trying to find the place where you see that, I'm not finding it |
21:34 |
jeff |
where I see what? |
21:34 |
Bmagic |
the row estimate with that crazy number |
21:35 |
jeff |
both are high, but the "fast" explain is higher, I think. it's the first line: |
21:35 |
jeff |
Hash Left Join (cost=131123.56..8937592211221534614396133393301504.00 rows=4215263689798428837764998235160576 width=274) (actual time=4.154..4.186 rows=1 loops=1) |
21:35 |
jeff |
the web UI turns estimated rows over actual rows into the "rows x" column |
21:36 |
Bmagic |
which explain URL? |
21:37 |
jeff |
it's similarly high on a pg10 db here, scaled down a bit due to smaller bib count, I'm sure. that's probably not quite related to the slow/fast discrep here, but I could be wrong. |
21:37 |
jeff |
the paste a few lines back is from the source here: https://explain.depesz.com/s/UWu5#source |
21:37 |
jeff |
OH |
21:38 |
jeff |
dangit, that's jeffdavis's |
21:38 |
jeff |
sorry. |
21:38 |
jeff |
Bmagic: did you have a prod/replica set of explain outputs? |
21:38 |
jeff |
i think I only saw the one, https://explain.depesz.com/s/Df5A |
21:39 |
jeff |
so yeah, almost everything I said above was related to jeffdavis, but i had it in my head as yours, so some of what i said probably doesn't make sense. |
21:39 |
Bmagic |
I posted two explains. One that explained the INSERT, and another for the plain select statement. |
21:39 |
Bmagic |
INSERT: https://explain.depesz.com/s/WK9m#query |
21:40 |
Bmagic |
SELECT: https://explain.depesz.com/s/Df5A#query |
21:40 |
jeff |
ah. it would be interesting to see the same SELECT query on your "fast" and "slow" db instances. |
21:42 |
Bmagic |
on DB2 (fast) https://explain.depesz.com/s/TPmv |
21:45 |
jeff |
is the query output identical, at least at a glance? |
21:46 |
jeff |
(on both fast and slow) |
21:46 |
jeff |
(i'm guessing you'd have noticed and called attention to things if your main and replica were not agreeing here...) |
21:47 |
jeff |
oh, https://explain.depesz.com/s/TPmv#source doesn't seem to be using EXPLAIN (ANALYZE, BUFFERS) |
21:48 |
Bmagic |
yeah, I don't have the db2 example with buggers |
21:48 |
Bmagic |
buffers* lol |
21:49 |
jeff |
does something still use reporter.old_super_simple_record? |
21:49 |
Bmagic |
and I can't run it now because I have the second DB broken at the moment. I'm working on getting it back in sync and I'll post the BUFFERS version. It'll be about 45 minutes if history is any indicator |
21:49 |
jeff |
I mean, there is a reason rmsr exists... ;-) |
21:49 |
jeff |
ah, ok. |
21:49 |
Bmagic |
jeff: yep, it's called when inserting new rows into biblio.record_entry |
21:50 |
jeff |
good luck! going to carve a pumpkin. |
21:50 |
Bmagic |
good luck to you too! |
22:11 |
|
sandbergja joined #evergreen |
22:34 |
Bmagic |
jeff: I'm not sure if this is good but after dumping, restoring, re-syncing the two DB's, they are more similar in speed for this explain analyze. Though DB2 is still winning. But niether one is acceptable now |
22:35 |
Bmagic |
DB1: https://explain.depesz.com/s/srWZ |
22:35 |
Bmagic |
DB2: https://explain.depesz.com/s/OWwJ |