IRC log for #evergreen, 2023-10-31

All times shown according to the server's local time.

Time	Nick	Message
00:09	jeff	Bmagic: taking 4-6s to seq scan a table containing 26 rows is... neat. :-P
00:11	jeff	is SELECT * FROM config.display_field_map; slow?
04:24		Guest13 joined #evergreen
04:26	Guest13	Hello, I have a question about evergreen software. Is there someone who can explain it to me in easy way :))
06:24		redavis joined #evergreen
08:06		BDorsey joined #evergreen
08:35		mmorgan joined #evergreen
08:37		dguarrac joined #evergreen
08:54		smayo joined #evergreen
09:05		Dyrcona joined #evergreen
09:16		mantis1 joined #evergreen
09:21		sandbergja joined #evergreen
09:25	Dyrcona	berick: I'm getting the following with the latest Redis branch for Opensrf: osrf_control -l --start-all
09:25	Dyrcona	[auth] WRONGPASS invalid username-password pair, at /usr/share/perl5/Redis.pm line 311.
09:26	Dyrcona	This is after a VM reboot, so it looks like the auto --reset-message-bus isn't always working.
09:26	Dyrcona	This is what I have checked out/installed: 437e8c67 (HEAD -> collab/berick/lp2017941-opensrf-on-redis-v3, working/collab/berick/lp2017941-opensrf-on-redis-v3) LP2017941 Auto-reset bus accounts; simplify and fix
09:33	Dyrcona	For the logs: `osrf_control -l --reset-message-bus` fixes it.
09:38	berick	Dyrcona: hm, ok. it does require a ./configure to rebuild the osrf_control
09:40	Dyrcona	Hmm. Maybe I just did make.
09:40	Dyrcona	make should detect if ./configure is necessary, though.
09:47	Bmagic	jeff: that table doesn't exist config.display_field_map. I thought it was talking about metabib.display_entry
09:48	Dyrcona	OK. Explicitly doing ./configure fixed it. That's probably a bug in how we're using autotools. Make with autotools is supposed to detect when configure is needed.
09:49	berick	Dyrcona: cool, good to know
09:49	Dyrcona	Bmagic: If you're still hunting the same thing from yesterday, metabib.display_entry is (I think), the only actual table involved in all of those views. There might be one other.
09:50	Dyrcona	Also, I'm trying to figure out why a bunch of checkins fail silently. I think I'll just override all events instead of play whack-a-mole.
09:51	Bmagic	jeff: lol, sorry, nevermind, that table does exist. select * takes .59
09:51	Bmagic	Seq Scan on display_field_map cdfm (cost=0.00..1.32 rows=1 width=37) (actual time=6375.566..6375.573 rows=1 loops=1)"
09:52	Bmagic	something is seriously wrong. I agree jeff. 6 seconds to seq scan a 26 row table. Doesn't make any sense
09:53	Bmagic	Dyrcona: yep, same thing from yesterday
09:54	Bmagic	Dyrcona:
09:54	Bmagic	DB1: https://explain.depesz.com/s/srWZ
09:54	Bmagic	DB2: https://explain.depesz.com/s/OWwJ
09:55	Bmagic	I had a thought that maybe postgres "buffered" that table, and it was reusing a method that it used before. Instead of "re-thinking". I figured a postgres restart would have cleared it's brain and caused it to re-think stuff?
09:57	eeevil	Bmagic: I'm coming in late, and might have missed something, but the seq scan line you pasted at 9:51 (for me) says .007 milliseconds to scan cdfm, not 6 seconds
09:58	Bmagic	eeevil: I pasted an example of a fast one and a slow one. Both machines containing the same database. PG15, DB1->DB2 replicant. DB1 is slow, DB2 is fast. I decided to pg_dump DB1 and then delete the database, and pg_restore. Then resync DB2. Now their both slow
09:59	Bmagic	I'm considering an index for config.display_field_map.name
09:59	eeevil	what I mean is, in "Seq Scan on display_field_map cdfm (cost=0.00..1.32 rows=1 width=37) (actual time=6375.566..6375.573 rows=1 loops=1)" the actual time is a start..finish range, measured in ms
10:00	Bmagic	eeevil: oh, so maybe I misunderstand where it's slow?
10:01	eeevil	I think explain.depsesz might be confused, and leading you astray
10:01	Bmagic	where are you seeing the 6 second spike? metabib_field_pkey?
10:07	Bmagic	the index didn't help
10:07	Bmagic	another idea I had was to truncate metabib.display_entry and reingest
10:11	eeevil	yeah, I do see exactly where the 6s are coming from: JIT
10:12	eeevil	it's right there in the JIT timing summary: Timing: Generation 30.673 ms, Inlining 29.530 ms, Optimization 3683.782 ms, Emission 2662.593 ms, Total 6406.579 ms
10:13	Bmagic	ah, all the way at the bottom
10:14	eeevil	I would recommend turning off all the JIT options individually, and then consider turning them back on one at a time. proably start with inlining (and generation if needed for inlining, I don't recall if it is)
10:15	Bmagic	researching....
10:16	eeevil	(so, in defence of explain.depesz, it's doing the best it can to highlight the JIT slowdown, but it def does still require understanding of what's being presented)
10:19	Bmagic	eeevil++ # jit_above_cost = -1 fixed it
10:19	Bmagic	good night that was bothering me
10:20	Bmagic	6000ms down to 15ms
10:20	Bmagic	so now I have to figure out what other things are broken with that disabled
10:20	berick	TIL about PG JIT
10:21	Bmagic	berick++ lol
10:21	Bmagic	I had to Google TIL
10:22	Bmagic	TIL about TIL
10:22	berick	heh
10:31	eeevil	disabling JIT shouldn't break anything. it'll just behave more like pg 10 or 11. which is to say, more predictable, sometimes a little slower, and sometimes (as here) much faster
10:33	Bmagic	eeevil++
10:45	eeevil	JIT is really good for DW and analytics, ESPECIALLY when you have, say, a pile GPUs and the citus column store extension installed and you're folding proteins or something. stuff where there's a LOT of data, it's extremely well controlled in type and datum size, but you don't know the cardinality or distribution of the data. TBH, that's basically the opposite of EG's data for the most part ;)
10:46	Bmagic	I didn't turn it completely off. There is a setting for just straight turning it off "jit = on"
10:46	Bmagic	this is the one setting that I changed where it changed the outcome of the analyze: "jit_above_cost = -1"
10:47	Bmagic	that fixed course reserves :)
10:50	Bmagic	jeff jeffdavis : Pining you all just to make sure you see the fix. It should probably be baked into our install instructions come to think of it
10:54		briank joined #evergreen
10:55	eeevil	I'd recommend the install instructions just say "Do not turn on Postgres' JIT capabilities. Evergreen's queries, especially complex ones used for search, are intentionally tuned for non-JIT execution and JIT has been shown to be harmful in some circumstances." then, once we know how/when/what to use from the JIT toolbox, we can change that rec. just IMO...
10:57	eeevil	the core problem is one of estimation -- we need to invest time/energy into finding the most likely universal cross-column and cross-table stats to configure, because PG will try to use JIT when the stats say "you're going to have to compare 1 BEEEELLION INTEGERS", and spend multiple seconds setting that up for a query, and then the stats were wrong and it compares, like, 3 rows.
10:59	eeevil	but we can't simply blanket every instance in stats gathering config, because that's a speed issue both coming and going, and MOAR STATS is not necessarily better stats
11:00	Bmagic	I'll submit the change to our install instructions
11:05	berick	eeevil: i'm working on a tech ref doc for redis bus addresses / login accounts. https://gist.github.com/berick/b1d26f7179b97635c71c9ac91ac38584
11:07	eeevil	berick++
11:07	eeevil	thanks!
11:11	eeevil	berick: so, right off the bat, I think I understood correctly that the first part of the bus address is pinned, and (right now) you couldn't have two separate instances running services and not seeing each other, right? to be able to have bus addresses and redis account paired is the thing I'm looking for, if by convention (some param that defaults to "opensrf" for the bus prefix) or explicitly/intentionally (the bus prefix is the login user name)
11:13	berick	eeevil: 'opensrf' as part of the bus prefix is convention and unrelated to the redis account. just wanted it have some kind of prefix. running separate service instances on one Redis instance just means giving each its own domwin.
11:14	berick	*domain
11:15	eeevil	I was advocating for the user name as the prefix so that we don't have to have (maintain, update, restart) dns changes all the time
11:15	eeevil	is there a reason /not/ to just make the prefix be the user that the redis-connecting thing uses?
11:17	berick	eeevil: top of head, that would probably work, but would mean some router changes .. i think, need to verify
11:34	jeff	eeevil++ imparting the JIT knowledge
11:34	eeevil	eeevil-- # being a curmudgeon re JIT ;)
11:35	jeff	And agreed, I think explain.depesz.com might need to tweak its pre-JIT logic surrounding the exclusive column. no longer is it "the deepest node has nothing else consuming time"
11:37	berick	eeevil: thinking through some options.. keeping the 'opensrf' prefix is simplest way forward (for ACL rules, code changes, etc.) but i think we could get the same benefit using router addresses like opesrf:router:$omain:$login -- then teach the services to register to specific routers by domain+login
11:37	eeevil	berick: hrm... I wouldn't think that would be necessary, since the router is named WRT the services and the clients (they know where to send router-distributed requests) and routers get registrations from the servers, so know where they should be sending messages to distributed. however, I have /not/ looked at the redis-ized router code, so I'm probably making assumptions about what's recorded and what's computed
11:38	eeevil	the thing that removes is the ACL-based cross-"user" protection of queues, though
11:39	eeevil	if the $login part is in the wildcard section of the ACL protection, then any client-user connection to redis can send requests at any service, right?
11:40	eeevil	I ... can't type. 2 lines up should be "I think that", not "the thing that"
11:41	berick	hm well.. the router ACLs could be done w/o wildcards. e.g. ~opensrf:router:private.localhost:router-01
11:41	berick	if we're at the level of specifying specific routers, there's not really any need for the wildcard
11:46	eeevil	not sure I follow. from the client's perspective, we need 3 bits of info to make a request: 1) service name 2) my local domain 3) the router "prefix" (this is least needed, but will eventually allow us to go routerless, I believe, even with cross-domain HA/LB)
11:47	eeevil	we already specify those things in opensrf_core.xml (concretely)
11:48	eeevil	the router prefix is //config/opensrf/router_name
11:53	eeevil	put another way, imagine translating from an XMPP jid to a redis queue name. jid structure is $router_name@$local_domain/$service_name, and the equiv redis queue might be $router_name:$service_name:$local_domain and the ACL could allow $client_user_name to write to $router_name:* queues
11:54	Dyrcona	Stompro: Your latest commit in the collab branch makes a huge difference. After 45 or so minutes, the output is where it was at 2 days before.
11:56	Stompro	Great, you probably have much larger @orgs, @shelves, @prefix arrays, the grep lookups were probably taking a long time.
11:57	eeevil	for non-trivial setups, the redis user and ACL file will be machine-generated (for us, at least, and I hope for others to avoid human error at the config file level), so I'm really not personally concerned about the contents being big or messy. but(!) trivial setups, it's still extremely simple, the router_name will be "router" and the ACL will be "opensrf can write to router:*"
11:57	Dyrcona	Stompro: Yeah. I'll have to count some of those after lunch.
12:04	berick	eeevil: what you're describing would almost certainly work. my thought behing e.g. opensrf:router:private.localhost:router-01 is mainly to limit the amount of changes needed to support the use case and seems equally as machine-generatable.
12:10	eeevil	what's router-01 in this case (IOW, why don't we just have The User Name for an EG instance's routers)? each domain will just have one router, right? so just having it be '$router:$local_domain:$service' (or '$router:$service:$local_domain') where $router might be literally "router" for dev systems.
12:12	eeevil	I mean, if having a global prefix of 'opensrf:' in front of every queue has a benefit, +1. to me that seems like noise we don't need, but I can't fight too hard against namespacing "all of opensrf" into "opensrf:" either.
12:13	berick	eeevil: router-01 in this example is the router's username. but wait, "each domain will just have one router, right?" -- i thought the point of this convo was that you needed multiple routers per domain (to avoid dns hassle, etc.).
12:14	eeevil	setting aside an opensrf redis namespace prefix (if that's what it really is), what I'm looking for is the ability to have a set of redis queues that follow a patter that clients and services can construct in a known way, and /definitely/ segregates one instances' queues from all others that might happen to live inside the same redis instance
12:15	eeevil	we're def talking past each other to some degree :)
12:23	eeevil	so each /EG instance/ will have exactly 1 router per redis (xmpp) listening hostname. (we actually have 4 domain types, not just public and private). if the router+service queue (xmpp jid for incoming requests) has a pattern like this: "library_A_router_name:open-ils.cat:public.domain.name" (though "library_A_router_name" woulnd't be guessable like that) then we can make ACLs for the redis user called "library_A_client" such that it can put messages
12:23	eeevil	onto the queues that match "library_A_router_name:", and services can put messages on queues that match "library_A_client:". the router-to-service pattern is similarly simple, and from the router's perspective can just be based where it got registrations from.
12:30	eeevil	(longer term, I believe we can get to routerless with this pattern of "$purpose_user_name:$access_domain:$purpose_marker" for queues. purpose_user_name inlcudes what opensrf_core.xml calls, router_name, and username for services and clients; access_domain is, essentially, public, private, etc (mapping from the actual-dns multi-domain stuff from xmpp and being how we say "I live here (usually private), but I will answer also answer requests over
12:30	eeevil	there (usually public). and, purpose_marker is "open-ils.cat" or "router" (where services send registrations, because recall that's a pseudo-service!) or "client:$host:$pid" for response-collection queues)
12:31	berick	eeevil: in your setup, will you have e.g. 2 instances of open-ils.actor listeners running on one domain, where each receives requests via a different router?
12:32	eeevil	and for each group of "purpose_user_name" that is also a redis user, the ACLs allow their EG-instance peers to talk at them, as appropriate
12:33		jihpringle joined #evergreen
12:35	eeevil	berick: if you mean 2 instances of open-ils.actor /from 2 different Evergreen instances/, then yes, every day and all day long. and they should never get confused. in XMPP world, that's super easy because the user part of the jid is part of the "queue" (aka "message destination"), and we can say "hey, services and routers for library X, your xmpp username is 123abc456. also, clients for library X, your xmpp username is 987xyz654."
12:37	eeevil	if we map jid to queue name, and put the router/service username at the front, we can say "user 987bcd654 can write to queues matching 123abc456:*" and have the exact same (better! it's actively protected) separation
12:37	eeevil	library Y, with router/service user of kdfskljkl and client user of r8r23823 cannot see or touch library X's data
12:44	eeevil	maybe there's a missing assumption here...
12:45	berick	eeevil: well, the code uses domains for segregation. running multiple listeners for separate EG instances on one domain is unexpected.
12:46	berick	it can work w/ some tweaking and I don't think it will need a full shakedown of the addressses
12:46	eeevil	I do not consider (conceptually) the xmpp or redis server as being special or tied to a specific EG instance, ever. it's just a message passing system, and the topology of that layer should not be prescribed.
12:47	berick	of cousre, Redis can host numerous EG instances, it just assumes that no 2 use the same domain name.
12:48	eeevil	berick: are you against having the prefix be, literally, the username of the redis user, and just having that default to "opensrf"?
12:49	eeevil	ah, well, that's where DNS management comes in, by making the domain a separator rather than the user
12:50	eeevil	I'm not saying you shouldn't be able to have domains be the separator, if that's what you want to use for your setup. but I /am/ saying that I want the /user/ to be the separator in mine.
12:50	eeevil	so, let's just make them both use existing configuration data
12:51	eeevil	I'll change my router_name and username elements in opensrf_core, and you can change your domains and manage separation in dns
12:51	berick	ottomh, i'm not against it, i would need to consider the ramifications, but it does mean more code changes, hence my hesitation.
12:51	berick	well i think we can use router name w/o having to restructure everything
12:52	eeevil	those are options we have available today with xmpp, and frankly, having hosted the most number of instances and learned the lessons from that, we do kinda need to retain that ability
12:53	eeevil	eeevil-- # more curmudgeonliness
12:55	berick	another thing to consider: the code as-is supports hungry-hippo style direct to drone delivery by sharing a well-known service address on a domain. if we ahve multiple services on a domain for different eg instances, that won't work -- they'll gobble each other's requests. surely there's other ways to accommodate it, but i don't want to lose that.
13:00	eeevil	well, if we put the router/service (let's just simplify for the moment and combine those 2, even though you /could/ have different accounts) at the front, we can /still/ do that. because the client knows how to construct a "router" destination via //config/opensrf/router_name, and the services /can/ know how to listen to that queue via their own name as the prefix. so, maybe a patch, but it can still be handled. that does /not/ get us HA/LB
13:00	eeevil	routerless by itself, but it isn't any different than "bare service name" WRT hungry-hungry-hippo routing
13:01	eeevil	"slap my/the-routers username on the front" is really no harder than just dumping them message on a hard-coded "opensrf:$whatever" queue
13:01	eeevil	s/them/the/
13:05	eeevil	meta-question: if a patch for opensrf-on-redis showed up that did what I'm advocating (allow what we can do in xmpp land, default to the string "opensrf" just like opensrf_core.xml does now), will there be much push-back? unless I'm severely misunderstanding both how redis works and what I'm asking for, I'm really only talking about replacing the hard-coded string "opensrf" with the redis user name
13:07	eeevil	(that's a meta-question for all interested in opensrf-on-redis, not just berick ;) )
13:08	berick	it's a littl more than that. the routing is more domain based than specific end-point based (for hungry-hippo drones). there's also a bug in the current implemtation i realized during this covo i need to fix.
13:09	berick	give me a couple days to try and cover the use case? if nothing else it would help me get my brain back into that territory
13:13	eeevil	berick++
13:14	berick	and now afk for a bit cuz mtgs
13:15		dmoore joined #evergreen
14:30		Rogan joined #evergreen
14:52	* Dyrcona	suspects marc_export exposes/causes a memory leak in Perl 5.
15:37		mantis1 left #evergreen
15:43	Dyrcona	Stompro: 4.5 hours versus 5+ days. I'll take it!
15:43	Dyrcona	Stompro++
15:47	Stompro	Dyrcona, that is great to hear. Was that for just one library, or everything?
15:52	Dyrcona	That was more or less everything: a simulation of exporting for Aspen.
15:52	mmorgan	Stompro++
15:57	Stompro	Nice, I have a few more speed ups, i'm playing around with threading to see if I can get several processes doing the MARC::Record creation step, since that is where 50% of the time is being taken up... so maybe we can get that down to 2 hours. :-)
15:59	Dyrcona	Stompro: When you say threading do you mean Perl threads or multiple processes? Either way, I want to stop you there. :)
15:59	Stompro	perl threads
16:00	berick	heh
16:00	Dyrcona	Perl threads will not work. When Encode gets invoked, and at some point it will, Perl will choke and die.
16:01	Stompro	Oh, bummer. Thanks for the heads up.
16:02	Dyrcona	I updated the bug with a pullrequest tag and "partial" signoff branch. I'm happy at this point. I think you got the big things.
16:36	Bmagic	Dyrcona++ Stompro++
17:20		jihpringle joined #evergreen
17:21		mmorgan left #evergreen
19:03		sandbergja joined #evergreen