Evergreen ILS Website

IRC log for #evergreen, 2023-10-31

| Channels | #evergreen index | Today | | Search | Google Search | Plain-Text | summary | Join Webchat

All times shown according to the server's local time.

Time Nick Message
00:09 jeff Bmagic: taking 4-6s to seq scan a table containing 26 rows is... neat. :-P
00:11 jeff is SELECT * FROM config.display_field_map; slow?
04:24 Guest13 joined #evergreen
04:26 Guest13 Hello, I have a question about evergreen software. Is there someone who can explain it to me in easy way :))
06:24 redavis joined #evergreen
08:06 BDorsey joined #evergreen
08:35 mmorgan joined #evergreen
08:37 dguarrac joined #evergreen
08:54 smayo joined #evergreen
09:05 Dyrcona joined #evergreen
09:16 mantis1 joined #evergreen
09:21 sandbergja joined #evergreen
09:25 Dyrcona berick: I'm getting the following with the latest Redis branch for Opensrf: osrf_control -l --start-all
09:25 Dyrcona [auth] WRONGPASS invalid username-password pair,  at /usr/share/perl5/Redis.pm line 311.
09:26 Dyrcona This is after a VM reboot, so it looks like the auto --reset-message-bus isn't always working.
09:26 Dyrcona This is what I have checked out/installed: 437e8c67 (HEAD -> collab/berick/lp2017941-opensrf-on-redis-v3, working/collab/berick/lp20​17941-opensrf-on-redis-v3) LP2017941 Auto-reset bus accounts; simplify and fix
09:33 Dyrcona For the logs: `osrf_control -l --reset-message-bus` fixes it.
09:38 berick Dyrcona: hm, ok.   it does require a ./configure to rebuild the osrf_control
09:40 Dyrcona Hmm. Maybe I just did make.
09:40 Dyrcona make should detect if ./configure is necessary, though.
09:47 Bmagic jeff: that table doesn't exist config.display_field_map. I thought it was talking about metabib.display_entry
09:48 Dyrcona OK. Explicitly doing ./configure fixed it. That's probably a bug in how we're using autotools. Make with autotools is supposed to detect when configure is needed.
09:49 berick Dyrcona: cool, good to know
09:49 Dyrcona Bmagic: If you're still hunting the same thing from yesterday, metabib.display_entry is (I think), the only actual table involved in all of those views. There might be one other.
09:50 Dyrcona Also, I'm trying to figure out why a bunch of checkins fail silently. I think I'll just override all events instead of play whack-a-mole.
09:51 Bmagic jeff: lol, sorry, nevermind, that table does exist. select * takes .59
09:51 Bmagic Seq Scan on display_field_map cdfm  (cost=0.00..1.32 rows=1 width=37) (actual time=6375.566..6375.573 rows=1 loops=1)"
09:52 Bmagic something is seriously wrong. I agree jeff. 6 seconds to seq scan a 26 row table. Doesn't make any sense
09:53 Bmagic Dyrcona: yep, same thing from yesterday
09:54 Bmagic Dyrcona:
09:54 Bmagic DB1: https://explain.depesz.com/s/srWZ
09:54 Bmagic DB2: https://explain.depesz.com/s/OWwJ
09:55 Bmagic I had a thought that maybe postgres "buffered" that table, and it was reusing a method that it used before. Instead of "re-thinking". I figured a postgres restart would have cleared it's brain and caused it to re-think stuff?
09:57 eeevil Bmagic: I'm coming in late, and might have missed something, but the seq scan line you pasted at 9:51 (for me) says .007 milliseconds to scan cdfm, not 6 seconds
09:58 Bmagic eeevil: I pasted an example of a fast one and a slow one. Both machines containing the same database. PG15, DB1->DB2 replicant. DB1 is slow, DB2 is fast. I decided to pg_dump DB1 and then delete the database, and pg_restore. Then resync DB2. Now their both slow
09:59 Bmagic I'm considering an index for config.display_field_map.name
09:59 eeevil what I mean is, in "Seq Scan on display_field_map cdfm  (cost=0.00..1.32 rows=1 width=37) (actual time=6375.566..6375.573 rows=1 loops=1)" the actual time is a start..finish range, measured in ms
10:00 Bmagic eeevil: oh, so maybe I misunderstand where it's slow?
10:01 eeevil I think explain.depsesz might be confused, and leading you astray
10:01 Bmagic where are you seeing the 6 second spike? metabib_field_pkey?
10:07 Bmagic the index didn't help
10:07 Bmagic another idea I had was to truncate metabib.display_entry and reingest
10:11 eeevil yeah, I do see exactly where the 6s are coming from: JIT
10:12 eeevil it's right there in the JIT timing summary: Timing: Generation 30.673 ms, Inlining 29.530 ms, Optimization 3683.782 ms, Emission 2662.593 ms, Total 6406.579 ms
10:13 Bmagic ah, all the way at the bottom
10:14 eeevil I would recommend turning off all the JIT options individually, and then consider turning them back on one at a time. proably start with inlining (and generation if needed for inlining, I don't recall if it is)
10:15 Bmagic researching....
10:16 eeevil (so, in defence of explain.depesz, it's doing the best it can to highlight the JIT slowdown, but it def does still require understanding of what's being presented)
10:19 Bmagic eeevil++ # jit_above_cost = -1 fixed it
10:19 Bmagic good night that was bothering me
10:20 Bmagic 6000ms down to 15ms
10:20 Bmagic so now I have to figure out what other things are broken with that disabled
10:20 berick TIL about PG JIT
10:21 Bmagic berick++ lol
10:21 Bmagic I had to Google TIL
10:22 Bmagic TIL about TIL
10:22 berick heh
10:31 eeevil disabling JIT shouldn't break anything. it'll just behave more like pg 10 or 11. which is to say, more predictable, sometimes a little slower, and sometimes (as here) much faster
10:33 Bmagic eeevil++
10:45 eeevil JIT is really good for DW and analytics, ESPECIALLY when you have, say, a pile GPUs and the citus column store extension installed and you're folding proteins or something. stuff where there's a LOT of data, it's extremely well controlled in type and datum size, but you don't know the cardinality or distribution of the data. TBH, that's basically the opposite of EG's data for the most part ;)
10:46 Bmagic I didn't turn it completely off. There is a setting for just straight turning it off "jit = on"
10:46 Bmagic this is the one setting that I changed where it changed the outcome of the analyze: "jit_above_cost = -1"
10:47 Bmagic that fixed course reserves :)
10:50 Bmagic jeff jeffdavis : Pining you all just to make sure you see the fix. It should probably be baked into our install instructions come to think of it
10:54 briank joined #evergreen
10:55 eeevil I'd recommend the install instructions just say "Do not turn on Postgres' JIT capabilities. Evergreen's queries, especially complex ones used for search, are intentionally tuned for non-JIT execution and JIT has been shown to be harmful in some circumstances."  then, once we know how/when/what to use from the JIT toolbox, we can change that rec.  just IMO...
10:57 eeevil the core problem is one of estimation -- we need to invest time/energy into finding the most likely universal cross-column and cross-table stats to configure, because PG will try to use JIT when the stats say "you're going to have to compare 1 BEEEELLION INTEGERS", and spend multiple seconds setting that up for a query, and then the stats were wrong and it compares, like, 3 rows.
10:59 eeevil but we can't simply blanket every instance in stats gathering config, because that's a speed issue both coming and going, and MOAR STATS is not necessarily better stats
11:00 Bmagic I'll submit the change to our install instructions
11:05 berick eeevil: i'm working on a tech ref doc for redis bus addresses / login accounts.  https://gist.github.com/berick/b​1d26f7179b97635c71c9ac91ac38584
11:07 eeevil berick++
11:07 eeevil thanks!
11:11 eeevil berick: so, right off the bat, I think I understood correctly that the first part of the bus address is pinned, and (right now) you couldn't have two separate instances running services and not seeing each other, right? to be able to have bus addresses and redis account paired is the thing I'm looking for, if by convention (some param that defaults to "opensrf" for the bus prefix) or explicitly/intentionally (the bus prefix is the login user name)
11:13 berick eeevil: 'opensrf' as part of the bus prefix is convention and unrelated to the redis account.  just wanted it have some kind of prefix.  running separate service instances on one Redis instance just means giving each its own domwin.
11:14 berick *domain
11:15 eeevil I was advocating for the user name as the prefix so that we don't have to have (maintain, update, restart) dns changes all the time
11:15 eeevil is there a reason /not/ to just make the prefix be the user that the redis-connecting thing uses?
11:17 berick eeevil: top of head, that would probably work, but would mean some router changes .. i think, need to verify
11:34 jeff eeevil++ imparting the JIT knowledge
11:34 eeevil eeevil-- # being a curmudgeon re JIT ;)
11:35 jeff And agreed, I think explain.depesz.com might need to tweak its pre-JIT logic surrounding the exclusive column. no longer is it "the deepest node has nothing else consuming time"
11:37 berick eeevil: thinking through some options..  keeping the 'opensrf' prefix is simplest way forward (for ACL rules, code changes, etc.) but i think we could get the same benefit using router addresses like opesrf:router:$omain:$login -- then teach the services to register to specific routers by domain+login
11:37 eeevil berick: hrm... I wouldn't think that would be necessary, since the router is named WRT the services and the clients (they know where to send router-distributed requests) and routers get registrations from the servers, so know where they should be sending messages to distributed. however, I have /not/ looked at the redis-ized router code, so I'm probably making assumptions about what's recorded and what's computed
11:38 eeevil the thing that removes is the ACL-based cross-"user" protection of queues, though
11:39 eeevil if the $login part is in the wildcard section of the ACL protection, then any client-user connection to redis can send requests at any service, right?
11:40 eeevil I ... can't type. 2 lines up should be "I think that", not "the thing that"
11:41 berick hm well.. the router ACLs could be done w/o wildcards.  e.g. ~opensrf:router:private.localhost:router-01
11:41 berick if we're at the level of specifying specific routers, there's not really any need for the wildcard
11:46 eeevil not sure I follow. from the client's perspective, we need 3 bits of info to make a request: 1) service name 2) my local domain 3) the router "prefix" (this is least needed, but will eventually allow us to go routerless, I believe, even with cross-domain HA/LB)
11:47 eeevil we already specify those things in opensrf_core.xml (concretely)
11:48 eeevil the router prefix is //config/opensrf/router_name
11:53 eeevil put another way, imagine translating from an XMPP jid to a redis queue name. jid structure is $router_name@$local_domain/$service_name, and the equiv redis queue might be $router_name:$service_name:$local_domain and the ACL could allow $client_user_name to write to $router_name:* queues
11:54 Dyrcona Stompro: Your latest commit in the collab branch makes a huge difference. After 45 or so minutes, the output is where it was at 2 days before.
11:56 Stompro Great, you probably have much larger @orgs, @shelves, @prefix arrays, the grep lookups were probably taking a long time.
11:57 eeevil for non-trivial setups, the redis user and ACL file will be machine-generated (for us, at least, and I hope for others to avoid human error at the config file level), so I'm really not personally concerned about the contents being big or messy. but(!) trivial setups, it's still extremely simple, the router_name will be "router" and the ACL will be "opensrf can write to router:*"
11:57 Dyrcona Stompro: Yeah. I'll have to count some of those after lunch.
12:04 berick eeevil: what you're describing would almost certainly work.  my thought behing e.g. opensrf:router:private.localhost:router-01 is mainly to limit the amount of changes needed to support the use case and seems equally as machine-generatable.
12:10 eeevil what's router-01 in this case (IOW, why don't we just have The User Name for an EG instance's routers)? each domain will just have one router, right? so just having it be '$router:$local_domain:$service' (or '$router:$service:$local_domain') where $router might be literally "router" for dev systems.
12:12 eeevil I mean, if having a global prefix of 'opensrf:' in front of every queue has a benefit, +1. to me that seems like noise we don't need, but I can't fight too hard against namespacing "all of opensrf" into "opensrf:" either.
12:13 berick eeevil: router-01 in this example is the router's username.  but wait, "each domain will just have one router, right?" -- i thought the point of this convo was that you needed multiple routers per domain (to avoid dns hassle, etc.).
12:14 eeevil setting aside an opensrf redis namespace prefix (if that's what it really is), what I'm looking for is the ability to have a set of redis queues that follow a patter that clients and services can construct in a known way, and /definitely/ segregates one instances' queues from all others that might happen to live inside the same redis instance
12:15 eeevil we're def talking past each other to some degree :)
12:23 eeevil so each /EG instance/ will have exactly 1 router per redis (xmpp) listening hostname. (we actually have 4 domain types, not just public and private). if the router+service queue (xmpp jid for incoming requests) has a pattern like this: "library_A_router_name:open-​ils.cat:public.domain.name" (though "library_A_router_name" woulnd't be guessable like that) then we can make ACLs for the redis user called "library_A_client" such that it can put messages
12:23 eeevil onto the queues that match "library_A_router_name:*", and services can put messages on queues that match "library_A_client:*".  the router-to-service pattern is similarly simple, and from the router's perspective can just be based where it got registrations from.
12:30 eeevil (longer term, I believe we can get to routerless with this pattern of "$purpose_user_name:$acces​s_domain:$purpose_marker" for queues.  purpose_user_name inlcudes what opensrf_core.xml calls, router_name, and username for services and clients; access_domain is, essentially, public, private, etc (mapping from the actual-dns multi-domain stuff from xmpp and being how we say "I live here (usually private), but I will answer also answer requests over
12:30 eeevil there (usually public).  and, purpose_marker is "open-ils.cat" or "router" (where services send registrations, because recall that's a pseudo-service!) or "client:$host:$pid" for response-collection queues)
12:31 berick eeevil: in your setup, will you have e.g. 2 instances of open-ils.actor listeners running on one domain, where each receives requests via a different router?
12:32 eeevil and for each group of "purpose_user_name" that is also a redis user, the ACLs allow their EG-instance peers to talk at them, as appropriate
12:33 jihpringle joined #evergreen
12:35 eeevil berick: if you mean 2 instances of open-ils.actor /from 2 different Evergreen instances/, then yes, every day and all day long. and they should never get confused. in XMPP world, that's super easy because the user part of the jid is part of the "queue" (aka "message destination"), and we can say "hey, services and routers for library X, your xmpp username is 123abc456. also, clients for library X, your xmpp username is 987xyz654."
12:37 eeevil if we map jid to queue name, and put the router/service username at the front, we can say "user 987bcd654 can write to queues matching 123abc456:*" and have the exact same (better! it's actively protected) separation
12:37 eeevil library Y, with router/service user of kdfskljkl and client user of r8r23823 cannot see or touch library X's data
12:44 eeevil maybe there's a missing assumption here...
12:45 berick eeevil: well, the code uses domains for segregation.  running multiple listeners for separate EG instances on one domain is unexpected.
12:46 berick it can work w/ some tweaking and I don't think it will need a full shakedown of the addressses
12:46 eeevil I do not consider (conceptually) the xmpp or redis server as being special or tied to a specific EG instance, ever.  it's just a message passing system, and the topology of that layer should not be prescribed.
12:47 berick of cousre, Redis can host numerous EG instances, it just assumes that no 2 use the same domain name.
12:48 eeevil berick: are you against having the prefix be, literally, the username of the redis user, and just having that default to "opensrf"?
12:49 eeevil ah, well, that's where DNS management comes in, by making the domain a separator rather than the user
12:50 eeevil I'm not saying you shouldn't be able to have domains be the separator, if that's what you want to use for your setup. but I /am/ saying that I want the /user/ to be the separator in mine.
12:50 eeevil so, let's just make them both use existing configuration data
12:51 eeevil I'll change my router_name and username elements in opensrf_core, and you can change your domains and manage separation in dns
12:51 berick ottomh, i'm not against it, i would need to consider the ramifications, but it does mean more code changes, hence my hesitation.
12:51 berick well i think we can use router name w/o having to restructure everything
12:52 eeevil those are options we have available today with xmpp, and frankly, having hosted the most number of instances and learned the lessons from that, we do kinda need to retain that ability
12:53 eeevil eeevil-- # more curmudgeonliness
12:55 berick another thing to consider:  the code as-is supports  hungry-hippo style direct to drone delivery by sharing a well-known service address on a domain.  if we ahve multiple services on a domain for different eg instances, that won't work -- they'll gobble each other's requests.  surely there's other ways to accommodate it, but i don't want to lose that.
13:00 eeevil well, if we put the router/service (let's just simplify for the moment and combine those 2, even though you /could/ have different accounts) at the front, we can /still/ do that. because the client knows how to construct a "router" destination via //config/opensrf/router_name, and the services /can/ know how to listen to that queue via their own name as the prefix. so, maybe a patch, but it can still be handled. that does /not/ get us HA/LB
13:00 eeevil routerless by itself, but it isn't any different than "bare service name" WRT hungry-hungry-hippo routing
13:01 eeevil "slap my/the-routers username on the front" is really no harder than just dumping them message on a hard-coded "opensrf:$whatever" queue
13:01 eeevil s/them/the/
13:05 eeevil meta-question: if a patch for opensrf-on-redis showed up that did what I'm advocating (allow what we can do in xmpp land, default to the string "opensrf" just like opensrf_core.xml does now), will there be much push-back?  unless I'm severely misunderstanding both how redis works and what I'm asking for, I'm really only talking about replacing the hard-coded string "opensrf" with the redis user name
13:07 eeevil (that's a meta-question for all interested in opensrf-on-redis, not just berick ;) )
13:08 berick it's a littl more than that.  the routing is more domain based than specific end-point based (for hungry-hippo drones).  there's also a bug in the current implemtation i realized during this covo i need to fix.
13:09 berick give me a couple days to try and cover the use case?  if nothing else it would help me get my brain back into that territory
13:13 eeevil berick++
13:14 berick and now afk for a bit cuz mtgs
13:15 dmoore joined #evergreen
14:30 Rogan joined #evergreen
14:52 * Dyrcona suspects marc_export exposes/causes a memory leak in Perl 5.
15:37 mantis1 left #evergreen
15:43 Dyrcona Stompro: 4.5 hours versus 5+ days. I'll take it!
15:43 Dyrcona Stompro++
15:47 Stompro Dyrcona, that is great to hear.  Was that for just one library, or everything?
15:52 Dyrcona That was more or less everything: a simulation of exporting for Aspen.
15:52 mmorgan Stompro++
15:57 Stompro Nice, I have a few more speed ups, i'm playing around with threading to see if I can get several processes doing the MARC::Record creation step, since that is where 50% of the time is being taken up... so maybe we can get that down to 2 hours. :-)
15:59 Dyrcona Stompro: When you say threading do you mean Perl threads or multiple processes? Either way, I want to stop you there. :)
15:59 Stompro perl threads
16:00 berick heh
16:00 Dyrcona Perl threads will not work. When Encode gets invoked, and at some point it will, Perl will choke and die.
16:01 Stompro Oh, bummer.  Thanks for the heads up.
16:02 Dyrcona I updated the bug with a pullrequest tag and "partial" signoff branch. I'm happy at this point. I think you got the big things.
16:36 Bmagic Dyrcona++ Stompro++
17:20 jihpringle joined #evergreen
17:21 mmorgan left #evergreen
19:03 sandbergja joined #evergreen

| Channels | #evergreen index | Today | | Search | Google Search | Plain-Text | summary | Join Webchat