Time |
Nick |
Message |
00:09 |
jeff |
Bmagic: taking 4-6s to seq scan a table containing 26 rows is... neat. :-P |
00:11 |
jeff |
is SELECT * FROM config.display_field_map; slow? |
04:24 |
|
Guest13 joined #evergreen |
04:26 |
Guest13 |
Hello, I have a question about evergreen software. Is there someone who can explain it to me in easy way :)) |
06:24 |
|
redavis joined #evergreen |
08:06 |
|
BDorsey joined #evergreen |
08:35 |
|
mmorgan joined #evergreen |
08:37 |
|
dguarrac joined #evergreen |
08:54 |
|
smayo joined #evergreen |
09:05 |
|
Dyrcona joined #evergreen |
09:16 |
|
mantis1 joined #evergreen |
09:21 |
|
sandbergja joined #evergreen |
09:25 |
Dyrcona |
berick: I'm getting the following with the latest Redis branch for Opensrf: osrf_control -l --start-all |
09:25 |
Dyrcona |
[auth] WRONGPASS invalid username-password pair, at /usr/share/perl5/Redis.pm line 311. |
09:26 |
Dyrcona |
This is after a VM reboot, so it looks like the auto --reset-message-bus isn't always working. |
09:26 |
Dyrcona |
This is what I have checked out/installed: 437e8c67 (HEAD -> collab/berick/lp2017941-opensrf-on-redis-v3, working/collab/berick/lp2017941-opensrf-on-redis-v3) LP2017941 Auto-reset bus accounts; simplify and fix |
09:33 |
Dyrcona |
For the logs: `osrf_control -l --reset-message-bus` fixes it. |
09:38 |
berick |
Dyrcona: hm, ok. it does require a ./configure to rebuild the osrf_control |
09:40 |
Dyrcona |
Hmm. Maybe I just did make. |
09:40 |
Dyrcona |
make should detect if ./configure is necessary, though. |
09:47 |
Bmagic |
jeff: that table doesn't exist config.display_field_map. I thought it was talking about metabib.display_entry |
09:48 |
Dyrcona |
OK. Explicitly doing ./configure fixed it. That's probably a bug in how we're using autotools. Make with autotools is supposed to detect when configure is needed. |
09:49 |
berick |
Dyrcona: cool, good to know |
09:49 |
Dyrcona |
Bmagic: If you're still hunting the same thing from yesterday, metabib.display_entry is (I think), the only actual table involved in all of those views. There might be one other. |
09:50 |
Dyrcona |
Also, I'm trying to figure out why a bunch of checkins fail silently. I think I'll just override all events instead of play whack-a-mole. |
09:51 |
Bmagic |
jeff: lol, sorry, nevermind, that table does exist. select * takes .59 |
09:51 |
Bmagic |
Seq Scan on display_field_map cdfm (cost=0.00..1.32 rows=1 width=37) (actual time=6375.566..6375.573 rows=1 loops=1)" |
09:52 |
Bmagic |
something is seriously wrong. I agree jeff. 6 seconds to seq scan a 26 row table. Doesn't make any sense |
09:53 |
Bmagic |
Dyrcona: yep, same thing from yesterday |
09:54 |
Bmagic |
Dyrcona: |
09:54 |
Bmagic |
DB1: https://explain.depesz.com/s/srWZ |
09:54 |
Bmagic |
DB2: https://explain.depesz.com/s/OWwJ |
09:55 |
Bmagic |
I had a thought that maybe postgres "buffered" that table, and it was reusing a method that it used before. Instead of "re-thinking". I figured a postgres restart would have cleared it's brain and caused it to re-think stuff? |
09:57 |
eeevil |
Bmagic: I'm coming in late, and might have missed something, but the seq scan line you pasted at 9:51 (for me) says .007 milliseconds to scan cdfm, not 6 seconds |
09:58 |
Bmagic |
eeevil: I pasted an example of a fast one and a slow one. Both machines containing the same database. PG15, DB1->DB2 replicant. DB1 is slow, DB2 is fast. I decided to pg_dump DB1 and then delete the database, and pg_restore. Then resync DB2. Now their both slow |
09:59 |
Bmagic |
I'm considering an index for config.display_field_map.name |
09:59 |
eeevil |
what I mean is, in "Seq Scan on display_field_map cdfm (cost=0.00..1.32 rows=1 width=37) (actual time=6375.566..6375.573 rows=1 loops=1)" the actual time is a start..finish range, measured in ms |
10:00 |
Bmagic |
eeevil: oh, so maybe I misunderstand where it's slow? |
10:01 |
eeevil |
I think explain.depsesz might be confused, and leading you astray |
10:01 |
Bmagic |
where are you seeing the 6 second spike? metabib_field_pkey? |
10:07 |
Bmagic |
the index didn't help |
10:07 |
Bmagic |
another idea I had was to truncate metabib.display_entry and reingest |
10:11 |
eeevil |
yeah, I do see exactly where the 6s are coming from: JIT |
10:12 |
eeevil |
it's right there in the JIT timing summary: Timing: Generation 30.673 ms, Inlining 29.530 ms, Optimization 3683.782 ms, Emission 2662.593 ms, Total 6406.579 ms |
10:13 |
Bmagic |
ah, all the way at the bottom |
10:14 |
eeevil |
I would recommend turning off all the JIT options individually, and then consider turning them back on one at a time. proably start with inlining (and generation if needed for inlining, I don't recall if it is) |
10:15 |
Bmagic |
researching.... |
10:16 |
eeevil |
(so, in defence of explain.depesz, it's doing the best it can to highlight the JIT slowdown, but it def does still require understanding of what's being presented) |
10:19 |
Bmagic |
eeevil++ # jit_above_cost = -1 fixed it |
10:19 |
Bmagic |
good night that was bothering me |
10:20 |
Bmagic |
6000ms down to 15ms |
10:20 |
Bmagic |
so now I have to figure out what other things are broken with that disabled |
10:20 |
berick |
TIL about PG JIT |
10:21 |
Bmagic |
berick++ lol |
10:21 |
Bmagic |
I had to Google TIL |
10:22 |
Bmagic |
TIL about TIL |
10:22 |
berick |
heh |
10:31 |
eeevil |
disabling JIT shouldn't break anything. it'll just behave more like pg 10 or 11. which is to say, more predictable, sometimes a little slower, and sometimes (as here) much faster |
10:33 |
Bmagic |
eeevil++ |
10:45 |
eeevil |
JIT is really good for DW and analytics, ESPECIALLY when you have, say, a pile GPUs and the citus column store extension installed and you're folding proteins or something. stuff where there's a LOT of data, it's extremely well controlled in type and datum size, but you don't know the cardinality or distribution of the data. TBH, that's basically the opposite of EG's data for the most part ;) |
10:46 |
Bmagic |
I didn't turn it completely off. There is a setting for just straight turning it off "jit = on" |
10:46 |
Bmagic |
this is the one setting that I changed where it changed the outcome of the analyze: "jit_above_cost = -1" |
10:47 |
Bmagic |
that fixed course reserves :) |
10:50 |
Bmagic |
jeff jeffdavis : Pining you all just to make sure you see the fix. It should probably be baked into our install instructions come to think of it |
10:54 |
|
briank joined #evergreen |
10:55 |
eeevil |
I'd recommend the install instructions just say "Do not turn on Postgres' JIT capabilities. Evergreen's queries, especially complex ones used for search, are intentionally tuned for non-JIT execution and JIT has been shown to be harmful in some circumstances." then, once we know how/when/what to use from the JIT toolbox, we can change that rec. just IMO... |
10:57 |
eeevil |
the core problem is one of estimation -- we need to invest time/energy into finding the most likely universal cross-column and cross-table stats to configure, because PG will try to use JIT when the stats say "you're going to have to compare 1 BEEEELLION INTEGERS", and spend multiple seconds setting that up for a query, and then the stats were wrong and it compares, like, 3 rows. |
10:59 |
eeevil |
but we can't simply blanket every instance in stats gathering config, because that's a speed issue both coming and going, and MOAR STATS is not necessarily better stats |
11:00 |
Bmagic |
I'll submit the change to our install instructions |
11:05 |
berick |
eeevil: i'm working on a tech ref doc for redis bus addresses / login accounts. https://gist.github.com/berick/b1d26f7179b97635c71c9ac91ac38584 |
11:07 |
eeevil |
berick++ |
11:07 |
eeevil |
thanks! |
11:11 |
eeevil |
berick: so, right off the bat, I think I understood correctly that the first part of the bus address is pinned, and (right now) you couldn't have two separate instances running services and not seeing each other, right? to be able to have bus addresses and redis account paired is the thing I'm looking for, if by convention (some param that defaults to "opensrf" for the bus prefix) or explicitly/intentionally (the bus prefix is the login user name) |
11:13 |
berick |
eeevil: 'opensrf' as part of the bus prefix is convention and unrelated to the redis account. just wanted it have some kind of prefix. running separate service instances on one Redis instance just means giving each its own domwin. |
11:14 |
berick |
*domain |
11:15 |
eeevil |
I was advocating for the user name as the prefix so that we don't have to have (maintain, update, restart) dns changes all the time |
11:15 |
eeevil |
is there a reason /not/ to just make the prefix be the user that the redis-connecting thing uses? |
11:17 |
berick |
eeevil: top of head, that would probably work, but would mean some router changes .. i think, need to verify |
11:34 |
jeff |
eeevil++ imparting the JIT knowledge |
11:34 |
eeevil |
eeevil-- # being a curmudgeon re JIT ;) |
11:35 |
jeff |
And agreed, I think explain.depesz.com might need to tweak its pre-JIT logic surrounding the exclusive column. no longer is it "the deepest node has nothing else consuming time" |
11:37 |
berick |
eeevil: thinking through some options.. keeping the 'opensrf' prefix is simplest way forward (for ACL rules, code changes, etc.) but i think we could get the same benefit using router addresses like opesrf:router:$omain:$login -- then teach the services to register to specific routers by domain+login |
11:37 |
eeevil |
berick: hrm... I wouldn't think that would be necessary, since the router is named WRT the services and the clients (they know where to send router-distributed requests) and routers get registrations from the servers, so know where they should be sending messages to distributed. however, I have /not/ looked at the redis-ized router code, so I'm probably making assumptions about what's recorded and what's computed |
11:38 |
eeevil |
the thing that removes is the ACL-based cross-"user" protection of queues, though |
11:39 |
eeevil |
if the $login part is in the wildcard section of the ACL protection, then any client-user connection to redis can send requests at any service, right? |
11:40 |
eeevil |
I ... can't type. 2 lines up should be "I think that", not "the thing that" |
11:41 |
berick |
hm well.. the router ACLs could be done w/o wildcards. e.g. ~opensrf:router:private.localhost:router-01 |
11:41 |
berick |
if we're at the level of specifying specific routers, there's not really any need for the wildcard |
11:46 |
eeevil |
not sure I follow. from the client's perspective, we need 3 bits of info to make a request: 1) service name 2) my local domain 3) the router "prefix" (this is least needed, but will eventually allow us to go routerless, I believe, even with cross-domain HA/LB) |
11:47 |
eeevil |
we already specify those things in opensrf_core.xml (concretely) |
11:48 |
eeevil |
the router prefix is //config/opensrf/router_name |
11:53 |
eeevil |
put another way, imagine translating from an XMPP jid to a redis queue name. jid structure is $router_name@$local_domain/$service_name, and the equiv redis queue might be $router_name:$service_name:$local_domain and the ACL could allow $client_user_name to write to $router_name:* queues |
11:54 |
Dyrcona |
Stompro: Your latest commit in the collab branch makes a huge difference. After 45 or so minutes, the output is where it was at 2 days before. |
11:56 |
Stompro |
Great, you probably have much larger @orgs, @shelves, @prefix arrays, the grep lookups were probably taking a long time. |
11:57 |
eeevil |
for non-trivial setups, the redis user and ACL file will be machine-generated (for us, at least, and I hope for others to avoid human error at the config file level), so I'm really not personally concerned about the contents being big or messy. but(!) trivial setups, it's still extremely simple, the router_name will be "router" and the ACL will be "opensrf can write to router:*" |
11:57 |
Dyrcona |
Stompro: Yeah. I'll have to count some of those after lunch. |
12:04 |
berick |
eeevil: what you're describing would almost certainly work. my thought behing e.g. opensrf:router:private.localhost:router-01 is mainly to limit the amount of changes needed to support the use case and seems equally as machine-generatable. |
12:10 |
eeevil |
what's router-01 in this case (IOW, why don't we just have The User Name for an EG instance's routers)? each domain will just have one router, right? so just having it be '$router:$local_domain:$service' (or '$router:$service:$local_domain') where $router might be literally "router" for dev systems. |
12:12 |
eeevil |
I mean, if having a global prefix of 'opensrf:' in front of every queue has a benefit, +1. to me that seems like noise we don't need, but I can't fight too hard against namespacing "all of opensrf" into "opensrf:" either. |
12:13 |
berick |
eeevil: router-01 in this example is the router's username. but wait, "each domain will just have one router, right?" -- i thought the point of this convo was that you needed multiple routers per domain (to avoid dns hassle, etc.). |
12:14 |
eeevil |
setting aside an opensrf redis namespace prefix (if that's what it really is), what I'm looking for is the ability to have a set of redis queues that follow a patter that clients and services can construct in a known way, and /definitely/ segregates one instances' queues from all others that might happen to live inside the same redis instance |
12:15 |
eeevil |
we're def talking past each other to some degree :) |
12:23 |
eeevil |
so each /EG instance/ will have exactly 1 router per redis (xmpp) listening hostname. (we actually have 4 domain types, not just public and private). if the router+service queue (xmpp jid for incoming requests) has a pattern like this: "library_A_router_name:open-ils.cat:public.domain.name" (though "library_A_router_name" woulnd't be guessable like that) then we can make ACLs for the redis user called "library_A_client" such that it can put messages |
12:23 |
eeevil |
onto the queues that match "library_A_router_name:*", and services can put messages on queues that match "library_A_client:*". the router-to-service pattern is similarly simple, and from the router's perspective can just be based where it got registrations from. |
12:30 |
eeevil |
(longer term, I believe we can get to routerless with this pattern of "$purpose_user_name:$access_domain:$purpose_marker" for queues. purpose_user_name inlcudes what opensrf_core.xml calls, router_name, and username for services and clients; access_domain is, essentially, public, private, etc (mapping from the actual-dns multi-domain stuff from xmpp and being how we say "I live here (usually private), but I will answer also answer requests over |
12:30 |
eeevil |
there (usually public). and, purpose_marker is "open-ils.cat" or "router" (where services send registrations, because recall that's a pseudo-service!) or "client:$host:$pid" for response-collection queues) |
12:31 |
berick |
eeevil: in your setup, will you have e.g. 2 instances of open-ils.actor listeners running on one domain, where each receives requests via a different router? |
12:32 |
eeevil |
and for each group of "purpose_user_name" that is also a redis user, the ACLs allow their EG-instance peers to talk at them, as appropriate |
12:33 |
|
jihpringle joined #evergreen |
12:35 |
eeevil |
berick: if you mean 2 instances of open-ils.actor /from 2 different Evergreen instances/, then yes, every day and all day long. and they should never get confused. in XMPP world, that's super easy because the user part of the jid is part of the "queue" (aka "message destination"), and we can say "hey, services and routers for library X, your xmpp username is 123abc456. also, clients for library X, your xmpp username is 987xyz654." |
12:37 |
eeevil |
if we map jid to queue name, and put the router/service username at the front, we can say "user 987bcd654 can write to queues matching 123abc456:*" and have the exact same (better! it's actively protected) separation |
12:37 |
eeevil |
library Y, with router/service user of kdfskljkl and client user of r8r23823 cannot see or touch library X's data |
12:44 |
eeevil |
maybe there's a missing assumption here... |
12:45 |
berick |
eeevil: well, the code uses domains for segregation. running multiple listeners for separate EG instances on one domain is unexpected. |
12:46 |
berick |
it can work w/ some tweaking and I don't think it will need a full shakedown of the addressses |
12:46 |
eeevil |
I do not consider (conceptually) the xmpp or redis server as being special or tied to a specific EG instance, ever. it's just a message passing system, and the topology of that layer should not be prescribed. |
12:47 |
berick |
of cousre, Redis can host numerous EG instances, it just assumes that no 2 use the same domain name. |
12:48 |
eeevil |
berick: are you against having the prefix be, literally, the username of the redis user, and just having that default to "opensrf"? |
12:49 |
eeevil |
ah, well, that's where DNS management comes in, by making the domain a separator rather than the user |
12:50 |
eeevil |
I'm not saying you shouldn't be able to have domains be the separator, if that's what you want to use for your setup. but I /am/ saying that I want the /user/ to be the separator in mine. |
12:50 |
eeevil |
so, let's just make them both use existing configuration data |
12:51 |
eeevil |
I'll change my router_name and username elements in opensrf_core, and you can change your domains and manage separation in dns |
12:51 |
berick |
ottomh, i'm not against it, i would need to consider the ramifications, but it does mean more code changes, hence my hesitation. |
12:51 |
berick |
well i think we can use router name w/o having to restructure everything |
12:52 |
eeevil |
those are options we have available today with xmpp, and frankly, having hosted the most number of instances and learned the lessons from that, we do kinda need to retain that ability |
12:53 |
eeevil |
eeevil-- # more curmudgeonliness |
12:55 |
berick |
another thing to consider: the code as-is supports hungry-hippo style direct to drone delivery by sharing a well-known service address on a domain. if we ahve multiple services on a domain for different eg instances, that won't work -- they'll gobble each other's requests. surely there's other ways to accommodate it, but i don't want to lose that. |
13:00 |
eeevil |
well, if we put the router/service (let's just simplify for the moment and combine those 2, even though you /could/ have different accounts) at the front, we can /still/ do that. because the client knows how to construct a "router" destination via //config/opensrf/router_name, and the services /can/ know how to listen to that queue via their own name as the prefix. so, maybe a patch, but it can still be handled. that does /not/ get us HA/LB |
13:00 |
eeevil |
routerless by itself, but it isn't any different than "bare service name" WRT hungry-hungry-hippo routing |
13:01 |
eeevil |
"slap my/the-routers username on the front" is really no harder than just dumping them message on a hard-coded "opensrf:$whatever" queue |
13:01 |
eeevil |
s/them/the/ |
13:05 |
eeevil |
meta-question: if a patch for opensrf-on-redis showed up that did what I'm advocating (allow what we can do in xmpp land, default to the string "opensrf" just like opensrf_core.xml does now), will there be much push-back? unless I'm severely misunderstanding both how redis works and what I'm asking for, I'm really only talking about replacing the hard-coded string "opensrf" with the redis user name |
13:07 |
eeevil |
(that's a meta-question for all interested in opensrf-on-redis, not just berick ;) ) |
13:08 |
berick |
it's a littl more than that. the routing is more domain based than specific end-point based (for hungry-hippo drones). there's also a bug in the current implemtation i realized during this covo i need to fix. |
13:09 |
berick |
give me a couple days to try and cover the use case? if nothing else it would help me get my brain back into that territory |
13:13 |
eeevil |
berick++ |
13:14 |
berick |
and now afk for a bit cuz mtgs |
13:15 |
|
dmoore joined #evergreen |
14:30 |
|
Rogan joined #evergreen |
14:52 |
* Dyrcona |
suspects marc_export exposes/causes a memory leak in Perl 5. |
15:37 |
|
mantis1 left #evergreen |
15:43 |
Dyrcona |
Stompro: 4.5 hours versus 5+ days. I'll take it! |
15:43 |
Dyrcona |
Stompro++ |
15:47 |
Stompro |
Dyrcona, that is great to hear. Was that for just one library, or everything? |
15:52 |
Dyrcona |
That was more or less everything: a simulation of exporting for Aspen. |
15:52 |
mmorgan |
Stompro++ |
15:57 |
Stompro |
Nice, I have a few more speed ups, i'm playing around with threading to see if I can get several processes doing the MARC::Record creation step, since that is where 50% of the time is being taken up... so maybe we can get that down to 2 hours. :-) |
15:59 |
Dyrcona |
Stompro: When you say threading do you mean Perl threads or multiple processes? Either way, I want to stop you there. :) |
15:59 |
Stompro |
perl threads |
16:00 |
berick |
heh |
16:00 |
Dyrcona |
Perl threads will not work. When Encode gets invoked, and at some point it will, Perl will choke and die. |
16:01 |
Stompro |
Oh, bummer. Thanks for the heads up. |
16:02 |
Dyrcona |
I updated the bug with a pullrequest tag and "partial" signoff branch. I'm happy at this point. I think you got the big things. |
16:36 |
Bmagic |
Dyrcona++ Stompro++ |
17:20 |
|
jihpringle joined #evergreen |
17:21 |
|
mmorgan left #evergreen |
19:03 |
|
sandbergja joined #evergreen |