Time |
Nick |
Message |
05:06 |
|
dbwells_ joined #evergreen |
06:40 |
|
rlefaive joined #evergreen |
07:04 |
|
jboyer-isl joined #evergreen |
07:36 |
|
rjackson_isl joined #evergreen |
08:01 |
|
ericar joined #evergreen |
08:04 |
|
mrpeters joined #evergreen |
08:06 |
|
mrpeters1 joined #evergreen |
08:40 |
|
mmorgan joined #evergreen |
08:45 |
|
mmorgan1 joined #evergreen |
08:51 |
|
mmorgan joined #evergreen |
08:52 |
|
collum joined #evergreen |
09:19 |
|
maryj joined #evergreen |
09:24 |
|
ericar_ joined #evergreen |
09:34 |
kmlussier |
@coffee |
09:35 |
kmlussier |
Sigh...pinesol_green is failing me. |
09:36 |
tsbere |
kmlussier: Fairly easy to do, given the lack of pinesol_green in the channel... |
09:37 |
* mmorgan |
pours a cup of NOBLE's finest and sends it sliding down the bar to kmlussier. |
09:37 |
|
yboston joined #evergreen |
09:38 |
* jeff |
kicks the bot |
09:40 |
|
pinesol_green joined #evergreen |
09:40 |
jeff |
@coffee kmlussier |
09:40 |
* pinesol_green |
brews and pours a cup of Indonesia Blue Batak Peaberry, and sends it sliding down the bar to kmlussier |
09:40 |
mmorgan |
Phew! jeff++ |
09:42 |
* jeff |
also kicks mail |
09:42 |
kmlussier |
jeff++ Thank you! You're a lifesaver. |
09:54 |
jeff |
@coffee |
09:54 |
* pinesol_green |
brews and pours a cup of Panama Elida Estate, and sends it sliding down the bar to jeff |
10:13 |
|
mllewellyn joined #evergreen |
10:14 |
|
ericar_ joined #evergreen |
10:14 |
|
jwoodard joined #evergreen |
10:20 |
|
berick joined #evergreen |
10:34 |
|
Christineb joined #evergreen |
10:37 |
csharp |
@who is NOT CONNECTED TO THE NETWORK!!!!? |
10:37 |
pinesol_green |
bwicksall is NOT CONNECTED TO THE NETWORK. |
10:38 |
|
collum joined #evergreen |
10:51 |
|
abneiman joined #evergreen |
11:02 |
berick |
@weather 27712 |
11:02 |
pinesol_green |
berick: Durham, NC :: Clear :: 24F/-4C | Tuesday: Plentiful sunshine. High near 30F. Winds NW at 10 to 15 mph. Tuesday Night: A clear sky. Low 16F. Winds light and variable. | Updated: 6m ago |
11:09 |
jwoodard |
one of my many staff accounts expired today |
11:10 |
jwoodard |
took me a minute to realize that 20125 is not a valid year in EG |
11:12 |
* kmlussier |
tries to imagine where Evergreen will be in 20125 |
11:13 |
csharp |
we're up on 2.9.1, but we're seeing cstore resource starvation |
11:13 |
berick |
apparently it's a post code in Milan. |
11:14 |
* berick |
looks closer, sees a Milan disco named "Alcatraz" |
11:14 |
csharp |
and we had to up our cap of memcached connections |
11:14 |
csharp |
I'm assuming the latter issue is for OPAC caching purposes |
11:14 |
berick |
csharp: excessive cstores will cause extra memcache connections |
11:14 |
csharp |
cstore is hitting max_children on all bricks |
11:15 |
berick |
well, wait... trying to remember if cstore connects to memcache. looking... |
11:15 |
tsbere |
berick: It should for "connected" drones at a minimum... |
11:15 |
berick |
tsbere: not necessarily |
11:17 |
berick |
ok, yes, all C drones connect to memcache |
11:17 |
berick |
couldn't remember if it was on-demand |
11:18 |
berick |
(but they connect regardless of whether they are in use -- connect on drone fork/startup) |
11:18 |
berick |
so, yeah, excessive cstore's could explain the rise in memecache connections |
11:18 |
berick |
csharp: so the question is what's leaving cstores out to dry |
11:20 |
csharp |
we're upping the max_children to 96, since that would accommodate the per-brick db connections we're seeing |
11:20 |
berick |
which means looking for "No request was received in" log entries and tracing them back to the original API call. |
11:20 |
berick |
which is unfortunately kind of a pain |
11:20 |
csharp |
berick: thanks for the tip |
11:33 |
|
sandbergja joined #evergreen |
11:47 |
csharp |
berick: I can't tell whether the messages I'm seeing are the cause of the problem or just a symptom - there are many many 'No request was received in' entries |
11:47 |
csharp |
berick: any thoughts on how I might rule things out? |
11:47 |
berick |
csharp: first, limit to those created by cstore |
11:50 |
berick |
i would probably limit my searching to a single hour (or maybe less). extract the thread trace from the 'no request was recevied' message |
11:50 |
csharp |
I'm with you so far |
11:50 |
berick |
from there work back to the first API call encountered for that thread trace |
11:50 |
berick |
which should be the first log line for each thread trace |
11:50 |
berick |
(though sometimes not, for unknown reasons) |
11:50 |
csharp |
okay good |
11:51 |
berick |
and be sure to check osrfsys.log not activity log, since tpac stuff does not go to the activity log (w/ a few exceptions) |
11:52 |
csharp |
cool - that's where I am |
11:52 |
berick |
that really messed me up when jason s. was having a similar problem |
11:55 |
miker |
csharp: did you upgrade opensrf as well? (guessing yes) |
12:03 |
csharp |
miker: we're on 2.4.1 |
12:05 |
csharp |
berick: using your method, I see open-ils.circ functions - many checkin, hold_pull_list.print, and some checkout.full |
12:05 |
csharp |
I also see some bookbag retrieval and some open-ils.search |
12:07 |
berick |
csharp: you may want to compare your results to a pre-2.9.1 log file to see what's changed |
12:09 |
|
jihpringle joined #evergreen |
12:12 |
csharp |
I can see on first glance that a checkin from last week took less than 1 second and the one I'm looking at took 9 seconds |
12:12 |
tsbere |
csharp: Thinking about things, at one point we upped our cstore keepalive value from the default 6 up to 12. We were getting a lot of things returning in 7 seconds... |
12:15 |
csharp |
actually a checkin from this morning took < 1 sec |
12:15 |
berick |
csharp: beware they could be taking longer as a symptom of the loss of cstores |
12:16 |
csharp |
yeah |
12:16 |
csharp |
that's why I feel like I'm chasing my tail |
12:17 |
tsbere |
csharp: If I recall, before I made that change to the keepalive we were running into a lot of cstore issues. Maybe a similar change will help you? |
12:18 |
csharp |
tsbere: so something would take longer than 6 seconds, then there was a cascade? |
12:18 |
csharp |
I'll give it a shot |
12:18 |
berick |
uh, be careful with that too. extending the timeout can exacerbate cstore exhaustion. |
12:19 |
tsbere |
csharp: The drone would take longer than 6 seconds to respond and end up in limbo until it timed out, resulting in more drones being spawned, that were asked to do things that took more than 6 seconds to respond... |
12:19 |
tsbere |
csharp: As berick just mentioned though, setting it too high is a different issue |
12:19 |
berick |
each abandoned cstore drone waits $timeout seconds before exiting. If $timeout is higher, they will wait idle longer, causing more cstores to be required. |
12:19 |
|
kitteh_ joined #evergreen |
12:20 |
csharp |
berick: where is $timeout defined? |
12:20 |
tsbere |
csharp: We actually bumped our utility server all the way up to 120 on the cstore keepalive due to some A/T queries timing out, but said server doesn't serve outside requests at all. |
12:20 |
berick |
csharp: opensrf.xml keepalive |
12:21 |
csharp |
berick: cool thanks |
12:21 |
* berick |
would actually consider lowering the timeout by a second to allow the drones to recover faster until the root of the problem is found. |
12:22 |
csharp |
@decide Bill or Tom |
12:22 |
pinesol_green |
csharp: That's a tough one... |
12:22 |
csharp |
pinesol_green: exactly |
12:22 |
pinesol_green |
Factoid 'exactly' not found |
12:22 |
pinesol_green |
csharp: have you tried local mean solar time for the named city as the reference point? |
12:22 |
berick |
heh |
12:23 |
tsbere |
csharp: Do you know how to override that for a specific brick? Only applying an increase/decrease to one brick could give you a side by side "is this helping, hurting, or doing nothing?" comparison |
12:28 |
csharp |
things have calmed significantly - I'm going to watch and wait |
12:29 |
csharp |
I appreciate everyone's help and advice! |
12:29 |
csharp |
berick++ tsbere++ |
12:49 |
|
bmills joined #evergreen |
12:59 |
|
vlewis joined #evergreen |
13:03 |
miker |
csharp: so, either folks have given up, or the DB is now primed and fast enough that timeouts aren't hit on the client side ... ;) |
13:12 |
csharp |
miker: I think they gave up - we're asking them to come off standalone and back on live |
13:12 |
csharp |
I can detect slowness from where I am too, though - something's definitely off |
13:19 |
csharp |
okay - we're back to losing cstore children |
13:23 |
jboyer-isl |
csharp: By chance was your database rebuilt from the ground up? |
13:24 |
jboyer-isl |
(I'd have asked earlier but it looked like you were doing ok) |
13:26 |
csharp |
jboyer-isl: well, I moved from 9.3 to 9.4, but it was using pg_upgrade |
13:27 |
jboyer-isl |
I should have been more clear. Was the config updated? since you moved from 9.3 to 9.4, is the 9.4 db looking for it's config in /etc/postgresql/9.3... or 9.4..., etc.? |
13:29 |
jboyer-isl |
I assume it's extremely unlikely that any setting names have changed, but that's another thing to consider. |
13:31 |
dbs |
csharp: I'm slow to the discussion, but have you checked the query plans? I don't think 9.3->9.4 automatically updates statistics |
13:31 |
berick |
could confirm whether the DB is the culprit by comparing query times to pre-upgrade logs. if you're only logging queries > some duration, just counting the number of lines in the logs is a good indicator. |
13:32 |
dbs |
If the database itself is slow, opensrf config tweakery isn't going to help a lot :/ |
13:32 |
berick |
dbs: exactly |
13:39 |
csharp |
dbs: jboyer-isl: berick: I think it is the DB |
13:39 |
jboyer-isl |
:( |
13:39 |
csharp |
timings between servers of similar specs on nearly the same dataset are 3x slower on our prod server |
13:40 |
berick |
csharp: hmm, we recently had a pile of prevent-wraparound vacuums kick off mid-day. It caused a noticeable slow down. |
13:40 |
csharp |
(both on 9.4) |
13:40 |
|
DPearl joined #evergreen |
13:40 |
csharp |
I'm doing an analyze right now |
13:41 |
csharp |
and we have a vacuum analyze cron that runs weekly (having lived through xact wraparound horror) |
13:41 |
berick |
k, cool, good to rule it out |
13:43 |
jboyer-isl |
I suppose now is a good time to ask, any vacuum / vacuum analyzes' run after the upgrade scripts did their thing? |
13:43 |
csharp |
jboyer-isl: yep |
13:43 |
jboyer-isl |
that's good |
13:43 |
csharp |
before and after |
13:44 |
csharp |
and actually did a vacuum full on one table after it bloated with some post upgrade db work |
14:13 |
|
vlewis_ joined #evergreen |
15:08 |
|
krvmga joined #evergreen |
15:13 |
DPearl |
Got a problem. I want to add a field to vmp and have it show up as new column in the vandelay/inc/profiles.tt2 table. I add field to fm_IDL.xml, the database, autogen.sh. Did restart-all of opensrf, apache. Problem: Only header of table appears (rows are gone), new column doesn't have label as header, just the db name. Also, appears hung (throbber spinning). What did I forget to do? |
15:15 |
tsbere |
DPearl: You may have not added the new field correctly. I would probably need to see your changes at a minimum. |
15:16 |
tsbere |
DPearl: oh, and thinking about it, there are two copies of fm_IDL.xml, one for the "reporter" that is web accessible, did that one get updated? (It is generated from the main one, but I don't recall what part of install does it) |
15:17 |
DPearl |
tsbere: It has the be the latter. I recall that I have been bitten by this before. I may have been assuming incorrectly that this other xml was derived during the autogen. Thanks. |
15:44 |
csharp |
what is the script/command/whatever to translated opensrf JSON into SQL? |
15:45 |
csharp |
s/translated/translate/ |
15:52 |
DPearl |
tsbere: Fixing the reporter fm_IDL.xml didn't address the problem. I'll send you my changes. |
15:52 |
tsbere |
csharp: Are you referring to the json query cstore call? |
15:59 |
csharp |
tsbere: yeah |
16:00 |
csharp |
berick: I'm still looking at DB performance as a factor, but a comparison of a checkin today vs. a checkin pre-upgrade is basically identical, just slower |
16:00 |
csharp |
1s versus like 9s |
16:00 |
csharp |
I'm hitting a brick wall with the DB stuff - not much on the web about it |
16:00 |
tsbere |
csharp: Same checkin modifiers? |
16:01 |
csharp |
tsbere: I don't see a way to tell from the logs :-/ |
16:02 |
csharp |
there are people who report 9.4 being slower than 9.3, but they appear to be shouted down/dismissed by the PG folks |
16:02 |
csharp |
I *can* verify that we are using 9.4 and not some ghost or hybrid of 9.3 |
16:02 |
csharp |
and that the configuration settings were brought along basically as-is |
16:03 |
csharp |
queries are fast, then slower, then faster - same query can take 1.5s, then 3.8s |
16:03 |
tsbere |
csharp: how is memory/swap usage on the DB server? |
16:04 |
csharp |
memory is topped out, but it's mostly in the cache - there is some swapping |
16:04 |
csharp |
but I can tell you it's always been that way |
16:05 |
csharp |
we're on SSDs |
16:10 |
|
abowling joined #evergreen |
16:19 |
dbwells |
csharp: Maybe you are hitting some variation of bug #1527731 (see also bug #1499086). We bumped up our join_collapse_limit as a more generalized solution, but if it is actually the exact same issue, applying jeffdavis's patch might be a better place to start. |
16:19 |
pinesol_green |
Launchpad bug 1527731 in Evergreen "Query for OPAC copies can be extremely slow" [Undecided,New] https://launchpad.net/bugs/1527731 |
16:19 |
pinesol_green |
Launchpad bug 1499086 in Evergreen 2.9 "Slowness/timeout on loading bookbags in OPAC" [Medium,Triaged] https://launchpad.net/bugs/1499086 |
16:28 |
|
vlewis joined #evergreen |
16:31 |
csharp |
dbwells: thanks for the tip |
16:32 |
|
jlitrell joined #evergreen |
17:02 |
|
vlewis_ joined #evergreen |
17:06 |
|
mmorgan left #evergreen |
17:17 |
csharp |
okay - I'm currently thinking about rolling back to PG 9.3 - before I get that drastic, does anyone have any other suggestions? |
17:18 |
csharp |
staff client is so slow logging in that it's causing network failure messages |
17:18 |
csharp |
staff can't checkin, can't checkout |
17:18 |
csharp |
patrons are getting angry |
17:19 |
csharp |
I can say for sure that we weren't seen any of this on 2.7.2/PG 9.3 |
17:20 |
csharp |
s/seen/seeing/ |
17:21 |
kmlussier |
:( |
17:21 |
csharp |
kmlussier: please take this as your warning not to rush into upgrading to 9.4 |
17:22 |
csharp |
or at least not to do so in tandem with major EG upgrades :-/ |
17:22 |
kmlussier |
csharp: I have no bits of wisdom to offer, but I'll send positive thoughts that things will get better soon. |
17:22 |
csharp |
kmlussier: thanks! |
17:22 |
* csharp |
heads home to regroup and decide what's next for the evening |
20:44 |
|
sandbergja joined #evergreen |
22:50 |
|
dbwells_ joined #evergreen |