Evergreen ILS Website

IRC log for #evergreen, 2016-01-19

| Channels | #evergreen index | Today | | Search | Google Search | Plain-Text | summary | Join Webchat

All times shown according to the server's local time.

Time Nick Message
05:06 dbwells_ joined #evergreen
06:40 rlefaive joined #evergreen
07:04 jboyer-isl joined #evergreen
07:36 rjackson_isl joined #evergreen
08:01 ericar joined #evergreen
08:04 mrpeters joined #evergreen
08:06 mrpeters1 joined #evergreen
08:40 mmorgan joined #evergreen
08:45 mmorgan1 joined #evergreen
08:51 mmorgan joined #evergreen
08:52 collum joined #evergreen
09:19 maryj joined #evergreen
09:24 ericar_ joined #evergreen
09:34 kmlussier @coffee
09:35 kmlussier Sigh...pinesol_green is failing me.
09:36 tsbere kmlussier: Fairly easy to do, given the lack of pinesol_green in the channel...
09:37 * mmorgan pours a cup of NOBLE's finest and sends it sliding down the bar to kmlussier.
09:37 yboston joined #evergreen
09:38 * jeff kicks the bot
09:40 pinesol_green joined #evergreen
09:40 jeff @coffee kmlussier
09:40 * pinesol_green brews and pours a cup of Indonesia Blue Batak Peaberry, and sends it sliding down the bar to kmlussier
09:40 mmorgan Phew! jeff++
09:42 * jeff also kicks mail
09:42 kmlussier jeff++ Thank you! You're a lifesaver.
09:54 jeff @coffee
09:54 * pinesol_green brews and pours a cup of Panama Elida Estate, and sends it sliding down the bar to jeff
10:13 mllewellyn joined #evergreen
10:14 ericar_ joined #evergreen
10:14 jwoodard joined #evergreen
10:20 berick joined #evergreen
10:34 Christineb joined #evergreen
10:37 csharp @who is NOT CONNECTED TO THE NETWORK!!!!?
10:37 pinesol_green bwicksall is NOT CONNECTED TO THE NETWORK.
10:38 collum joined #evergreen
10:51 abneiman joined #evergreen
11:02 berick @weather 27712
11:02 pinesol_green berick: Durham, NC :: Clear :: 24F/-4C | Tuesday: Plentiful sunshine. High near 30F. Winds NW at 10 to 15 mph. Tuesday Night: A clear sky. Low 16F. Winds light and variable. | Updated: 6m ago
11:09 jwoodard one of my many staff accounts expired today
11:10 jwoodard took me a minute to realize that 20125 is not a valid year in EG
11:12 * kmlussier tries to imagine where Evergreen will be in 20125
11:13 csharp we're up on 2.9.1, but we're seeing cstore resource starvation
11:13 berick apparently it's a post code in Milan.
11:14 * berick looks closer, sees a Milan disco named "Alcatraz"
11:14 csharp and we had to up our cap of memcached connections
11:14 csharp I'm assuming the latter issue is for OPAC caching purposes
11:14 berick csharp: excessive cstores will cause extra memcache connections
11:14 csharp cstore is hitting max_children on all bricks
11:15 berick well, wait...  trying to remember if cstore connects to memcache.  looking...
11:15 tsbere berick: It should for "connected" drones at a minimum...
11:15 berick tsbere: not necessarily
11:17 berick ok, yes, all C drones connect to memcache
11:17 berick couldn't remember if it was on-demand
11:18 berick (but they connect regardless of whether they are in use -- connect on drone fork/startup)
11:18 berick so, yeah, excessive cstore's could explain the rise in memecache connections
11:18 berick csharp: so the question is what's leaving cstores out to dry
11:20 csharp we're upping the max_children to 96, since that would accommodate the per-brick db connections we're seeing
11:20 berick which means looking for "No request was received in" log entries and tracing them back to the original API call.
11:20 berick which is unfortunately kind of a pain
11:20 csharp berick: thanks for the tip
11:33 sandbergja joined #evergreen
11:47 csharp berick: I can't tell whether the messages I'm seeing are the cause of the problem or just a symptom - there are many many 'No request was received in' entries
11:47 csharp berick: any thoughts on how I might rule things out?
11:47 berick csharp: first, limit to those created by cstore
11:50 berick i would probably limit my searching to a single hour (or maybe less).  extract the thread trace from the 'no request was recevied' message
11:50 csharp I'm with you so far
11:50 berick from there work back to the first API call encountered for that thread trace
11:50 berick which should be the first log line for each thread trace
11:50 berick (though sometimes not, for unknown reasons)
11:50 csharp okay good
11:51 berick and be sure to check osrfsys.log not activity log, since tpac stuff does not go to the activity log (w/ a few exceptions)
11:52 csharp cool - that's where I am
11:52 berick that really messed me up when jason s. was having a similar problem
11:55 miker csharp: did you upgrade opensrf as well? (guessing yes)
12:03 csharp miker: we're on 2.4.1
12:05 csharp berick: using your method, I see open-ils.circ functions - many checkin, hold_pull_list.print, and some checkout.full
12:05 csharp I also see some bookbag retrieval and some open-ils.search
12:07 berick csharp: you may want to compare your results to a pre-2.9.1 log file to see what's changed
12:09 jihpringle joined #evergreen
12:12 csharp I can see on first glance that a checkin from last week took less than 1 second and the one I'm looking at took 9 seconds
12:12 tsbere csharp: Thinking about things, at one point we upped our cstore keepalive value from the default 6 up to 12. We were getting a lot of things returning in 7 seconds...
12:15 csharp actually a checkin from this morning took < 1 sec
12:15 berick csharp: beware they could be taking longer as a symptom of the loss of cstores
12:16 csharp yeah
12:16 csharp that's why I feel like I'm chasing my tail
12:17 tsbere csharp: If I recall, before I made that change to the keepalive we were running into a lot of cstore issues. Maybe a similar change will help you?
12:18 csharp tsbere: so something would take longer than 6 seconds, then there was a cascade?
12:18 csharp I'll give it a shot
12:18 berick uh, be careful with that too.  extending the timeout can exacerbate cstore exhaustion.
12:19 tsbere csharp: The drone would take longer than 6 seconds to respond and end up in limbo until it timed out, resulting in more drones being spawned, that were asked to do things that took more than 6 seconds to respond...
12:19 tsbere csharp: As berick just mentioned though, setting it too high is a different issue
12:19 berick each abandoned cstore drone waits $timeout seconds before exiting.  If $timeout is higher, they will wait idle longer, causing more cstores to be required.
12:19 kitteh_ joined #evergreen
12:20 csharp berick: where is $timeout defined?
12:20 tsbere csharp: We actually bumped our utility server all the way up to 120 on the cstore keepalive due to some A/T queries timing out, but said server doesn't serve outside requests at all.
12:20 berick csharp: opensrf.xml keepalive
12:21 csharp berick: cool thanks
12:21 * berick would actually consider lowering the timeout by a second to allow the drones to recover faster until the root of the problem is found.
12:22 csharp @decide Bill or Tom
12:22 pinesol_green csharp: That's a tough one...
12:22 csharp pinesol_green: exactly
12:22 pinesol_green Factoid 'exactly' not found
12:22 pinesol_green csharp: have you tried local mean solar time for the named city as the reference point?
12:22 berick heh
12:23 tsbere csharp: Do you know how to override that for a specific brick? Only applying an increase/decrease to one brick could give you a side by side "is this helping, hurting, or doing nothing?" comparison
12:28 csharp things have calmed significantly - I'm going to watch and wait
12:29 csharp I appreciate everyone's help and advice!
12:29 csharp berick++ tsbere++
12:49 bmills joined #evergreen
12:59 vlewis joined #evergreen
13:03 miker csharp: so, either folks have given up, or the DB is now primed and fast enough that timeouts aren't hit on the client side ... ;)
13:12 csharp miker: I think they gave up - we're asking them to come off standalone and back on live
13:12 csharp I can detect slowness from where I am too, though - something's definitely off
13:19 csharp okay - we're back to losing cstore children
13:23 jboyer-isl csharp: By chance was your database rebuilt from the ground up?
13:24 jboyer-isl (I'd have asked earlier but it looked like you were doing ok)
13:26 csharp jboyer-isl: well, I moved from 9.3 to 9.4, but it was using pg_upgrade
13:27 jboyer-isl I should have been more clear. Was the config updated? since you moved from 9.3 to 9.4, is the 9.4 db looking for it's config in /etc/postgresql/9.3... or 9.4..., etc.?
13:29 jboyer-isl I assume it's extremely unlikely that any setting names have changed, but that's another thing to consider.
13:31 dbs csharp: I'm slow to the discussion, but have you checked the query plans? I don't think 9.3->9.4 automatically updates statistics
13:31 berick could confirm whether the DB is the culprit by comparing query times to pre-upgrade logs.  if you're only logging queries > some duration, just counting the number of lines in the logs is a good indicator.
13:32 dbs If the database itself is slow, opensrf config tweakery isn't going to help a lot :/
13:32 berick dbs: exactly
13:39 csharp dbs: jboyer-isl: berick: I think it is the DB
13:39 jboyer-isl :(
13:39 csharp timings between servers of similar specs on nearly the same dataset are 3x slower on our prod server
13:40 berick csharp: hmm, we recently had a pile of prevent-wraparound vacuums kick off mid-day.  It caused a noticeable slow down.
13:40 csharp (both on 9.4)
13:40 DPearl joined #evergreen
13:40 csharp I'm doing an analyze right now
13:41 csharp and we have a vacuum analyze cron that runs weekly (having lived through xact wraparound horror)
13:41 berick k, cool, good to rule it out
13:43 jboyer-isl I suppose now is a good time to ask, any vacuum / vacuum analyzes' run after the upgrade scripts did their thing?
13:43 csharp jboyer-isl: yep
13:43 jboyer-isl that's good
13:43 csharp before and after
13:44 csharp and actually did a vacuum full on one table after it bloated with some post upgrade db work
14:13 vlewis_ joined #evergreen
15:08 krvmga joined #evergreen
15:13 DPearl Got a problem.  I want to add a field to vmp and have it show up as new column in the vandelay/inc/profiles.tt2 table.  I add field to fm_IDL.xml, the database, autogen.sh.  Did restart-all of opensrf, apache.  Problem: Only header of table appears (rows are gone), new column doesn't have label as header, just the db name.  Also, appears hung (throbber spinning).  What did I forget to do?
15:15 tsbere DPearl: You may have not added the new field correctly. I would probably need to see your changes at a minimum.
15:16 tsbere DPearl: oh, and thinking about it, there are two copies of fm_IDL.xml, one for the "reporter" that is web accessible, did that one get updated? (It is generated from the main one, but I don't recall what part of install does it)
15:17 DPearl tsbere: It has the be the latter.  I recall that I have been bitten by this before. I may have been assuming incorrectly that this other xml was derived during the autogen.  Thanks.
15:44 csharp what is the script/command/whatever to translated opensrf JSON into SQL?
15:45 csharp s/translated/translate/
15:52 DPearl tsbere: Fixing the reporter fm_IDL.xml didn't address the problem.  I'll send you my changes.
15:52 tsbere csharp: Are you referring to the json query cstore call?
15:59 csharp tsbere: yeah
16:00 csharp berick: I'm still looking at DB performance as a factor, but a comparison of a checkin today vs. a checkin pre-upgrade is basically identical, just slower
16:00 csharp 1s versus like 9s
16:00 csharp I'm hitting a brick wall with the DB stuff - not much on the web about it
16:00 tsbere csharp: Same checkin modifiers?
16:01 csharp tsbere: I don't see a way to tell from the logs :-/
16:02 csharp there are people who report 9.4 being slower than 9.3, but they appear to be shouted down/dismissed by the PG folks
16:02 csharp I *can* verify that we are using 9.4 and not some ghost or hybrid of 9.3
16:02 csharp and that the configuration settings were brought along basically as-is
16:03 csharp queries are fast, then slower, then faster - same query can take 1.5s, then 3.8s
16:03 tsbere csharp: how is memory/swap usage on the DB server?
16:04 csharp memory is topped out, but it's mostly in the cache - there is some swapping
16:04 csharp but I can tell you it's always been that way
16:05 csharp we're on SSDs
16:10 abowling joined #evergreen
16:19 dbwells csharp: Maybe you are hitting some variation of bug #1527731 (see also bug #1499086).  We bumped up our join_collapse_limit as a more generalized solution, but if it is actually the exact same issue, applying jeffdavis's patch might be a better place to start.
16:19 pinesol_green Launchpad bug 1527731 in Evergreen "Query for OPAC copies can be extremely slow" [Undecided,New] https://launchpad.net/bugs/1527731
16:19 pinesol_green Launchpad bug 1499086 in Evergreen 2.9 "Slowness/timeout on loading bookbags in OPAC" [Medium,Triaged] https://launchpad.net/bugs/1499086
16:28 vlewis joined #evergreen
16:31 csharp dbwells: thanks for the tip
16:32 jlitrell joined #evergreen
17:02 vlewis_ joined #evergreen
17:06 mmorgan left #evergreen
17:17 csharp okay - I'm currently thinking about rolling back to PG 9.3 - before I get that drastic, does anyone have any other suggestions?
17:18 csharp staff client is so slow logging in that it's causing network failure messages
17:18 csharp staff can't checkin, can't checkout
17:18 csharp patrons are getting angry
17:19 csharp I can say for sure that we weren't seen any of this on 2.7.2/PG 9.3
17:20 csharp s/seen/seeing/
17:21 kmlussier :(
17:21 csharp kmlussier: please take this as your warning not to rush into upgrading to 9.4
17:22 csharp or at least not to do so in tandem with major EG upgrades :-/
17:22 kmlussier csharp: I have no bits of wisdom to offer, but I'll send positive thoughts that things will get better soon.
17:22 csharp kmlussier: thanks!
17:22 * csharp heads home to regroup and decide what's next for the evening
20:44 sandbergja joined #evergreen
22:50 dbwells_ joined #evergreen

| Channels | #evergreen index | Today | | Search | Google Search | Plain-Text | summary | Join Webchat