IRC log for #evergreen, 2016-01-19

All times shown according to the server's local time.

Time	Nick	Message
05:06		dbwells_ joined #evergreen
06:40		rlefaive joined #evergreen
07:04		jboyer-isl joined #evergreen
07:36		rjackson_isl joined #evergreen
08:01		ericar joined #evergreen
08:04		mrpeters joined #evergreen
08:06		mrpeters1 joined #evergreen
08:40		mmorgan joined #evergreen
08:45		mmorgan1 joined #evergreen
08:51		mmorgan joined #evergreen
08:52		collum joined #evergreen
09:19		maryj joined #evergreen
09:24		ericar_ joined #evergreen
09:34	kmlussier	@coffee
09:35	kmlussier	Sigh...pinesol_green is failing me.
09:36	tsbere	kmlussier: Fairly easy to do, given the lack of pinesol_green in the channel...
09:37	* mmorgan	pours a cup of NOBLE's finest and sends it sliding down the bar to kmlussier.
09:37		yboston joined #evergreen
09:38	* jeff	kicks the bot
09:40		pinesol_green joined #evergreen
09:40	jeff	@coffee kmlussier
09:40	* pinesol_green	brews and pours a cup of Indonesia Blue Batak Peaberry, and sends it sliding down the bar to kmlussier
09:40	mmorgan	Phew! jeff++
09:42	* jeff	also kicks mail
09:42	kmlussier	jeff++ Thank you! You're a lifesaver.
09:54	jeff	@coffee
09:54	* pinesol_green	brews and pours a cup of Panama Elida Estate, and sends it sliding down the bar to jeff
10:13		mllewellyn joined #evergreen
10:14		ericar_ joined #evergreen
10:14		jwoodard joined #evergreen
10:20		berick joined #evergreen
10:34		Christineb joined #evergreen
10:37	csharp	@who is NOT CONNECTED TO THE NETWORK!!!!?
10:37	pinesol_green	bwicksall is NOT CONNECTED TO THE NETWORK.
10:38		collum joined #evergreen
10:51		abneiman joined #evergreen
11:02	berick	@weather 27712
11:02	pinesol_green	berick: Durham, NC :: Clear :: 24F/-4C \| Tuesday: Plentiful sunshine. High near 30F. Winds NW at 10 to 15 mph. Tuesday Night: A clear sky. Low 16F. Winds light and variable. \| Updated: 6m ago
11:09	jwoodard	one of my many staff accounts expired today
11:10	jwoodard	took me a minute to realize that 20125 is not a valid year in EG
11:12	* kmlussier	tries to imagine where Evergreen will be in 20125
11:13	csharp	we're up on 2.9.1, but we're seeing cstore resource starvation
11:13	berick	apparently it's a post code in Milan.
11:14	* berick	looks closer, sees a Milan disco named "Alcatraz"
11:14	csharp	and we had to up our cap of memcached connections
11:14	csharp	I'm assuming the latter issue is for OPAC caching purposes
11:14	berick	csharp: excessive cstores will cause extra memcache connections
11:14	csharp	cstore is hitting max_children on all bricks
11:15	berick	well, wait... trying to remember if cstore connects to memcache. looking...
11:15	tsbere	berick: It should for "connected" drones at a minimum...
11:15	berick	tsbere: not necessarily
11:17	berick	ok, yes, all C drones connect to memcache
11:17	berick	couldn't remember if it was on-demand
11:18	berick	(but they connect regardless of whether they are in use -- connect on drone fork/startup)
11:18	berick	so, yeah, excessive cstore's could explain the rise in memecache connections
11:18	berick	csharp: so the question is what's leaving cstores out to dry
11:20	csharp	we're upping the max_children to 96, since that would accommodate the per-brick db connections we're seeing
11:20	berick	which means looking for "No request was received in" log entries and tracing them back to the original API call.
11:20	berick	which is unfortunately kind of a pain
11:20	csharp	berick: thanks for the tip
11:33		sandbergja joined #evergreen
11:47	csharp	berick: I can't tell whether the messages I'm seeing are the cause of the problem or just a symptom - there are many many 'No request was received in' entries
11:47	csharp	berick: any thoughts on how I might rule things out?
11:47	berick	csharp: first, limit to those created by cstore
11:50	berick	i would probably limit my searching to a single hour (or maybe less). extract the thread trace from the 'no request was recevied' message
11:50	csharp	I'm with you so far
11:50	berick	from there work back to the first API call encountered for that thread trace
11:50	berick	which should be the first log line for each thread trace
11:50	berick	(though sometimes not, for unknown reasons)
11:50	csharp	okay good
11:51	berick	and be sure to check osrfsys.log not activity log, since tpac stuff does not go to the activity log (w/ a few exceptions)
11:52	csharp	cool - that's where I am
11:52	berick	that really messed me up when jason s. was having a similar problem
11:55	miker	csharp: did you upgrade opensrf as well? (guessing yes)
12:03	csharp	miker: we're on 2.4.1
12:05	csharp	berick: using your method, I see open-ils.circ functions - many checkin, hold_pull_list.print, and some checkout.full
12:05	csharp	I also see some bookbag retrieval and some open-ils.search
12:07	berick	csharp: you may want to compare your results to a pre-2.9.1 log file to see what's changed
12:09		jihpringle joined #evergreen
12:12	csharp	I can see on first glance that a checkin from last week took less than 1 second and the one I'm looking at took 9 seconds
12:12	tsbere	csharp: Thinking about things, at one point we upped our cstore keepalive value from the default 6 up to 12. We were getting a lot of things returning in 7 seconds...
12:15	csharp	actually a checkin from this morning took < 1 sec
12:15	berick	csharp: beware they could be taking longer as a symptom of the loss of cstores
12:16	csharp	yeah
12:16	csharp	that's why I feel like I'm chasing my tail
12:17	tsbere	csharp: If I recall, before I made that change to the keepalive we were running into a lot of cstore issues. Maybe a similar change will help you?
12:18	csharp	tsbere: so something would take longer than 6 seconds, then there was a cascade?
12:18	csharp	I'll give it a shot
12:18	berick	uh, be careful with that too. extending the timeout can exacerbate cstore exhaustion.
12:19	tsbere	csharp: The drone would take longer than 6 seconds to respond and end up in limbo until it timed out, resulting in more drones being spawned, that were asked to do things that took more than 6 seconds to respond...
12:19	tsbere	csharp: As berick just mentioned though, setting it too high is a different issue
12:19	berick	each abandoned cstore drone waits $timeout seconds before exiting. If $timeout is higher, they will wait idle longer, causing more cstores to be required.
12:19		kitteh_ joined #evergreen
12:20	csharp	berick: where is $timeout defined?
12:20	tsbere	csharp: We actually bumped our utility server all the way up to 120 on the cstore keepalive due to some A/T queries timing out, but said server doesn't serve outside requests at all.
12:20	berick	csharp: opensrf.xml keepalive
12:21	csharp	berick: cool thanks
12:21	* berick	would actually consider lowering the timeout by a second to allow the drones to recover faster until the root of the problem is found.
12:22	csharp	@decide Bill or Tom
12:22	pinesol_green	csharp: That's a tough one...
12:22	csharp	pinesol_green: exactly
12:22	pinesol_green	Factoid 'exactly' not found
12:22	pinesol_green	csharp: have you tried local mean solar time for the named city as the reference point?
12:22	berick	heh
12:23	tsbere	csharp: Do you know how to override that for a specific brick? Only applying an increase/decrease to one brick could give you a side by side "is this helping, hurting, or doing nothing?" comparison
12:28	csharp	things have calmed significantly - I'm going to watch and wait
12:29	csharp	I appreciate everyone's help and advice!
12:29	csharp	berick++ tsbere++
12:49		bmills joined #evergreen
12:59		vlewis joined #evergreen
13:03	miker	csharp: so, either folks have given up, or the DB is now primed and fast enough that timeouts aren't hit on the client side ... ;)
13:12	csharp	miker: I think they gave up - we're asking them to come off standalone and back on live
13:12	csharp	I can detect slowness from where I am too, though - something's definitely off
13:19	csharp	okay - we're back to losing cstore children
13:23	jboyer-isl	csharp: By chance was your database rebuilt from the ground up?
13:24	jboyer-isl	(I'd have asked earlier but it looked like you were doing ok)
13:26	csharp	jboyer-isl: well, I moved from 9.3 to 9.4, but it was using pg_upgrade
13:27	jboyer-isl	I should have been more clear. Was the config updated? since you moved from 9.3 to 9.4, is the 9.4 db looking for it's config in /etc/postgresql/9.3... or 9.4..., etc.?
13:29	jboyer-isl	I assume it's extremely unlikely that any setting names have changed, but that's another thing to consider.
13:31	dbs	csharp: I'm slow to the discussion, but have you checked the query plans? I don't think 9.3->9.4 automatically updates statistics
13:31	berick	could confirm whether the DB is the culprit by comparing query times to pre-upgrade logs. if you're only logging queries > some duration, just counting the number of lines in the logs is a good indicator.
13:32	dbs	If the database itself is slow, opensrf config tweakery isn't going to help a lot :/
13:32	berick	dbs: exactly
13:39	csharp	dbs: jboyer-isl: berick: I think it is the DB
13:39	jboyer-isl	:(
13:39	csharp	timings between servers of similar specs on nearly the same dataset are 3x slower on our prod server
13:40	berick	csharp: hmm, we recently had a pile of prevent-wraparound vacuums kick off mid-day. It caused a noticeable slow down.
13:40	csharp	(both on 9.4)
13:40		DPearl joined #evergreen
13:40	csharp	I'm doing an analyze right now
13:41	csharp	and we have a vacuum analyze cron that runs weekly (having lived through xact wraparound horror)
13:41	berick	k, cool, good to rule it out
13:43	jboyer-isl	I suppose now is a good time to ask, any vacuum / vacuum analyzes' run after the upgrade scripts did their thing?
13:43	csharp	jboyer-isl: yep
13:43	jboyer-isl	that's good
13:43	csharp	before and after
13:44	csharp	and actually did a vacuum full on one table after it bloated with some post upgrade db work
14:13		vlewis_ joined #evergreen
15:08		krvmga joined #evergreen
15:13	DPearl	Got a problem. I want to add a field to vmp and have it show up as new column in the vandelay/inc/profiles.tt2 table. I add field to fm_IDL.xml, the database, autogen.sh. Did restart-all of opensrf, apache. Problem: Only header of table appears (rows are gone), new column doesn't have label as header, just the db name. Also, appears hung (throbber spinning). What did I forget to do?
15:15	tsbere	DPearl: You may have not added the new field correctly. I would probably need to see your changes at a minimum.
15:16	tsbere	DPearl: oh, and thinking about it, there are two copies of fm_IDL.xml, one for the "reporter" that is web accessible, did that one get updated? (It is generated from the main one, but I don't recall what part of install does it)
15:17	DPearl	tsbere: It has the be the latter. I recall that I have been bitten by this before. I may have been assuming incorrectly that this other xml was derived during the autogen. Thanks.
15:44	csharp	what is the script/command/whatever to translated opensrf JSON into SQL?
15:45	csharp	s/translated/translate/
15:52	DPearl	tsbere: Fixing the reporter fm_IDL.xml didn't address the problem. I'll send you my changes.
15:52	tsbere	csharp: Are you referring to the json query cstore call?
15:59	csharp	tsbere: yeah
16:00	csharp	berick: I'm still looking at DB performance as a factor, but a comparison of a checkin today vs. a checkin pre-upgrade is basically identical, just slower
16:00	csharp	1s versus like 9s
16:00	csharp	I'm hitting a brick wall with the DB stuff - not much on the web about it
16:00	tsbere	csharp: Same checkin modifiers?
16:01	csharp	tsbere: I don't see a way to tell from the logs :-/
16:02	csharp	there are people who report 9.4 being slower than 9.3, but they appear to be shouted down/dismissed by the PG folks
16:02	csharp	I can verify that we are using 9.4 and not some ghost or hybrid of 9.3
16:02	csharp	and that the configuration settings were brought along basically as-is
16:03	csharp	queries are fast, then slower, then faster - same query can take 1.5s, then 3.8s
16:03	tsbere	csharp: how is memory/swap usage on the DB server?
16:04	csharp	memory is topped out, but it's mostly in the cache - there is some swapping
16:04	csharp	but I can tell you it's always been that way
16:05	csharp	we're on SSDs
16:10		abowling joined #evergreen
16:19	dbwells	csharp: Maybe you are hitting some variation of bug #1527731 (see also bug #1499086). We bumped up our join_collapse_limit as a more generalized solution, but if it is actually the exact same issue, applying jeffdavis's patch might be a better place to start.
16:19	pinesol_green	Launchpad bug 1527731 in Evergreen "Query for OPAC copies can be extremely slow" [Undecided,New] https://launchpad.net/bugs/1527731
16:19	pinesol_green	Launchpad bug 1499086 in Evergreen 2.9 "Slowness/timeout on loading bookbags in OPAC" [Medium,Triaged] https://launchpad.net/bugs/1499086
16:28		vlewis joined #evergreen
16:31	csharp	dbwells: thanks for the tip
16:32		jlitrell joined #evergreen
17:02		vlewis_ joined #evergreen
17:06		mmorgan left #evergreen
17:17	csharp	okay - I'm currently thinking about rolling back to PG 9.3 - before I get that drastic, does anyone have any other suggestions?
17:18	csharp	staff client is so slow logging in that it's causing network failure messages
17:18	csharp	staff can't checkin, can't checkout
17:18	csharp	patrons are getting angry
17:19	csharp	I can say for sure that we weren't seen any of this on 2.7.2/PG 9.3
17:20	csharp	s/seen/seeing/
17:21	kmlussier	:(
17:21	csharp	kmlussier: please take this as your warning not to rush into upgrading to 9.4
17:22	csharp	or at least not to do so in tandem with major EG upgrades :-/
17:22	kmlussier	csharp: I have no bits of wisdom to offer, but I'll send positive thoughts that things will get better soon.
17:22	csharp	kmlussier: thanks!
17:22	* csharp	heads home to regroup and decide what's next for the evening
20:44		sandbergja joined #evergreen
22:50		dbwells_ joined #evergreen