IRC log for #evergreen, 2017-11-01

All times shown according to the server's local time.

Time	Nick	Message
00:54		ariez joined #evergreen
00:54	pastebot	"ariez" at 64.57.241.14 pasted "CANNOT CREATE MARC RECORD IN EVERGREEN ILS 3.0" (6 lines) at http://paste.evergreen-ils.org/909
06:01	pinesol_green	News from qatests: Test Success <http://testing.evergreen-ils.org/~live>
07:38		dwgreen joined #evergreen
08:14		sandbergja joined #evergreen
08:57		bos20k joined #evergreen
09:00		yboston joined #evergreen
09:01		_adb joined #evergreen
09:40		jvwoolf joined #evergreen
09:44		mmorgan joined #evergreen
10:01		dbwells_ joined #evergreen
11:44		khuckins joined #evergreen
12:21		mmorgan1 joined #evergreen
12:32		rlefaive joined #evergreen
12:39		jihpringle joined #evergreen
12:39		jvwoolf joined #evergreen
12:43		mmorgan joined #evergreen
13:39		mmorgan joined #evergreen
13:58		khuckins_ joined #evergreen
13:59		khuckins_ joined #evergreen
14:13	jeffdavis	The 10-year anniversary of Sitka's first library going live on Evergreen is coming up in a few days.
14:14	bshum	Yay!
14:14	phasefx	yay!
14:15	bshum	jeffdavis: Where's the party going to be?
14:15	bshum	:D
14:16	jeffdavis	An exellent question :)
14:19		khuckins_ joined #evergreen
14:27	berick	jeffdavis++
14:38	berick	ooh, starting to occaisonally get something similar to bug 1704396. just regular searches though.
14:38	* berick	cranks up the logging
14:38	pinesol_green	Launchpad bug 1704396 in Evergreen "Slowness for metarecord and one-hit searches in 2.12" [High,Confirmed] https://launchpad.net/bugs/1704396
14:41	mmorgan	berick: It does happen with regular searches, too, but more often with those mentioned in the bug.
14:41	berick	mmorgan: oh good, thanks for confirming
14:43	dbwells	berick: I can't recall if I commented yet on that bug or not, but we've seen it (or something just like it) with simple open-ils.search.biblio.copy_counts.location.summary.retrieve calls. It seems like the only commonality is sufficient working of a storage service, but it might be even more generalizable than that.
14:44	berick	dbwells: i see...
14:45	dbwells	berick: I'd love to see some debug logs if you can catch it in the act. We can reliably reproduce it on production only, and I've been too gunshy to crank up to debug on production at this point.
14:48	berick	sorry dev VM, you're disk is about to smoke
14:49	berick	heh, ejabberd log level 5 is certainly effective
14:49	dbwells	My testbed when I was poking at it for an hour or two ended up being a simple srfsh file of 500 open-ils.search.biblio.copy_counts.location.summary.retrieve calls. It almost always failed in production within those 500, but never on dev with the same test :(
14:50	berick	dbwells: with the same or different args each time?
14:50	dbwells	berick: exactly the same
14:51	* berick	tries that
14:52		khuckins__ joined #evergreen
14:53	csharp	jeffdavis++
14:58	jeffdavis	evergreen++ # That first library is still on EG, I guess it must be a pretty okay ILS.
14:59	csharp	heh
15:09	miker	berick: re 1704396, I've added a ton of logging, and right now it's pointing at the process just going away when it tries to select() on the socket
15:09	miker	the process being the search drone
15:14	berick	miker: that's consisten w/ what I'm seeing. (finally caught it again). storage tries to send a reply to open-ils.search, jabber replies with a service-unavailable message.
15:14	berick	search process is gone
15:15	miker	right
15:15	miker	it dies inside the gather() that tries to fetch the result of that storage call
15:15	miker	I've confirmed that much, and added more logging in XMPPReader to see where in its wait() it's actually dying... trying to provoke it again
15:16	miker	(and wondering if that logging will "fix" it, because that would be annoying...)
15:17	miker	well, presuming it's in wait() ... but everything between gather() and wait() is really just test-and-pass, with a little timeout adjustment thrown in
15:31	miker	berick: and to board up one rabbit hole for you to avoid, I elided dispatch() and it still occurs :(
15:31		kmlussier joined #evergreen
15:32	berick	miker: thanks.
15:43		khuckins__ joined #evergreen
15:43	miker	wheeee... confirmed that it's the select in XMPPReader.pm around line 215. I put a log message right before and after that, and the one before is the last I see of that process
15:45	kmlussier	miker++ #Progress
15:46	miker	well, not seeing a solution ATM... that code is wrapped (way up the stack) in a try{}, which is implemented as an eval(), and should not be able to take the process down completely TTBOMK
15:47	kmlussier	miker: Yes, well, it's still progress. Every little piece of the puzzle helps.
15:47	miker	:)
15:47	berick	yeah, this is good.
15:48	berick	wonder if it's an uncaught signal.
15:51	miker	berick: I thought about that too, so I used 'use sigtrap' to install a signal handler that logs when a signal arrives that we don't have a handler installed for ... I'm not seeing those log lines show up
15:53	miker	(which isn't to say I got that quite right... it's worth another set of eyes, certainly)
15:56	berick	oh cool, never heard of sigtrap
16:03	pastebot	"miker" at 64.57.241.14 pasted "new sig handlers" (5 lines) at http://paste.evergreen-ils.org/910
16:03	miker	berick: so, I replaced Server.pm:526 with the following, inside the eval {} block:
16:03	miker	er, that -^
16:03	miker	s/following/preceding/
16:04	miker	and I do see the ones for TERM when a child is reaped due to max_spare_children
16:04	berick	gotcha
17:11	berick	miker: i tried an alternate approach and did a $SIG{$_} = sub { ... } for keys %SIG. started seeing drumroll... sporadic SIGALRMs.
17:11	berick	applied this patch https://bugs.launchpad.net/opensrf/+bug/1697029/comments/9 with a 'IGNORE' and problem appears to have gone away
17:11	pinesol_green	Launchpad bug 1697029 in OpenSRF "Service Listeners Crash when Using an Undefined Value for Syscalls" [High,Fix released]
17:12	berick	have not yet been able to reproduce in the time I normally am able to, anyway
17:12	berick	will keep trying
17:13	dbwells	berick++
17:13	kmlussier	berick: Was that patch not part of the code that was committed?
17:14	miker	hrm... we don't use alarm() in opensrf proper...
17:14	berick	miker: it's coming from the facet stuff I believe
17:15	miker	yep... just grep for that
17:15	berick	i reproduce by running copy counts API calls in parallel to catalog searches
17:15	dbwells	I am going to try and confirm, if I can.
17:15	miker	and, it's because we don't alarm(0)
17:15	berick	kmlussier: it's not in master...
17:15	miker	in the facet part
17:16	* kmlussier	keeps her fingers crossed.
17:16	jeffdavis	kmlussier: I think we all generally assumed that the SIGALRM thing in that ticket was a false lead
17:16	jeffdavis	so nothing further happened with that specific patch
17:16	jeffdavis	s/ticket/LP bug/
17:17	kmlussier	Sometimes, the slowness will disappear for a while after you apply a promising patch, making you think it's gone, only to come back in full force when you least suspect it.
17:17	kmlussier	It's sneaky
17:18	miker	berick: alternate proposal... at OpenILS/Application/Search/Biblio.pm:1417 say `alarm(0);` and likewise immediately after the close of the eval block
17:18	* miker	tests
17:19	berick	miker: +1 to adding that for sure
17:19	miker	that'll clear the alarm timer
17:25	dbwells	My usual breakage routine seems to be failing me tonight. Was probably relying unknowingly on some other load which doesn't exist at 5:25pm :)
17:25	* dbwells	keeps trying to push his server over
17:27	dbwells	It sounds like the masters here have things well in hand, at any rate :)
17:30		jvwoolf left #evergreen
17:30	dbwells	yay, finally broke it
17:41	miker	berick: I think you just found the final SIGALRM-shaped jigsaw puzzle piece under that couch cushion... some testing tomorrow will confirm (or deny), and if it's good I'll post a patch
17:44	berick	miker: awesome. beware the lint!
17:44	berick	and dog toys and old nickels and
17:50	dbwells	miker: berick: adding an IGNORE a la berick's suggestion also seems to work here. I certainly can contribute nothing regarding the correctness of other points being discussed :) There seems to be some light at the end of the tunnel, though!
17:54	berick	miker's suggestion is needed regardless and should resolve it in this instance. but we may also want to wrap select() in an ALRM IGNORE so this kind of thing won't bite us so readily in the future.
17:54	berick	not sure if globally ignoring it is the best strategy
17:54	berick	thanks for confirming, dbwells
17:54	miker	yeah, IGNORE works, but I'm a little nervous about futzing with sig handers in EG-proper. calling alarm(0) has the effect of canceling the timer ... there's a theoretical race condition that we could plug with a 'local $SIG{ALRM} = ...' or saving off the previous value, but the race is just one opcode wide (the longjmp from the die() to the next command, which would be alarm(0)) so my preference is to change less (and avoid direct %SIG mangling) if
17:54	miker	possible...
17:57	miker	(we at least have the fig leaf of the alarm() abstraction, rather than touching %SIG directly in the EG code, IOW)
17:58	miker	also
17:58	miker	@blame miker
17:58	pinesol_green	miker: Forget it, Jake. It's just miker.
17:58	miker	for those spurious ALRMs
18:00	berick	miker: when does that alarm code come into play? if the caller has the cache key, doesn't that imply the data is in the cache?
18:02	pinesol_green	News from qatests: Test Success <http://testing.evergreen-ils.org/~live>
18:02	berick	in any event, suppose we could also replace alarm() w/ a loop that counts up to the timeout if it continues to be problematic.
18:02	miker	1) the main search call uses respond_complete() to send back the results and the provisional facet key early, then calculates the facets and stores them 2) the caller gets the key and says "search! give me the facets!" in a second call 3) if the first call's post-respond_complete facet logic isn't done yet, the second search call waits until it sees that it is complete, sleeping 50ms at a time
18:03	miker	we could
18:03	miker	now, it's always wanted to do that
18:04	berick	aha, i see the respond_complete now, before cache_facets
18:04	miker	but because we were using method_lookup, the first request didn't actually return until the facet logic was done -- respond_complete doesn't send to the client directly
18:05	miker	but dispatch() instead of run() on a method_lookup call does talk directly to the client
18:05	miker	so now they can actually be parallel, instead of just pretending to be
18:06	miker	and the mod_perl code is rearranged to do a bunch of work between the result return and the facet request, giving facets parallel time to be calculated
18:07	miker	shaves ~1.5 secs off the average "several hundred results" search in terms of time-to-render
18:08	berick	right, thought it was something like, but was missing the piece where it returned early
18:09	miker	originally the facet timeout was 2s, then 4s in 3.0 IIRC, but I'm going to propose bumping it up to 10 as a max for facet timeout ... there are still instances of a result page not showing facets on the first load
18:09	miker	esp on all-in-one test servers
18:12	kmlussier	I see that happening a lot in 2.12. I haven't noticed it as much in 3.0.
18:33	jeffdavis	^ exciting!
18:50	kmlussier	jeffdavis: Yes, indeed! :)
19:52	csharp	and, just like that, https://blog.angular.io/version-5-0-0-of-angular-now-available-37e414935ced
23:52	kmlussier	I'm noticing an issue with Located URI searching using the Conerto dataset.
23:53	kmlussier	In 2.12, I can do a keyword search for 'hobbit' and retrieve one of the ebook test records with a located URI owned by the consortium.
23:53	kmlussier	In 3.0, I get no results.