Evergreen ILS Website

IRC log for #evergreen, 2017-11-01

| Channels | #evergreen index | Today | | Search | Google Search | Plain-Text | summary | Join Webchat

All times shown according to the server's local time.

Time Nick Message
00:54 ariez joined #evergreen
00:54 pastebot "ariez" at 64.57.241.14 pasted "CANNOT CREATE MARC RECORD IN EVERGREEN ILS 3.0" (6 lines) at http://paste.evergreen-ils.org/909
06:01 pinesol_green News from qatests: Test Success <http://testing.evergreen-ils.org/~live>
07:38 dwgreen joined #evergreen
08:14 sandbergja joined #evergreen
08:57 bos20k joined #evergreen
09:00 yboston joined #evergreen
09:01 _adb joined #evergreen
09:40 jvwoolf joined #evergreen
09:44 mmorgan joined #evergreen
10:01 dbwells_ joined #evergreen
11:44 khuckins joined #evergreen
12:21 mmorgan1 joined #evergreen
12:32 rlefaive joined #evergreen
12:39 jihpringle joined #evergreen
12:39 jvwoolf joined #evergreen
12:43 mmorgan joined #evergreen
13:39 mmorgan joined #evergreen
13:58 khuckins_ joined #evergreen
13:59 khuckins_ joined #evergreen
14:13 jeffdavis The 10-year anniversary of Sitka's first library going live on Evergreen is coming up in a few days.
14:14 bshum Yay!
14:14 phasefx yay!
14:15 bshum jeffdavis: Where's the party going to be?
14:15 bshum :D
14:16 jeffdavis An exellent question :)
14:19 khuckins_ joined #evergreen
14:27 berick jeffdavis++
14:38 berick ooh, starting to occaisonally get something similar to bug 1704396.  just regular searches though.
14:38 * berick cranks up the logging
14:38 pinesol_green Launchpad bug 1704396 in Evergreen "Slowness for metarecord and one-hit searches in 2.12" [High,Confirmed] https://launchpad.net/bugs/1704396
14:41 mmorgan berick: It does happen with regular searches, too, but more often with those mentioned in the bug.
14:41 berick mmorgan: oh good, thanks for confirming
14:43 dbwells berick: I can't recall if I commented yet on that bug or not, but we've seen it (or something just like it) with simple open-ils.search.biblio.copy_co​unts.location.summary.retrieve calls.  It seems like the only commonality is sufficient working of a storage service, but it might be even more generalizable than that.
14:44 berick dbwells: i see...
14:45 dbwells berick: I'd love to see some debug logs if you can catch it in the act.  We can reliably reproduce it on production only, and I've been too gunshy to crank up to debug on production at this point.
14:48 berick sorry dev VM, you're disk is about to smoke
14:49 berick heh, ejabberd log level 5 is certainly effective
14:49 dbwells My testbed when I was poking at it for an hour or two ended up being a simple srfsh file of 500 open-ils.search.biblio.copy_co​unts.location.summary.retrieve calls.  It almost always failed in production within those 500, but never on dev with the same test :(
14:50 berick dbwells: with the same or different args each time?
14:50 dbwells berick: exactly the same
14:51 * berick tries that
14:52 khuckins__ joined #evergreen
14:53 csharp jeffdavis++
14:58 jeffdavis evergreen++ # That first library is still on EG, I guess it must be a pretty okay ILS.
14:59 csharp heh
15:09 miker berick: re 1704396, I've added a ton of logging, and right now it's pointing at the process just going away when it tries to select() on the socket
15:09 miker the process being the search drone
15:14 berick miker: that's consisten w/ what I'm seeing.  (finally caught it again).  storage tries to send a reply to open-ils.search, jabber replies with a service-unavailable message.
15:14 berick search process is gone
15:15 miker right
15:15 miker it dies inside the gather() that tries to fetch the result of that storage call
15:15 miker I've confirmed that much, and added more logging in XMPPReader to see where in its wait() it's actually dying... trying to provoke it again
15:16 miker (and wondering if that logging will "fix" it, because that would be annoying...)
15:17 miker well, presuming it's in wait() ... but everything between gather() and wait() is really just test-and-pass, with a little timeout adjustment thrown in
15:31 miker berick: and to board up one rabbit hole for you to avoid, I elided dispatch() and it still occurs :(
15:31 kmlussier joined #evergreen
15:32 berick miker: thanks.
15:43 khuckins__ joined #evergreen
15:43 miker wheeee... confirmed that it's the select in XMPPReader.pm around line 215. I put a log message right before and after that, and the one before is the last I see of that process
15:45 kmlussier miker++ #Progress
15:46 miker well, not seeing a solution ATM... that code is wrapped (way up the stack) in a try{}, which is implemented as an eval(), and should not be able to take the process down completely TTBOMK
15:47 kmlussier miker: Yes, well, it's still progress. Every little piece of the puzzle helps.
15:47 miker :)
15:47 berick yeah, this is good.
15:48 berick wonder if it's an uncaught signal.
15:51 miker berick: I thought about that too, so I used 'use sigtrap' to install a signal handler that logs when a signal arrives that we don't have a handler installed for ... I'm not seeing those log lines show up
15:53 miker (which isn't to say I got that quite right... it's worth another set of eyes, certainly)
15:56 berick oh cool, never heard of sigtrap
16:03 pastebot "miker" at 64.57.241.14 pasted "new sig handlers" (5 lines) at http://paste.evergreen-ils.org/910
16:03 miker berick: so, I replaced Server.pm:526 with the following, inside the eval {} block:
16:03 miker er, that -^
16:03 miker s/following/preceding/
16:04 miker and I do see the ones for TERM when a child is reaped due to max_spare_children
16:04 berick gotcha
17:11 berick miker: i tried an alternate approach and did a $SIG{$_} = sub { ... } for keys %SIG.  started seeing *drumroll...*  sporadic SIGALRMs.
17:11 berick applied this patch https://bugs.launchpad.net/op​ensrf/+bug/1697029/comments/9 with a 'IGNORE' and problem appears to have gone away
17:11 pinesol_green Launchpad bug 1697029 in OpenSRF "Service Listeners Crash when Using an Undefined Value for Syscalls" [High,Fix released]
17:12 berick have not yet been able to reproduce in the time I normally am able to, anyway
17:12 berick will keep trying
17:13 dbwells berick++
17:13 kmlussier berick: Was that patch not part of the code that was committed?
17:14 miker hrm... we don't use alarm() in opensrf proper...
17:14 berick miker: it's coming from the facet stuff I believe
17:15 miker yep... just grep for that
17:15 berick i reproduce by running copy counts API calls in parallel to catalog searches
17:15 dbwells I am going to try and confirm, if I can.
17:15 miker and, it's because we don't alarm(0)
17:15 berick kmlussier: it's not in master...
17:15 miker in the facet part
17:16 * kmlussier keeps her fingers crossed.
17:16 jeffdavis kmlussier: I think we all generally assumed that the SIGALRM thing in that ticket was a false lead
17:16 jeffdavis so nothing further happened with that specific patch
17:16 jeffdavis s/ticket/LP bug/
17:17 kmlussier Sometimes, the slowness will disappear for a while after you apply a promising patch, making you think it's gone, only to come back in full force when you least suspect it.
17:17 kmlussier It's sneaky
17:18 miker berick: alternate proposal... at OpenILS/Application/Search/Biblio.pm:1417 say `alarm(0);` and likewise immediately after the close of the eval block
17:18 * miker tests
17:19 berick miker: +1 to adding that for sure
17:19 miker that'll clear the alarm timer
17:25 dbwells My usual breakage routine seems to be failing me tonight.  Was probably relying unknowingly on some other load which doesn't exist at 5:25pm :)
17:25 * dbwells keeps trying to push his server over
17:27 dbwells It sounds like the masters here have things well in hand, at any rate :)
17:30 jvwoolf left #evergreen
17:30 dbwells yay, finally broke it
17:41 miker berick: I think you just found the final SIGALRM-shaped jigsaw puzzle piece under that couch cushion... some testing tomorrow will confirm (or deny), and if it's good I'll post a patch
17:44 berick miker: awesome.  beware the lint!
17:44 berick and dog toys and old nickels and
17:50 dbwells miker: berick: adding an IGNORE a la berick's suggestion also seems to work here.  I certainly can contribute nothing regarding the correctness of other points being discussed :)  There seems to be some light at the end of the tunnel, though!
17:54 berick miker's suggestion is needed regardless and should resolve it in this instance.  but we may also want to wrap select() in an ALRM IGNORE so this kind of thing won't bite us so readily in the future.
17:54 berick not sure if globally ignoring it is the best strategy
17:54 berick thanks for confirming, dbwells
17:54 miker yeah, IGNORE works, but I'm a little nervous about futzing with sig handers in EG-proper. calling alarm(0) has the effect of canceling the timer ... there's a theoretical race condition that we could plug with a 'local $SIG{ALRM} = ...' or saving off the previous value, but the race is just one opcode wide (the longjmp from the die() to the next command, which would be alarm(0)) so my preference is to change less (and avoid direct %SIG mangling) if
17:54 miker possible...
17:57 miker (we at least have the fig leaf of the alarm() abstraction, rather than touching %SIG directly in the EG code, IOW)
17:58 miker also
17:58 miker @blame miker
17:58 pinesol_green miker: Forget it, Jake. It's just miker.
17:58 miker for those spurious ALRMs
18:00 berick miker: when does that alarm code come into play?  if the caller has the cache key, doesn't that imply the data is in the cache?
18:02 pinesol_green News from qatests: Test Success <http://testing.evergreen-ils.org/~live>
18:02 berick in any event, suppose we could also replace alarm() w/ a loop that counts up to the timeout if it continues to be problematic.
18:02 miker 1) the main search call uses respond_complete() to send back the results and the provisional facet key early, then calculates the facets and stores them 2) the caller gets the key and says "search! give me the facets!" in a second call 3) if the first call's post-respond_complete facet logic isn't done yet, the second search call waits until it sees that it is complete, sleeping 50ms at a time
18:03 miker we could
18:03 miker now, it's always wanted to do that
18:04 berick aha, i see the respond_complete now, before cache_facets
18:04 miker but because we were using method_lookup, the first request didn't actually return until the facet logic was done -- respond_complete doesn't send to the client directly
18:05 miker but dispatch() instead of run() on a method_lookup call does talk directly to the client
18:05 miker so now they can actually be parallel, instead of just pretending to be
18:06 miker and the mod_perl code is rearranged to do a bunch of work between the result return and the facet request, giving facets parallel time to be calculated
18:07 miker shaves ~1.5 secs off the average "several hundred results" search in terms of time-to-render
18:08 berick right, thought it was something like, but was missing the piece where it returned early
18:09 miker originally the facet timeout was 2s, then 4s in 3.0 IIRC, but I'm going to propose bumping it up to 10 as a max for facet timeout ... there are still instances of a result page not showing facets on the first load
18:09 miker esp on all-in-one test servers
18:12 kmlussier I see that happening a lot in 2.12. I haven't noticed it as much in 3.0.
18:33 jeffdavis ^ exciting!
18:50 kmlussier jeffdavis: Yes, indeed! :)
19:52 csharp and, just like that, https://blog.angular.io/version-5-0-0-​of-angular-now-available-37e414935ced
23:52 kmlussier I'm noticing an issue with Located URI searching using the Conerto dataset.
23:53 kmlussier In 2.12, I can do a keyword search for 'hobbit' and retrieve one of the ebook test records with a located URI owned by the consortium.
23:53 kmlussier In 3.0, I get no results.

| Channels | #evergreen index | Today | | Search | Google Search | Plain-Text | summary | Join Webchat