Time |
Nick |
Message |
00:54 |
|
ariez joined #evergreen |
00:54 |
pastebot |
"ariez" at 64.57.241.14 pasted "CANNOT CREATE MARC RECORD IN EVERGREEN ILS 3.0" (6 lines) at http://paste.evergreen-ils.org/909 |
06:01 |
pinesol_green |
News from qatests: Test Success <http://testing.evergreen-ils.org/~live> |
07:38 |
|
dwgreen joined #evergreen |
08:14 |
|
sandbergja joined #evergreen |
08:57 |
|
bos20k joined #evergreen |
09:00 |
|
yboston joined #evergreen |
09:01 |
|
_adb joined #evergreen |
09:40 |
|
jvwoolf joined #evergreen |
09:44 |
|
mmorgan joined #evergreen |
10:01 |
|
dbwells_ joined #evergreen |
11:44 |
|
khuckins joined #evergreen |
12:21 |
|
mmorgan1 joined #evergreen |
12:32 |
|
rlefaive joined #evergreen |
12:39 |
|
jihpringle joined #evergreen |
12:39 |
|
jvwoolf joined #evergreen |
12:43 |
|
mmorgan joined #evergreen |
13:39 |
|
mmorgan joined #evergreen |
13:58 |
|
khuckins_ joined #evergreen |
13:59 |
|
khuckins_ joined #evergreen |
14:13 |
jeffdavis |
The 10-year anniversary of Sitka's first library going live on Evergreen is coming up in a few days. |
14:14 |
bshum |
Yay! |
14:14 |
phasefx |
yay! |
14:15 |
bshum |
jeffdavis: Where's the party going to be? |
14:15 |
bshum |
:D |
14:16 |
jeffdavis |
An exellent question :) |
14:19 |
|
khuckins_ joined #evergreen |
14:27 |
berick |
jeffdavis++ |
14:38 |
berick |
ooh, starting to occaisonally get something similar to bug 1704396. just regular searches though. |
14:38 |
* berick |
cranks up the logging |
14:38 |
pinesol_green |
Launchpad bug 1704396 in Evergreen "Slowness for metarecord and one-hit searches in 2.12" [High,Confirmed] https://launchpad.net/bugs/1704396 |
14:41 |
mmorgan |
berick: It does happen with regular searches, too, but more often with those mentioned in the bug. |
14:41 |
berick |
mmorgan: oh good, thanks for confirming |
14:43 |
dbwells |
berick: I can't recall if I commented yet on that bug or not, but we've seen it (or something just like it) with simple open-ils.search.biblio.copy_counts.location.summary.retrieve calls. It seems like the only commonality is sufficient working of a storage service, but it might be even more generalizable than that. |
14:44 |
berick |
dbwells: i see... |
14:45 |
dbwells |
berick: I'd love to see some debug logs if you can catch it in the act. We can reliably reproduce it on production only, and I've been too gunshy to crank up to debug on production at this point. |
14:48 |
berick |
sorry dev VM, you're disk is about to smoke |
14:49 |
berick |
heh, ejabberd log level 5 is certainly effective |
14:49 |
dbwells |
My testbed when I was poking at it for an hour or two ended up being a simple srfsh file of 500 open-ils.search.biblio.copy_counts.location.summary.retrieve calls. It almost always failed in production within those 500, but never on dev with the same test :( |
14:50 |
berick |
dbwells: with the same or different args each time? |
14:50 |
dbwells |
berick: exactly the same |
14:51 |
* berick |
tries that |
14:52 |
|
khuckins__ joined #evergreen |
14:53 |
csharp |
jeffdavis++ |
14:58 |
jeffdavis |
evergreen++ # That first library is still on EG, I guess it must be a pretty okay ILS. |
14:59 |
csharp |
heh |
15:09 |
miker |
berick: re 1704396, I've added a ton of logging, and right now it's pointing at the process just going away when it tries to select() on the socket |
15:09 |
miker |
the process being the search drone |
15:14 |
berick |
miker: that's consisten w/ what I'm seeing. (finally caught it again). storage tries to send a reply to open-ils.search, jabber replies with a service-unavailable message. |
15:14 |
berick |
search process is gone |
15:15 |
miker |
right |
15:15 |
miker |
it dies inside the gather() that tries to fetch the result of that storage call |
15:15 |
miker |
I've confirmed that much, and added more logging in XMPPReader to see where in its wait() it's actually dying... trying to provoke it again |
15:16 |
miker |
(and wondering if that logging will "fix" it, because that would be annoying...) |
15:17 |
miker |
well, presuming it's in wait() ... but everything between gather() and wait() is really just test-and-pass, with a little timeout adjustment thrown in |
15:31 |
miker |
berick: and to board up one rabbit hole for you to avoid, I elided dispatch() and it still occurs :( |
15:31 |
|
kmlussier joined #evergreen |
15:32 |
berick |
miker: thanks. |
15:43 |
|
khuckins__ joined #evergreen |
15:43 |
miker |
wheeee... confirmed that it's the select in XMPPReader.pm around line 215. I put a log message right before and after that, and the one before is the last I see of that process |
15:45 |
kmlussier |
miker++ #Progress |
15:46 |
miker |
well, not seeing a solution ATM... that code is wrapped (way up the stack) in a try{}, which is implemented as an eval(), and should not be able to take the process down completely TTBOMK |
15:47 |
kmlussier |
miker: Yes, well, it's still progress. Every little piece of the puzzle helps. |
15:47 |
miker |
:) |
15:47 |
berick |
yeah, this is good. |
15:48 |
berick |
wonder if it's an uncaught signal. |
15:51 |
miker |
berick: I thought about that too, so I used 'use sigtrap' to install a signal handler that logs when a signal arrives that we don't have a handler installed for ... I'm not seeing those log lines show up |
15:53 |
miker |
(which isn't to say I got that quite right... it's worth another set of eyes, certainly) |
15:56 |
berick |
oh cool, never heard of sigtrap |
16:03 |
pastebot |
"miker" at 64.57.241.14 pasted "new sig handlers" (5 lines) at http://paste.evergreen-ils.org/910 |
16:03 |
miker |
berick: so, I replaced Server.pm:526 with the following, inside the eval {} block: |
16:03 |
miker |
er, that -^ |
16:03 |
miker |
s/following/preceding/ |
16:04 |
miker |
and I do see the ones for TERM when a child is reaped due to max_spare_children |
16:04 |
berick |
gotcha |
17:11 |
berick |
miker: i tried an alternate approach and did a $SIG{$_} = sub { ... } for keys %SIG. started seeing *drumroll...* sporadic SIGALRMs. |
17:11 |
berick |
applied this patch https://bugs.launchpad.net/opensrf/+bug/1697029/comments/9 with a 'IGNORE' and problem appears to have gone away |
17:11 |
pinesol_green |
Launchpad bug 1697029 in OpenSRF "Service Listeners Crash when Using an Undefined Value for Syscalls" [High,Fix released] |
17:12 |
berick |
have not yet been able to reproduce in the time I normally am able to, anyway |
17:12 |
berick |
will keep trying |
17:13 |
dbwells |
berick++ |
17:13 |
kmlussier |
berick: Was that patch not part of the code that was committed? |
17:14 |
miker |
hrm... we don't use alarm() in opensrf proper... |
17:14 |
berick |
miker: it's coming from the facet stuff I believe |
17:15 |
miker |
yep... just grep for that |
17:15 |
berick |
i reproduce by running copy counts API calls in parallel to catalog searches |
17:15 |
dbwells |
I am going to try and confirm, if I can. |
17:15 |
miker |
and, it's because we don't alarm(0) |
17:15 |
berick |
kmlussier: it's not in master... |
17:15 |
miker |
in the facet part |
17:16 |
* kmlussier |
keeps her fingers crossed. |
17:16 |
jeffdavis |
kmlussier: I think we all generally assumed that the SIGALRM thing in that ticket was a false lead |
17:16 |
jeffdavis |
so nothing further happened with that specific patch |
17:16 |
jeffdavis |
s/ticket/LP bug/ |
17:17 |
kmlussier |
Sometimes, the slowness will disappear for a while after you apply a promising patch, making you think it's gone, only to come back in full force when you least suspect it. |
17:17 |
kmlussier |
It's sneaky |
17:18 |
miker |
berick: alternate proposal... at OpenILS/Application/Search/Biblio.pm:1417 say `alarm(0);` and likewise immediately after the close of the eval block |
17:18 |
* miker |
tests |
17:19 |
berick |
miker: +1 to adding that for sure |
17:19 |
miker |
that'll clear the alarm timer |
17:25 |
dbwells |
My usual breakage routine seems to be failing me tonight. Was probably relying unknowingly on some other load which doesn't exist at 5:25pm :) |
17:25 |
* dbwells |
keeps trying to push his server over |
17:27 |
dbwells |
It sounds like the masters here have things well in hand, at any rate :) |
17:30 |
|
jvwoolf left #evergreen |
17:30 |
dbwells |
yay, finally broke it |
17:41 |
miker |
berick: I think you just found the final SIGALRM-shaped jigsaw puzzle piece under that couch cushion... some testing tomorrow will confirm (or deny), and if it's good I'll post a patch |
17:44 |
berick |
miker: awesome. beware the lint! |
17:44 |
berick |
and dog toys and old nickels and |
17:50 |
dbwells |
miker: berick: adding an IGNORE a la berick's suggestion also seems to work here. I certainly can contribute nothing regarding the correctness of other points being discussed :) There seems to be some light at the end of the tunnel, though! |
17:54 |
berick |
miker's suggestion is needed regardless and should resolve it in this instance. but we may also want to wrap select() in an ALRM IGNORE so this kind of thing won't bite us so readily in the future. |
17:54 |
berick |
not sure if globally ignoring it is the best strategy |
17:54 |
berick |
thanks for confirming, dbwells |
17:54 |
miker |
yeah, IGNORE works, but I'm a little nervous about futzing with sig handers in EG-proper. calling alarm(0) has the effect of canceling the timer ... there's a theoretical race condition that we could plug with a 'local $SIG{ALRM} = ...' or saving off the previous value, but the race is just one opcode wide (the longjmp from the die() to the next command, which would be alarm(0)) so my preference is to change less (and avoid direct %SIG mangling) if |
17:54 |
miker |
possible... |
17:57 |
miker |
(we at least have the fig leaf of the alarm() abstraction, rather than touching %SIG directly in the EG code, IOW) |
17:58 |
miker |
also |
17:58 |
miker |
@blame miker |
17:58 |
pinesol_green |
miker: Forget it, Jake. It's just miker. |
17:58 |
miker |
for those spurious ALRMs |
18:00 |
berick |
miker: when does that alarm code come into play? if the caller has the cache key, doesn't that imply the data is in the cache? |
18:02 |
pinesol_green |
News from qatests: Test Success <http://testing.evergreen-ils.org/~live> |
18:02 |
berick |
in any event, suppose we could also replace alarm() w/ a loop that counts up to the timeout if it continues to be problematic. |
18:02 |
miker |
1) the main search call uses respond_complete() to send back the results and the provisional facet key early, then calculates the facets and stores them 2) the caller gets the key and says "search! give me the facets!" in a second call 3) if the first call's post-respond_complete facet logic isn't done yet, the second search call waits until it sees that it is complete, sleeping 50ms at a time |
18:03 |
miker |
we could |
18:03 |
miker |
now, it's always wanted to do that |
18:04 |
berick |
aha, i see the respond_complete now, before cache_facets |
18:04 |
miker |
but because we were using method_lookup, the first request didn't actually return until the facet logic was done -- respond_complete doesn't send to the client directly |
18:05 |
miker |
but dispatch() instead of run() on a method_lookup call does talk directly to the client |
18:05 |
miker |
so now they can actually be parallel, instead of just pretending to be |
18:06 |
miker |
and the mod_perl code is rearranged to do a bunch of work between the result return and the facet request, giving facets parallel time to be calculated |
18:07 |
miker |
shaves ~1.5 secs off the average "several hundred results" search in terms of time-to-render |
18:08 |
berick |
right, thought it was something like, but was missing the piece where it returned early |
18:09 |
miker |
originally the facet timeout was 2s, then 4s in 3.0 IIRC, but I'm going to propose bumping it up to 10 as a max for facet timeout ... there are still instances of a result page not showing facets on the first load |
18:09 |
miker |
esp on all-in-one test servers |
18:12 |
kmlussier |
I see that happening a lot in 2.12. I haven't noticed it as much in 3.0. |
18:33 |
jeffdavis |
^ exciting! |
18:50 |
kmlussier |
jeffdavis: Yes, indeed! :) |
19:52 |
csharp |
and, just like that, https://blog.angular.io/version-5-0-0-of-angular-now-available-37e414935ced |
23:52 |
kmlussier |
I'm noticing an issue with Located URI searching using the Conerto dataset. |
23:53 |
kmlussier |
In 2.12, I can do a keyword search for 'hobbit' and retrieve one of the ebook test records with a located URI owned by the consortium. |
23:53 |
kmlussier |
In 3.0, I get no results. |