Time |
Nick |
Message |
02:50 |
|
genpaku_ joined #evergreen |
04:30 |
pinesol_green |
News from qatests: Test Success <http://testing.evergreen-ils.org/~live> |
07:07 |
|
rjackson_isl joined #evergreen |
07:12 |
|
JBoyer joined #evergreen |
07:31 |
|
agoben joined #evergreen |
08:04 |
|
littlet joined #evergreen |
08:06 |
|
rlefaive joined #evergreen |
08:42 |
|
mmorgan joined #evergreen |
08:48 |
|
bos20k joined #evergreen |
08:53 |
|
kmlussier joined #evergreen |
08:55 |
|
umarzuki joined #evergreen |
08:56 |
umarzuki |
i got "Received no data from server" when testing srfsh request opensrf.math add 2,2 |
08:57 |
umarzuki |
shell output > https://pastebin.com/id9VWfrf |
09:00 |
umarzuki |
srfsh.log https://pastebin.com/ppnbtLfR |
09:23 |
Bmagic |
umarzuki: Trying to get the Evergreen server setup for the first time? |
09:25 |
Bmagic |
umarzuki: You might be interested in the "one command to install it all" solution https://hub.docker.com/r/mobiusoffice/evergreen-ils/ |
09:28 |
|
Dyrcona joined #evergreen |
09:35 |
|
yboston joined #evergreen |
09:43 |
|
jvwoolf joined #evergreen |
09:45 |
|
maryj joined #evergreen |
09:45 |
bshum |
umarzuki: Could be an ejabberd authentication issue. What Linux distribution are you installing on? |
09:46 |
bshum |
Also I know there can be special characters in ejabberd passwords can break things, so depending on what you decided to set up the ejabberd user passwords with, that could cause issue |
09:49 |
kmlussier |
Bmagic: Have you considered adding that solution to https://wiki.evergreen-ils.org/doku.php?id=server_installation:semi_automated? |
09:49 |
Bmagic |
kmlussier: Yeah, I knew it belonged somewhere |
09:52 |
umarzuki |
Bmagic: yes |
09:52 |
umarzuki |
bshum: ubuntu xenial 16.04 |
09:53 |
umarzuki |
I'm installing this on vmware workstation |
09:56 |
bshum |
With Xenial, another issue that's cropped up lately has been problems where the firewall blocks ejabberd from communicating |
09:57 |
bshum |
I thought berick found a workaround for the ansible installer, not sure if that'll help |
09:58 |
umarzuki |
bshum: firewalld service not enabled and iptables -L shows no rules |
09:58 |
umarzuki |
I'm just using zaq12wsx for password, testing purposes only |
10:00 |
bshum |
That doesn't seem too overly complex and should have worked then |
10:00 |
bshum |
Hmm |
10:01 |
Dyrcona |
umarzuki: Did you change the auth_passwrd_format to plain in ejabberd.yml? |
10:01 |
umarzuki |
Dyrcona: yes |
10:02 |
Dyrcona |
umarzuki: Looks like Ejabberd is only listening on tcp6 and not 4. |
10:02 |
Dyrcona |
From your netstat output, that is. |
10:05 |
berick |
bshum: the issue I had to fix was registering ejabberd users as root won't work in 16.04. you have to sudo to 'ejabberd'. or modify apparmor |
10:06 |
Dyrcona |
berick: I've seen that, too, but I think umarzuki's issue is the ejabberd configuration. |
10:09 |
umarzuki |
Dyrcona: ip: "::" |
10:09 |
umarzuki |
that one in ejabberd.yml? |
10:09 |
Dyrcona |
Yes, you probably want to change that one. |
10:10 |
Dyrcona |
I'd also make sure that the entries in the hosts section are correct and all there. |
10:12 |
Dyrcona |
http://evergreen-ils.org/documentation/install/OpenSRF/README_2_5_0.html#_configure_the_ejabberd_server |
10:12 |
Dyrcona |
In case you missed it. |
10:12 |
|
mmorgan1 joined #evergreen |
10:13 |
Dyrcona |
I think we should make the link to OpenSRF more obvious on the installation page. |
10:17 |
umarzuki |
Dyrcona: ejabberd now on ipv4 but I'm still seing same error |
10:17 |
umarzuki |
tcp 0 0 0.0.0.0:5222 0.0.0.0:* |
10:18 |
umarzuki |
srfsh 2017-06-05 22:14:43 [WARN:33200:osrf_stack.c:144:1496672082332000] * Jabber Error is for top level remote id [routerprivate.localhost/opensrf.math], no one to send my message to! Cutting request short... |
10:18 |
Dyrcona |
umarzuki: You restarted osrf services after restarting ejabberd? |
10:19 |
Dyrcona |
You might also need the --force-clean-process option to osrf_control. |
10:21 |
Dyrcona |
umarzuki: You should also have logs in /var/log/ejabberd/. I'd look at those to see if they give any clues. |
10:22 |
jeff |
umarzuki: what version of opensrf, and did you follow all of the recommended changes for the ejabberd config, including (as applicable) max stanza sizes and max sessions, etc? |
10:23 |
umarzuki |
yes, 2.5.0 |
10:24 |
jeff |
are you able to post your ejabberd config file somewhere that we can examine it? |
10:27 |
umarzuki |
jeff: ejabberd log https://pastebin.com/r4she46e |
10:28 |
umarzuki |
ejabberd.yml https://pastebin.com/jUrEwDBL |
10:31 |
jeff |
umarzuki: you should check your setting for max_user_sessions -- it appears to be at 10, and not the recommended 10000 |
10:34 |
bshum |
jeff++ # good eye on that |
10:35 |
umarzuki |
jeff: thanks |
10:36 |
umarzuki |
I mistook that part wher I only change all to 10000 |
10:46 |
|
mmorgan joined #evergreen |
10:49 |
|
sandbergja joined #evergreen |
10:52 |
|
collum joined #evergreen |
10:52 |
bshum |
JBoyer: cesardv: I'll be curious to see what you guys find with those settings carrying over between retrieved users in the last comments on https://bugs.launchpad.net/evergreen/+bug/1642035 ; I think I saw a similar issue occurring with the patron stat cats where the last entry made on one user was also showing up when retrieving the next user (who didn't have an entry yet) |
10:52 |
pinesol_green |
Launchpad bug 1642035 in Evergreen "Web Staff Client - Problems saving notification preferences" [Undecided,Confirmed] |
10:59 |
cesardv |
bshum: yeah I noticed that while doing some testing... I'm still pretty new to Evergreen, but yea seems like (according to a comment on regctl.js) there's caching of data unless explicitly click out of the patron tabs |
11:02 |
kmlussier |
JBoyer / agoben: Can you remind me what the hack-a-way date is? |
11:06 |
agoben |
The EOB day is November 6; the Hack-away is November 7-9. (I've run into an issue with our location from last year, so am working on alternate options for the venue. Will announce as soon as possible.) |
11:07 |
berick |
thanks agoben , was wondering myself |
11:08 |
agoben |
Yup, that was the week that everyone seemed to be available, so should be a good showing :) |
11:11 |
kmlussier |
agoben: Thanks! Do you want me to add it to the dev calendar or should I wait until you have a venue lined up? |
11:14 |
agoben |
kmlussier: give me this week to try to lock in the venue. I don't anticipate changing the date, but would feel better if everyone could go ahead and get started with lodging arrangements asap to coincide. |
11:14 |
kmlussier |
agoben: Sounds good |
11:48 |
|
_adb joined #evergreen |
12:10 |
|
jihpringle joined #evergreen |
12:35 |
Dyrcona |
If a script that starts a cstore transaction times out, that transaction will never commit, will it? |
12:35 |
berick |
Dyrcona: correct. |
12:36 |
Dyrcona |
That's what I thought. Guess I'll kill the script and cancel the pg backend. |
12:38 |
Dyrcona |
I guess a follow up question is does the transaction matter if the cstore calls a stored procedure directly. |
12:40 |
Dyrcona |
It probably does.... |
12:42 |
* Dyrcona |
runs the stored procedure via psql command line. |
12:48 |
|
littlet joined #evergreen |
13:03 |
|
jvwoolf joined #evergreen |
13:03 |
|
rlefaive_ joined #evergreen |
13:17 |
berick |
Dyrcona: indeed, the transaction still matters. |
13:18 |
|
maryj_ joined #evergreen |
13:18 |
Dyrcona |
So, what happens if the transaction commits while the store procedure is till running? Because that is what appears to happen. |
13:18 |
* Dyrcona |
is looking a purge_circulations.srfsh |
13:19 |
berick |
probably get an error about having no open transaction to commit |
13:20 |
berick |
after the json_query times out, the next cstore request goes to a different cstore drone |
13:21 |
berick |
well, not 100% on that.. i'd have to look at the srfsh code |
13:22 |
|
kmatthiesen joined #evergreen |
13:23 |
berick |
in any event, psql++ |
13:24 |
Dyrcona |
OK. I did Ctrl-c on the running srfsh, so I don't see no transaction to commit in the logs. |
13:24 |
Dyrcona |
When you haven't purged circulation for five years, it takes a while. :) |
13:24 |
berick |
yeah |
13:25 |
berick |
more so now that circ history doesn't live in the circ tables. all that old stuff gets aged. |
13:34 |
Dyrcona |
Well, we had some incidental aging when purging users. |
13:40 |
|
maryj joined #evergreen |
13:45 |
jeffdavis |
The open-ils.search service died on one of our servers again: "server: died with error Can't use an undefined value as a symbol reference at /usr/local/share/perl/5.18.2/OpenSRF/Server.pm line 307." |
13:46 |
jeff |
another OOM kill? |
13:50 |
jeffdavis |
Doesn't look like it. |
14:00 |
jeffdavis |
I presume it's the $child var that's undefined there. So OpenSRF::Server->run() is either passing the request to an idle child that doesn't actually exist, or else failing to spawn a child for some reason. |
14:04 |
jeffdavis |
As before, I don't see any helpful log messages prior to that: no "error creating data socketpair" or "child process died" for example. |
14:06 |
jeffdavis |
Either of which would indicate an error while spawning a child. |
14:13 |
berick |
jeffdavis: or pipe_to_child is undef, because the child died and was reaped (which removes all of $child->{*} fields) while the parent was in the middle of the write_child() function. |
14:18 |
berick |
so some time after write_child() is called but before syswrite() is called, SIGCHLD occurs, $child is reaped and becomes a dead husk. |
14:18 |
berick |
then syswrite($child->{pipe_to_child}, ..) fails becuase pipe_to_child is undef |
14:19 |
berick |
if the child exited gracefully, there would be no log entry (unless loglevel is highest) |
14:22 |
berick |
a few log lines in Server.pm would probably shed a lot of light on it |
14:22 |
jeffdavis |
Yeah, I'll add some logging to warn if that scenario arises. |
14:22 |
jeffdavis |
Is that likely to occur naturally, do you think? This is the second time in two weeks that we've seen this problem. |
14:25 |
berick |
if my theory is correct, it's atypical, but not impossible for a child to finish what it's doing while another request is en route to the child. |
14:26 |
|
Dyrcona joined #evergreen |
14:27 |
berick |
jeffdavis: does the timing of these coincide with any updates? osrf or eg? |
14:29 |
jeffdavis |
First incident was just after we upgraded to OSRF 2.5 / EG 2.12.1 from 2.4/2.10.2 |
14:30 |
berick |
ah, ok |
14:35 |
JBoyer |
berick, jeffdavis, I'll add that we first saw this same thing after that same upgrade (though from Eg 2.11.x rather than 2.10) |
14:35 |
berick |
JBoyer: what osrf are you on? |
14:35 |
JBoyer |
2.5 |
14:36 |
berick |
as in, you upgraded osrf at same time? |
14:36 |
JBoyer |
yes, |
14:36 |
berick |
k |
14:38 |
berick |
for logging, know if $child is undef, $child->{pipe_to_child} is undef, and logging $child->{pid} would be good starts. with the pid it should be possible to backtrack what the child was doing prior to the problem |
14:38 |
berick |
e.g. find the api call, whether it processed stuff already, etc. |
14:51 |
JBoyer |
Well this should help: I graph the # of every opensrf service by the minute and there's no way any open-ils.search children should have been reaped at any time all day. |
14:51 |
JBoyer |
UNLESS, I have completely misunderstood the min_spare_children value for Net::Server. |
14:52 |
berick |
JBoyer: the drones (gracefully) die and re-spawn throughout the day. |
14:52 |
berick |
each time they hit max-requests |
14:52 |
JBoyer |
Oh, yes, yes. |
14:53 |
JBoyer |
(I was momentarily woried that min_spare... had to be larger than min_children) |
14:54 |
jeffdavis |
JBoyer: to confirm, you're seeing the same thing as me in your logs? open-ils.search dies with that error message and no other errors leading up to it? |
14:55 |
JBoyer |
jeffdavis, yup. |
15:03 |
JBoyer |
My assumption is that it's probably a race condition between the write_child call on lines 719 or 193 and actually calling syswrite on line 307 (the call on 185 seems fairly safe since there aren't any errors with the message from line 449). Am I correct in thinking that's your opinion too, berick ? |
15:03 |
JBoyer |
(To be fair, I'm making that assumption based on berick's suggestions and questions but don't want to speak for him necessarily.) |
15:09 |
berick |
now i'm questioning my logic. write_child is only called to start a new conversation. i can't think if any reason a child would gracefully die after becoming idle (or getting spawned), but before it received its first request. |
15:10 |
|
babel_ joined #evergreen |
15:10 |
berick |
unless it was killed by an external process |
15:12 |
berick |
jeffdavis' comments about an idle child not existing or spawn failing make more sense to me now |
15:19 |
berick |
notably spawn_child is not checking for a fork() failure (undef response). |
15:19 |
berick |
though i would only expect that to be a problem if memory was exhausted, in which case, chaos everywhere |
15:20 |
berick |
i think I already asked this, but jeffdavis JBoyer you're not seeing anything in /openils/var/log/open-ils.search_stderr.log ? |
15:22 |
jeff |
there is the scenario where the listener becomes a drone when it fails to fork, but again -- that's generally due to memory exhaustion and is often coupled with the oom-killer killing... something. |
15:23 |
jeff |
bug 1546683 |
15:23 |
pinesol_green |
Launchpad bug 1546683 in OpenSRF "fork() failure results in Perl service Listeners becoming Drones" [Undecided,New] https://launchpad.net/bugs/1546683 |
15:23 |
JBoyer |
I have (plenty...) of open-ils.storage.biblio.multiclass.staged.search_fts.atomic failures for ISBN searches that end with a '*' |
15:23 |
JBoyer |
But that's been going on for ages and open-ils.search dying is rather new. |
15:24 |
JBoyer |
(lack of date/timestamps in those files makes it difficult to say) |
15:25 |
berick |
JBoyer: oh, right |
15:46 |
|
collum joined #evergreen |
15:55 |
|
Jillianne joined #evergreen |
16:30 |
pinesol_green |
News from qatests: Test Success <http://testing.evergreen-ils.org/~live> |
17:14 |
|
mmorgan left #evergreen |
19:33 |
|
jihpringle_ joined #evergreen |
22:20 |
|
genpaku joined #evergreen |