IRC log for #evergreen, 2017-06-05

All times shown according to the server's local time.

Time	Nick	Message
02:50		genpaku_ joined #evergreen
04:30	pinesol_green	News from qatests: Test Success <http://testing.evergreen-ils.org/~live>
07:07		rjackson_isl joined #evergreen
07:12		JBoyer joined #evergreen
07:31		agoben joined #evergreen
08:04		littlet joined #evergreen
08:06		rlefaive joined #evergreen
08:42		mmorgan joined #evergreen
08:48		bos20k joined #evergreen
08:53		kmlussier joined #evergreen
08:55		umarzuki joined #evergreen
08:56	umarzuki	i got "Received no data from server" when testing srfsh request opensrf.math add 2,2
08:57	umarzuki	shell output > https://pastebin.com/id9VWfrf
09:00	umarzuki	srfsh.log https://pastebin.com/ppnbtLfR
09:23	Bmagic	umarzuki: Trying to get the Evergreen server setup for the first time?
09:25	Bmagic	umarzuki: You might be interested in the "one command to install it all" solution https://hub.docker.com/r/mobiusoffice/evergreen-ils/
09:28		Dyrcona joined #evergreen
09:35		yboston joined #evergreen
09:43		jvwoolf joined #evergreen
09:45		maryj joined #evergreen
09:45	bshum	umarzuki: Could be an ejabberd authentication issue. What Linux distribution are you installing on?
09:46	bshum	Also I know there can be special characters in ejabberd passwords can break things, so depending on what you decided to set up the ejabberd user passwords with, that could cause issue
09:49	kmlussier	Bmagic: Have you considered adding that solution to https://wiki.evergreen-ils.org/doku.php?id=server_installation:semi_automated?
09:49	Bmagic	kmlussier: Yeah, I knew it belonged somewhere
09:52	umarzuki	Bmagic: yes
09:52	umarzuki	bshum: ubuntu xenial 16.04
09:53	umarzuki	I'm installing this on vmware workstation
09:56	bshum	With Xenial, another issue that's cropped up lately has been problems where the firewall blocks ejabberd from communicating
09:57	bshum	I thought berick found a workaround for the ansible installer, not sure if that'll help
09:58	umarzuki	bshum: firewalld service not enabled and iptables -L shows no rules
09:58	umarzuki	I'm just using zaq12wsx for password, testing purposes only
10:00	bshum	That doesn't seem too overly complex and should have worked then
10:00	bshum	Hmm
10:01	Dyrcona	umarzuki: Did you change the auth_passwrd_format to plain in ejabberd.yml?
10:01	umarzuki	Dyrcona: yes
10:02	Dyrcona	umarzuki: Looks like Ejabberd is only listening on tcp6 and not 4.
10:02	Dyrcona	From your netstat output, that is.
10:05	berick	bshum: the issue I had to fix was registering ejabberd users as root won't work in 16.04. you have to sudo to 'ejabberd'. or modify apparmor
10:06	Dyrcona	berick: I've seen that, too, but I think umarzuki's issue is the ejabberd configuration.
10:09	umarzuki	Dyrcona: ip: "::"
10:09	umarzuki	that one in ejabberd.yml?
10:09	Dyrcona	Yes, you probably want to change that one.
10:10	Dyrcona	I'd also make sure that the entries in the hosts section are correct and all there.
10:12	Dyrcona	http://evergreen-ils.org/documentation/install/OpenSRF/README_2_5_0.html#_configure_the_ejabberd_server
10:12	Dyrcona	In case you missed it.
10:12		mmorgan1 joined #evergreen
10:13	Dyrcona	I think we should make the link to OpenSRF more obvious on the installation page.
10:17	umarzuki	Dyrcona: ejabberd now on ipv4 but I'm still seing same error
10:17	umarzuki	tcp 0 0 0.0.0.0:5222 0.0.0.0:*
10:18	umarzuki	srfsh 2017-06-05 22:14:43 [WARN:33200:osrf_stack.c:144:1496672082332000] * Jabber Error is for top level remote id [routerprivate.localhost/opensrf.math], no one to send my message to! Cutting request short...
10:18	Dyrcona	umarzuki: You restarted osrf services after restarting ejabberd?
10:19	Dyrcona	You might also need the --force-clean-process option to osrf_control.
10:21	Dyrcona	umarzuki: You should also have logs in /var/log/ejabberd/. I'd look at those to see if they give any clues.
10:22	jeff	umarzuki: what version of opensrf, and did you follow all of the recommended changes for the ejabberd config, including (as applicable) max stanza sizes and max sessions, etc?
10:23	umarzuki	yes, 2.5.0
10:24	jeff	are you able to post your ejabberd config file somewhere that we can examine it?
10:27	umarzuki	jeff: ejabberd log https://pastebin.com/r4she46e
10:28	umarzuki	ejabberd.yml https://pastebin.com/jUrEwDBL
10:31	jeff	umarzuki: you should check your setting for max_user_sessions -- it appears to be at 10, and not the recommended 10000
10:34	bshum	jeff++ # good eye on that
10:35	umarzuki	jeff: thanks
10:36	umarzuki	I mistook that part wher I only change all to 10000
10:46		mmorgan joined #evergreen
10:49		sandbergja joined #evergreen
10:52		collum joined #evergreen
10:52	bshum	JBoyer: cesardv: I'll be curious to see what you guys find with those settings carrying over between retrieved users in the last comments on https://bugs.launchpad.net/evergreen/+bug/1642035 ; I think I saw a similar issue occurring with the patron stat cats where the last entry made on one user was also showing up when retrieving the next user (who didn't have an entry yet)
10:52	pinesol_green	Launchpad bug 1642035 in Evergreen "Web Staff Client - Problems saving notification preferences" [Undecided,Confirmed]
10:59	cesardv	bshum: yeah I noticed that while doing some testing... I'm still pretty new to Evergreen, but yea seems like (according to a comment on regctl.js) there's caching of data unless explicitly click out of the patron tabs
11:02	kmlussier	JBoyer / agoben: Can you remind me what the hack-a-way date is?
11:06	agoben	The EOB day is November 6; the Hack-away is November 7-9. (I've run into an issue with our location from last year, so am working on alternate options for the venue. Will announce as soon as possible.)
11:07	berick	thanks agoben , was wondering myself
11:08	agoben	Yup, that was the week that everyone seemed to be available, so should be a good showing :)
11:11	kmlussier	agoben: Thanks! Do you want me to add it to the dev calendar or should I wait until you have a venue lined up?
11:14	agoben	kmlussier: give me this week to try to lock in the venue. I don't anticipate changing the date, but would feel better if everyone could go ahead and get started with lodging arrangements asap to coincide.
11:14	kmlussier	agoben: Sounds good
11:48		_adb joined #evergreen
12:10		jihpringle joined #evergreen
12:35	Dyrcona	If a script that starts a cstore transaction times out, that transaction will never commit, will it?
12:35	berick	Dyrcona: correct.
12:36	Dyrcona	That's what I thought. Guess I'll kill the script and cancel the pg backend.
12:38	Dyrcona	I guess a follow up question is does the transaction matter if the cstore calls a stored procedure directly.
12:40	Dyrcona	It probably does....
12:42	* Dyrcona	runs the stored procedure via psql command line.
12:48		littlet joined #evergreen
13:03		jvwoolf joined #evergreen
13:03		rlefaive_ joined #evergreen
13:17	berick	Dyrcona: indeed, the transaction still matters.
13:18		maryj_ joined #evergreen
13:18	Dyrcona	So, what happens if the transaction commits while the store procedure is till running? Because that is what appears to happen.
13:18	* Dyrcona	is looking a purge_circulations.srfsh
13:19	berick	probably get an error about having no open transaction to commit
13:20	berick	after the json_query times out, the next cstore request goes to a different cstore drone
13:21	berick	well, not 100% on that.. i'd have to look at the srfsh code
13:22		kmatthiesen joined #evergreen
13:23	berick	in any event, psql++
13:24	Dyrcona	OK. I did Ctrl-c on the running srfsh, so I don't see no transaction to commit in the logs.
13:24	Dyrcona	When you haven't purged circulation for five years, it takes a while. :)
13:24	berick	yeah
13:25	berick	more so now that circ history doesn't live in the circ tables. all that old stuff gets aged.
13:34	Dyrcona	Well, we had some incidental aging when purging users.
13:40		maryj joined #evergreen
13:45	jeffdavis	The open-ils.search service died on one of our servers again: "server: died with error Can't use an undefined value as a symbol reference at /usr/local/share/perl/5.18.2/OpenSRF/Server.pm line 307."
13:46	jeff	another OOM kill?
13:50	jeffdavis	Doesn't look like it.
14:00	jeffdavis	I presume it's the $child var that's undefined there. So OpenSRF::Server->run() is either passing the request to an idle child that doesn't actually exist, or else failing to spawn a child for some reason.
14:04	jeffdavis	As before, I don't see any helpful log messages prior to that: no "error creating data socketpair" or "child process died" for example.
14:06	jeffdavis	Either of which would indicate an error while spawning a child.
14:13	berick	jeffdavis: or pipe_to_child is undef, because the child died and was reaped (which removes all of $child->{*} fields) while the parent was in the middle of the write_child() function.
14:18	berick	so some time after write_child() is called but before syswrite() is called, SIGCHLD occurs, $child is reaped and becomes a dead husk.
14:18	berick	then syswrite($child->{pipe_to_child}, ..) fails becuase pipe_to_child is undef
14:19	berick	if the child exited gracefully, there would be no log entry (unless loglevel is highest)
14:22	berick	a few log lines in Server.pm would probably shed a lot of light on it
14:22	jeffdavis	Yeah, I'll add some logging to warn if that scenario arises.
14:22	jeffdavis	Is that likely to occur naturally, do you think? This is the second time in two weeks that we've seen this problem.
14:25	berick	if my theory is correct, it's atypical, but not impossible for a child to finish what it's doing while another request is en route to the child.
14:26		Dyrcona joined #evergreen
14:27	berick	jeffdavis: does the timing of these coincide with any updates? osrf or eg?
14:29	jeffdavis	First incident was just after we upgraded to OSRF 2.5 / EG 2.12.1 from 2.4/2.10.2
14:30	berick	ah, ok
14:35	JBoyer	berick, jeffdavis, I'll add that we first saw this same thing after that same upgrade (though from Eg 2.11.x rather than 2.10)
14:35	berick	JBoyer: what osrf are you on?
14:35	JBoyer	2.5
14:36	berick	as in, you upgraded osrf at same time?
14:36	JBoyer	yes,
14:36	berick	k
14:38	berick	for logging, know if $child is undef, $child->{pipe_to_child} is undef, and logging $child->{pid} would be good starts. with the pid it should be possible to backtrack what the child was doing prior to the problem
14:38	berick	e.g. find the api call, whether it processed stuff already, etc.
14:51	JBoyer	Well this should help: I graph the # of every opensrf service by the minute and there's no way any open-ils.search children should have been reaped at any time all day.
14:51	JBoyer	UNLESS, I have completely misunderstood the min_spare_children value for Net::Server.
14:52	berick	JBoyer: the drones (gracefully) die and re-spawn throughout the day.
14:52	berick	each time they hit max-requests
14:52	JBoyer	Oh, yes, yes.
14:53	JBoyer	(I was momentarily woried that min_spare... had to be larger than min_children)
14:54	jeffdavis	JBoyer: to confirm, you're seeing the same thing as me in your logs? open-ils.search dies with that error message and no other errors leading up to it?
14:55	JBoyer	jeffdavis, yup.
15:03	JBoyer	My assumption is that it's probably a race condition between the write_child call on lines 719 or 193 and actually calling syswrite on line 307 (the call on 185 seems fairly safe since there aren't any errors with the message from line 449). Am I correct in thinking that's your opinion too, berick ?
15:03	JBoyer	(To be fair, I'm making that assumption based on berick's suggestions and questions but don't want to speak for him necessarily.)
15:09	berick	now i'm questioning my logic. write_child is only called to start a new conversation. i can't think if any reason a child would gracefully die after becoming idle (or getting spawned), but before it received its first request.
15:10		babel_ joined #evergreen
15:10	berick	unless it was killed by an external process
15:12	berick	jeffdavis' comments about an idle child not existing or spawn failing make more sense to me now
15:19	berick	notably spawn_child is not checking for a fork() failure (undef response).
15:19	berick	though i would only expect that to be a problem if memory was exhausted, in which case, chaos everywhere
15:20	berick	i think I already asked this, but jeffdavis JBoyer you're not seeing anything in /openils/var/log/open-ils.search_stderr.log ?
15:22	jeff	there is the scenario where the listener becomes a drone when it fails to fork, but again -- that's generally due to memory exhaustion and is often coupled with the oom-killer killing... something.
15:23	jeff	bug 1546683
15:23	pinesol_green	Launchpad bug 1546683 in OpenSRF "fork() failure results in Perl service Listeners becoming Drones" [Undecided,New] https://launchpad.net/bugs/1546683
15:23	JBoyer	I have (plenty...) of open-ils.storage.biblio.multiclass.staged.search_fts.atomic failures for ISBN searches that end with a '*'
15:23	JBoyer	But that's been going on for ages and open-ils.search dying is rather new.
15:24	JBoyer	(lack of date/timestamps in those files makes it difficult to say)
15:25	berick	JBoyer: oh, right
15:46		collum joined #evergreen
15:55		Jillianne joined #evergreen
16:30	pinesol_green	News from qatests: Test Success <http://testing.evergreen-ils.org/~live>
17:14		mmorgan left #evergreen
19:33		jihpringle_ joined #evergreen
22:20		genpaku joined #evergreen