IRC log for #evergreen, 2024-12-20

All times shown according to the server's local time.

Time	Nick	Message
07:20		kworstell-isl joined #evergreen
08:01		BDorsey joined #evergreen
09:01		dguarrac joined #evergreen
09:04		Dyrcona joined #evergreen
09:05	Dyrcona	Hm. Had issues with a couple of overnight jobs on the dev system, but the router is still running.
09:06	Dyrcona	No problems with Autorenew on the other test vm, either.
09:07	Dyrcona	Got this in the email from the badge_score_generator: [auth] WRONGPASS invalid username-password pair, at /usr/share/perl5/Redis.pm line 311.
09:08	Dyrcona	Haven't seen that one before. All the services are running OK.
09:09	Dyrcona	refresh carousels reported "Unable to bootstrap client for requests." When I saw that I thought it might have been because of a crash.
09:13	JBoyer	Is opensrf's .srfsh.xml setup correctly for redis there? I think both of those are run from .srfsh scripts so if that's got the wrong password or router location those seems like the errors I would expect
09:13	Dyrcona	Looks like something's up with services on the training/dev system, though. I'm getting an internal server error when I try to use the staff client.
09:13	Dyrcona	JBoyer: I think so, I'll double check.
09:15	Dyrcona	I don't like the way the opensrf_core.xml gets commented by the ansible/docker set up. It's confusing. "--><password>...</password><--" as an example
09:16	berick	it's def a stop-gap
09:16	Dyrcona	Yeah, the opensrf user password looks right.
09:17	Dyrcona	I'm going to try the angular staff client and opac, and if they both give internal server errors, I'll restart services.
09:18	Dyrcona	OPAC gives me an internal server error. eg2 just white screens.
09:19	Dyrcona	Oh, and I did restart Apache after I first got the internal server errors.
09:20	berick	Dyrcona: are both router instances running?
09:20	Dyrcona	berick: --diagnostic said everything was running.
09:21	Dyrcona	I just restarted services and apache, and I can login.
09:22	Dyrcona	* router [1812] uptime=21:31:28 cputime=00:00:04
09:22	Dyrcona	* router [1813] uptime=21:31:28 cputime=00:01:52
09:22	Dyrcona	That's from before.
09:23	* berick	nods
09:23	csharp_	routers be routin'
09:32	Dyrcona	Nothing in the system logs for opensrf_router except for the segfault yesterday morning.
09:33	Dyrcona	Redis errors in /openils/var/log/router-public.log
09:35	Dyrcona	Is opensrf.settings supposed to be registered on the private domain?
09:35	berick	yes
09:35	Dyrcona	the private router log says it isn't, but that could be from before the restart.
09:38	Dyrcona	Hm... It's tough to say, but I think that's from just before I restarted services.
09:39	Dyrcona	Does anything log service start/stop/restart?
09:40	Dyrcona	Ugh. It correspnds with the second of the restart.
09:40	Dyrcona	I suppose I can restart again and see if it comes up again.
09:41	Dyrcona	If I could stop making typos for just 5 minutes!
09:42	Dyrcona	OK. Looks like the not registered errors are from before the previous restart. They don't repeat with this one.
09:43	Dyrcona	router 2024-12-20 09:18:38 [ERR :1813:transport_connection.c:255:] REDIS Error [] BLPOP opensrf:router:router:private.localhost 5
09:43	Dyrcona	router 2024-12-20 09:18:38 [ERR :1813:transport_connection.c:263:] Resting for a few seconds after bus failure...
09:44	Dyrcona	That repeats over and over again in the private router log since about midnight.
09:44	Dyrcona	Same thing in the public router log.
09:45	Dyrcona	Modulo the process id and public versus private.
09:46	Dyrcona	berick: Should I open a bug on Lp?
09:47		mmorgan joined #evergreen
09:47	berick	it's not much to go on :\ any other nearby errors or in the redis logs?
09:48	berick	good to verify redis didn't restart for some reason as well
09:51	Dyrcona	Well, is the number after the file name, the line number? That might be something to start with.
09:52	berick	yeah we know where BLPOP happens, but unfortunately that's not much more useful than "something happened while talking to redis"
09:53	Dyrcona	All right. I'm looking. I go to journalctl first these days, but I haven't found much there. A bunch of apache stuff about missing files and ssl.
09:54	berick	thanks for digging
09:54	csharp_	ok - the "5" is the timeout
09:54	csharp_	len = snprintf(command_buf, 256, "BLPOP %s %d", recipient, timeout);
09:55	Dyrcona	Today's redis-server.log is 0 bytes. Yesterday's is compressed. I'll have a look at it.
09:56	Dyrcona	593:M 19 Dec 2024 16:41:51.794 # Guru Meditation: Redis aborting for OUT OF MEMORY. Allocating 32774 bytes! #server.c:5206
09:56	Dyrcona	593:M 19 Dec 2024 16:41:51.794 # (forcing SIGSEGV in order to print the stack trace)
09:57	berick	interesting
09:57	Dyrcona	There's a backtrace for Redis.
09:59	berick	based on the VM, how much memory do you think it had to consume to reach that point?
09:59	Dyrcona	Don't think much was going on then: fine generator and/or hold_targeter maybe. I don't know right now if anyone was using the server.
09:59	Dyrcona	The server has about 30GB.
09:59	berick	i guess the other question is what actually ate the memory
10:00	Dyrcona	Free says 29Gi and no swap.
10:00	Dyrcona	Right now 24Gi is available with 23Gi buffered/cached.
10:00	Dyrcona	4.2GI used and 2.0gi "free"
10:01	Dyrcona	Looks like Redis restarted right after that, and that's where the log ends.
10:04	Dyrcona	Someone was using acq about that time. This is 3.14, so the new Angular acq.
10:08	Dyrcona	I've emailed one of our internal lists to find out if whoever was using acq had any issues and could report anything.
10:08	csharp_	@blame acq
10:08	pinesol	csharp_: acq stole bradl's tux doll!
10:08	csharp_	gotta remove that blame
10:08	Dyrcona	IIRC, csharp_ you have the tux doll in question?
10:09	csharp_	I do, in my office
10:09	csharp_	I'll share a pic sometime :-)
10:10	csharp_	also I'll share a pic of the unicorn on a noose that used to hang on the "previous ILS" $1M+ Sun server
10:10	Dyrcona	I hesitate to check yesterday's osrfsys.log. It's 202MB compressed.
10:12	berick	Dyrcona: since Bmagic was also having issues on a utility server, I'm leaning toward it being a utility process that's causing the issue (hold targeter, fine generator, etc.).
10:15	berick	if a backend was sending replies faster than the caller could collect them, it would cause messages to pile up in Redis. but I'm having trouble imagining where ~30G of such content would be coming from.
10:15	berick	hm unless the caller just disconnected
10:16	Dyrcona	OK. I suspect something like that, too.
10:16	berick	but even then that's a lot
10:16	berick	maybe over time though
10:16	Dyrcona	No lock files from that time hanging around.
10:17	Dyrcona	The badge score generator lock file is there from 4:30 am, though.
10:18	Dyrcona	At 16:41, it would most likely be the fine generator.
10:18	berick	Dyrcona: how many days had it been running OK?
10:19	Dyrcona	I don't actually know. I was out of the office for a week, and didn't get back until yesterday.
10:21	Dyrcona	I did recently switch the fine generator to run at 30 minutes past the hour because we were getting regular emails from the monitor about the number of processes running when it and the hold targeter started at the same time.
10:22	Dyrcona	We're running password reset every minute, and the pending a/t running every half hour.
10:23		redavis joined #evergreen
10:23	Dyrcona	We have a SQL sending data to Bywater for our test Aspen installation every 5 minutes. That could be using a lot of RAM.....
10:25	Dyrcona	The person testing acq yesterday said that they did experience Evergreen stop working at that time. They also encountered Lp 2086786. I don't think it's related.
10:25	pinesol	Launchpad bug 2086786 in Evergreen 3.13 "Acq: Multi-Branch Libraries Can't See All Their Branch Funds in Invoices" [High,Fix committed] https://launchpad.net/bugs/2086786
10:26	Dyrcona	Someone else mentioned Evergreen "crashing" at that time.
10:29	berick	Dyrcona: mind checking periodically to see if Redis memory usage is slowly climbing? if that's the case, I have a script that will help
10:29	Dyrcona	OK. I'll do that.
10:29	berick	thanks
10:31	Dyrcona	Right now, I have to figure out where reports should be and probably make a link. I'm also told reports aren't working. I think they're working and the output is going somewhere other than /openils/var/web/reporter.
10:35	Dyrcona	And, that was easy.
10:38	Dyrcona	Hm... it's 85MB of VM, rss is 13MB at the moment.
10:41	Dyrcona	FWIW, I'm monitoring it this way: ps -o etime,time,rss,size,vsize 15997 [that's the actual PID]
10:45	Dyrcona	I have not had it crash on the other test server, but I don't usually run cron, and it's almost never used.
10:46	Dyrcona	It should be quiet next week. I'll try recompiling OpenSRF and Evergreen with debug options turned on. That might help, particularly if I can figure out how to reproduce a crash.
10:57	Dyrcona	Think I'll do that on my other test vm, too, if I haven't already. I'll do that after I test the 3.12.10 tarball.
11:03		sandbergja joined #evergreen
11:19		Christineb joined #evergreen
12:24		kworstell-isl joined #evergreen
12:24	redavis	Dyrcona++ mjdriscoll++
12:59	mmorgan	Dyrcona++ mjdriscoll++
13:17	Dyrcona	storage and hold_targeter services crashed at some point. I missed catching it live.
13:18	Dyrcona	storage won't run again....
13:21	Dyrcona	today's redis-server.log is still empty. I wonder if it is a case of redis needs to be restarted when the log is turned over?
13:22	berick	no, restarts aren't necessary
13:23	Dyrcona	ERR open-ils.storage Has PID file entry [61045], which matches no running open-ils.storage processes
13:23	Dyrcona	That's after attempting to stop and then start storage again.
13:23	Dyrcona	I think Redis or the router is hosed again.
13:24	Dyrcona	diagnostic shows everything but storage and hold-targeter running.
13:25	Dyrcona	storage is connecting to the database.
13:26	Dyrcona	storage std_err log has not changed since yesterday, so whatever is happening is not getting logged.
13:27	Dyrcona	Dec 20 13:25:34 cwmars-dev kernel: __vm_enough_memory: pid: 61502, comm: gce_workload_ce, not enough memory for the allocation
13:27	Dyrcona	Free says I have plenty available.
13:29	Dyrcona	Does redis spawn processes? I only see 1 when I pgrep -af redis
13:29	Dyrcona	and it is still only using 85MB.
13:29	berick	no, just the one process
13:30	Dyrcona	OK. I just restarted redis-server, too.
13:32	Dyrcona	Non-evergreen processes are saying that they cannot allocate memory, or were until a couple of minutes ago. See what I pasted above. Looks like an OpenSRF Listener could not allocate RAM at 11:30 am.
13:33	Dyrcona	Dec 20 11:30:05 cwmars-dev kernel: __vm_enough_memory: pid: 50517, comm: OpenSRF Listene, not enough memory for the allocation
13:33	Dyrcona	Dec 20 12:00:02 cwmars-dev kernel: __vm_enough_memory: pid: 50598, comm: OpenSRF Listene, not enough memory for the allocation
13:33	Dyrcona	Cut off, so I don't know which is which.
13:36	Dyrcona	Well, there's this: -rw-rw-r-- 1 opensrf opensrf 5 Dec 20 11:30 generate_fines-LOCK
13:37	berick	can you tell from dmesg or similar what proc had the initial OOM issue?
13:38	Dyrcona	[18852.962748] __vm_enough_memory: pid: 15109, comm: postgres, not enough memory
13:38	Dyrcona	for the allocation
13:38	Dyrcona	That's the earliest from dmesg. There's no timestamp, but should be since the reboot this morning.
13:39	Dyrcona	I'm going to disable the incremental updates for Aspen.
13:53	* Dyrcona	wonders if some kernel memory tuning would help, like changing cache/buffer settings?
14:11	Dyrcona	And, it's happening again. Just got an email that no children are available for open-ils.trigger. Went to run osrf_control --digagnostic and that segfaulted.
14:11	Dyrcona	free says 1.6Gi free, but nothing will run.
14:12	Dyrcona	redis-serve pid changed.
14:13	Dyrcona	2.5 minutes ago it changed, and is not using much memory.
14:15	Dyrcona	Can I blame google_guest_agent using 2.2GB of virtual memory? or websocketd using 1GB?
14:16	Dyrcona	For resident memory, several storage drones are using 280MB or more.
14:16	Dyrcona	with virt over 300
14:17	Dyrcona	I thought postgres would be up there, too, but it's not.
14:18	Dyrcona	My conclusion: this vm just doesn't have enough RAM.
14:18	csharp_	Dyrcona: that's what it sounds like to me
14:18	Dyrcona	redis-server pid changed again.
14:19	Dyrcona	I'm going to stop the cron jobs.
14:19	Dyrcona	reboot it.
14:20	Dyrcona	Clark died along the way. Probably when reporter-store died.
14:22	Dyrcona	ran --stop-all and some drones are still hanging out.
14:22	Dyrcona	What's that option? --kill-with-fire?
14:23	Dyrcona	I bet these drones are hanging around from when the listeners crashed earlier.
14:31	Dyrcona	Don't think I'll start SIPServer or simple2zoom, either.
14:38	Dyrcona	Yeah, the top 4 users of virt memory are all related to GCP and docker.
14:38	Dyrcona	Should this thing be running docker?
14:40	* Dyrcona	has more reason to not like docker.
16:08	pinesol	News from commits: Forward port 3.13.6 -> 3.13.7 ugrade script <https://git.evergreen-ils.org/?p=Evergreen.git;a=commitdiff;h=e77e94dfc25c5da2f32a072d9316334e350adbd4>
16:38	pinesol	News from commits: Forward-port 3.12.9-3.12.10 upgrade script <https://git.evergreen-ils.org/?p=Evergreen.git;a=commitdiff;h=467133d1de7778f9ed46ed05d70d707a75c78fac>
17:11		mmorgan left #evergreen