IRC log for #evergreen, 2024-12-30

All times shown according to the server's local time.

Time	Nick	Message
07:09		collum joined #evergreen
07:57		BDorsey joined #evergreen
08:45		mmorgan joined #evergreen
08:46		dguarrac joined #evergreen
08:55		Dyrcona joined #evergreen
09:01		mmorgan1 joined #evergreen
09:10	Dyrcona	Looks like the memory problem is more the OpenSRF drones than PostgreSQL. `top` says that postgres is using as much ram as a single storage drone....
09:16	Dyrcona	Restarting services doesn't make much difference, either. top reports an idle storage drone using 331272KB of VIRT memory. I've seen them hit 345MB each when in use.
09:16	* Dyrcona	checks a couple of other systems for comparison.
09:18	Dyrcona	Memory use of a storage drone is comparable on another system running Redis (333340K virt) and Ejabberd (303464K virt).
09:19	Dyrcona	So, 330MB per process is no big deal, right? Well, not until you're running 70 to 100 of them...
09:20	Dyrcona	Before I restarted services, I saw search and acq drones approaching the same memory use, but they typically start out lower.
09:20	Dyrcona	I was thinking of cutting the number of requests that the drones on this server are allowed to process to see if that helps, but these numbers suggest to me that it won't make that much difference.
09:21	Dyrcona	As for redis crashing, I suspect, but do not know, that it is likely making an additional copy of the data as messages get passed around, thereby increasing the memory pressure.
09:22	Dyrcona	BTW, beam.smp running Ejabberd on the one system is using 4224472K virt. That's 4GB!
09:23	Dyrcona	Redis doesn't even show up when top -c is sorted by memory consumption when idle.
09:26	Dyrcona	On the mostly unused test system Redis has VSZ of 74520K or 74MB. On the one where I just restarted it, the VSZ is 76460K.
09:28	Dyrcona	Bmagic \|\| berick : See my monologue above if you get this notification. If not, it's in the logs for future reference.
09:29	Dyrcona	So, I'm going to rung the fine generator and keep an eye on top sorted by memory consumption.
09:32	Dyrcona	s/rung/run/ but you get it.
09:37	Dyrcona	OK. Going to restart the VM, first.
09:39	Dyrcona	Huh. had to 'kill' simple2zoom twice. Once for each process. Guess they're not spawned in the same process group?
09:41	Dyrcona	Gotta remember to start websocketd. It's not setup with systemd.
09:41	Dyrcona	Think I'll do these tests on a couple of other vms to see what difference there is with memory use between Redis and Ejabberd if any.
09:42	berick	Dyrcona: read it, but i'm unclear what the takeaway is
09:42	Dyrcona	Well, I'm not sure there's a takeaway, yet, except that storage drones use a lot of memory regardless.
09:44	Dyrcona	I'm in the process of ferreting out where the high memory use happens, and I'm logging it publicly rather than keeping private notes. It's a quiet day in IRC, so I don't think anyone will really mind much. If they do, know one has said anything, yet.
09:44	berick	no, i appreciate it
09:45	Dyrcona	:)
09:45	berick	just unsure if there was a call to action
09:45	berick	in any event, i'm here if you find anything
09:46	Dyrcona	berick++
09:47	Dyrcona	FWIW, apache processes start out using around 147MB VIRT before I even start anything else.
09:48	Dyrcona	snapd is using 1.69GB VIRT. Think that's mostly buf/cache.
09:49	Dyrcona	I think I was just sorting them by RES before which is the resident set size in RAM.
09:51	Dyrcona	Also, VSZ/VIRT is probably not what I really want to measure anyway. That includes memory-mapped files, which don't actually consume RAM.
09:51	Dyrcona	RES is what's actually used, but it's more complicated because I'm concerned about buffers/cache which top apparently doesn't report.
09:54	Dyrcona	And, shared memory with forked processes complicates things further. Those forked drones may not each actually be using all the RAM individually.
09:54	Dyrcona	I need to write better tools for monitoring memory. Linux has gotten way too complicated over the past 30 years.
09:54	Dyrcona	In the 2.x kernel days, it was easier to know what you're talking about.
09:55	Dyrcona	Enough chatter. I have to restart services.
10:05	Dyrcona	Idle storage drones still have a RSS of 275MB. That's pretty big.
10:05	eeevil	Dyrcona: the large size of open-ils.storage is due to it generating ~16k methods at startup. it's always been north of 180MB, but newer perls seem greedier
10:06	Dyrcona	eeevil++
10:11	Dyrcona	Running the fine generator adds a little to the storage drones' memory use. It also caused 2 additional drones to run. I have 4 storage drones going right now, and all get used at various times.
10:12	Dyrcona	RSS fluctuates between 279MB and 290MB per drone.
10:13	Dyrcona	Redis RSS jumped from 12.2MB to 13.5MB.
10:14		mantis1 joined #evergreen
10:14	Dyrcona	The storage drone's RSS doesn't drop when the fine generator ends.
10:15	Dyrcona	Perl--
10:16	Dyrcona	RSS of redis dropped to 12.7MB after the fine generator completed.
10:17	* Dyrcona	tries the hold_targeter
10:17	Dyrcona	hold_targeter.pl --osrf-config /openils/conf/opensrf_core.xml --soft-retarget-interval='1 hour' # for the logs.
10:18	Dyrcona	Redis RSS is back up to 14.6MB and circ drones are hitting around 100MB.
10:19	Dyrcona	BTW, I'm mostly trying to chase down Bmagic's hypothesis that Redis is causing an increase in RAM usage.
10:22	Dyrcona	buffers/cache has jumped to 17Gi.
10:23	Dyrcona	And, 22Gi now. The hold targeter has an effect on something.
10:25	Dyrcona	There's so much nonsense on stackoverflow. I can't use that site any more....
10:25	berick	Dyrcona: redis-server is using 22G?
10:25	berick	oh, no, i misread taht i think
10:26	Dyrcona	No, that's total buffers/cache.
10:27	Dyrcona	And, it's up to 24G, now. I'm trying to figure what's in there, Pg? drones?
10:27	Dyrcona	Redsi is only using 14.4MB RSS and 80.6MB VSZ.
10:28	Dyrcona	Redsi? :)
10:30	Dyrcona	I'm starting to think that redis-server is not the problem, but it may be more efficient than Ejabberd leading to more RAM use elsewhere.
10:34	Dyrcona	Monitoring by CPU usage, I see 7 or 8 Pg processes going, using about 30-44MB each.
10:34	berick	fwiw, my hold targeters also get bloated. i have open-ils.hold-targeter set to <max_requests>10</max_requests> on my utility server from way back
10:35	Dyrcona	berick: I might try that, but only 2 hold targeter drones are running at around 100MB each.
10:37	* Dyrcona	goes to look up how to find out what's using buffer/cache from a reliable source. I see so many contradictory answers on stackoverflow, all accepted.
10:40	Dyrcona	rhel++ # The documentation anyway.
10:41	Dyrcona	And, it's out of date... rhel--
10:42	Dyrcona	oh, never mind. It hasn't been set, the /proc file doesn't exist.... rhel++ :)
10:45	Dyrcona	All right maybe the documentation is out of date... or good for RHEL only.... When I try to set the kernel pagecache percentage, my system complains.
10:48	Dyrcona	OK. The setting I want is vm.vfs_cache_pressure in sysctl.
10:48	Bmagic	Once I moved a certain group of action triggers over to a different utility server, neither the old or the new utility server has failed to process the cron jobs. They've been running for 3+ weeks without restarting services.
10:48	Dyrcona	Bmagic: We're having issues with action trigger disabled.
10:48	Dyrcona	No cron jobs, just me running stuff manually.
10:49	Dyrcona	I'm going to bump vm.vfs_cache_pressure to 150 to see if that helps.
10:49	Dyrcona	before I do that. I'm going to run the thing that I think crashed it last Friday.
10:50	Bmagic	So, I think I can conclude that the issue appears to show itself when enough OpenSRF related things are running at the same time, or withing a short period of time perhaps. Dyrcona: it's actually nice to know that you can reliably cause a problem. That's where I got stuck, even when I kicked off the job by hand, it wouldn't fail
10:52	Dyrcona	Well, I'm going to try again. It's our big dump of "everything" that goes to Aspen. I think that was running Friday when there was a crash, after I had run the fine generator and hold targeter.
10:53	Dyrcona	I suspect that I just need to tell the kernel to clear the cache buffers more often.
10:56	Dyrcona	Well, something crashed, but not redis.
10:58	Dyrcona	postgres
10:58	Dyrcona	Not the main one, just the child/children running the query.
11:00	Dyrcona	I'll try again with the vfs_cache_pressure at 150.
11:00		sandbergja joined #evergreen
11:01	Dyrcona	Oh, right. We use marc_export for part of it.
11:03	Dyrcona	Nope. Same result. I'm pretty sure that I don't have to restart for this setting to have an effect.
11:03	Dyrcona	I'll increase the cache pressure to 200. I think that's the maximum.
11:05	Dyrcona	An "idle" postgresql process is using 20% of CPU....
11:06	Dyrcona	Nope. Looks like I might have to manually clear the cache.
11:07	jeff	in what way did the postgres child crash?
11:08	Dyrcona	No memory for an allocation.
11:08	Dyrcona	Yes, I may have to cleanup the temp area on disk.
11:10	Dyrcona	I'll worry about that once I get the memory use sorted.
11:13	Dyrcona	sudo bash -c "echo 1 > /proc/sys/vm/drop_caches" # Works great!
11:13	Dyrcona	From 22Gi to 343Gi. I could clear the dentries and inodes, too, but meh.
11:13	Dyrcona	Now, let's see if the process runs.
11:24	Dyrcona	Nope. Still crashed.
11:26	Dyrcona	It was using only 13G of buffers/cache
11:26	jeff	what does the log entry look like? have you shared it above already? I looked, but didn't see it in a quick scan.
11:26	Dyrcona	Dec 30 11:16:30 cwmars-dev kernel: __vm_enough_memory: pid: 5321, comm: postgres, not enough memory for the allocation
11:27	jeff	what does the postgresql log entry look like?
11:29	Dyrcona	2024-12-30 11:16:30.615 EST [5321] evergreenevergreen ERROR: out of memory
11:29	Dyrcona	2024-12-30 11:16:30.615 EST [5321] evergreenevergreen DETAIL: Failed on request of size 8192 in memory context "ErrorContext".
11:29	Dyrcona	It may not have actually crashed, but the process is gone.
11:30	Dyrcona	There are two additional lines about a cursor not existing and failing to fetch from the cursor, but I think the above are more relevant.
11:35	Dyrcona	Looks Pg cleaned up any temporary files.
11:40	Dyrcona	At any rate, redis-server is not gobbling up lots of RAM.
11:41	Dyrcona	I blame Perl and I blame Postgres.
11:43	Dyrcona	Doing anything with the database cranks up the buffers/cache.
11:44	jeff	expected, since you're reading from disk and the kernel is caching that read data.
11:45	Dyrcona	Yes.
11:47	Dyrcona	My testing points the finger away from redis and more towards running everything on 1 machine.
11:47	jeff	how much physical memory and how much swap does this machine/vm/whatnot have?
11:47	jeff	can you share the output of this query somewhere? SELECT name, setting, unit, source FROM pg_settings WHERE name ~ 'mem';
11:48	Dyrcona	It says 30G of memory, no swap. (I've considered adding swap, but haven't bothered because it would only delay the inevitable.)
11:49	Dyrcona	Everything is default except shared mem type = posix
11:50	Dyrcona	I cut shared_buffers from the default (128MB) down to 64MB last week.
11:50	jeff	that's incredibly tiny. that might be your issue right there.
11:51	jeff	but I'm not sure you'd hit out of memory issues, more likely "terrible performance".
11:51	jeff	but since the planner takes shared buffers into consideration, you might be getting a worst-possible plan that's contributing.
11:51	Dyrcona	Right, and performance did not change much. I think we just need a larger VM.
11:52	Bmagic	30GB may not be enough memory to simulate production, considering the DB size on disk > physical memory * 14
11:52	jeff	I thought that postgresql was a lot more verbose in these kinds of out of memory errors, the format of the message claiming it's out of memory in the ErrorContext context almost sounds like it didn't have enough memory to construct the more detailed out of memory error.
11:52	Dyrcona	Yeah, that's what I think happened.
11:53	Dyrcona	Bmagic: Yes. I don't remember what the previous server was set up as.
11:53	Bmagic	same
11:53	Dyrcona	We can worry about that next week.
11:53	jeff	any chance of getting the output of this, sanitized of any sensitive data if any? SELECT name, setting, unit, source FROM pg_settings WHERE source <> 'default';
11:53	Dyrcona	I think I'll start most of the cron jobs and see what happens.
11:55	Dyrcona	None of it looks sensitive. Just a sec.
11:56	jeff	have you already tried with a shared_buffers value closer to the conventional starting point of 25% to 40% of physical memory?
11:59	Dyrcona	jeff: https://drive.google.com/file/d/1dJ5wsVWyEt0DL9hIwaMq34xaTrVeh73a/view?usp=sharing
11:59	Dyrcona	Well, I think that would cause the database to use more RAM and starve the other processes even faster.
12:08		kmlussier joined #evergreen
12:37	sandbergja	Dyrcona: purely out of curiosity, what is `cat /proc/sys/vm/overcommit_memory`?
12:38	Dyrcona	sandbergja: 2
12:39	* Dyrcona	has to look that setting up again to remember what it does exactly. I have tweaked it in the past on other servers.
12:40	Dyrcona	Maybe I should set it to 0 to allow heuristic over commits? Pretty sure that I don't want 1.
12:40	sandbergja	I came across some redis documentation saying that redis prefers 1? Let me see if I can dig it up
12:41	sandbergja	This thing: https://redis.io/docs/latest/develop/get-started/faq/#background-saving-fails-with-a-fork-error-on-linux
12:42	Dyrcona	So, I'm considering adding a small swap file, not more than 8GB, and taking jeff's suggestion and changing shared_buffers in Pg (requires a Pg restart).
12:43	Dyrcona	sandbergja: That's interesting. I'll see if I'm getting that error.
12:44	sandbergja	I think that 0 or 1 would make the "not enough memory for the allocation" errors less likely, at the tradeoff of being more likely to hit the oom_killer?
12:44	berick	w/ our documented setup, redis disk saving should not be happening
12:46	Dyrcona	I got this "Dec 27 15:52:16 cwmars-dev systemd[1]: redis-server.service: Failed to fork: Cannot allocate memory" about the time of the crash on Friday.
12:48	Dyrcona	I can certainly try changing the overcommit memory setting. I know I've done it before (for Postgres?), but I don't remember exactly when or why, and I probably don't have good notes. (I should go back to keeping an admin notebook/journal.)
12:49	Dyrcona	Too bad -r doesn't work with zgrep.
12:51	collum	I got curious about the different settings - https://www.kernel.org/doc/Documentation/vm/overcommit-accounting
12:51	Dyrcona	collum++ I was just looking at that document.
12:55		sandbergja joined #evergreen
12:55	sandbergja	berick++ thanks, gotcha!
12:57	sandbergja	haha I like this phrase from the documentation collum shared: "a seriously wild allocation"
12:59	Dyrcona	"A wild allocation appears."
12:59	sandbergja	haha
12:59	berick	i can't say 100% that disabling persistence means it won't ever fork, but if Redis is using ~15MB of ram and it can't fork, then you're just out of ram.
13:01	Dyrcona	Yes, that's what I think. I'm just out of RAM. Today it has been postgres. Friday it was redis.
13:01	berick	Dyrcona: "it was Redis" -- redis crashed or it ate the ram?
13:02	Dyrcona	Redis crashed. I don't think redis is eating ram. If it were, I think I would have seen it while running the scheduled jobs. I think it's Perl and Pg eating the ram.
13:03	* berick	nods
13:07	Dyrcona	right... add a swap file.... been a while...
13:11		troy joined #evergreen
13:13	Dyrcona	So, 6GB of swap might be OK.
13:20	Dyrcona	As much as I like running systems without swap, it's usually necessary to have some.
13:23	Dyrcona	I think I'll clear the cache again and set over_commit to 0 for heuristic overcommit, then run some scheduled jobs, maybe even autorenew and see what happens.
13:44	Dyrcona	I'm not the only one using the dev machine right now. Load hit 15 while I'm running the hold targeter. I know at least one person is looking at circulation settings.
13:45	Dyrcona	Memory looks good after setting ovecommit_memory.
13:45	Dyrcona	Not using any swap yet, either.
13:46	Dyrcona	redis never seems to go much over 14.5MB.
14:03		kmlussier1 joined #evergreen
14:03		kmlussier1 left #evergreen
14:10	Dyrcona	I think things will be OK as long as I don't run the aspen stuff.
15:11	Dyrcona	berick: I see some pcrud drones using as much/more memory as storage. I may have to limit the number of requests that they're allowed to process. I'll consider that depending on how things go over the next few days.
15:12	Dyrcona	I'm running our Daily-PD-2 granularity action trigger, which is essentially a combined autorenew and 2-day courtesy notice. RAM use is getting tight.
15:14	Dyrcona	No memory errors so far, though.
15:24	berick	hm, haven't seen that w/ pcrud.
15:26	berick	pcrud's using 1/4 of what storage uses pretty consistently here
15:26	berick	w/ 2000 max_requests for both
15:32	Dyrcona	Hmm. It's going to be difficult for me to confirm that right now. They're not showing up in the top memory users in top right now.
15:32	Dyrcona	I'll have to double check later.
15:34	Dyrcona	It's still chugging away on autorewal with 280MB of free RAM and 0.8MB of swap used.
15:34	Dyrcona	Correction. It ticked up to 7.2MB of swap.
15:35	Dyrcona	And, autorenew just finished and things look good!
15:35	Dyrcona	I suspect the overcommit_memory setting helped more than anything else I've done. sandbergja++
15:41	Dyrcona	berick: top -c output sorted by most to least memory usage. These are the top 10 processes for memory use on the machine right now: https://drive.google.com/file/d/1OBh7QyMsBaCR6auiiBfJtR-Ak0H5mzJW/view?usp=sharing
15:46	berick	interesting they're using more than storage, but the amounts used aren't crazy.
15:46	Dyrcona	For certain definitions of crazy. :)
15:46	Dyrcona	No, I mostly agree.
15:47	Dyrcona	Looks like they've been actually doing more stuff based on the run time.
15:49	Dyrcona	I plan to set shared buffers back to the default in a bit and restart Pg. I'll run the same tests again tomorrow. If that's all good, I'll enable all but our Aspen scheduled jobs on Thursday.
15:52	Dyrcona	Looking at the overcommit documentation, adding the swap may have helped in ways other than just being an overflow space. It looks like swap amount is used in the overcommit calculations for the heuristic and don't overcommit methods.
15:54	Dyrcona	I'm curious to look at that part of the kernel code, but not so curious that I'm going to do it right now.
16:00	Dyrcona	Without doing anything buffers/cached dropped by half just now. I think the cache pressure setting is also helping.
16:01	Dyrcona	Guess I'll wait to restart. Looks like someone is doing something. I see search and actor and pcrud drones becoming active.
16:14	Bmagic	anyone know off-hand any tips for loading PO's via MARC resulting in only the PO and none of the bib records being inserted into biblio.record_entry? Despite the "import non-matching record" checked? The resulting bib queue also has zero records loaded in the queue. Nothing I can (re)import there either.
16:25	Dyrcona	Oof. Dunno about that one.
16:27	Dyrcona	I will say. If these setting work tomorrow, I'll make the sysctl settings permanent. Then, I'll bug Bmagic about how to add it to the docker/ansible setup.
16:28	* Dyrcona	signs out.
16:36	pinesol	News from commits: LP2092227 Hold History Pink Box <https://git.evergreen-ils.org/?p=Evergreen.git;a=commitdiff;h=102da36cb3e7ac0c19c0b16ee6c7e9763247507f>
16:48	berick	Bmagic: did it make line items?
16:50	Bmagic	berick: yes
16:55	Bmagic	I'm starting to think it's because not all of the bibs have an item on it. And the provider MARC attributes definitions are odd. Like, for example, the barcode is found in 267$i, shelving location is 266$l, fund code 264$b. All of the attributes are in different fields
16:59	Bmagic	oh, lol, nevermind
17:00	Bmagic	those were DB ID numbers, not MARC tags. The Attributes look OK actually. And the number of bibs in the file is the same number of 962 tags. So the file has a one-to-one item/bib relationship.
17:01	Bmagic	The PO interface loads up the order fine. Each line has an item (1). But, of course, all* of them need "Link to Catalog"
17:03		mmorgan1 left #evergreen
17:04	berick	huh. been a while since i looked closed at that code. kinda thought the recs had to be queued first during import.
17:06		mantis1 left #evergreen
17:10	Bmagic	I've loaded the same file via standard vandelay (no PO) and it loads up fine
17:11	Bmagic	something about getting loaded and* POified, causes the bibs to vanish
17:11	Bmagic	no errors in the logs either
17:16	Bmagic	Chopping the marc file down to (random number) 36 records, and feed that in, it loads the bibs fine and creates the PO
17:16	Bmagic	The full file is 891 records
17:42	Bmagic	Maybe I'll try half, and then half again, and so on. ok, yeah, the first half failed the same way the whole file did, I'll split that and see where I get