Time |
Nick |
Message |
07:09 |
|
collum joined #evergreen |
07:57 |
|
BDorsey joined #evergreen |
08:45 |
|
mmorgan joined #evergreen |
08:46 |
|
dguarrac joined #evergreen |
08:55 |
|
Dyrcona joined #evergreen |
09:01 |
|
mmorgan1 joined #evergreen |
09:10 |
Dyrcona |
Looks like the memory problem is more the OpenSRF drones than PostgreSQL. `top` says that postgres is using as much ram as a single storage drone.... |
09:16 |
Dyrcona |
Restarting services doesn't make much difference, either. top reports an idle storage drone using 331272KB of VIRT memory. I've seen them hit 345MB each when in use. |
09:16 |
* Dyrcona |
checks a couple of other systems for comparison. |
09:18 |
Dyrcona |
Memory use of a storage drone is comparable on another system running Redis (333340K virt) and Ejabberd (303464K virt). |
09:19 |
Dyrcona |
So, 330MB per process is no big deal, right? Well, not until you're running 70 to 100 of them... |
09:20 |
Dyrcona |
Before I restarted services, I saw search and acq drones approaching the same memory use, but they typically start out lower. |
09:20 |
Dyrcona |
I was thinking of cutting the number of requests that the drones on this server are allowed to process to see if that helps, but these numbers suggest to me that it won't make that much difference. |
09:21 |
Dyrcona |
As for redis crashing, I suspect, but do not know, that it is likely making an additional copy of the data as messages get passed around, thereby increasing the memory pressure. |
09:22 |
Dyrcona |
BTW, beam.smp running Ejabberd on the one system is using 4224472K virt. That's 4GB! |
09:23 |
Dyrcona |
Redis doesn't even show up when top -c is sorted by memory consumption when idle. |
09:26 |
Dyrcona |
On the mostly unused test system Redis has VSZ of 74520K or 74MB. On the one where I just restarted it, the VSZ is 76460K. |
09:28 |
Dyrcona |
Bmagic || berick : See my monologue above if you get this notification. If not, it's in the logs for future reference. |
09:29 |
Dyrcona |
So, I'm going to rung the fine generator and keep an eye on top sorted by memory consumption. |
09:32 |
Dyrcona |
s/rung/run/ but you get it. |
09:37 |
Dyrcona |
OK. Going to restart the VM, first. |
09:39 |
Dyrcona |
Huh. had to 'kill' simple2zoom twice. Once for each process. Guess they're not spawned in the same process group? |
09:41 |
Dyrcona |
Gotta remember to start websocketd. It's not setup with systemd. |
09:41 |
Dyrcona |
Think I'll do these tests on a couple of other vms to see what difference there is with memory use between Redis and Ejabberd if any. |
09:42 |
berick |
Dyrcona: read it, but i'm unclear what the takeaway is |
09:42 |
Dyrcona |
Well, I'm not sure there's a takeaway, yet, except that storage drones use a lot of memory regardless. |
09:44 |
Dyrcona |
I'm in the process of ferreting out where the high memory use happens, and I'm logging it publicly rather than keeping private notes. It's a quiet day in IRC, so I don't think anyone will really mind much. If they do, know one has said anything, yet. |
09:44 |
berick |
no, i appreciate it |
09:45 |
Dyrcona |
:) |
09:45 |
berick |
just unsure if there was a call to action |
09:45 |
berick |
in any event, i'm here if you find anything |
09:46 |
Dyrcona |
berick++ |
09:47 |
Dyrcona |
FWIW, apache processes start out using around 147MB VIRT before I even start anything else. |
09:48 |
Dyrcona |
snapd is using 1.69GB VIRT. Think that's mostly buf/cache. |
09:49 |
Dyrcona |
I think I was just sorting them by RES before which is the resident set size in RAM. |
09:51 |
Dyrcona |
Also, VSZ/VIRT is probably not what I really want to measure anyway. That includes memory-mapped files, which don't actually consume RAM. |
09:51 |
Dyrcona |
RES is what's actually used, but it's more complicated because I'm concerned about buffers/cache which top apparently doesn't report. |
09:54 |
Dyrcona |
And, shared memory with forked processes complicates things further. Those forked drones may not each actually be using all the RAM individually. |
09:54 |
Dyrcona |
I need to write better tools for monitoring memory. Linux has gotten way too complicated over the past 30 years. |
09:54 |
Dyrcona |
In the 2.x kernel days, it was easier to know what you're talking about. |
09:55 |
Dyrcona |
Enough chatter. I have to restart services. |
10:05 |
Dyrcona |
Idle storage drones still have a RSS of 275MB. That's pretty big. |
10:05 |
eeevil |
Dyrcona: the large size of open-ils.storage is due to it generating ~16k methods at startup. it's always been north of 180MB, but newer perls seem greedier |
10:06 |
Dyrcona |
eeevil++ |
10:11 |
Dyrcona |
Running the fine generator adds a little to the storage drones' memory use. It also caused 2 additional drones to run. I have 4 storage drones going right now, and all get used at various times. |
10:12 |
Dyrcona |
RSS fluctuates between 279MB and 290MB per drone. |
10:13 |
Dyrcona |
Redis RSS jumped from 12.2MB to 13.5MB. |
10:14 |
|
mantis1 joined #evergreen |
10:14 |
Dyrcona |
The storage drone's RSS doesn't drop when the fine generator ends. |
10:15 |
Dyrcona |
Perl-- |
10:16 |
Dyrcona |
RSS of redis dropped to 12.7MB after the fine generator completed. |
10:17 |
* Dyrcona |
tries the hold_targeter |
10:17 |
Dyrcona |
hold_targeter.pl --osrf-config /openils/conf/opensrf_core.xml --soft-retarget-interval='1 hour' # for the logs. |
10:18 |
Dyrcona |
Redis RSS is back up to 14.6MB and circ drones are hitting around 100MB. |
10:19 |
Dyrcona |
BTW, I'm mostly trying to chase down Bmagic's hypothesis that Redis is causing an increase in RAM usage. |
10:22 |
Dyrcona |
buffers/cache has jumped to 17Gi. |
10:23 |
Dyrcona |
And, 22Gi now. The hold targeter has an effect on something. |
10:25 |
Dyrcona |
There's so much nonsense on stackoverflow. I can't use that site any more.... |
10:25 |
berick |
Dyrcona: redis-server is using 22G? |
10:25 |
berick |
oh, no, i misread taht i think |
10:26 |
Dyrcona |
No, that's total buffers/cache. |
10:27 |
Dyrcona |
And, it's up to 24G, now. I'm trying to figure what's in there, Pg? drones? |
10:27 |
Dyrcona |
Redsi is only using 14.4MB RSS and 80.6MB VSZ. |
10:28 |
Dyrcona |
Redsi? :) |
10:30 |
Dyrcona |
I'm starting to think that redis-server is not the problem, but it may be more efficient than Ejabberd leading to more RAM use elsewhere. |
10:34 |
Dyrcona |
Monitoring by CPU usage, I see 7 or 8 Pg processes going, using about 30-44MB each. |
10:34 |
berick |
fwiw, my hold targeters also get bloated. i have open-ils.hold-targeter set to <max_requests>10</max_requests> on my utility server from way back |
10:35 |
Dyrcona |
berick: I might try that, but only 2 hold targeter drones are running at around 100MB each. |
10:37 |
* Dyrcona |
goes to look up how to find out what's using buffer/cache from a reliable source. I see so many contradictory answers on stackoverflow, all accepted. |
10:40 |
Dyrcona |
rhel++ # The documentation anyway. |
10:41 |
Dyrcona |
And, it's out of date... rhel-- |
10:42 |
Dyrcona |
oh, never mind. It hasn't been set, the /proc file doesn't exist.... rhel++ :) |
10:45 |
Dyrcona |
All right maybe the documentation is out of date... or good for RHEL only.... When I try to set the kernel pagecache percentage, my system complains. |
10:48 |
Dyrcona |
OK. The setting I want is vm.vfs_cache_pressure in sysctl. |
10:48 |
Bmagic |
Once I moved a certain group of action triggers over to a different utility server, neither the old or the new utility server has failed to process the cron jobs. They've been running for 3+ weeks without restarting services. |
10:48 |
Dyrcona |
Bmagic: We're having issues with action trigger disabled. |
10:48 |
Dyrcona |
No cron jobs, just me running stuff manually. |
10:49 |
Dyrcona |
I'm going to bump vm.vfs_cache_pressure to 150 to see if that helps. |
10:49 |
Dyrcona |
before I do that. I'm going to run the thing that I think crashed it last Friday. |
10:50 |
Bmagic |
So, I think I can conclude that the issue appears to show itself when enough OpenSRF related things are running at the same time, or withing a short period of time perhaps. Dyrcona: it's actually nice to know that you can reliably cause a problem. That's where I got stuck, even when I kicked off the job by hand, it wouldn't fail |
10:52 |
Dyrcona |
Well, I'm going to try again. It's our big dump of "everything" that goes to Aspen. I think that was running Friday when there was a crash, after I had run the fine generator and hold targeter. |
10:53 |
Dyrcona |
I suspect that I just need to tell the kernel to clear the cache buffers more often. |
10:56 |
Dyrcona |
Well, something crashed, but not redis. |
10:58 |
Dyrcona |
postgres |
10:58 |
Dyrcona |
Not the main one, just the child/children running the query. |
11:00 |
Dyrcona |
I'll try again with the vfs_cache_pressure at 150. |
11:00 |
|
sandbergja joined #evergreen |
11:01 |
Dyrcona |
Oh, right. We use marc_export for part of it. |
11:03 |
Dyrcona |
Nope. Same result. I'm pretty sure that I don't have to restart for this setting to have an effect. |
11:03 |
Dyrcona |
I'll increase the cache pressure to 200. I think that's the maximum. |
11:05 |
Dyrcona |
An "idle" postgresql process is using 20% of CPU.... |
11:06 |
Dyrcona |
Nope. Looks like I might have to manually clear the cache. |
11:07 |
jeff |
in what way did the postgres child crash? |
11:08 |
Dyrcona |
No memory for an allocation. |
11:08 |
Dyrcona |
Yes, I may have to cleanup the temp area on disk. |
11:10 |
Dyrcona |
I'll worry about that once I get the memory use sorted. |
11:13 |
Dyrcona |
sudo bash -c "echo 1 > /proc/sys/vm/drop_caches" # Works great! |
11:13 |
Dyrcona |
From 22Gi to 343Gi. I could clear the dentries and inodes, too, but meh. |
11:13 |
Dyrcona |
Now, let's see if the process runs. |
11:24 |
Dyrcona |
Nope. Still crashed. |
11:26 |
Dyrcona |
It was using only 13G of buffers/cache |
11:26 |
jeff |
what does the log entry look like? have you shared it above already? I looked, but didn't see it in a quick scan. |
11:26 |
Dyrcona |
Dec 30 11:16:30 cwmars-dev kernel: __vm_enough_memory: pid: 5321, comm: postgres, not enough memory for the allocation |
11:27 |
jeff |
what does the postgresql log entry look like? |
11:29 |
Dyrcona |
2024-12-30 11:16:30.615 EST [5321] evergreenevergreen ERROR: out of memory |
11:29 |
Dyrcona |
2024-12-30 11:16:30.615 EST [5321] evergreenevergreen DETAIL: Failed on request of size 8192 in memory context "ErrorContext". |
11:29 |
Dyrcona |
It may not have actually crashed, but the process is gone. |
11:30 |
Dyrcona |
There are two additional lines about a cursor not existing and failing to fetch from the cursor, but I think the above are more relevant. |
11:35 |
Dyrcona |
Looks Pg cleaned up any temporary files. |
11:40 |
Dyrcona |
At any rate, redis-server is not gobbling up lots of RAM. |
11:41 |
Dyrcona |
I blame Perl and I blame Postgres. |
11:43 |
Dyrcona |
Doing anything with the database cranks up the buffers/cache. |
11:44 |
jeff |
expected, since you're reading from disk and the kernel is caching that read data. |
11:45 |
Dyrcona |
Yes. |
11:47 |
Dyrcona |
My testing points the finger away from redis and more towards running everything on 1 machine. |
11:47 |
jeff |
how much physical memory and how much swap does this machine/vm/whatnot have? |
11:47 |
jeff |
can you share the output of this query somewhere? SELECT name, setting, unit, source FROM pg_settings WHERE name ~ 'mem'; |
11:48 |
Dyrcona |
It says 30G of memory, no swap. (I've considered adding swap, but haven't bothered because it would only delay the inevitable.) |
11:49 |
Dyrcona |
Everything is default except shared mem type = posix |
11:50 |
Dyrcona |
I cut shared_buffers from the default (128MB) down to 64MB last week. |
11:50 |
jeff |
that's incredibly tiny. that might be your issue right there. |
11:51 |
jeff |
but I'm not sure you'd hit out of memory issues, more likely "terrible performance". |
11:51 |
jeff |
but since the planner takes shared buffers into consideration, you might be getting a worst-possible plan that's contributing. |
11:51 |
Dyrcona |
Right, and performance did not change much. I think we just need a larger VM. |
11:52 |
Bmagic |
30GB may not be enough memory to simulate production, considering the DB size on disk > physical memory * 14 |
11:52 |
jeff |
I thought that postgresql was a lot more verbose in these kinds of out of memory errors, the format of the message claiming it's out of memory in the ErrorContext context almost sounds like it didn't have enough memory to construct the more detailed out of memory error. |
11:52 |
Dyrcona |
Yeah, that's what I think happened. |
11:53 |
Dyrcona |
Bmagic: Yes. I don't remember what the previous server was set up as. |
11:53 |
Bmagic |
same |
11:53 |
Dyrcona |
We can worry about that next week. |
11:53 |
jeff |
any chance of getting the output of this, sanitized of any sensitive data if any? SELECT name, setting, unit, source FROM pg_settings WHERE source <> 'default'; |
11:53 |
Dyrcona |
I think I'll start most of the cron jobs and see what happens. |
11:55 |
Dyrcona |
None of it looks sensitive. Just a sec. |
11:56 |
jeff |
have you already tried with a shared_buffers value closer to the conventional starting point of 25% to 40% of physical memory? |
11:59 |
Dyrcona |
jeff: https://drive.google.com/file/d/1dJ5wsVWyEt0DL9hIwaMq34xaTrVeh73a/view?usp=sharing |
11:59 |
Dyrcona |
Well, I think that would cause the database to use more RAM and starve the other processes even faster. |
12:08 |
|
kmlussier joined #evergreen |
12:37 |
sandbergja |
Dyrcona: purely out of curiosity, what is `cat /proc/sys/vm/overcommit_memory`? |
12:38 |
Dyrcona |
sandbergja: 2 |
12:39 |
* Dyrcona |
has to look that setting up again to remember what it does exactly. I have tweaked it in the past on other servers. |
12:40 |
Dyrcona |
Maybe I should set it to 0 to allow heuristic over commits? Pretty sure that I don't want 1. |
12:40 |
sandbergja |
I came across some redis documentation saying that redis prefers 1? Let me see if I can dig it up |
12:41 |
sandbergja |
This thing: https://redis.io/docs/latest/develop/get-started/faq/#background-saving-fails-with-a-fork-error-on-linux |
12:42 |
Dyrcona |
So, I'm considering adding a small swap file, not more than 8GB, and taking jeff's suggestion and changing shared_buffers in Pg (requires a Pg restart). |
12:43 |
Dyrcona |
sandbergja: That's interesting. I'll see if I'm getting that error. |
12:44 |
sandbergja |
I think that 0 or 1 would make the "not enough memory for the allocation" errors less likely, at the tradeoff of being more likely to hit the oom_killer? |
12:44 |
berick |
w/ our documented setup, redis disk saving should not be happening |
12:46 |
Dyrcona |
I got this "Dec 27 15:52:16 cwmars-dev systemd[1]: redis-server.service: Failed to fork: Cannot allocate memory" about the time of the crash on Friday. |
12:48 |
Dyrcona |
I can certainly try changing the overcommit memory setting. I know I've done it before (for Postgres?), but I don't remember exactly when or why, and I probably don't have good notes. (I should go back to keeping an admin notebook/journal.) |
12:49 |
Dyrcona |
Too bad -r doesn't work with zgrep. |
12:51 |
collum |
I got curious about the different settings - https://www.kernel.org/doc/Documentation/vm/overcommit-accounting |
12:51 |
Dyrcona |
collum++ I was just looking at that document. |
12:55 |
|
sandbergja joined #evergreen |
12:55 |
sandbergja |
berick++ thanks, gotcha! |
12:57 |
sandbergja |
haha I like this phrase from the documentation collum shared: "a seriously wild allocation" |
12:59 |
Dyrcona |
"A wild allocation appears." |
12:59 |
sandbergja |
haha |
12:59 |
berick |
i can't say 100% that disabling persistence means it won't ever fork, but if Redis is using ~15MB of ram and it can't fork, then you're just out of ram. |
13:01 |
Dyrcona |
Yes, that's what I think. I'm just out of RAM. Today it has been postgres. Friday it was redis. |
13:01 |
berick |
Dyrcona: "it was Redis" -- redis crashed or it ate the ram? |
13:02 |
Dyrcona |
Redis crashed. I don't think redis is eating ram. If it were, I think I would have seen it while running the scheduled jobs. I think it's Perl and Pg eating the ram. |
13:03 |
* berick |
nods |
13:07 |
Dyrcona |
right... add a swap file.... been a while... |
13:11 |
|
troy joined #evergreen |
13:13 |
Dyrcona |
So, 6GB of swap might be OK. |
13:20 |
Dyrcona |
As much as I like running systems without swap, it's usually necessary to have some. |
13:23 |
Dyrcona |
I think I'll clear the cache again and set over_commit to 0 for heuristic overcommit, then run some scheduled jobs, maybe even autorenew and see what happens. |
13:44 |
Dyrcona |
I'm not the only one using the dev machine right now. Load hit 15 while I'm running the hold targeter. I know at least one person is looking at circulation settings. |
13:45 |
Dyrcona |
Memory looks good after setting ovecommit_memory. |
13:45 |
Dyrcona |
Not using any swap yet, either. |
13:46 |
Dyrcona |
redis never seems to go much over 14.5MB. |
14:03 |
|
kmlussier1 joined #evergreen |
14:03 |
|
kmlussier1 left #evergreen |
14:10 |
Dyrcona |
I think things will be OK as long as I don't run the aspen stuff. |
15:11 |
Dyrcona |
berick: I see some pcrud drones using as much/more memory as storage. I may have to limit the number of requests that they're allowed to process. I'll consider that depending on how things go over the next few days. |
15:12 |
Dyrcona |
I'm running our Daily-PD-2 granularity action trigger, which is essentially a combined autorenew and 2-day courtesy notice. RAM use is getting tight. |
15:14 |
Dyrcona |
No memory errors so far, though. |
15:24 |
berick |
hm, haven't seen that w/ pcrud. |
15:26 |
berick |
pcrud's using 1/4 of what storage uses pretty consistently here |
15:26 |
berick |
w/ 2000 max_requests for both |
15:32 |
Dyrcona |
Hmm. It's going to be difficult for me to confirm that right now. They're not showing up in the top memory users in top right now. |
15:32 |
Dyrcona |
I'll have to double check later. |
15:34 |
Dyrcona |
It's still chugging away on autorewal with 280MB of free RAM and 0.8MB of swap used. |
15:34 |
Dyrcona |
Correction. It ticked up to 7.2MB of swap. |
15:35 |
Dyrcona |
And, autorenew just finished and things look good! |
15:35 |
Dyrcona |
I suspect the overcommit_memory setting helped more than anything else I've done. sandbergja++ |
15:41 |
Dyrcona |
berick: top -c output sorted by most to least memory usage. These are the top 10 processes for memory use on the machine right now: https://drive.google.com/file/d/1OBh7QyMsBaCR6auiiBfJtR-Ak0H5mzJW/view?usp=sharing |
15:46 |
berick |
interesting they're using more than storage, but the amounts used aren't crazy. |
15:46 |
Dyrcona |
For certain definitions of crazy. :) |
15:46 |
Dyrcona |
No, I mostly agree. |
15:47 |
Dyrcona |
Looks like they've been actually doing more stuff based on the run time. |
15:49 |
Dyrcona |
I plan to set shared buffers back to the default in a bit and restart Pg. I'll run the same tests again tomorrow. If that's all good, I'll enable all but our Aspen scheduled jobs on Thursday. |
15:52 |
Dyrcona |
Looking at the overcommit documentation, adding the swap may have helped in ways other than just being an overflow space. It looks like swap amount is used in the overcommit calculations for the heuristic and don't overcommit methods. |
15:54 |
Dyrcona |
I'm curious to look at that part of the kernel code, but not so curious that I'm going to do it right now. |
16:00 |
Dyrcona |
Without doing anything buffers/cached dropped by half just now. I think the cache pressure setting is also helping. |
16:01 |
Dyrcona |
Guess I'll wait to restart. Looks like someone is doing something. I see search and actor and pcrud drones becoming active. |
16:14 |
Bmagic |
anyone know off-hand any tips for loading PO's via MARC resulting in only the PO and none of the bib records being inserted into biblio.record_entry? Despite the "import non-matching record" checked? The resulting bib queue also has zero records loaded in the queue. Nothing I can (re)import there either. |
16:25 |
Dyrcona |
Oof. Dunno about that one. |
16:27 |
Dyrcona |
I will say. If these setting work tomorrow, I'll make the sysctl settings permanent. Then, I'll bug Bmagic about how to add it to the docker/ansible setup. |
16:28 |
* Dyrcona |
signs out. |
16:36 |
pinesol |
News from commits: LP2092227 Hold History Pink Box <https://git.evergreen-ils.org/?p=Evergreen.git;a=commitdiff;h=102da36cb3e7ac0c19c0b16ee6c7e9763247507f> |
16:48 |
berick |
Bmagic: did it make line items? |
16:50 |
Bmagic |
berick: yes |
16:55 |
Bmagic |
I'm starting to think it's because not all of the bibs have an item on it. And the provider MARC attributes definitions are odd. Like, for example, the barcode is found in 267$i, shelving location is 266$l, fund code 264$b. All of the attributes are in different fields |
16:59 |
Bmagic |
oh, lol, nevermind |
17:00 |
Bmagic |
those were DB ID numbers, not MARC tags. The Attributes look OK actually. And the number of bibs in the file is the same number of 962 tags. So the file has a one-to-one item/bib relationship. |
17:01 |
Bmagic |
The PO interface loads up the order fine. Each line has an item (1). But, of course, all* of them need "Link to Catalog" |
17:03 |
|
mmorgan1 left #evergreen |
17:04 |
berick |
huh. been a while since i looked closed at that code. kinda thought the recs had to be queued first during import. |
17:06 |
|
mantis1 left #evergreen |
17:10 |
Bmagic |
I've loaded the same file via standard vandelay (no PO) and it loads up fine |
17:11 |
Bmagic |
something about getting loaded and* POified, causes the bibs to vanish |
17:11 |
Bmagic |
no errors in the logs either |
17:16 |
Bmagic |
Chopping the marc file down to (random number) 36 records, and feed that in, it loads the bibs fine and creates the PO |
17:16 |
Bmagic |
The full file is 891 records |
17:42 |
Bmagic |
Maybe I'll try half, and then half again, and so on. ok, yeah, the first half failed the same way the whole file did, I'll split that and see where I get |