Evergreen ILS Website

IRC log for #evergreen, 2024-12-30

| Channels | #evergreen index | Today | | Search | Google Search | Plain-Text | summary | Join Webchat

All times shown according to the server's local time.

Time Nick Message
07:09 collum joined #evergreen
07:57 BDorsey joined #evergreen
08:45 mmorgan joined #evergreen
08:46 dguarrac joined #evergreen
08:55 Dyrcona joined #evergreen
09:01 mmorgan1 joined #evergreen
09:10 Dyrcona Looks like the memory problem is more the OpenSRF drones than PostgreSQL. `top` says that postgres is using as much ram as a single storage drone....
09:16 Dyrcona Restarting services doesn't make much difference, either. top reports an idle storage drone using 331272KB of VIRT memory. I've seen them hit 345MB each when in use.
09:16 * Dyrcona checks a couple of other systems for comparison.
09:18 Dyrcona Memory use of a storage drone is comparable on another system running Redis (333340K virt) and Ejabberd (303464K virt).
09:19 Dyrcona So, 330MB per process is no big deal, right? Well, not until you're running 70 to 100 of them...
09:20 Dyrcona Before I restarted services, I saw search and acq drones approaching the same memory use, but they typically start out lower.
09:20 Dyrcona I was thinking of cutting the number of requests that the drones on this server are allowed to process to see if that helps, but these numbers suggest to me that it won't make that much difference.
09:21 Dyrcona As for redis crashing, I suspect, but do not know, that it is likely making an additional copy of the data as messages get passed around, thereby increasing the memory pressure.
09:22 Dyrcona BTW, beam.smp running Ejabberd on the one system is using 4224472K virt. That's 4GB!
09:23 Dyrcona Redis doesn't even show up when top -c is sorted by memory consumption when idle.
09:26 Dyrcona On the mostly unused test system Redis has VSZ of  74520K or 74MB. On the one where I just restarted it, the VSZ is 76460K.
09:28 Dyrcona Bmagic || berick : See my monologue above if you get this notification. If not, it's in the logs for future reference.
09:29 Dyrcona So, I'm going to rung the fine generator and keep an eye on top sorted by memory consumption.
09:32 Dyrcona s/rung/run/ but you get it.
09:37 Dyrcona OK. Going to restart the VM, first.
09:39 Dyrcona Huh. had to 'kill' simple2zoom twice. Once for each process. Guess they're not spawned in the same process group?
09:41 Dyrcona Gotta remember to start websocketd. It's not setup with systemd.
09:41 Dyrcona Think I'll do these tests on a couple of other vms to see what difference there is with memory use between Redis and Ejabberd if any.
09:42 berick Dyrcona: read it, but i'm unclear what the takeaway is
09:42 Dyrcona Well, I'm not sure there's a takeaway, yet, except that storage drones use a lot of memory regardless.
09:44 Dyrcona I'm in the process of ferreting out where the high memory use happens, and I'm logging it publicly rather than keeping private notes. It's a quiet day in IRC, so I don't think anyone will really mind much. If they do, know one has said anything, yet.
09:44 berick no, i appreciate it
09:45 Dyrcona :)
09:45 berick just unsure if there was a call to action
09:45 berick in any event, i'm here if you find anything
09:46 Dyrcona berick++
09:47 Dyrcona FWIW, apache processes start out using around 147MB VIRT before I even start anything else.
09:48 Dyrcona snapd is using 1.69GB VIRT. Think that's mostly buf/cache.
09:49 Dyrcona I think I was just sorting them by RES before which is the resident set size in RAM.
09:51 Dyrcona Also, VSZ/VIRT is probably not what I really want to measure anyway. That includes memory-mapped files, which don't actually consume RAM.
09:51 Dyrcona RES is what's actually used, but it's more complicated because I'm concerned about buffers/cache which top apparently doesn't report.
09:54 Dyrcona And, shared memory with forked processes complicates things further. Those forked drones may not each actually be using all the RAM individually.
09:54 Dyrcona I need to write better tools for monitoring memory. Linux has gotten way too complicated over the past 30 years.
09:54 Dyrcona In the 2.x kernel days, it was easier to know what you're talking about.
09:55 Dyrcona Enough chatter. I have to restart services.
10:05 Dyrcona Idle storage drones still have a RSS of 275MB. That's pretty big.
10:05 eeevil Dyrcona: the large size of open-ils.storage is due to it generating ~16k methods at startup. it's always been north of 180MB, but newer perls seem greedier
10:06 Dyrcona eeevil++
10:11 Dyrcona Running the fine generator adds a little to the storage drones' memory use. It also caused 2 additional drones to run. I have 4 storage drones going right now, and all get used at various times.
10:12 Dyrcona RSS fluctuates between 279MB and 290MB per drone.
10:13 Dyrcona Redis RSS jumped from 12.2MB to 13.5MB.
10:14 mantis1 joined #evergreen
10:14 Dyrcona The storage drone's RSS doesn't drop when the fine generator ends.
10:15 Dyrcona Perl--
10:16 Dyrcona RSS of redis dropped to 12.7MB after the fine generator completed.
10:17 * Dyrcona tries the hold_targeter
10:17 Dyrcona hold_targeter.pl --osrf-config /openils/conf/opensrf_core.xml --soft-retarget-interval='1 hour' # for the logs.
10:18 Dyrcona Redis RSS is back up to 14.6MB and circ drones are hitting around 100MB.
10:19 Dyrcona BTW, I'm mostly trying to chase down Bmagic's hypothesis that Redis is causing an increase in RAM usage.
10:22 Dyrcona buffers/cache has jumped to 17Gi.
10:23 Dyrcona And, 22Gi now. The hold targeter has an effect on something.
10:25 Dyrcona There's so much nonsense on stackoverflow. I can't use that site any more....
10:25 berick Dyrcona: redis-server is using 22G?
10:25 berick oh, no, i misread taht i think
10:26 Dyrcona No, that's total buffers/cache.
10:27 Dyrcona And, it's up to 24G, now. I'm trying to figure what's in there, Pg? drones?
10:27 Dyrcona Redsi is only using 14.4MB RSS and 80.6MB VSZ.
10:28 Dyrcona Redsi? :)
10:30 Dyrcona I'm starting to think that redis-server is not the problem, but it may be more efficient than Ejabberd leading to more RAM use elsewhere.
10:34 Dyrcona Monitoring by CPU usage, I see 7 or 8 Pg processes going, using about 30-44MB each.
10:34 berick fwiw, my hold targeters also get bloated.  i have open-ils.hold-targeter set to <max_requests>10</max_requests> on my utility server from way back
10:35 Dyrcona berick: I might try that, but only 2 hold targeter drones are running at around 100MB each.
10:37 * Dyrcona goes to look up how to find out what's using buffer/cache from a reliable source. I see so many contradictory answers on stackoverflow, all accepted.
10:40 Dyrcona rhel++ # The documentation anyway.
10:41 Dyrcona And, it's out of date... rhel--
10:42 Dyrcona oh, never mind. It hasn't been set, the /proc file doesn't exist.... rhel++ :)
10:45 Dyrcona All right maybe the documentation is out of date... or good for RHEL only.... When I try to set the kernel pagecache percentage, my system complains.
10:48 Dyrcona OK. The setting I want is vm.vfs_cache_pressure in sysctl.
10:48 Bmagic Once I moved a certain group of action triggers over to a different utility server, neither the old or the new utility server has failed to process the cron jobs. They've been running for 3+ weeks without restarting services.
10:48 Dyrcona Bmagic: We're having issues with action trigger disabled.
10:48 Dyrcona No cron jobs, just me running stuff manually.
10:49 Dyrcona I'm going to bump vm.vfs_cache_pressure to 150 to see if that helps.
10:49 Dyrcona before I do that. I'm going to run the thing that I think crashed it last Friday.
10:50 Bmagic So, I think I can conclude that the issue appears to show itself when enough OpenSRF related things are running at the same time, or withing a short period of time perhaps. Dyrcona: it's actually nice to know that you can reliably cause a problem. That's where I got stuck, even when I kicked off the job by hand, it wouldn't fail
10:52 Dyrcona Well, I'm going to try again. It's our big dump of "everything" that goes to Aspen. I think that was running Friday when there was a crash, after I had run the fine generator and hold targeter.
10:53 Dyrcona I suspect that I just need to tell the kernel to clear the cache buffers more often.
10:56 Dyrcona Well, something crashed, but not redis.
10:58 Dyrcona postgres
10:58 Dyrcona Not the main one, just the child/children running the query.
11:00 Dyrcona I'll try again with the vfs_cache_pressure at 150.
11:00 sandbergja joined #evergreen
11:01 Dyrcona Oh, right. We use marc_export for part of it.
11:03 Dyrcona Nope. Same result. I'm pretty sure that I don't have to restart for this setting to have an effect.
11:03 Dyrcona I'll increase the cache pressure to 200. I think that's the maximum.
11:05 Dyrcona An "idle" postgresql process is using 20% of CPU....
11:06 Dyrcona Nope. Looks like I might have to manually clear the cache.
11:07 jeff in what way did the postgres child crash?
11:08 Dyrcona No memory for an allocation.
11:08 Dyrcona Yes, I may have to cleanup the temp area on disk.
11:10 Dyrcona I'll worry about that once I get the memory use sorted.
11:13 Dyrcona sudo bash -c "echo 1 > /proc/sys/vm/drop_caches" # Works great!
11:13 Dyrcona From 22Gi to 343Gi. I could clear the dentries and inodes, too, but meh.
11:13 Dyrcona Now, let's see if the process runs.
11:24 Dyrcona Nope. Still crashed.
11:26 Dyrcona It was using only 13G of buffers/cache
11:26 jeff what does the log entry look like? have you shared it above already? I looked, but didn't see it in a quick scan.
11:26 Dyrcona Dec 30 11:16:30 cwmars-dev kernel: __vm_enough_memory: pid: 5321, comm: postgres, not enough memory for the allocation
11:27 jeff what does the postgresql log entry look like?
11:29 Dyrcona 2024-12-30 11:16:30.615 EST [5321] evergreen@evergreen ERROR:  out of memory
11:29 Dyrcona 2024-12-30 11:16:30.615 EST [5321] evergreen@evergreen DETAIL:  Failed on request of size 8192 in memory context "ErrorContext".
11:29 Dyrcona It may not have actually crashed, but the process is gone.
11:30 Dyrcona There are two additional lines about a cursor not existing and failing to fetch from the cursor, but I think the above are more relevant.
11:35 Dyrcona Looks Pg cleaned up any temporary files.
11:40 Dyrcona At any rate, redis-server is not gobbling up lots of RAM.
11:41 Dyrcona I blame Perl and I blame Postgres.
11:43 Dyrcona Doing anything with the database cranks up the buffers/cache.
11:44 jeff expected, since you're reading from disk and the kernel is caching that read data.
11:45 Dyrcona Yes.
11:47 Dyrcona My testing points the finger away from redis and more towards running everything on 1 machine.
11:47 jeff how much physical memory and how much swap does this machine/vm/whatnot have?
11:47 jeff can you share the output of this query somewhere? SELECT name, setting, unit, source FROM pg_settings WHERE name ~ 'mem';
11:48 Dyrcona It says 30G of memory, no swap. (I've considered adding swap, but haven't bothered because it would only delay the inevitable.)
11:49 Dyrcona Everything is default except shared mem type = posix
11:50 Dyrcona I cut shared_buffers from the default (128MB) down to 64MB last week.
11:50 jeff that's incredibly tiny. that might be your issue right there.
11:51 jeff but I'm not sure you'd hit out of memory issues, more likely "terrible performance".
11:51 jeff but since the planner takes shared buffers into consideration, you might be getting a worst-possible plan that's contributing.
11:51 Dyrcona Right, and performance did not change much. I think we just need a larger VM.
11:52 Bmagic 30GB may not be enough memory to simulate production, considering the DB size on disk > physical memory * 14
11:52 jeff I thought that postgresql was a lot more verbose in these kinds of out of memory errors, the format of the message claiming it's out of memory in the ErrorContext context almost sounds like it didn't have enough memory to construct the more detailed out of memory error.
11:52 Dyrcona Yeah, that's what I think happened.
11:53 Dyrcona Bmagic: Yes. I don't remember what the previous server was set up as.
11:53 Bmagic same
11:53 Dyrcona We can worry about that next week.
11:53 jeff any chance of getting the output of this, sanitized of any sensitive data if any? SELECT name, setting, unit, source FROM pg_settings WHERE source <> 'default';
11:53 Dyrcona I think I'll start most of the cron jobs and see what happens.
11:55 Dyrcona None of it looks sensitive. Just a sec.
11:56 jeff have you already tried with a shared_buffers value closer to the conventional starting point of 25% to 40% of physical memory?
11:59 Dyrcona jeff: https://drive.google.com/file/d/1dJ5wsVWy​Et0DL9hIwaMq34xaTrVeh73a/view?usp=sharing
11:59 Dyrcona Well, I think that would cause the database to use more RAM and starve the other processes even faster.
12:08 kmlussier joined #evergreen
12:37 sandbergja Dyrcona: purely out of curiosity, what is `cat /proc/sys/vm/overcommit_memory`?
12:38 Dyrcona sandbergja: 2
12:39 * Dyrcona has to look that setting up again to remember what it does exactly. I have tweaked it in the past on other servers.
12:40 Dyrcona Maybe I should set it to 0 to allow heuristic over commits? Pretty sure that I don't want 1.
12:40 sandbergja I came across some redis documentation saying that redis prefers 1?  Let me see if I can dig it up
12:41 sandbergja This thing: https://redis.io/docs/latest/develo​p/get-started/faq/#background-savin​g-fails-with-a-fork-error-on-linux
12:42 Dyrcona So, I'm considering adding a small swap file, not more than 8GB, and taking jeff's suggestion and changing shared_buffers in Pg (requires a Pg restart).
12:43 Dyrcona sandbergja: That's interesting. I'll see if I'm getting that error.
12:44 sandbergja I think that 0 or 1 would make the "not enough memory for the allocation" errors less likely, at the tradeoff of being more likely to hit the oom_killer?
12:44 berick w/ our documented setup, redis disk saving should not be happening
12:46 Dyrcona I got this "Dec 27 15:52:16 cwmars-dev systemd[1]: redis-server.service: Failed to fork: Cannot allocate memory" about the time of the crash on Friday.
12:48 Dyrcona I can certainly try changing the overcommit memory setting. I know I've done it before (for Postgres?), but I don't remember exactly when or why, and I probably don't have good notes. (I should go back to keeping an admin notebook/journal.)
12:49 Dyrcona Too bad -r doesn't work with zgrep.
12:51 collum I got curious about the different settings - https://www.kernel.org/doc/Docume​ntation/vm/overcommit-accounting
12:51 Dyrcona collum++ I was just looking at that document.
12:55 sandbergja joined #evergreen
12:55 sandbergja berick++ thanks, gotcha!
12:57 sandbergja haha I like this phrase from the documentation collum shared: "a seriously wild allocation"
12:59 Dyrcona "A wild allocation appears."
12:59 sandbergja haha
12:59 berick i can't say 100% that disabling persistence means it won't ever fork, but if Redis is using ~15MB of ram and it can't fork, then you're just out of ram.
13:01 Dyrcona Yes, that's what I think. I'm just out of RAM. Today it has been postgres. Friday it was redis.
13:01 berick Dyrcona: "it was Redis" -- redis crashed or it ate the ram?
13:02 Dyrcona Redis crashed. I don't think redis is eating ram. If it were, I think I would have seen it while running the scheduled jobs. I think it's Perl and Pg eating the ram.
13:03 * berick nods
13:07 Dyrcona right... add a swap file.... been a while...
13:11 troy joined #evergreen
13:13 Dyrcona So, 6GB of swap might be OK.
13:20 Dyrcona As much as I like running systems without swap, it's usually necessary to have some.
13:23 Dyrcona I think I'll clear the cache again and set over_commit to 0 for heuristic overcommit, then run some scheduled jobs, maybe even autorenew and see what happens.
13:44 Dyrcona I'm not the only one using the dev machine right now. Load hit 15 while I'm running the hold targeter. I know at least one person is looking at circulation settings.
13:45 Dyrcona Memory looks good after setting ovecommit_memory.
13:45 Dyrcona Not using any swap yet, either.
13:46 Dyrcona redis never seems to go much over 14.5MB.
14:03 kmlussier1 joined #evergreen
14:03 kmlussier1 left #evergreen
14:10 Dyrcona I think things will be OK as long as I don't run the aspen stuff.
15:11 Dyrcona berick: I see some pcrud drones using as much/more memory as storage. I may have to limit the number of requests that they're allowed to process. I'll consider that depending on how things go over the next few days.
15:12 Dyrcona I'm running our Daily-PD-2 granularity action trigger, which is essentially a combined autorenew and 2-day courtesy notice. RAM use is getting tight.
15:14 Dyrcona No memory errors so far, though.
15:24 berick hm, haven't seen that w/ pcrud.
15:26 berick pcrud's using 1/4 of what storage uses pretty consistently here
15:26 berick w/ 2000 max_requests for both
15:32 Dyrcona Hmm. It's going to be difficult for me to confirm that right now. They're not showing up in the top memory users in top right now.
15:32 Dyrcona I'll have to double check later.
15:34 Dyrcona It's still chugging away on autorewal with 280MB of free RAM and 0.8MB of swap used.
15:34 Dyrcona Correction. It ticked up to 7.2MB of swap.
15:35 Dyrcona And, autorenew just finished and things look good!
15:35 Dyrcona I suspect the overcommit_memory setting helped more than anything else I've done. sandbergja++
15:41 Dyrcona berick: top -c output sorted by most to least memory usage. These are the top 10 processes for memory use on the machine right now: https://drive.google.com/file/d/1OBh7QyMs​BaCR6auiiBfJtR-Ak0H5mzJW/view?usp=sharing
15:46 berick interesting they're using more than storage, but the amounts used aren't crazy.
15:46 Dyrcona For certain definitions of crazy. :)
15:46 Dyrcona No, I mostly agree.
15:47 Dyrcona Looks like they've been actually doing more stuff based on the run time.
15:49 Dyrcona I plan to set shared buffers back to the default in a bit and restart Pg. I'll run the same tests again tomorrow. If that's all good, I'll enable all but our Aspen scheduled jobs on Thursday.
15:52 Dyrcona Looking at the overcommit documentation, adding the swap may have helped in ways other than just being an overflow space. It looks like swap amount is used in the overcommit calculations for the heuristic and don't overcommit methods.
15:54 Dyrcona I'm curious to look at that part of the kernel code, but not so curious that I'm going to do it right now.
16:00 Dyrcona Without doing anything buffers/cached dropped by half just now. I think the cache pressure setting is also helping.
16:01 Dyrcona Guess I'll wait to restart. Looks like someone is doing something. I see search and actor and pcrud drones becoming active.
16:14 Bmagic anyone know off-hand any tips for loading PO's via MARC resulting in only the PO and none of the bib records being inserted into biblio.record_entry? Despite the "import non-matching record" checked? The resulting bib queue also has zero records loaded in the queue. Nothing I can (re)import there either.
16:25 Dyrcona Oof. Dunno about that one.
16:27 Dyrcona I will say. If these setting work tomorrow, I'll make the sysctl settings permanent. Then, I'll bug Bmagic about how to add it to the docker/ansible setup.
16:28 * Dyrcona signs out.
16:36 pinesol News from commits: LP2092227 Hold History Pink Box <https://git.evergreen-ils.org/?p=E​vergreen.git;a=commitdiff;h=102da3​6cb3e7ac0c19c0b16ee6c7e9763247507f>
16:48 berick Bmagic: did it make line items?
16:50 Bmagic berick: yes
16:55 Bmagic I'm starting to think it's because not all of the bibs have an item on it. And the provider MARC attributes definitions are odd. Like, for example, the barcode is found in 267$i, shelving location is 266$l, fund code 264$b. All of the attributes are in different fields
16:59 Bmagic oh, lol, nevermind
17:00 Bmagic those were DB ID numbers, not MARC tags. The Attributes look OK actually. And the number of bibs in the file is the same number of 962 tags. So the file has a one-to-one item/bib relationship.
17:01 Bmagic The PO interface loads up the order fine. Each line has an item (1). But, of course, all* of them need "Link to Catalog"
17:03 mmorgan1 left #evergreen
17:04 berick huh.  been a while since i looked closed at that code.  kinda thought the recs had to be queued first during import.
17:06 mantis1 left #evergreen
17:10 Bmagic I've loaded the same file via standard vandelay (no PO) and it loads up fine
17:11 Bmagic something about getting loaded and* POified, causes the bibs to vanish
17:11 Bmagic no errors in the logs either
17:16 Bmagic Chopping the marc file down to (random number) 36 records, and feed that in, it loads the bibs fine and creates the PO
17:16 Bmagic The full file is 891 records
17:42 Bmagic Maybe I'll try half, and then half again, and so on. ok, yeah, the first half failed the same way the whole file did, I'll split that and see where I get

| Channels | #evergreen index | Today | | Search | Google Search | Plain-Text | summary | Join Webchat