Evergreen ILS Website

IRC log for #evergreen, 2024-12-20

| Channels | #evergreen index | Today | | Search | Google Search | Plain-Text | summary | Join Webchat

All times shown according to the server's local time.

Time Nick Message
07:20 kworstell-isl joined #evergreen
08:01 BDorsey joined #evergreen
09:01 dguarrac joined #evergreen
09:04 Dyrcona joined #evergreen
09:05 Dyrcona Hm. Had issues with a couple of overnight jobs on the dev system, but the router is still running.
09:06 Dyrcona No problems with Autorenew on the other test vm, either.
09:07 Dyrcona Got this in the email from the badge_score_generator: [auth] WRONGPASS invalid username-password pair,  at /usr/share/perl5/Redis.pm line 311.
09:08 Dyrcona Haven't seen that one before. All the services are running OK.
09:09 Dyrcona refresh carousels reported "Unable to bootstrap client for requests." When I saw that I thought it might have been because of a crash.
09:13 JBoyer Is opensrf's .srfsh.xml setup correctly for redis there? I think both of those are run from .srfsh scripts so if that's got the wrong password or router location those seems like the errors I would expect
09:13 Dyrcona Looks like something's up with services on the training/dev system, though. I'm getting an internal server error when I try to use the staff client.
09:13 Dyrcona JBoyer: I think so, I'll double check.
09:15 Dyrcona I don't like the way the opensrf_core.xml gets commented by the ansible/docker set up. It's confusing. "--><password>...</password><--" as an example
09:16 berick it's def a stop-gap
09:16 Dyrcona Yeah, the opensrf user password looks right.
09:17 Dyrcona I'm going to try the angular staff client and opac, and if they both give internal server errors, I'll restart services.
09:18 Dyrcona OPAC gives me an internal server error. eg2 just white screens.
09:19 Dyrcona Oh, and I did restart Apache after I first got the internal server errors.
09:20 berick Dyrcona: are both router instances running?
09:20 Dyrcona berick: --diagnostic said everything was running.
09:21 Dyrcona I just restarted services and apache, and I can login.
09:22 Dyrcona * router                   [1812] uptime=21:31:28    cputime=00:00:04
09:22 Dyrcona * router                   [1813] uptime=21:31:28    cputime=00:01:52
09:22 Dyrcona That's from before.
09:23 * berick nods
09:23 csharp_ routers be routin'
09:32 Dyrcona Nothing in the system logs for opensrf_router except for the segfault yesterday morning.
09:33 Dyrcona Redis errors in /openils/var/log/router-public.log
09:35 Dyrcona Is opensrf.settings supposed to be registered on the private domain?
09:35 berick yes
09:35 Dyrcona the private router log says it isn't, but that could be from before the restart.
09:38 Dyrcona Hm... It's tough to say, but I think that's from just before I restarted services.
09:39 Dyrcona Does anything log service start/stop/restart?
09:40 Dyrcona Ugh. It correspnds with the second of the restart.
09:40 Dyrcona I suppose I can restart again and see if it comes up again.
09:41 Dyrcona If I could stop making typos for just 5 minutes!
09:42 Dyrcona OK. Looks like the not registered errors are from before the previous restart. They don't repeat with this one.
09:43 Dyrcona router 2024-12-20 09:18:38 [ERR :1813:transport_connection.c:255:] REDIS Error [] BLPOP opensrf:router:router:private.localhost 5
09:43 Dyrcona router 2024-12-20 09:18:38 [ERR :1813:transport_connection.c:263:] Resting for a few seconds after bus failure...
09:44 Dyrcona That repeats over and over again in the private router log since about midnight.
09:44 Dyrcona Same thing in the public router log.
09:45 Dyrcona Modulo the process id and public versus private.
09:46 Dyrcona berick: Should I open a bug on Lp?
09:47 mmorgan joined #evergreen
09:47 berick it's not much to go on :\  any other nearby errors or in the redis logs?
09:48 berick good to verify redis didn't restart for some reason as well
09:51 Dyrcona Well, is the number after the file name, the line number? That might be something to start with.
09:52 berick yeah we know where BLPOP happens, but unfortunately that's not much more useful than "something happened while talking to redis"
09:53 Dyrcona All right. I'm looking. I go to journalctl first these days, but I haven't found much there. A bunch of apache stuff about missing files and ssl.
09:54 berick thanks for digging
09:54 csharp_ ok - the "5" is the timeout
09:54 csharp_ len = snprintf(command_buf, 256, "BLPOP %s %d", recipient, timeout);
09:55 Dyrcona Today's redis-server.log is 0 bytes. Yesterday's is compressed. I'll have a look at it.
09:56 Dyrcona 593:M 19 Dec 2024 16:41:51.794 # Guru Meditation: Redis aborting for OUT OF MEMORY. Allocating 32774 bytes! #server.c:5206
09:56 Dyrcona 593:M 19 Dec 2024 16:41:51.794 # (forcing SIGSEGV in order to print the stack trace)
09:57 berick interesting
09:57 Dyrcona There's a backtrace for Redis.
09:59 berick based on the VM, how much memory do you think it had to consume to reach that point?
09:59 Dyrcona Don't think much was going on then: fine generator and/or hold_targeter maybe. I don't know right now if anyone was using the server.
09:59 Dyrcona The server has about 30GB.
09:59 berick i guess the other question is what actually ate the memory
10:00 Dyrcona Free says 29Gi and no swap.
10:00 Dyrcona Right now 24Gi is available with 23Gi buffered/cached.
10:00 Dyrcona 4.2GI used and 2.0gi "free"
10:01 Dyrcona Looks like Redis restarted right after that, and that's where the log ends.
10:04 Dyrcona Someone was using acq about that time. This is 3.14, so the new Angular acq.
10:08 Dyrcona I've emailed one of our internal lists to find out if whoever was using acq had any issues and could report anything.
10:08 csharp_ @blame acq
10:08 pinesol csharp_: acq stole bradl's tux doll!
10:08 csharp_ gotta remove that blame
10:08 Dyrcona IIRC, csharp_ you have the tux doll in question?
10:09 csharp_ I do, in my office
10:09 csharp_ I'll share a pic sometime :-)
10:10 csharp_ also I'll share a pic of the unicorn on a noose that used to hang on the "previous ILS" $1M+ Sun server
10:10 Dyrcona I hesitate to check yesterday's osrfsys.log. It's 202MB compressed.
10:12 berick Dyrcona: since Bmagic was also having issues on a utility server, I'm leaning toward it being a utility process that's causing the issue (hold targeter, fine generator, etc.).
10:15 berick if a backend was sending replies faster than the caller could collect them, it would cause messages to pile up in Redis.  but I'm having trouble imagining where ~30G of such content would be coming from.
10:15 berick hm unless the caller just disconnected
10:16 Dyrcona OK. I suspect something like that, too.
10:16 berick but even then that's a lot
10:16 berick maybe over time though
10:16 Dyrcona No lock files from that time hanging around.
10:17 Dyrcona The badge score generator lock file is there from 4:30 am, though.
10:18 Dyrcona At 16:41, it would most likely be the fine generator.
10:18 berick Dyrcona: how many days had it been running OK?
10:19 Dyrcona I don't actually know. I was out of the office for a week, and didn't get back until yesterday.
10:21 Dyrcona I did recently switch the fine generator to run at 30 minutes past the hour because we were getting regular emails from the monitor about the number of processes running when it and the hold targeter started at the same time.
10:22 Dyrcona We're running password reset every minute, and the pending a/t running every half hour.
10:23 redavis joined #evergreen
10:23 Dyrcona We have a SQL sending data to Bywater for our test Aspen installation every 5 minutes. That could be using a lot of RAM.....
10:25 Dyrcona The person testing acq yesterday said that they did experience Evergreen stop working at that time. They also encountered Lp 2086786. I don't think it's related.
10:25 pinesol Launchpad bug 2086786 in Evergreen 3.13 "Acq: Multi-Branch Libraries Can't See All Their Branch Funds in Invoices" [High,Fix committed] https://launchpad.net/bugs/2086786
10:26 Dyrcona Someone else mentioned Evergreen "crashing" at that time.
10:29 berick Dyrcona: mind checking periodically to see if Redis memory usage is slowly climbing?  if that's the case, I have a script that will help
10:29 Dyrcona OK. I'll do that.
10:29 berick thanks
10:31 Dyrcona Right now, I have to figure out where reports should be and probably make a link. I'm also told reports aren't working. I think they're working and the output is going somewhere other than /openils/var/web/reporter.
10:35 Dyrcona And, that was easy.
10:38 Dyrcona Hm... it's 85MB of VM, rss is 13MB at the moment.
10:41 Dyrcona FWIW, I'm monitoring it this way: ps -o etime,time,rss,size,vsize 15997 [that's the actual PID]
10:45 Dyrcona I have not had it crash on the other test server, but I don't usually run cron, and it's almost never used.
10:46 Dyrcona It should be quiet next week. I'll try recompiling OpenSRF and Evergreen with debug options turned on. That might help, particularly if I can figure out how to reproduce a crash.
10:57 Dyrcona Think I'll do that on my other test vm, too, if I haven't already. I'll do that after I test the 3.12.10 tarball.
11:03 sandbergja joined #evergreen
11:19 Christineb joined #evergreen
12:24 kworstell-isl joined #evergreen
12:24 redavis Dyrcona++ mjdriscoll++
12:59 mmorgan Dyrcona++ mjdriscoll++
13:17 Dyrcona storage and hold_targeter services crashed at some point. I missed catching it live.
13:18 Dyrcona storage won't run again....
13:21 Dyrcona today's redis-server.log is still empty. I wonder if it is a case of redis needs to be restarted when the log is turned over?
13:22 berick no, restarts aren't necessary
13:23 Dyrcona ERR open-ils.storage Has PID file entry [61045], which matches no running open-ils.storage processes
13:23 Dyrcona That's after attempting to stop and then start storage again.
13:23 Dyrcona I think Redis or the router is hosed again.
13:24 Dyrcona diagnostic shows everything but storage and hold-targeter running.
13:25 Dyrcona storage is connecting to the database.
13:26 Dyrcona storage std_err log has not changed since yesterday, so whatever is happening is not getting logged.
13:27 Dyrcona Dec 20 13:25:34 cwmars-dev kernel: __vm_enough_memory: pid: 61502, comm: gce_workload_ce, not enough memory for the allocation
13:27 Dyrcona Free says I have plenty available.
13:29 Dyrcona Does redis spawn processes? I only see 1 when I pgrep -af redis
13:29 Dyrcona and it is still only using 85MB.
13:29 berick no, just the one process
13:30 Dyrcona OK. I just restarted redis-server, too.
13:32 Dyrcona Non-evergreen processes are saying that they cannot allocate memory, or were until a couple of minutes ago. See what I pasted above. Looks like an OpenSRF Listener could not allocate RAM at 11:30 am.
13:33 Dyrcona Dec 20 11:30:05 cwmars-dev kernel: __vm_enough_memory: pid: 50517, comm: OpenSRF Listene, not enough memory for the allocation
13:33 Dyrcona Dec 20 12:00:02 cwmars-dev kernel: __vm_enough_memory: pid: 50598, comm: OpenSRF Listene, not enough memory for the allocation
13:33 Dyrcona Cut off, so I don't know which is which.
13:36 Dyrcona Well, there's this: -rw-rw-r-- 1 opensrf opensrf    5 Dec 20 11:30 generate_fines-LOCK
13:37 berick can you tell from dmesg or similar what proc had the initial OOM issue?
13:38 Dyrcona [18852.962748] __vm_enough_memory: pid: 15109, comm: postgres, not enough memory
13:38 Dyrcona for the allocation
13:38 Dyrcona That's the earliest from dmesg. There's no timestamp, but should be since the reboot this morning.
13:39 Dyrcona I'm going to disable the incremental updates for Aspen.
13:53 * Dyrcona wonders if some kernel memory tuning would help, like changing cache/buffer settings?
14:11 Dyrcona And, it's happening again. Just got an email that no children are available for open-ils.trigger. Went to run osrf_control --digagnostic and that segfaulted.
14:11 Dyrcona free says 1.6Gi free, but nothing will run.
14:12 Dyrcona redis-serve pid changed.
14:13 Dyrcona 2.5 minutes ago it changed, and is not using much memory.
14:15 Dyrcona Can I blame google_guest_agent using 2.2GB of virtual memory? or websocketd using 1GB?
14:16 Dyrcona For resident memory, several storage drones are using 280MB or more.
14:16 Dyrcona with virt over 300
14:17 Dyrcona I thought postgres would be up there, too, but it's not.
14:18 Dyrcona My conclusion: this vm just doesn't have enough RAM.
14:18 csharp_ Dyrcona: that's what it sounds like to me
14:18 Dyrcona redis-server pid changed again.
14:19 Dyrcona I'm going to stop the cron jobs.
14:19 Dyrcona reboot it.
14:20 Dyrcona Clark died along the way. Probably when reporter-store died.
14:22 Dyrcona ran --stop-all and some drones are still hanging out.
14:22 Dyrcona What's that option? --kill-with-fire?
14:23 Dyrcona I bet these drones are hanging around from when the listeners crashed earlier.
14:31 Dyrcona Don't think I'll start SIPServer or simple2zoom, either.
14:38 Dyrcona Yeah, the top 4 users of virt memory are all related to GCP and docker.
14:38 Dyrcona Should this thing be running docker?
14:40 * Dyrcona has more reason to not like docker.
16:08 pinesol News from commits: Forward port 3.13.6 -> 3.13.7 ugrade script <https://git.evergreen-ils.org/?p=E​vergreen.git;a=commitdiff;h=e77e94​dfc25c5da2f32a072d9316334e350adbd4>
16:38 pinesol News from commits: Forward-port 3.12.9-3.12.10 upgrade script <https://git.evergreen-ils.org/?p=E​vergreen.git;a=commitdiff;h=467133​d1de7778f9ed46ed05d70d707a75c78fac>
17:11 mmorgan left #evergreen

| Channels | #evergreen index | Today | | Search | Google Search | Plain-Text | summary | Join Webchat