Time |
Nick |
Message |
07:20 |
|
kworstell-isl joined #evergreen |
08:01 |
|
BDorsey joined #evergreen |
09:01 |
|
dguarrac joined #evergreen |
09:04 |
|
Dyrcona joined #evergreen |
09:05 |
Dyrcona |
Hm. Had issues with a couple of overnight jobs on the dev system, but the router is still running. |
09:06 |
Dyrcona |
No problems with Autorenew on the other test vm, either. |
09:07 |
Dyrcona |
Got this in the email from the badge_score_generator: [auth] WRONGPASS invalid username-password pair, at /usr/share/perl5/Redis.pm line 311. |
09:08 |
Dyrcona |
Haven't seen that one before. All the services are running OK. |
09:09 |
Dyrcona |
refresh carousels reported "Unable to bootstrap client for requests." When I saw that I thought it might have been because of a crash. |
09:13 |
JBoyer |
Is opensrf's .srfsh.xml setup correctly for redis there? I think both of those are run from .srfsh scripts so if that's got the wrong password or router location those seems like the errors I would expect |
09:13 |
Dyrcona |
Looks like something's up with services on the training/dev system, though. I'm getting an internal server error when I try to use the staff client. |
09:13 |
Dyrcona |
JBoyer: I think so, I'll double check. |
09:15 |
Dyrcona |
I don't like the way the opensrf_core.xml gets commented by the ansible/docker set up. It's confusing. "--><password>...</password><--" as an example |
09:16 |
berick |
it's def a stop-gap |
09:16 |
Dyrcona |
Yeah, the opensrf user password looks right. |
09:17 |
Dyrcona |
I'm going to try the angular staff client and opac, and if they both give internal server errors, I'll restart services. |
09:18 |
Dyrcona |
OPAC gives me an internal server error. eg2 just white screens. |
09:19 |
Dyrcona |
Oh, and I did restart Apache after I first got the internal server errors. |
09:20 |
berick |
Dyrcona: are both router instances running? |
09:20 |
Dyrcona |
berick: --diagnostic said everything was running. |
09:21 |
Dyrcona |
I just restarted services and apache, and I can login. |
09:22 |
Dyrcona |
* router [1812] uptime=21:31:28 cputime=00:00:04 |
09:22 |
Dyrcona |
* router [1813] uptime=21:31:28 cputime=00:01:52 |
09:22 |
Dyrcona |
That's from before. |
09:23 |
* berick |
nods |
09:23 |
csharp_ |
routers be routin' |
09:32 |
Dyrcona |
Nothing in the system logs for opensrf_router except for the segfault yesterday morning. |
09:33 |
Dyrcona |
Redis errors in /openils/var/log/router-public.log |
09:35 |
Dyrcona |
Is opensrf.settings supposed to be registered on the private domain? |
09:35 |
berick |
yes |
09:35 |
Dyrcona |
the private router log says it isn't, but that could be from before the restart. |
09:38 |
Dyrcona |
Hm... It's tough to say, but I think that's from just before I restarted services. |
09:39 |
Dyrcona |
Does anything log service start/stop/restart? |
09:40 |
Dyrcona |
Ugh. It correspnds with the second of the restart. |
09:40 |
Dyrcona |
I suppose I can restart again and see if it comes up again. |
09:41 |
Dyrcona |
If I could stop making typos for just 5 minutes! |
09:42 |
Dyrcona |
OK. Looks like the not registered errors are from before the previous restart. They don't repeat with this one. |
09:43 |
Dyrcona |
router 2024-12-20 09:18:38 [ERR :1813:transport_connection.c:255:] REDIS Error [] BLPOP opensrf:router:router:private.localhost 5 |
09:43 |
Dyrcona |
router 2024-12-20 09:18:38 [ERR :1813:transport_connection.c:263:] Resting for a few seconds after bus failure... |
09:44 |
Dyrcona |
That repeats over and over again in the private router log since about midnight. |
09:44 |
Dyrcona |
Same thing in the public router log. |
09:45 |
Dyrcona |
Modulo the process id and public versus private. |
09:46 |
Dyrcona |
berick: Should I open a bug on Lp? |
09:47 |
|
mmorgan joined #evergreen |
09:47 |
berick |
it's not much to go on :\ any other nearby errors or in the redis logs? |
09:48 |
berick |
good to verify redis didn't restart for some reason as well |
09:51 |
Dyrcona |
Well, is the number after the file name, the line number? That might be something to start with. |
09:52 |
berick |
yeah we know where BLPOP happens, but unfortunately that's not much more useful than "something happened while talking to redis" |
09:53 |
Dyrcona |
All right. I'm looking. I go to journalctl first these days, but I haven't found much there. A bunch of apache stuff about missing files and ssl. |
09:54 |
berick |
thanks for digging |
09:54 |
csharp_ |
ok - the "5" is the timeout |
09:54 |
csharp_ |
len = snprintf(command_buf, 256, "BLPOP %s %d", recipient, timeout); |
09:55 |
Dyrcona |
Today's redis-server.log is 0 bytes. Yesterday's is compressed. I'll have a look at it. |
09:56 |
Dyrcona |
593:M 19 Dec 2024 16:41:51.794 # Guru Meditation: Redis aborting for OUT OF MEMORY. Allocating 32774 bytes! #server.c:5206 |
09:56 |
Dyrcona |
593:M 19 Dec 2024 16:41:51.794 # (forcing SIGSEGV in order to print the stack trace) |
09:57 |
berick |
interesting |
09:57 |
Dyrcona |
There's a backtrace for Redis. |
09:59 |
berick |
based on the VM, how much memory do you think it had to consume to reach that point? |
09:59 |
Dyrcona |
Don't think much was going on then: fine generator and/or hold_targeter maybe. I don't know right now if anyone was using the server. |
09:59 |
Dyrcona |
The server has about 30GB. |
09:59 |
berick |
i guess the other question is what actually ate the memory |
10:00 |
Dyrcona |
Free says 29Gi and no swap. |
10:00 |
Dyrcona |
Right now 24Gi is available with 23Gi buffered/cached. |
10:00 |
Dyrcona |
4.2GI used and 2.0gi "free" |
10:01 |
Dyrcona |
Looks like Redis restarted right after that, and that's where the log ends. |
10:04 |
Dyrcona |
Someone was using acq about that time. This is 3.14, so the new Angular acq. |
10:08 |
Dyrcona |
I've emailed one of our internal lists to find out if whoever was using acq had any issues and could report anything. |
10:08 |
csharp_ |
@blame acq |
10:08 |
pinesol |
csharp_: acq stole bradl's tux doll! |
10:08 |
csharp_ |
gotta remove that blame |
10:08 |
Dyrcona |
IIRC, csharp_ you have the tux doll in question? |
10:09 |
csharp_ |
I do, in my office |
10:09 |
csharp_ |
I'll share a pic sometime :-) |
10:10 |
csharp_ |
also I'll share a pic of the unicorn on a noose that used to hang on the "previous ILS" $1M+ Sun server |
10:10 |
Dyrcona |
I hesitate to check yesterday's osrfsys.log. It's 202MB compressed. |
10:12 |
berick |
Dyrcona: since Bmagic was also having issues on a utility server, I'm leaning toward it being a utility process that's causing the issue (hold targeter, fine generator, etc.). |
10:15 |
berick |
if a backend was sending replies faster than the caller could collect them, it would cause messages to pile up in Redis. but I'm having trouble imagining where ~30G of such content would be coming from. |
10:15 |
berick |
hm unless the caller just disconnected |
10:16 |
Dyrcona |
OK. I suspect something like that, too. |
10:16 |
berick |
but even then that's a lot |
10:16 |
berick |
maybe over time though |
10:16 |
Dyrcona |
No lock files from that time hanging around. |
10:17 |
Dyrcona |
The badge score generator lock file is there from 4:30 am, though. |
10:18 |
Dyrcona |
At 16:41, it would most likely be the fine generator. |
10:18 |
berick |
Dyrcona: how many days had it been running OK? |
10:19 |
Dyrcona |
I don't actually know. I was out of the office for a week, and didn't get back until yesterday. |
10:21 |
Dyrcona |
I did recently switch the fine generator to run at 30 minutes past the hour because we were getting regular emails from the monitor about the number of processes running when it and the hold targeter started at the same time. |
10:22 |
Dyrcona |
We're running password reset every minute, and the pending a/t running every half hour. |
10:23 |
|
redavis joined #evergreen |
10:23 |
Dyrcona |
We have a SQL sending data to Bywater for our test Aspen installation every 5 minutes. That could be using a lot of RAM..... |
10:25 |
Dyrcona |
The person testing acq yesterday said that they did experience Evergreen stop working at that time. They also encountered Lp 2086786. I don't think it's related. |
10:25 |
pinesol |
Launchpad bug 2086786 in Evergreen 3.13 "Acq: Multi-Branch Libraries Can't See All Their Branch Funds in Invoices" [High,Fix committed] https://launchpad.net/bugs/2086786 |
10:26 |
Dyrcona |
Someone else mentioned Evergreen "crashing" at that time. |
10:29 |
berick |
Dyrcona: mind checking periodically to see if Redis memory usage is slowly climbing? if that's the case, I have a script that will help |
10:29 |
Dyrcona |
OK. I'll do that. |
10:29 |
berick |
thanks |
10:31 |
Dyrcona |
Right now, I have to figure out where reports should be and probably make a link. I'm also told reports aren't working. I think they're working and the output is going somewhere other than /openils/var/web/reporter. |
10:35 |
Dyrcona |
And, that was easy. |
10:38 |
Dyrcona |
Hm... it's 85MB of VM, rss is 13MB at the moment. |
10:41 |
Dyrcona |
FWIW, I'm monitoring it this way: ps -o etime,time,rss,size,vsize 15997 [that's the actual PID] |
10:45 |
Dyrcona |
I have not had it crash on the other test server, but I don't usually run cron, and it's almost never used. |
10:46 |
Dyrcona |
It should be quiet next week. I'll try recompiling OpenSRF and Evergreen with debug options turned on. That might help, particularly if I can figure out how to reproduce a crash. |
10:57 |
Dyrcona |
Think I'll do that on my other test vm, too, if I haven't already. I'll do that after I test the 3.12.10 tarball. |
11:03 |
|
sandbergja joined #evergreen |
11:19 |
|
Christineb joined #evergreen |
12:24 |
|
kworstell-isl joined #evergreen |
12:24 |
redavis |
Dyrcona++ mjdriscoll++ |
12:59 |
mmorgan |
Dyrcona++ mjdriscoll++ |
13:17 |
Dyrcona |
storage and hold_targeter services crashed at some point. I missed catching it live. |
13:18 |
Dyrcona |
storage won't run again.... |
13:21 |
Dyrcona |
today's redis-server.log is still empty. I wonder if it is a case of redis needs to be restarted when the log is turned over? |
13:22 |
berick |
no, restarts aren't necessary |
13:23 |
Dyrcona |
ERR open-ils.storage Has PID file entry [61045], which matches no running open-ils.storage processes |
13:23 |
Dyrcona |
That's after attempting to stop and then start storage again. |
13:23 |
Dyrcona |
I think Redis or the router is hosed again. |
13:24 |
Dyrcona |
diagnostic shows everything but storage and hold-targeter running. |
13:25 |
Dyrcona |
storage is connecting to the database. |
13:26 |
Dyrcona |
storage std_err log has not changed since yesterday, so whatever is happening is not getting logged. |
13:27 |
Dyrcona |
Dec 20 13:25:34 cwmars-dev kernel: __vm_enough_memory: pid: 61502, comm: gce_workload_ce, not enough memory for the allocation |
13:27 |
Dyrcona |
Free says I have plenty available. |
13:29 |
Dyrcona |
Does redis spawn processes? I only see 1 when I pgrep -af redis |
13:29 |
Dyrcona |
and it is still only using 85MB. |
13:29 |
berick |
no, just the one process |
13:30 |
Dyrcona |
OK. I just restarted redis-server, too. |
13:32 |
Dyrcona |
Non-evergreen processes are saying that they cannot allocate memory, or were until a couple of minutes ago. See what I pasted above. Looks like an OpenSRF Listener could not allocate RAM at 11:30 am. |
13:33 |
Dyrcona |
Dec 20 11:30:05 cwmars-dev kernel: __vm_enough_memory: pid: 50517, comm: OpenSRF Listene, not enough memory for the allocation |
13:33 |
Dyrcona |
Dec 20 12:00:02 cwmars-dev kernel: __vm_enough_memory: pid: 50598, comm: OpenSRF Listene, not enough memory for the allocation |
13:33 |
Dyrcona |
Cut off, so I don't know which is which. |
13:36 |
Dyrcona |
Well, there's this: -rw-rw-r-- 1 opensrf opensrf 5 Dec 20 11:30 generate_fines-LOCK |
13:37 |
berick |
can you tell from dmesg or similar what proc had the initial OOM issue? |
13:38 |
Dyrcona |
[18852.962748] __vm_enough_memory: pid: 15109, comm: postgres, not enough memory |
13:38 |
Dyrcona |
for the allocation |
13:38 |
Dyrcona |
That's the earliest from dmesg. There's no timestamp, but should be since the reboot this morning. |
13:39 |
Dyrcona |
I'm going to disable the incremental updates for Aspen. |
13:53 |
* Dyrcona |
wonders if some kernel memory tuning would help, like changing cache/buffer settings? |
14:11 |
Dyrcona |
And, it's happening again. Just got an email that no children are available for open-ils.trigger. Went to run osrf_control --digagnostic and that segfaulted. |
14:11 |
Dyrcona |
free says 1.6Gi free, but nothing will run. |
14:12 |
Dyrcona |
redis-serve pid changed. |
14:13 |
Dyrcona |
2.5 minutes ago it changed, and is not using much memory. |
14:15 |
Dyrcona |
Can I blame google_guest_agent using 2.2GB of virtual memory? or websocketd using 1GB? |
14:16 |
Dyrcona |
For resident memory, several storage drones are using 280MB or more. |
14:16 |
Dyrcona |
with virt over 300 |
14:17 |
Dyrcona |
I thought postgres would be up there, too, but it's not. |
14:18 |
Dyrcona |
My conclusion: this vm just doesn't have enough RAM. |
14:18 |
csharp_ |
Dyrcona: that's what it sounds like to me |
14:18 |
Dyrcona |
redis-server pid changed again. |
14:19 |
Dyrcona |
I'm going to stop the cron jobs. |
14:19 |
Dyrcona |
reboot it. |
14:20 |
Dyrcona |
Clark died along the way. Probably when reporter-store died. |
14:22 |
Dyrcona |
ran --stop-all and some drones are still hanging out. |
14:22 |
Dyrcona |
What's that option? --kill-with-fire? |
14:23 |
Dyrcona |
I bet these drones are hanging around from when the listeners crashed earlier. |
14:31 |
Dyrcona |
Don't think I'll start SIPServer or simple2zoom, either. |
14:38 |
Dyrcona |
Yeah, the top 4 users of virt memory are all related to GCP and docker. |
14:38 |
Dyrcona |
Should this thing be running docker? |
14:40 |
* Dyrcona |
has more reason to not like docker. |
16:08 |
pinesol |
News from commits: Forward port 3.13.6 -> 3.13.7 ugrade script <https://git.evergreen-ils.org/?p=Evergreen.git;a=commitdiff;h=e77e94dfc25c5da2f32a072d9316334e350adbd4> |
16:38 |
pinesol |
News from commits: Forward-port 3.12.9-3.12.10 upgrade script <https://git.evergreen-ils.org/?p=Evergreen.git;a=commitdiff;h=467133d1de7778f9ed46ed05d70d707a75c78fac> |
17:11 |
|
mmorgan left #evergreen |