| Time |
Nick |
Message |
| 00:03 |
|
JBoyer joined #evergreen |
| 08:35 |
|
mmorgan joined #evergreen |
| 09:20 |
|
Dyrcona joined #evergreen |
| 09:38 |
Dyrcona |
Interesting....With no parallel for react and collect in open-ils.trigger neither Ejabberd nor Redis had issues with the mark item lost events last night. I'll leave it running like this for at least 1 more day to see if that changes. |
| 09:40 |
Dyrcona |
Apparently, neither did the auto-renewals. I think there's a bug with parallel event processing. |
| 09:45 |
csharp_ |
the only times I've seen trouble with parallel event processing ended up being a RAM issue |
| 09:46 |
csharp_ |
(as in, insufficient RAM for what I was trying to do) |
| 09:46 |
Dyrcona |
csharp_: I've been able to reliably reproduce it on VMs with 16GB of RAM and basically all that's going is the mark item lost process. |
| 09:47 |
csharp_ |
looks like our A/T server has 24GB right now |
| 09:47 |
Dyrcona |
If it takes > 16GB of RAM to process 1,000 lost items, then something is seriously wrong with the memory management of our code. |
| 09:48 |
csharp_ |
we have 3 parallel procs configured |
| 09:48 |
Dyrcona |
That's what I had: 3 collect and 3 react. |
| 09:48 |
csharp_ |
same |
| 09:49 |
csharp_ |
four CPU cores: Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz |
| 09:49 |
csharp_ |
that's the VM host's processor |
| 09:49 |
csharp_ |
(fwiw) |
| 09:52 |
Dyrcona |
csharp_: 16GB RAM, 16 CPUs for the vms where I'm doing my testing. |
| 09:53 |
* csharp_ |
nods |
| 09:53 |
Dyrcona |
Bmagic will have to say how the production docker image that does the mark item lost is configured, but I'm considering disabling parallel processing for now. |
| 09:53 |
Dyrcona |
But, seriously, I will repeat: something is very wrong if it takes more than 16GB of RAM to process 300 to 1,000 events in parallel. |
| 09:55 |
Dyrcona |
Yes, 16: <vcpu placement='static'>16</vcpu> |
| 09:55 |
csharp_ |
also I've seen it fail in the collection stage when there are too many fleshed items or whatever |
| 09:55 |
csharp_ |
1 patron, 50 lost items or something |
| 09:55 |
Dyrcona |
I've seen that, but it was rare. |
| 09:55 |
csharp_ |
though that's usually circ notice related, especially 1 day overdue or something |
| 09:56 |
Dyrcona |
Usually some "magic" card that the libraries use for some purpose: book club, tracking lost things in some way because 4 ILSes ago didn't track lost items, stuff like that. |
| 09:56 |
csharp_ |
ah yes |
| 09:57 |
csharp_ |
"we don't need no stinking buckets! we have this handy card!" |
| 09:57 |
Dyrcona |
I hate when I see workflows from 30 years ago being used. |
| 09:57 |
csharp_ |
at my first public library job we had a fake card named MS. E PIECES |
| 09:58 |
Dyrcona |
I recall seeing the notebook with the steps for some process at a library, and they were complaining that Evergreen was broken. I told them, those steps didn't work in Horizon either. (This was at MVLC.) |
| 09:59 |
Dyrcona |
I also "like" notes saying things like "number in the field above is..." Really? You think that's gonna be accurate after an update, never mind a migration? |
| 10:00 |
Dyrcona |
Anyway, I came here to bash (heh) Perl and Evergreen, not staff..... :) |
| 10:00 |
csharp_ |
@decide bash or perl or evergreen |
| 10:00 |
pinesol |
csharp_: That's a tough one... |
| 10:02 |
Dyrcona |
Our production utility vm that still runs auto-renewals with parallel processing has 32GB of RAM and 8 CPUs. I guess CWMARS is just too big for our hardware. |
| 10:02 |
Dyrcona |
We used to have 2 vms configured that way for utility stuff. We still had problems, but not as frequently. |
| 10:03 |
Dyrcona |
Well, one of our production utility servers was actual hardware, IIRC. |
| 10:06 |
Dyrcona |
@decide lua or lisp |
| 10:06 |
pinesol |
Dyrcona: That's a tough one... |
| 10:06 |
Dyrcona |
pinesol: Why so indecisive today? |
| 10:06 |
pinesol |
Dyrcona: What do you mean? An African or European swallow? |
| 10:09 |
Dyrcona |
"It's not a question of 'ow 'e grips it." |
| 10:09 |
Dyrcona |
Anyway, what I'm gonna do? Rewrite open-ils.trigger in Rust? |
| 10:34 |
Dyrcona |
csharp_: I can't find any evidence of the OOM Killer running on either vm where I've been testing this. |
| 10:36 |
csharp_ |
hmm |
| 10:40 |
Bmagic |
Dyrcona: your findings agree with my suspicions. The root issue is likely memory management, and when running the same Evergreen code over the top of Redis, it shows the issue more often. Probably* because ejabberd is slower and gives time for garbage collection? |
| 10:53 |
Dyrcona |
Yeah, but I'm not finding any OOM Killer messages in the logs nor with journalctl. I'm still looking. |
| 10:55 |
Dyrcona |
Anyway, think I'm just going to disable the parallel settings for open-ils.trigger. Guess I'll open a ticket with our hosting vendor. :) |
| 11:13 |
Bmagic |
I don't see OOM either when the issue arises. Couple of theories: ulimit maybe, or Redis automatically starts dropping children when it starves. |
| 11:18 |
Dyrcona |
ulimit: unlimited |
| 11:18 |
jmurray-isl |
EGIN's utility server uses 32 CPUs and 32 GB RAM (though we probably don't exceed 16GB), with 8 collect and 8 react. We typically get a 75% cpu load only during morning notice processing, but it doesn't typically last long. |
| 11:19 |
Dyrcona |
We're just cursed. :) |
| 11:24 |
Dyrcona |
We could be blowing out the 8MB stack limit. Not sure how I'd find that, segmentation faults? |
| 11:27 |
Dyrcona |
Nope. No segfaults either. |
| 12:02 |
|
jihpringle joined #evergreen |
| 12:23 |
|
jihpringle joined #evergreen |
| 12:46 |
|
mantis1 joined #evergreen |
| 13:14 |
|
mantis1 joined #evergreen |
| 15:12 |
|
jihpringle joined #evergreen |
| 15:21 |
|
mantis1 left #evergreen |
| 17:08 |
|
mmorgan left #evergreen |
| 20:22 |
|
beardicus joined #evergreen |
| 22:28 |
|
beardicus4 joined #evergreen |
| 22:38 |
|
jeff joined #evergreen |