Time |
Nick |
Message |
00:09 |
|
sandbergja joined #evergreen |
01:00 |
|
sandbergja joined #evergreen |
07:23 |
|
rjackson_isl_hom joined #evergreen |
07:47 |
|
mantis1 joined #evergreen |
07:56 |
|
rfrasur joined #evergreen |
08:40 |
|
mmorgan joined #evergreen |
08:44 |
|
mmorgan left #evergreen |
08:51 |
|
Dyrcona joined #evergreen |
08:55 |
|
yar joined #evergreen |
08:55 |
|
rfrasur joined #evergreen |
08:56 |
|
mmorgan joined #evergreen |
09:00 |
|
mmorgan left #evergreen |
09:05 |
|
jvwoolf joined #evergreen |
09:07 |
|
alynn26 joined #evergreen |
09:24 |
|
dbwells joined #evergreen |
09:54 |
|
mmorgan joined #evergreen |
09:54 |
csharp |
berick: I was able to test your patch over the weekend and it works as expected - because drone exhaustion was dire this morning, I've applied it to PINES production and am watching carefully |
09:54 |
csharp |
so far so good though |
09:55 |
csharp |
if we don't see any problems by noon or so, I'll sign off/commit |
09:57 |
berick |
csharp++ |
10:00 |
|
Cocopuff2018 joined #evergreen |
10:09 |
JBoyer |
Hi! If you haven't seen my Dev Meeting Poll on evergreen-dev or evergreen-general, it's here: https://forms.gle/ypTx2zqLbW7eVoj99 |
10:10 |
JBoyer |
Not a lot of responses yet, so if you haven't filled it out yet please do! |
10:11 |
berick |
ohhh.. i saw "new developer" in the subject and thought that's what it was about. |
10:16 |
csharp |
JBoyer++ # thanks for the poke |
11:09 |
Dyrcona |
Funny how sometimes the more you work on code, the worse it gets. |
11:18 |
* Dyrcona |
thinks it may be time to bite the bullet and refactor Circulate.pm. |
11:34 |
Bmagic |
general production brick qustion: Anyone else see an occational brick spike really high CPU? Caused by a dramatic increase of apache2 processes? In this example, the machines are using nginx proxy (standard setup) - monitoring the number of nginx and apache processes, we see they are usually 9 and 18 respectively |
11:35 |
Bmagic |
sometimes, though, I see the number of apache processes spike up over 200! |
11:35 |
berick |
Bmagic: could be related to the same drone exhaustion issues we've been battling, web clients just blasting requests |
11:36 |
Bmagic |
yeah? |
11:38 |
Dyrcona |
Bmagic: I see it when someone is spidering us, like EBSCO did last week. Apache maxed out on a brick or two. |
11:38 |
Dyrcona |
Drone exhaustion doesn't seem to be a cause in our case, at least not to the same extent. |
11:39 |
Bmagic |
I just witnessed a single machine go out of control and die. Then the rest of the machines followed suit. Drone exhaustion is defined by the number of allowed children getting maxed? |
11:40 |
Dyrcona |
Bmagic: Yes. |
11:40 |
|
rfrasur joined #evergreen |
11:40 |
berick |
oh yeah, ebsco could certainly be it too |
11:40 |
Dyrcona |
We don't usually hit max number of Apache processes unless someone is being a bad actor, like spamming searches or unapi requests. |
11:40 |
Bmagic |
so, when the number of allowed drones max, there should be some log messages like "no children" - In my case, I didn't see that |
11:42 |
Bmagic |
It seems there should* be a way to mitigate that kind of thing |
11:43 |
Dyrcona |
Bmagic: There are several. One is to configure connection limiting in a proxy. Oh hey! You have nginx in front of Apache. |
11:43 |
Bmagic |
I'm listening |
11:43 |
Dyrcona |
The rest is an exercise for the reader. :) |
11:46 |
Bmagic |
max apache workers is set to 250, which is a hold over setting from when apache was up front |
11:47 |
Dyrcona |
Ours is 150, IIRC. |
11:47 |
Dyrcona |
You're likely to run out of RAM before you hit 250, unless you have heaps. |
11:48 |
Bmagic |
nginx makes everything play nicer - but sometimes (once or twice a week) apache processes go from less than 20 to over 200. you're suggesting a configuration on nginx to limit the number of connections? will that deny legit requests? Is it the number of connections from the same client? |
11:48 |
Bmagic |
32GB memory on each brick with swap - though, swap doesn't get touched oddly |
11:48 |
csharp |
berick: still seeing higher-than-desired open-ils.actor drone counts, but definitely improved |
11:48 |
mmorgan |
JBoyer: FYI RE: the form, I also heard the "new developer" confusion from a colleague |
11:48 |
|
sandbergja joined #evergreen |
11:49 |
Dyrcona |
It's usually configured by IP address, so it could throttle legit requests if you have a lot of sites with NAT. |
11:49 |
* csharp |
starts a new Confusted Developers' group |
11:49 |
csharp |
er... Confused, even |
11:49 |
csharp |
appropriate typo, I guess |
11:49 |
berick |
csharp: gotcha, good to know |
11:49 |
Dyrcona |
Confuzzled. |
11:49 |
berick |
heh |
11:50 |
csharp |
berick: do you have a reliable way of identifying which client-side actions are at fault? |
11:50 |
berick |
csharp: if you want to experiment, you could modify MAX_PARALLEL_REQUESTS in opensrf.js, make it lower |
11:50 |
csharp |
I was counting messages per threadtrace as a start |
11:51 |
csharp |
berick: does that require an opensrf restart to take effect? |
11:51 |
Dyrcona |
Speaking of confused developers, my comments about Circulate.pm earlier were not a joke. I'm reviewing code that attempts to a feature to circulation and as the work progresses, the code gets less functional, not because the developer is bad, but because the Circulate.pm code is poorly organzied. |
11:52 |
berick |
csharp: hm, not really. i would probably look for bursts from a single IP address in the activity log. i imagine in most cases the threadtraces will be different |
11:53 |
berick |
csharp: no |
11:53 |
berick |
no restart needed |
11:53 |
Dyrcona |
csharp: It should require an OpenSRF restart, but will likely require you to clear the cache in your browser, and cache maybe why it's not having a huge impact. |
11:53 |
berick |
exactly.. |
11:53 |
Dyrcona |
grr... s/should/shouldn't/ |
11:53 |
csharp |
berick: Dyrcona: good to know - I'll wait it out for now |
11:53 |
berick |
i wouldn't be surprised if a lot of clients aren't using the new code yet |
11:53 |
Dyrcona |
I'll check back with you in a year.... |
11:53 |
csharp |
I thought that might be the case |
11:54 |
berick |
Dyrcona: agree circulate.pm has not aged well |
11:54 |
csharp |
Dyrcona: because cache expiry is set super high? |
11:54 |
Dyrcona |
javascript is access + 1 year, and was before the infamous commit went in. |
11:55 |
csharp |
argh |
11:55 |
Dyrcona |
Of course, browsers should send an If-Modified-Since: and the server should server the JS again, but you never know. |
11:55 |
Dyrcona |
We turned the main ExpiresActive off in our configuration recently. |
11:56 |
csharp |
Dyrcona: +1 re: Circulate.pm |
11:56 |
berick |
csharp: curl -I 'https://gapines.org/js/dojo/opensrf/opensrf.js' |
11:56 |
berick |
Expires: Tue, 26 Jan 2021 10:56:01 GMT |
11:57 |
berick |
so not a year at least |
11:57 |
csharp |
whew |
11:57 |
csharp |
well, I'll see how it looks tomorrow |
11:57 |
* csharp |
hops in TARDIS to take a look |
12:02 |
|
jihpringle joined #evergreen |
12:04 |
Dyrcona |
csharp: Looks like you're set to 18 hours. |
12:06 |
Dyrcona |
Nifty! We're apparently doing HTTP/2. |
12:14 |
jeffdavis |
The super-long default cache expire times seem based on the use of cache-busting in the OPAC, not sure they make sense for JS in the Ang/AngJS client era? |
12:15 |
jeffdavis |
We're using 18 hours here too. |
12:23 |
Bmagic |
apachetop -f /var/log/apache2/other_vhosts_access.log |
12:30 |
JBoyer |
mmorgan++ berick++ csharp++ |
12:30 |
JBoyer |
I'll re-send with a proper subject. :/ |
12:36 |
Bmagic |
Dyrcona: do you impose limits? Anyone? Reading this article gives me some ideas: https://www.nginx.com/blog/rate-limiting-nginx/ |
12:40 |
Dyrcona |
Bmagic: No, we don't currently use rate limits. That article does look interesting. |
12:41 |
Dyrcona |
Ideally, you would have 1 proxy in front of all your bricks that would handle the rate limits, but rate limiting per brick would still help. |
12:42 |
Bmagic |
It seems that we be hitting the same issue from csharp - bug 1912834 ? |
12:42 |
pinesol |
Launchpad bug 1912834 in OpenSRF "Browser client should limit the number of parallel requests" [High,New] https://launchpad.net/bugs/1912834 |
12:42 |
Dyrcona |
Yeah, we all are from time to time. |
12:43 |
Dyrcona |
Rate limiting should still be implemented to protect you from the people who drop books or cats on keyboards and other malicious actors, deliberate or not. |
12:44 |
Bmagic |
using that apachetop command, I can observe osrf-http-translator request burst up over 10/second with totals over 300 in a short period. That may or may not be an issue though. |
12:44 |
JBoyer |
Something that I've seen make a difference in the past is to be absolutely certain that there are reasonable timeouts in place on your proxy, especially for websocketd since it doesn't have a timeout of its own. That won't make this kind of thing stop entirely, but may help free up some drones. |
12:45 |
Dyrcona |
I've not seen a spike in OpenSRF requests push our Apache counts up to max. When that happens, its something more persistent, like 300 requests for the same search terms or 30,000 uapi requests, etc. |
12:46 |
JBoyer |
And yeah, nothing will help those kinds of things but rate limiting. |
12:47 |
Bmagic |
example https://ibb.co/FxsLxqX |
12:49 |
Bmagic |
that snapshot doesn't show some of the higher numbers, but anyways, that's what I'm looking at to try and figure out where to "fix" this |
12:50 |
Dyrcona |
Rate limiting on unapi would be a start. |
12:52 |
Dyrcona |
@decide make it better or make it worse |
12:52 |
pinesol |
Dyrcona: go with make it worse |
12:53 |
Dyrcona |
Yeah, pinesol, that is easier than making it better, at least for now. :) |
13:41 |
|
mrisher joined #evergreen |
13:42 |
|
mrisher joined #evergreen |
13:46 |
|
jihpringle joined #evergreen |
14:18 |
Dyrcona |
Hopefully, I didn't make it that much worse. :) |
14:22 |
|
jihpringle joined #evergreen |
14:24 |
|
Cocopuff2018 joined #evergreen |
14:36 |
Dyrcona |
Anyone else been getting this lately: |
14:36 |
Dyrcona |
warning: inexact rename detection was skipped due to too many files. |
14:36 |
Dyrcona |
warning: you may want to set your merge.renamelimit variable to at least 1760 and retry the command. |
14:36 |
Dyrcona |
I see it when I'm backporting from master to 3.5 or 3.2. |
14:36 |
berick |
Dyrcona: yes, and beware if you bump the merge.renamelimit it will slow to a crawl :( |
14:36 |
berick |
it did for me, anyway |
14:37 |
Dyrcona |
Well mine is set to 1493 because that was suggested to me earlier by a git merge/cherry-pick. |
14:42 |
Dyrcona |
It's the documentation that's causing it, isn't it? |
14:56 |
Bmagic |
Dyrcona: lol, I was about to report something very similar. ejabberd log: 2021-01-25 14:49:59.550 [error] <0.441.0>@ejabberd_listener:accept:311 (#Port<0.8999>) Failed TCP accept: too many open files |
14:57 |
Bmagic |
kernel file limits are all the way to the max: cat /proc/sys/fs/file-max result 65535 |
14:58 |
Bmagic |
aha! but ejabberd user "ulimit -n" is still 1024 |
15:01 |
Dyrcona |
Bmagic: That is very different. |
15:01 |
Bmagic |
:) |
15:06 |
Dyrcona |
Bmagic try asking for more and see what happens. |
15:06 |
Bmagic |
no doubt, just need to bake it into the brick build |
15:06 |
Bmagic |
I ran into this before, solved it, and now it's back |
15:08 |
Dyrcona |
TBH, I've not had problems with limits on Linux. OpenBSD on the other hand.... They like to set low defaults. |
15:09 |
Dyrcona |
Well, not had problems on Linux in quite a while. |
15:10 |
csharp |
berick: so we're getting complaints from catalogers that buckets aren't working - could that be the new patch at work? |
15:11 |
csharp |
I'm not seeing anything system-side that shows trouble |
15:11 |
berick |
csharp: hard to say, it could be |
15:11 |
Dyrcona |
Bmagic: Switch to FreeBSD: open files (-n) 116487 |
15:11 |
Dyrcona |
|
15:11 |
Bmagic |
oh sure |
15:11 |
mmorgan |
JBoyer: Having trouble filling out the Developer Meeting form. For example, if someone has a recurring meeting the third Wednesday of each month, how would they say they are available the first or second Wednesday of the month? |
15:12 |
berick |
csharp: if you have specifics i could try to verify |
15:12 |
csharp |
berick: trying to get those |
15:12 |
Dyrcona |
Probably because they have too many things in the bucket, and it's timing out processing 5 at a time. |
15:13 |
csharp |
that makes sense |
15:13 |
Dyrcona |
It does, and it could be wrong. :) |
15:14 |
csharp |
I guess I'm going to need their console messages to see the console.warn messages to see if they're hitting the limit |
15:14 |
JBoyer |
mmorgan, it is an imprecise instrument. :/ You could fill out everything that would work and say what's no good in the text box, or avoid combinations that aren't entirely open. |
15:15 |
JBoyer |
I was trying to avoid having 20-30 different options to choose from but that may have been easier to use. |
15:15 |
csharp |
"When I first scan a barcode in Item Status then go to that record I can view holdings, holds and MARC but if I then go back to one of those tabs nothing happens. It looks like it is trying to load the page but nothing ever happens." |
15:15 |
mmorgan |
Ok, gotcha. |
15:15 |
csharp |
"I am having to refresh the screen every time I add holding and also every time I print a label." |
15:16 |
csharp |
"I can't get items to transfer when I do get Holdings View to open. This is transferring between libraries in the same holdings view. Refreshing the screen doesn't help. |
15:16 |
csharp |
quotes from ACTUAL pines catalogers |
15:16 |
csharp |
several "me too" messages along with |
15:17 |
csharp |
I'll rollback the change for now, I think |
15:17 |
Dyrcona |
Screenshots or it didn't happen. |
15:18 |
berick |
csharp: EG 3.6? |
15:18 |
JBoyer |
"We replaced these ACTUAL Pines catalogers' MARC records with Folgers Crystals. Let's see if they notice." |
15:18 |
Dyrcona |
JBoyer++ |
15:18 |
csharp |
berick: yep |
15:18 |
* JBoyer |
realizes he's only barely old enough for that joke to land |
15:18 |
mmorgan |
JBoyer++ :) |
15:18 |
csharp |
JBoyer++ |
15:19 |
csharp |
JBoyer: I would have never guessed that given the accuracy |
15:19 |
* csharp |
fills it to the rim, with Brim |
15:20 |
JBoyer |
I apparently have very little control over what I will lose in a minute vs. what I will remember decades later than it could ever be useful. |
15:20 |
csharp |
https://www.youtube.com/watch?v=aGY7maLpA1I |
15:20 |
mmorgan |
JBoyer: You're not alone in that! Wish we could repurpose that ROM! |
15:20 |
berick |
csharp: i was just able to reproduce |
15:21 |
csharp |
oh? |
15:21 |
berick |
csharp: it does seem likely the patch is at fault (sigh( |
15:23 |
Dyrcona |
csharp++ # For boldly going where no Evergreen sysadmin has gone before. |
15:24 |
csharp |
Dyrcona: all while staying in my living room office for nearly a year! |
15:25 |
csharp |
berick: I rolled the patch out, but I can apply any fixes to a test server |
15:25 |
csharp |
working without the patch means I have to babysit all day long |
15:26 |
* berick |
nods |
15:30 |
|
mantis1 left #evergreen |
15:32 |
mmorgan |
So, poking at Curbside, I have hours open set as 2:00pm - 5:30pm, 4 hour appointment slots. As a patron, I'm offered appointments at 2:00am, 6:00am, 10:00am, 2:00pm. Anyone seen this problem? |
15:34 |
|
jihpringle joined #evergreen |
15:57 |
mmorgan |
Re: my curbside question, nevermind. Hours of operation showed PM in client, but stored AM in db :-/ |
15:57 |
|
sandbergja joined #evergreen |
16:00 |
Dyrcona |
mmorgan: I'd double check the servers' timezone settings. |
16:02 |
mmorgan |
Dyrcona: Thanks, will do that. |
16:30 |
berick |
csharp: fyi working/user/berick/lp1912834-max-parallel-net-v2 |
16:33 |
csharp |
berick: cool - I'll take a look in a bit |
16:37 |
|
troy__ joined #evergreen |
16:39 |
|
rfrasur joined #evergreen |
17:19 |
|
mmorgan left #evergreen |
17:56 |
|
Cocopuff2018 joined #evergreen |
18:01 |
pinesol |
News from qatests: Testing Success <http://testing.evergreen-ils.org/~live> |
18:47 |
|
sandbergja joined #evergreen |
21:26 |
|
sandbergja joined #evergreen |
21:52 |
|
JBoyer joined #evergreen |
22:15 |
|
JBoyer joined #evergreen |