IRC log for #evergreen, 2021-12-15

All times shown according to the server's local time.

Time	Nick	Message
06:01	pinesol	News from qatests: Testing Success <http://testing.evergreen-ils.org/~live>
07:42		collum joined #evergreen
08:07		rjackson_isl_hom joined #evergreen
08:15		mantis joined #evergreen
08:36		mmorgan joined #evergreen
09:14		rfrasur joined #evergreen
09:20		Dyrcona joined #evergreen
09:21	Dyrcona	Who else thinks the pivot option should be removed from the reporter?
09:26	csharp_	I think it doesn't offer a lot of value - Excel(orwhatever) does it so much better
09:27	csharp_	most end users have no idea what it is, even if they wanted to use it
09:27	rhamby	I can agree with that.
09:27	csharp_	we've tried to train our libraries to see the Evergreen reporter as just a source of raw data and not be depended on for end-user-friendly output
09:27	Dyrcona	It also crashes on our reports server with only 32GB of RAM.
09:28	csharp_	oh - hmm
09:28	rhamby	If you're going to use a pivot table it's better to do it in the spreadsheet software.
09:28	csharp_	huh - our reporter currently has 16GB of RAM and we don't see that trouble - it's running on blazing fast blades at ITS though
09:29	Dyrcona	I'm looking at what it takes to rip it out. I'll open a Lp bug after I do that.
09:29	csharp_	storage is basically RAM in that case'
09:29	csharp_	Dyrcona: I'm for it
09:29	csharp_	as long as it doesn't have tendrils leading into a full redesign of the reporter
09:29	Dyrcona	Someone tried to run two circ reports yesterday with the pivot table options, and both reports exhausted the RAM on the server and the reports were put down by the OOM killer.
09:30	Dyrcona	Well, a full redesign of the reporter would not be a terrible idea.
09:30	csharp_	very much agreed - just didn't want the removal of an annoying value-add as a driver for that redesign :-)
09:31	Dyrcona	Pretty much any time a report dies, two conditions have been met: 1) the reports server ran out of RAM and 2) the pivot options were used.
09:31	csharp_	how many concurrent reports, btw?
09:32	Dyrcona	We allow up to 6 concurrent reports, usually never have more than two or three, except after several hours down time for an upgrade or whatever.
09:32	Dyrcona	I'll check the schedule for yesterday.
09:32	csharp_	gotcha
09:32	csharp_	we do 6 at a time too
09:32	csharp_	at one point we were doing 12, but that was trouble
09:37	Dyrcona	We only had these two reports running at the time. Same report, probably. Looks like user scheduled it again at 7:00 pm when the one started at 5:35 pm hadn't finished, yet.
09:38	Dyrcona	Started getting memory warning emails around 6:30 PM, but I wasn't paying attention to work email at that time or after.
09:39	Dyrcona	Oh, well. I have a meeting in 20 minutes. I should at least look at the agenda.
09:41		jvwoolf joined #evergreen
09:48		Keith-isl joined #evergreen
09:58		bshum joined #evergreen
10:35	miker	fwiw, removing pivot should be a matter of hiding some html elements. making it smarter (at UI reimplementation time) and showing it as "compare [aggregate colun] across [non-agg column]", and only when it's reasonable, would be possible, too
11:19	csharp_	following up on yesterday's ejabberd findings - I can't tie the occurrences I'm looking at with a specific OpenSRF call yet
11:19	csharp_	the message in the ejabberd log is 2021-12-15 10:20:42.390 [info] <0.31885.41>@ejabberd_c2s:process_terminated:271 (tcp\|<0.31885.41>) Closing c2s session for opensrfprivate.brick03-head.gapines.org/open-ils.actor_listener_brick03-head.gapines.org_32971: Connection failed: timeout
11:20	csharp_	when I've turned up ejabberd logging to debug it's apparently too much data and I end up with a truncated log where logging just stops midstream
11:23	Dyrcona	csharp_: I only ever use debug logging for brief periods, and I typically truncate the log before.
11:24	Dyrcona	It's not so useful in production. Better in a controlled environment where yours are the only requests happening.
11:25		jihpringle joined #evergreen
11:25	Dyrcona	miker: The widgets look like their hidden to start with, and there is code to unhide them. I didn't get very far before my 10am meeting. I'm inclined to just yank all the pivot code. Let them use Excel (or LibreOffice)!
11:26	Dyrcona	there they're their, choose wisely. :)
11:29	* Dyrcona	sidles off to get lunch.
11:29	csharp_	Dyrcona: yeah, that's the problem - this happens only sporadically and I have no idea what's causing it
11:29	csharp_	so it's all or nothing :-/
11:29	* csharp_	digs around in the ejabberd source code for clues to where timeout is defined/handled
11:36	miker	csharp_: did you recently upgrade your perl?
11:36	csharp_	yes - from 5.22? to 5.26
11:36	csharp_	(16.04 to 18.04 ubuntu)
11:36	csharp_	currently on v5.26.1
11:38	csharp_	yes, 16.04 was on 5.22
11:40	miker	csharp_: can you scan your logs for instances of "server: died with error" ? of particular interest are "Use of freed value" and "Can't kill a non-numeric process ID"
11:40	miker	re https://bugs.launchpad.net/opensrf/+bug/1953047 and https://bugs.launchpad.net/opensrf/+bug/1953044
11:40	pinesol	Launchpad bug 1953047 in OpenSRF "Perl services can crash with a "Can't kill a non-numeric process ID" error" [Medium,New]
11:40	pinesol	Launchpad bug 1953044 in OpenSRF "Perl services can crash with a "Use of freed value in iteration" error" [Medium,Confirmed]
11:41	miker	in a high-drone-turnover environment, '044 can happen when a drone exits after max-requests at an inopportune time
11:41	csharp_	I see the first two, but not 'Use of freed value'
11:42	csharp_	I would gladly update OpenSRF to fix this though
11:44	miker	fwiw, perl 5.24 seems to be the boundary version where those two fixes are relevant.
11:44	csharp_	ok, then I will apply them, by god
11:45	csharp_	since they're perl, I can just hot patch them and rollback if necessary
11:46	miker	and they're both restricted to 1 file, so, easy peasy
11:46	csharp_	yep
12:23	mmorgan	Has anyone gathered some experience with 3.7+ in a production database with saving/loading bib records? We're trying to get an idea of the effect of the symspell dictionaries for did you mean.
12:26	csharp_	mmorgan: we're going to 3.8 next month - running it on a training/testing server that is production-ish, datawise
12:26	csharp_	what should we be looking out for?
12:26	jeff	We're not relying on "did you mean" and did not populate the related tables as part of our upgrade. I've noticed a deadlock or two which have made me want to look into disabling the relevant triggers until I have time to look into it more.
12:28	jihpringle	mmorgan: we're running 3.7.0 and do not have "did you mean" turned on. In testing with it on we couldn't save any MARC records
12:29	jihpringle	we haven't tested with any of the post 3.7.0 "did you mean" fixes yet
12:29	mmorgan	csharp_: Our big concern is loading records with 856 links. We do a lot of that, but general saving and loading of bib records via vandelay is our concern, too.
12:31	mmorgan	We've just begun testing with production data and so far are seeing the records with 856 links taking MUCH longer, also getting general.unknown error for some records that have failed to load.
12:31	mmorgan	jeff: Not sure what you mean by a 'deadlock' ?
12:32	mmorgan	jihpringle: We've loaded some of the post 3.7.0 fixes.
12:32	berick	jeff: if you decide to disable, mind sharing what all you disable?
12:33	mmorgan	We're very early in testing, and were just wondering if others were ahead of us and had experiences regarding saving/loading records.
12:38	mmorgan	jihpringle: Have you done anything other than turning off "did you mean"? My thinking is that even with it off, the sysmpell dictionaries in the database would still be updated without disabling triggers like jeff mentions.
12:38	csharp_	mmorgan: a deadlock is a postgresql level problem where two processes are waiting on each other to release the lock on a paritcular tuple ("row")
12:39	mmorgan	csharp_: Ah. Ok, thanks. I think I've seen log entries complaining about tuples.
12:39	jeff	mmorgan: the postgresql logs will log a deadlock when the database detects that process X and process Y (and maybe process Z) are all waiting on things that each other are waiting on. It's one of the reasons that pingest doesn't do certain bib-related ingest things in parallel.
12:39		collum joined #evergreen
12:40	jeff	looks like we had one weird one on actor.usr_setting (which I don't have enough information to reproduce at this point), and the only other one was:
12:40	jeff	Process 456530: INSERT INTO biblio.record_entry...
12:40	jeff	Process 456523: SELECT asset.merge_record_assets("bre"...
12:41	jeff	that one's close enough that I'd look into symspell triggers being a contributing factor, but it might be something else also.
12:41	csharp_	fwiw, when addressing bug 1931737 we disabled all the maintain_symspell_entries_tgr triggers on the metabib.*_entry tables
12:41	pinesol	Launchpad bug 1931737 in Evergreen "Did you mean breaks parallel reingest" [Undecided,Confirmed] https://launchpad.net/bugs/1931737
12:41	jeff	If I improperly maligned symspell triggers in my deadlock mention earlier, I'm sorry. :-)
12:42	csharp_	@blame symspell triggers
12:42	pinesol	csharp_: symspell triggers caused the white screen of death!
12:42	jeff	berick: I hadn't looked into it yet, but the triggers mentioned in the bug csharp_ just linked are the ones I had first in mind to look at disabling. It doesn't currently look like we're having enough issues to warrant me looking at it immediately, though.
12:42	jeff	now pinesol is improperly maligning the triggers!
12:43	csharp_	@blame pinesol for blaming the triggers
12:43	pinesol	csharp_: itself wants the TRUTH?! itself CAN'T HANDLE THE TRUTH!! for blaming the triggers
12:43	jeff	is "Improperly maligning triggers..." the new "Reticulating splines..."?
12:43	csharp_	@ana Improperly maligning triggers
12:43	pinesol	csharp_: Gleamingly gripping terrorism
12:44	jeff	Ah, that's the problem right there: your triggers are out of malignment!
12:45	csharp_	@quote add <jeff> Ah, that's the problem right there: your triggers are out of malignment!
12:45	pinesol	csharp_: The operation succeeded. Quote #219 added.
12:46	csharp_	when it comes down to it, aren't we all just "waiting for more data from parent"?
12:48	* csharp_	is referring to the server error mentioned yesterday http://irc.evergreen-ils.org/evergreen/2021-12-14#i_497049
13:00	mmorgan	testing a file of 1000 records with 856 links on our 3.7 test system is not looking good. 6% progress after an hour, and most failed.
13:03	JBoyer	There are a couple dym-related patches that you may not have depending on your 3.7.X version. Since they're all just function updates they can be applied anytime and absolutely should be if you don't have them.
13:03	JBoyer	Sadly, I don't actually have a list handy to refer to...
13:05	jeff	...and Launchpad search returns 101 open tickets on a search for "did you mean"... :-P
13:05	jeff	bug 1931626
13:05	pinesol	Launchpad bug 1931626 in Evergreen "Did you mean: search suggestions exist for deleted records and can result in no hits" [Medium,Confirmed] https://launchpad.net/bugs/1931626
13:05	jeff	bug 1931625
13:05	pinesol	Launchpad bug 1931625 in Evergreen "Did you mean: diacritics cause erroneous search suggestions, resulting in no hits" [Undecided,New] https://launchpad.net/bugs/1931625
13:06	JBoyer	The big ones are lp 1931162 (3.7.2) and lp 1947173 (3.7.3)
13:06	pinesol	Launchpad bug 1931162 in Evergreen "Did You Mean optimization fails for some data sets" [High,Fix released] https://launchpad.net/bugs/1931162
13:06	pinesol	Launchpad bug 1947173 in Evergreen 3.7 "Did You Mean Symspell dictionary updates can significant slow record ingest" [High,Fix committed] https://launchpad.net/bugs/1947173
13:07	jeff	bug 1947173
13:07	JBoyer	sometimes Launchpad can search if you only want to look at fix committed and fix released. :D
13:07	jeff	looks like bug 1931162 made it into 3.7.2
13:07	pinesol	Launchpad bug 1931162 in Evergreen "Did You Mean optimization fails for some data sets" [High,Fix released] https://launchpad.net/bugs/1931162
13:07	mmorgan	1947173 we don't have in place, so that's one place to start
13:08	JBoyer	Yeah, the () was the version they were released in. I couldn't remember if they both made it out yet.
13:08	mmorgan	We should already have 1931162, but will double check
13:11	* mmorgan	confirms we do have 1931162
13:14	JBoyer	taking another look at the search results it looks like the released patches are the only ones I was really worried about, but if you don't have both you really, really want to get both asap.
13:15	JBoyer	The two jeff posted aren't fixed yet but that appears to be it for now.
13:16	mmorgan	So we can apply 1947173, and next try disabling the maintain_symspell_entries_tgr triggers to see if those steps affect the behavior we're seeing. That should shed more light.
13:16	mmorgan	jeff++
13:16	mmorgan	JBoyer++
13:16	mmorgan	csharp_++
13:16	mmorgan	jihpringle++
13:16	mmorgan	If anyone discovers anything more, I'd be interested!
13:19	JBoyer	Fun fact about 1931162: it was initially found when accidentally searching for the word "metarecord" but it turns out that pretty much every ISBN search will trigger it, leaving you with Pg processes spinning for ages as they gather ISBNs in your system to recommend...
13:24	jeff	Well that's interesting... Chrome tells me that my https://host.foo.example.org/eg/opac/home and https://host.foo.example.org/eg/staff/home connections are secure and have valid certs, but if I dig down into [lock icon] -> Connection is secure -> Certificate is valid, for the /eg/staff/home tab it shows an expired certificate.
13:25	jeff	The certificate has not changed in the lifetime of my browser session.
13:25	jeff	The hostname in question points at a single IP fronted by a single nginx instance pointing at a single backend server.
13:26	JBoyer	I've seen that before also, but not tracked it down. How's the cert that apache's using look?
13:26	jeff	If I had to guess, I'd say that the expired certificate in question might be on the backend host... I'm just wondering how it's making its way to the browser, yet in a non-fatal way.
13:27	jeff	Hrm. Also need to determine if the paths in question are both going to the same backend. I think they are, but I should check. Again though, same question: why is my browser even getting a whiff of the backend cert?
13:30	jeff	co-worker reproduced, then did a shift-reload on the staff url, and the expired cert was no longer there.
13:32	JBoyer	Fun caching interactions, perhaps. I think it can take a shift-reload to force certs to be redownloaded
13:32	jeff	nope, backend apache instance is using a self-signed cert.
13:33	jeff	the cert in question would have been replaced in... February.
13:33	jeff	which predates this laptop (but not the account that Chrome is currently using for sync, so...)
14:44	jeff	Interesting. Record doesn't appear in keyword search but does show in series search for same terms. metabib.keyword_field_entry appears to have the keywords in the search, which is: who would win
15:03	jeff	somewhere along the process of looking at this, I found a special number of metabib.keyword_field_entry rows that matched one of my debug queries: 404.
15:03	jeff	:-P
15:05	jeff	also, I'm not sure I noticed before today that there's a reference in PostgreSQL documentation to The Sudbury Neutrino Detector.
15:07	mmorgan	:)
15:17	* mmorgan	is having another Weird Wednesday issue. Has anyone else had issues emailing bib records from the opac?
15:17	mmorgan	I get the preview, but when I click Email Now, no email, ever.
15:19	mmorgan	I can see the preview action trigger event in the database, that is complete, but that preview trigger is the only one I see.
15:26	mmorgan	Can anyone else confirm that they can successfully email bib records from their opac?
15:38		jihpringle100 joined #evergreen
15:54	collum	mmorgan: I can confirm. I tried several times in our opac and did not receive an email. The preview event is in the database.
15:56	mmorgan	collum++
15:56	mmorgan	Thanks, I will open a LP bug.
15:56	* mmorgan	was hoping it was just me :-(
16:36	csharp_	@decide everyone or just me
16:36	pinesol	csharp_: go with just me
16:36		jvwoolf left #evergreen
16:38	berick	almost sounds like a Prince song
16:39	csharp_	and log4j sounds like a Prince logging utility :-)
16:39	csharp_	Nothing Compares 2 Log4J
16:49	JBoyer	I quite enjoyed this log4j take: https://twitter.com/leanrum/status/1470954707120181253
17:01		mmorgan left #evergreen
18:00		jihpringle joined #evergreen
18:00	pinesol	News from qatests: Testing Success <http://testing.evergreen-ils.org/~live>
19:01		eglogbot joined #evergreen
19:01		Topic for #evergreen is now Welcome to #evergreen (https://evergreen-ils.org). This channel is publicly logged.
19:32		jihpringle joined #evergreen