IRC log for #evergreen, 2023-03-13

All times shown according to the server's local time.

Time	Nick	Message
07:29		collum joined #evergreen
08:31		kworstell-isl joined #evergreen
08:36		mmorgan joined #evergreen
08:50		Dyrcona joined #evergreen
08:57		rfrasur joined #evergreen
09:15		Stompro joined #evergreen
09:39		ejk joined #evergreen
09:39		jeffdavis_ joined #evergreen
09:46		dguarrac joined #evergreen
10:25		Christineb joined #evergreen
10:54	Dyrcona	Hm.. I wonder how workable it would be to replace ISO-8859-1 copyright symbols with UTF-8 ones using a regex.... It seems like it would be simple, but it's the kind of thing that can lead to problems. I have a file of records that I can play with....
10:58	Dyrcona	Looks like it happens with registered trademark symbol, too. (Gotta love vendor-supplied MARC records.)
11:06	Bmagic	Love em indeed
11:16	Dyrcona	Bmagic: Does your load process handle things like that? I have a --strict option on one of my load programs that rejects records with bad characters, well any warnings are treated as errors, really.
11:16	Dyrcona	I'm considering adding code to fix copyright and registered trademark symbols since they seem to be a thing with this one vendor in particular.
11:18	Bmagic	I wrote some code that would try to parse the records one way and then again another way and choose the best automatically, using some evals... An internal project where we reach out to various vendor websites using selenium and download weekly/monthly records and import into the catalog
11:20	Bmagic	no, I don't think I have something that specifically handls character conversion from one character map to another.
11:21	Stompro	mmorgan, Content Cafe seems to still be having issues. Our cataloger reports that many new items don't have cover art in Evergreen, but do in Titlesource360. But in general I'm seeing the majority of cover art show up.
11:21	Bmagic	Dyrcona: but FWIW: https://github.com/mcoia/sierra_marc_tools/blob/master/auto_rec_load/dataHandler.pm around line 800, if it dies, it will failover to readMARCFileRaw
11:22	Dyrcona	Bmagic: Thanks. I've had a glance at that code before. My issue isn't reading the records. We get these warnings when loading them: utf8 "\xA9" does not map to Unicode at /usr/lib/x86_64-linux-gnu/perl/5.26/Encode.pm line 212, <GEN1> chunk 300.
11:23	Dyrcona	They will load in the database, if I let them go in.
11:24	Dyrcona	The warnings don't occur while prepocessing the records with using MARC::Record to modify the 856 tags.
11:24	Bmagic	writing something to transcode one character to another seems doable. But I've not done it. Sorry :(
11:25	Dyrcona	Yeah, it's actually transcoding to 2 characters \xA9 -> \xC2\xA9.
11:25	Bmagic	it sounds like you'll end up having to read the file character by character instead of letting MARC::Record do it?
11:26	mmorgan	Stompro: A quick look through our catalog doesn't show a lack of images.
11:26	Dyrcona	Well, I was thinking of looping through the datafields and checking the content of every subfield, but the trick is limiting it to invalid UTF-8 character sequences.
11:27	Bmagic	seems a little tricky since the destination 2-characters includes the converted-from character. It'd require some logic (regex) to be sure that it doesn't have \xC2
11:27	Dyrcona	Well that and any other legal UTF-8 sequence containing the target to be changed has to be skipped.
11:28	Bmagic	the plot thickens
11:28	Dyrcona	If it was just \xC2, that would actually be easy with a regex.
11:28	Dyrcona	jwz is right about regular expressions. :)
11:29	Bmagic	it sounds like you'll have to cross-reference all* of the legal preceeding UTF8 characters, and weed/convert out the ones that don't appear in the cross reference?
11:30	Stompro	mmorgan, it seems like they are not loading new images into the http://contentcafe2.btol.com system right now, so our cataloger just noticed seeing no cover art for items that just arrived from B&T.
11:30	Bmagic	or maybe you're simply looking for a \xA9 that is preceeded by a space character (seems like you could miss some)
11:31	Stompro	mmorgan, There was also a fixed cover art image from last week that is still showing up wrong in http://contentcafe2.btol.com.
11:37	Dyrcona	Bmagic: My thoughts are heading back into the direction of "it's not worth the effort."
11:38	Bmagic	Dyrcona: the results would be broken copyright symbols sometimes?
11:38	Bmagic	or wing dings in the marc
11:39	mmorgan	Stompro: I'm not able to isolate new B&T bibs, but will keep my eyes open. Thanks for the heads up!
11:42	Dyrcona	Bmagic: We use a program very much like this one https://pastebin.com/g4RGDJLr to load electronic resource records into Evergreen. We use the --strict option to prevent really bad records from getting in. It also stops those with minor issues like the one we're discussing.
11:43	Dyrcona	We also run a preprocess program on the files to fix the 856 tags before loading so that cataloging staff don't have to do it manually. I'm considering checking for certain "bad" characters in there and fixing them.
11:46	Bmagic	I see
11:51	Bmagic	I'm doing a presentation this year on bug 1947898
11:51	pinesol	Launchpad bug 1947898 in Evergreen "Enhanced MARC importer script electronic_marc_import.pl" [Wishlist,Confirmed] https://launchpad.net/bugs/1947898
11:53	pinesol	News from commits: Docs: Update describing_your_organization.adoc <https://git.evergreen-ils.org/?p=Evergreen.git;a=commitdiff;h=e744b5f388aafdb095e5c30b1ed57bb6ae7f1eda>
11:54	Dyrcona	Yeah, I was going to review that at some point, but things keep coming up.
11:55	Bmagic	I saw you were assigned last year. It's all good, something this complicated takes time to review. Time that we don't have!
12:08		jihpringle joined #evergreen
12:25		kworstell-isl joined #evergreen
12:27	Dyrcona	Interesting... I dumped the rejected records with the copyright and registered trademark symbols to XML, and they have other issues.
12:43	Dyrcona	Hmm. Our process may be mangling the records further....
12:45	Dyrcona	Comparing the input record with the rejected record, it looks like the "bad" copyright symbol is causing problems with reading the rest of the record. At least when I convert the two to marcxml with yaz-marcdump.
14:01	Dyrcona	Turns out to be more bad encoding handling in my code. Dunno why I've had so much trouble with something so "simple."
14:02	Dyrcona	"I said it was simple. I never said it was easy."
14:03	Dyrcona	I hope I don't end up having to add special encoding tricks per vendor. I seem to recall having it this way in the beginning, and it worked for all but 1 vendor who claims to send UTF-8 records.
15:28	Bmagic	Dyrcona: I feel your pain
16:19		mmorgan1 joined #evergreen
16:25		mmorgan joined #evergreen
16:28	Dyrcona	Bmagic: I think it is fixed now, and in the end, the fix was obvious: use the ';utf8' binary layer on input and output with UTF-8 files.
16:37		mmorgan1 joined #evergreen
17:07	jeffdavis	SIPServer is returning non-ASCII characters even though the encoding is explicitly set to "ascii" in SIP config - very puzzling
17:11		mmorgan1 left #evergreen