Evergreen ILS Website

IRC log for #evergreen, 2023-03-13

| Channels | #evergreen index | Today | | Search | Google Search | Plain-Text | summary | Join Webchat

All times shown according to the server's local time.

Time Nick Message
07:29 collum joined #evergreen
08:31 kworstell-isl joined #evergreen
08:36 mmorgan joined #evergreen
08:50 Dyrcona joined #evergreen
08:57 rfrasur joined #evergreen
09:15 Stompro joined #evergreen
09:39 ejk joined #evergreen
09:39 jeffdavis_ joined #evergreen
09:46 dguarrac joined #evergreen
10:25 Christineb joined #evergreen
10:54 Dyrcona Hm.. I wonder how workable it would be to replace ISO-8859-1 copyright symbols with UTF-8 ones using a regex.... It seems like it would be simple, but it's the kind of thing that can lead to problems. I have a file of records that I can play with....
10:58 Dyrcona Looks like it happens with registered trademark symbol, too. (Gotta love vendor-supplied MARC records.)
11:06 Bmagic Love em indeed
11:16 Dyrcona Bmagic: Does your load process handle things like that? I have a --strict option on one of my load programs that rejects records with bad characters, well any warnings are treated as errors, really.
11:16 Dyrcona I'm considering adding code to fix copyright and registered trademark symbols since they seem to be a thing with this one vendor in particular.
11:18 Bmagic I wrote some code that would try to parse the records one way and then again another way and choose the best automatically, using some evals... An internal project where we reach out to various vendor websites using selenium and download weekly/monthly records and import into the catalog
11:20 Bmagic no, I don't think I have something that specifically handls character conversion from one character map to another.
11:21 Stompro mmorgan, Content Cafe seems to still be having issues.  Our cataloger reports that many new items don't have cover art in Evergreen, but do in Titlesource360.  But in general I'm seeing the majority of cover art show up.
11:21 Bmagic Dyrcona: but FWIW: https://github.com/mcoia/sierra_marc_tools​/blob/master/auto_rec_load/dataHandler.pm around line 800, if it dies, it will failover to readMARCFileRaw
11:22 Dyrcona Bmagic: Thanks. I've had a glance at that code before. My issue isn't reading the records. We get these warnings when loading them: utf8 "\xA9" does not map to Unicode at /usr/lib/x86_64-linux-gnu/perl/5.26/Encode.pm line 212, <GEN1> chunk 300.
11:23 Dyrcona They will load in the database, if I let them go in.
11:24 Dyrcona The warnings don't occur while prepocessing the records with using MARC::Record to modify the 856 tags.
11:24 Bmagic writing something to transcode one character to another seems doable. But I've not done it. Sorry :(
11:25 Dyrcona Yeah, it's actually transcoding to 2 characters \xA9 -> \xC2\xA9.
11:25 Bmagic it sounds like you'll end up having to read the file character by character instead of letting MARC::Record do it?
11:26 mmorgan Stompro: A quick look through our catalog doesn't show a lack of images.
11:26 Dyrcona Well, I was thinking of looping through the datafields and checking the content of every subfield, but the trick is limiting it to invalid UTF-8 character sequences.
11:27 Bmagic seems a little tricky since the destination 2-characters includes the converted-from character. It'd require some logic (regex) to be sure that it doesn't have \xC2
11:27 Dyrcona Well that and any other legal UTF-8 sequence containing the target to be changed has to be skipped.
11:28 Bmagic the plot thickens
11:28 Dyrcona If it was just \xC2, that would actually be easy with a regex.
11:28 Dyrcona jwz is right about regular expressions. :)
11:29 Bmagic it sounds like you'll have to cross-reference all* of the legal preceeding UTF8 characters, and weed/convert out the ones that don't appear in the cross reference?
11:30 Stompro mmorgan, it seems like they are not loading new images into the http://contentcafe2.btol.com system right now, so our cataloger just noticed seeing no cover art for items that just arrived from B&T.
11:30 Bmagic or maybe you're simply looking for a \xA9 that is preceeded by a space character (seems like you could miss some)
11:31 Stompro mmorgan, There was also a fixed cover art image from last week that is still showing up wrong in http://contentcafe2.btol.com.
11:37 Dyrcona Bmagic: My thoughts are heading back into the direction of "it's not worth the effort."
11:38 Bmagic Dyrcona: the results would be broken copyright symbols sometimes?
11:38 Bmagic or wing dings in the marc
11:39 mmorgan Stompro: I'm not able to isolate new B&T bibs, but will keep my eyes open. Thanks for the heads up!
11:42 Dyrcona Bmagic: We use a program very much like this one https://pastebin.com/g4RGDJLr to load electronic resource records into Evergreen. We use the --strict option to prevent really bad records from getting in. It also stops those with minor issues like the one we're discussing.
11:43 Dyrcona We also run a preprocess program on the files to fix the 856 tags before loading so that cataloging staff don't have to do it manually. I'm considering checking for certain "bad" characters in there and fixing them.
11:46 Bmagic I see
11:51 Bmagic I'm doing a presentation this year on bug 1947898
11:51 pinesol Launchpad bug 1947898 in Evergreen "Enhanced MARC importer script electronic_marc_import.pl" [Wishlist,Confirmed] https://launchpad.net/bugs/1947898
11:53 pinesol News from commits: Docs: Update describing_your_organization.adoc <https://git.evergreen-ils.org/?p=E​vergreen.git;a=commitdiff;h=e744b5​f388aafdb095e5c30b1ed57bb6ae7f1eda>
11:54 Dyrcona Yeah, I was going to review that at some point, but things keep coming up.
11:55 Bmagic I saw you were assigned last year. It's all good, something this complicated takes time to review. Time that we don't have!
12:08 jihpringle joined #evergreen
12:25 kworstell-isl joined #evergreen
12:27 Dyrcona Interesting... I dumped the rejected records with the copyright and registered trademark symbols to XML, and they have other issues.
12:43 Dyrcona Hmm. Our process may be mangling the records further....
12:45 Dyrcona Comparing the input record with the rejected record, it looks like the "bad" copyright symbol is causing problems with reading the rest of the record. At least when I convert the two to marcxml with yaz-marcdump.
14:01 Dyrcona Turns out to be more bad encoding handling in my code. Dunno why I've had so much trouble with something so "simple."
14:02 Dyrcona "I said it was simple. I never said it was easy."
14:03 Dyrcona I hope I don't end up having to add special encoding tricks per vendor. I seem to recall having it this way in the beginning, and it worked for all but 1 vendor who claims to send UTF-8 records.
15:28 Bmagic Dyrcona: I feel your pain
16:19 mmorgan1 joined #evergreen
16:25 mmorgan joined #evergreen
16:28 Dyrcona Bmagic: I think it is fixed now, and in the end, the fix was obvious: use the ';utf8' binary layer on input and output with UTF-8 files.
16:37 mmorgan1 joined #evergreen
17:07 jeffdavis SIPServer is returning non-ASCII characters even though the encoding is explicitly set to "ascii" in SIP config - very puzzling
17:11 mmorgan1 left #evergreen

| Channels | #evergreen index | Today | | Search | Google Search | Plain-Text | summary | Join Webchat