Time |
Nick |
Message |
07:29 |
|
collum joined #evergreen |
08:31 |
|
kworstell-isl joined #evergreen |
08:36 |
|
mmorgan joined #evergreen |
08:50 |
|
Dyrcona joined #evergreen |
08:57 |
|
rfrasur joined #evergreen |
09:15 |
|
Stompro joined #evergreen |
09:39 |
|
ejk joined #evergreen |
09:39 |
|
jeffdavis_ joined #evergreen |
09:46 |
|
dguarrac joined #evergreen |
10:25 |
|
Christineb joined #evergreen |
10:54 |
Dyrcona |
Hm.. I wonder how workable it would be to replace ISO-8859-1 copyright symbols with UTF-8 ones using a regex.... It seems like it would be simple, but it's the kind of thing that can lead to problems. I have a file of records that I can play with.... |
10:58 |
Dyrcona |
Looks like it happens with registered trademark symbol, too. (Gotta love vendor-supplied MARC records.) |
11:06 |
Bmagic |
Love em indeed |
11:16 |
Dyrcona |
Bmagic: Does your load process handle things like that? I have a --strict option on one of my load programs that rejects records with bad characters, well any warnings are treated as errors, really. |
11:16 |
Dyrcona |
I'm considering adding code to fix copyright and registered trademark symbols since they seem to be a thing with this one vendor in particular. |
11:18 |
Bmagic |
I wrote some code that would try to parse the records one way and then again another way and choose the best automatically, using some evals... An internal project where we reach out to various vendor websites using selenium and download weekly/monthly records and import into the catalog |
11:20 |
Bmagic |
no, I don't think I have something that specifically handls character conversion from one character map to another. |
11:21 |
Stompro |
mmorgan, Content Cafe seems to still be having issues. Our cataloger reports that many new items don't have cover art in Evergreen, but do in Titlesource360. But in general I'm seeing the majority of cover art show up. |
11:21 |
Bmagic |
Dyrcona: but FWIW: https://github.com/mcoia/sierra_marc_tools/blob/master/auto_rec_load/dataHandler.pm around line 800, if it dies, it will failover to readMARCFileRaw |
11:22 |
Dyrcona |
Bmagic: Thanks. I've had a glance at that code before. My issue isn't reading the records. We get these warnings when loading them: utf8 "\xA9" does not map to Unicode at /usr/lib/x86_64-linux-gnu/perl/5.26/Encode.pm line 212, <GEN1> chunk 300. |
11:23 |
Dyrcona |
They will load in the database, if I let them go in. |
11:24 |
Dyrcona |
The warnings don't occur while prepocessing the records with using MARC::Record to modify the 856 tags. |
11:24 |
Bmagic |
writing something to transcode one character to another seems doable. But I've not done it. Sorry :( |
11:25 |
Dyrcona |
Yeah, it's actually transcoding to 2 characters \xA9 -> \xC2\xA9. |
11:25 |
Bmagic |
it sounds like you'll end up having to read the file character by character instead of letting MARC::Record do it? |
11:26 |
mmorgan |
Stompro: A quick look through our catalog doesn't show a lack of images. |
11:26 |
Dyrcona |
Well, I was thinking of looping through the datafields and checking the content of every subfield, but the trick is limiting it to invalid UTF-8 character sequences. |
11:27 |
Bmagic |
seems a little tricky since the destination 2-characters includes the converted-from character. It'd require some logic (regex) to be sure that it doesn't have \xC2 |
11:27 |
Dyrcona |
Well that and any other legal UTF-8 sequence containing the target to be changed has to be skipped. |
11:28 |
Bmagic |
the plot thickens |
11:28 |
Dyrcona |
If it was just \xC2, that would actually be easy with a regex. |
11:28 |
Dyrcona |
jwz is right about regular expressions. :) |
11:29 |
Bmagic |
it sounds like you'll have to cross-reference all* of the legal preceeding UTF8 characters, and weed/convert out the ones that don't appear in the cross reference? |
11:30 |
Stompro |
mmorgan, it seems like they are not loading new images into the http://contentcafe2.btol.com system right now, so our cataloger just noticed seeing no cover art for items that just arrived from B&T. |
11:30 |
Bmagic |
or maybe you're simply looking for a \xA9 that is preceeded by a space character (seems like you could miss some) |
11:31 |
Stompro |
mmorgan, There was also a fixed cover art image from last week that is still showing up wrong in http://contentcafe2.btol.com. |
11:37 |
Dyrcona |
Bmagic: My thoughts are heading back into the direction of "it's not worth the effort." |
11:38 |
Bmagic |
Dyrcona: the results would be broken copyright symbols sometimes? |
11:38 |
Bmagic |
or wing dings in the marc |
11:39 |
mmorgan |
Stompro: I'm not able to isolate new B&T bibs, but will keep my eyes open. Thanks for the heads up! |
11:42 |
Dyrcona |
Bmagic: We use a program very much like this one https://pastebin.com/g4RGDJLr to load electronic resource records into Evergreen. We use the --strict option to prevent really bad records from getting in. It also stops those with minor issues like the one we're discussing. |
11:43 |
Dyrcona |
We also run a preprocess program on the files to fix the 856 tags before loading so that cataloging staff don't have to do it manually. I'm considering checking for certain "bad" characters in there and fixing them. |
11:46 |
Bmagic |
I see |
11:51 |
Bmagic |
I'm doing a presentation this year on bug 1947898 |
11:51 |
pinesol |
Launchpad bug 1947898 in Evergreen "Enhanced MARC importer script electronic_marc_import.pl" [Wishlist,Confirmed] https://launchpad.net/bugs/1947898 |
11:53 |
pinesol |
News from commits: Docs: Update describing_your_organization.adoc <https://git.evergreen-ils.org/?p=Evergreen.git;a=commitdiff;h=e744b5f388aafdb095e5c30b1ed57bb6ae7f1eda> |
11:54 |
Dyrcona |
Yeah, I was going to review that at some point, but things keep coming up. |
11:55 |
Bmagic |
I saw you were assigned last year. It's all good, something this complicated takes time to review. Time that we don't have! |
12:08 |
|
jihpringle joined #evergreen |
12:25 |
|
kworstell-isl joined #evergreen |
12:27 |
Dyrcona |
Interesting... I dumped the rejected records with the copyright and registered trademark symbols to XML, and they have other issues. |
12:43 |
Dyrcona |
Hmm. Our process may be mangling the records further.... |
12:45 |
Dyrcona |
Comparing the input record with the rejected record, it looks like the "bad" copyright symbol is causing problems with reading the rest of the record. At least when I convert the two to marcxml with yaz-marcdump. |
14:01 |
Dyrcona |
Turns out to be more bad encoding handling in my code. Dunno why I've had so much trouble with something so "simple." |
14:02 |
Dyrcona |
"I said it was simple. I never said it was easy." |
14:03 |
Dyrcona |
I hope I don't end up having to add special encoding tricks per vendor. I seem to recall having it this way in the beginning, and it worked for all but 1 vendor who claims to send UTF-8 records. |
15:28 |
Bmagic |
Dyrcona: I feel your pain |
16:19 |
|
mmorgan1 joined #evergreen |
16:25 |
|
mmorgan joined #evergreen |
16:28 |
Dyrcona |
Bmagic: I think it is fixed now, and in the end, the fix was obvious: use the ';utf8' binary layer on input and output with UTF-8 files. |
16:37 |
|
mmorgan1 joined #evergreen |
17:07 |
jeffdavis |
SIPServer is returning non-ASCII characters even though the encoding is explicitly set to "ascii" in SIP config - very puzzling |
17:11 |
|
mmorgan1 left #evergreen |