Evergreen ILS Website

IRC log for #evergreen, 2021-06-02

| Channels | #evergreen index | Today | | Search | Google Search | Plain-Text | summary | Join Webchat

All times shown according to the server's local time.

Time Nick Message
00:06 JBoyer joined #evergreen
00:07 sandbergja joined #evergreen
01:09 sandbergja joined #evergreen
01:25 sandbergja joined #evergreen
06:00 pinesol News from qatests: Testing Success <http://testing.evergreen-ils.org/~live>
06:45 rlefaive joined #evergreen
07:13 rjackson_isl_hom joined #evergreen
07:16 mantis2 joined #evergreen
08:15 rlefaive joined #evergreen
08:27 Dyrcona joined #evergreen
08:30 Stompro joined #evergreen
08:40 rfrasur joined #evergreen
08:46 mmorgan joined #evergreen
08:47 alynn26 joined #evergreen
09:24 jvwoolf joined #evergreen
10:03 rlefaive joined #evergreen
10:35 sandbergja joined #evergreen
10:52 Keith-isl joined #evergreen
10:58 pinesol [evergreen|Jane Sandberg] LP1922120: Add to carousel action in angular catalog - <http://git.evergreen-ils.org/?p=​Evergreen.git;a=commit;h=ec252ff>
10:59 jvwoolf joined #evergreen
11:12 pinesol [evergreen|Terran McCanna] LP1908619 Adjustments to Staff Search Preferences Page - <http://git.evergreen-ils.org/?p=​Evergreen.git;a=commit;h=20bd496>
11:12 pinesol [evergreen|Galen Charlton] LP#1908619: add a release notes sentence - <http://git.evergreen-ils.org/?p=​Evergreen.git;a=commit;h=c8072fb>
11:23 * gmcharlt claims 1264
11:28 pinesol [evergreen|Mike Rylander] LP#1778955: Remove our custom version of array_remove(anyarray,anyelement) - <http://git.evergreen-ils.org/?p=​Evergreen.git;a=commit;h=460d0c0>
11:28 pinesol [evergreen|Jane Sandberg] LP#1778955: fixing upgrade script, removing duplicate function definition - <http://git.evergreen-ils.org/?p=​Evergreen.git;a=commit;h=8c79f68>
11:28 pinesol [evergreen|Galen Charlton] LP#1778955: stamp schema update - <http://git.evergreen-ils.org/?p=​Evergreen.git;a=commit;h=d4f5a2e>
12:17 jihpringle joined #evergreen
12:21 Bmagic Dyrcona: I have a foggy memory of you talking about having some perl code that "figures out" the character encoding in a marc file?
12:32 rhamby Bmagic: are you looking to sort out all possible encodings (not possible) or just figure out if marc8 is hiding in a unicode declared file or the like?
12:34 rhamby there are too many ways for multibyte encodings to be processed to reliably discover them all, heck one sequence could be valid in mutliple encoding schemes
12:34 rhamby but if you're looking at the most common like marc8 versus unicode  or utf8 you can usually tease them apart
12:35 Dyrcona The python chardet module does a pretty good job.
12:36 rhamby if you're using perl the preceding declarer for marc8 evaluates using ord as 225 so that's a pretty good indicator the byte following is marc8
12:36 rhamby I usually scan for that and have it print them to visually eyeball and spot problems pretty fast
12:37 Dyrcona Bmagic: I did a quick perusal of my scripts and I don't see anything like what you mentioned.
12:37 collum joined #evergreen
12:38 Dyrcona The real problem is when 1 MARC file contains text in different encodings, or worse Windows-1252 with smart quotes.
12:38 rhamby yeah, that's why my solution does a decent job of finding the issues
12:39 rhamby note that breaking strings into arrays and scanning with ord can be really really slow on big files but I cheat and do titles and authors usually and that is usually a good indicator if I need to dig deeper
12:39 rhamby and yeah, the declaration if the file is marc8 vs unicode is usually more a hopeful statement than anything factual
12:39 Dyrcona I've used python chardet with non-marc data. It could be used on a field by field basis with pymarc.
12:40 Dyrcona Unicode is often spelt ISO8859-X (where X is a number). :)
12:42 Dyrcona That should be misspelt, not spelt. :)
12:48 Dyrcona I do have a script that spits out the record warnings and the encoding as understood by MARC::Record.
12:49 collum joined #evergreen
12:50 Dyrcona Bmagic: https://pastebin.com/Pef0KLeL
13:00 sandbergja joined #evergreen
13:10 nfBurton joined #evergreen
13:17 jvwoolf joined #evergreen
14:17 jvwoolf joined #evergreen
14:26 csharp joined #evergreen
14:59 jihpringle joined #evergreen
15:00 Bmagic Dyrcona++
15:00 Bmagic rhamby++
15:06 Keith_isl joined #evergreen
15:06 RFrasur_ joined #evergreen
15:42 Bmagic Drycona rhamby: I'm reading a mrc file in like this $file = MARC::File::USMARC->in($file). Then loop the records: while ( my $marc = $file->next() ). And for each marc record, I convert it to XML: $thisXML =  $marc->as_xml(); followed by a dozen regex replacers.
15:45 Bmagic pretty sure it's fine as long as the file is encoded with <not really sure> utf8 maybe. So, what I think* I need to do, is read the file record for record like you have there in your script Dyrcona. attempt to load the data into MARC::Record. Catch the errors, and automatically load the file again but with the "right" character encoding. Or maybe just loop through the records manually to begin with and ditch MARC::File ?
15:47 Bmagic correction "$file = MARC::File::USMARC->in($filename)"
15:48 Dyrcona Are you having any particular problems other than encoding?
15:49 Bmagic I'm thinking of potential issues with the way this script reads the records to make it more "compatible" for the masses
15:50 Dyrcona My script reads the records that way because that is a) how MARC::File::USMARC->in() does it, but also b) when you have "smart quotes" pasted into a field, you actually need to split records on \x1E\x1D because \x1D is in the smart quote sequence.
15:51 Dyrcona If you're having encoding issues with some records, I'd suggest trying pymarc and chardet to go over each field. You can then convert the data field by field if necessary.
15:51 Bmagic wow, I guess the question is: should I write this to support "smart quotes"
15:52 Bmagic maybe the contribution needs to land in MARC::File instead of my script?
15:52 Dyrcona MARC::FILE may already work with smart quotes. I know that tsbere opened a ticket on CPAN about it. I don't know if his patch ever went in.
15:53 Dyrcona gmcharlt should know, as i think he is one of the maintainers of MARC::File.
15:55 gmcharlt I should check the patch queue, but yeah, at the moment smart quotes would break MARC::File::USMARC's expectations
15:56 Dyrcona I was just looking at rt.cpan.org.
15:56 Dyrcona I couldn't find the bug report.
15:58 * Dyrcona is mildly surprised that his CPAN id still works. :)
15:58 Dyrcona Found the issue.
15:59 sandbergja joined #evergreen
16:00 mantis2 left #evergreen
16:01 Dyrcona Well, it's not reported by tsbere, but here it is: https://rt.cpan.org/Ticket/Display.html?id=70169
16:01 Dyrcona Looks like rt.cpan.org is being shutdown.
16:01 gmcharlt I'll take (another) look
16:03 Dyrcona gmcharlt: If you want a patch, I could probably provide. I recall tsbere writing one for this. Maybe it was for MARC::Batch?
16:03 Bmagic gmcharlt's comments from 2011 on that ticket are great: "not that encouraging such sloppy MARC records is a good idea. :)"
16:03 gmcharlt Dyrcona: sure, happy to take a patch from you
16:05 Bmagic just to be clear: I don't need to do anything special when I pass a UTF8 or a MARC8 or a MARC21 file into MARC::File::USMARC ?
16:05 Dyrcona Ha! I thought rt.cpan.org would be closed by now, but the latest bug on MARC::File is 20 minutes, and it's  a spam.
16:06 Bmagic MARC::File::USMARC does all the work for me? Figuring out which character set to use and whatnot?
16:06 Dyrcona Bmagic: Usually, yes.
16:06 Dyrcona If the encoding is set correctly in the file.
16:07 Dyrcona What are you actually trying to do? Load records from a migration/new library, a vendor?
16:15 Dyrcona gmcharlt: Is there a git repository for MARC::Recod & company?
16:16 gmcharlt Dyrcona: yeah: https://github.com/perl4lib/marc-perl
16:17 Dyrcona Cool. I make an issue and pull request there.
16:18 Bmagic Dyrcona: This sprang from the electronic_bib_import.pl work I'm doing
16:19 Bmagic answer: probably not migration, but yes on the vendor
16:21 Dyrcona Bmagic: USMARC records should be in MARC8 unless the leader says UTF-8. Trouble is, I've seen just about anything in actual MARC records, and it is difficult to tell at run time.
16:24 Dyrcona For the logs and anyone else following along at home: It turns out that tsbere made a pull request on github for the issue, but his code breaks the tests. I'll take that up and see if I can fix it so it doesn't break the tests.
16:27 Dyrcona Also, for Bmagic, and those following along, generally, the only way to detect the encoding of a MARC record that says it is MARC8 is to assume it is MARC8, convert it UTF-8 for Evergreen and catch any errors. That's another reason why I often read the records individually and convert them to MARC::Record inside an eval, so that 1 or two bad records don't spoil the whole batch.
16:30 jihpringle joined #evergreen
16:32 jeff indication of marc8 vs utf8 is at the record level, not the file level, right?
16:33 Dyrcona jeff: Yes.
16:33 Dyrcona Files are usually all one or the other, but that isn't guaranteed.
16:34 jeff iirc, there's no file-level metadata at all -- at least, not *in* the file.
16:34 jeff filenames are weak but sometimes useful clues (or lies)
16:34 Dyrcona Right.
16:34 jeff and yeah, while I think it's a Bad Idea to have a file containing records of different encodings, I've seen it.
16:34 Dyrcona Also, I may have a simpler patch than what tsbere did.
16:35 Dyrcona I've seen 1 record have different encodings in it.--Cataloging by copy and paste....
16:38 Dyrcona My favorite was the one that prompted tsbere to open this issue, IIRC, it was a file of records that were "flagged" MARC8, but were actually Windows 1252.
16:41 Dyrcona Ah, yeah. I misremembered the actual problem.....
16:42 jeff we used to get OverDrive records in an "Excel" file that was actually (I think) Office Open XML or possibly just HTML with additional elements. They were Windows-1252 but with a UTF-8 declaration somewhere, I think.
16:43 Dyrcona The problem was indeed smart quotes, but the sequence is 0x201E and 0x201D, so they still looked like end of field/end of record indicators.
16:44 Dyrcona Well, UTF-8 isn't too far off as both it and Windows 1252 are modified supersets of ISO8859-1, but that's still wrong.
16:46 Dyrcona I guess that's why most of the records came out as ISO8859-1 when we ran them through some code to guess the encoding.
16:48 jeff we would attempt to parse the data with the claimed encoding, then fall back to Windows-1252 if that failed. I can't recall our exact method of detecting "failure". Probably out-of-range characters.
16:55 Dyrcona Well, the Windows 1252 smart quotes cause your record parsing to break. You get a short record followed by a junk record.
17:00 sandbergja joined #evergreen
17:03 Dyrcona Well, I'll take this up tomorrow. G'night all!
17:10 mmorgan left #evergreen
17:49 mantis2 joined #evergreen
17:49 mantis2 left #evergreen
18:00 pinesol News from qatests: Testing Success <http://testing.evergreen-ils.org/~live>
18:55 StomproJ joined #evergreen
19:39 mantis2 joined #evergreen
19:39 mantis2 left #evergreen
21:13 Stompro joined #evergreen

| Channels | #evergreen index | Today | | Search | Google Search | Plain-Text | summary | Join Webchat