Time |
Nick |
Message |
00:06 |
|
JBoyer joined #evergreen |
00:07 |
|
sandbergja joined #evergreen |
01:09 |
|
sandbergja joined #evergreen |
01:25 |
|
sandbergja joined #evergreen |
06:00 |
pinesol |
News from qatests: Testing Success <http://testing.evergreen-ils.org/~live> |
06:45 |
|
rlefaive joined #evergreen |
07:13 |
|
rjackson_isl_hom joined #evergreen |
07:16 |
|
mantis2 joined #evergreen |
08:15 |
|
rlefaive joined #evergreen |
08:27 |
|
Dyrcona joined #evergreen |
08:30 |
|
Stompro joined #evergreen |
08:40 |
|
rfrasur joined #evergreen |
08:46 |
|
mmorgan joined #evergreen |
08:47 |
|
alynn26 joined #evergreen |
09:24 |
|
jvwoolf joined #evergreen |
10:03 |
|
rlefaive joined #evergreen |
10:35 |
|
sandbergja joined #evergreen |
10:52 |
|
Keith-isl joined #evergreen |
10:58 |
pinesol |
[evergreen|Jane Sandberg] LP1922120: Add to carousel action in angular catalog - <http://git.evergreen-ils.org/?p=Evergreen.git;a=commit;h=ec252ff> |
10:59 |
|
jvwoolf joined #evergreen |
11:12 |
pinesol |
[evergreen|Terran McCanna] LP1908619 Adjustments to Staff Search Preferences Page - <http://git.evergreen-ils.org/?p=Evergreen.git;a=commit;h=20bd496> |
11:12 |
pinesol |
[evergreen|Galen Charlton] LP#1908619: add a release notes sentence - <http://git.evergreen-ils.org/?p=Evergreen.git;a=commit;h=c8072fb> |
11:23 |
* gmcharlt |
claims 1264 |
11:28 |
pinesol |
[evergreen|Mike Rylander] LP#1778955: Remove our custom version of array_remove(anyarray,anyelement) - <http://git.evergreen-ils.org/?p=Evergreen.git;a=commit;h=460d0c0> |
11:28 |
pinesol |
[evergreen|Jane Sandberg] LP#1778955: fixing upgrade script, removing duplicate function definition - <http://git.evergreen-ils.org/?p=Evergreen.git;a=commit;h=8c79f68> |
11:28 |
pinesol |
[evergreen|Galen Charlton] LP#1778955: stamp schema update - <http://git.evergreen-ils.org/?p=Evergreen.git;a=commit;h=d4f5a2e> |
12:17 |
|
jihpringle joined #evergreen |
12:21 |
Bmagic |
Dyrcona: I have a foggy memory of you talking about having some perl code that "figures out" the character encoding in a marc file? |
12:32 |
rhamby |
Bmagic: are you looking to sort out all possible encodings (not possible) or just figure out if marc8 is hiding in a unicode declared file or the like? |
12:34 |
rhamby |
there are too many ways for multibyte encodings to be processed to reliably discover them all, heck one sequence could be valid in mutliple encoding schemes |
12:34 |
rhamby |
but if you're looking at the most common like marc8 versus unicode or utf8 you can usually tease them apart |
12:35 |
Dyrcona |
The python chardet module does a pretty good job. |
12:36 |
rhamby |
if you're using perl the preceding declarer for marc8 evaluates using ord as 225 so that's a pretty good indicator the byte following is marc8 |
12:36 |
rhamby |
I usually scan for that and have it print them to visually eyeball and spot problems pretty fast |
12:37 |
Dyrcona |
Bmagic: I did a quick perusal of my scripts and I don't see anything like what you mentioned. |
12:37 |
|
collum joined #evergreen |
12:38 |
Dyrcona |
The real problem is when 1 MARC file contains text in different encodings, or worse Windows-1252 with smart quotes. |
12:38 |
rhamby |
yeah, that's why my solution does a decent job of finding the issues |
12:39 |
rhamby |
note that breaking strings into arrays and scanning with ord can be really really slow on big files but I cheat and do titles and authors usually and that is usually a good indicator if I need to dig deeper |
12:39 |
rhamby |
and yeah, the declaration if the file is marc8 vs unicode is usually more a hopeful statement than anything factual |
12:39 |
Dyrcona |
I've used python chardet with non-marc data. It could be used on a field by field basis with pymarc. |
12:40 |
Dyrcona |
Unicode is often spelt ISO8859-X (where X is a number). :) |
12:42 |
Dyrcona |
That should be misspelt, not spelt. :) |
12:48 |
Dyrcona |
I do have a script that spits out the record warnings and the encoding as understood by MARC::Record. |
12:49 |
|
collum joined #evergreen |
12:50 |
Dyrcona |
Bmagic: https://pastebin.com/Pef0KLeL |
13:00 |
|
sandbergja joined #evergreen |
13:10 |
|
nfBurton joined #evergreen |
13:17 |
|
jvwoolf joined #evergreen |
14:17 |
|
jvwoolf joined #evergreen |
14:26 |
|
csharp joined #evergreen |
14:59 |
|
jihpringle joined #evergreen |
15:00 |
Bmagic |
Dyrcona++ |
15:00 |
Bmagic |
rhamby++ |
15:06 |
|
Keith_isl joined #evergreen |
15:06 |
|
RFrasur_ joined #evergreen |
15:42 |
Bmagic |
Drycona rhamby: I'm reading a mrc file in like this $file = MARC::File::USMARC->in($file). Then loop the records: while ( my $marc = $file->next() ). And for each marc record, I convert it to XML: $thisXML = $marc->as_xml(); followed by a dozen regex replacers. |
15:45 |
Bmagic |
pretty sure it's fine as long as the file is encoded with <not really sure> utf8 maybe. So, what I think* I need to do, is read the file record for record like you have there in your script Dyrcona. attempt to load the data into MARC::Record. Catch the errors, and automatically load the file again but with the "right" character encoding. Or maybe just loop through the records manually to begin with and ditch MARC::File ? |
15:47 |
Bmagic |
correction "$file = MARC::File::USMARC->in($filename)" |
15:48 |
Dyrcona |
Are you having any particular problems other than encoding? |
15:49 |
Bmagic |
I'm thinking of potential issues with the way this script reads the records to make it more "compatible" for the masses |
15:50 |
Dyrcona |
My script reads the records that way because that is a) how MARC::File::USMARC->in() does it, but also b) when you have "smart quotes" pasted into a field, you actually need to split records on \x1E\x1D because \x1D is in the smart quote sequence. |
15:51 |
Dyrcona |
If you're having encoding issues with some records, I'd suggest trying pymarc and chardet to go over each field. You can then convert the data field by field if necessary. |
15:51 |
Bmagic |
wow, I guess the question is: should I write this to support "smart quotes" |
15:52 |
Bmagic |
maybe the contribution needs to land in MARC::File instead of my script? |
15:52 |
Dyrcona |
MARC::FILE may already work with smart quotes. I know that tsbere opened a ticket on CPAN about it. I don't know if his patch ever went in. |
15:53 |
Dyrcona |
gmcharlt should know, as i think he is one of the maintainers of MARC::File. |
15:55 |
gmcharlt |
I should check the patch queue, but yeah, at the moment smart quotes would break MARC::File::USMARC's expectations |
15:56 |
Dyrcona |
I was just looking at rt.cpan.org. |
15:56 |
Dyrcona |
I couldn't find the bug report. |
15:58 |
* Dyrcona |
is mildly surprised that his CPAN id still works. :) |
15:58 |
Dyrcona |
Found the issue. |
15:59 |
|
sandbergja joined #evergreen |
16:00 |
|
mantis2 left #evergreen |
16:01 |
Dyrcona |
Well, it's not reported by tsbere, but here it is: https://rt.cpan.org/Ticket/Display.html?id=70169 |
16:01 |
Dyrcona |
Looks like rt.cpan.org is being shutdown. |
16:01 |
gmcharlt |
I'll take (another) look |
16:03 |
Dyrcona |
gmcharlt: If you want a patch, I could probably provide. I recall tsbere writing one for this. Maybe it was for MARC::Batch? |
16:03 |
Bmagic |
gmcharlt's comments from 2011 on that ticket are great: "not that encouraging such sloppy MARC records is a good idea. :)" |
16:03 |
gmcharlt |
Dyrcona: sure, happy to take a patch from you |
16:05 |
Bmagic |
just to be clear: I don't need to do anything special when I pass a UTF8 or a MARC8 or a MARC21 file into MARC::File::USMARC ? |
16:05 |
Dyrcona |
Ha! I thought rt.cpan.org would be closed by now, but the latest bug on MARC::File is 20 minutes, and it's a spam. |
16:06 |
Bmagic |
MARC::File::USMARC does all the work for me? Figuring out which character set to use and whatnot? |
16:06 |
Dyrcona |
Bmagic: Usually, yes. |
16:06 |
Dyrcona |
If the encoding is set correctly in the file. |
16:07 |
Dyrcona |
What are you actually trying to do? Load records from a migration/new library, a vendor? |
16:15 |
Dyrcona |
gmcharlt: Is there a git repository for MARC::Recod & company? |
16:16 |
gmcharlt |
Dyrcona: yeah: https://github.com/perl4lib/marc-perl |
16:17 |
Dyrcona |
Cool. I make an issue and pull request there. |
16:18 |
Bmagic |
Dyrcona: This sprang from the electronic_bib_import.pl work I'm doing |
16:19 |
Bmagic |
answer: probably not migration, but yes on the vendor |
16:21 |
Dyrcona |
Bmagic: USMARC records should be in MARC8 unless the leader says UTF-8. Trouble is, I've seen just about anything in actual MARC records, and it is difficult to tell at run time. |
16:24 |
Dyrcona |
For the logs and anyone else following along at home: It turns out that tsbere made a pull request on github for the issue, but his code breaks the tests. I'll take that up and see if I can fix it so it doesn't break the tests. |
16:27 |
Dyrcona |
Also, for Bmagic, and those following along, generally, the only way to detect the encoding of a MARC record that says it is MARC8 is to assume it is MARC8, convert it UTF-8 for Evergreen and catch any errors. That's another reason why I often read the records individually and convert them to MARC::Record inside an eval, so that 1 or two bad records don't spoil the whole batch. |
16:30 |
|
jihpringle joined #evergreen |
16:32 |
jeff |
indication of marc8 vs utf8 is at the record level, not the file level, right? |
16:33 |
Dyrcona |
jeff: Yes. |
16:33 |
Dyrcona |
Files are usually all one or the other, but that isn't guaranteed. |
16:34 |
jeff |
iirc, there's no file-level metadata at all -- at least, not *in* the file. |
16:34 |
jeff |
filenames are weak but sometimes useful clues (or lies) |
16:34 |
Dyrcona |
Right. |
16:34 |
jeff |
and yeah, while I think it's a Bad Idea to have a file containing records of different encodings, I've seen it. |
16:34 |
Dyrcona |
Also, I may have a simpler patch than what tsbere did. |
16:35 |
Dyrcona |
I've seen 1 record have different encodings in it.--Cataloging by copy and paste.... |
16:38 |
Dyrcona |
My favorite was the one that prompted tsbere to open this issue, IIRC, it was a file of records that were "flagged" MARC8, but were actually Windows 1252. |
16:41 |
Dyrcona |
Ah, yeah. I misremembered the actual problem..... |
16:42 |
jeff |
we used to get OverDrive records in an "Excel" file that was actually (I think) Office Open XML or possibly just HTML with additional elements. They were Windows-1252 but with a UTF-8 declaration somewhere, I think. |
16:43 |
Dyrcona |
The problem was indeed smart quotes, but the sequence is 0x201E and 0x201D, so they still looked like end of field/end of record indicators. |
16:44 |
Dyrcona |
Well, UTF-8 isn't too far off as both it and Windows 1252 are modified supersets of ISO8859-1, but that's still wrong. |
16:46 |
Dyrcona |
I guess that's why most of the records came out as ISO8859-1 when we ran them through some code to guess the encoding. |
16:48 |
jeff |
we would attempt to parse the data with the claimed encoding, then fall back to Windows-1252 if that failed. I can't recall our exact method of detecting "failure". Probably out-of-range characters. |
16:55 |
Dyrcona |
Well, the Windows 1252 smart quotes cause your record parsing to break. You get a short record followed by a junk record. |
17:00 |
|
sandbergja joined #evergreen |
17:03 |
Dyrcona |
Well, I'll take this up tomorrow. G'night all! |
17:10 |
|
mmorgan left #evergreen |
17:49 |
|
mantis2 joined #evergreen |
17:49 |
|
mantis2 left #evergreen |
18:00 |
pinesol |
News from qatests: Testing Success <http://testing.evergreen-ils.org/~live> |
18:55 |
|
StomproJ joined #evergreen |
19:39 |
|
mantis2 joined #evergreen |
19:39 |
|
mantis2 left #evergreen |
21:13 |
|
Stompro joined #evergreen |