IRC log for #evergreen, 2021-06-02

All times shown according to the server's local time.

Time	Nick	Message
00:06		JBoyer joined #evergreen
00:07		sandbergja joined #evergreen
01:09		sandbergja joined #evergreen
01:25		sandbergja joined #evergreen
06:00	pinesol	News from qatests: Testing Success <http://testing.evergreen-ils.org/~live>
06:45		rlefaive joined #evergreen
07:13		rjackson_isl_hom joined #evergreen
07:16		mantis2 joined #evergreen
08:15		rlefaive joined #evergreen
08:27		Dyrcona joined #evergreen
08:30		Stompro joined #evergreen
08:40		rfrasur joined #evergreen
08:46		mmorgan joined #evergreen
08:47		alynn26 joined #evergreen
09:24		jvwoolf joined #evergreen
10:03		rlefaive joined #evergreen
10:35		sandbergja joined #evergreen
10:52		Keith-isl joined #evergreen
10:58	pinesol	[evergreen\|Jane Sandberg] LP1922120: Add to carousel action in angular catalog - <http://git.evergreen-ils.org/?p=Evergreen.git;a=commit;h=ec252ff>
10:59		jvwoolf joined #evergreen
11:12	pinesol	[evergreen\|Terran McCanna] LP1908619 Adjustments to Staff Search Preferences Page - <http://git.evergreen-ils.org/?p=Evergreen.git;a=commit;h=20bd496>
11:12	pinesol	[evergreen\|Galen Charlton] LP#1908619: add a release notes sentence - <http://git.evergreen-ils.org/?p=Evergreen.git;a=commit;h=c8072fb>
11:23	* gmcharlt	claims 1264
11:28	pinesol	[evergreen\|Mike Rylander] LP#1778955: Remove our custom version of array_remove(anyarray,anyelement) - <http://git.evergreen-ils.org/?p=Evergreen.git;a=commit;h=460d0c0>
11:28	pinesol	[evergreen\|Jane Sandberg] LP#1778955: fixing upgrade script, removing duplicate function definition - <http://git.evergreen-ils.org/?p=Evergreen.git;a=commit;h=8c79f68>
11:28	pinesol	[evergreen\|Galen Charlton] LP#1778955: stamp schema update - <http://git.evergreen-ils.org/?p=Evergreen.git;a=commit;h=d4f5a2e>
12:17		jihpringle joined #evergreen
12:21	Bmagic	Dyrcona: I have a foggy memory of you talking about having some perl code that "figures out" the character encoding in a marc file?
12:32	rhamby	Bmagic: are you looking to sort out all possible encodings (not possible) or just figure out if marc8 is hiding in a unicode declared file or the like?
12:34	rhamby	there are too many ways for multibyte encodings to be processed to reliably discover them all, heck one sequence could be valid in mutliple encoding schemes
12:34	rhamby	but if you're looking at the most common like marc8 versus unicode or utf8 you can usually tease them apart
12:35	Dyrcona	The python chardet module does a pretty good job.
12:36	rhamby	if you're using perl the preceding declarer for marc8 evaluates using ord as 225 so that's a pretty good indicator the byte following is marc8
12:36	rhamby	I usually scan for that and have it print them to visually eyeball and spot problems pretty fast
12:37	Dyrcona	Bmagic: I did a quick perusal of my scripts and I don't see anything like what you mentioned.
12:37		collum joined #evergreen
12:38	Dyrcona	The real problem is when 1 MARC file contains text in different encodings, or worse Windows-1252 with smart quotes.
12:38	rhamby	yeah, that's why my solution does a decent job of finding the issues
12:39	rhamby	note that breaking strings into arrays and scanning with ord can be really really slow on big files but I cheat and do titles and authors usually and that is usually a good indicator if I need to dig deeper
12:39	rhamby	and yeah, the declaration if the file is marc8 vs unicode is usually more a hopeful statement than anything factual
12:39	Dyrcona	I've used python chardet with non-marc data. It could be used on a field by field basis with pymarc.
12:40	Dyrcona	Unicode is often spelt ISO8859-X (where X is a number). :)
12:42	Dyrcona	That should be misspelt, not spelt. :)
12:48	Dyrcona	I do have a script that spits out the record warnings and the encoding as understood by MARC::Record.
12:49		collum joined #evergreen
12:50	Dyrcona	Bmagic: https://pastebin.com/Pef0KLeL
13:00		sandbergja joined #evergreen
13:10		nfBurton joined #evergreen
13:17		jvwoolf joined #evergreen
14:17		jvwoolf joined #evergreen
14:26		csharp joined #evergreen
14:59		jihpringle joined #evergreen
15:00	Bmagic	Dyrcona++
15:00	Bmagic	rhamby++
15:06		Keith_isl joined #evergreen
15:06		RFrasur_ joined #evergreen
15:42	Bmagic	Drycona rhamby: I'm reading a mrc file in like this $file = MARC::File::USMARC->in($file). Then loop the records: while ( my $marc = $file->next() ). And for each marc record, I convert it to XML: $thisXML = $marc->as_xml(); followed by a dozen regex replacers.
15:45	Bmagic	pretty sure it's fine as long as the file is encoded with <not really sure> utf8 maybe. So, what I think* I need to do, is read the file record for record like you have there in your script Dyrcona. attempt to load the data into MARC::Record. Catch the errors, and automatically load the file again but with the "right" character encoding. Or maybe just loop through the records manually to begin with and ditch MARC::File ?
15:47	Bmagic	correction "$file = MARC::File::USMARC->in($filename)"
15:48	Dyrcona	Are you having any particular problems other than encoding?
15:49	Bmagic	I'm thinking of potential issues with the way this script reads the records to make it more "compatible" for the masses
15:50	Dyrcona	My script reads the records that way because that is a) how MARC::File::USMARC->in() does it, but also b) when you have "smart quotes" pasted into a field, you actually need to split records on \x1E\x1D because \x1D is in the smart quote sequence.
15:51	Dyrcona	If you're having encoding issues with some records, I'd suggest trying pymarc and chardet to go over each field. You can then convert the data field by field if necessary.
15:51	Bmagic	wow, I guess the question is: should I write this to support "smart quotes"
15:52	Bmagic	maybe the contribution needs to land in MARC::File instead of my script?
15:52	Dyrcona	MARC::FILE may already work with smart quotes. I know that tsbere opened a ticket on CPAN about it. I don't know if his patch ever went in.
15:53	Dyrcona	gmcharlt should know, as i think he is one of the maintainers of MARC::File.
15:55	gmcharlt	I should check the patch queue, but yeah, at the moment smart quotes would break MARC::File::USMARC's expectations
15:56	Dyrcona	I was just looking at rt.cpan.org.
15:56	Dyrcona	I couldn't find the bug report.
15:58	* Dyrcona	is mildly surprised that his CPAN id still works. :)
15:58	Dyrcona	Found the issue.
15:59		sandbergja joined #evergreen
16:00		mantis2 left #evergreen
16:01	Dyrcona	Well, it's not reported by tsbere, but here it is: https://rt.cpan.org/Ticket/Display.html?id=70169
16:01	Dyrcona	Looks like rt.cpan.org is being shutdown.
16:01	gmcharlt	I'll take (another) look
16:03	Dyrcona	gmcharlt: If you want a patch, I could probably provide. I recall tsbere writing one for this. Maybe it was for MARC::Batch?
16:03	Bmagic	gmcharlt's comments from 2011 on that ticket are great: "not that encouraging such sloppy MARC records is a good idea. :)"
16:03	gmcharlt	Dyrcona: sure, happy to take a patch from you
16:05	Bmagic	just to be clear: I don't need to do anything special when I pass a UTF8 or a MARC8 or a MARC21 file into MARC::File::USMARC ?
16:05	Dyrcona	Ha! I thought rt.cpan.org would be closed by now, but the latest bug on MARC::File is 20 minutes, and it's a spam.
16:06	Bmagic	MARC::File::USMARC does all the work for me? Figuring out which character set to use and whatnot?
16:06	Dyrcona	Bmagic: Usually, yes.
16:06	Dyrcona	If the encoding is set correctly in the file.
16:07	Dyrcona	What are you actually trying to do? Load records from a migration/new library, a vendor?
16:15	Dyrcona	gmcharlt: Is there a git repository for MARC::Recod & company?
16:16	gmcharlt	Dyrcona: yeah: https://github.com/perl4lib/marc-perl
16:17	Dyrcona	Cool. I make an issue and pull request there.
16:18	Bmagic	Dyrcona: This sprang from the electronic_bib_import.pl work I'm doing
16:19	Bmagic	answer: probably not migration, but yes on the vendor
16:21	Dyrcona	Bmagic: USMARC records should be in MARC8 unless the leader says UTF-8. Trouble is, I've seen just about anything in actual MARC records, and it is difficult to tell at run time.
16:24	Dyrcona	For the logs and anyone else following along at home: It turns out that tsbere made a pull request on github for the issue, but his code breaks the tests. I'll take that up and see if I can fix it so it doesn't break the tests.
16:27	Dyrcona	Also, for Bmagic, and those following along, generally, the only way to detect the encoding of a MARC record that says it is MARC8 is to assume it is MARC8, convert it UTF-8 for Evergreen and catch any errors. That's another reason why I often read the records individually and convert them to MARC::Record inside an eval, so that 1 or two bad records don't spoil the whole batch.
16:30		jihpringle joined #evergreen
16:32	jeff	indication of marc8 vs utf8 is at the record level, not the file level, right?
16:33	Dyrcona	jeff: Yes.
16:33	Dyrcona	Files are usually all one or the other, but that isn't guaranteed.
16:34	jeff	iirc, there's no file-level metadata at all -- at least, not in the file.
16:34	jeff	filenames are weak but sometimes useful clues (or lies)
16:34	Dyrcona	Right.
16:34	jeff	and yeah, while I think it's a Bad Idea to have a file containing records of different encodings, I've seen it.
16:34	Dyrcona	Also, I may have a simpler patch than what tsbere did.
16:35	Dyrcona	I've seen 1 record have different encodings in it.--Cataloging by copy and paste....
16:38	Dyrcona	My favorite was the one that prompted tsbere to open this issue, IIRC, it was a file of records that were "flagged" MARC8, but were actually Windows 1252.
16:41	Dyrcona	Ah, yeah. I misremembered the actual problem.....
16:42	jeff	we used to get OverDrive records in an "Excel" file that was actually (I think) Office Open XML or possibly just HTML with additional elements. They were Windows-1252 but with a UTF-8 declaration somewhere, I think.
16:43	Dyrcona	The problem was indeed smart quotes, but the sequence is 0x201E and 0x201D, so they still looked like end of field/end of record indicators.
16:44	Dyrcona	Well, UTF-8 isn't too far off as both it and Windows 1252 are modified supersets of ISO8859-1, but that's still wrong.
16:46	Dyrcona	I guess that's why most of the records came out as ISO8859-1 when we ran them through some code to guess the encoding.
16:48	jeff	we would attempt to parse the data with the claimed encoding, then fall back to Windows-1252 if that failed. I can't recall our exact method of detecting "failure". Probably out-of-range characters.
16:55	Dyrcona	Well, the Windows 1252 smart quotes cause your record parsing to break. You get a short record followed by a junk record.
17:00		sandbergja joined #evergreen
17:03	Dyrcona	Well, I'll take this up tomorrow. G'night all!
17:10		mmorgan left #evergreen
17:49		mantis2 joined #evergreen
17:49		mantis2 left #evergreen
18:00	pinesol	News from qatests: Testing Success <http://testing.evergreen-ils.org/~live>
18:55		StomproJ joined #evergreen
19:39		mantis2 joined #evergreen
19:39		mantis2 left #evergreen
21:13		Stompro joined #evergreen