IRC log for #evergreen, 2022-08-05

All times shown according to the server's local time.

Time	Nick	Message
06:17		collum joined #evergreen
07:10		rjackson_isl_hom joined #evergreen
07:50		BDorsey joined #evergreen
08:16		mantis1 joined #evergreen
08:42		mmorgan joined #evergreen
09:39		Dyrcona joined #evergreen
09:40	Dyrcona	So, it looks like a batch of records from Overdrive are in whatever character set, individually: utf8 "\xC2" does not map to Unicode at /usr/lib/x86_64-linux-gnu/perl/5.26/Encode.pm line 212
09:48	Dyrcona	Hmm.. If I'm reading the table correctly, that should be a "sound recording copyright," or a P with a circle around it in MARC-8.
09:48	Dyrcona	In UTF-8
09:49	Dyrcona	In UTF-8 \xC2 is often the first of a pair, \xC2A9 is the copyright symbol, for example.
09:50	Dyrcona	So, I guess some of the records are UTF-8 and some are MARC-8, or even have a mix of characters in different character sets. I guess the question is, do I load the "bad" records?
09:52	Dyrcona	Sound recording copyright is \xE28497 in UTF-8.
09:53	Dyrcona	I suppose I could try loading the records as MARC-8 and see which leads to fewer issues. I seem to recall the whole thing blowing up when I didn't force the encoding of the MARC to UTF-8, though.
10:01	Dyrcona	Typing in a web form and my palm hit the track pad on my laptop thereby selecting all of the text in the field so that the next character that I typed replaced it all. Ctrl-z didn't bring it back....
10:01	Dyrcona	It's going to be that kind of Friday....
10:05	* Dyrcona	searches for a Perl module like chardet for Python.
10:07	Dyrcona	Well, there is this: https://github.com/boytm/chardet
10:09	Dyrcona	Mozilla has libchardet written in C++.
10:09	Dyrcona	So, maybe, I should try PyMarc and chardet for this project.
10:13	mmorgan	Ctrl-z must be taking a vacation day :-(
10:14	Dyrcona	I suppose.
10:16	Dyrcona	I'm going to try and solve my record issue by rewriting the prep script in Python using PyMarc and chardet to autodetect the character set of each record. I wonder what chardet will say about MARC-8 records? Maybe, it will call them ISO 2022?
10:23	Dyrcona	Maybe I can keep most of the Perl code and just throw the XML at a character set detection program?
10:30	Dyrcona	Running chardet on the input files says, "utf-8 with confidence 0.99."
10:30	* Dyrcona	sighs. Guess I'll just jam the bad records in, like my current test is doing.
10:49		jvwoolf joined #evergreen
11:08		jihpringle joined #evergreen
11:29	Dyrcona	I wonder if $raw =~ s/\xC2/.../g; works on UTF-8 files that might have \xC2A9 and other multibyte sequences....
11:32	Dyrcona	I suppose that I could test it.
11:53	Dyrcona	Weird, 0xC2A doesn't give me they copyright symbol. Let me try it reversed for little endian.
11:54	Dyrcona	Funny, too, how 0xC2 only prints "correctly" when other unicode sequences are in the string. Otherwise it prints as an unknown character glyph.
12:26	Dyrcona	Unicode in Perl is a pain.
12:26	Dyrcona	I is one of those Fridays....
12:26	Dyrcona	s/^I/It/
12:28	Dyrcona	I can't get the copyright symbol to print using hex codes.
12:30	Dyrcona	I get a chinese glyph instead. Maybe I have to decode or encode it?
12:30	Dyrcona	I did specify use utf8; but that only works for the source code....
12:35	Dyrcona	Well, decode('utf-8', $string) produces nothing. Switching to encode gives me what I get without it.
12:36	jeff	Dyrcona: your ``only prints "correctly" when other unicode sequences are in the string'' strongly brought to mind the section of perlunicode called ``The "Unicode Bug"'', and in refreshing my memory I'm amused to see that it uses "\xC2" in its example. :-)
12:37	jeff	earlier in the document, 0xC1 and 0xC2 are noted as being suitable choices as sentinels bytes, because "they never appear in well-formed UTF-8".
12:38	Dyrcona	Indeed it is.
12:41	Dyrcona	So, 0xC2A9 is the copyright symbol in UTF-8. However, I always get a Chinese glyph when it prints as something other than the unknown character glyph. The 0xC2 either prints as the unknown character glyph or it prints as capital A the little circle on top. (I don't feel like looking up the actual name of the character.)
12:54	Dyrcona	Bug 1983725 sounds like on that I already filed....
12:54	pinesol	Launchpad bug 1983725 in Evergreen "Ampersands in subject headings make for bad links" [Undecided,New] https://launchpad.net/bugs/1983725
12:56	Dyrcona	Not quite: Bug 1021427
12:56	pinesol	Launchpad bug 1021427 in Evergreen "Ampersand in Call Number causes not well-formed error" [Undecided,Confirmed] https://launchpad.net/bugs/1021427
12:57	Dyrcona	Anyway, my LANG=en_US.UTF-8.... Back to reading perlunicode
12:58	jeff	Dyrcona: print "\xC2\xA9" seems to output the copyright symbol in question by default for me when ssh'd in to a remote Linux host running Perl v5.32.1 -- stock system Perl with Debian bullseye. LANG=en_US.UTF-8
12:58	jeff	https://gist.github.com/jeff/b87a1fe7591b1f0414edd17603021cca
12:58	jeff	Am I doing it differently from you?
12:59	jeff	(that's without setting binmode(STDOUT, ":utf8");)
13:01	Dyrcona	Well, I was just gonna say that setting binmode(STDOUT, ":utf8") produces diferent results...
13:01	jeff	with the binmode call, I get Â© (I think because the raw bytes being printed are already utf8, and it's trying to double encode them)
13:01	jeff	same/similar if i use: use open qw/:std :utf8/;
13:02	Dyrcona	Yeahp.
13:02	Dyrcona	I was doing this: my $string = sprintf("%c stome stuff %c", 0xC2, 0xC2A9); then printing $string.
13:04	Dyrcona	OK. Hardcoding the string "\xC2 stome stuff \xC2\xA9" works!
13:04	Dyrcona	jeff++
13:04	Dyrcona	I'm going to add the phonograph copyright symbol as well.
13:05	Dyrcona	It looks like my regex substitiutions will have the desired effect.
13:08	Dyrcona	No, I only got "lucky" by not using the global qualifier on the first substitution.
13:08	Dyrcona	Existing copyright symbols get garbled with a naive regex replace.
13:14	csharp_	@ana Existing copyright symbols get garbled with a naive regex replace.
13:14	pinesol	csharp_: Brightly empty toxic sogginess
13:17	Dyrcona	csharp_++
13:17	Dyrcona	"Brightly empty toxic sogginess" seems like an apt description.
13:18	Dyrcona	$string =~ s/\xC2/\xE2\x84\x97/gu; ends up garbling copyright symbols.
13:18	mmorgan	Profound indeed! But pinesol missed a few letters.
13:19	Dyrcona	I've also added use utf8; and use feature 'unicode_strings'; both of which are supposed to make regexes following "Unicode rules."
13:20	Dyrcona	As does the 'u' modifier after the g.
13:21	Dyrcona	jwz was almost right. If you use regexes with unicode, now you've got 3 problems. :)
13:23	Dyrcona	I may just write something in C++ to see if it has this problem.
13:23	Dyrcona	I suspect it might.
13:26	Dyrcona	I wouldn't spend so much time on this, except that the records with \xC2 and \xC3 don't load.
13:31	Dyrcona	$string =~ s/\xC2(?= )/\xE2\x84\x97/ug; works for my sample, but seems rickety.
13:33	Dyrcona	It also strips the following space. I thought there was one of those that didn't remove the matching characters....
13:37	Dyrcona	Looks like I may have to scan each record char by char..... :(
13:39		rfrasur joined #evergreen
13:55	Dyrcona	Hmm.. I may have yet a regex option. It looks like \xC2 and \xC3 can only be followed by values from \x80 to \xBF as valid UTF-8 characters, so if I match on them followed by any other value, it should work.
14:00	Dyrcona	And, that zero-width assertion that I used earlier was working to not "swallow" the space. The phono copyright symbol just takes up extra width in my terminal's font.
14:02	Dyrcona	We have a winner! $string =~ s/\xC2(?![\x80-\xBF])/\xE2\x84\x97/gu; and $string =~ s/\xC3(?![\x80-\xBF])/\xC2\xA9/gu;
14:11	Dyrcona	It seems like that took too long to figure out. :)
14:21	Dyrcona	Hm... Next Q: Is it possible to call update_leader on a MARC::Record...
14:25	Dyrcona	In my specific case, I suppose it won't matter. The program will make changes to several tags anyway before outputting the record, so the length should get updated.
14:26	Dyrcona	There is a bug related to that, and it probably affects these multibyte characters, too.
14:30		rfrasur joined #evergreen
14:37	Dyrcona	Ugh. When I modify my prep program with the substitution code, I get a ton of the "does not map to Unicode" errors that started this whole investigation.
14:39	Dyrcona	Oof.... Helps to run it on the correct files in the correct directory....
14:41	Dyrcona	OK. I'm suspicious that my output from this run is the same size as from the previous run. Seems like it should be larger.
14:45	Dyrcona	So the substitution doesn't work on raw MARC data, apparently.
14:47	Dyrcona	diff says the file I just generated with the modified prep script is the same as the old one.
14:47	Dyrcona	Yes, I'm sure I used the new script.....
14:47	Dyrcona	Nice day for ducks. Looks like we're about to get a thunderstorm.
15:10	Dyrcona	Multiline match doesn't help...
15:18	Dyrcona	Single line mode doesn't make a difference either. So, maybe the non-UTF-8 characters are coming from MARC::Record?
15:19	Dyrcona	Bleh. marc--
15:21	Dyrcona	perl-- while i'm at it.
15:31	Dyrcona	@monologue
15:31	pinesol	Dyrcona: Your current monologue is at least 28 lines long.
15:42	mmorgan	Dyrcona: Did you get thunderstorms?
15:43	* mmorgan	sees grey clouds to the west
16:47	Dyrcona	mmorgan: Yes, got a brief thunderstorm. Sorry I stepped away right about the time you asked.
16:47	mmorgan	Dyrcona: no worries, we just got one, too. Grateful for the moisture!
17:02		mmorgan left #evergreen
17:33		jvwoolf left #evergreen