Evergreen ILS Website

IRC log for #evergreen, 2022-08-05

| Channels | #evergreen index | Today | | Search | Google Search | Plain-Text | summary | Join Webchat

All times shown according to the server's local time.

Time Nick Message
06:17 collum joined #evergreen
07:10 rjackson_isl_hom joined #evergreen
07:50 BDorsey joined #evergreen
08:16 mantis1 joined #evergreen
08:42 mmorgan joined #evergreen
09:39 Dyrcona joined #evergreen
09:40 Dyrcona So, it looks like a batch of records from Overdrive are in whatever character set, individually: utf8 "\xC2" does not map to Unicode at /usr/lib/x86_64-linux-gnu/perl/5.26/Encode.pm line 212
09:48 Dyrcona Hmm.. If I'm reading the table correctly, that should be a "sound recording copyright," or a P with a circle around it in MARC-8.
09:48 Dyrcona In UTF-8
09:49 Dyrcona In UTF-8 \xC2 is often the first of a pair, \xC2A9 is the copyright symbol, for example.
09:50 Dyrcona So, I guess some of the records are UTF-8 and some are MARC-8, or even have a mix of characters in different character sets. I guess the question is, do I load the "bad" records?
09:52 Dyrcona Sound recording copyright is \xE28497 in UTF-8.
09:53 Dyrcona I suppose I could try loading the records as MARC-8 and see which leads to fewer issues. I seem to recall the whole thing blowing up when I didn't force the encoding of the MARC to UTF-8, though.
10:01 Dyrcona Typing in a web form and my palm hit the track pad on my laptop thereby selecting all of the text in the field so that the next character that I typed replaced it all. Ctrl-z didn't bring it back....
10:01 Dyrcona It's going to be that kind of Friday....
10:05 * Dyrcona searches for a Perl module like chardet for Python.
10:07 Dyrcona Well, there is this: https://github.com/boytm/chardet
10:09 Dyrcona Mozilla has libchardet written in C++.
10:09 Dyrcona So, maybe, I should try PyMarc and chardet for this project.
10:13 mmorgan Ctrl-z must be taking a vacation day :-(
10:14 Dyrcona I suppose.
10:16 Dyrcona I'm going to try and solve my record issue by rewriting the prep script in Python using PyMarc and chardet to autodetect the character set of each record. I wonder what chardet will say about MARC-8 records? Maybe, it will call them ISO 2022?
10:23 Dyrcona Maybe I can keep most of the Perl code and just throw the XML at a character set detection program?
10:30 Dyrcona Running chardet on the input files says, "utf-8 with confidence 0.99."
10:30 * Dyrcona sighs. Guess I'll just jam the bad records in, like my current test is doing.
10:49 jvwoolf joined #evergreen
11:08 jihpringle joined #evergreen
11:29 Dyrcona I wonder if $raw =~ s/\xC2/.../g; works on UTF-8 files that might have \xC2A9 and other multibyte sequences....
11:32 Dyrcona I suppose that I could test it.
11:53 Dyrcona Weird, 0xC2A doesn't give me they copyright symbol. Let me try it reversed for little endian.
11:54 Dyrcona Funny, too, how 0xC2 only prints "correctly" when other unicode sequences are in the string. Otherwise it prints as an unknown character glyph.
12:26 Dyrcona Unicode in Perl is a pain.
12:26 Dyrcona I is one of those Fridays....
12:26 Dyrcona s/^I/It/
12:28 Dyrcona I can't get the copyright symbol to print using hex codes.
12:30 Dyrcona I get a chinese glyph instead. Maybe I have to decode or encode it?
12:30 Dyrcona I did specify use utf8; but that only works for the source code....
12:35 Dyrcona Well, decode('utf-8', $string) produces nothing. Switching to encode gives me what I get without it.
12:36 jeff Dyrcona: your ``only prints "correctly" when other unicode sequences are in the string'' strongly brought to mind the section of perlunicode called ``The "Unicode Bug"'', and in refreshing my memory I'm amused to see that it uses "\xC2" in its example. :-)
12:37 jeff earlier in the document, 0xC1 and 0xC2 are noted as being suitable choices as sentinels bytes, because "they never appear in well-formed UTF-8".
12:38 Dyrcona Indeed it is.
12:41 Dyrcona So, 0xC2A9 is the copyright symbol in UTF-8. However, I always get a Chinese glyph when it prints as something other than the unknown character glyph. The 0xC2 either prints as the unknown character glyph or it prints as capital A the little circle on top. (I don't feel like looking up the actual name of the character.)
12:54 Dyrcona Bug 1983725 sounds like on that I already filed....
12:54 pinesol Launchpad bug 1983725 in Evergreen "Ampersands in subject headings make for bad links" [Undecided,New] https://launchpad.net/bugs/1983725
12:56 Dyrcona Not quite: Bug 1021427
12:56 pinesol Launchpad bug 1021427 in Evergreen "Ampersand in Call Number causes not well-formed error" [Undecided,Confirmed] https://launchpad.net/bugs/1021427
12:57 Dyrcona Anyway, my LANG=en_US.UTF-8.... Back to reading perlunicode
12:58 jeff Dyrcona: print "\xC2\xA9" seems to output the copyright symbol in question by default for me when ssh'd in to a remote Linux host running Perl v5.32.1 -- stock system Perl with Debian bullseye. LANG=en_US.UTF-8
12:58 jeff https://gist.github.com/jeff/b8​7a1fe7591b1f0414edd17603021cca
12:58 jeff Am I doing it differently from you?
12:59 jeff (that's without setting binmode(STDOUT, ":utf8");)
13:01 Dyrcona Well, I was just gonna say that setting binmode(STDOUT, ":utf8") produces diferent results...
13:01 jeff with the binmode call, I get © (I think because the raw bytes being printed are already utf8, and it's trying to double encode them)
13:01 jeff same/similar if i use: use open qw/:std :utf8/;
13:02 Dyrcona Yeahp.
13:02 Dyrcona I was doing this: my $string = sprintf("%c stome stuff %c", 0xC2, 0xC2A9); then printing $string.
13:04 Dyrcona OK. Hardcoding the string "\xC2 stome stuff \xC2\xA9" works!
13:04 Dyrcona jeff++
13:04 Dyrcona I'm going to add the phonograph copyright symbol as well.
13:05 Dyrcona It looks like my regex substitiutions will have the desired effect.
13:08 Dyrcona No, I only got "lucky" by not using the global qualifier on the first substitution.
13:08 Dyrcona Existing copyright symbols get garbled with a naive regex replace.
13:14 csharp_ @ana Existing copyright symbols get garbled with a naive regex replace.
13:14 pinesol csharp_: Brightly empty toxic sogginess
13:17 Dyrcona csharp_++
13:17 Dyrcona "Brightly empty toxic sogginess" seems like an apt description.
13:18 Dyrcona $string =~ s/\xC2/\xE2\x84\x97/gu; ends up garbling copyright symbols.
13:18 mmorgan Profound indeed! But pinesol missed a few letters.
13:19 Dyrcona I've also added use utf8; and use feature 'unicode_strings'; both of which are supposed to make regexes following "Unicode rules."
13:20 Dyrcona As does the 'u' modifier after the g.
13:21 Dyrcona jwz was almost right. If you use regexes with unicode, now you've got 3 problems. :)
13:23 Dyrcona I may just write something in C++ to see if it has this problem.
13:23 Dyrcona I suspect it might.
13:26 Dyrcona I wouldn't spend so much time on this, except that the records with \xC2 and \xC3 don't load.
13:31 Dyrcona $string =~ s/\xC2(?= )/\xE2\x84\x97/ug; works for my sample, but seems rickety.
13:33 Dyrcona It also strips the following space. I thought there was one of those that didn't remove the matching characters....
13:37 Dyrcona Looks like I may have to scan each record char by char..... :(
13:39 rfrasur joined #evergreen
13:55 Dyrcona Hmm.. I may have yet a regex option. It looks like \xC2 and \xC3 can only be followed by values from \x80 to \xBF as valid UTF-8 characters, so if I match on them followed by any other value, it should work.
14:00 Dyrcona And, that zero-width assertion that I used earlier was working to not "swallow" the space. The phono copyright symbol just takes up extra width in my terminal's font.
14:02 Dyrcona We have a winner! $string =~ s/\xC2(?![\x80-\xBF])/\xE2\x84\x97/gu; and $string =~ s/\xC3(?![\x80-\xBF])/\xC2\xA9/gu;
14:11 Dyrcona It seems like that took too long to figure out. :)
14:21 Dyrcona Hm... Next Q: Is it possible to call update_leader on a MARC::Record...
14:25 Dyrcona In my specific case, I suppose it won't matter. The program will make changes to several tags anyway before outputting the record, so the length should get updated.
14:26 Dyrcona There is a bug related to that, and it probably affects these multibyte characters, too.
14:30 rfrasur joined #evergreen
14:37 Dyrcona Ugh. When I modify my prep program with the substitution code, I get a ton of the "does not map to Unicode" errors that started this whole investigation.
14:39 Dyrcona Oof.... Helps to run it on the correct files in the correct directory....
14:41 Dyrcona OK. I'm suspicious that my output from this run is the same size as from the previous run. Seems like it should be larger.
14:45 Dyrcona So the substitution doesn't work on raw MARC data, apparently.
14:47 Dyrcona diff says the file I just generated with the modified prep script is the same as the old one.
14:47 Dyrcona Yes, I'm sure I used the new script.....
14:47 Dyrcona Nice day for ducks. Looks like we're about to get a thunderstorm.
15:10 Dyrcona Multiline match doesn't help...
15:18 Dyrcona Single line mode doesn't make a difference either. So, maybe the non-UTF-8 characters are coming from MARC::Record?
15:19 Dyrcona Bleh. marc--
15:21 Dyrcona perl-- while i'm at it.
15:31 Dyrcona @monologue
15:31 pinesol Dyrcona: Your current monologue is at least 28 lines long.
15:42 mmorgan Dyrcona: Did you get thunderstorms?
15:43 * mmorgan sees grey clouds to the west
16:47 Dyrcona mmorgan: Yes, got a brief thunderstorm. Sorry I stepped away right about the time you asked.
16:47 mmorgan Dyrcona: no worries, we just got one, too. Grateful for the moisture!
17:02 mmorgan left #evergreen
17:33 jvwoolf left #evergreen

| Channels | #evergreen index | Today | | Search | Google Search | Plain-Text | summary | Join Webchat