Time |
Nick |
Message |
06:17 |
|
collum joined #evergreen |
07:10 |
|
rjackson_isl_hom joined #evergreen |
07:50 |
|
BDorsey joined #evergreen |
08:16 |
|
mantis1 joined #evergreen |
08:42 |
|
mmorgan joined #evergreen |
09:39 |
|
Dyrcona joined #evergreen |
09:40 |
Dyrcona |
So, it looks like a batch of records from Overdrive are in whatever character set, individually: utf8 "\xC2" does not map to Unicode at /usr/lib/x86_64-linux-gnu/perl/5.26/Encode.pm line 212 |
09:48 |
Dyrcona |
Hmm.. If I'm reading the table correctly, that should be a "sound recording copyright," or a P with a circle around it in MARC-8. |
09:48 |
Dyrcona |
In UTF-8 |
09:49 |
Dyrcona |
In UTF-8 \xC2 is often the first of a pair, \xC2A9 is the copyright symbol, for example. |
09:50 |
Dyrcona |
So, I guess some of the records are UTF-8 and some are MARC-8, or even have a mix of characters in different character sets. I guess the question is, do I load the "bad" records? |
09:52 |
Dyrcona |
Sound recording copyright is \xE28497 in UTF-8. |
09:53 |
Dyrcona |
I suppose I could try loading the records as MARC-8 and see which leads to fewer issues. I seem to recall the whole thing blowing up when I didn't force the encoding of the MARC to UTF-8, though. |
10:01 |
Dyrcona |
Typing in a web form and my palm hit the track pad on my laptop thereby selecting all of the text in the field so that the next character that I typed replaced it all. Ctrl-z didn't bring it back.... |
10:01 |
Dyrcona |
It's going to be that kind of Friday.... |
10:05 |
* Dyrcona |
searches for a Perl module like chardet for Python. |
10:07 |
Dyrcona |
Well, there is this: https://github.com/boytm/chardet |
10:09 |
Dyrcona |
Mozilla has libchardet written in C++. |
10:09 |
Dyrcona |
So, maybe, I should try PyMarc and chardet for this project. |
10:13 |
mmorgan |
Ctrl-z must be taking a vacation day :-( |
10:14 |
Dyrcona |
I suppose. |
10:16 |
Dyrcona |
I'm going to try and solve my record issue by rewriting the prep script in Python using PyMarc and chardet to autodetect the character set of each record. I wonder what chardet will say about MARC-8 records? Maybe, it will call them ISO 2022? |
10:23 |
Dyrcona |
Maybe I can keep most of the Perl code and just throw the XML at a character set detection program? |
10:30 |
Dyrcona |
Running chardet on the input files says, "utf-8 with confidence 0.99." |
10:30 |
* Dyrcona |
sighs. Guess I'll just jam the bad records in, like my current test is doing. |
10:49 |
|
jvwoolf joined #evergreen |
11:08 |
|
jihpringle joined #evergreen |
11:29 |
Dyrcona |
I wonder if $raw =~ s/\xC2/.../g; works on UTF-8 files that might have \xC2A9 and other multibyte sequences.... |
11:32 |
Dyrcona |
I suppose that I could test it. |
11:53 |
Dyrcona |
Weird, 0xC2A doesn't give me they copyright symbol. Let me try it reversed for little endian. |
11:54 |
Dyrcona |
Funny, too, how 0xC2 only prints "correctly" when other unicode sequences are in the string. Otherwise it prints as an unknown character glyph. |
12:26 |
Dyrcona |
Unicode in Perl is a pain. |
12:26 |
Dyrcona |
I is one of those Fridays.... |
12:26 |
Dyrcona |
s/^I/It/ |
12:28 |
Dyrcona |
I can't get the copyright symbol to print using hex codes. |
12:30 |
Dyrcona |
I get a chinese glyph instead. Maybe I have to decode or encode it? |
12:30 |
Dyrcona |
I did specify use utf8; but that only works for the source code.... |
12:35 |
Dyrcona |
Well, decode('utf-8', $string) produces nothing. Switching to encode gives me what I get without it. |
12:36 |
jeff |
Dyrcona: your ``only prints "correctly" when other unicode sequences are in the string'' strongly brought to mind the section of perlunicode called ``The "Unicode Bug"'', and in refreshing my memory I'm amused to see that it uses "\xC2" in its example. :-) |
12:37 |
jeff |
earlier in the document, 0xC1 and 0xC2 are noted as being suitable choices as sentinels bytes, because "they never appear in well-formed UTF-8". |
12:38 |
Dyrcona |
Indeed it is. |
12:41 |
Dyrcona |
So, 0xC2A9 is the copyright symbol in UTF-8. However, I always get a Chinese glyph when it prints as something other than the unknown character glyph. The 0xC2 either prints as the unknown character glyph or it prints as capital A the little circle on top. (I don't feel like looking up the actual name of the character.) |
12:54 |
Dyrcona |
Bug 1983725 sounds like on that I already filed.... |
12:54 |
pinesol |
Launchpad bug 1983725 in Evergreen "Ampersands in subject headings make for bad links" [Undecided,New] https://launchpad.net/bugs/1983725 |
12:56 |
Dyrcona |
Not quite: Bug 1021427 |
12:56 |
pinesol |
Launchpad bug 1021427 in Evergreen "Ampersand in Call Number causes not well-formed error" [Undecided,Confirmed] https://launchpad.net/bugs/1021427 |
12:57 |
Dyrcona |
Anyway, my LANG=en_US.UTF-8.... Back to reading perlunicode |
12:58 |
jeff |
Dyrcona: print "\xC2\xA9" seems to output the copyright symbol in question by default for me when ssh'd in to a remote Linux host running Perl v5.32.1 -- stock system Perl with Debian bullseye. LANG=en_US.UTF-8 |
12:58 |
jeff |
https://gist.github.com/jeff/b87a1fe7591b1f0414edd17603021cca |
12:58 |
jeff |
Am I doing it differently from you? |
12:59 |
jeff |
(that's without setting binmode(STDOUT, ":utf8");) |
13:01 |
Dyrcona |
Well, I was just gonna say that setting binmode(STDOUT, ":utf8") produces diferent results... |
13:01 |
jeff |
with the binmode call, I get © (I think because the raw bytes being printed are already utf8, and it's trying to double encode them) |
13:01 |
jeff |
same/similar if i use: use open qw/:std :utf8/; |
13:02 |
Dyrcona |
Yeahp. |
13:02 |
Dyrcona |
I was doing this: my $string = sprintf("%c stome stuff %c", 0xC2, 0xC2A9); then printing $string. |
13:04 |
Dyrcona |
OK. Hardcoding the string "\xC2 stome stuff \xC2\xA9" works! |
13:04 |
Dyrcona |
jeff++ |
13:04 |
Dyrcona |
I'm going to add the phonograph copyright symbol as well. |
13:05 |
Dyrcona |
It looks like my regex substitiutions will have the desired effect. |
13:08 |
Dyrcona |
No, I only got "lucky" by not using the global qualifier on the first substitution. |
13:08 |
Dyrcona |
Existing copyright symbols get garbled with a naive regex replace. |
13:14 |
csharp_ |
@ana Existing copyright symbols get garbled with a naive regex replace. |
13:14 |
pinesol |
csharp_: Brightly empty toxic sogginess |
13:17 |
Dyrcona |
csharp_++ |
13:17 |
Dyrcona |
"Brightly empty toxic sogginess" seems like an apt description. |
13:18 |
Dyrcona |
$string =~ s/\xC2/\xE2\x84\x97/gu; ends up garbling copyright symbols. |
13:18 |
mmorgan |
Profound indeed! But pinesol missed a few letters. |
13:19 |
Dyrcona |
I've also added use utf8; and use feature 'unicode_strings'; both of which are supposed to make regexes following "Unicode rules." |
13:20 |
Dyrcona |
As does the 'u' modifier after the g. |
13:21 |
Dyrcona |
jwz was almost right. If you use regexes with unicode, now you've got 3 problems. :) |
13:23 |
Dyrcona |
I may just write something in C++ to see if it has this problem. |
13:23 |
Dyrcona |
I suspect it might. |
13:26 |
Dyrcona |
I wouldn't spend so much time on this, except that the records with \xC2 and \xC3 don't load. |
13:31 |
Dyrcona |
$string =~ s/\xC2(?= )/\xE2\x84\x97/ug; works for my sample, but seems rickety. |
13:33 |
Dyrcona |
It also strips the following space. I thought there was one of those that didn't remove the matching characters.... |
13:37 |
Dyrcona |
Looks like I may have to scan each record char by char..... :( |
13:39 |
|
rfrasur joined #evergreen |
13:55 |
Dyrcona |
Hmm.. I may have yet a regex option. It looks like \xC2 and \xC3 can only be followed by values from \x80 to \xBF as valid UTF-8 characters, so if I match on them followed by any other value, it should work. |
14:00 |
Dyrcona |
And, that zero-width assertion that I used earlier was working to not "swallow" the space. The phono copyright symbol just takes up extra width in my terminal's font. |
14:02 |
Dyrcona |
We have a winner! $string =~ s/\xC2(?![\x80-\xBF])/\xE2\x84\x97/gu; and $string =~ s/\xC3(?![\x80-\xBF])/\xC2\xA9/gu; |
14:11 |
Dyrcona |
It seems like that took too long to figure out. :) |
14:21 |
Dyrcona |
Hm... Next Q: Is it possible to call update_leader on a MARC::Record... |
14:25 |
Dyrcona |
In my specific case, I suppose it won't matter. The program will make changes to several tags anyway before outputting the record, so the length should get updated. |
14:26 |
Dyrcona |
There is a bug related to that, and it probably affects these multibyte characters, too. |
14:30 |
|
rfrasur joined #evergreen |
14:37 |
Dyrcona |
Ugh. When I modify my prep program with the substitution code, I get a ton of the "does not map to Unicode" errors that started this whole investigation. |
14:39 |
Dyrcona |
Oof.... Helps to run it on the correct files in the correct directory.... |
14:41 |
Dyrcona |
OK. I'm suspicious that my output from this run is the same size as from the previous run. Seems like it should be larger. |
14:45 |
Dyrcona |
So the substitution doesn't work on raw MARC data, apparently. |
14:47 |
Dyrcona |
diff says the file I just generated with the modified prep script is the same as the old one. |
14:47 |
Dyrcona |
Yes, I'm sure I used the new script..... |
14:47 |
Dyrcona |
Nice day for ducks. Looks like we're about to get a thunderstorm. |
15:10 |
Dyrcona |
Multiline match doesn't help... |
15:18 |
Dyrcona |
Single line mode doesn't make a difference either. So, maybe the non-UTF-8 characters are coming from MARC::Record? |
15:19 |
Dyrcona |
Bleh. marc-- |
15:21 |
Dyrcona |
perl-- while i'm at it. |
15:31 |
Dyrcona |
@monologue |
15:31 |
pinesol |
Dyrcona: Your current monologue is at least 28 lines long. |
15:42 |
mmorgan |
Dyrcona: Did you get thunderstorms? |
15:43 |
* mmorgan |
sees grey clouds to the west |
16:47 |
Dyrcona |
mmorgan: Yes, got a brief thunderstorm. Sorry I stepped away right about the time you asked. |
16:47 |
mmorgan |
Dyrcona: no worries, we just got one, too. Grateful for the moisture! |
17:02 |
|
mmorgan left #evergreen |
17:33 |
|
jvwoolf left #evergreen |