11:18 |
BAMkubasa |
thanks! |
11:19 |
berick |
hm, our main metadata schema is MARC. MODS is used in some places for extracting specific bits of information. |
11:20 |
BAMkubasa |
ok |
11:20 |
* Dyrcona |
assumed MARC is one of the main data schemes, not metadata, but suppose we could argue about that. |
11:22 |
Dyrcona |
But, we do use MODS for indexing and a good bit of display in the OPAC. It would be easier to add a new schema if you can represent it with XSLT. |
11:22 |
BAMkubasa |
so, xpath is a tool/language used to interrogate xml (or structured data?), and the xml would be the thing that would have the schema if I'm remembering how these things interact correctly? |
11:24 |
Dyrcona |
So, we mostly use XSLT to convert from MARC to MODS. Some index and display fields forego the use of the MODS transforms and have a XPATH expression to extract the needed data from the MARC. |
11:25 |
Dyrcona |
It's easy to add new fields with XPATH. For instance if there is some RDA field you want to extract from MARC. |
11:25 |
Bmagic |
BAMkubasa: I end up referencing config.xml_transform which is a straight copy from LoC (right?) |
11:26 |
Dyrcona |
I guess a more appropriate question is what is the student trying to do? Add Bibframe or something like that? |
11:26 |
BAMkubasa |
Don't know, they asked theirlocal librarian, who asked me |
11:26 |
Dyrcona |
Bmagic: Almost a straight copy. We've modified one or two of the transforms in the past. |
11:27 |
Bmagic |
I've resorted to eding those templates too. I'm not sure if I have a copy of an Evergreen database with tweaks anymore though |
11:27 |
Dyrcona |
Adding a new schema for bib records would be a big deal, i.e. a lot of work. If you have some other format that you could convert from MARC via XSLT, that would be easier. |
11:27 |
Bmagic |
I think it was to include more tags for the keyword index (before the feature was added to Evergreen, making that easier) |
11:28 |
Dyrcona |
Bmagic: By "we," i meant the Evergreen community/developers. Not all of the transforms are strictly stock from LoC. I also think we sometimes fall behind LoC changes to the canonical set. |
11:30 |
Bmagic |
right on |
10:03 |
Dyrcona |
Lowercase n with tilde is showing up in my console as the two single byte characters represented by its component bytes. Something isn't handling multibyte UTF8 properly somewhere. |
10:04 |
Dyrcona |
So, It's probably the load program that I had so much trouble getting Unicode to work with in the past few months. |
10:09 |
|
jvwoolf joined #evergreen |
10:15 |
Dyrcona |
Defintitely coming from MARC::Record->new_from_usmarc() in the load program, regardless of whether or not I set the file handle to UTF-8 or not. |
10:22 |
Dyrcona |
MARC::Charset is up to date (1.35). |
10:28 |
Dyrcona |
MARC::Charset shouldn't be involved and doesn't look like it is. The error is not coming directly from MARC::File::USMARC::decode() either. |
10:29 |
Dyrcona |
I tried installing the latest Encode.pm and no difference. |
10:30 |
Dyrcona |
Oh, I may stand corrected: https://metacpan.org/module/MARC::File::USMARC/source#L172 |
10:31 |
Dyrcona |
Bingo! Source of my error, and it is Encode.pm or the edition of Unicode available to Perl on my system. |
10:32 |
Dyrcona |
https://metacpan.org/module/MARC::File::Encode/source#L35 |
10:39 |
Dyrcona |
Or, maybe it's just "The Unicode Bug..." :( |
10:45 |
Dyrcona |
Looks like I might be able to avoid this by converting the records to MARCXML first. |
10:53 |
Dyrcona |
Outside of a MARC context, I can't make decode crash on those characters. |
11:16 |
|
BDorsey joined #evergreen |
11:20 |
Dyrcona |
Well, it blows up elsewhere using MARC::File::XML: 2 :129: parser error : Input is not proper UTF-8, indicate encoding ! |
11:20 |
Dyrcona |
Bytes: 0xA9 0x22 0x20 0x69 |
11:21 |
|
jihpringle joined #evergreen |
11:25 |
Dyrcona |
@monologue |
11:25 |
pinesol |
Dyrcona: Your current monologue is at least 23 lines long. |
11:39 |
Dyrcona |
Ha! I want a goto for a valid reason. I want a label outside my main loop that I can branch to when there is an error. Otherwise, I don't want to interfere with the flow. |
11:42 |
Dyrcona |
Maybe I just need to change my loop to a do while. |
11:46 |
Dyrcona |
Well, MARC::File::XML just makes it worse. More records get spit out that way. |
11:52 |
Dyrcona |
I think the increase in errors comes from the records with invalid lengths and indicators getting mangled when converted to XML. |
12:31 |
csharp_ |
Dyrcona: fwiw, I usually learn from your monologues :-) |
12:31 |
csharp_ |
Dyrcona++ |
12:36 |
mmorgan |
Dyrcona++ |
12:37 |
Dyrcona |
Thanks! |
12:41 |
Dyrcona |
I used yaz-marcdump to convert the binary MARC to XML, and that was after I had preprocessed the file file from the vendor. So it could be that my preprocessor program is writing junk, but I can't find bad UTF-8 in it. |
12:42 |
Dyrcona |
It just looks like when going through the MARC modules, Encode suddenly doesn't like otherwise valid \XC2 and \xC3 sequences. |
12:56 |
Dyrcona |
Very interesting: If I use a program to split the binary file into records using \x1E\x1D as the input record separator and then rune the decode('UTF-8', $raw_record), I get no errors, so the issue is definitely coming from the MARC modules somehow. |
12:57 |
Dyrcona |
That's using the preprocessed file. I'll see what happens with the files directly from the vendor. |
12:58 |
Dyrcona |
Ditto... Zero errors. |
13:06 |
|
jihpringle joined #evergreen |
13:13 |
csharp_ |
Dyrcona: I know you've been working on this for days and have probably ruled this out, but I've seen stupid stuff where the \XC2 literal characters were themselves mis-encoded somehow |
13:14 |
csharp_ |
as in literally "\XC2" where one of those was some unicode character that escaped notice |
13:16 |
Dyrcona |
csharp_: I originally thought that is what the problem was, or rather a MARC-8 \xC2 that got into the UTF-8 data. \xC2 in MARC-8 is the P with a circle sound copyright symbol, and \xC3 is the regular copyright symbol. |
13:17 |
Dyrcona |
However, the input actually has valid UTF-8 sequences and using Encode::decode on the raw data on a record by record basis does not output any errors. The errors come when MARC::File::USMARC::decode() is run on a record. |
13:18 |
csharp_ |
ah |
13:19 |
Dyrcona |
Hmm... I have another idea..... |
13:23 |
Dyrcona |
I tried MARC::File::USMARC::decode() on the files from the vendor and there are no errors. When I run it on the preprocessed file, I get errors. So, my preprocessor must be doing something wrong, even though decode UTF-8 like the raw input.... |
13:25 |
Dyrcona |
If I have to set the output stream to UTF-8, I'll be upset. I spent several days fiddling with that before and I swear that I got it right..... |
13:27 |
Dyrcona |
So, I'm already setting binmode on the output to :utf8. Maybe I should do :bytes or :raw? |
13:32 |
Dyrcona |
I guess :raw isn't a thing.... |
13:38 |
Dyrcona |
csharp_++ again for suggesting an encoding issue. Looks like my preprocessor was double encoding some characters. |
14:03 |
csharp_ |
Dyrcona: oh wow |
14:11 |
Dyrcona |
This line in the perlunicode documentation is misleading: Use the ":encoding(...)" layer to read from and write to filehandles using the specified encoding. |
14:11 |
Dyrcona |
I suspect it only applies if you're not manually decoding the data, which the MARC code does. |
14:32 |
jeffdavis |
bug 1979345 adds a new permission to govern the hold pull list; is that an OK change to include in a point release, assuming there's a release note? |
14:32 |
pinesol |
Launchpad bug 1979345 in Evergreen "Angular Holds Pull List Doesn't Scope" [Medium,Confirmed] https://launchpad.net/bugs/1979345 |
14:34 |
csharp_ |
I would probably wait until the next release |
14:46 |
jeffdavis |
We're going live with the new perm when we upgrade to 3.9 this weekend, I'll reconcile myself to having to renumber our permissions at some point. :) |
14:48 |
|
rfrasur joined #evergreen |
14:59 |
Dyrcona |
We've had to do things like that after upgrades before because we've backported things from future releases. |
15:00 |
Dyrcona |
On the subject of MARC and encoding, just to complicate things, if you're pulling records from Evergreen via DBI as MARCXML and then converting them to USMARC to write to a file. You have to set the output stream to utf8 encoding, or you get errors reading the output file. |
15:02 |
Dyrcona |
It's always fun to relearn things like this every 3 or so years. :) |
15:10 |
Dyrcona |
jeffdavis: One thing that I often do is wait for the code to make it into master, then I cherry-pick the commits into my local branch so I have the correct id numbers and db upgrade codes. This makes generating the db upgrade script for future upgrades easier. |
15:11 |
jeffdavis |
hm, and I guess there's nothing preventing that commit from going into master even if it doesn't get backported to 3.8/3.9 |
08:42 |
|
mmorgan joined #evergreen |
09:39 |
|
Dyrcona joined #evergreen |
09:40 |
Dyrcona |
So, it looks like a batch of records from Overdrive are in whatever character set, individually: utf8 "\xC2" does not map to Unicode at /usr/lib/x86_64-linux-gnu/perl/5.26/Encode.pm line 212 |
09:48 |
Dyrcona |
Hmm.. If I'm reading the table correctly, that should be a "sound recording copyright," or a P with a circle around it in MARC-8. |
09:48 |
Dyrcona |
In UTF-8 |
09:49 |
Dyrcona |
In UTF-8 \xC2 is often the first of a pair, \xC2A9 is the copyright symbol, for example. |
09:50 |
Dyrcona |
So, I guess some of the records are UTF-8 and some are MARC-8, or even have a mix of characters in different character sets. I guess the question is, do I load the "bad" records? |
09:52 |
Dyrcona |
Sound recording copyright is \xE28497 in UTF-8. |
09:53 |
Dyrcona |
I suppose I could try loading the records as MARC-8 and see which leads to fewer issues. I seem to recall the whole thing blowing up when I didn't force the encoding of the MARC to UTF-8, though. |
10:01 |
Dyrcona |
Typing in a web form and my palm hit the track pad on my laptop thereby selecting all of the text in the field so that the next character that I typed replaced it all. Ctrl-z didn't bring it back.... |
10:01 |
Dyrcona |
It's going to be that kind of Friday.... |
10:05 |
* Dyrcona |
searches for a Perl module like chardet for Python. |
10:09 |
Dyrcona |
So, maybe, I should try PyMarc and chardet for this project. |
10:13 |
mmorgan |
Ctrl-z must be taking a vacation day :-( |
10:14 |
Dyrcona |
I suppose. |
10:16 |
Dyrcona |
I'm going to try and solve my record issue by rewriting the prep script in Python using PyMarc and chardet to autodetect the character set of each record. I wonder what chardet will say about MARC-8 records? Maybe, it will call them ISO 2022? |
10:23 |
Dyrcona |
Maybe I can keep most of the Perl code and just throw the XML at a character set detection program? |
10:30 |
Dyrcona |
Running chardet on the input files says, "utf-8 with confidence 0.99." |
10:30 |
* Dyrcona |
sighs. Guess I'll just jam the bad records in, like my current test is doing. |
14:00 |
Dyrcona |
And, that zero-width assertion that I used earlier was working to not "swallow" the space. The phono copyright symbol just takes up extra width in my terminal's font. |
14:02 |
Dyrcona |
We have a winner! $string =~ s/\xC2(?![\x80-\xBF])/\xE2\x84\x97/gu; and $string =~ s/\xC3(?![\x80-\xBF])/\xC2\xA9/gu; |
14:11 |
Dyrcona |
It seems like that took too long to figure out. :) |
14:21 |
Dyrcona |
Hm... Next Q: Is it possible to call update_leader on a MARC::Record... |
14:25 |
Dyrcona |
In my specific case, I suppose it won't matter. The program will make changes to several tags anyway before outputting the record, so the length should get updated. |
14:26 |
Dyrcona |
There is a bug related to that, and it probably affects these multibyte characters, too. |
14:30 |
|
rfrasur joined #evergreen |
14:37 |
Dyrcona |
Ugh. When I modify my prep program with the substitution code, I get a ton of the "does not map to Unicode" errors that started this whole investigation. |
14:39 |
Dyrcona |
Oof.... Helps to run it on the correct files in the correct directory.... |
14:41 |
Dyrcona |
OK. I'm suspicious that my output from this run is the same size as from the previous run. Seems like it should be larger. |
14:45 |
Dyrcona |
So the substitution doesn't work on raw MARC data, apparently. |
14:47 |
Dyrcona |
diff says the file I just generated with the modified prep script is the same as the old one. |
14:47 |
Dyrcona |
Yes, I'm sure I used the new script..... |
14:47 |
Dyrcona |
Nice day for ducks. Looks like we're about to get a thunderstorm. |
15:10 |
Dyrcona |
Multiline match doesn't help... |
15:18 |
Dyrcona |
Single line mode doesn't make a difference either. So, maybe the non-UTF-8 characters are coming from MARC::Record? |
15:19 |
Dyrcona |
Bleh. marc-- |
15:21 |
Dyrcona |
perl-- while i'm at it. |
15:31 |
Dyrcona |
@monologue |
15:31 |
pinesol |
Dyrcona: Your current monologue is at least 28 lines long. |
09:52 |
Dyrcona |
derekz: Adding a patron group for this library is the better solution. It seems to me that your circulation rules are too specific if you have to do all that work. The hold and circ rules cascade much like CSS, so simpler is better. |
09:53 |
Dyrcona |
Philosphy: IMNSHO, it's better to have a bunch of generic rules that apply to the majority of cases at the largest number of org units, then you make specific exceptions from there. |
09:54 |
Dyrcona |
Avoid using permission groups if possible. |
09:55 |
Dyrcona |
Unrelated: "Vendor MARC records" seems to be a synonym for "garbage." |
09:57 |
derekz |
Dunking on "Vendor MARC records" is a perfect distraction and apropos for March |
09:58 |
Dyrcona |
Well, I just got a batch to load that produce character warnings regardless if I process them as UTF-8 or MARC-8, so they're likely in some other character set. |
10:01 |
Dyrcona |
derekz: Writing a script as a QND solution is ... viable(?). In the long run, you'll want to eventually do the work with the circ and hold matrices. Unfortunately, there's nothing more permanent than a temporary solution. |
10:01 |
stephengwills |
perfect segway. I have a new school library that just started importing vendor records and, during this same few days postgres just started getting goosed by oom-killer. Any chance that not a coincidence? is vandelay known to consume lots of postgres memory? |
10:02 |
Dyrcona |
stephengwills: I don't use vandelay much. I'm just throwing garba... ahem.. MARC at the database via DBI. |
10:03 |
stephengwills |
I have a time/funding issue and am too paranoid to allow anyone else to touch the database directly. ;) |
10:03 |
Dyrcona |
Why do I suspect that this batch of records is a mix of UTF-8 and some Windows code page or 3? |
10:05 |
stephengwills |
they started using vandelay instead of having to wait for me. which, in theory, isn’t unreasonable. |
10:40 |
Dyrcona |
Thank you, chardet! Windows-1254 with confidence 0.51855302355 |
10:40 |
Dyrcona |
Now, can yaz-iconv convert that to UTF-8? |
10:42 |
Dyrcona |
Argh! Somethng's wrong with my processor... I get "utf-8 with confidence 0.99" on the original. I swear I worked that all out a couple of weeks ago! |
10:43 |
Dyrcona |
marc-- |
10:43 |
Dyrcona |
@blame MARC |
10:43 |
pinesol |
Dyrcona: It's all MARC's fault! |
10:59 |
Dyrcona |
The real issue isn't MARC so much. It's how Perl handles character sets and "binary" data. |
11:00 |
Dyrcona |
My other prep program doesn't seem to have this problem. |
11:01 |
Dyrcona |
Also, it would help if vendors would set the leader properly. |
11:02 |
Dyrcona |
Maybe it's time for lunch? |
11:12 |
Dyrcona |
Or even make the preprocessor part of the loader. |
11:15 |
Dyrcona |
Or, maybe just switch to Bmagic's loader that i've been meaning to test. |
11:18 |
Dyrcona |
This is still too "hands on" for my taste. |
11:28 |
Dyrcona |
@marc 022 |
11:28 |
pinesol |
Dyrcona: The ISSN, a unique identification number assigned to a continuing resource. (Repeatable) [a,y,z,2,6,8] |
11:28 |
Dyrcona |
@marc 028 |
11:28 |
pinesol |
Dyrcona: The formatted number used for sound recordings, printed music, and videorecordings. Publisher's numbers that are given in an unformatted form are recorded in field 500 (General Note). A print constant identifying the kind of publisher number may be generated based on the value in the first indicator position. (Repeatable) [a,b,6,8] |
11:28 |
Dyrcona |
@marc 024 |
11:28 |
pinesol |
Dyrcona: A standard number or code published on an item which cannot be accommodated in another field (e.g., field 020 (International Standard Book Number), 022 (International Standard Serial Number) , and 027 (Standard Technical Report Number)). The type of standard number or code is identified in the first indicator position or in subfield $2 (Source of number or code). (Repeatable) [a,c,d,z,2,6,8] |
11:29 |
Dyrcona |
Anyone know where UPCs usually show up? |
11:29 |
Dyrcona |
@marc 026 |
11:29 |
pinesol |
Dyrcona: Used to assist in the identification of antiquarian books by recording information comprising groups of characters taken from specified positions on specified pages of the book, in accordance with the principles laid down in various published guidelines. (Repeatable) [] |
11:29 |
Dyrcona |
@marc 025 |
11:29 |
pinesol |
Dyrcona: A number assigned by the Library of Congress to an item that was acquired through one of its overseas acquisition programs. (Repeatable) [] |
11:39 |
Dyrcona |
Looks like 024. |
11:46 |
rhamby |
yeah, should be 024s though the indicator for it is rarely set in my expeirence |
09:06 |
|
Keith_isl joined #evergreen |
09:30 |
|
jvwoolf joined #evergreen |
10:15 |
|
terranm joined #evergreen |
10:29 |
Dyrcona |
If I want to find a record in the database that has a MARC tag with two particular subfield values, there doesn't seem to be a quick way to do that with a double join on metabib.real_full_rec and even then, I'm not guaranteed that they're in the same tag. Am I missing something? |
10:36 |
Dyrcona |
I could probably add a "search" index using xpath or something, but I always get lost in a maze of twisty config and metabib tables when I do that. |
10:43 |
|
rjackson_isl_hom joined #evergreen |
10:45 |
Dyrcona |
Thought I had an example of that in my old code, but I can't find it. I know that I created such things in the database in the past. |
10:54 |
Dyrcona |
berick: Yeah, but I'm trying to avoid a join like that. |
10:55 |
Dyrcona |
I'd probably have to join mravl and crad anyway. |
10:57 |
Dyrcona |
I don't even need a join, yet. There won't be any other records that would match this tag and a particular subfield value, but I can't guarantee that will always be the case (though it very likely will be). |
10:58 |
Dyrcona |
Plus, the field will have dates in it, and I should also use those dates in my query.... This is MARC tag 583 for anyone who is curious. |
11:02 |
Dyrcona |
I'm trying to come up with a query to get bre ids to pipe into marc_export. If I end up having to deal with dates, then I may just have to write a custom export to pick the marc apart in Perl. |
11:03 |
Dyrcona |
I don't even need the query today. I'm trying to figure out something that might work, so I can estimate how long it will take to implement. |
11:04 |
Dyrcona |
@marc 583 |
11:04 |
pinesol |
Dyrcona: Contains information about processing, reference, and preservation actions. (Repeatable) [a,b,c,d,e,f,h,i,j,k,l,n,o,u,x,z,2,3,5,6,8] |
11:08 |
Dyrcona |
I basically want to be able to find records where a single 583 has $f = some value $5 = someother value $c < today's date and $d > today's date (if $d is even there) |
11:09 |
Dyrcona |
Oh, and a bunch of other criteria, like a certain member library has a non-deleted asset.copy entry with a specific circ modifier... |
12:43 |
Dyrcona |
It could be an entirely different character set, too, but I got pretty much the same error as you. |
12:43 |
Bmagic |
right, we've all had this pain. I've posted here over the years about it. This project is slightly different, and my understanding of the issue has matured over the years |
12:44 |
Bmagic |
I've got the file open in MARCEdit. I'm seeing lots of   and {acute} |
12:45 |
Dyrcona |
So, I'd recommend setting the leader to UTF-8 and seeing what happens. You can do that with $marc->encoding('UTF-8'); |
12:46 |
Bmagic |
right, let's see |
12:48 |
Bmagic |
that did the trick. I would like to try it one way, catch an error, then handle it the other way. Can I "get the message" that MARC::Record bombed? I'm just seeing a screen dump, but my program goes on |
12:50 |
Dyrcona |
Bmagic: You're getting a warning, I think from MARC::Charset? You can do some stuff with signal in Perl to make the warnings fatal and then use an eval block to trap them. |
12:52 |
Bmagic |
ah! I'll look that up, thanks |
12:53 |
Dyrcona |
Here's an example using eval {...}; if ($@) { ... } to log errors: https://pastebin.com/g4RGDJLr |
12:53 |
Bmagic |
Dyrcona++ |
12:57 |
Dyrcona |
In that example any fatal errors that happen inside the eval {} get handled by the code in the if ($@) {}. |
12:57 |
Bmagic |
Also: I see that you're reading the file raw with a separator "\x1E\x1D". I suppose that's the best way? You find that reading the file yourself (instead of MARC::Batch) is better? |
12:57 |
Dyrcona |
eval does two different things depending on how you use it. |
12:58 |
Dyrcona |
Bmagic: MARC::Batch usually works. I do tend to read the file manually because I've had issues in the past with records containing "smart" quotes. One of them looks like the the end of record character. |
12:58 |
Bmagic |
I'll see if this error (now that I've narrowed it down to a particular record) will cause "die" or not |
12:59 |
Bmagic |
your example forces a die when @warnings |
12:59 |
|
jihpringle joined #evergreen |
13:00 |
Dyrcona |
Right. But not all warnings, just warnings from MARC::Record. |
13:00 |
Dyrcona |
I'm not sure the MARC::Charset warning shows up there, but you can try it. |
13:00 |
Bmagic |
This program I'm writing needs to be pretty hardy. Handling millions of records every year. So, I think I'll go with the manual reading of the files with raw, like what you've got there |
13:02 |
Dyrcona |
Suit yourself. MARC::Batch and MARC::File go to great lengths to read in otherwise garbage records, but they blow up sometimes on otherwise well-formed MARC. |
13:05 |
Bmagic |
haha, so, it's more* hardy to use MARC::File/Batch.... Maybe I go with MARC::File.... and if it breaks, catch it and go RAW manual. off to find a file that blows MARC::File |
13:06 |
Dyrcona |
The problem I'm fixing by reading the records the way that I do comes from copy/paste cataloging. |
13:06 |
Bmagic |
yeah, I bet I can copy/paste a smart quote into a record and produce this issue |
13:10 |
Bmagic |
Dyrcona++ |
13:12 |
Dyrcona |
I thought tsbere submitted a patch for that one, but it didn't work. Then again, I also think tsbere opened a separate bug. I don't know where that bug has gone. |
13:13 |
Dyrcona |
I spent a little time working on tsbere's patch, but kind of gave up. |
13:15 |
Dyrcona |
Oh, right. tsbere made a PR on github: https://github.com/perl4lib/marc-perl/pull/4 |
13:16 |
Dyrcona |
Too many bug trackers.... |
13:18 |
|
JBoyer joined #evergreen |
13:18 |
Bmagic |
It would be nice if MARC::File would just handle it |
10:24 |
Dyrcona |
I was trying to find an Emacs command or function to make a file: URL from a path. Doesn't look like such a thing exists, but after 15 minutes of search the function help and online, I decided, "I could have implemented it by now." |
10:29 |
Dyrcona |
I still haven't implemented it, but it might be a handy thing to have. |
10:35 |
Dyrcona |
Of course, there's a Jabber/XMPP library for Emacs..... :) |
10:44 |
Dyrcona |
MARC::Charset apparently does not like EM DASH: \xE2\x80\x94. |
10:50 |
Dyrcona |
The message suggests that triplets beginning with \xE2\x80 are the problem: no mapping found for [0x80] at position 24 |
10:50 |
Dyrcona |
gmcharlt ^^ Should I file a bug on MARC::Charset? |
10:55 |
Dyrcona |
FWIW: I'm using this program to load some binary MARC records: https://pastebin.com/g4RGDJLr |
11:01 |
miker |
Dyrcona: doesn't like going from MARC8 to UTF-8? |
11:06 |
Dyrcona |
The file is UTF-8. |
11:08 |
miker |
hrm... that's strange ... could the records be claiming MARC8? IIRC we do look at the leader, but I think there's a way to force the issue |
11:20 |
Dyrcona |
Now, my dump program is complaining about wide character in print. It didn't before I set the encoding.... |
11:29 |
miker |
are you calling MARC::Charset->assume_unicode(1); before processing the records? (that's the "force the issue" option) |
11:34 |
Dyrcona |
miker: No. You can see the the program that I'm using to load the records. I haven't shared the preprocessor, yet. |
11:35 |
Dyrcona |
I was going to ask if anyone in here has used pymarc much. (I've looked at it.) Python usually has better charset handling that does Perl, and I wonder if anyone has used chardet to detect the charsets used by MARC records. |
11:35 |
Dyrcona |
I've found all kinds of crap in MARC from 3rd parties. |
11:36 |
miker |
right on. fwiw, many of our scripts and db functions use both assume_unicode(1) and ignore_errors(1). see the top part of Open-ILS/src/extras/import/marc_add_ids for instance |
11:38 |
Dyrcona |
Well, I've set the 09 in the leader to 'a' for this batch. When I open the file in Emacs it looks like UTF-8. |
11:39 |
Dyrcona |
I'm running it again in another db with a fresh copy of production. I reloaded them after yesterday's tests. |
11:44 |
Dyrcona |
Looks like you're only expected to use it on output. |
11:45 |
Dyrcona |
The documentation needs to be updated or PyMarc does: "When I can require python 2.3, this will go away." |
11:47 |
Dyrcona |
I might play with this some time, but for now I'm sticking to Perl. |
11:50 |
Dyrcona |
It's funny that my preprocessor, using MARC::Record, has no problem with these records, but I guess I'm not asking it to do any charset conversion. Well, it has no problem since I fixed the double encoding bug. :) |
11:58 |
Dyrcona |
IIRC, I think I had to set the charset in these records yesterday, but I forgot when I ran the preprocessor after making changes to it this morning. |
11:59 |
Dyrcona |
Yeah, it has passed the author with the different characters for the first letter of the last name with no warnings or errors. |
12:03 |
|
jihpringle joined #evergreen |
14:33 |
pinesol |
csharp_: go with explicit |
14:35 |
Dyrcona |
I only have those git add problems when I add things to git on the servers. :) |
15:11 |
gmcharlt |
Dyrcona: re MARC::Charset, yes |
15:13 |
Dyrcona |
gmcharlt: I'm not sure it's a bug in MARC::Charset, now. It's bad data combined with user error/forgetfulness. |
15:13 |
gmcharlt |
Dyrcona: ah, OK |
15:13 |
Dyrcona |
I thought the records were specifying UTF-8, but they weren't, even though there were in fact encoded in UTF-8. |
15:14 |
Dyrcona |
s/there/they/ # It's getting late and the fingers are tired... :) |
09:24 |
* Dyrcona |
mumbles "smart quotes...." |
09:27 |
Dyrcona |
Also, just plain junk in these records. |
09:33 |
Dyrcona |
I wonder if we're using an outdated Unicode standard?--That "we" is meant to be vague, i.e. not necessarily Evergreen. |
09:39 |
Dyrcona |
So, it looks like the preprocessing script does something to the records. Maybe I need to tell MARC::Record not to mangle the characters, somwehow? |
09:56 |
Dyrcona |
Or, maybe it doesn't.... Comparing dumps of the processed versus raw records, the relevant bits of the busted records look the same. |
10:08 |
Dyrcona |
I wonder if converting the records to marcxml will make a difference? |
10:10 |
Dyrcona |
So, do I convert them with yaz-mardump or with MARC::File::XML? |
10:15 |
Dyrcona |
Right, so when I use yaz-marcdump to convert the input records to marcxml, my editor shows the characters correctly. For some reason, I think my editor is treating the dumps as latin-1, even when I tell it they are UTF-8. What I suspect is MARC::Record and friends are mangling the characters because I'm working with binary MARC. |
10:16 |
Dyrcona |
I have little proof, other than what I see in the files through my editor, and that the records get mangled by my preprocessor Perl program. |
10:17 |
Dyrcona |
I will adapt my preprocessor and loader to work with marcxml and see what happens. |
10:18 |
Dyrcona |
BTW, the input records say they are UTF-8 in the leader. |
10:18 |
* Dyrcona |
quacks. |
10:19 |
Dyrcona |
Hmm... Should I use MARC::File::XML on these records, or should I use LibXML? I can do what I want with either..... |
10:23 |
* Dyrcona |
should write a MARC mode for Emacs. It couldn't be that hard... :) |
10:26 |
Dyrcona |
@monologue |
10:26 |
pinesol |
Dyrcona: Your current monologue is at least 15 lines long. |
10:28 |
Dyrcona |
So, yeah, the MARC::Record code that I'm using is mangling the characters. |
10:30 |
Dyrcona |
When I tell Emacs to use UTF-8 with one of the files, I get this: "...encountered characters it couldn’t encode..." followed by a list of characters that won't paste into my IRC client. |
10:30 |
Dyrcona |
The error message is much more detailed. |
10:31 |
Dyrcona |
I open the original MARC file and the mangled characters show up correctly in Emacs. |
10:31 |
Dyrcona |
Proof! |
10:55 |
|
rjackson_isl_hom joined #evergreen |
11:01 |
Dyrcona |
I wonder if the problem is how I'm reading the binary files. I open it via IO::File with the record separator set to \x1e\x1d because records with smart quotes would break with MARC::Batch or MARC::File::USMARC. After I get the raw MARC, I feed that to MARC::Record. I suspect that is where the breakage occurs. |
11:01 |
Dyrcona |
Could be that I need to decode the data before passing it to MARC::Record? |
11:01 |
|
rjackson_isl_hom joined #evergreen |
11:01 |
Dyrcona |
Or would that be encode? |
11:03 |
Dyrcona |
Well, I can try it and see. |
09:34 |
JBoyer |
Dyrcona, something else to consider from that message is whether or not you have any local MODS or other xslt transforms. |
09:37 |
Dyrcona |
JBoyer: I think we do, so I'll check that. Thanks for the suggestion. |
09:47 |
jeff |
chopPunctuation, chopPunctuation, chopPunctuation... heh. |
09:49 |
Dyrcona |
Yeah, I haven't looked but I suspect a busted field in the MARC. Maybe i should dump it now before someone changes it. |
09:54 |
jvwoolf |
Dyrcona: Before I forget again, I wanted to say that we tested the patch in 1482757 and it worked fine. We've got it running in production now. |
09:54 |
jvwoolf |
Let's see if I can get that to link correctly - lp1482757 |
09:54 |
Dyrcona |
Lp 1482757 |
09:55 |
* JBoyer |
shakes fist at anything case-sensitive that's not a password |
09:55 |
Dyrcona |
jvwoolf: It works fine for me, too. It just doesn't speed things up in a noticeable way. Also, this process is still really slow on Pg 12+. |
09:56 |
jvwoolf |
Dyrcona: It sped up importing eresources pretty significantly for us |
09:56 |
Dyrcona |
So, just dumping the MARC to the screen I think I see the problem. There's a field that ends with a lot of blank spaces. |
09:56 |
Dyrcona |
Well, subfield.... |
09:57 |
jvwoolf |
Also, we removed the 30 million deleted URI call numbers and our call number reports work again. We also haven't had any drone timeouts since then, but that could be a cooincidence. |
09:57 |
Dyrcona |
jvwoolf: I guess, but I was looking for improvement on later Pg versions, which that patch doesn't do. If you want to sign off, feel free. |
11:16 |
Dyrcona |
So, that would be 444 templates? |
11:16 |
Dyrcona |
jeff: A qualified yes on sharing it. A definite yes on I still have the record. |
11:17 |
Dyrcona |
I should ask before I share it for reasons. |
11:21 |
Dyrcona |
Apparently, it's an item that no one else is likely to have. It's MARC type a (a book?) about the construction of one of our libraries. Looks to be a gift from the construction company. |
11:28 |
Dyrcona |
Of course the spaces don't show up in the staff client. |
11:29 |
Dyrcona |
The 300 field also looks a little messed up in the editor. Subfield a is just ";" |
11:36 |
Dyrcona |
Page count is missing. |
11:59 |
Dyrcona |
Just to confirm: It blows up on any of the mods transforms. (mods3 gives a missing file error. I should look into that, but doesn't look like we use mods3.) marc21expand880 works, but it doesn't strip out the extra spaces. |
12:00 |
Dyrcona |
jeff: Yeah. that's what it looks like. |
12:00 |
Dyrcona |
It's obvious in the marcxml. Not so obvious elsewhere. |
12:01 |
Dyrcona |
I should probably use MARC::Record to fix it so that the length is updated correctly, but I should be able to just subtract 222 from the current value. |
12:03 |
jeff |
I have mixed feelings about trying to maintain size in a marcxml record. :-) |
12:04 |
jeff |
we mostly (and should always) ensure that it's updated/correct on conversion from marcxml to binary MARC format. |
12:04 |
Dyrcona |
Some of our vendors have strong opinions about it. |
08:36 |
|
mantis1 joined #evergreen |
08:38 |
|
mmorgan joined #evergreen |
09:12 |
|
Dyrcona joined #evergreen |
09:19 |
Dyrcona |
If I get this message while loading MARC records "no mapping found for [0x80]": a) is that a warning or an error (i.e. does the record load anyway) and b) anyone had to fix these before and got any tips that might save me an hour or so of fumbling around? |
09:20 |
Dyrcona |
The records in the file claim to UTF-8, but we all know how that works out in reality. |
09:26 |
Dyrcona |
Well, I can say that those messages are not making it to MARC::Record->warnings. Because I would log those to my log file. |
09:28 |
Dyrcona |
Grr. Because I botched the shortname in 856$9, they're not showing up in my custom view.... |
09:29 |
Dyrcona |
That means I can't get a quick count. |
09:30 |
Dyrcona |
I can also say that the messages didn't trigger my error handler in the eval because there's no error log. |
09:31 |
pinesol |
Dyrcona: Vendor records is probably integrated with systemd |
09:35 |
|
jvwoolf joined #evergreen |
09:39 |
* Dyrcona |
reloads the database to give it another go. |
09:49 |
Dyrcona |
Think I'll redirect stderr to a file. I'm not sure where these messages are coming from, either when I create the MARC::Record from the raw marc data, or when doing the insert. |
09:49 |
Dyrcona |
Most likely the former. |
09:58 |
JBoyer |
Bmagic_, I'm told the cert for the MOBIUS bugsquash server could use a refresh sometime. |
10:01 |
Bmagic_ |
oh! will do |
10:41 |
csharp_ |
jvwoolf: re: slow reports, have you done any query analysis to see what it's doing? (e.g. EXPLAIN/EXPLAIN ANALYZE) |
10:41 |
csharp_ |
note that EXPLAIN ANALYZE actually runs the query, so if it's timing out, that may not work |
10:55 |
|
rfrasur joined #evergreen |
10:59 |
Dyrcona |
So, back to my MARC::Charset thing from earlier. I basically copied the example from the MARC::Lint manpage and ran that on my MARC input file, and it doesn't report any character set issues, though it does complain about punctuation and the subfield 9 in the 856. |
11:06 |
Dyrcona |
Oh, interesting. It looks like MARC::File::USMARC->next returns raw MARC and doesn't create a MARC::Record object. |
11:09 |
Dyrcona |
On, never mind. It does. I was looking at _next(). |
11:09 |
csharp_ |
Dyrcona: I've seen that kind of thing when processing records with garbage characters too |
11:10 |
csharp_ |
pretty sure were able to find them in the text and replace/remove them |
11:13 |
Dyrcona |
So, I don't see the bad character message from MARC::File::USMARC because it builds the MARC::Record differently. It adds the fields as it finds them, does some checking of its own. My program reads the raw MARC from the file and does MARC::Record->new_from_usmarc(). |
11:14 |
Dyrcona |
Some of these look like "smart" quotes. Probably Windows-1252, again. :( |
11:14 |
csharp_ |
yeppers |
11:14 |
Dyrcona |
I wish people would learn that you can't just copy and paste into a MARC record. |
11:14 |
csharp_ |
edit in Word, paste into MARC |
11:15 |
csharp_ |
WYSIWY(don't actually)G |
11:15 |
Dyrcona |
:) |
12:51 |
Dyrcona |
At Pg 11, most things start getting faster, but others seem to get hit. |
12:52 |
Dyrcona |
We do a dump and restore at least whenever we get new hardware. |
12:52 |
Dyrcona |
I do them weekly to keep some test databases up to date. Anyway, getting off topic. |
12:54 |
Dyrcona |
Going back to my record load/character issues, the message isn't coming from MARC::Record, either. I've got a program to dump the 856s that uses the same code to read the records and it doesn't peep. |
12:56 |
Dyrcona |
I wonder if I can just convert the raw MARC data or if I need to do it field by field.... |
13:04 |
Dyrcona |
Now, it's just looking like some garbage and not necessarily Windows 1252... Probably a mix.... :( |
13:07 |
Dyrcona |
I think I know the source of the messages though. It looks like they're possibly coming from NFD() when the MARC is cleaned to go in the database. I can check that hypothesis real quick. |
13:09 |
Dyrcona |
And, no. That's not it, either. |
13:11 |
Dyrcona |
g0=ASCII_DEFAULT g1=EXTENDED_LATIN at /usr/local/share/perl/5.26.1/MARC/Charset.pm line 308, <GEN1> chunk 1752. |
13:18 |
Dyrcona |
Still don't know where that's coming from. I'm not using MARC::Charset, and doesn't look like the modules that I use do either. It must be the database throwing that at me.... |
13:57 |
JBoyer |
Dyrcona, MARC::XML and friends are used in the ingest trigger functions, so depending on where you're actually seeing that message output the database is a likely source. |
13:59 |
Dyrcona |
JBoyer: Yeah. It doesn't seem to be coming from anything running in my Perl code outside of the database, but I didn't think warnings from the database would just show up in my output. |
14:00 |
Dyrcona |
Oh, never mind. They will because of the DBI options. |
15:40 |
Dyrcona |
Aight, so it is a mangled UTF-8 "smart quote." The sequence should probably be 3 bytes: \xe2\x80\x9d. |
15:41 |
Dyrcona |
Ah, this character appears before it: â |
15:43 |
Dyrcona |
Which is \xe2..... |
15:43 |
Dyrcona |
MARC::Charset is apparently not dealing with it correctly, or I need to update my Unicode support.... |
15:54 |
Dyrcona |
Ugh.... I somehow killed the program.... Not sure what I did. |
16:00 |
Dyrcona |
gmcharlt: It looks like MARC::Charset doen'st handle \xe2\x80\x9d correctly. |
16:02 |
Dyrcona |
I wonder what happens if I replace those in Perl with a "? |
16:11 |
Dyrcona |
Also, if the file is UTF-8, and that is a valid UTF-8 sequence, what's MARC::Charset got to do with it? |
16:12 |
Dyrcona |
I'ma just leave it alone and reload the file again tomorrow after a db reload. |
16:13 |
mmorgan |
Going home and coming back again is always a good approach :) |
16:14 |
Dyrcona |
Unfortunately, I am at home. |
16:14 |
Dyrcona |
And, I'll be tomorrow, too. :) |
16:15 |
mmorgan |
There's always turning it off and on again! |
16:15 |
Dyrcona |
I don't think this should be my problem. Things actually look good for a change with the data. I was blaming the vendor, but it looks like MARC::Charset and/or Evergreen's use of it is at fault. Unless this one record isn't set to UTF-8 or something when it should be. |
16:16 |
Dyrcona |
Also, I have little context for the warnings. I'd have to search the MARC for the offending text/codes. |
16:16 |
Dyrcona |
No record number, which tells me this isn't coming from my code because all errors and warnings are logged with the record number from the file. |
16:23 |
|
abowling joined #evergreen |
16:24 |
abowling |
after a 3.7 update, links in 856 are no longer appearing in the opac. the library has custom templates, but i diffed the relevant ones and it seems nothing has changed. any ideas on what i might be misssing? |