Evergreen ILS Website

Search in #evergreen

Channels | #evergreen index




Results

Result pages: 1 2 3 4 5 6 7

Results for 2022-08-08

10:03 Dyrcona Lowercase n with tilde is showing up in my console as the two single byte characters represented by its component bytes. Something isn't handling multibyte UTF8 properly somewhere.
10:04 Dyrcona So, It's probably the load program that I had so much trouble getting Unicode to work with in the past few months.
10:09 jvwoolf joined #evergreen
10:15 Dyrcona Defintitely coming from MARC::Record->new_from_usmarc() in the load program, regardless of whether or not I set the file handle to UTF-8 or not.
10:22 Dyrcona MARC::Charset is up to date (1.35).
10:28 Dyrcona MARC::Charset shouldn't be involved and doesn't look like it is. The error is not coming directly from MARC::File::USMARC::decode() either.
10:29 Dyrcona I tried installing the latest Encode.pm and no difference.
10:30 Dyrcona Oh, I may stand corrected: https://metacpan.org/module/M​ARC::File::USMARC/source#L172
10:31 Dyrcona Bingo! Source of my error, and it is Encode.pm or the edition of Unicode available to Perl on my system.
10:32 Dyrcona https://metacpan.org/module/M​ARC::File::Encode/source#L35
10:39 Dyrcona Or, maybe it's just "The Unicode Bug..." :(
10:45 Dyrcona Looks like I might be able to avoid this by converting the records to MARCXML first.
10:53 Dyrcona Outside of a MARC context, I can't make decode crash on those characters.
11:16 BDorsey joined #evergreen
11:20 Dyrcona Well, it blows up elsewhere using MARC::File::XML: 2 :129: parser error : Input is not proper UTF-8, indicate encoding !
11:20 Dyrcona Bytes: 0xA9 0x22 0x20 0x69
11:21 jihpringle joined #evergreen
11:25 Dyrcona @monologue
11:25 pinesol Dyrcona: Your current monologue is at least 23 lines long.
11:39 Dyrcona Ha! I want a goto for a valid reason. I want a label outside my main loop that I can branch to when there is an error. Otherwise, I don't want to interfere with the flow.
11:42 Dyrcona Maybe I just need to change my loop to a do while.
11:46 Dyrcona Well, MARC::File::XML just makes it worse. More records get spit out that way.
11:52 Dyrcona I think the increase in errors comes from the records with invalid lengths and indicators getting mangled when converted to XML.
12:31 csharp_ Dyrcona: fwiw, I usually learn from your monologues :-)
12:31 csharp_ Dyrcona++
12:36 mmorgan Dyrcona++
12:37 Dyrcona Thanks!
12:41 Dyrcona I used yaz-marcdump to convert the binary MARC to XML, and that was after I had preprocessed the file file from the vendor. So it could be that my preprocessor program is writing junk, but I can't find bad UTF-8 in it.
12:42 Dyrcona It just looks like when going through the MARC modules, Encode suddenly doesn't like otherwise valid \XC2 and \xC3 sequences.
12:56 Dyrcona Very interesting: If I use a program to split the binary file into records using \x1E\x1D as the input record separator and then rune the decode('UTF-8', $raw_record), I get no errors, so the issue is definitely coming from the MARC modules somehow.
12:57 Dyrcona That's using the preprocessed file. I'll see what happens with the files directly from the vendor.
12:58 Dyrcona Ditto... Zero errors.
13:06 jihpringle joined #evergreen
13:13 csharp_ Dyrcona: I know you've been working on this for days and have probably ruled this out, but I've seen stupid stuff where the \XC2 literal characters were themselves mis-encoded somehow
13:14 csharp_ as in literally "\XC2" where one of those was some unicode character that escaped notice
13:16 Dyrcona csharp_: I originally thought that is what the problem was, or rather a MARC-8 \xC2 that got into the UTF-8 data. \xC2 in MARC-8 is the P with a circle sound copyright symbol, and \xC3 is the regular copyright symbol.
13:17 Dyrcona However, the input actually has valid UTF-8 sequences and using Encode::decode on the raw data on a record by record basis does not output any errors. The errors come when MARC::File::USMARC::decode() is run on a record.
13:18 csharp_ ah
13:19 Dyrcona Hmm... I have another idea.....
13:23 Dyrcona I tried MARC::File::USMARC::decode() on the files from the vendor and there are no errors. When I run it on the preprocessed file, I get errors. So, my preprocessor must be doing something wrong, even though decode UTF-8 like the raw input....
13:25 Dyrcona If I have to set the output stream to UTF-8, I'll be upset. I spent several days fiddling with that before and I swear that I got it right.....
13:27 Dyrcona So, I'm already setting binmode on the output to :utf8. Maybe I should do :bytes or :raw?
13:32 Dyrcona I guess :raw isn't a thing....
13:38 Dyrcona csharp_++ again for suggesting an encoding issue. Looks like my preprocessor was double encoding some characters.
14:03 csharp_ Dyrcona: oh wow
14:11 Dyrcona This line in the perlunicode documentation is misleading: Use the ":encoding(...)" layer  to read from and write to filehandles using the specified encoding.
14:11 Dyrcona I suspect it only applies if you're not manually decoding the data, which the MARC code does.
14:32 jeffdavis bug 1979345 adds a new permission to govern the hold pull list; is that an OK change to include in a point release, assuming there's a release note?
14:32 pinesol Launchpad bug 1979345 in Evergreen "Angular Holds Pull List Doesn't Scope" [Medium,Confirmed] https://launchpad.net/bugs/1979345
14:34 csharp_ I would probably wait until the next release
14:46 jeffdavis We're going live with the new perm when we upgrade to 3.9 this weekend, I'll reconcile myself to having to renumber our permissions at some point. :)
14:48 rfrasur joined #evergreen
14:59 Dyrcona We've had to do things like that after upgrades before because we've backported things from future releases.
15:00 Dyrcona On the subject of MARC and encoding, just to complicate things, if you're pulling records from Evergreen via DBI as MARCXML and then converting them to USMARC to write to a file. You have to set the output stream to utf8 encoding, or you get errors reading the output file.
15:02 Dyrcona It's always fun to relearn things like this every 3 or so years. :)
15:10 Dyrcona jeffdavis: One thing that I often do is wait for the code to make it into master, then I cherry-pick the commits into my local branch so I have the correct id numbers and db upgrade codes. This makes generating the db upgrade script for future upgrades easier.
15:11 jeffdavis hm, and I guess there's nothing preventing that commit from going into master even if it doesn't get backported to 3.8/3.9

Results for 2022-08-05

08:42 mmorgan joined #evergreen
09:39 Dyrcona joined #evergreen
09:40 Dyrcona So, it looks like a batch of records from Overdrive are in whatever character set, individually: utf8 "\xC2" does not map to Unicode at /usr/lib/x86_64-linux-gnu/perl/5.26/Encode.pm line 212
09:48 Dyrcona Hmm.. If I'm reading the table correctly, that should be a "sound recording copyright," or a P with a circle around it in MARC-8.
09:48 Dyrcona In UTF-8
09:49 Dyrcona In UTF-8 \xC2 is often the first of a pair, \xC2A9 is the copyright symbol, for example.
09:50 Dyrcona So, I guess some of the records are UTF-8 and some are MARC-8, or even have a mix of characters in different character sets. I guess the question is, do I load the "bad" records?
09:52 Dyrcona Sound recording copyright is \xE28497 in UTF-8.
09:53 Dyrcona I suppose I could try loading the records as MARC-8 and see which leads to fewer issues. I seem to recall the whole thing blowing up when I didn't force the encoding of the MARC to UTF-8, though.
10:01 Dyrcona Typing in a web form and my palm hit the track pad on my laptop thereby selecting all of the text in the field so that the next character that I typed replaced it all. Ctrl-z didn't bring it back....
10:01 Dyrcona It's going to be that kind of Friday....
10:05 * Dyrcona searches for a Perl module like chardet for Python.
10:09 Dyrcona So, maybe, I should try PyMarc and chardet for this project.
10:13 mmorgan Ctrl-z must be taking a vacation day :-(
10:14 Dyrcona I suppose.
10:16 Dyrcona I'm going to try and solve my record issue by rewriting the prep script in Python using PyMarc and chardet to autodetect the character set of each record. I wonder what chardet will say about MARC-8 records? Maybe, it will call them ISO 2022?
10:23 Dyrcona Maybe I can keep most of the Perl code and just throw the XML at a character set detection program?
10:30 Dyrcona Running chardet on the input files says, "utf-8 with confidence 0.99."
10:30 * Dyrcona sighs. Guess I'll just jam the bad records in, like my current test is doing.
14:00 Dyrcona And, that zero-width assertion that I used earlier was working to not "swallow" the space. The phono copyright symbol just takes up extra width in my terminal's font.
14:02 Dyrcona We have a winner! $string =~ s/\xC2(?![\x80-\xBF])/\xE2\x84\x97/gu; and $string =~ s/\xC3(?![\x80-\xBF])/\xC2\xA9/gu;
14:11 Dyrcona It seems like that took too long to figure out. :)
14:21 Dyrcona Hm... Next Q: Is it possible to call update_leader on a MARC::Record...
14:25 Dyrcona In my specific case, I suppose it won't matter. The program will make changes to several tags anyway before outputting the record, so the length should get updated.
14:26 Dyrcona There is a bug related to that, and it probably affects these multibyte characters, too.
14:30 rfrasur joined #evergreen
14:37 Dyrcona Ugh. When I modify my prep program with the substitution code, I get a ton of the "does not map to Unicode" errors that started this whole investigation.
14:39 Dyrcona Oof.... Helps to run it on the correct files in the correct directory....
14:41 Dyrcona OK. I'm suspicious that my output from this run is the same size as from the previous run. Seems like it should be larger.
14:45 Dyrcona So the substitution doesn't work on raw MARC data, apparently.
14:47 Dyrcona diff says the file I just generated with the modified prep script is the same as the old one.
14:47 Dyrcona Yes, I'm sure I used the new script.....
14:47 Dyrcona Nice day for ducks. Looks like we're about to get a thunderstorm.
15:10 Dyrcona Multiline match doesn't help...
15:18 Dyrcona Single line mode doesn't make a difference either. So, maybe the non-UTF-8 characters are coming from MARC::Record?
15:19 Dyrcona Bleh. marc--
15:21 Dyrcona perl-- while i'm at it.
15:31 Dyrcona @monologue
15:31 pinesol Dyrcona: Your current monologue is at least 28 lines long.

Results for 2022-07-11

10:47 pinesol News from commits: LP#1981095: fix deletion of item tags in Angular item attributes editor <https://git.evergreen-ils.org/?p=E​vergreen.git;a=commitdiff;h=cc9c27​b8647b39c44a1270b74b65bbe1a044739a>
12:34 collum joined #evergreen
12:46 collum joined #evergreen
13:03 Dyrcona I'm looking into a cleanup of Overdrive URIs, and I've found over 35,000 asset.uri entries that don't correspond to any 856 in the biblio.record_entry.marc for record on the call number. Many of the call numbers are deleted, but some aren't. I have the fixes for dangling asset.uris in the database.
13:05 Dyrcona They have uri_call_number_maps, but I'm thinking of deleting these just the same.
13:07 Dyrcona "Curiouser and curiouser," said Alice.
13:08 Dyrcona My join criteria is wrong.... :(

Results for 2022-04-05

12:48 pinesol Dyrcona: Band 'Integer Overflow' added to list
12:53 Dyrcona @tag 000
12:53 pinesol Dyrcona: Must be because I had the flu for Christmas.
12:53 Dyrcona @marc 000
12:53 pinesol Dyrcona: unknown tag 000
12:53 Dyrcona That's what I thought...
12:56 Dyrcona Oof. Looks like indicators are bonzo on that record with the 000 tag as well. The "catalog" must be broken.
12:57 Dyrcona Heh: 520 C $a limb aboard....
13:00 Dyrcona @marc lea
13:00 pinesol Dyrcona: unknown tag lea
13:01 Dyrcona Yeah, just for giggles. This one looks the leader was "copied" to a tag "lea" with indicators e & r, followed by the leader.
13:02 JBoyer "The best part is that you can take a plain text marc record from one system and paste it into another!"

Results for 2022-03-23

09:52 Dyrcona derekz: Adding a patron group for this library is the better solution. It seems to me that your circulation rules are too specific if you have to do all that work. The hold and circ rules cascade much like CSS, so simpler is better.
09:53 Dyrcona Philosphy: IMNSHO, it's better to have a bunch of generic rules that apply to the majority of cases at the largest number of org units, then you make specific exceptions from there.
09:54 Dyrcona Avoid using permission groups if possible.
09:55 Dyrcona Unrelated: "Vendor MARC records" seems to be a synonym for "garbage."
09:57 derekz Dunking on "Vendor MARC records" is a perfect distraction and apropos for March
09:58 Dyrcona Well, I just got a batch to load that produce character warnings regardless if I process them as UTF-8 or MARC-8, so they're likely in some other character set.
10:01 Dyrcona derekz: Writing a script as a QND solution is ... viable(?). In the long run, you'll want to eventually do the work with the circ and hold matrices. Unfortunately, there's nothing more permanent than a temporary solution.
10:01 stephengwills perfect segway.  I have a new school library that just started importing vendor records and, during this same few days postgres just started getting goosed by oom-killer.  Any chance that not a coincidence?  is vandelay known to consume lots of postgres memory?
10:02 Dyrcona stephengwills: I don't use vandelay much. I'm just throwing garba... ahem.. MARC at the database via DBI.
10:03 stephengwills I have a time/funding issue and am too paranoid to allow anyone else to touch the database directly. ;)
10:03 Dyrcona Why do I suspect that this batch of records is a mix of UTF-8 and some Windows code page or 3?
10:05 stephengwills they started using vandelay instead of having to wait for me.  which, in theory, isn’t unreasonable.
10:40 Dyrcona Thank you, chardet! Windows-1254 with confidence 0.51855302355
10:40 Dyrcona Now, can yaz-iconv convert that to UTF-8?
10:42 Dyrcona Argh! Somethng's wrong with my processor... I get "utf-8 with confidence 0.99" on the original. I swear I worked that all out a couple of weeks ago!
10:43 Dyrcona marc--
10:43 Dyrcona @blame MARC
10:43 pinesol Dyrcona: It's all MARC's fault!
10:59 Dyrcona The real issue isn't MARC so much. It's how Perl handles character sets and "binary" data.
11:00 Dyrcona My other prep program doesn't seem to have this problem.
11:01 Dyrcona Also, it would help if vendors would set the leader properly.
11:02 Dyrcona Maybe it's time for lunch?
11:12 Dyrcona Or even make the preprocessor part of the loader.
11:15 Dyrcona Or, maybe just switch to Bmagic's loader that i've been meaning to test.
11:18 Dyrcona This is still too "hands on" for my taste.
11:28 Dyrcona @marc 022
11:28 pinesol Dyrcona: The ISSN, a unique identification number assigned to a continuing resource. (Repeatable) [a,y,z,2,6,8]
11:28 Dyrcona @marc 028
11:28 pinesol Dyrcona: The formatted number used for sound recordings, printed music, and videorecordings. Publisher's numbers that are given in an unformatted form are recorded in field 500 (General Note). A print constant identifying the kind of publisher number may be generated based on the value in the first indicator position. (Repeatable) [a,b,6,8]
11:28 Dyrcona @marc 024
11:28 pinesol Dyrcona: A standard number or code published on an item which cannot be accommodated in another field (e.g., field 020 (International Standard Book Number), 022 (International Standard Serial Number) , and 027 (Standard Technical Report Number)). The type of standard number or code is identified in the first indicator position or in subfield $2 (Source of number or code). (Repeatable) [a,c,d,z,2,6,8]
11:29 Dyrcona Anyone know where UPCs usually show up?
11:29 Dyrcona @marc 026
11:29 pinesol Dyrcona: Used to assist in the identification of antiquarian books by recording information comprising groups of characters taken from specified positions on specified pages of the book, in accordance with the principles laid down in various published guidelines. (Repeatable) []
11:29 Dyrcona @marc 025
11:29 pinesol Dyrcona: A number assigned by the Library of Congress to an item that was acquired through one of its overseas acquisition programs. (Repeatable) []
11:39 Dyrcona Looks like 024.
11:46 rhamby yeah, should be 024s though the indicator for it is rarely set in my expeirence

Results for 2022-03-18

09:06 Keith_isl joined #evergreen
09:30 jvwoolf joined #evergreen
10:15 terranm joined #evergreen
10:29 Dyrcona If I want to find a record in the database that has a MARC tag with two particular subfield values, there doesn't seem to be a quick way to do that with a double join on metabib.real_full_rec and even then, I'm not guaranteed that they're in the same tag. Am I missing something?
10:36 Dyrcona I could probably add a "search" index using xpath or something, but I always get lost in a maze of twisty config and metabib tables when I do that.
10:43 rjackson_isl_hom joined #evergreen
10:45 Dyrcona Thought I had an example of that in my old code, but I can't find it. I know that I created such things in the database in the past.
10:54 Dyrcona berick: Yeah, but I'm trying to avoid a join like that.
10:55 Dyrcona I'd probably have to join mravl and crad anyway.
10:57 Dyrcona I don't even need a join, yet. There won't be any other records that would match this tag and a particular subfield value, but I can't guarantee that will always be the case (though it very likely will be).
10:58 Dyrcona Plus, the field will have dates in it, and I should also use those dates in my query.... This is MARC tag 583 for anyone who is curious.
11:02 Dyrcona I'm trying to come up with a query to get bre ids to pipe into marc_export. If I end up having to deal with dates, then I may just have to write a custom export to pick the marc apart in Perl.
11:03 Dyrcona I don't even need the query today. I'm trying to figure out something that might work, so I can estimate how long it will take to implement.
11:04 Dyrcona @marc 583
11:04 pinesol Dyrcona: Contains information about processing, reference, and preservation actions. (Repeatable) [a,b,c,d,e,f,h,i,j,k,l,n,o,u,x,z,2,3,5,6,8]
11:08 Dyrcona I basically want to be able to find records where a single 583 has $f = some value $5 = someother value $c < today's date and $d > today's date (if $d is even there)
11:09 Dyrcona Oh, and a bunch of other criteria, like a certain member library has a non-deleted asset.copy entry with a specific circ modifier...

Results for 2022-03-14

12:43 Dyrcona It could be an entirely different character set, too, but I got pretty much the same error as you.
12:43 Bmagic right, we've all had this pain. I've posted here over the years about it. This project is slightly different, and my understanding of the issue has matured over the years
12:44 Bmagic I've got the file open in MARCEdit. I'm seeing lots of &#xA0 and {acute}
12:45 Dyrcona So, I'd recommend setting the leader to UTF-8 and seeing what happens. You can do that with $marc->encoding('UTF-8');
12:46 Bmagic right, let's see
12:48 Bmagic that did the trick. I would like to try it one way, catch an error, then handle it the other way. Can I "get the message" that MARC::Record bombed? I'm just seeing a screen dump, but my program goes on
12:50 Dyrcona Bmagic: You're getting a warning, I think from MARC::Charset? You can do some stuff with signal in Perl to make the warnings fatal and then use an eval block to trap them.
12:52 Bmagic ah! I'll look that up, thanks
12:53 Dyrcona Here's an example using eval {...}; if ($@) { ... } to log errors: https://pastebin.com/g4RGDJLr
12:53 Bmagic Dyrcona++
12:57 Dyrcona In that example any fatal errors that happen inside the eval {} get handled by the code in the if ($@) {}.
12:57 Bmagic Also: I see that you're reading the file raw with a separator "\x1E\x1D". I suppose that's the best way? You find that reading the file yourself (instead of MARC::Batch) is better?
12:57 Dyrcona eval does two different things depending on how you use it.
12:58 Dyrcona Bmagic: MARC::Batch usually works. I do tend to read the file manually because I've had issues in the past with records containing "smart" quotes. One of them looks like the the end of record character.
12:58 Bmagic I'll see if this error (now that I've narrowed it down to a particular record) will cause "die" or not
12:59 Bmagic your example forces a die when @warnings
12:59 jihpringle joined #evergreen
13:00 Dyrcona Right. But not all warnings, just warnings from MARC::Record.
13:00 Dyrcona I'm not sure the MARC::Charset warning shows up there, but you can try it.
13:00 Bmagic This program I'm writing needs to be pretty hardy. Handling millions of records every year. So, I think I'll go with the manual reading of the files with raw, like what you've got there
13:02 Dyrcona Suit yourself. MARC::Batch and MARC::File go to great lengths to read in otherwise garbage records, but they blow up sometimes on otherwise well-formed MARC.
13:05 Bmagic haha, so, it's more* hardy to use MARC::File/Batch.... Maybe I go with MARC::File.... and if it breaks, catch it and go RAW manual. off to find a file that blows MARC::File
13:06 Dyrcona The problem I'm fixing by reading the records the way that I do comes from copy/paste cataloging.
13:06 Bmagic yeah, I bet I can copy/paste a smart quote into a record and produce this issue
13:10 Bmagic Dyrcona++
13:12 Dyrcona I thought tsbere submitted a patch for that one, but it didn't work. Then again, I also think tsbere opened a separate bug. I don't know where that bug has gone.
13:13 Dyrcona I spent a little time working on tsbere's patch, but kind of gave up.
13:15 Dyrcona Oh, right. tsbere made a PR on github: https://github.com/perl4lib/marc-perl/pull/4
13:16 Dyrcona Too many bug trackers....
13:18 JBoyer joined #evergreen
13:18 Bmagic It would be nice if MARC::File would just handle it

Results for 2022-03-04

09:55 tsadok_ joined #evergreen
10:01 stephengwills joined #evergreen
10:01 stephengwills left #evergreen
10:14 Dyrcona Why is working with MARC so difficult? Now, I'm trying to dump some records from the database to a binary MARC file in UTF-8, and of course, its mangling the "fancy" characters. You'd think I'd have this down pat by now.
10:16 Dyrcona No, if I can just keep my fat palms off of the touchpad.... (I keep "clicking" inadvertently in another window.)
10:17 * csharp_ removes offensive trolling from the IRC logs
10:18 Dyrcona OK, when pulling records from the DB, use (BinaryEncoding => 'utf8') on the use MARC::File::XML line, and if using IO::File to write the output do $fh->binmode(':utf8');
10:19 Dyrcona csharp_: I'm half curious to see the trolling because I missed it, but I hope you're not referring to my monologues. :)
10:21 csharp_ oh, I deleted those too :-)
10:21 csharp_ I KEED I KEED

Results for 2022-03-02

10:24 Dyrcona I was trying to find an Emacs command or function to make a file: URL from a path. Doesn't look like such a thing exists, but after 15 minutes of search the function help and online, I decided, "I could have implemented it by now."
10:29 Dyrcona I still haven't implemented it, but it might be a handy thing to have.
10:35 Dyrcona Of course, there's a Jabber/XMPP library for Emacs..... :)
10:44 Dyrcona MARC::Charset apparently does not like EM DASH: \xE2\x80\x94.
10:50 Dyrcona The message suggests that triplets beginning with \xE2\x80 are the problem: no mapping found for [0x80] at position 24
10:50 Dyrcona gmcharlt ^^ Should I file a bug on MARC::Charset?
10:55 Dyrcona FWIW: I'm using this program to load some binary MARC records: https://pastebin.com/g4RGDJLr
11:01 miker Dyrcona: doesn't like going from MARC8 to UTF-8?
11:06 Dyrcona The file is UTF-8.
11:08 miker hrm... that's strange ... could the records be claiming MARC8? IIRC we do look at the leader, but I think there's a way to force the issue
11:20 Dyrcona Now, my dump program is complaining about wide character in print. It didn't before I set the encoding....
11:29 miker are you calling MARC::Charset->assume_unicode(1); before processing the records? (that's the "force the issue" option)
11:34 Dyrcona miker: No. You can see the the program that I'm using to load the records. I haven't shared the preprocessor, yet.
11:35 Dyrcona I was going to ask if anyone in here has used pymarc much. (I've looked at it.) Python usually has better charset handling that does Perl, and I wonder if anyone has used chardet to detect the charsets used by MARC records.
11:35 Dyrcona I've found all kinds of crap in MARC from 3rd parties.
11:36 miker right on. fwiw, many of our scripts and db functions use both assume_unicode(1) and ignore_errors(1). see the top part of Open-ILS/src/extras/import/marc_add_ids for instance
11:38 Dyrcona Well, I've set the 09 in the leader to 'a' for this batch. When I open the file in Emacs it looks like UTF-8.
11:39 Dyrcona I'm running it again in another db with a fresh copy of production. I reloaded them after yesterday's tests.
11:44 Dyrcona Looks like you're only expected to use it on output.
11:45 Dyrcona The documentation needs to be updated or PyMarc does: "When I can require python 2.3, this will go away."
11:47 Dyrcona I might play with this some time, but for now I'm sticking to Perl.
11:50 Dyrcona It's funny that my preprocessor, using MARC::Record, has no problem with these records, but I guess I'm not asking it to do any charset conversion. Well, it has no problem since I fixed the double encoding bug. :)
11:58 Dyrcona IIRC, I think I had to set the charset in these records yesterday, but I forgot when I ran the preprocessor after making changes to it this morning.
11:59 Dyrcona Yeah, it has passed the author with the different characters for the first letter of the last name with no warnings or errors.
12:03 jihpringle joined #evergreen
14:33 pinesol csharp_: go with explicit
14:35 Dyrcona I only have those git add problems when I add things to git on the servers. :)
15:11 gmcharlt Dyrcona: re MARC::Charset, yes
15:13 Dyrcona gmcharlt: I'm not sure it's a bug in MARC::Charset, now. It's bad data combined with user error/forgetfulness.
15:13 gmcharlt Dyrcona: ah, OK
15:13 Dyrcona I thought the records were specifying UTF-8, but they weren't, even though there were in fact encoded in UTF-8.
15:14 Dyrcona s/there/they/ # It's getting late and the fingers are tired... :)

Results for 2022-03-01

09:24 * Dyrcona mumbles "smart quotes...."
09:27 Dyrcona Also, just plain junk in these records.
09:33 Dyrcona I wonder if we're using an outdated Unicode standard?--That "we" is meant to be vague, i.e. not necessarily Evergreen.
09:39 Dyrcona So, it looks like the preprocessing script does something to the records. Maybe I need to tell MARC::Record not to mangle the characters, somwehow?
09:56 Dyrcona Or, maybe it doesn't.... Comparing dumps of the processed versus raw records, the relevant bits of the busted records look the same.
10:08 Dyrcona I wonder if converting the records to marcxml will make a difference?
10:10 Dyrcona So, do I convert them with yaz-mardump or with MARC::File::XML?
10:15 Dyrcona Right, so when I use yaz-marcdump to convert the input records to marcxml, my editor shows the characters correctly. For some reason, I think my editor is treating the dumps as latin-1, even when I tell it they are UTF-8. What I suspect is MARC::Record and friends are mangling the characters because I'm working with binary MARC.
10:16 Dyrcona I have little proof, other than what I see in the files through my editor, and that the records get mangled by my preprocessor Perl program.
10:17 Dyrcona I will adapt my preprocessor and loader to work with marcxml and see what happens.
10:18 Dyrcona BTW, the input records say they are UTF-8 in the leader.
10:18 * Dyrcona quacks.
10:19 Dyrcona Hmm... Should I use MARC::File::XML on these records, or should I use LibXML? I can do what I want with either.....
10:23 * Dyrcona should write a MARC mode for Emacs. It couldn't be that hard... :)
10:26 Dyrcona @monologue
10:26 pinesol Dyrcona: Your current monologue is at least 15 lines long.
10:28 Dyrcona So, yeah, the MARC::Record code that I'm using is mangling the characters.
10:30 Dyrcona When I tell Emacs to use UTF-8 with one of the files, I get this: "...encountered characters it couldn’t encode..." followed by a list of characters that won't paste into my IRC client.
10:30 Dyrcona The error message is much more detailed.
10:31 Dyrcona I open the original MARC file and the mangled characters show up correctly in Emacs.
10:31 Dyrcona Proof!
10:55 rjackson_isl_hom joined #evergreen
11:01 Dyrcona I wonder if the problem is how I'm reading the binary files. I  open it via IO::File with the record separator set to \x1e\x1d because records with smart quotes would break with MARC::Batch or MARC::File::USMARC. After I get the raw MARC, I feed that to MARC::Record. I suspect that is where the breakage occurs.
11:01 Dyrcona Could be that I need to decode the data before passing it to MARC::Record?
11:01 rjackson_isl_hom joined #evergreen
11:01 Dyrcona Or would that be encode?
11:03 Dyrcona Well, I can try it and see.

Results for 2022-02-28

13:55 jeffdavis I think PINES has the fixes for those bugs so I am assuming this is a 3.8-specific issue, probably with the new holdings editor?
13:59 jeffdavis well not necessarily the holdings editor, forget that bit
14:59 collum joined #evergreen
15:01 Dyrcona Anybody ever seen this one before: Use of uninitialized value $code_wanted in string eq at /usr/share/perl5/MARC/Field.pm line 314, <GEN1> chunk 1.
15:05 Dyrcona Ok. Figured it out: Can't call method "subfield" on an undefined value at /home/opensrf/scripts/prep-od-advantage line 76, <GEN1> chunk 9.
15:06 Dyrcona I have I tried getting something out of the record that doesn't exist, and didn't properly check for its existence before trying to use it.
15:52 jeffdavis I spoke too soon! no children available for open-ils.actor (1442 warnings)

Results for 2022-02-25

09:34 JBoyer Dyrcona, something else to consider from that message is whether or not you have any local MODS or other xslt transforms.
09:37 Dyrcona JBoyer: I think we do, so I'll check that. Thanks for the suggestion.
09:47 jeff chopPunctuation, chopPunctuation, chopPunctuation... heh.
09:49 Dyrcona Yeah, I haven't looked but I suspect a busted field in the MARC. Maybe i should dump it now before someone changes it.
09:54 jvwoolf Dyrcona: Before I forget again, I wanted to say that we tested the patch in 1482757 and it worked fine. We've got it running in production now.
09:54 jvwoolf Let's see if I can get that to link correctly - lp1482757
09:54 Dyrcona Lp 1482757
09:55 * JBoyer shakes fist at anything case-sensitive that's not a password
09:55 Dyrcona jvwoolf: It works fine for me, too. It just doesn't speed things up in a noticeable way. Also, this process is still really slow on Pg 12+.
09:56 jvwoolf Dyrcona: It sped up importing eresources pretty significantly for us
09:56 Dyrcona So, just dumping the MARC to the screen I think I see the problem. There's a field that ends with a lot of blank spaces.
09:56 Dyrcona Well, subfield....
09:57 jvwoolf Also, we removed the 30 million deleted URI call numbers and our call number reports work again. We also haven't had any drone timeouts since then, but that could be a cooincidence.
09:57 Dyrcona jvwoolf: I guess, but I was looking for improvement on later Pg versions, which that patch doesn't do. If you want to sign off, feel free.
11:16 Dyrcona So, that would be 444 templates?
11:16 Dyrcona jeff: A qualified yes on sharing it. A definite yes on I still have the record.
11:17 Dyrcona I should ask before I share it for reasons.
11:21 Dyrcona Apparently, it's an item that no one else is likely to have.  It's MARC type a (a book?) about the construction of one of our libraries. Looks to be a gift from the construction company.
11:28 Dyrcona Of course the spaces don't show up in the staff client.
11:29 Dyrcona The 300 field also looks a little messed up in the editor. Subfield a is just ";"
11:36 Dyrcona Page count is missing.
11:59 Dyrcona Just to confirm: It blows up on any of the mods transforms. (mods3 gives a missing file error. I should look into that, but doesn't look like we use mods3.) marc21expand880 works, but it doesn't strip out the extra spaces.
12:00 Dyrcona jeff: Yeah. that's what it looks like.
12:00 Dyrcona It's obvious in the marcxml. Not so obvious elsewhere.
12:01 Dyrcona I should probably use MARC::Record to fix it so that the length is updated correctly, but I should be able to just subtract 222 from the current value.
12:03 jeff I have mixed feelings about trying to maintain size in a marcxml record. :-)
12:04 jeff we mostly (and should always) ensure that it's updated/correct on conversion from marcxml to binary MARC format.
12:04 Dyrcona Some of our vendors have strong opinions about it.

Results for 2022-02-15

08:36 mantis1 joined #evergreen
08:38 mmorgan joined #evergreen
09:12 Dyrcona joined #evergreen
09:19 Dyrcona If I get this message while loading MARC records "no mapping found for [0x80]": a) is that a warning or an error (i.e. does the record load anyway) and b) anyone had to fix these before and got any tips that might save me an hour or so of fumbling around?
09:20 Dyrcona The records in the file claim to UTF-8, but we all know how that works out in reality.
09:26 Dyrcona Well, I can say that those messages are not making it to MARC::Record->warnings. Because I would log those to my log file.
09:28 Dyrcona Grr. Because I botched the shortname in 856$9, they're not showing up in my custom view....
09:29 Dyrcona That means I can't get a quick count.
09:30 Dyrcona I can also say that the messages didn't trigger my error handler in the eval because there's no error log.
09:31 pinesol Dyrcona: Vendor records is probably integrated with systemd
09:35 jvwoolf joined #evergreen
09:39 * Dyrcona reloads the database to give it another go.
09:49 Dyrcona Think I'll redirect stderr to a file. I'm not sure where these messages are coming from, either when I create the MARC::Record from the raw marc data, or when doing the insert.
09:49 Dyrcona Most likely the former.
09:58 JBoyer Bmagic_, I'm told the cert for the MOBIUS bugsquash server could use a refresh sometime.
10:01 Bmagic_ oh! will do
10:41 csharp_ jvwoolf: re: slow reports, have you done any query analysis to see what it's doing? (e.g. EXPLAIN/EXPLAIN ANALYZE)
10:41 csharp_ note that EXPLAIN ANALYZE actually runs the query, so if it's timing out, that may not work
10:55 rfrasur joined #evergreen
10:59 Dyrcona So, back to my MARC::Charset thing from earlier. I basically copied the example from the MARC::Lint manpage and ran that on my MARC input file, and it doesn't report any character set issues, though it does complain about punctuation and the subfield 9 in the 856.
11:06 Dyrcona Oh, interesting. It looks like MARC::File::USMARC->next returns raw MARC and doesn't create a MARC::Record object.
11:09 Dyrcona On, never mind. It does. I was looking at _next().
11:09 csharp_ Dyrcona: I've seen that kind of thing when processing records with garbage characters too
11:10 csharp_ pretty sure were able to find them in the text and replace/remove them
11:11 Dyrcona If it looks like single characters, I may just use sed or something like that on the file.
11:12 csharp_ yeah - pretty sure we had a cataloger load the file in MARCEdit at some point and do a simple find/replace
11:12 csharp_ that was back before I was more comfortable with regexes
11:13 Dyrcona So, I don't see the bad character message from MARC::File::USMARC because it builds the MARC::Record differently. It adds the fields as it finds them, does some checking of its own. My program reads the raw MARC from the file and does MARC::Record->new_from_usmarc().
11:14 Dyrcona Some of these look like "smart" quotes. Probably Windows-1252, again. :(
11:14 csharp_ yeppers
11:14 Dyrcona I wish people would learn that you can't just copy and paste into a MARC record.
11:14 csharp_ edit in Word, paste into MARC
11:15 csharp_ WYSIWY(don't actually)G
11:15 Dyrcona :)
12:51 Dyrcona At Pg 11, most things start getting faster, but others seem to get hit.
12:52 Dyrcona We do a dump and restore at least whenever we get new hardware.
12:52 Dyrcona I do them weekly to keep some test databases up to date. Anyway, getting off topic.
12:54 Dyrcona Going back to my record load/character issues, the message isn't coming from MARC::Record, either. I've got a program to dump the 856s that uses the same code to read the records and it doesn't peep.
12:56 Dyrcona I wonder if I can just convert the raw MARC data or if I need to do it field by field....
13:04 Dyrcona Now, it's just looking like some garbage and not necessarily Windows 1252... Probably a mix.... :(
13:07 Dyrcona I think I know the source of the messages though. It looks like they're possibly coming from NFD() when the MARC is cleaned to go in the database. I can check that hypothesis real quick.
13:09 Dyrcona And, no. That's not it, either.
13:11 Dyrcona g0=ASCII_DEFAULT g1=EXTENDED_LATIN at /usr/local/share/perl/5.26.1/MARC/Charset.pm line 308, <GEN1> chunk 1752.
13:18 Dyrcona Still don't know where that's coming from. I'm not using MARC::Charset, and doesn't look like the modules that I use do either. It must be the database throwing that at me....
13:57 JBoyer Dyrcona, MARC::XML and friends are used in the ingest trigger functions, so depending on where you're actually seeing that message output the database is a likely source.
13:59 Dyrcona JBoyer: Yeah. It doesn't seem to be coming from anything running in my Perl code outside of the database, but I didn't think warnings from the database would just show up in my output.
14:00 Dyrcona Oh, never mind. They will because of the DBI options.
15:40 Dyrcona Aight, so it is a mangled UTF-8 "smart quote." The sequence should probably be 3 bytes: \xe2\x80\x9d.
15:41 Dyrcona Ah, this character appears before it: â
15:43 Dyrcona Which is \xe2.....
15:43 Dyrcona MARC::Charset is apparently not dealing with it correctly, or I need to update my Unicode support....
15:54 Dyrcona Ugh.... I somehow killed the program.... Not sure what I did.
16:00 Dyrcona gmcharlt: It looks like MARC::Charset doen'st handle \xe2\x80\x9d correctly.
16:02 Dyrcona I wonder what happens if I replace those in Perl with a "?
16:11 Dyrcona Also, if the file is UTF-8, and that is a valid UTF-8 sequence, what's MARC::Charset got to do with it?
16:12 Dyrcona I'ma just leave it alone and reload the file again tomorrow after a db reload.
16:13 mmorgan Going home and coming back again is always a good approach :)
16:14 Dyrcona Unfortunately, I am at home.
16:14 Dyrcona And, I'll be tomorrow, too. :)
16:15 mmorgan There's always turning it off and on again!
16:15 Dyrcona I don't think this should be my problem. Things actually look good for a change with the data. I was blaming the vendor, but it looks like MARC::Charset and/or Evergreen's use of it is at fault. Unless this one record isn't set to UTF-8 or something when it should be.
16:16 Dyrcona Also, I have little context for the warnings. I'd have to search the MARC for the offending text/codes.
16:16 Dyrcona No record number, which tells me this isn't coming from my code because all errors and warnings are logged with the record number from the file.
16:23 abowling joined #evergreen
16:24 abowling after a 3.7 update, links in 856 are no longer appearing in the opac. the library has custom templates, but i diffed the relevant ones and it seems nothing has changed. any ideas on what i might be misssing?

Results for 2022-02-07

09:09 terranm joined #evergreen
09:14 jvwoolf joined #evergreen
09:45 Keith-isl joined #evergreen
11:16 Dyrcona So, back to that MARC subfield lookup issue I had on Friday. If I use naco_normalize() on my input string and naco_normalize() on the subfield text, then I get the matches that I expected. Only the second naco_normailize is necessary, but with both I don't have to make my input all lowercase, etc.
11:39 Dyrcona My program now does what I expected, it normalizes 347$b values so that they are all the same for a given material type.
11:41 Dyrcona All of our variations on "4K Ultra HD Blu-ray" (usually missing a word here or there) look the same.
12:26 jihpringle joined #evergreen

Results for 2022-02-04

16:29 jeff if the normalization is causing problems only with your change attempt, and the problem is that the normalization makes it difficult to find the relevant records, could you change your approach to use mrfr metabib.real_full_rec to find *possibly* relevant records, then parse their marcxml to determine the non-normalized value in 347$b?
16:29 jeff (you probably already have that idea or better)
16:30 Dyrcona jeff: It's pretty simple. I'm searching mrfr using the value string converted to a tsvector to find the records that I want to update.
16:31 Dyrcona I'm using that string, preconversion to look for matching 347$b in the MARC::Record. It's the $subfield eq $str that it is failing, basically.
16:32 Dyrcona If normalize $subfield, then it will match, but then I'll be stuffing a different, normalized value into the MARC and our catalogers might not like that.
16:33 Dyrcona Turns out, too, that it's only 49 records, so I may just ask the cataloging center to fix them.
16:34 Dyrcona Now, I want to make t-shirts and wrist bands with "WWJSD" on them: What would Jon Skeet do?
16:36 Dyrcona I'm tempted to stuff this code into my private scripts repo anyway.

Results for 2022-01-11

09:23 Dyrcona So, I found out that a full ingest is pretty much required going from 3.5 to 3.7. Titles and authors were not showing the bootstrap OPAC on our training server until I ran pingest over the weekend.
09:36 miker Dyrcona: that's ... strange. missing data in metabib.display_entry, somehow?
09:38 jvwoolf joined #evergreen
09:50 Dyrcona miker: Yes. The TTOPAC pulls titles and authors from the MARC, IIRC. BooPAC uses display fields.
09:53 mmorgan JBoyer++
10:50 mantis We're on 3.6.5 using Angular for the staff but still using TPAC for the OPAC.  When accessing the Patron View button on conjoined items, we get a server error.  This however works when Boopac is enabled on our other test servers.  Is this a known bug?
11:06 Dyrcona mantis: IDK, but if it is not on Lp, then it's probably not widely known.

Results for 2021-10-13

11:03 rhamby right
11:11 Dyrcona Well, I like automating things because automated mistakes usually lend themselves to automated fixes.
11:12 Dyrcona csharp_: I work with a project that includes a submodule. If you have questions, let me know. I might be able to help.
11:13 Dyrcona So, I'm looking at MARC export, and I think it's a bug that it exports holdings in 852. I think it's supposed to be 952.
11:31 Dyrcona Eh, maybe that isn't a bug. I should have looked at the full description again. ;)
11:33 Dyrcona Hmm.. The query that I'm working on is going to be more complicated than I first thought...
11:34 Dyrcona Or, maybe not. I probably don't have to include locations for deleted copies.
11:35 Dyrcona Thanks, rubber ducky!
11:44 mmorgan __(')<
12:02 Dyrcona Wonder if I messed up, or if there are really that many copies: Record length of 2216371 is larger than the MARC spec allows (99999 bytes). at /usr/share/perl5/MARC/File/USMARC.pm line 314.
12:03 Dyrcona mmorgan++
12:03 Dyrcona Guess I'll convert to XML and have a look after it finishes.
12:05 Dyrcona Heh. I messed up my query....
12:50 Dyrcona Always fun when you get the submodule out of sync with the main code. :)
12:52 Dyrcona Here's the project I'm talking about: https://github.com/Dyrcona/openfortigui
12:54 Dyrcona Ugh. Looks like this code might be too slow to be useful.
12:57 Dyrcona When it was just straight up dumping MARC, it took about 10 to 15 minutes to dump 48,000+ records. It has been running for about 45 minutes now and only dumped 1,296 records with holdings. I should probably modify the main query to return the marc and copy info.
13:05 Dyrcona Wonder if I can array_agg over an array_agg?
13:22 Bmagic Dyrcona: I've got some perl that dumps records in parallel, I've seen it hit 300 records/second
13:24 Bmagic IIRC, 8 threads. Mind you, it's not using the perl "threads" module because Encode.pm. Instead it launches a system command and monitors a mutual file on the fs
13:48 Dyrcona I don't bother with threads in Perl 5. I use fork.
13:49 Dyrcona Anyway, I think I've got a solution. Rather than run this time consuming query once for each record, I'll do it once for all records and make a hash table of the information per record id.
13:49 Dyrcona If I get the options right, I can probably have selectall_arrayref make the data structure for me.
13:53 Dyrcona My program basically works like this: Get a list of bre.id using one of 3 queries. After that loop through the array of ids and grab the marc for each one. If this is a batch of deletes, set the leader 05 to d and write to the binary output file. Otherwise, delete the 852 tags in the marc, look up the copy location and org unt name for each copy and add a 852 to the marc for each.
13:53 Dyrcona Then, write it to the output file.
13:54 Dyrcona It got really slow when I added the copy location/org_unit query.
14:02 Bmagic I had some of the same challenges

Results for 2021-09-17

13:26 Bmagic small ones tend to load completely. But this interface is still Dojo, so I don't imagine anyone is interested until it's Angular
13:41 pinesol [opensrf|kenstir] Fix LP#1883169 by using growing_buffer - <http://git.evergreen-ils.org/?p​=OpenSRF.git;a=commit;h=a3368f9>
13:42 jvwoolf Interestingly, we had one with 6 items fail to load
13:44 Dyrcona jvwoolf: How big is the MARC associated with those?
13:45 Dyrcona I'm pretty sure there is some MARC pulled over as well, though I might be thinking of something else.
13:45 Dyrcona Oops!
13:45 jvwoolf joined #evergreen

Result pages: 1 2 3 4 5 6 7