10:51 |
berick |
Bmagic: https://demo.evergreencatalog.com/osrf-gateway-v1?service=open-ils.search&method=open-ils.search.biblio.marc.staff&param={%22org_unit%22:1,%22limit%22:10,%22offset%22:0,%22searches%22:[{%22term%22:%22ocolc%22,%22restrict%22:[{%22subfield%22:%22a%22,%22tag%22:%22035%22}]}]} |
10:51 |
berick |
well, remove ".staff" as needed |
10:52 |
Bmagic |
ty! I'll work with that and see if I can get it to go my way |
11:05 |
Dyrcona |
Bmagic: Are you looking for MARC-style data for a record? |
11:05 |
Bmagic |
xml is fine |
11:06 |
Bmagic |
this api returns json with the bib ID's then a separate call is required to get the details of the bib |
11:08 |
Bmagic |
a wrinkle in my conundrum is that the bibs I'm trying to turn up are electronic bibs (no items) - it seems that the public(non staff) API call won't turn those up no matter which scope I use. |
11:10 |
Bmagic |
there is a setting for marc tag search in the OPAC? |
11:10 |
Dyrcona |
Well, for whether or not and how records without items, most electronic records, show up in the OPAC. |
11:10 |
Bmagic |
What I'm finding is that Evergreen won't give me results for electronic scoped bibs when searching marc tags in the OPAC |
11:11 |
Dyrcona |
There might be a bug. I don't use MARC tag search. |
11:11 |
Dyrcona |
There are settings to control if scoped records show up or not depending on the scope. |
11:11 |
Dyrcona |
It might be those settings. |
11:12 |
* Dyrcona |
is doing like 3 things at once right now, so if I missed that you've scoped the search, then I apologize. :0 |
15:09 |
Dyrcona |
Pg 15 mostly. |
15:09 |
Dyrcona |
Are you having issues? |
15:10 |
Bmagic |
oh good. I have done the same recently. It worked (pg 15 for me too). Though, there were many records it threw some console errors about. It still resulted in a marc file. And that file contained errors according to MARCEdit, which it happily stripped out for me using the validator tool. |
15:11 |
Dyrcona |
What console errors? I'm not sure that has to do with the Pg version so much. It's more likely down to character set issues in the MARC. |
15:11 |
Bmagic |
the DB is "C" and UTF-8, same as it was on PG10. I'm troubleshooting an export for a VuFind instance. VuFind doesn't like the export (all of a sudden) - one change we made was upgraded to pg15 from 10. Just trying to rule that out as a possible issue. I think it's just plain bad records that were introduced recently. and the pg version is a red herring |
15:12 |
Bmagic |
yep, character set issues. Which, we're no strangers to. But the underlying DB version could play a role. |
15:13 |
Dyrcona |
Did you upgrade Ubuntu, too? There was an Ubuntu upgrade that required reindiexing the database or something because the Unicode library version changed. |
15:30 |
Dyrcona |
I might as well start one of them now. |
15:31 |
Dyrcona |
I should also make sure that they're using the same marc_export. |
15:32 |
|
jvwoolf joined #evergreen |
15:35 |
Dyrcona |
Bmagic: you are dumping binary MARC with encoding UTF-8? |
15:36 |
Dyrcona |
I've got one that, for some reason, dumps XML then uses yaz-marcdump to convert it to binary MARC. |
15:37 |
Dyrcona |
XML would be easier to compare. |
15:37 |
Dyrcona |
...Even if the file is bloated. |
15:38 |
* Dyrcona |
started a binary dump already and decides to let it go. |
15:30 |
Dyrcona |
"They say these days are made of rust..." Or should that be Rust? Eh, berick? |
15:33 |
Stompro |
I'm looking at how Aspen categorizes things as fiction/non-fiction... and it categorizes poetry as Fiction by default? Which seems wrong for how Libraries usually categorize things. |
15:33 |
Dyrcona |
Command line programming with PHP: I don't recommend it if you don't need to for some crazy reason... like oh, testing the Aspen Evergreen driver without installing Aspen. |
15:35 |
Dyrcona |
Stompro: Most of that comes from the MARC, I wager. Not sure if poetry can also say "fiction" in the coded values/wherever that lives, but I've seen lots of crazy stuff in MARC records over the years...decades. |
15:36 |
Dyrcona |
Come to think of it, I don't recommend web programming with PHP, either. |
15:36 |
Dyrcona |
:) |
15:37 |
Dyrcona |
Poetry should probably be its own category though. |
15:57 |
kmlussier |
I'm also not surprised jweston has published a book in every Dewey classification. She's very talented. |
16:03 |
sleary |
I have THOUGHTS about the fact that MARC has a field for festschrift but not fiction/nonfiction. |
16:03 |
* berick |
had to google |
16:04 |
Dyrcona |
I have many thoughts about MARC....most of them... not good. |
16:04 |
JBoyer |
I lack the energy to act on it but Dyrcona and berick's conversation above has me thinking about a Malware Radio parody of Wall of Voodoo's Mexican Radio, which I enjoyed quite a bit back in the day. |
16:04 |
berick |
yeah... |
16:04 |
Dyrcona |
"I'm a Mexican woa-oah radio." |
10:36 |
Dyrcona |
Actually, it's probably smple rec and some of the other triggers, too. I'm not sure I want to disable simple rec in a transaction. I'll have to investigate the triggers a bit more. |
10:40 |
pinesol |
News from commits: LP1850473 (follow-up): Update DOM selector in nightwatch test <https://git.evergreen-ils.org/?p=Evergreen.git;a=commitdiff;h=bddc372d27e4ac9298ef6d265534c3b14529b2a2> |
10:41 |
Dyrcona |
No, I don't want to disable any triggers. It would lead to locking and possible dead locks. Setting session replication role is overkill since it disables all triggers. |
10:45 |
Dyrcona |
Looks like the ingest triggers are getting fired, too, but they.... Oh wait. The marc changes.... |
10:53 |
Dyrcona |
Trying different batch sizes, with a limit on a subquery, It did 10 in 5 seconds 100 in 26 seconds and 1000 took 4 minutes 23 seconds and used a lot more memory. |
10:53 |
* Dyrcona |
tries 500. |
10:56 |
Dyrcona |
1m40.676s... |
14:01 |
csharp_ |
reingest began ~11:30 a.m. yesterday - a little over half done per the batch count |
14:07 |
Dyrcona |
csharp_: My case was nearly catastrophic. Cascading deadlocks stemming from a process trying to delete a staff account with thousands of owned records. It was acquisitions account. The person doing the delete tried it at least 3 more times after the first timed out in the client. |
14:10 |
Dyrcona |
This sort of thing used to happen so much in Horizon with Sybase, that I made a little GUI in Java to show me the deadlocked processes. From that I could identify the most likely culprit process. I could then click its row in the table, type Ctrl-k and that would send the Sybase equivalent of pg_cancel_backend. Every now and then, I consider adapting that to PostgreSQL and Evergreen. |
14:13 |
Dyrcona |
Looks like someone was also loading MARC order records at the same time. I may have clobbered their backend process, not sure. I tried to only cancel the delete_usr queries. |
14:23 |
Dyrcona |
Speaking of long-running processes my update of tcn_source on 530,105 bibs where it is '' has been running for 1day, 5 hours, and 27 minutes at this point. |
14:25 |
Dyrcona |
Fortunately, that is on a test system, or I'd have probably had to cancel it for deadlocks. |
14:29 |
|
smayo joined #evergreen |
09:27 |
Bmagic |
kmlussier++ |
09:44 |
Dyrcona |
So, I think I've found a "solution" to my MARC4J streams problem. I'm going to have two programs. The export program will write out the marcxml from Evergreen to a file and also write a file to map the database id with the record's position in the marcxml file. |
09:44 |
Dyrcona |
The second program will read both files, parsing the marcxml with MARC4J and using the other file as a key to map the current record with the database id. |
09:45 |
Dyrcona |
The first program doesn't have to be written in Java, and I suppose I could output a binary MARC file to same some space. MARCXML just seemed simpler, since I could just print the marc field directly. |
10:03 |
sleary |
kmlussier does pinesol know about matcha frapps? |
10:03 |
Dyrcona |
my $file_content = do{local(@ARGV,$/)=$filename;<>}; # That's a gnarly trick. |
10:05 |
kmlussier |
sleary: It certainly can be added as a dessert. :) |
15:52 |
Dyrcona |
Hmm... The permissive stream reader include binary in the exception output, looks like field separators or other low ASCII control characters. |
15:54 |
Dyrcona |
Maybe this has not been worth the effort. |
16:19 |
|
jihpringle joined #evergreen |
16:44 |
Dyrcona |
I wonder, too, if sometimes the errors cascade. That is an error reading a previous record ends up messing up the read of the next few records even if they might be correct. There are things that will do that with MARC::Record. |
16:51 |
|
jihpringle joined #evergreen |
17:05 |
|
mmorgan1 left #evergreen |
18:21 |
|
sandbergja joined #evergreen |
11:21 |
jeff |
oh, never mind. I think that's exactly what you said you were doing, and I misread. |
11:24 |
Dyrcona |
Yeah. We're sending records to someone parsing them with MARC4J. I'm trying to implement a program to find records that MARC4J doesn't like and then output a spreadsheet of the errors for our catalogers. |
11:24 |
Dyrcona |
I may go back to java.nio.Pipe. I had that sort of working, but when I added InputStream to the mix, the program would hang. |
11:26 |
Dyrcona |
I know I'm getting I/O deadlock, and I'm using classes that are recommend for use with different threads in a single thread. Maybe if I spin off a thread for the MARC reader, but then I need to also get the database Id in that thread somehow..... |
11:26 |
Dyrcona |
There's too much computer science in Java. It's definitely not a hacker's language. |
11:27 |
|
kmlussier joined #evergreen |
11:28 |
Dyrcona |
I am flushing the OutputStream before trying to read from the other end of the pipe.... Pipes in C are so much easier. Well, maybe I'm more familiar with pipes in C. :) |
12:12 |
Dyrcona |
That's at 11:24 EST. :) |
12:13 |
Dyrcona |
I've already done this for a set of records that evergreen-universe-rs doesn't like. |
12:13 |
berick |
oh, gotcha |
12:15 |
Dyrcona |
I think I'll put this down for now and work on a program to convert some marcxml to binary marc to see if CPAN's RT will let me upload that. I get a 403 when I try to upload the marcxml examples from the Rust test. |
12:15 |
Dyrcona |
"That should only take half an hour," he said knowing it was very likely to be a lie. |
12:16 |
Dyrcona |
Also, mexican_coca-cola++ It tastes so much better with cane sugar than with HFCS. |
12:22 |
Dyrcona |
What? libmarc-perl does not install MARC::File and friends? I thought that it did. |
12:23 |
* Dyrcona |
grumbles about CPAN.....and that half hour will be spent just installing the tools. |
12:33 |
Dyrcona |
marcdump [options] file(s) That's useful.... |
12:34 |
Dyrcona |
And, I have to write my own. marcdump doesn't work on XML. |
12:35 |
Dyrcona |
I'm just full of complaints today, aren't I? |
12:44 |
Bmagic |
you? never! |
12:44 |
Dyrcona |
I'm installing MARC::File::XML with cpan set to local::lib, and there sure are a lot of prerequisites. |
12:50 |
Dyrcona |
Failed 11/11 test programs. 3/5 subtests failed. |
12:50 |
Dyrcona |
Right. I'll just run it on a server where this is already installed. |
12:51 |
Dyrcona |
And, I'll wipe out the stuff that CPAN installed locally. |
13:08 |
Dyrcona |
Looks like I may have to reboot. I just swapped monitors and the laptop doesn't see the new one. |
13:14 |
|
Dyrcona joined #evergreen |
13:16 |
Dyrcona |
hey! That's funny. MARC::Batch catches some of these errors: Leader must be 24 bytes long |
13:19 |
Stompro |
Dyrcona, have you looked at MARC::Lint already? |
13:33 |
Dyrcona |
Stompro: Never heard of it. |
13:33 |
Dyrcona |
Apparently, all of the software in the world has chosen this week to hate me: https://rt.cpan.org/Ticket/Display.html?id=150348&results=a5d68555ff4b4354e65ce6ec51f76634 # Read to the bottom... |
13:37 |
Dyrcona |
gmcharlt: RT on CPAN is apparently broken for uploads at the moment. I've tried 3 times to add a file of records to that ticket above. |
13:39 |
Dyrcona |
Stompro++ I'll give MARC::Lint a whirl. |
13:57 |
Dyrcona |
Stompro: It looks like MARC::Lint may help. I'm running a test program already. |
13:59 |
Stompro |
I wonder if it will be too verbose, or if you can pick out the bigger issues. I'm curious how it performs also? |
13:59 |
Dyrcona |
And, maybe not so much: is_valid_checksum: Didn't get object! at /usr/share/perl5/Business/ISBN.pm line 481, <DATA> line 244. |
13:59 |
Dyrcona |
Well, it gets totally clobbered by our data after bib id 233519. |
14:05 |
Dyrcona |
Dunno. That could be what it exploded on. I'm trying again with an eval BLOCK. |
14:06 |
Dyrcona |
If it gets all the way through I'll use CSV, and output the warnings to a csv. I might output the errors to a separate one. |
14:08 |
Dyrcona |
My catalogers will be sorry that they ever asked for this. :) |
14:10 |
Dyrcona |
MARC::Lint seems to find something in nearly every record. |
14:39 |
|
terranm joined #evergreen |
14:54 |
Dyrcona |
hmm... What's the limit of rows in Excel, 32,000? I may have to split this up. |
14:55 |
|
kmlussier1 joined #evergreen |
15:03 |
* Dyrcona |
tries hot swapping monitors again. If I disappear, I had to reboot. |
15:10 |
Dyrcona |
Well, looks like I have to reboot. |
15:16 |
|
Dyrcona joined #evergreen |
15:28 |
Dyrcona |
Looks like MARC::Lint uses its own eval, and the errors that it passes up to the client program are not very useful for a cataloger: "Can't locate object method ""checksum"" via package ""0316110620"" (perhaps you forgot to load ""0316110620""?) at /usr/share/perl5/Business/ISBN.pm line 484, <DATA> line 244." |
15:32 |
Dyrcona |
Stompro: Do you know about Tk::MARC::Editor and MARC::ErrorChecks? |
15:41 |
|
dluch joined #evergreen |
15:45 |
|
jihpringle joined #evergreen |
15:49 |
Stompro |
Dyrcona, nope, I haven't looked at them before. |
15:50 |
Dyrcona |
I had a quick look at MARC::Errorchecks and it seems more cumbersome and nitpicky than MARC::Lint. |
16:06 |
|
pinesol joined #evergreen |
17:08 |
|
mmorgan left #evergreen |
18:26 |
|
briank joined #evergreen |
09:00 |
|
sleary joined #evergreen |
09:00 |
|
smayo joined #evergreen |
09:02 |
|
mmorgan1 joined #evergreen |
09:10 |
Dyrcona |
Stompro: It is 1,736,893 bibs. I can't really count the items because the main file is MARC binary. |
09:11 |
Dyrcona |
BTW: I think the Perl MARC::Record is too permissive. I'm going to do some research on it and probably open a ticket in CPAN's RT. |
09:12 |
Dyrcona |
We have a record with and empty 008 as an example. Evergreen deals with it just fine, but other systems mangle it. |
09:24 |
Dyrcona |
The simple things seem to be a problem for me this morning, like opening files in Perl.... I know what the problem is... No music is playing! |
09:25 |
|
mantis1 joined #evergreen |
14:00 |
terranm |
Weekly 3.12 code review if anyone wants to join - https://www.google.com/url?q=https://princeton.zoom.us/my/sandbergja |
14:13 |
|
ejk_ joined #evergreen |
15:21 |
jeff |
terranm++ for sharing the link here |
15:36 |
Dyrcona |
So, I'm thinking of implement a third MARC exporter for Evergreen. This one would be written in Java using MARC4J to catch things that Aspen won't like. It's not so much meant to be for every day use. |
15:54 |
Dyrcona |
Looking through the code I wrote the MVLC migration to Evergreen from Horizon has given me some ideas. I had a SAX parser in there to check the MARCXML before trying to load a record. It would delete "empty" subfields and control fields. I could just yank the MARCXML from the database and check for busted elements as a start. |
16:07 |
csharp_ |
jeffdavis++ # PG 14 |
16:59 |
|
mantis1 left #evergreen |
10:08 |
Dyrcona |
I'm not entirely sure how vis_attr_vector is used, but I'm sure it's most important for records with URIs. |
10:13 |
sleary |
sandbergja: thank you for changing that route :) |
10:14 |
Bmagic |
update config.internal_flag set enabled='f' where name~'ingest.reingest.force_on_same_marc' |
10:15 |
Dyrcona |
Bmagic: Yeah. You may or may not want that flag enabled all the time. It depends on how much/how often your MARC actually changes. |
10:16 |
Dyrcona |
You might be surprised to see how often MARC gets updated when it hasn't changed. |
10:16 |
Bmagic |
I think that was the trick, still playing with it |
10:18 |
Bmagic |
confirmed, that was it |
10:18 |
Bmagic |
I could have swore I checked that before I started posting here |
10:56 |
Dyrcona |
Awesome sauce! The "obvious" approach works. |
10:57 |
Dyrcona |
`psql -v outputdir=output -f script`, then in the script: `\o :outputdir/outfile.dat`. |
10:58 |
Dyrcona |
Think I'll shorten it to 'outdir' for the actual thing, though. |
11:24 |
Dyrcona |
Bmagic (and berick for that matter): I'm going to run the Rust MARC exporter either later today or tonight to capture the error output. It seems to find more "bad" records than the Perl code. Just thought I'd give you a head's up. I don't expect the load to spike on the utility server, but you never know. |
11:25 |
Bmagic |
Dyrcona++ # you go on with your bad self |
11:25 |
berick |
Dyrcona: cool, be curious to see what you find |
11:26 |
Dyrcona |
Actually, I'll schedule it for 9:00 PM since we don't seem to have any database updates requested. I'll write something to parse the error output afterward. (It will be good practice.) |
11:26 |
berick |
also what Bmagic said |
11:26 |
Dyrcona |
:) |
11:27 |
Dyrcona |
I'm going to extract all of the records, and I won't bother with holdings. |
11:28 |
Dyrcona |
berick: Can the eg-marc-export do authorities, too? If not, that would be a useful feature. |
11:28 |
berick |
Dyrcona: not yet |
11:28 |
Dyrcona |
I'll work on a pull request, then. ;) |
11:28 |
berick |
awesome |
10:19 |
Stompro |
I figured, my perl array skills need work. :-) |
10:20 |
Dyrcona |
Maybe my suggestion requires more rearrangement of the code, though. Having a firsttag flag might fit better with the current code organization. |
10:20 |
Dyrcona |
I wonder if the first one even needs to be grouped? |
10:20 |
Dyrcona |
I'm going to look at MARC::Record again. |
10:21 |
Dyrcona |
Stompro++ # For the notes in the snippets. |
10:21 |
Stompro |
In my test data, the 901 tag would be placed before the 852 without using the insert_grouped_field for the first. |
10:23 |
Stompro |
I don't think MARC::Record re-orders the fields. |
12:38 |
Dyrcona |
Heh. Almost 1 minute longer..... |
12:50 |
|
collum joined #evergreen |
13:02 |
Dyrcona |
I am testing this now: time marc_export --all -e UTF-8 --items > all.mrc 2>all.err |
13:14 |
Dyrcona |
The Rust marc export does batching of the queries by adding limit and offset. I wonder if we should do the same? I've noticed that the CPU usage goes up over time, which implies that something is struggling through the records. The memory use stays mostly constant once all of the records are retrieved from the database. |
13:20 |
Stompro |
Dyrcona, if you use gnu time, it gets you max memory usage also. /usr/bin/time -v... so you don't have to check that separately. |
13:25 |
Stompro |
Dyrcona, I'm surprised the execution time increased for you... hmm. |
13:28 |
Dyrcona |
Things are always weird here. |
16:13 |
berick |
Stompro++ eeevil++ # looks like cursors are an option -- will give it a poke |
16:17 |
Dyrcona |
I'll give cursors a poke, too. |
16:17 |
Dyrcona |
I think I commented about "cursors and Sybase" and my early experience with them at The Jockey Club and with Horizon last week. |
16:19 |
Dyrcona |
BTW, my maximum memory usage is 9GB. I think that's my biggest issue with MARC export. |
16:19 |
eeevil |
rewindable and writable cursors, and with-hold cursors, are not as fast as not-those-types in PG, but we don't need those, generally. |
16:19 |
Stompro |
With --items, 1.3G vs 256M for max resident memory, 596s vs 473s run time. (That as compares the 852 insert changes). |
16:19 |
Stompro |
s/as/also/ |
09:01 |
mantis1 |
This season has been terrible |
09:02 |
mantis1 |
I hope you'll be ok for Hackaway! |
09:06 |
|
Dyrcona joined #evergreen |
09:08 |
Dyrcona |
berick: I did `cargo build --release --package evergreen` then copied eg-marc-export to /openils/bin/. I missed the password on of the two lines for eg-marc-export in my script, so I don't know if it is faster, but the binary is certainly smaller without the debugging symbols, etc. |
09:14 |
|
redavis joined #evergreen |
09:18 |
|
terranm joined #evergreen |
09:18 |
Dyrcona |
FWIW, I haven't used --release on my test system. I did that for the production server. |
10:10 |
csharp_ |
sounds good too |
10:21 |
Dyrcona |
Hmm. One of our marc_exports is still running since Tuesday night. |
10:21 |
Dyrcona |
I wonder if Perl has some kind of issue on virtual machines? |
10:22 |
Dyrcona |
Well, I can always replace it with eg-marc-export. |
10:32 |
berick |
Dyrcona: fwiw, the --release build chopped off 1/3 of the runtime for my 150k record+items export. |
10:32 |
berick |
depends on the data, i'm sure, though |
10:34 |
* JBoyer |
wonders what Kuma's Korner does for birthdays? |
14:30 |
Dyrcona |
berick: Speaking of Rust... I think you might have introduced a bug with a recent commit when you moved where OFFSET gets addded. I got the following when using --query-file: |
14:31 |
JBoyer |
Ah, still catching up here and there. Something is still bonkers with that extract though given what we're seeing here. |
14:32 |
Dyrcona |
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error { kind: Db, cause: Some(DbError { severity: "ERROR", parsed_severity: Some(Error), code: SqlState(E42601), message: "syntax error at or near \"OFFSET\"", detail: None, hint: None, position: Some(Original(630)), where_: None, schema: None, table: None, column: None, datatype: None, constraint: None, file: Some("scan.l"), line: Some(1123), routi |
14:32 |
Dyrcona |
nner_yyerror") }) }', evergreen/src/bin/marc-export.rs:398:56 |
14:33 |
Dyrcona |
JBoyer: --items has always been slow for us, but it's worse than ever, and it really looks like it is Perl. |
14:34 |
berick |
Dyrcona: k. mind sharing your query file? |
14:35 |
Dyrcona |
berick: I'm modifying one of them. Let me try another run. |
14:35 |
berick |
Dyrcona: i think i see the issue.. |
14:37 |
Dyrcona |
One of the queries returns 3 columns: bre.id, bre.marc, and count(acp.id). It didn't loook like the 3rd column would be an issue. |
14:38 |
Dyrcona |
JBoyer: I would not be surprised if there is something wrong with the Perl versions I'm using or something, but I don't feel like I have time to deal with that. I'm under pressure to get them records last week. :) |
14:40 |
berick |
the chunked processing requires ordering/limiting/offseting which adds additional restraints to the format of the query file. for now, could just read query file as-is and avoid any paging. |
14:41 |
JBoyer |
True, finding the Right Fix when you're under a Right Now deadline is a lot like being technically correct but completely unhelpful. :) I just think that *after* you can get the initial export done and transported that replacing the exporter won't necessarily be the ideal fix. (Though, for later, I'm also curious what os the super slow exporter is running on) |
14:51 |
|
jihpringle joined #evergreen |
14:52 |
berick |
Dyrcona: pushed a patch to avoid modification of the --query-file sql |
14:52 |
berick |
re: id, any chance you have multiple columns resolving to the name "id"? |
14:53 |
Dyrcona |
berick: Cool. I might.... I'm going to modify the query that I think is blowing up to use a CTE, and then grab bre.id and bre.marc. |
14:53 |
Dyrcona |
Currently, it's actually returning acn.record, bre.marc, and the count on acp.id. |
14:56 |
Dyrcona |
Dude..... I just noticed the file ends with two semicolons.....I'll bet that's it. |
14:56 |
Dyrcona |
Still, I think I'll do the CTE. |
14:57 |
berick |
my test file contain: SELECT id, marc FROM biblio.record_entry WHERE NOT deleted (no semicolons needed) |
14:59 |
Dyrcona |
Yeah, ; is a habit from writing stuff for psql. |
15:07 |
pinesol |
News from commits: LP2035287: Update selectors in e2e tests <https://git.evergreen-ils.org/?p=Evergreen.git;a=commitdiff;h=f562b3ac30a3753d63d565c2d7be4d3a7121a2fb> |
15:11 |
Dyrcona |
Now, I'm cooking with gas. I got the "large" bibs file (85 records with 87,296 items) dumped to XML in under a minute. That took almost 2 hours with the Perl program the other day. |
15:18 |
Dyrcona |
Using query-file to feed the eg-marc-export, it is using a lot more RAM than before, about the same as the Perl export was using. It still uses less CPU. We'll see if that changes over time. |
15:28 |
jeff |
you have 87,296 items that are spread across only 85 bibs? |
15:29 |
Dyrcona |
Yes, we do. |
15:29 |
jeff |
color me intrigued. |
13:20 |
jeffdavis |
yes, fairly frequently |
13:28 |
pinesol |
News from commits: LP#2007603: restore functioning of default search tab preference <https://git.evergreen-ils.org/?p=Evergreen.git;a=commitdiff;h=adf3bd07e0e4558e9d48f9cdf5081e956eef7866> |
13:55 |
|
jihpringle joined #evergreen |
13:59 |
Dyrcona |
berick: eg-marc-export appears to be way more efficient than marc_export. However, the XML it produces isn't pretty printed, so I'll have to split records with sed or something to see what I've got when it is done. |
14:06 |
berick |
Dyrcona: ah, i've been piping to xmllint --format |
14:06 |
berick |
but could easily add a --pretty-print option |
14:12 |
berick |
pushed --pretty-print-xml option. added some other options since yesterday as well. |
14:13 |
mantis1 |
Is recall and force holds a library setting? |
14:18 |
berick |
mantis1: when placing an item-level hold in the staff catalog, the options should be available. |
14:18 |
berick |
they do have their own permissions |
14:18 |
Dyrcona |
berick: I could have just dumped a binary marc file. I've got a thing to count records for that. |
14:19 |
Dyrcona |
I pasted your example from yesterday and changed the filename. |
14:20 |
jihpringle |
mantis1: depending on what you're doing there are three library settings that control how recalls work |
14:21 |
mantis1 |
berick: it might be the permissions then that I need to assign |
14:21 |
mantis1 |
berick++ |
14:22 |
mmorgan |
mantis1: I think those hold options are only available under Request Item in Item Status. |
14:22 |
berick |
mmorgan: they're in the Ang staff catalog |
14:22 |
Dyrcona |
berick is there a way to just build eg-marc-export? |
14:23 |
mmorgan |
Oh! I should have known that! |
14:23 |
berick |
Dyrcona: sorta, not really. you can build --package evergreen, but it of course builds its opensrf depenencies |
14:24 |
Dyrcona |
OK. Thanks! |
14:58 |
berick |
looking at some other options too |
15:01 |
Dyrcona |
I should have time'd this run... I'll do that next time. |
15:02 |
|
mantis1 left #evergreen |
15:24 |
Dyrcona |
berick: It doesn't look like eg-marc-export can be fed a list of ids in a pipe. |
15:24 |
Dyrcona |
That was more a question, really. |
15:28 |
Dyrcona |
It finds lots of errors, too. |
15:32 |
berick |
Dyrcona: no, that's not yet supported |
15:34 |
Dyrcona |
I'm going to start over with some different options and time it. |
15:34 |
berick |
k. --query-file may need some work too. it wants 'id' and 'marc' columns in the query, which may vary from the marc_export version. |
15:35 |
* berick |
looks at --pipe |
15:35 |
Dyrcona |
Yeah, but I could use a modified version of the SQL to get the list of ids, just add the marc column. |
15:35 |
berick |
yeah |
15:35 |
Dyrcona |
marc_export just takes ids on stdin. |
15:36 |
Dyrcona |
Think I'll just dump all records with --items to a binary file, time it, and see what I get and how long it takes. I'll dump stderr to a file to see if I can fix some of these records. I see a bit of "bad subfield code." |
11:02 |
Dyrcona |
We will also likely never use it in produciton.... |
11:05 |
Dyrcona |
It might need more patches than just that one.... I'll leave it for now. |
11:14 |
|
kmlussier joined #evergreen |
11:18 |
Dyrcona |
So, going back to yesterday's conversation about MARC export, I wonder if that commit really was the problem. I reverted that one and two others, then started a new export. It has been running for almost 21 hours and only exported about 340,000 records. I estimate it should export about 1.7 million. |
11:20 |
Dyrcona |
At that rate, it will still take roughly 5 days to export them all. This is one a test system, but it's an old production database server and it's "configured." The hardware is no slouch. I guess I will have to dump queries and run them through EXPLAIN. |
11:28 |
Dyrcona |
Y'know what. I think I'll stop this export, back out the entire feature and go again. |
11:29 |
jeff |
if it's similar behavior as yesterday and most of the resource usage appears to be marc_export using CPU, I'd suspect inefficiency in the MARC record manipulation or in dealing with the relatively large amount of data in memory from the use of fetchall_ on such a large dataset. |
10:27 |
Dyrcona |
`ps -o etime 24586` said 4-19:00:01 just a few seconds ago. |
10:28 |
Dyrcona |
I'm running it with time, but was curious how long it has been going so far. |
10:29 |
Dyrcona |
Adding --items seems to really slow it down on this setup. |
10:33 |
Dyrcona |
I should try it with a binary MARC file to see if that makes a difference. I wonder if writing the output locally is a problem, though I doubt it. |
10:36 |
Dyrcona |
The db server does not appear to be under any strain. |
10:37 |
Dyrcona |
Load is 0.08 and plenty of free RAM, which could mean its not cached, but with NVMe, who needs cache? ;) |
10:44 |
|
sandbergja joined #evergreen |
10:44 |
Dyrcona |
Makes me wonder if we're missing an index, or if adding a new index might help. It would be nice if there was an easy way to dump the SQL from Perl DBI... Maybe there is. I should check. |
10:48 |
Dyrcona |
I suppose I could hack a copy of marc_export to dump the SQL instead of executing it. |
10:50 |
|
briank joined #evergreen |
10:50 |
Dyrcona |
I'd like to run it through explain. It's probably the queries to grab item information to add to the MARC, so I'll have to dump an example of that, too. |
10:52 |
Dyrcona |
Guess I will be looking into it later.... *sigh* |
10:52 |
jeff |
or tweak log_min_duration_statement just long enough to capture some sample queries. depends on how otherwise loaded your db server is, if this is prod. |
10:56 |
Dyrcona |
This is a test system that hosts multiple databases, but this is the only instance currently doing anything. |
11:03 |
Dyrcona |
It's not running on the same server as the DB either. |
11:06 |
Dyrcona |
I'll have to do some investigation to see where the problem lies. Maybe I can get some improvements for everyone out of this. |
11:20 |
Dyrcona |
jeff++ # I may just crank the logging up for a test run later. I suspect this one will finish sometime later today, but I also thought that it would have done by yesterday to start with. |
11:24 |
Dyrcona |
FWIW, I'm dumping XML because it's "easier" to work with than binary MARC, but when a file is about 8GB in size, the format doesn't really matter any longer, does it? :) |
11:25 |
|
collum joined #evergreen |
11:33 |
|
kmlussier joined #evergreen |
11:39 |
|
jihpringle joined #evergreen |
13:41 |
Dyrcona |
jeff: I think some of the patches that I am testing are responsible for the slow down, particularly the one for the above Lp bug. |
13:45 |
Dyrcona |
I think I'll revert a couple of commits before I say much more. |
14:21 |
Dyrcona |
Hmm... Looks like I have somewhere in the vicinity of 400,000 records left to export. I think I'll stop this one and try again with the suspected commits reverted. |
14:25 |
Dyrcona |
Think I'll export to a binary MARC file this time. At least the file will be smaller. |
14:43 |
|
mdriscoll joined #evergreen |
14:50 |
|
shulabear joined #evergreen |
14:50 |
|
Stompro joined #evergreen |
09:50 |
|
kworstell-isl joined #evergreen |
10:12 |
|
Christineb joined #evergreen |
12:07 |
|
jihpringle joined #evergreen |
12:10 |
Dyrcona |
Binary MARC records with HTML entities in them.... I guess.... Whatever..... |
12:27 |
berick |
@decide binary-marc-with-html OR html-with-binary-marc |
12:27 |
pinesol |
berick: That's a tough one... |
12:57 |
jeff |
&2DzfVQ- is the IMAP4 modified UTF-7 (mUTF-7) encoding for U+1F355, aka the "pizza" emoji: 🍕 |
12:59 |
* jeffdavis |
backs away slowly |
13:02 |
Dyrcona |
Does that pizza emoji have pineapple on it? |
13:04 |
Dyrcona |
So, chardet3 is my new friend. It tells me UTF-8 encoded MARC files are UTF-8 with 0.99 confidence, and MARC-8 encoded MARC files are ISO-8859-1 with 0.70 to 0.75 confidence. |
13:04 |
Dyrcona |
chardet3 does not know about MARC-8. |
13:05 |
* Dyrcona |
looks for a similar module to Python's chardet in Perl. |
13:07 |
Dyrcona |
libencode-detect-perl is packaged for Ubuntu/Debian. |
13:12 |
Dyrcona |
It turns out that libraries can choose UTF-8 when downloading records from Overdrive. If they don't then the records are apparently MARC-8. I don't want to have to figure that out manually, so I'm going to make my record load program do that for me. |
13:15 |
Dyrcona |
And Encode::Detect won't do what I want, since it decodes the text using the detect encoding. That will break MARC-8. |
13:15 |
Dyrcona |
I feel like I've had this monologue before.... |
13:26 |
Dyrcona |
Aha! ascii if there are no "fancy" characters for MARC-8. |
13:35 |
|
rfrasur joined #evergreen |
14:26 |
Dyrcona |
1 file changed, 39 insertions(+), 7 deletions(-) # Hopefully that does it! |
14:43 |
scottangel |
I'm looking at the 'strict barcodes' checkbox on the patron checkout page. Looks like it's bound to a variable called $scope.strict_barcode. The problem I'm facing is the function ng-change="onStrictBarcodeChange()" doesn't flip the boolean. Am I missing something? shouldn't there be something like $scope.strict_barcode = !$scope.strict_barcode; From what I can tell is this setting is meant to be saved w/ egCore.hatch.setItem() but |
14:52 |
Dyrcona |
Good question. I don't know. |
15:13 |
Dyrcona |
Meh... I should spell check commit messages before pushing.... |
16:01 |
|
BDorsey joined #evergreen |
16:37 |
Dyrcona |
Whee! MARC-8 encoded record, says it's UTF-8 in the leader but \xE1\x65 is MARC-8 for \xC3\xA8 in UTF-8. |
16:40 |
Dyrcona |
NB: I haven't done anything to the file other than inspect with my editor. |
16:40 |
Dyrcona |
Grr..... Omitted a word there. |
16:45 |
Dyrcona |
Also....Editing text in browser text boxes stinks.... |
09:46 |
|
dguarrac joined #evergreen |
10:25 |
|
Christineb joined #evergreen |
10:54 |
Dyrcona |
Hm.. I wonder how workable it would be to replace ISO-8859-1 copyright symbols with UTF-8 ones using a regex.... It seems like it would be simple, but it's the kind of thing that can lead to problems. I have a file of records that I can play with.... |
10:58 |
Dyrcona |
Looks like it happens with registered trademark symbol, too. (Gotta love vendor-supplied MARC records.) |
11:06 |
Bmagic |
Love em indeed |
11:16 |
Dyrcona |
Bmagic: Does your load process handle things like that? I have a --strict option on one of my load programs that rejects records with bad characters, well any warnings are treated as errors, really. |
11:16 |
Dyrcona |
I'm considering adding code to fix copyright and registered trademark symbols since they seem to be a thing with this one vendor in particular. |
11:21 |
Bmagic |
Dyrcona: but FWIW: https://github.com/mcoia/sierra_marc_tools/blob/master/auto_rec_load/dataHandler.pm around line 800, if it dies, it will failover to readMARCFileRaw |
11:22 |
Dyrcona |
Bmagic: Thanks. I've had a glance at that code before. My issue isn't reading the records. We get these warnings when loading them: utf8 "\xA9" does not map to Unicode at /usr/lib/x86_64-linux-gnu/perl/5.26/Encode.pm line 212, <GEN1> chunk 300. |
11:23 |
Dyrcona |
They will load in the database, if I let them go in. |
11:24 |
Dyrcona |
The warnings don't occur while prepocessing the records with using MARC::Record to modify the 856 tags. |
11:24 |
Bmagic |
writing something to transcode one character to another seems doable. But I've not done it. Sorry :( |
11:25 |
Dyrcona |
Yeah, it's actually transcoding to 2 characters \xA9 -> \xC2\xA9. |
11:25 |
Bmagic |
it sounds like you'll end up having to read the file character by character instead of letting MARC::Record do it? |
13:02 |
miker |
jeff: re LP 1829295, without wading into the bug itself, a big +1 to a YAOUS for respecting closed dates (and, you can just delete the row from config.org_setting_type, probably through the UI as the admin user!) |
13:02 |
pinesol |
Launchpad bug 1829295 in Evergreen "Shelf expire date doesn't respect closed dates" [Wishlist,Confirmed] https://launchpad.net/bugs/1829295 |
13:15 |
jeff |
miker: thanks for the feedback! i may have only half-followed you, though. which org unit setting type are you referring to? |
13:55 |
Dyrcona |
"Smart" quotes in MARC.....That seems to be what's causing problems with record sizes. |
13:55 |
|
mantis1 joined #evergreen |
13:56 |
rhamby |
smart_quotes-- |
13:56 |
sleary |
ugh |
14:03 |
Dyrcona |
sleary++ |
14:03 |
Dyrcona |
I'm leaning towards Windows and "copy and paste" cataloging. I was just converting from octal to see what the value is to look it up in UTF-8. |
14:05 |
sleary |
Quotes and apostrophes copied from Word in Windows used to truncate content in WordPress constantly. Good times. |
14:06 |
Dyrcona |
They truncate MARC records, too, because one of the characters in the sequence is the MARC End of Record character. I specifically use code to look for End of Field followed by End of Record to avoid this. |
14:07 |
Dyrcona |
Looks like our Perl MARC code doesn't calculate a proper record length, but those characters shouldn't be in a MARC record in the first place. |
14:07 |
rhamby |
utf16-- |
14:08 |
Dyrcona |
Heh. If only everything was big endian UTF-32....drive manufacturers would be happy.... :) |
14:16 |
mmorgan |
@quote get 232 |
14:16 |
pinesol |
mmorgan: Quote #232: "<mmorgan> Smart quotes are kinda like smart TVs in that neither are all that smart" (added by Dyrcona at 04:47 PM, October 19, 2022) |
14:17 |
Dyrcona |
mmorgan++ |
14:18 |
Dyrcona |
There's a bug in the CPAN RT for MARC::Batch or MARC::Record (maybe on Github, too), and I don't think the solution actually works. |
14:23 |
Dyrcona |
I thought tsbere had a proposed solution this one: https://rt.cpan.org/Public/Bug/Display.html?id=70169 |
14:28 |
|
jihpringle joined #evergreen |
14:32 |
Dyrcona |
I have been told that this can happen if people copy and paste from Amazon when cataloging. |
15:19 |
Dyrcona |
Yeahp. GNU Emacs also says the file of bad records is UTF-16 when I open it. |
15:20 |
* Dyrcona |
wonders if pinesol has any dry kona in the coffee database. |
15:24 |
|
sleary joined #evergreen |
15:26 |
Dyrcona |
It seems odd to me that a program using MARC::Record->new_from_usmarc can read these records, modify them, and write them out without issue, but another program using the same Perl module blows up. I suspect the writing out leads to a bad length in the LDR. |
15:32 |
Dyrcona |
oh! |
15:33 |
Dyrcona |
The original records display just fine in GNU Emacs..... They get mangled going through Perl. |
15:34 |
Dyrcona |
I see the curly apostrophes and quotes, and Emacs says the coding system is multi-byte UTF-8. Something has a double encoding problem with these characters. |
15:35 |
Dyrcona |
I wonder if I'll even be able to fix this in a reasonable manner? |
15:36 |
Dyrcona |
Perl's Unicode support is so broken.... |
15:42 |
Dyrcona |
See... this is what I dislike about Unicode in Perl (particularly with MARC): one time I encode/decode the records and it works. Next time, the records come out garbled. Mebbe I should reread the Unicode FAQ and double check the MARC code to know what's really going on here. |
16:33 |
pinesol |
News from commits: Docs: global flags docs fixes <https://git.evergreen-ils.org/?p=Evergreen.git;a=commitdiff;h=3f7b48566d3f34e07c5b7ba5b27ed23d97abd4b4> |
16:44 |
|
jvwoolf left #evergreen |
17:00 |
|
mmorgan left #evergreen |