develooper Front page | perl.perl4lib | Postings from January 2012

MARC::Record / MARC::File::XML bug when fields contain newlines?

Thread Next
From:
arvinporthog@lycos.com
Date:
January 11, 2012 13:35
Subject:
MARC::Record / MARC::File::XML bug when fields contain newlines?
Message ID:
1066190041.208800.1326317744146.JavaMail.mail@webmail17
I've been converting MARC XML records into USMARC and recently had a slew of bad records which MARCEdit reported as having invalid leaders. After a few days of puzzling over this and blaming it all on Unicode I noticed they were all records which contained newlines (0D 0A) in their datafields. As far as I know newlines aren't illegal in USMARC, but when I replaced them with spaces, sure enough the problem disappeared.

Test record:

<?xml version="1.0" encoding="utf-8"?>
<collection xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns="http://www.loc.gov/MARC21/slim">
  <record>
    <leader>06965nam a2202005 u 4500</leader>
    <datafield tag="245" ind1="0" ind2="0">
      <subfield code="a">Theoretical and Technological 
Aspects of Crystal Growth</subfield>
    </datafield>
  </record>
</collection>

(If your mail viewer mangles lines, there's a hard return (0D 0A) after the word Technological in the 245)

Here is my test program which illustrates the problem:

use MARC::Batch;
use MARC::File::XML (BinaryEncoding => 'utf8', RecordFormat => 'UNIMARC');
use strict 'vars';

open (MARCOUT, ">test_out.marc") or die "Couldn't open test_out.marc for writing: $!\n";
binmode(MARCOUT, ':utf8');

my $batch = new MARC::Batch ('XML', 'test.xml');
my $record = $batch->next;
print MARCOUT $record->as_usmarc;

As I said, I don't think newlines are illegal in USMARC so I rather suspect the problem is somewhere in MARC::Record. I took the easier route though and replaced them with spaces in MARC::File::SAX and that solves the problem:

sub characters {
    my ( $self, $chars ) = @_;
    if (
        ( exists $self->{ subcode } && $self->{ subcode } ne '')
        || ( $self->{ tag } && ( $self->{ tag } eq 'LDR' || $self->{ tag } < 10 ))
    ) { 
        $self->{ chars } .= $chars->{ Data };
        
        ## Added by me, 1/11/2011
        $self->{ chars } =~ s/\n/ /g;
        $self->{ chars } =~ s/ {2,}/ /g;
    } 
}

So is this a bug that can be officially fixed or am I overlooking something?

ActiveState perl 5.10, MARC::Record v.2.0.3, MARC::File::XML v. 0.93

Arvin

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About