I've been converting MARC XML records into USMARC and recently had a slew of bad records which MARCEdit reported as having invalid leaders. After a few days of puzzling over this and blaming it all on Unicode I noticed they were all records which contained newlines (0D 0A) in their datafields. As far as I know newlines aren't illegal in USMARC, but when I replaced them with spaces, sure enough the problem disappeared.
Test record:
<?xml version="1.0" encoding="utf-8"?>
<collection xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns="http://www.loc.gov/MARC21/slim">
<record>
<leader>06965nam a2202005 u 4500</leader>
<datafield tag="245" ind1="0" ind2="0">
<subfield code="a">Theoretical and Technological
Aspects of Crystal Growth</subfield>
</datafield>
</record>
</collection>
(If your mail viewer mangles lines, there's a hard return (0D 0A) after the word Technological in the 245)
Here is my test program which illustrates the problem:
use MARC::Batch;
use MARC::File::XML (BinaryEncoding => 'utf8', RecordFormat => 'UNIMARC');
use strict 'vars';
open (MARCOUT, ">test_out.marc") or die "Couldn't open test_out.marc for writing: $!\n";
binmode(MARCOUT, ':utf8');
my $batch = new MARC::Batch ('XML', 'test.xml');
my $record = $batch->next;
print MARCOUT $record->as_usmarc;
As I said, I don't think newlines are illegal in USMARC so I rather suspect the problem is somewhere in MARC::Record. I took the easier route though and replaced them with spaces in MARC::File::SAX and that solves the problem:
sub characters {
my ( $self, $chars ) = @_;
if (
( exists $self->{ subcode } && $self->{ subcode } ne '')
|| ( $self->{ tag } && ( $self->{ tag } eq 'LDR' || $self->{ tag } < 10 ))
) {
$self->{ chars } .= $chars->{ Data };
## Added by me, 1/11/2011
$self->{ chars } =~ s/\n/ /g;
$self->{ chars } =~ s/ {2,}/ /g;
}
}
So is this a bug that can be officially fixed or am I overlooking something?
ActiveState perl 5.10, MARC::Record v.2.0.3, MARC::File::XML v. 0.93
Arvin
Thread Next