Front page | perl.perl6.users |
Postings from April 2020
Re: readchars, seek back, and readchars again
Thread Previous
|
Thread Next
From:
Brad Gilbert
Date:
April 24, 2020 19:24
Subject:
Re: readchars, seek back, and readchars again
Message ID:
CAD2L-T1z=iF0-XQ7dKNcHMoFbspTB_Zm+nL0ZYxwc_Qp90cJow@mail.gmail.com
In UTF8 characters can be 1 to 4 bytes long.
UTF8 was designed so that 7-bit ASCII is a subset of it.
Any 8bit byte that has its most significant bit set cannot be ASCII.
So multi-byte codepoints have the most significant bit set for all of the
bytes.
The first byte can tell you the number of bytes that follow it.
That is how a singe codepoint is stored.
A character can be made of several codepoints.
"\c[LATIN SMALL LETTER E]\c[COMBINING ACUTE ACCENT]"
"é"
So Rakudo has to read the next codepoint to make sure that it isn't a
combining codepoint.
It is probably faking up the reads to look right when reading ASCII, but
failing to do that for wider codepoints.
On Fri, Apr 24, 2020 at 1:34 PM Joseph Brenner <doomvox@gmail.com> wrote:
> I thought that doing a readchars on a filehandle, seeking backwards
> the width of the char in bytes and then doing another read
> would always get the same character. That works for ascii-range
> characters (1-byte in utf-8 encoding) but not multi-byte "wide"
> characters (commonly 3-bytes in utf-8).
>
> The question then, is why do I need a $nudge of 3 for wide chars, but
> not ascii-range ones?
>
> use v6;
> use Test;
>
> my $tmpdir = IO::Spec::Unix.tmpdir;
> my $file = "$tmpdir/scratch_file.txt";
> my $unichar_str = "\x[1200]\x[2D80]\x[4DFC]\x[AAAA]\x[2CA4]\x[2C8E]"; #
> ሀⶀ䷼ꪪⲤⲎ
> my $ascii_str = "ABCDEFGHI";
>
> subtest {
> my $nudge = 3;
> test_read_and_read_again($unichar_str, $file, $nudge);
> }, "Wide unicode chars: $unichar_str";
>
> subtest {
> my $nudge = 0;
> test_read_and_read_again($ascii_str, $file, $nudge);
> }, "Ascii-range chars: $ascii_str";
>
> # write given string to file, then read the third character twice and check
> sub test_read_and_read_again($str, $file, $nudge = 0) {
> spurt $file, $str;
> my $fh = $file.IO.open;
> $fh.readchars(2); # skip a few
> my $chr_1 = $fh.readchars(1);
> my $width = $chr_1.encode('UTF-8').bytes; # for our purposes, always
> 1 or 3
> my $step_back = $width + $nudge;
> $fh.seek: -$step_back, SeekFromCurrent;
> my $chr_2 = $fh.readchars(1);
> is( $chr_1, $chr_2,
> "read, seek back, and read again gets same char with nudge of
> $nudge" );
> }
>
Thread Previous
|
Thread Next