develooper Front page | perl.perl6.users | Postings from April 2020

Re: readchars, seek back, and readchars again

Thread Previous | Thread Next
From:
Joseph Brenner
Date:
April 24, 2020 20:03
Subject:
Re: readchars, seek back, and readchars again
Message ID:
CAFfgvXXMjh8vSfF5XMXQnx2ceDSRrU4DYMSXrW0-szHsaTthxg@mail.gmail.com
Thanks, yes I understand unicode and utf-8 reasonably well.

> So Rakudo has to read the next codepoint to make sure that it isn't a combining codepoint.

> It is probably faking up the reads to look right when reading ASCII, but failing to do that for wider codepoints.

I think it'd be the other way around... the idea here would be it's
doing an extra readchar behind the scenes just in-case there's
combining chars involved-- so you're figuring there's some confusion
about the actual point in the file that's being read and the
abstraction that readchars is supplying?


On 4/24/20, Brad Gilbert <b2gills@gmail.com> wrote:
> In UTF8 characters can be 1 to 4 bytes long.
>
> UTF8 was designed so that 7-bit ASCII is a subset of it.
>
> Any 8bit byte that has its most significant bit set cannot be ASCII.
> So multi-byte codepoints have the most significant bit set for all of the
> bytes.
> The first byte can tell you the number of bytes that follow it.
>
> That is how a singe codepoint is stored.
>
> A character can be made of several codepoints.
>
>     "\c[LATIN SMALL LETTER E]\c[COMBINING ACUTE ACCENT]"
>     "é"
>
> So Rakudo has to read the next codepoint to make sure that it isn't a
> combining codepoint.
>
> It is probably faking up the reads to look right when reading ASCII, but
> failing to do that for wider codepoints.
>
> On Fri, Apr 24, 2020 at 1:34 PM Joseph Brenner <doomvox@gmail.com> wrote:
>
>> I thought that doing a readchars on a filehandle, seeking backwards
>> the width of the char in bytes and then doing another read
>> would always get the same character.  That works for ascii-range
>> characters (1-byte in utf-8 encoding) but not multi-byte "wide"
>> characters (commonly 3-bytes in utf-8).
>>
>> The question then, is why do I need a $nudge of 3 for wide chars, but
>> not ascii-range ones?
>>
>> use v6;
>> use Test;
>>
>> my $tmpdir = IO::Spec::Unix.tmpdir;
>> my $file = "$tmpdir/scratch_file.txt";
>> my $unichar_str = "\x[1200]\x[2D80]\x[4DFC]\x[AAAA]\x[2CA4]\x[2C8E]";  #
>> ሀⶀ䷼ꪪⲤⲎ
>> my $ascii_str =   "ABCDEFGHI";
>>
>> subtest {
>>     my $nudge = 3;
>>     test_read_and_read_again($unichar_str, $file, $nudge);
>> }, "Wide unicode chars: $unichar_str";
>>
>> subtest {
>>     my $nudge = 0;
>>     test_read_and_read_again($ascii_str, $file, $nudge);
>> }, "Ascii-range chars: $ascii_str";
>>
>> # write given string to file, then read the third character twice and
>> check
>> sub test_read_and_read_again($str, $file, $nudge = 0) {
>>     spurt $file, $str;
>>     my $fh = $file.IO.open;
>>     $fh.readchars(2);  # skip a few
>>     my $chr_1 =      $fh.readchars(1);
>>     my $width = $chr_1.encode('UTF-8').bytes;  # for our purposes, always
>> 1 or 3
>>     my $step_back = $width + $nudge;
>>     $fh.seek: -$step_back, SeekFromCurrent;
>>     my $chr_2 =      $fh.readchars(1);
>>     is( $chr_1, $chr_2,
>>         "read, seek back, and read again gets same char with nudge of
>> $nudge" );
>> }
>>
>

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About