develooper Front page | perl.perl6.users | Postings from April 2020

Re: readchars, seek back, and readchars again

Thread Previous | Thread Next
From:
Joseph Brenner
Date:
April 26, 2020 21:02
Subject:
Re: readchars, seek back, and readchars again
Message ID:
CAFfgvXV975vLvFY19wM_Y_AK9=runKzigMy0tgmvAmoHZS-mNA@mail.gmail.com
I decided to open an issue for this one.  Even if there's no practical
fix for the behavior of readchars, I'd think this odd meaning of the
"current" point in the file would need to be better documented:

  https://github.com/rakudo/rakudo/issues/3646

I simplified the test I've been using:

use v6;
use Test;

my $tmpdir = IO::Spec::Unix.tmpdir;

# ሀⶀ䷼ꪪⲤⲎ
my $unichar_str = "\x[1200]\x[2D80]\x[4DFC]\x[AAAA]\x[2CA4]\x[2C8E]";
my $unichar_file = "$tmpdir/six_unicode_chars.txt";
spurt $unichar_file, $unichar_str;

my $ascii_str =   "ABCDEFGHI";
my $ascii_file = "$tmpdir/nine_ascii_chars.txt";
spurt $ascii_file, $ascii_str;


{
    my $fh    = $ascii_file.IO.open;
    my $loc1  = $fh.tell;
    my $char_count = 3;
    my $str = readchars_no_advance($fh, $char_count);
    my $loc2  = $fh.tell;
    is( $loc1, $loc2,
        "Testing that readchars file position works as expected for
ascii-range chars " );
}


{
    my $fh    = $unichar_file.IO.open;
    my $loc1  = $fh.tell;
    my $char_count = 3;
    my $str = readchars_no_advance($fh, $char_count);
    my $loc2  = $fh.tell;
    is( $loc1, $loc2,
        "Testing that readchars file position works as expected for
unichars beyond ascii-range" );
}

# After a readchar, this tries to return to the original position in the file
sub readchars_no_advance ($fh, $char_count) {
    my $str   = $fh.readchars($char_count);
    my $width = $str.encode('UTF-8').bytes;
    $fh.seek: -$width, SeekFromCurrent;
    return $str;
}




On 4/24/20, Brad Gilbert <b2gills@gmail.com> wrote:
> In UTF8 characters can be 1 to 4 bytes long.
>
> UTF8 was designed so that 7-bit ASCII is a subset of it.
>
> Any 8bit byte that has its most significant bit set cannot be ASCII.
> So multi-byte codepoints have the most significant bit set for all of the
> bytes.
> The first byte can tell you the number of bytes that follow it.
>
> That is how a singe codepoint is stored.
>
> A character can be made of several codepoints.
>
>     "\c[LATIN SMALL LETTER E]\c[COMBINING ACUTE ACCENT]"
>     "é"
>
> So Rakudo has to read the next codepoint to make sure that it isn't a
> combining codepoint.
>
> It is probably faking up the reads to look right when reading ASCII, but
> failing to do that for wider codepoints.
>
> On Fri, Apr 24, 2020 at 1:34 PM Joseph Brenner <doomvox@gmail.com> wrote:
>
>> I thought that doing a readchars on a filehandle, seeking backwards
>> the width of the char in bytes and then doing another read
>> would always get the same character.  That works for ascii-range
>> characters (1-byte in utf-8 encoding) but not multi-byte "wide"
>> characters (commonly 3-bytes in utf-8).
>>
>> The question then, is why do I need a $nudge of 3 for wide chars, but
>> not ascii-range ones?
>>
>> use v6;
>> use Test;
>>
>> my $tmpdir = IO::Spec::Unix.tmpdir;
>> my $file = "$tmpdir/scratch_file.txt";
>> my $unichar_str = "\x[1200]\x[2D80]\x[4DFC]\x[AAAA]\x[2CA4]\x[2C8E]";  #
>> ሀⶀ䷼ꪪⲤⲎ
>> my $ascii_str =   "ABCDEFGHI";
>>
>> subtest {
>>     my $nudge = 3;
>>     test_read_and_read_again($unichar_str, $file, $nudge);
>> }, "Wide unicode chars: $unichar_str";
>>
>> subtest {
>>     my $nudge = 0;
>>     test_read_and_read_again($ascii_str, $file, $nudge);
>> }, "Ascii-range chars: $ascii_str";
>>
>> # write given string to file, then read the third character twice and
>> check
>> sub test_read_and_read_again($str, $file, $nudge = 0) {
>>     spurt $file, $str;
>>     my $fh = $file.IO.open;
>>     $fh.readchars(2);  # skip a few
>>     my $chr_1 =      $fh.readchars(1);
>>     my $width = $chr_1.encode('UTF-8').bytes;  # for our purposes, always
>> 1 or 3
>>     my $step_back = $width + $nudge;
>>     $fh.seek: -$step_back, SeekFromCurrent;
>>     my $chr_2 =      $fh.readchars(1);
>>     is( $chr_1, $chr_2,
>>         "read, seek back, and read again gets same char with nudge of
>> $nudge" );
>> }
>>
>

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About