develooper Front page | perl.perl5.porters | Postings from September 2006

[perl #40432] LWP and Unicode

From:
Dale Gerdemann
Date:
September 29, 2006 13:48
Subject:
[perl #40432] LWP and Unicode
Message ID:
rt-3.5.HEAD-31257-1159528014-1658.40432-75-0@perl.org
# New Ticket Created by  Dale Gerdemann 
# Please include the string:  [perl #40432]
# in the subject line of all future correspondence about this issue. 
# <URL: http://rt.perl.org/rt3/Ticket/Display.html?id=40432 >



This is a bug report for perl from dg@tomita.sfs.uni-tuebingen.de,
generated with the help of perlbug 1.35 running under perl v5.8.8.


-----------------------------------------------------------------
[Please enter your report here]

The following appears to be a bug, but maybe I missed some docs.

With LWP, it doesn't appear to be possible to use a URL containing
Unicode (encoded or otherwise). Consider the for possibilities:

1. With 'use utf8', Unicode in URL
2. With 'use utf8', Percent encoded Unicode in URL
3. Without 'use utf8', Unicode in URL
4. Without 'use utf8', Percent encoded Unicode in URL

Using warnings and diagnostics, I get the following results for:

my $browser = LWP::UserAgent->new;
my $response = $browser->get($url);

1. Program exits with following message:


Use of uninitialized value in substitution iterator at
        /afs/sfs/lehre/dg/myperl/lib/URI.pm line 76 (#1)
    (W uninitialized) An undefined value was used as if it were already
    defined.  It was interpreted as a "" or a 0, but maybe it was a mistake.
    To suppress this warning assign a defined value to your variables.

    To help you figure out what was undefined, perl tells you what operation
    you used the undefined value in.  Note, however, that perl optimizes your
    program and the operation displayed in the warning may not necessarily
    appear literally in your program.  For example, "that $foo" is
    usually optimized into "that " . $foo, and the warning will refer to
    the concatenation (.) operator, even though there is no . in your
    program.

2. Issues following warning, but otherwise sort of succeeds:

Parsing of undecoded UTF-8 will give garbage when decoding entities at /afs/sfs/lehre/dg/myperl/lib/LWP/Protocol.pm line 114.

The returned content is, however, a byte sequence.

3. This works well, except for the same warning message as in
second test. Content is a character sequence.

4. Works exactly as the third test. I think this means that the third
and fourth tests are really the same. The URL in the third test
appears to have Unicode, but it's really just a byte sequence that
get's percent encoded somewhere along in the process.

But why should either test 3 or 4 produce a warning concerning UTF-8?
There is no UTF-8 in these tests. I don't have 'use utf8' or 'use
encoding' or even 'perl -CWhatever'.

Here are the two URL strings that I used:


http://bg.wiktionary.org/wiki/Уикиречник:Български/
Типове_думи/Глаголи

http://bg.wiktionary.org/wiki/%D0%A3%D0%B8%D0%BA%D0%B8%D1%80%D0%B5%D1%87%D0%BD%D0%B8%D0%BA:%D0%91%D1%8A%D0%BB%D0%B3%D0%B0%D1%80%D1%81%D0%BA%D0%B8/%D0%A2%D0%B8%D0%BF%D0%BE%D0%B2%D0%B5_%D0%B4%D1%83%D0%BC%D0%B8/%D0%93%D0%BB%D0%B0%D0%B3%D0%BE%D0%BB%D0%B8

The second URL is just the percent encoding of the utf-8 encoding of
the first. [you can maybe see why the average user would like to use
the first "URL" and have percent-encoding in a hidden layer]


Finally, change the URL to http:://www.cnn.com and all problems go
away. It doesn't matter whether or not you have 'use utf8'.


[Please do not change anything below this line]
-----------------------------------------------------------------
---
Flags:
    category=library
    severity=medium
---
Site configuration information for perl v5.8.8:

Configured by dg at Wed Sep  6 16:13:32 CEST 2006.

Summary of my perl5 (revision 5 version 8 subversion 8) configuration:
  Platform:
    osname=linux, osvers=2.6.8-2-686-smp, archname=i686-linux
    uname='linux tomita 2.6.8-2-686-smp #1 smp tue aug 16 12:08:30 utc 2005 i686 gnulinux '
    config_args=''
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef
    useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=undef use64bitall=undef uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O2',
    cppflags='-fno-strict-aliasing -pipe -I/usr/local/include'
    ccversion='', gccversion='3.3.5 (Debian 1:3.3.5-13)', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib
    libs=-lnsl -ldl -lm -lcrypt -lutil -lc
    perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc
    libc=/lib/libc-2.3.2.so, so=so, useshrplib=false, libperl=libperl.a
    gnulibc_version='2.3.2'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fpic', lddlflags='-shared -L/usr/local/lib'

Locally applied patches:
    

---
@INC for perl v5.8.8:
    /afs/sfs/lehre/dg/myperl/lib/i686-linux
    /afs/sfs/lehre/dg/myperl/lib
    /afs/sfs/lehre/dg/perl-5.8.8/lib/5.8.8/i686-linux
    /afs/sfs/lehre/dg/perl-5.8.8/lib/5.8.8
    /afs/sfs/lehre/dg/perl-5.8.8/lib/site_perl/5.8.8/i686-linux
    /afs/sfs/lehre/dg/perl-5.8.8/lib/site_perl/5.8.8
    /afs/sfs/lehre/dg/perl-5.8.8/lib/site_perl
    .

---
Environment for perl v5.8.8:
    HOME=/home/gerdemann
    LANG=en_US.UTF-8
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/afs/sfs/lehre/dg/perl-5.8.8/bin:/afs/sfs/lehre/dg/perl-5.8.8/scripts:/home/gerdemann/source/MoMo:/home/milca/a4/bin:/afs/sfs/lehre/dg/fsm-4.0/bin:/usr/ucb:/usr/bin:/bin:/afs/sfs/i386_linux24/sicstus312/bin/sicstus:/afs/sfs/i386_linux24/OOo110/OpenOffice.org1.1.0/program:/usr/local/bin:/usr/local/tex/bin://usr/X11R6/bin:/home/sfb/cl_systems/daVinci_V2.0:/home/gerdemann/Office51/bin:/afs/sfs/lehre/dg/bin:/home/gerdemann/scripts:/home/gerdemann/mg/bin:/home/gerdemann/bin:/afs/sfs/lehre/dg/xerox:.
    PERL5LIB=/afs/sfs/lehre/dg/myperl/lib
    PERL_BADLANG (unset)
    SHELL=/bin/bash




nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About