On Friday 17 February 2017 15:36:09 H.Merijn Brand wrote: > On Fri, 17 Feb 2017 11:29:16 +0100, pali@cpan.org wrote: > > > Hi! > > > > In more perl modules and perl code I see incorrect usage of > > utf8::is_utf8(). Most common incorrect pattern is found in modules is: > > > > use utf8; > > > > my $value = func(); > > if (utf8::is_utf8($value)) { > > utf8::encode($value); > > } > > > > As utf8::is_utf8() does not tell if value is already encoded in utf8 > > (and in perl it is not possible to detect it) such code is wrong. In > > case func() returns string which is internally stored as Latin1 nothing > > happen. But when is internally stored as UTF8 then string is converted > > to UTF-8 octets. Which means such code pattern encode string to UTF-8 > > octets based on internal perl flag which does not make any sense for in > > such condition. > > > > Maybe corrected pattern could be (probably under eval to handle errors): > > > > my $value = func(); > > if (utf8::is_utf8($value) { > > utf8::downgrade($value); > > } > > > > Which at least does not modify content of $value. Operator 'eq' on > > $value is same despite if condition was true or false. > > > > As first pattern in more common I would propose to rename function > > utf8::is_utf8() to some better name, e.g. utf8::is_upgraded() which does > > not say anything about UTF-8 encoding. > > That *will* break a lot of code that uses the function as it is > supposed to be used. The only reason to use utf8::is_utf8() function is probably when you need to deal with broken XS module. As in pure perl code is internal storage of string irrelevant and fully invisible. Strings "\x{A0}" and "\N{U+A0}" are same. But utf8::is_utf8() is used on other places which means that code is with high probability not correct. > You cannot see from the parser if the code that uses this function is > using it right or wrong. Yes, I know. I did not mean to introduce some "heuristic" which will "rewrite" some code pattern to another. Such thing will never work... > > And ideally deprecate utf8::is_utf8() function or at least start > > throwing warning when is used as most usage of utf8::is_utf8() is > > incorrect. > > > > What do you think about it? > > Don't rename this function, even if the purpose is debatable Maybe better description what I mean: Introduce new function utf8::is_upgraded() which will be copy of utf8::is_utf8(). And then starting discussion about either deprecating utf8::is_utf8() or start throwing warning when utf8::is_utf8() is used. That should not break any existing code. And new people who do know what is utf8::is_utf8() doing will stop using it in new code. Idea is that people will stop using utf8::is_utf8() function which has really bad name.Thread Previous | Thread Next