Perl's default string encoding and representation

Question

In the following:

my $string = "Can you \x{FB01}nd my r\x{E9}sum\x{E9}?\n";

The x{FB01} and x{E9} are code points. And code points are encoded via an encoding scheme to a series of octets.
So the character è which has the codepoint \x{FB01} is part of the string of $string. But how does this work? Are all the characters in this sentence (including the ASCII ones) encoded via UTF-8?
If yes why do I get the following behavior?

my $str = "Some arbitrary string\n";  

if(Encode::is_utf8($str)) {  
        print "YES str IS UTF8!\n";  
}  
else {  
        print "NO str IT IS NOT UTF8\n";   
}

This prints "NO str IT IS NOT UTF8\n"
Additionally Encode::is_utf8($string) returns true.
In what way are $string and $str different and one is considered UTF-8 and the other not?
And in any case what is the encoding of $str? ASCII? Is this the default for Perl?

Perl doesn’t keep things in an encoding. Its strings are always decoded. Only undecoded strings might be in some encoding. — tchrist, Jun 20 '13 at 21:40

ikegami · Answer 1 · 2018-10-12T10:46:23.077

8

In C, a string is a collection of octets, but Perl has two string storage formats:

String of 8-bit values.
String of 72-bit values. (In practice, limited to 32-bit or 64-bit.)

As such, you don't need to encode code points to store them in a string.

my $s = "\x{2660}\x{2661}";
say length $s;                            # 2
say sprintf '%X', ord substr($s, 0, 1);   # 2660
say sprintf '%X', ord substr($s, 1, 1);   # 2661

(Internally, an extension of UTF-8 called "utf8" is used to store the strings of 72-bit chars. That's not something you should ever have to know except to realize the performance implications, but there are bugs that expose this fact.)

Encode's is_utf8 reports which type of string a scalar contains. It's a function that serves absolutely no use except to debug the bugs I previously mentioned.

An 8-bit string can store the value of "abc" (or the string in the OP's $str), so Perl used the more efficient 8-bit (UTF8=0) string format.
An 8-bit string can't store the value of "\x{2660}\x{2661}" (or the string in the OP's $string), so Perl used the 72-bit (UTF8=1) string format.

Zero is zero whether it's stored in a floating point number, a signed integer or an unsigned integer. Similarly, the storage format of strings conveys no information about the value of the string.

You can store code points in an 8-bit string (if they're small enough) just as easily as a 72-bit string.
You can store bytes in a 72-bit string just as easily as an 8-bit string.

In fact, Perl will switch between the two formats at will. For example, if you concatenate $string with $str, you'll get a string in the 72-bit format.

You can alter the storage format of a string with the builtins utf8::downgrade and utf8::upgrade, should you ever need to work around a bug.

utf8::downgrade($s);  # Switch to strings of  8-bit values (UTF8=0).
utf8::upgrade($s);    # Switch to strings of 72-bit values (UTF8=1).

You can see the effect using Devel::Peek.

>perl -MDevel::Peek -e"$s=chr(0x80); utf8::downgrade($s); Dump($s);"
SV = PV(0x7b8a74) at 0x4a84c4
  REFCNT = 1
  FLAGS = (POK,pPOK)
  PV = 0x7bab9c "\200"\0
  CUR = 1
  LEN = 12

>perl -MDevel::Peek -e"$s=chr(0x80); utf8::upgrade($s); Dump($s);"
SV = PV(0x558a6c) at 0x1cc843c
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x55ab94 "\302\200"\0 [UTF8 "\x{80}"]
  CUR = 2
  LEN = 12

edited Oct 12 '18 at 10:46

answered Jun 20 '13 at 20:24

ikegami

322,729
15
228
466

`substr($s, 0, 1)` refers to the first character of the string? So essentially the first character can have a value `>255` which means that it is not stored in one byte? Am I starting to understand this? – Cratylus Jun 20 '13 at 20:42
Yes, first char. Yes, it can have a value greater than 255. It *may* be stored using more than one byte depending on the storage format used and the value of the character. – ikegami Jun 20 '13 at 20:43
I've added a snippet that shows 0x80 stored as one or two bytes at the bottom of my answer. – ikegami Jun 20 '13 at 20:45
What do you mean by *may*? How can we store a value greater than `255` in a single byte? – Cratylus Jun 20 '13 at 20:47
If it's >255, it will take more than a single byte. If it's in 128..255, it will take one or two bytes. If it's 0..127, it will always take one. – ikegami Jun 20 '13 at 20:49
`String of 72-bit values. (In practice, limited to 32-bit or 64-bit.)` so this means that each character in the string can be 4 or 8 bytes right? – Cratylus Jun 20 '13 at 20:52
No, each character is stored using a variable number of bytes: 1,2,3,4,5,6,7 or 13 – ikegami Jun 20 '13 at 20:54
Ok, but then in this from perldoc: `Encodes the scalar value STRING from Perl's internal form into ENCODING and returns a sequence of octets.` Internal form is the 8-bit or 72-bit values that you specify I assume? And this form is converted to 8-bit values? – Cratylus Jun 20 '13 at 20:57
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/32109/discussion-between-ikegami-and-cratylus) – ikegami Jun 20 '13 at 20:58
4

I feel that thinking of Perl as keeping strings in “some default encoding” internally rather than “in some internal representation” is just going to confuse more people than it helps. It is better to think of strings as being sequences of logical code points. I believe that knowing the exact memory layout of this logical string is helpful to not one person in ten thousand, and harmful to most of the rest. – tchrist Jun 20 '13 at 21:39
_"For example, if you concatenate $string with $str, you'll get a string in the 72-bit format."_ ... except sometimes it doesn't work correctly, e.g. if you concatenate string from two different sources, only one of which uses utf8. – Jakub Narębski Jun 24 '13 at 14:49
@Jakub Narębski, Not so, it'll work perfectly fine. It might not make sense to concatenate the two values, but that's something entirely different. – ikegami Jun 24 '13 at 16:48
@ikegami: I meant something like this: `perl -MDevel::Peek -E 'my $a = "ż"; my $b = "\x{17c}"; Dump $a; Dump $b; Dump "$b - $a"'` -- note the lack of `use utf8;` (which is admittedly a mistake). – Jakub Narębski Jun 26 '13 at 07:02
@JakubNarębski That output is expected. The `"ż"` is a bytestring (no utf8-flag set) of *two* octets/codepoints, as my console uses UTF-8 encoding. The `"\x{17c}"` has *one* codepoint; the uft8-flag is set. It is irrelevant that both strings are stored with the same sequence of octets; those are internals, and the *interpretation* of these sequences differs vastly. Upon concatenation, you end up with `2+3+1=6` codepoints, as is to be expected. Nothing to see here. Always de- and encode data on system boundaries, so stuff like this doesn't happen. – amon Jun 26 '13 at 07:59
@Jakub Narębski, `my $a = "ż"` is impossible without `use utf8;`. There's no `ż` in iso-8859-1, and you said the program's encoding is iso-8859-1 without `use utf8;`. – ikegami Jun 26 '13 at 10:34
@amon, @ikegami: The problem is that if you try to decode _after concatenation_ it doesn't work. The real problem is lack of `use utf8;`. – Jakub Narębski Jun 26 '13 at 11:11
@Jakub Narębski, Decoding before concatenation doesn't work either. Again, no problem with concatenation. – ikegami Jun 26 '13 at 11:12
@ikegami: decoding before concatenation does work: `perl -MDevel::Peek -E 'my $a = "ż"; utf8::decode($a); my $b = "\x{17c}"; Dump $a; Dump $b; Dump "$b - $a"'`; note **`utf8::decode($a);`** – Jakub Narębski Jun 26 '13 at 13:42
What I don't like about Perl in abovementioned situation is that it doesn't warn me when concatenating "bytes" and "characters", turning them into "characters"... somewhat. – Jakub Narębski Jun 26 '13 at 13:45
@Jakub Narębski, A character is an element of a string, so you're concatenating characters with characters. Perl has no way to know that some of those characters are UTF-8 encoded code points and ther others are unencoded code points, which is why it doesn't warn you. You'd have to provide more information to the interpreter somehow (e.g. a type system, some kind of annotation, etc). – ikegami Jun 26 '13 at 22:57
Perl 6 has Buf (binary data) and Str (characters, or rather graphemes), and you cannot concatenate Buf with Str. Perl 5 lacks this information; it is just string, with UTF8 flag denoting representation. – Jakub Narębski Jun 27 '13 at 12:34
@Jakub Narębski, I know. "You'd have to provide more information to the interpreter somehow (e.g. a type system)" – ikegami Jun 28 '13 at 00:20
@Jakub Narębski, I didn't notice your second comment with code. Note **the lack of utf8::decode($b)** Like I said, "decoding before concatenation doesn't work either. Again, no problem with concatenation." – ikegami Jun 28 '13 at 00:21

score 5 · Answer 2 · answered Jun 20 '13 at 20:42

The \x{FB01} and \x{E9} are code points.

Not quiet, the numeric values inside the braces are codepoints. The whole \x expression is just a notation for a character. There are several notations for characters, most of them starting with a backslash, but the common one is the simple string literal. You might as well write:

use utf8;
my $string = "Can you ﬁnd my résumé?\n";
#                     ↑       ↑   ↑

And code points are encoded via an encoding scheme to a series of octets.

True, but so far your string is a string of characters, not a buffer of octets.

But how does this work?

Strings consist of characters. That's just Perl's model. You as a programmer are supposed to deal with it at this level.

Of course, the computer can't, and the internal data structure must have some form of internal encoding. Far too much confusion ensues because "Perl can't keep a secret", the details leak out occasionally.

Are all the characters in this sentence (including the ASCII ones) encoded via UTF-8?

No, the internal encoding is lax UTF8 (no dash). It does not have some of the restrictions that UTF-8 (a.k.a. UTF-8-strict) has.

UTF-8 goes up to 0x10_ffff, UTF8 goes up to 0xffff_ffff_ffff_ffff on my 64-bit system. Codepoints greater than 0xffff_ffff will emit a non-portability warning, though.
In UTF-8 certain codepoints are non-characters or illegal characters. In UTF8, anything goes.

Encode::is_utf8

… is an internals function, and is clearly marked as such. You as a programmer are not supposed to peek. But since you want to peek, no one can stop you. Devel::Peek::Dump is a better tool for getting at the internals.

Read http://p3rl.org/UNI for an introduction to the topic of encoding in Perl.

@daxim:`True, but so far your string is a string of characters, not a buffer of octets.` What does this mean? How is a buffer of octets declared in perl? — Cratylus, Jun 20 '13 at 20:49
I left away that sometimes the internal encoding is not UTF8; [you](http://stackoverflow.com/u/589924) have covered it nicely. — daxim, Jun 20 '13 at 20:49
Cratylus, you create octets by **encoding** them from a character string. There are several ways to do so, both explicit and implicit. Read through http://p3rl.org/UNI to learn all the ways, and when to prefer which. - The other way to get [octets](http://p3rl.org/Encode#octet) is to read them **raw** from a disk file, standard I/O stream, database, command-line argument, environment variable, socket etc., that is to say to skip the usual **decoding** step. — daxim, Jun 20 '13 at 20:52
By octets you mean 8-bit bytes/values? While the equivalent decoded format is 8 or 72 bit values? — Cratylus, Jun 20 '13 at 20:58
I already linked to the definition of octet in my previous comment attached to this answer. — daxim, Jun 20 '13 at 20:59

score 3 · Answer 3 · answered Jun 20 '13 at 20:33

3

is_utf8 is a badly-named function that doesn't mean what you think it means or have anything to do with that. The answer to your question is that $string doesn't have an encoding, because it's not encoded. When you call Encode::encode with some encoding, the result of that will be a string that is encoded, and has a known encoding

answered Jun 20 '13 at 20:33

hobbs

187,508
16
182
271

This `Encode::is_utf8($string, 1)` also returns `true` and according to `perldoc`: `If CHECK is true, also checks whether STRING contains well-formed UTF-8`. BTW I have a big head-ache with perldoc... – Cratylus Jun 20 '13 at 20:39

Perl's default string encoding and representation

3 Answers3