C++: Is there a standard definition for end-of-line in a multi-line string constant?

Question

If I have a multi-line string C++11 string constant such as

R"""line 1
line 2
line3"""

Is it defined what character(s) the line terminator/separator consist of?

It consists of whatever it is in the source file. It's a literal: the content is literally what you typed.For a server with well-defined EOL requirements such as HTTP that's not sufficient: you should use `\r\n` for HTTP, mail, etc. — user207421, Oct 05 '16 at 23:59
Is this only the text, or is this a string literal? Shouldn't there some parentheses to make the raw string literal work? Something starting with `R"(""line 1` perhaps? — wally, Oct 06 '16 at 00:09
@Cheersandhth.-Alf - Not all visitors to the site have 128k rep or are experts in the field. As long as a question is on-topic and not a duplicate, it is acceptable on SO regardless of the rep of the asker. — owacoder, Oct 06 '16 at 00:29
Mechanically, I believe it's `\n`, which represents `0x0A`, in the program code itself (just like newlines in other string constants), but different environments will translate it to their native newlines (such as CRLF on Windows). — Justin Time - Reinstate Monica, Oct 06 '16 at 01:07
There is (or at least was) actual implementation divergence on this question. It is hardly "completely intro novice" when people who read standards and write compilers for a living don't agree with each other. — T.C., Oct 06 '16 at 01:15
@Cheersandhth.-Alf: Is my 146k rep high enough for me to disagree with you? — Keith Thompson, Oct 06 '16 at 01:17
@Cheersandhth.-Alf: Specifically, I disagree that this is a novice-level question, given that the standard itself is inconsistent. — Keith Thompson, Oct 06 '16 at 01:28
@KeithThompson: The standard isn't inconsistent. It's just vaguely worded. Presumably nobody thought it worth wasting time on making this more precise, or for that matter resolving the DR, **because** it's basic, trivial novice level stuff, stuff that everybody knows and nobody at the level of those who implement compilers (main readership) would disagree about. — Cheers and hth. - Alf, Oct 06 '16 at 01:32
@Cheersandhth.-Alf: I disagree. The standard explicitly says that any transformations performed in phases 1 and 2 are reverted. Those transformations explicitly include the introduction of new-line characters. The normative wording is unambiguous, and if you overlook end-of-line indicators that are not character sequences it even makes sense (you might *want* CR-LF pairs on Windows). And somebody thought it was worthwhile to submit a DR, which is unresolved after 3+ years. I'm not saying you're wrong, I'm merely suggesting that your conclusion isn't as obvious as you seem to think it is. — Keith Thompson, Oct 06 '16 at 01:38
@KeithThompson: As I see it you're just reading the word "transformations" wrong. With such bad interpretation one ends up thinking something is "explicitly" stated, when it's really just a consequence of the bad interpretation. It can be recognized as bad since it leads to inconsistency. A good interpretation is instead that the immediately following list of relevant transformations, is what the word "transformations" refers to here. And these transformations are all about making backslash effectively un-processed, which is what raw literals are all about. /That/ makes sense. — Cheers and hth. - Alf, Oct 06 '16 at 01:46
@Cheersandhth.-Alf: I think you're probably right. But I still think the wording is sloppy and subject to reasonable misunderstanding. It says "**any** transformations" (emphasis added) are reverted. The parenthesized clause after that "(trigraphs, universal-character-names, and line splicing)" seems to be intended to define what "transformations" are referred to, but it's not sufficiently clear that that's meant to be an exhaustive list. It's perfectly reasonable to think of the mapping of physical source characters to the basic source character set as a "transformation". — Keith Thompson, Oct 06 '16 at 02:01
Has anybody checked how this is implemented in real-world compilers, preferably on different OSes? — Mr Lister, Oct 06 '16 at 06:39
I downvoted this question as not a real question, considering the OP's high rep score and alleged background as a very competent person, while the question is a novice question about really basic stuff. The original comment was removed by the SO mods. In the commentary above it's argued that the question isn't really basic because at least one compiler has had a bug in this area. To me that's nonsense. — Cheers and hth. - Alf, Oct 06 '16 at 22:20

score 28 · Accepted Answer · edited May 23 '17 at 12:33

The intent is that a newline in a raw string literal maps to a single '\n' character. This intent is not expressed as clearly as it should be, which has led to some confusion.

Citations are to the 2011 ISO C++ standard.

First, here's the evidence that it maps to a single '\n' character.

A note in section 2.14.5 [lex.string] paragraph 4 says:

[ Note: A source-file new-line in a raw string literal results in a new-line in the resulting execution string-literal. Assuming no whitespace at the beginning of lines in the following example, the assert will succeed:

    const char *p = R"(a\
    b
    c)";
    assert(std::strcmp(p, "a\\\nb\nc") == 0);

— end note ]

This clearly states that a newline is mapped to a single '\n' character. It also matches the observed behavior of g++ 6.2.0 and clang++ 3.8.1 (tests done on a Linux system using source files with Unix-style and Windows-style line endings).

Given the clearly stated intent in the note and the behavior of two popular compilers, I'd say it's safe to rely on this -- though it would be interesting to see how other compilers actually handle this.

However, a literal reading of the normative wording of the standard could easily lead to a different conclusion, or at least to some uncertainty.

Section 2.5 [lex.pptoken] paragraph 3 says (emphasis added):

Between the initial and final double quote characters of the raw string, any transformations performed in phases 1 and 2 (trigraphs, universal-character-names, and line splicing) are reverted; this reversion shall apply before any d-char, r-char, or delimiting parenthesis is identified.

The phases of translation are specified in 2.2 [lex.phases]. In phase 1:

Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary.

If we assume that the mapping of physical source file characters to the basic character set and the introduction of new-line characters are "tranformations", we might reasonably conclude that, for example, a newline in the middle of a raw string literal in a Windows-format source file should be equivalent to a \r\n sequence. (I can imagine that being useful for Windows-specific code.)

(This interpretation does lead to problems with systems where the end-of-line indicator is not a sequence of characters, for example where each line is a fixed-width record. Such systems are rare these days.)

As "Cheers and hth. - Alf"'s answer points out, there is an open Defect Report for this issue. It was submitted in 2013 and has not yet been resolved.

Personally, I think the root of the confusion is the word "any" (emphasis added as before):

Between the initial and final double quote characters of the raw string, any transformations performed in phases 1 and 2 (trigraphs, universal-character-names, and line splicing) are reverted; this reversion shall apply before any d-char, r-char, or delimiting parenthesis is identified.

Surely the mapping of physical source file characters to the basic source character set can reasonably be thought of as a transformation. The parenthesized clause "(trigraphs, universal-character-names, and line splicing)" seems to be intended to specify which transformations are to be reverted, but that either attempts to change the meaning of the word "transformations" (which the standard does not formally define) or contradicts the use of the word "any".

I suggest that changing the word "any" to "certain" would express the apparent intent much more clearly:

Between the initial and final double quote characters of the raw string, certain transformations performed in phases 1 and 2 (trigraphs, universal-character-names, and line splicing) are reverted; this reversion shall apply before any d-char, r-char, or delimiting parenthesis is identified.

This wording would make it much clearer that "trigraphs, universal-character-names, and line splicing" are the only transformations that are to be reverted. (Not everything done in translation phases 1 and 2 is reverted, just those specific listed transformations.)

Note that the for-arguments-sake interpretation fails outright for file systems where lines are records, that is, where there is no data denoting newline. It's very long ago since I used such system, but as I recall, MPE/IV on the HP 3000 was like that. I recall vaguely reading that VAX was also that, but although I used VAX machines as a student I can't recall, sorry. But, main point, the C++ rules can't do without a transformation of newlines to `\n`. Because otherwise would not work on those systems. — Cheers and hth. - Alf, Oct 06 '16 at 01:20
Also note that there is an open core language issue about this, referenced in my answer. — Cheers and hth. - Alf, Oct 06 '16 at 01:23
You wrote above: "This clearly states that a newline is mapped to a single '\n' character." I agree with your statement with one single exception: the statement in the standard is confusing, for it uses the term [new-line](http://eel.is/c++draft/cpp.pre#nt:new-line) as if it would introduce a new-line character, and not the escape sequence `\n'. Could you comment on this? — Belloc, Mar 29 '20 at 18:40
@Belloc - I'm not sure I understand your comment. A line break in a raw string literal results in a newline character. It doesn't create a `\n` escape sequence, but it is equivalent to a `\n` escape sequence. — Keith Thompson, Mar 29 '20 at 21:36
OK, I think I can understand what you're saying. But where exactly in the standard does it say that a new-line character maps to a single `\n` character as you wrote above? I'm sure you are aware that [\[lex.string\]/3](http://eel.is/c++draft/lex.string#3) is just a Note. — Belloc, Mar 30 '20 at 12:47
@Belloc The whole point of the cited defect report is that the standard *doesn't* state that clearly in normative text. The footnote implies that it was the intent. (I haven't checked whether later editions of the standard resolve this.) — Keith Thompson, Mar 30 '20 at 20:43
After a detailed search in the Standard I found something ([table 9](http://eel.is/c++draft/lex.ccon#tab:lex.ccon.esc)) that I think gives normative status to the assertion that a new-line is mapped to a single `\n` character. Thanks for your comments. — Belloc, Mar 31 '20 at 10:19
@Belloc: That table says that `\n` *in a character literal* represents a newline character. It says nothing about raw string literals. — Keith Thompson, Mar 31 '20 at 17:40

score 16 · Answer 2 · answered Oct 06 '16 at 00:11

16

The standard seems to indicate that:

R"""line 1
line 2
line3"""

is equivalent to:

"line 1\nline 2\nline3"

From 2.14.5 String literals of the C++11 standard:

4 [ Note: A source-file new-line in a raw string literal results in a new-line in the resulting execution string literal. Assuming no whitespace at the beginning of lines in the following example, the assert will succeed:
const char *p = R"(a\
b
c)";
assert(std::strcmp(p, "a\\\nb\nc") == 0);
—end note ]

5 [ Example: The raw string
R"a(
)\
a"
)a"
is equivalent to "\n)\\\na\"\n".

answered Oct 06 '16 at 00:11

R Sahu

196,807
13
136
247

Unclear. That only says that a newline maps to a newline. It doesn't say what happens if the source file contained `\r` for example. – user207421 Oct 06 '16 at 00:15
@EJP: Look up "phases of translation" in the standard. Apparently you think source file characters make it directly into literals. They don't. – Cheers and hth. - Alf Oct 06 '16 at 00:19
@Cheersandhth.-Alf There's no 'apparently' about it. That's exactly what I said in my comment below the question. I may well be wrong about that, but no convincing evidence has appeared anywhere here yet. This answer comes closest so far. At least he cites and quotes something. – user207421 Oct 06 '16 at 00:22
1

@Cheersandhth.-Alf in 2.5 it also says "Between the initial and final double quote characters of the raw string, any transformations performed in phases 1 and 2 (trigraphs, universal-character-names, and line splicing) are reverted" – harmic Oct 06 '16 at 00:22
@harmic: Yes, that's due to raw literals being tacked on to the language very late. It's a kludge to make \\ effectively not processed within a raw string literal. But with an unfortunate phrasing. One can analyze it by requiring that it makes sense. It makes sense if the list in the parenthesis is read as an exhaustive enumeration of possible "transformations". It ceases to make sense if "transformation" is interpreted as also including character set and end-of-line conversions, for then the same source code in different encodings could produce different programs. – Cheers and hth. - Alf Oct 06 '16 at 00:33
1

Also, examples, and notes are [non-normative](http://stackoverflow.com/questions/21364398/are-notes-and-examples-in-the-core-language-specification-of-the-c-standard-no) which to me leaves a little ambiguity – harmic Oct 06 '16 at 00:36

Cheers and hth. - Alf · Answer 3 · 2016-10-07T05:05:34.960

^{Note: the question has changed substantially since the answers were posted. Only half of it remains, namely the pure C++ aspect. The network focus in this answer addresses the original question's “sending a multi-line string to a server with well-defined end-of-line requirements”. I do not chase question evolution in general.}

Internally in the program, the C++ standard for newline is \n. This is used also for newline in a raw literal. There is no special convention for raw literals.

Usually \n maps to ASCII linefeed, which is the value 10.

I'm not sure what it maps to in EBCDIC, but you can check that if needed.

On the wire, however, it's my impression that most protocols use ASCII carriage return plus linefeed, i.e. 13 followed by 10. This is sometimes called CRLF, after the ASCII abbreviations CR for carriage return and LF for linefeed. When the C++ escapes are mapped to ASCII this is simply \r\n in C++.

You need to abide by the requirements of the protocol you're using.

For ordinary file/stream i/o the C++ standard library takes care of mapping the internal \n to whatever convention the host environment uses. This is called text mode, as opposed to binary mode where no mapping is performed.

For network i/o, which is not covered by the standard library, the application code must do this itself, either directly or via some library functions.

There is an active issue about this, core language defect report #1655 “Line endings in raw string literals”, submitted by Mike Miller 2013-04-26, where he asks,

” is it intended that, for example, a CRLF in the source of a raw string literal is to be represented as a newline character or as the original characters?

Since line ending values differ depending on the encoding of the original file, and considering that in some file systems there is not an encoding of line endings, but instead lines as records, it's clear that the intention is not to represent the file contents as-is – since that's impossible to do in all cases. But as far as I can see this DR is not yet resolved.

This does not (directly) answer the question, which is about raw string literals. — Tavian Barnes, Oct 06 '16 at 00:05
@TavianBarnes: The C++ standard for newline is `\n`. I fail to see how you failed to read that. — Cheers and hth. - Alf, Oct 06 '16 at 00:05
@Alf The C++ standard provides that `\n` maps to line feed. If there is something in the standard that answers this question, or that 'maps newline to `\n`', which isn't necessarily the same thing, you should cite and quote it. — user207421, Oct 06 '16 at 00:07
@Cheersandhth.-Alf If your point is that `R""" """` (there should be a newline between the `"""`s but of course it won't render in comments) and `"\n"` are equivalent, you should say that explicitly (and preferably give evidence). — Tavian Barnes, Oct 06 '16 at 00:09
@EJP: The standard doesn't say that `'\n'` maps to line feed. It doesn't even mention the line feed character. It says that `'\n'` maps to new-line. — Keith Thompson, Oct 06 '16 at 00:36
Friends of Mark: it's **stupid to downvote in retaliation**. I cannot express how freaking dumb that is. He certainly won't get any more help from me, like the help that produced the answer he chose as solution (it was opposite originally). — Cheers and hth. - Alf, Oct 08 '16 at 16:03

C++: Is there a standard definition for end-of-line in a multi-line string constant?

3 Answers3

Linked