How to correctly parse incoming HTTP requests

Question

i've created an C++ application using WinSck, which has a small (handles just a few features which i need) http server implemented. This is used to communicate with the outside world using http requests. It works, but sometimes the requests are not handled correctly, because the parsing fails. Now i'm quite sure that the requests are correctly formed, since they are sent by major web browsers like firefox/chrome or perl/C# (which have http modules/dll's).

After some debugging i found out that the problem is in fact in receiving the message. When the message comes in more than just one part (it is not read in one recv() call) then sometimes the parsing fails. I have gone through numerous tries on how to resolve this, but nothing seems to be reliable enough.

What i do now is that i read in data until i find "\r\n\r\n" sequence which indicates end of header. If WSAGetLastError() reports something else than 10035 (connection closed/failed) before such a sequence is found i discard the message. When i know i have the whole header i parse it and look for information about the body length. However i'm not sure if this information is mandatory (i think not) and what should i do if there is no such information - does it mean there will be no body? Another problem is that i do not know if i should look for a "\r\n\r\n" after the body (if its length is greater than zero).

Does anybody know how to reliably parse a http message?

Note: i know there are implementations of http servers out there. I want my own for various reasons. And yes, reinventing the wheel is bad, i know that too.

Unless you're doing this for fun, look at the http-parser link Jack has provided below. It looks brilliant, and doesn't presume to hijack your socket/whatever. — Matt Joiner, Sep 13 '10 at 13:22
@Matt Joiner: i looked at it and it indeed looks very good. But i really need to write my own which supports just a fraction of all the http features and at the same time knows about a few special commands. If i was in need of a full http server i would definitely not write my own. — PeterK, Sep 13 '10 at 13:46
Keep in mind the code provided is __tiny__ and pushes no requirements on you. You can halt, ignore, and wrap it in any way you please by customizing the few callbacks it provides. I sympathise with the desire to do things yourself, but this will save you hours of debugging and bugs due to unforeseen input later on. — Matt Joiner, Sep 13 '10 at 13:49

score 8 · Answer 1 · edited Nov 06 '12 at 15:41

8

If you're set on writing your own parser, I'd take the Zed Shaw approach: use the Ragel state machine compiler and build your parser based on that. Ragel can handle input arriving in chunks, if you're careful.

Honestly, though, I'd just use something like this.

Your go-to resource should be RFC 2616, which describes HTTP 1.1, which you can use to construct a parser. Good luck!

edited Nov 06 '12 at 15:41

Homme Zwaagstra

8,313
2
15
14

answered Sep 13 '10 at 07:28

Jack Kelly

17,042
1
51
78

+1 for the http-parser and definitive links. That source would generate ***FAST*** code, I'm really impressed. That's badass. – Matt Joiner Sep 13 '10 at 13:20
Talking about Ragel, you can give a look at HttpMachine (https://github.com/bvanderveen/httpmachine/tree/master/src/HttpMachine/rl). Also if it written in C#, the state machine is compiled with Ragel and I think that it should be easily adaptable to C++. More over two .rl (Ragel sources) files of three are not tied to C#, but general (so a lot of work is already done). – gsscoder Jan 19 '13 at 16:24

score 3 · Accepted Answer · answered Sep 13 '10 at 07:21

3

You could try looking at their code to see how they handle a HTTP message.

Or you could look at the spec, there's message length fields you should use. Only buggy browsers send additional CRLFs at the end, apparently.

answered Sep 13 '10 at 07:21

gbjbaanb

49,287
10
99
143

The HTTPbis WG has clarified message parsing; see http://greenbytes.de/tech/webdav/draft-ietf-httpbis-p1-messaging-11.html#message.body for the current draft text. – Julian Reschke Sep 13 '10 at 08:42
This looks good, thanks. If that helps i will gladly accept your answer. – PeterK Sep 13 '10 at 09:40

score 0 · Answer 3 · answered Sep 13 '10 at 08:31

0

Anyway HTTP request has "\r\n\r\n" at the end of request headers and before the request data if any, even if request is "GET / HTTP/1.0\r\n\r\n".

If method is "POST" you should read as many bytes after "\r\n\r\n", as specified in Content-Length field.

So pseudocode is:

read_until(buf, "\r\n\r\n");
if(buf.starts_with("POST")
{
   contentLength = regex("^Content-Length: (\d+)$").find(buf)[1];
   read_all(buf, contentLength);
}

There will be "\r\n\r\n" after the content only if content includes it. Content may be binary data, it hasn't any terminating sequences, and the one method to get its size is use Content-Length field.

answered Sep 13 '10 at 08:31

Abyx

10,859
4
36
74

No, it does not depend on the method name. See http://greenbytes.de/tech/webdav/draft-ietf-httpbis-p1-messaging-11.html#message.body for details. – Julian Reschke Sep 13 '10 at 08:41
Also, keep in mind that HTTP 1.1 requests do not need to use a `Content-Length` header, either. They can use `Transfer-Encoding: chunked` instead, in which case the message length is encoded inside the message data itself. – Remy Lebeau Sep 13 '10 at 20:05

score -1 · Answer 4 · answered Sep 13 '10 at 07:36

-1

HTTP GET/HEAD requests have no body, and POST request can have no body too. You have to check if it's a GET/HEAD, if it's, then you have no content (body/message) sent. If it was a POST, do as the specs say about parsing a message of known/unknown length, as @gbjbaanb said.

answered Sep 13 '10 at 07:36

aularon

10,724
3
33
41

GET and HEAD request *can* have a body. So no, you don't check the method name. – Julian Reschke Sep 13 '10 at 08:39
@Julian, it's not exactly specified in HTTP specification whether you can include a body or not in GET/HEAD requests. I tested it locally and it works with apache, but I never seen that before in a real world implementation, I'm reading http://stackoverflow.com/questions/978061/ and http://stackoverflow.com/questions/1266596/ now, thanks for pointing that out. – aularon Sep 13 '10 at 10:14
whether something is used in practice and whether it's allowed are separate questions. What's important is that request parsing just is the same for all methods. (Contrary to response parsing where HEAD is special). See also http://trac.tools.ietf.org/wg/httpbis/trac/ticket/19 -- that's why were revising RFC 2616, after all. – Julian Reschke Sep 13 '10 at 15:34

How to correctly parse incoming HTTP requests

4 Answers4

Linked