Decompressing a gzipped payload of a packet with Python

Question

I am currently working on a program that takes a .pcap file and separates all of the packets out by ip using the scapy package. I want to decompress the payloads that are compressed using the gzip package. I can tell if the payload is gzipped because it contains

Content-Encoding: gzip

I am trying to use

fileStream = StringIO.StringIO(payload)
gzipper = gzip.GzipFile(fileobj=fileStream)
data = gzipper.read()

to decompress the payload, where

payload = str(pkt[TCP].payload)

When I try to do this I get the error

IOError: Not a gzipped file

When I print the first payload I get

HTTP/1.1 200 OK
Cache-Control: private, max-age=0
Content-Type: text/html; charset=utf-8
P3P: CP="NON UNI COM NAV STA LOC CURa DEVa PSAa PSDa OUR IND"
Vary: Accept-Encoding
Content-Encoding: gzip
Date: Sat, 30 Mar 2013 19:23:33 GMT
Content-Length: 15534
Connection: keep-alive
Set-Cookie: _FS=NU=1; domain=.bing.com; path=/
Set-Cookie: _SS=SID=F2652FD33DC443498CE043186458C3FC&C=20.0; domain=.bing.com; path=/
Set-Cookie: MUID=2961778241736E4F314E732240626EBE; expires=Mon, 30-Mar-2015 19:23:33 GMT; domain=.bing.com; path=/
Set-Cookie: MUIDB=2961778241736E4F314E732240626EBE; expires=Mon, 30-Mar-2015 19:23:33 GMT; path=/
Set-Cookie: OrigMUID=2961778241736E4F314E732240626EBE%2c532012b954b64747ae9b83e7ede66522; expires=Mon, 30-Mar-2015 19:23:33 GMT; domain=.bing.com; path=/
Set-Cookie: SRCHD=D=2758763&MS=2758763&AF=NOFORM; expires=Mon, 30-Mar-2015 19:23:33 GMT; domain=.bing.com; path=/
Set-Cookie: SRCHUID=V=2&GUID=02F43275DC7F435BB3DF3FD32E181F4D; expires=Mon, 30-Mar-2015 19:23:33 GMT; path=/
Set-Cookie: SRCHUSR=AUTOREDIR=0&GEOVAR=&DOB=20130330; expires=Mon, 30-Mar-2015 19:23:33 GMT; domain=.bing.com; path=/

?}k{?H????+0?#!?,_???$?:?7vf?w?Hb???ƊG???9???/9U?\$;3{9g?ycAӗ???????W{?o?~?FZ?e ]>??<??n????׻?????????d?t??a?3?
?2?p??eBI?e??????ܒ?P??-?Q?-L?????ǼR?³?ׯ??%'
?2Kf?7???c?Y?I?1+c??,ae]?????<{?=ƞ,?^?J?ď???y??6O?_?z????_?ޞ~?_?????Bo%]???_?????W=?

For additional information, this is a packet that was isolated because it contained Content-Encoding: gzip from a sample .pcap file provided by a project.

I may be wrong about this, but I suspect `gzip.GzipFile` wants to deal with a *file*, as suggested by both the name of the class/function and the documentation (for 2.7.x, anyway). For compressing/decompressing *buffers*, perhaps the `zlib` module (in particular the `compress` and `decompress` functions) might be more appropriate... — twalberg, May 19 '15 at 19:35
@twalberg, no, a `StringIO` will do just fine. The OP's problem is that he doesn't separate the compressed message body from the headers, but instead tries to decompress the full message. — Lukas Graf, May 19 '15 at 19:37
@LukasGraf That was my second guess, but the question wasn't really clear on whether anything was being done to remove headers, etc... — twalberg, May 19 '15 at 19:39

Lukas Graf · Accepted Answer · 2015-05-19T20:20:59.933

3

In order to decode a gzipped HTTP response, you only need to decode the response body, not the headers.

The payload in your case is the entire TCP payload, i.e. the entire HTTP message including headers and body.

HTTP messages (requests and responses) are RFC 822 messages (which is the same generic message format that E-Mail messages (RFC 2822) are based upon).

The structure of an 822 message is very simple:

Zero or more header lines (key/ value pairs separated by :), terminated by CRLF
An empty line (CRLF (carriage return, line feed, so '\r\n')
The message body

You now could parse this message yourself in order to isolate the body. But I would rather recommend you use the tools Python already provides for you. The httplib module (Python 2.x) includes the HTTPMessage class which is used by httplib internally to parse HTTP responses. It's not meant to be used directly, but in this case I would probably still use it - it will handle some HTTP specific details for you.

Here's how you can use it to separate the body from the headers:

>>> from httplib import HTTPMessage
>>>
>>> f = open('gzipped_response.payload')
>>>
>>> # Or, if you already have the payload in memory as a string:
... # f = StringIO.StringIO(payload)
...
>>> status_line = f.readline()
>>> msg = HTTPMessage(f, 0)
>>> body = msg.fp.read()

The HTTPMessage class works in a similar way the rfc822.Message does:

First, you need to read (or discard) the status line (HTTP/1.1 200 OK), because that's not part of the RFC822 message, and is not a header.
Then you instantiate HTTPMessage with a handle to an open file and the seekable argument set to 0. The file pointer is stored as msg.fp
Upon instantiation it calls msg.readheaders(), which reads all header lines until it encounters an empty line (CRLF).
At that point, msg.fp has been advanced to the point where the headers end and the body starts. You can therefore call msg.fp.read() to read the rest of the message - the body.

After that, your code for decompressing the gzipped body just works:

>>> body_stream = StringIO.StringIO(body)
>>> gzipper = gzip.GzipFile(fileobj=body_stream)
>>> data = gzipper.read()
>>>
>>> print data[:25]
<!DOCTYPE html>
<html>

edited May 19 '15 at 20:20

answered May 19 '15 at 19:41

Lukas Graf

23,458
7
65
81

I am now encountering this error, implementing your suggested code: `line = self.fp.readline(_MAXLINE + 1)` `AttributeError: 'str' object has no attribute 'readline'` – Delta May 19 '15 at 20:06
It seems like you're instatiating `HTTPMessage` with a string directly instead of a `StringIO`. – Lukas Graf May 19 '15 at 20:09
@Delta - also note that I slightly updated the code. You need to discard the first line (the status line) by calling `payload.readline()` once, and then you don't need to call `msg.readheaders()` yourself. – Lukas Graf May 19 '15 at 20:10
@Delta hang on, I'll rewrite my code a bit so it matches the variable names from your question. – Lukas Graf May 19 '15 at 20:14
@Delta updated my code. `payload` in my code should now be exactly what I assume it is in your question, and using `f = StringIO.StringIO(payload)` you should be able to just copy & paste the rest. – Lukas Graf May 19 '15 at 20:23
This is making a lot more sense now. I am getting `IOError: CRC check failed 0xdddddebd != 0xc5fd705fL` when I try to run my code now, coming from the gzipper.read() call. – Delta May 19 '15 at 20:37
Hmm. I tested all my code with a simple pcap dump from a request to stackoverflow.com, and it worked. Are you sure the body is `gzip` encoded and not `deflate`? Could you maybe upload your sample pcap file somewhere, or post a link, if it's part of a public project? – Lukas Graf May 19 '15 at 20:45
1

([This answer](http://stackoverflow.com/a/9856879/1599111) explains why there is some confusion around `gzip` vs `deflate`, and that browsers often have fallback logic that can deal with incorrectly declared content encodings). – Lukas Graf May 19 '15 at 20:49
I am using the pcap file from [this project](http://nifty.stanford.edu/2015/matthews-raymond-packet-sniffing/), but I am expanding from the base parameters of the project to try and decompress anything that is compressed. The payload that it starts out with is the one I pasted in my OP. – Delta May 19 '15 at 21:00
Hmm, I was successfully able to decode this response body as gzip with my code after isolating that single HTTP response (with wireshark). Are you sure you correctly separated the HTTP requests and responses in that single TCP stream? Because in that TCP stream there's *several* requests and responses. – Lukas Graf May 19 '15 at 21:33
What's the length of your `body` after you extract it with the code from my answer? It should be exactly 15534 bytes (which matches the `Content-Length` header). If it's more, then your `payload` contains more than a single HTTP response. – Lukas Graf May 19 '15 at 21:36
I just read the `sample.py` from that project - is that basically all you're doing? Because that wouldn't be enough by a long shot - in order to decompress gzipped **HTTP** responses, you first would need to reassemble all the TCP segments that belong to the same stream, and then separate the HTTP messages in that stream. And frankly, that's a totally different and much more complex question. – Lukas Graf May 19 '15 at 21:42
Well it seems like I have some thinking to do. Thanks for all the help! – Delta May 20 '15 at 15:22
1

You're welcome. I happened to find this thread that has some really good answers about reassembling a TCP stream, maybe this could get you started: [TCP payload assembly](http://comments.gmane.org/gmane.comp.security.scapy.general/2997) – Lukas Graf May 20 '15 at 15:29

Decompressing a gzipped payload of a packet with Python

1 Answers1

Linked