2

I receive some data as a string. I need to write the data to a file, but the problem is that sometimes the data is compressed/zipped and sometimes it's just plain text. I need to determine the content-type so I know whether to write it to a .txt file or a .tgz file. Any ideas on how to accomplish this? Can I use mime type somehow even though my data is a string, not a file?

Thanks.

kkeogh
  • 47
  • 6
  • This is similar to the question http://stackoverflow.com/questions/43580/how-to-find-the-mime-type-of-a-file-in-python . See the answer there linking to python-magic at https://github.com/ahupp/python-magic . – Andrew Dalke Jan 21 '11 at 22:55

4 Answers4

1

If the file is downloaded from a webserver, you should have a content-type to look at, however you are at the mercy of the webserver whether or not it truly describes the type of the file.

Another alternative would be to use a heuristic to guess the file type. This can often be done by looking at the first few bytes of the file

John La Rooy
  • 263,347
  • 47
  • 334
  • 476
1

Both gzip and zip use distinct headers before compressed data, rather unlikely for human-readable strings. If the choice is only between these, you can make a faster check than mimetypes would provide.

9000
  • 37,110
  • 8
  • 58
  • 98
1

As some answers already suggested, you could peek into the first bytes of the file:

#!/usr/bin/env python

# $ cat hello.txt
# Hello World. I'm plaintext.

# $ cat hello.txt | gzip > hello.txt.gz

from struct import unpack

# 1F 8B 08 00 / gz magic number
magic = ('\x1f', '\x8b', '\x08', '\x00')

for filename in ['hello.txt', 'hello.txt.gz']:
    with open(filename, 'rb') as handle:
        s = unpack('cccc', handle.read(4))
        if s == magic:
            print filename, 'seems gzipped'
        else:
            print filename, 'seems not gzipped'

# =>
# hello.txt seems not gzipped
# hello.txt.gz seems gzipped
miku
  • 161,705
  • 45
  • 286
  • 300
  • Since I started with a string, I didn't need to unpack anything, I just used str.startswith() to check the first four bytes to see if it matched the magic number you provided. Seems to work great. Thanks! – kkeogh Jan 21 '11 at 20:34
0

You can try the mimetypes module: http://docs.python.org/library/mimetypes.html.

Here's something to play with:

print mimetypes.guess_type(filename)

Good luck!

Blender
  • 257,973
  • 46
  • 399
  • 459