Python 2.7 pyLZMA works, Python 3.4 LZMA module does not

Question

import sys
import os
import zlib

try:
    import pylzma as lzma
except ImportError:
    import lzma

from io import StringIO
import struct

#-----------------------------------------------------------------------------------------------------------------------

def read_ui8(c):
    return struct.unpack('<B', c)[0]
def read_ui16(c):
    return struct.unpack('<H', c)[0]
def read_ui32(c):
    return struct.unpack('<I', c)[0]

def parse(input):
    """Parses the header information from an SWF file."""
    if hasattr(input, 'read'):
        input.seek(0)
    else:
        input = open(input, 'rb')

    header = { }

    # Read the 3-byte signature field
    header['signature'] = signature = b''.join(struct.unpack('<3c', input.read(3))).decode()

    # Version
    header['version'] = read_ui8(input.read(1))

    # File size (stored as a 32-bit integer)
    header['size'] = read_ui32(input.read(4))

    # Payload

    if header['signature'] == 'FWS':
        print("The opened file doesn't appear to be compressed")
        buffer = input.read(header['size'])
    elif header['signature'] == 'CWS':
        print("The opened file appears to be compressed with Zlib")
        buffer = zlib.decompress(input.read(header['size']))
    elif header['signature'] == 'ZWS':
        print("The opened file appears to be compressed with Lzma")
        # ZWS(LZMA)
        # | 4 bytes       | 4 bytes    | 4 bytes       | 5 bytes    | n bytes    | 6 bytes         |
        # | 'ZWS'+version | scriptLen  | compressedLen | LZMA props | LZMA data  | LZMA end marker |
        size = read_ui32(input.read(4))
        buffer = lzma.decompress(input.read())

    # Containing rectangle (struct RECT)

    # The number of bits used to store the each of the RECT values are
    # stored in first five bits of the first byte.

    nbits = read_ui8(buffer[0]) >> 3

    current_byte, buffer = read_ui8(buffer[0]), buffer[1:]
    bit_cursor = 5

    for item in 'xmin', 'xmax', 'ymin', 'ymax':
        value = 0
        for value_bit in range(nbits-1, -1, -1): # == reversed(range(nbits))
            if (current_byte << bit_cursor) & 0x80:
                value |= 1 << value_bit
            # Advance the bit cursor to the next bit
            bit_cursor += 1

            if bit_cursor > 7:
                # We've exhausted the current byte, consume the next one
                # from the buffer.
                current_byte, buffer = read_ui8(buffer[0]), buffer[1:]
                bit_cursor = 0

        # Convert value from TWIPS to a pixel value
        header[item] = value / 20

    header['width'] = header['xmax'] - header['xmin']
    header['height'] = header['ymax'] - header['ymin']

    header['frames'] = read_ui16(buffer[0:2])
    header['fps'] = read_ui16(buffer[2:4])

    input.close()
    return header

header = parse(sys.argv[1]);

print('SWF header')
print('----------')
print('Version:      %s' % header['version'])
print('Signature:    %s' % header['signature'])
print('Dimensions:   %s x %s' % (header['width'], header['height']))
print('Bounding box: (%s, %s, %s, %s)' % (header['xmin'], header['xmax'], header['ymin'], header['ymax']))
print('Frames:       %s' % header['frames'])
print('FPS:          %s' % header['fps'])

I was under the impression the built in python 3.4 LZMA module works the same as the Python 2.7 pyLZMA module. The code I've provided runs on both 2.7 and 3.4, but when it is run on 3.4 (which doesn't have pylzma so it resorts to the inbuilt lzma) I get the following error:

_lzma.LZMAError: Input format not supported by decoder

Why does pylzma work but Python 3.4's lzma doesn't?

I'm very surprised a similar question hasn't come up else where (as far as I can tell). Could someone please shed some light? — hedgehog90, Sep 26 '15 at 11:10
I'm using python 3.5, and the error I get is `TypeError: a bytes-like object is required, not 'int'` at `line 16` for `struct.unpack(' — Marcus, Mar 18 '16 at 21:03

score 5 · Answer 1 · edited May 23 '17 at 12:10

While I do not have an answer to why the two modules work differently, I do have a solution.

I was unable to get the non-stream LZMA lzma.decompress to work since I do not have enough knowledge about the LZMA/XZ/SWF specs, however I got the lzma.LZMADecompressor to work. For completeness, I believe SWF LZMA uses this header format (not 100% confirmed):

Bytes  Length  Type  Endianness  Description
 0- 2  3       UI8   -           SWF Signature: ZWS
 3     1       UI8   -           SWF Version
 4- 7  4       UI32  LE          SWF FileLength aka File Size

 8-11  4       UI32  LE          SWF? Compressed Size (File Size - 17)

12     1       -     -           LZMA Decoder Properties
13-16  4       UI32  LE          LZMA Dictionary Size
17-    -       -     -           LZMA Compressed Data (including rest of SWF header)

However the LZMA file format spec says that it should be:

Bytes  Length  Type  Endianness  Description
 0     1       -     -           LZMA Decoder Properties
 1- 4  4       UI32  LE          LZMA Dictionary Size
 5-12  8       UI64  LE          LZMA Uncompressed Size
13-    -       -     -           LZMA Compressed Data

I was never able to really get my head around what Uncompressed Size should be (if even possible to define for this format). pylzma seems to not care about this, while Python 3.3 lzma does. However, it seems that an explicit unknown size works and may be specified as an UI64 with value 2^64, e.g. 8*b'\xff' or 8*'\xff', so by shuffling around headers a bit and instead of using:

buffer = lzma.decompress(input.read())

Try:

d = lzma.LZMADecompressor(format=lzma.FORMAT_ALONE)
buffer = d.decompress(input.read(5) + 8*b'\xff' + input.read())

Note: I had no local python3 interpreter available so only tested it online with a slightly modified read procedure, so it might not work out of the box.

Edit: Confirmed to work in python3 however some things needed to be changed, like Marcus mentioned about unpack (easily solved by using buffer[0:1] instead of buffer[0]). It's not really necessary to read the whole file either, a small chunk, say 256 bytes should be fine for reading the whole SWF header. The frames field is a bit quirky too, though I believe all you have to do is some bit shifting, i.e.:

header['frames'] = read_ui16(buffer[0:2]) >> 8

SWF file format spec

LZMA file format spec

Python 2.7 pyLZMA works, Python 3.4 LZMA module does not

1 Answers1