blown up .sav file size using haven::write_sav()

Question

I am writing SPSS .sav files from R using the package haven, which works very well for me in general. However I have noticed that the .sav file size written on disk using write_sav() seems to be much bigger than nescessary. Whenever I open and save a .sav file written by write_sav() in SPSS, the file size is reduced by a factor of up to ~10!

This matters to me as I am writing rather big data to SPSS for others and sometimes SPSS refuses to open a very big file. Maybe this would problem would not arise if write_sav() would store more efficiently in a "real" native SPSS way?

Does anyone know this issue and maybe has a helpful comment on it? SPSS installation is needed to replicate this issue

score 0 · Answer 1 · answered Dec 11 '17 at 12:00

It's not clear from the Haven write_sav() documentation, but it sounds like it is saving them as uncompressed .sav files. The default for (most) SPSS installations would be to save as compressed files. SPSS has an extra compression option of 'zCompressed' which will produce even smaller files but these generally can't be opened outside of SPSS.

You can experiment with this like so;

Save outfile = 'Uncompressed file.sav'
    /UnCompressed.
Save outfile = 'Compressed file.sav'
    /Compressed.
Save outfile = 'ZCompressed file.zsav'
    /ZCompressed.

Note the .zsav file extension isn't necessary (could be .sav) but it's considered best practice to use this to make it clear where compatibility might be an issue.

See https://www.ibm.com/support/knowledgecenter/en/SSLVMB_21.0.0/com.ibm.spss.statistics.help/syn_save_compressed_uncompressed.htm for more info.

score 0 · Answer 2 · answered Nov 13 '19 at 20:39

What form does your actual data take? Is is Codepage or Unicode; and what is Haven doing? Since SPSS 16.0 and the introduction of the UNICODE setting, there has been a tripling of string field widths when converting from Codepage to Unicode. This is a pain best suffered only once. Get your data to unicode and then stay there.

See https://www.ibm.com/support/knowledgecenter/SSLVMB_26.0.0/statistics_reference_project_ddita/spss/base/syn_set_unicode.html for more information.

score 0 · Answer 3 · answered Nov 13 '19 at 23:19

If the output size is a problem, you could have a look at my package readspss. Using compression and zsav you should be able to get the best available compression. Compression in sav files depends on how the file is written. SPSS has different compression methods to store numeric information. Numerics can be stored only as doubles (no compression) or in a mix of doubles and int8_t (compression 1). Zsav used zlib to compress whatever the initial input was (compression 2). Eight integers take the size of a double hence the difference in the file size.

score 0 · Answer 4 · answered Sep 06 '20 at 23:15

There are three variants of the SPSS (.sav) file format:

Uncompressed (.sav). This is haven's default output, but is rarely used in my experience.
Compressed (.sav). This is what most people use, and it has been the default save format for SPSS for many, many years.
Zcompressed (.zsav, but sometimes .sav). Added a few years ago to SPSS, but doesn't seem used much. You can get this from haven by adding compress=TRUE to write_spss()

I have submitted a pull request to make the compressed (2) format the default.

blown up .sav file size using haven::write_sav()

4 Answers4