Any possible pitfalls for employing string compression in order to decrease database size

Question

Background

One of our SQL Server 2012 databases is getting a bit large, at least compared to our other databases. I was running some queries and noticed that we are currently storing large amounts of xml/html data in one of the columns. This is the main contributor to the size of the databse.

It occurred to me that the data in this column could be compressed prior to storage in C#. I found these compression methods that I was able to verify would work fine.

In my testing, the compression and decompression in C# for a 20,000 character html string took 1-3 thousandths of a second, and the decompression time was similar.

So, my thought is that compressing and decompressing the xml/html strings in C# would have the following advantages:

Reduced amount of data to transfer TO the database
Reduced amount of storage required IN the database.
Reduced amount of data to transfer FROM the database.

All this would only be offset my a few thousandths of a second for the compression and decompression on the web servers.

The Question:

Are there any pitfalls to employing this strategy of which I need to be aware?

I don't mind the downvote, but would appreciate a comment as to where I should go for an answer to this. — Narnian, Sep 23 '13 at 14:03
Other than the additional level of complexity and the inability to read the text using other tools (like SSMS). But if those issues aren't a problem for you, then I can't see any reason not to. — Pete, Sep 23 '13 at 14:03
Depending on the version of SQL Server, you could use it's native compression. This would take the workload off the webservers, putting it on the SQL Server - whether this is useful depends on the workload of each machine. — Hannah Vernon, Sep 23 '13 at 14:03
Just make certain you are using a lossless compression method. (Sorry if that insults your intelligence.) If you tested the ones in your link, in both directions, then they're probably OK. — dmm, Sep 23 '13 at 14:11
@MaxVernon Thanks. I will look into that. We're using SQL Server 2012, so this is, in fact, available. — Narnian, Sep 23 '13 at 14:11
@user2654818 Yes, I tested both ways to make sure nothing was lost. Thank you for mention that, though. No insult taken... it's better to ask just to make sure. — Narnian, Sep 23 '13 at 14:11
@user2654818 - I'm interested to know more about this non-lossless text compression. Never heard of it... where is it used? Or did you mean a hash function? — Hogan, Sep 23 '13 at 19:38

Hannah Vernon · Answer 1 · 2013-09-23T14:18:42.263

Since you are using SQL Server 2012, you could enable row-level or page-level compression on the table in question. This provides transparent compression enabling you to focus on the needs of the client instead of the needs of the storage system.

Microsoft has an excellent document on compression at http://technet.microsoft.com/en-us/library/cc280449.aspx

ALTER TABLE <table_name> 
REBUILD 
WITH (DATA_COMPRESSION = PAGE | ROW)

A fairly serious bonus of compression is reduced memory usage on the SQL Server. Compressing data pages in the database also results in the pages remaining compressed in memory.

Glenn Berry has an excellent article on estimating the benefits of compression at http://www.sqlskills.com/blogs/glenn/estimating-data-compression-savings-in-sql-server-2012/

score 1 · Answer 2 · answered Sep 23 '13 at 14:20

1

This will work, yes. The code linked has two problems:

It uses Unicode encoding which is wasteful for common character sets. Use UTF8.
The BCL compressor is known to be very poor. Experiment with a better library like SharpZipLib. If you want to expend a lot of CPU to go very small, use LZMA (free C# code available). It is state of the art (mainstream) compression.

Note, that SQL Server's data compression feature does not affect blobs, so it cannot be an alternative. AFAIK it also does not compress redundancies inside a single string. It only compresses common prefixes across string. Not sure about that, though.

If you were to compress multiple blobs at once you would gain much better compression. That would not allow for a row-by-row approach, though.

answered Sep 23 '13 at 14:20

usr

162,013
33
219
345

The OP is only asking about HTML/text data which certainly can be compressed in SQL Server. – Hannah Vernon Sep 23 '13 at 14:23
@MaxVernon it can be compressed by the data compression feature? SQL Server data compression does not use a general purpose algorithm like LZ. It is very special purpose. Here are some details: http://technet.microsoft.com/en-us/library/cc280464.aspx http://sqlblog.com/blogs/hugo_kornelis/archive/2013/01/31/why-does-sql-server-not-compress-data-on-lob-pages.aspx The latter also explains why page compression could not work for blobs. I guess it is worth a test. – usr Sep 23 '13 at 14:24
I suggest you run the sp_estimate_data_compression_savings stored proc to estimate the savings for any given table. – Hannah Vernon Sep 23 '13 at 14:31
1

Thanks for the tip on the Unicode compression vs UTF8! – Narnian Sep 23 '13 at 14:51

Any possible pitfalls for employing string compression in order to decrease database size

2 Answers2