Big Data – Mount Vernon Data Systems

Data compression is a technology that’s been around since SQL Server 2008. The idea of data compression is that you can selectively choose tables, indexes, or partitions within a database and shrink the amount of space used to store the data.

Now, before you get too excited about this scalability and performance feature, at the time of this writing data compression is limited to the Enterprise Edition of all versions of SQL Server which are currently on the market. The Developer Edition is a single-user SQL Server Enterprise, and it, too, supports data compression.

If you’re still reading, then you’re probably using SQL Server Enterprise, and if you need to squeeze a bit more space out of your disk farm, then you might want to consider data compression.

With the advent of SQL Server 2014, we now speak of data in storage as either “rowstore” or “columnstore”. If you’re managing columnstore tables and indexes then that’s already using columnstore compression, which you can’t configure. The tips and reasoning in this posting reference only rowstore (or regular, as you’ve always know it) data storage.

There are plenty of “how to” posts on SQL Server data compression, but teasing out the “should I or shouldn’t I” is a little trickier. Some of the better blog posts and article have the “should I or shouldn’t I” points interspersed with descriptions of how the technology works. Other posts just show you how to compress data, and don’t really take into consideration the pluses and minuses of data compression.

First, data compression takes resources, lots of resources – CPU and memory, specifically – so if your SQL Server is already CPU-bound or memory-constrained, forget this idea. Go get more memory, faster CPUs, or tune the queries, but don’t add more load to the SQL Server, or your DBA career will be history.

Assuming that you’ve got unused CPU (in the 10-20% range) and available memory capacity, then you’ll want to consider the type of data compression to use – row or page.

Row compression compresses data records, effectively squeezing more records into a single page, so that with each physical I/O you’re reading more records into memory than if the data were uncompressed. That’s highly efficient, since the single most expensive operation in a conventional computer is physical I/O (excluding SSDs and flash drives).

So how does it work?

Glad you asked. Every data row in storage has record headers which contain metadata – information about each field in the record, its length and offset, etc. – this is called row overhead. Sometimes the row or metadata overhead takes more space than the data. Row compression reduces the amount of space required to store the metadata overhead (Microsoft isn’t telling us how it does this), plus it changes the data types in storage. Say, for instance, you have a table with several fixed-length columns, both character and numeric. Row compression will change those data types from fixed to variable-length. Char, integer, decimal, float, datetime, and money data types are all converted to their variable-length equivalents, sometimes resulting in massive space savings. The details for each data type and how they’re converted are covered at https://msdn.microsoft.com/en-us/library/cc280576.aspx.

Page compression enhances row compression by adding prefix compression and then dictionary compression to row compression…so, in order, a page compression operation first compresses the row (row compression, converting fixed-length data types to variable-length and minimizing the metadata overhead); then it determines a valid prefix for data on each page and compresses using that algorithm (prefix compression); then it looks for repeating values on each page, and compresses using that algorithm(dictionary compression).

To get a better idea of how page compression works, a picture is literally worth a thousand words. I’ve borrowed these illustrations from the MSDN library, https://msdn.microsoft.com/en-us/library/cc280464.aspx. Thank you, MSDN.

An uncompressed page of data.

This first figure represents a page of data in storage. Prefix and dictionary compression work page-by-page, determining the best compression “operator” for each page, storing these operators in something called a “CI” structure that is the last entry in each page header.

Notice that the data in this example page contains commonalities – many of the data values start with the same characters (prefix). This might be a common condition with employee ID values, for instance, especially if the table were clustered by these ID values.

A page of data with prefix (step 1 of page) compression applied.

The next figure shows what the data values look like in storage after SQL Server has gone through and done prefix compression. The prefix operators are stored at the end of the page header. The first column, first row, can be decompressed by following the pointer (4) and adding the first four characters of the associated prefix operator to the remaining character in the column, thereby rebuilding the original data value (aaabb). Of course, the pointers are going to be much more complex than this illustration – but you get the idea. The storage savings are pretty phenomenal, close to 25% by my simplistic calculations for this random data set. If you allow a single byte for each character in the data row, and another character for each prefix value in the CI section of the page header, the uncompressed data = 45 bytes, the compressed data = 34 bytes, a 25% reduction.

A page of data after dictionary (step 2 of page) compression has been applied.

Following prefix compression, SQL Server then applies dictionary compression, as it looks for repeating values anywhere on the page and replaces them with a dictionary compression operator and pointers, storing these in the CI structure of the page header.

If your eyes haven’t crossed by now, you’ll notice that the compressed byte count has dropped to 31, or maybe 29, depending on how those zeros in the top row are handled. Nevertheless, the storage reduction is somewhere between 35% and 40%, which could be reason enough to try implementing page data compression.

Bottom line: the less randomized the data set, the more consistent the prefixes, and the more common strings in each data value, then the greater the storage savings that you can expect.

When does data get uncompressed?

While the answer to this question doesn’t strictly fit with the reason I’m writing this post, I’m glad you asked, because it shows you’re curious, and that’s good.

Data does not get uncompressed in memory – there is no “clear text” version of the data that some creepy hacker trolling memory could get hold of. Data is compressed on disk and it stays compressed while in memory until one of two things happen: data is read as part of a filter, sort, join or query response, or the data is updated – and then it’s not the entire page that’s uncompressed, just the data fields which are being read or updated.

We’ve talked about how uncompressing data consumes CPU, but those extra CPU cycles may be compensated for by the reduction in both physical and logical I/O. Because more rows of compressed data fit on a single page, more data is read into memory with each physical I/O, resulting in fewer I/Os overall. More data can remain cached in the buffer pool, and smaller volumes of data are written back to the disk. If the data is already in memory, fewer logical I/Os result in less work for the CPUs – possibly compensating for the cost of the CPUs compressing and uncompressing the data. The best savings in physical and logical I/O is during table and index scans. Using execution plans to determine the predominant query resolution operation (scans v. lookups) will help you determine best candidates for compression.

What’s a good candidate?

Now that you understand how data compression works – what’s happening internally, even at a very high level, you can better appreciate the CPU and memory overhead that’s required to decompress the data so that it can be delivered back to the calling applications in some understandable form. So, which situations are best suited for data compression, and which are not?

The ideal data set for data compression – other than having the characteristics I’ve just cited above – is data that is write once, read many. The best candidates are large tables and indexes where the scan-to-update operation is roughly a 3:1 or 4:1 ratio, or higher. If operations on a large table are 75% (or more) scans and 20% (or fewer) updates, then this may be a very good candidate for data compression. Data warehouses, and fact tables, especially, may be excellent candidates for data compression.

This is data compression, so patterns in the data matter. If you have a table with numeric/decimal or fixed-length character data types, where most values don’t need all the space allocated by the data type definition – these columns will row-compress very nicely. Nullable columns where most of the rows are NULL, and columns where the data values repeat (state codes, for instance), are also going to compress well.

What doesn’t compress so well? Filestream data; text and image data types (which, incidentally, are being deprecated with SQL Server 2014) – anything that is stored out of row (data in overflow areas is not compressed); repeating strings within the data value, but without repeating prefixes; data with high variability and little repeatability; variable-length data types.

What can you compress? Anything in the following list that meets the above criteria…

A table with no organization (“heap” structure);
A table organized as a clustered index;
A nonclustered index;
An indexed view (remember, these do persist on disk);
Partitioned tables and indexes (each partition can be configured differently);

Where are the gotchas?

Data compression is done one time, to whichever data sets (tables, indexes, partitions) on whichever SQL Server instances (production, development, or test) you choose. So what happens if you create a new index on a compressed table – do you think that the new index “inherits” the data compression?

Nope.

That’s right, now you’ve got a table that’s compressed but an index on that table that’s not. It might take you a little while to figure out why the index you just created isn’t giving you the performance boost you were expecting. Well, you say, I can just check with SQL Server Management Studio and to see what’s going on.

Table person.contact has been page compressed but its non-clustered indexes have not – can you tell?

Not easily you can’t.

In the figure to the right, the table Person.Contact is page-compressed; the clustered index PK_Contact_ContactID is also (well, yes, since the data is stored at the leaf level of the clustered index), but the other indexes are not.

It would be great if compressed objects would turn blue or something like the compressed files in Windows Explorer, but that doesn’t happen.

So here’s how you find out what’s compressed, and what’s not. When you execute this code…

SELECT t.name AS [Table], i.name AS [Index], i.type_desc AS IndexType, p.partition_number AS [Partition], p.data_compression_desc AS [Compression]

FROM sys.partitions p

INNER JOIN sys.tables t ON t.object_id = p.object_id

INNER JOIN sys.indexes i ON i.object_id = p.object_id AND i.index_id = p.index_id

ORDER BY t.name, i.type_desc, i.name

…then you get a list of tables, indexes and partitions, and their compression status. It’s much, much easier than clicking through SQL Server Management Studio, checking Storage>Manage Compression for each object.

Back to the issue at hand, and that’s keeping everyone and everything synched. Having your dev and test environments compressed, but your production environment uncompressed is going to produce some interesting and unexpected results in production. The worst case is to have a scattering of objects in some environments compressed, but the same objects in other environments uncompressed. Better to have them all the same.

A recommendation – when embracing data compression, adopt a process something like:

Determine which objects (tables, indexes, partitions) need compression;
Compress in all environments (production, development, test);
Stage the compressions (don’t do them all at once) and compress during low-activity time periods;
Institute a change management process so that objects which are added after initial compression can also be compressed, if needed;
Automate as much of this process so that you’re generating reports periodically, then review those reports to ensure that your systems are in sync.

The next “gotcha” is the case where compressing the data actually causes storage to increase. As you saw at the beginning of this posting, data compression adds extra space to each page header – mostly this is the metadata that decodes the prefix and dictionary portions of page compression. If the data values are too varied, if there’s not a consistency in the first or “prefix” portion of the strings, if there are minimal repeating value strings within the data sets, then it’s possible that this metadata overhead, as it’s called, can be greater than the space saved. In this case, you’ll want to back out of page compression and apply just row compression. If that doesn’t alleviate the problem, uncompress the data completely, and start campaigning for SSD storage devices.

Combining compression

I’ll bet you’re asking yourself if you should bother with compression when you’re doing SQL Server backups now that you’ve got data compression turned on. The short answer is yes.

Backup compression is quite different from data compression; its algorithm is a lot like zipping a file. Data compression and backup compression co-exist nicely, and can result in additional storage space savings, so go ahead and indulge.

There are most certainly some of you who are using SQL Server data compression and you’re considering trading in your spindle-based hard disks for SSDs or flash drive arrays, now that the price of SSDs is dropping while the reliability is skyrocketing. The best advice that I can offer you, based on colleagues who are already doing this, is to keep the data compression in place. Not only are they stuffing each I/O with more records than if the data were uncompressed, they’re now circumventing the I/O bottleneck. And they’re doing the DBA happy dance.

This all sounds pretty cool, but as with all things SQL Server, there are tradeoffs. The greater the compression, the greater the load on CPU (and memory) as data is exchanged with the calling application. If the apps generate a lot of backing and forthing between the client and the SQL Server, with continuous data updating, this could be a very negative feature. If you’re already CPU-bound, if you can’t spare 10-15% CPU capacity, then don’t try this. If you’re already maxed out with memory, address that issue first, and then try data compression. But, if you have lots of unused CPU capacity, memory to spare, and you know that you’re I/O bound, data compression is something that may give your SQL Servers (and you) the relief needed.

Thanks for reading this blog, I hope this has helped answer questions that you might have had about when and when not to use SQL Server data compression. If you’d like to learn more about our products and services, give me a call today at +1 303.526.1839 or click over to www.MountVernonDataSystems.com. If you’re new at the database administration game, check out Coach-in-a-Can® at https://mountvernondatasystems.com/coach-in-a-can.html.

Michelle Poolet, The Coach

Friday, 13 March, 2015

SQL Server Data Compression – Right for You?

So how does it work?

When does data get uncompressed?

What’s a good candidate?

Where are the gotchas?

Combining compression

BIG DATA, TO MODEL OR NOT TO MODEL?