SQL Server Data Compression – Right for You?

Data compression is a technology that’s been around since SQL Server 2008. The idea of data compression is that you can selectively choose tables, indexes, or partitions within a database and shrink the amount of space used to store the data.

Now, before you get too excited about this scalability and performance feature, at the time of this writing data compression is limited to the Enterprise Edition of all versions of SQL Server which are currently on the market. The Developer Edition is a single-user SQL Server Enterprise, and it, too, supports data compression.

If you’re still reading, then you’re probably using SQL Server Enterprise, and if you need to squeeze a bit more space out of your disk farm, then you might want to consider data compression.

With the advent of SQL Server 2014, we now speak of data in storage as either “rowstore” or “columnstore”. If you’re managing columnstore tables and indexes then that’s already using columnstore compression, which you can’t configure. The tips and reasoning in this posting reference only rowstore (or regular, as you’ve always know it) data storage.

There are plenty of “how to” posts on SQL Server data compression, but teasing out the “should I or shouldn’t I” is a little trickier. Some of the better blog posts and article have the “should I or shouldn’t I” points interspersed with descriptions of how the technology works. Other posts just show you how to compress data, and don’t really take into consideration the pluses and minuses of data compression.

First, data compression takes resources, lots of resources – CPU and memory, specifically – so if your SQL Server is already CPU-bound or memory-constrained, forget this idea. Go get more memory, faster CPUs, or tune the queries, but don’t add more load to the SQL Server, or your DBA career will be history.

Assuming that you’ve got unused CPU (in the 10-20% range) and available memory capacity, then you’ll want to consider the type of data compression to use – row or page.

Row compression compresses data records, effectively squeezing more records into a single page, so that with each physical I/O you’re reading more records into memory than if the data were uncompressed. That’s highly efficient, since the single most expensive operation in a conventional computer is physical I/O (excluding SSDs and flash drives).

So how does it work?

Glad you asked. Every data row in storage has record headers which contain metadata – information about each field in the record, its length and offset, etc. – this is called row overhead. Sometimes the row or metadata overhead takes more space than the data. Row compression reduces the amount of space required to store the metadata overhead (Microsoft isn’t telling us how it does this), plus it changes the data types in storage. Say, for instance, you have a table with several fixed-length columns, both character and numeric. Row compression will change those data types from fixed to variable-length. Char, integer, decimal, float, datetime, and money data types are all converted to their variable-length equivalents, sometimes resulting in massive space savings. The details for each data type and how they’re converted are covered at https://msdn.microsoft.com/en-us/library/cc280576.aspx.

Page compression enhances row compression by adding prefix compression and then dictionary compression to row compression…so, in order, a page compression operation first compresses the row (row compression, converting fixed-length data types to variable-length and minimizing the metadata overhead); then it determines a valid prefix for data on each page and compresses using that algorithm (prefix compression); then it looks for repeating values on each page, and compresses using that algorithm(dictionary compression).

To get a better idea of how page compression works, a picture is literally worth a thousand words. I’ve borrowed these illustrations from the MSDN library, https://msdn.microsoft.com/en-us/library/cc280464.aspx. Thank you, MSDN.

Data compression pix_1

An uncompressed page of data.

This first figure represents a page of data in storage. Prefix and dictionary compression work page-by-page, determining the best compression “operator” for each page, storing these operators in something called a “CI” structure that is the last entry in each page header.

Notice that the data in this example page contains commonalities – many of the data values start with the same characters (prefix). This might be a common condition with employee ID values, for instance, especially if the table were clustered by these ID values.

Data compression pix_2

A page of data with prefix (step 1 of page) compression applied.

The next figure shows what the data values look like in storage after SQL Server has gone through and done prefix compression. The prefix operators are stored at the end of the page header. The first column, first row, can be decompressed by following the pointer (4) and adding the first four characters of the associated prefix operator to the remaining character in the column, thereby rebuilding the original data value (aaabb). Of course, the pointers are going to be much more complex than this illustration – but you get the idea. The storage savings are pretty phenomenal, close to 25% by my simplistic calculations for this random data set. If you allow a single byte for each character in the data row, and another character for each prefix value in the CI section of the page header, the uncompressed data = 45 bytes, the compressed data = 34 bytes, a 25% reduction.

Data compression pix_3

A page of data after dictionary (step 2 of page) compression has been applied.

Following prefix compression, SQL Server then applies dictionary compression, as it looks for repeating values anywhere on the page and replaces them with a dictionary compression operator and pointers, storing these in the CI structure of the page header.

If your eyes haven’t crossed by now, you’ll notice that the compressed byte count has dropped to 31, or maybe 29, depending on how those zeros in the top row are handled. Nevertheless, the storage reduction is somewhere between 35% and 40%, which could be reason enough to try implementing page data compression.

 

Bottom line: the less randomized the data set, the more consistent the prefixes, and the more common strings in each data value, then the greater the storage savings that you can expect.

When does data get uncompressed?

While the answer to this question doesn’t strictly fit with the reason I’m writing this post, I’m glad you asked, because it shows you’re curious, and that’s good.

Data does not get uncompressed in memory – there is no “clear text” version of the data that some creepy hacker trolling memory could get hold of. Data is compressed on disk and it stays compressed while in memory until one of two things happen: data is read as part of a filter, sort, join or query response, or the data is updated – and then it’s not the entire page that’s uncompressed, just the data fields which are being read or updated.

We’ve talked about how uncompressing data consumes CPU, but those extra CPU cycles may be compensated for by the reduction in both physical and logical I/O. Because more rows of compressed data fit on a single page, more data is read into memory with each physical I/O, resulting in fewer I/Os overall. More data can remain cached in the buffer pool, and smaller volumes of data are written back to the disk. If the data is already in memory, fewer logical I/Os result in less work for the CPUs – possibly compensating for the cost of the CPUs compressing and uncompressing the data. The best savings in physical and logical I/O is during table and index scans. Using execution plans to determine the predominant query resolution operation (scans v. lookups) will help you determine best candidates for compression.

What’s a good candidate?

Now that you understand how data compression works – what’s happening internally, even at a very high level, you can better appreciate the CPU and memory overhead that’s required to decompress the data so that it can be delivered back to the calling applications in some understandable form. So, which situations are best suited for data compression, and which are not?

The ideal data set for data compression – other than having the characteristics I’ve just cited above – is data that is write once, read many. The best candidates are large tables and indexes where the scan-to-update operation is roughly a 3:1 or 4:1 ratio, or higher. If operations on a large table are 75% (or more) scans and 20% (or fewer) updates, then this may be a very good candidate for data compression. Data warehouses, and fact tables, especially, may be excellent candidates for data compression.

This is data compression, so patterns in the data matter. If you have a table with numeric/decimal or fixed-length character data types, where most values don’t need all the space allocated by the data type definition – these columns will row-compress very nicely. Nullable columns where most of the rows are NULL, and columns where the data values repeat (state codes, for instance), are also going to compress well.

What doesn’t compress so well? Filestream data; text and image data types (which, incidentally, are being deprecated with SQL Server 2014) – anything that is stored out of row (data in overflow areas is not compressed); repeating strings within the data value, but without repeating prefixes; data with high variability and little repeatability; variable-length data types.

What can you compress? Anything in the following list that meets the above criteria…

  • A table with no organization (“heap” structure);
  • A table organized as a clustered index;
  • A nonclustered index;
  • An indexed view (remember, these do persist on disk);
  • Partitioned tables and indexes (each partition can be configured differently);

Where are the gotchas?

Data compression is done one time, to whichever data sets (tables, indexes, partitions) on whichever SQL Server instances (production, development, or test) you choose. So what happens if you create a new index on a compressed table – do you think that the new index “inherits” the data compression?

Nope.

That’s right, now you’ve got a table that’s compressed but an index on that table that’s not. It might take you a little while to figure out why the index you just created isn’t giving you the performance boost you were expecting. Well, you say, I can just check with SQL Server Management Studio and to see what’s going on.

Data compression pix_4

Table person.contact has been page compressed but its non-clustered indexes have not – can you tell?

Not easily you can’t.

In the figure to the right, the table Person.Contact is page-compressed; the clustered index PK_Contact_ContactID is also (well, yes, since the data is stored at the leaf level of the clustered index), but the other indexes are not.

It would be great if compressed objects would turn blue or something like the compressed files in Windows Explorer, but that doesn’t happen.

So here’s how you find out what’s compressed, and what’s not. When you execute this code…

SELECT t.name AS [Table], i.name AS [Index], i.type_desc AS IndexType, p.partition_number AS [Partition], p.data_compression_desc AS [Compression]

FROM sys.partitions p

INNER JOIN sys.tables t ON t.object_id = p.object_id

INNER JOIN sys.indexes i ON i.object_id = p.object_id AND i.index_id = p.index_id

ORDER BY t.name, i.type_desc, i.name

…then you get a list of tables, indexes and partitions, and their compression status. It’s much, much easier than clicking through SQL Server Management Studio, checking Storage>Manage Compression for each object.

Back to the issue at hand, and that’s keeping everyone and everything synched. Having your dev and test environments compressed, but your production environment uncompressed is going to produce some interesting and unexpected results in production. The worst case is to have a scattering of objects in some environments compressed, but the same objects in other environments uncompressed. Better to have them all the same.

A recommendation – when embracing data compression, adopt a process something like:

  1. Determine which objects (tables, indexes, partitions) need compression;
  2. Compress in all environments (production, development, test);
  3. Stage the compressions (don’t do them all at once) and compress during low-activity time periods;
  4. Institute a change management process so that objects which are added after initial compression can also be compressed, if needed;
  5. Automate as much of this process so that you’re generating reports periodically, then review those reports to ensure that your systems are in sync.

The next “gotcha” is the case where compressing the data actually causes storage to increase. As you saw at the beginning of this posting, data compression adds extra space to each page header – mostly this is the metadata that decodes the prefix and dictionary portions of page compression. If the data values are too varied, if there’s not a consistency in the first or “prefix” portion of the strings, if there are minimal repeating value strings within the data sets, then it’s possible that this metadata overhead, as it’s called, can be greater than the space saved. In this case, you’ll want to back out of page compression and apply just row compression. If that doesn’t alleviate the problem, uncompress the data completely, and start campaigning for SSD storage devices.

Combining compression

I’ll bet you’re asking yourself if you should bother with compression when you’re doing SQL Server backups now that you’ve got data compression turned on. The short answer is yes.

Backup compression is quite different from data compression; its algorithm is a lot like zipping a file. Data compression and backup compression co-exist nicely, and can result in additional storage space savings, so go ahead and indulge.

There are most certainly some of you who are using SQL Server data compression and you’re considering trading in your spindle-based hard disks for SSDs or flash drive arrays, now that the price of SSDs is dropping while the reliability is skyrocketing. The best advice that I can offer you, based on colleagues who are already doing this, is to keep the data compression in place. Not only are they stuffing each I/O with more records than if the data were uncompressed, they’re now circumventing the I/O bottleneck. And they’re doing the DBA happy dance.

This all sounds pretty cool, but as with all things SQL Server, there are tradeoffs. The greater the compression, the greater the load on CPU (and memory) as data is exchanged with the calling application. If the apps generate a lot of backing and forthing between the client and the SQL Server, with continuous data updating, this could be a very negative feature. If you’re already CPU-bound, if you can’t spare 10-15% CPU capacity, then don’t try this. If you’re already maxed out with memory, address that issue first, and then try data compression. But, if you have lots of unused CPU capacity, memory to spare, and you know that you’re I/O bound, data compression is something that may give your SQL Servers (and you) the relief needed.

Thanks for reading this blog, I hope this has helped answer questions that you might have had about when and when not to use SQL Server data compression. If you’d like to learn more about our products and services, give me a call today at +1 303.526.1839 or click over to www.MountVernonDataSystems.com. If you’re new at the database administration game, check out Coach-in-a-Can® at https://mountvernondatasystems.com/coach-in-a-can.html.

The CoachMichelle Poolet, The Coach

Friday, 13 March, 2015

 

 

 

WHAT’S EATING UP MY HARD DISK?

I have a bone to pick with Microsoft. I’m not a Windows Systems Admin, and I don’t think I should have to be in order to use a Windows operating system. I’ll tell my story, and you decide if sysadmin certification is a pre-req to using – in this case – Windows 7.

Lately I’ve had runaway disk consumption on my Windows 7 laptop.  As a SQL Server DBA, I use this little laptop for all my client apps — SQL Server Management Studio, Microsoft Office, personal file storage, etc. It’s not a big system, the local hard disk is is only 278GB.

pink spikey monster-cropWithin the last month I’ve noticed that the free space on this hard disk has been shrinking at an alarming rate, and when the space available “red light” came on, I got really concerned. Granted, I had been saving files to this disk left and right, without considering space requirements — but hey, 278GB can hold a lot of files, right? Apparently not?

Subsequently, I spent many hours moving my files to network storage, and still I was only able to recover less than 30GB of local storage!

WHAT WAS EATING UP MY HARD DISK?

I searched that hard disk for missing space, but I could find nothing. No user folder contained more than a few hundred MBs worth of files.  ARGH!

This is the point in this story when I say that I’m glad I’ve been around computers for as long as I have. Back in the dirty dark days of DOS I had a little utility program that would list folders and their sizes, largest to smallest. Hard disks in those days were measured in a few tens of MB, so file control was critical.

Was there such a utility in today’s market?  OH YES! I found TreeSize (www.jam-software.com/treesize_free/), downloaded, installed, and almost immediately found the culprit — which, incidentally, I could never have done using just Microsoft tools.

Have you heard of this thing called System Volume Information?

If you’re a sysadmin, of course you have! But as a DBA, why would I care? My sysadmin will take care of sysvol stuff, right?

Sure.

SysVol_pix 1

What happens when you try to look into the System Volume Information folder…

I had noticed this sysvol entry in the Windows Explorer tree, and had attempted to check it out, thinking maybe there was a bunch of system stuff that was sucking up the space on disk. Take a look at what you see when you try to check the folder and see what it contains — not a very friendly greeting.

OK, I really didn’t care what it contained at this point, I just wanted to know how large the folder contents were…so, check Properties…

SysVol_pix 2 Zero bytes. This must be one of those folders that Windows uses for temporary files or something.  Except, TreeSize told me that this folder contained about 233 GB worth of files…HUH?

WHY?

Research time!

It didn’t take long to determine what was going on…lots and lots of entries regarding sysvol info, stretching back to the days of Windows XP:

Clean out system volume information folder – Microsoft  https://social.technet.microsoft.com/Forums/en-US/92775152-7330-4da4-8af9-10a04e0bc783/clean-out-system-volume-information-folder

How to reduce the size of the System Volume information (my favorite for concise instructions), http://www.tomshardware.com/answers/id-1813658/reduce-size-system-volume-information-folder.html

How to reduce the size of the System Volume information, http://www.sevenforums.com/performance-maintenance/222588-how-delete-system-volume-information-folder-win-7-a.html

…and even Wikipedia gets into the act…  https://en.wikipedia.org/wiki/System_Restore

BUT — I take exception with this wiki entry. Under Disk Space Consumption, it states:

“The amount of disk space System Restore consumes can be configured. Starting with Windows XP, the disk space allotted is configurable per volume and the data stores are also stored per volume. File[s] are stored using NTFS compression and a Disk Cleanup handler allows deleting all but the most recent Restore Point to free up disk space. System Restore can be disabled completely to regain disk space. It automatically disables itself if the disk free space is too low for it to operate.”

Three points:

1. The contention that files are stored using NT compression — I believe this is not true. You can attempt to apply compression to this folder, but it does nothing — there’s no indication that compression is being done, and it doesn’t even throw an error message. In addition, the display in Windows Explorer is not that of a compressed file.

2. Restore Points are not being managed, and disk space is not being freed up. It’s my understanding that the default config is to allot a maximum of 50% of disk space to sysvol, but in my experience sysvol operations have taken well beyond 50%.  For example, a 278GB hard disk containing 35GB of user data files; when these data files were removed to an external drive, the disk still showed 233GB space used.

3. The contention that “It automatically disables itself if the disk free space is too low for it to operate”  is not true, IMNSHO. My hard disk space was in the red zone, yet GBs were still being consumed — even while user files were being moved off the local hard drive onto the external device.

I’m not sure why this problem surfaced now — perhaps it’s a combination of enough time has elapsed and I’ve been doing more work so that the space consumed by sysvol info has become an issue, who knows? The point is, this is a situation that could trip up a lot of DBAs and ordinary users.  Back in the old days, we could take ownership of the sysvol info folder and administer directly — but starting with Windows 7 that’s no longer possible. Now  you have to jump through appropriate hoops in order to manage the size (see links above for instructions), and how to do this is not obvious. Actually, without TreeSize to tip me off to the sysvol info problem, I would have missed it entirely!

One word of caution: I did reset the sysvol max disk space allocation according to the directions, but when I went back in to check (trust but verify?), the settings had returned to the 50% default. Great.

I can turn off sysvol info entirely — that is an option — but then my system would loose it’s auto-recovery features, and since there have been a couple of auto-recoveries in this laptop’s lifetime, I really don’t want to do that. I just want to use this laptop as it was created…as a shared resource.

So, you tell me, should I or any other non-system admin user need to know about this stuff?  Really?

Thank you for listening to this story. I hope you don’t have this problem with your hard disk, but if you do, I hope this has helped. If you’d like to learn more about our products and services, give me a call today at +1 303.526.1839 or click over to www.MountVernonDataSystems.com. If you’re new at the database administration game, check out Coach-in-a-Can® at https://mountvernondatasystems.com/coach-in-a-can.html

The Coach