Tempdb: Checksum or Torn Page Detection or None?

Tempdb performance tuning can be an art unto itself. We probably all know the standard stuff:

SQL Server's Tempdb database

SQL Server has only one tempdb database, which is used for a variety of internal operations.

  • put tempdb on separate disks if you’re still working with direct attached storage — on it’s own spindle if at all possible, and fastest one(s) you have. If you can, put the tempdb data file on one disk and the tempdb log file on a second disk.
  • if you have the capability to determine RAID level for tempdb, RAID 1+0 is best (there’s only one tempdb per SQL Server, so if you loose the tempdb you’re SOL — hardware mirroring is one way of preventing unnecessary downtime). Otherwise, opt for RAID 0 (striping).
  • if your database employs a lot of the objects which are managed in tempdb (temporary tables – local and global, table variables, cursors, sorts inside queries, index rebuilding operations, online indexing, table-valued functions), it’s a good idea to create one tempdb file for every 4 CPUs, up to 8 tempdb data files. Then observe performance and use of the tempdb files, and increase the number of data files if necessary. And, of course, restart the SQL Server so that the new tempdb data file will be recognized and put to use.
  • never configure tempdb with the default initial size and auto-growth as defined in the model database, especially older versions of SQL Server as it came out of the box. Pre-size the tempdb files when you’re installing SQL Server — all the same size for optimal load distribution.

Recently I read a posting about the Page Verify option of database properties, and tempdb not allowing any option other than CHECKSUM, which is not true for SQL Server 2008 R2 and 2012.  This got me to thinking…why would you want to use an option other than CHECKSUM?

TORN PAGE DETECTION is reputedly not as robust as CHECKSUM in terms of data integrity, but might be quicker…is it really?

NONE would be the quickest option if you were focused entirely on fast performance, and really didn’t care about the odd data inaccuracy.

I honestly don’t know — I’ve always gone with safety first and opted for CHECKSUM on all databases, user and system.  Am I missing something by being too conservative?

Have you ever thought about this before? What do you figure would be the conditions under which you’d want to use one property v. another? And obviously, this is NOT something you want to try on ANY production system, not without volumes of testing! Let me know what you think…

To link directly to this post in future:

https://www.mountvernondatasystems.com/blog/tempdb-checksum-or-torn-page-detection-or-none

 

Why Are We All Not Great SQL Server DBAs?

We know we’re good; we know we can be better. But too many times, stuff gets in our way. So what is it that’s preventing us from doing not just a good job, but a great job; what is holding us back from becoming a great DBA?

– Is there not enough time in the day to address all the issues that you have on your to-do list?

– Are you not feeling comfortable with the challenges presented to you? I know, this is a hard one to admit, but it hits us — each and every one of us — at some point in our careers!

– Do you feel that there’s a lack of communication with your peer group? (Who is your peer group?)

– Do you feel that you are experiencing a lack of communication and understanding with your management?

– Do you feel like you have a lack of control over your working environment? How would you like to change that?

– Do you have a lack of tools which might – could – would help you get the work done right – better – best?

– Do you feel like you lack certain skills, or that the technology is moving faster than your ability to assimilate the information? I know I feel that way, every time Microsoft releases a new version of SQL Server!

– Do you feel like you’re lacking skilled staffers, and in which quarters?

– Do you feel like you’re lacking adequate support from your peers and/or your management. What would you like to happen, that might be able to fix that?

– And lastly, fill in the blank: Do you feel that ____________________________.

No, this is not a discussion about techniques and tips, but it will help expose some of the common issues which are prevalent among SQL Server DBAs. In doing so, maybe we can get to the some of the root causes of why we’re not all great DBAs!

The 833, What Is It?

Here’s a story that I’d like to share, about the infamous SQL Server 833. A while ago I was on a performance-tuning call at a client site; they’d been experiencing event 833 from some of their production SQL Servers, and they wanted to know if this was a SQL Server problem or a problem with the disk farm.

To back up just a little, an event 833 (or error 833, as it’s sometimes called) is a signal from the SQL Server that it’s waiting for records that it’s requested. An 833 message reads like “SQL Server has encountered 49 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file […file full path…] in database […database name…]…”.

This message is often interpreted as being a disk I/O issue, and in many cases the 833 can be resolved by optimizing disk operations or just replacing older, slower equipment with the new fast disks. But this one case didn’t resolve quite so easily. You see, the disk farm attached to the SQL Servers in question were composed of flash drives — a flash SAN.

Let’s complicate this situation:

SAN storage serverThe production SQL Servers which were getting the 833s are on the far right of this system diagram. The 833s were coming from different database data files and log files, sometimes during heavy transaction activity and other times during very light transaction activity…so, in other words, the 833s were sporadic and unpredictable.

Given all the “upstream” data stores, processes, and source data implications (no one on staff seemed to really understand how the production system worked), it’s not surprising that there would be some sort of issue in the transaction stream. The stream starts at point-of-sale terminals and data aggregators on the far left and proceeds through load balancers and “soft routers” (this is a distributed environment) and on to error-handling routines before finally arriving at a place where the SQL Servers can begin to do their job.  The good news is that this system is able to handle many millions of transactions a day with only the occasional 833 hiccup. The 833 “wait” always resolves itself within a minute or two, but by then, the upstream congestion is significant.

The client set up a test environment with the intent of replicating the 833 situation to determine root cause. There was only one problem with the test setup…it bore little if any resemblance to production. And, while I was on-site, we were never allowed to use advanced analysis tools on the production systems.

Spoiler Alert! the exact root cause of the 833s was never determined…

So now the challenge to you, dear reader: given what you know so far, what would you surmise might be the cause of the 833s which are being thrown by the SQL Servers?  Remember, that SAN is full of flash drives…

 

Scheduled Job Failed to Run…How to Troubleshoot SQL Server Error Messages

Troubleshooting error messages can be a little tricky, especially if you’re new to the task. Here’s a good example of a job that failed without a whole lot of information as to why it failed. This was taken from the Windows Event Log/Application Log:

Log Name:      Application
Source:        SQLISPackage
Date:          7/12/2014 10:45:45 PM
Event ID:      12291
Task Category: None
Level:         Error
Keywords:      Classic
User:
Computer:
Description:   Package “Dynamics DB Integrity Check” failed.

Step 1: this error message is pointing to a job (Application/SQLISPackage) called Dynamics DB Integrity Check.Subplan, so it’s logical to start with the SQL Server Management Studio scheduled jobs. Sure enough, there’s a scheduled job with the same name in the job list, and it’s an active job.

SchedJobErrorMsg_01

 

 

 

 

 

 

 

 

(The jobs with the tiny red downward-pointing arrows as part of the icon have been disabled.)

Step 2: open up the job history for the job and see if there’s any additional information:

SchedJobErrorMsg_02

 

 

 

 

 

 

As it turns out, there isn’t, not in the main entry for the job history description or in the subplan step.

SchedJobErrorMsg_03

 

 

 

 

 

The clues are in the error message Source (SQLISPackage) and Description that starts with “Package …”. The job name is a clue, also – the two-part name usually indicates that this job was created by a Maintenance Plan.

Step 3: Confirm these suspicions by opening the job itself and viewing the content (you’ll have to Edit in order to see the internal configuration of the job, but don’t make any changes – and always cancel out when you’re done, never save *).

SchedJobErrorMsg_04

 

 

 

 

 

 

 

 

 

Maintenance Plans are SQL Server Integration Services packages. When you save a package that you’ve just created, it creates and enables scheduled jobs which run the various tasks that comprise the maintenance package.

Step 4: go to the Maintenance Plans and check the plan named in the job step…but look!  There IS NO MAINTENANCE PLAN!

SchedJobErrorMsg_05

 

 

 

 

 

 

 

 

 

If there were maintenance plans for this SQL Server they would be listed under Legacy>Database Maintenance Plans (this is a SQL 2005 server). Obviously, there are none. Also obviously, at one time there was a maintenance plan.

Reason for the Error Message: the scheduled job is calling a non-existent maintenance plan.

Lesson Learned: when a Maintenance Plan is deleted from a SQL Server, the scheduled jobs that ran the pieces of the Plan are not also deleted. You have to remove (or disable) them separately.

Step 5: remove or disable the scheduled job. For the sake of caution, we’re going to disable the job. Highlight the job name, R-click, choose “Disable”. The result looks like this:

SchedJobErrorMsg_06

 

 

 

 

And those are the steps to troubleshooting an error message about a job failing to run.

* Save v. cancel: when investigating settings or parameters for live jobs or other database properties, it’s always safest to cancel out rather than save when you’re done observing. A “save” could cause changes if you inadvertently modify something in the settings, whereas a “cancel” will simply leave the settings as they were. “Save” when you intend to make changes; “cancel” when you don’t.