Feature Article - Microsoft Press Mastering Series

Microsoft SQL Server 2000 Reliability: Dressed to the Nines

By Scott Mauvais

Everyone is talking about nines these days. "My apps got more nines than your app, nah, nah, nah." Until recently, this schoolyard taunt was the exclusive domain of the socially challenged UNIX bigots who (mistakenly) believed the only way to have a truly reliable system was to run on some flavor of UNIX.

We Microsoft NT� true believers have known all along that Microsoft� Windows� could run circles around any other OS in terms of reliability. After all, Barnes & Noble had the highest uptime (99.98%) of any e-commerce site during the 1999 holiday season-and they were running a beta of Microsoft Windows 2000! In fact, that 0.02% downtime resulted from their scheduled upgrade to the RTM version.

BN.com is not alone. Lots of people are designing and building Microsoft Windows-based apps with previously unimagined reliability. Starbucks, GMAC, and FreeMarkets.com all have case studies demonstrating five nines.

So, how are these companies able to achieve these levels of reliability? Well, that is exactly what this article is about. This month I am going to show you how to use standard Microsoft Windows technologies to make your Microsoft SQL ServerT app just like Denny's-"always open" and ready for business.

If you want a comprehensive look at the inner workings of SQL Server 2000, you need to get a copy of Ron Soukup and Kalen Delaney's Inside Microsoft SQL Server 2000. They just finished revising their best selling, Inside Microsoft SQL Server 7.0 to provide essential insight into the architectural changes in SQL Server 2000. Even if you have the 7.0 edition (or Ron's 6.5 version), you should purchase a copy of the new book because it tells you why SQL Server 2000 is the way it is and shows you how to use it to its full potential.

How Reliable Is Five Nines?
So what do all these nines mean? We are, of course, talking about percentage of uptime. Five nines means your system is available to end users 99.999% of the time. If you do the math, to say a system is reliable to five nines means that an app can only be down for about five minutes per year. Given that the availability of most systems is judged on a monthly if not weekly basis, this is even more impressive. (If you care, five nines allows you a whopping six seconds of down time per week; on a monthly basis you get a generous 26 seconds.)

Five minutes per year! On today's high-end servers, that is not even long enough for a reboot. Even if you are using Microsoft Cluster Service that only gives you two or three fail-over/fail-back episodes.

The first point to realize when we discuss high availability is that we are interested in highly reliable applications and not necessarily servers. Back in the days when an application ran on a single server, the availability of the server dictated the availability of the application. Now that we have three-tier applications live on the network, we must separate the uptime of the application from that of the server. The important implication here is we may decide to bring individual servers down for planned maintenance or an upgrade (or one may even crash), yet keep the application running and keep chalking up those nines of uptime.

When building highly reliable applications, you have three ingredients: technology, people, and process. Just so that we are all on the same page, here is what these terms mean to me.

Technology: this refers to the features of the underlying platform, such as clustering, load balancing, and such.
People: this means the output of your peers or staff and the strategies used to produce bulletproof apps.
Process: this is the complex one. Depending on where you are in a project's life cycle, the definition of process varies. If you are in development, for example, process refers to coding standards, test plans, and applications frameworks. Conversely, once an app has been deployed, process usually refers to operations such as backup strategies and change control. Because we are talking about reliability I will assume the app has been built, so I will defer to the operations definition.

Let's focus on the technology ingredient in this formula and discuss the reliability-oriented features that Microsoft builds into its products. Once we understand what components are available, we can move on to looking at specific techniques and processes to leverage the features. In a future article, I will revisit the people and processes aspects of reliability.

Common Causes of Downtime
Before diving into features, I want to review why systems fail. There are two types of downtime: planned and unplanned. Because unscheduled downtime is the most problematic, I will focus my discussion there. In my future article on process, I will touch on some strategies that also help to minimize planned downtime. Let's examine the most common causes of unscheduled downtime and some solution areas to investigate.

Application

Cause:	Poor (or non-existent) testing and change management Limited problem detection (lack of asserts, buffer overflows, missing error handlers, etc) Scalability problems (poor capacity planning, DOS vulnerabilities, blocking code, etc)
Percentage:	40%
Potential Solutions:	Development guidelines, Microsoft Solutions Framework, BackOffice Logo Program, Training and Certification, testing

Platform

Cause:	Operating System (Blue Screen, Abend, reboots, etc.) Network connectivity (teleco, ISPs, upstream routers, etc) Environmental (power, disaster, the ever-present Acts of God)
Percentage:	20%
Potential Solutions:	Hardware Compatibility List, Reboot Reduction Project, Redundant hardware, multiple network routes, driver signing

Operator

Cause:

Missing backups
Misconfigured security
Lacking procedures

Percentage: 40%

Potential Solutions: Operation guidelines, Microsoft Operations Framework, Training and Certification.

Now that we understand what causes downtime, let's see what Microsoft technologies we can use to help prevent it.

It All Starts with Microsoft Windows 2000
No matter what application you are using or building, Microsoft Windows 2000 is key to the reliability of your app. With Windows 2000, Microsoft introduced features such as driver signing and file protection. Driver signing allows you to guarantee that only tested and certified drivers are installed on your system. Windows File Protection ensures that critical system files are not accidentally deleted (or maliciously modified for that matter) or overwritten by a poorly written install program. While there are many other features that directly contribute to greater uptime, a more detailed discussion of them is outside the scope of an article for developers. For specifics on the advances in Microsoft Windows 2000 and a discussion of best practices, see the white paper "Increasing System Reliability and Availability with Windows 2000", available on TechNet's Reliability Web site.

Because Microsoft SQL Server ships only on the Microsoft Windows platform, it is able to take direct advantage of all reliability features in the underling operating system and then extend them with database-specific enhancements. The problem developers often face is deciding which feature to use when building solutions. Should you use Microsoft SQL Server 2000's multi-instance support, Microsoft Cluster Service, Log Shipping, or some combination?

As you might suspect, a combination of approaches is usually best. To help you understand what combination is best for your circumstances, I'll first walk you through each technology and then present some broad recommendations.

Multi-Instance
Multi-instance support is a new feature in Microsoft SQL Server 2000 that lets you concurrently run multiple copies of SQL Server on the same machine. It is similar to the version switcher that Microsoft introduced as part of SQL Server 7.0 that allowed you to switch back and forth between SQL Server 6.x and SQL Server 7.0. The mechanics of multiple instances is pretty well documented, so I won't cover it here. Rather, I will focus on how you can use multi-instance support to enhance reliability. If you are unfamiliar with the feature, see the SQL Server Books Online topic entitled "Working with Instances and Versions of SQL Server," or you can find it in the online SQL Server documentation on MSDN�.

Multi-instance support increases reliability in two ways. First, it allows you to completely isolate applications from one another; second, it makes server consolidation easy.

Isolating Applications
As far as each instance is concerned, it is the only copy of SQL Server running on the system. Each instance gets its own registry entries, binaries, memory space, and system databases. You can then use this isolation to install different applications on their own instance. If one application misbehaves or a user issues a runaway query, the impact is isolated to that single instance, and the apps running on other instances remain unaffected.

Well, depending on how compromised a given instance was, it could impact other instances because they do share the same hardware, right? Not necessarily, because you can configure an instance to use only a specific set of resources. Obviously, you can place an instance's database devices on dedicated partitions or physical disks so that you don't have to worry about one instance "auto-growing" into another. You can also limit the maximum memory an instance can use with the max server memory configuration setting, and you can limit the instance to a specific set of processors by setting the affinity mask. The code below demonstrates how to do this.

USE master
EXEC sp_configure 'show 
advanced option', '1'
RECONFIGURE
-- Takes effect immedately
-- 
Parameter is number of megabytes, not 
-- the number of 2K pages from some 
earlier
-- version of SQL Server
EXEC sp_configure 'max server memory', 
256
RECONFIGURE 
-- Must restart server to take effect
-- 3 is 00000011 
in binary which corresponds to 
-- processors 1 and 0 (reading right to 
left)
EXEC sp_configure 'affinity mask', 3
RECONFIGURE

Certainly, this does not completely eliminate the shared aspects of the system-the various instances share the same I/O channels, the same instance of the operating system, the same system bus, and so on-but it does isolate the most common sources of contention: memory, CPU, and hard disk.

A really cool scenario that makes great use of multiple instances is an ASP that provides a hosted SQL Server. For, say, $79 per month, you would get a Basic account that entitles you to your own database but requires you to share the server's resources with all the ASP's other Basic users. For $129 you can upgrade to a Premium account where you get your "very own dedicated server" with 128 MB RAM and a single processor. For $199 you can have an Imperial account that entitles you to a dual-processor machine with 256 MB RAM and a small island kingdom.

From the ASP's perspective, it could easily configure these "dedicated servers" to simply be separate instances, each with the appropriate affinity mask and max server memory settings. From the user's perspective, SQL Server behaves like it is running on a dedicated machine. Granted, the OS resources are shared, but for the most common scenario of Web developers needing access to a database for dynamic content and state management, this shouldn't matter at all.

Let's now turn to the second benefit of multiple instances-server consolidation.

Server Consolidation
Before Microsoft SQL Server 2000, server consolidation projects were often difficult undertakings. In some cases, they were downright impossible because of conflicting collations. In other cases, security needs or tuning parameters were at odds. Now, with Microsoft SQL Server 2000, each instance is completely separate and can have whatever settings you want. This makes is simple to decommission individual servers and consolidate them on a single, mammoth server.

How, you might ask, does server consolidation contribute to reliability? Simple: by placing all your applications on a single box, you can more easily justify the capital expenditure for highly reliable hardware such as the 32-processor, 64 GB RAM-capable Unisys ES7000. These beefy machines not only offer fault-tolerant components, but also come with minimum uptime guarantee of 99.9% as part of Microsoft's Datacenter Server program.

Before you redesign your applications to take advantage of multi-instancing, be sure to read the "Dynamically Managing Memory Between Multiple Instances" topic in the SQL Server Books Online or on MSDN. For some types of applications and some consolidation scenarios, you will need to carefully examine the performance implications before undertaking a server consolidation project.

Now, if you were thinking that consolidation would actually reduce reliability because you were introducing a potential single point of failure-this mammoth server-then you were on the right track. But there is a solution to this-clustering.

Microsoft Cluster Service
Microsoft Cluster Server (MSCS) provides fail-over clustering for applications by ensuring that clients can still access an application even after the server that was originally hosting it has failed. Should one of the nodes of a cluster fail, MSCS will migrate its applications and their workloads to a surviving node. From the client's perspective, this process is nearly seamless because MSCS presents the appearance of a single system image, even though up to four separate servers could be servicing the applications. As with multi-instance support, Microsoft's clustering technologies are very well documented elsewhere, so I won't go into detail here. If you want to learn more about MSCS, see the Windows Clustering homepage. You might also want to check out course 958, Supporting Microsoft Cluster Server from Microsoft Official Curriculum

I do, however, want to return to what is meant by "nearly" seamless. As far as the client app is concerned, nothing has changed: the server name, the IP address, and all the configuration settings are the same with the possible exception of hardware-specific ones if the machines were not identical. More important, because MSCS uses a shared disk, the data is in the same consistent state that it was before the fail-over. The only thing that has changed is that it would appear that the machine rebooted-but very quickly. Typically, the fail-over from one node to another takes less than 30 seconds.

This is where the word "nearly" comes in. Because a fail-over is essentially a reboot, all open connections are dropped and any in-flight transactions are rolled back. Furthermore, while the SQL Server may be up and running on the new node in 30 seconds or so, depending on when the last checkpoint occurred, the recovery interval, and the workload it was processing, it may need to roll-forward several transactions. Worse still, if SQL Server was processing any long-running transactions when a node failed, it will also need to rollback these transactions before accepting new connections to that database. So, in worst-case scenarios, it may take significantly longer than 30 seconds to regain access to your database. Long running transactions, however, are not the norm, and the default recovery interval is one minute, so in most cases your application should be back up and running in 60 seconds or so.

The problem with MSCS is how people have deployed it and what they expect from it. The crux of the issue is that the nodes of the cluster must be in close proximity to one another, and this rules out the disaster-recovery scenario that many people hope for.

Most implementations require all the nodes to be on the same campus and more often than not, in the same building. Some storage vendors are pushing the envelope to allow distances in the tens of kilometers, but these solutions require special hardware and long, very high-speed links that are very, very expensive. Besides, I'm not so sure what a few dozen kilometers buys you. Sure, it protects you from a localized disaster such as a fire or a building collapse, but most disasters I worry about (floods, hurricanes, and yes, I'm from San Francisco, so earthquakes) are more regional in nature. Sure, even with a catastrophic disaster, you can probably count on one of your four buildings still being at least semi-operational, so your application theoretically could fail over into the one surviving data center. What about power and network links? (An ironic side note: as I write this, I have been selected as one of California's rolling blackout victims- error thank goodness for laptop batteries.) How is your operations staff going to get to the building to manage the application?

Therefore, you should be looking at MSCS as a solution for hardware and software faults. If you are looking for disaster preparedness and recovery, you are better off with log shipping that will allow you to move your data clear across the world if necessary.

Log Shipping
Log shipping made its debut as an add-on for Microsoft SQL Server 7.0 that shipped as part of the BackOffice� Resource Kit 4.5. Because it was only partially supported, many people did not implement it until it Microsoft rolled it into the product as part of the Enterprise Edition of SQL Server 2000.

The concept behind log shipping is really quite simple. It allows the administrator to use the Database Maintenance Plan Wizard to set up a schedule in which the transaction logs are dumped on one server, shipped over to a second server, and restored there. By default, the log shipper kicks in every five minutes, but you can dial that down to one minute if you need to. Conversely, if you want to conserve bandwidth, you can configure the log shipper to only send the logs once per day. Actually, because the parameter is an int, you could increase the interval all the way up to 2,147,483,647 minutes, but I don't think that would give you the type of fault tolerance you were looking for.

Back in the SQL Server 4.x and 6.x days, many of us wrote applications similar to log shipping so that we could off-load the DBCC checks we had to run to make sure we had a valid database before we backed it up. Once the remote backup database passed the DBCC checks, we would back up the original transaction log to tape. At the time, this eliminated one of the biggest hurdles to uptime.

Ever since the release of Microsoft SQL Server 7.0, we have been absolved of our duty to run DBCC before each backup. Even so, log shipping provides an essential tool in our high-reliability belt: it solves the geographic distribution requirement for true disaster recovery scenarios.

One administrative hassle of using log shipping rather than MSCS is that to transition a standard three tier, WinDNA app over to a log shipped backup server, the middle tier must connect to a different server. Remember, this was not the case with MSCS because it presented a single system image. However, with some creative development I will cover in a future article, this really is not such a burden. The end result is I can use log shipping to create warm backup servers that are nearly identical to my production server.

There's that "nearly" word again. As you might expect, log shipping comes with some caveats. First off, there is that delay of up to one minute. Many people mistakenly think this means that you would lose any data committed during this minute but not shipped to the remote server. Not at all. The problem is not that the data is lost-it is still on the drive array next to the original server-it is just that your servers are out of synch. Of course, if you lose the drive array, you really have lost that one minute of transactions; but that would happen whenever you lose your physical storage. Even in this scenario, log shipping adds value because it will be far faster to get your application back up and running if you have a warm, standby server than if you had to rebuild a server and load the data from tape.

So, what to do now that our servers are out of synch? This leaves you with a choice: you can either force your applications into read-only mode (you can't write to the log-shipped database because it is not consistent with the source database) or you can accept the loss of the data from that one minute. Well, if you care about reliability at all, clearly you can't tolerate the loss of data, so you are forced into read-only mode.

What good is a read-only database, you (rightfully) might ask? Before you give the standard response of "absolutely none" you need to think of the two ways you might use this.

First there is the geographic dispersion case. This scenario only kicks in if all four nodes of your cluster are rendered inoperable. Even putting aside the special hardware that allows you to separate the nodes by many kilometers, clearly your data center has experienced some sort of localized calamity, and I would suggest that you have much bigger problems than whether you data is one minute out of synch or not. While disaster planning of this sort always seems to get lots of attention, I don't spend much time worrying about it. I'll design a system that ships the data off to a safe location. After that, the business-continuity planning folks pick up the ball and figure out what to do with their precious data on The Day After.

The other case, which I find much more interesting, is to use this log-shipped data to provide a bridge of temporary availability during the minute or two while MSCS fails the database over and SQL Server recovers the database. This is interesting because it provides some development challenges. Exactly how to go about designing an application that does this will be the topic of next month's article on Development Techniques for Highly Reliable SQL Server Applications.

Conclusion
In this month's article, I started with a discussion of the most common causes of downtime. Next I talked about some of the features in Microsoft Windows 2000 to help ensure that your servers stay up and running. Then I moved on to Microsoft SQL Server and discussed ways you can combine SQL Server features such as multiple instance and log shipping with the core clustering technologies of the Windows platform to build highly reliable systems.

If there is one thing you remember from this article, it should be that you need to look at availability in terms of complete systems rather than just individual servers. With a well-designed architecture combining multiple instances of SQL Server, MSCS, and log shipping, you can have the necessary infrastructure to build highly reliable Microsoft SQL Server applications. Next month, I will look at some of the development strategies you can use to ensure your applications are "always open."

Microsoft Press provides in-depth documentation for these and all the other issues related to developing and administering SQL Server applications. The best place to start is with Ron Soukup and Kalen Delaney's new book: Inside Microsoft SQL Server 2000.

If you plan on supporting any 7.0 applications, you should also get a copy of Inside Microsoft SQL Server 7.0. There are enough differences between the two products to make both texts essential for the serious SQL Server developer.

For a quick overview of the new features in SQL Server 2000, take a look at this article in the SQL Server section of Microsoft.com: "What's New in SQL Server 2000".

If you are after a comprehensive look at the new features along with lots of sample code, check out Carl Nolan's excellent MSDN article "SQL Server 2000: New Features Provide Unmatched Ease of Use and Scalability to Admins and Users".

Other titles from Microsoft Press you might want to check out include the following:

Microsoft SQL Server 2000 Administrator's Companion is a great book for the administrator looking to support highly reliable SQL Server installations.
Microsoft SQL Server 2000 Reference Library is the perfect reference set that collects all the Microsoft SQL Server documentation in one place.
Hitchhiker's Guide to Visual Basic and SQL Server, Sixth Edition gives the VB developer a jump-start on creating robust database applications with SQL Server. Even though this book is written for SQL Server 7.0, most everything it covers is directly applicable to SQL Server 2000.
Microsoft Windows 2000 Server Resource Kit is the essential reference for deploying, managing, and optimizing Windows 2000.
Programming Distributed Applications with COM+ and Microsoft Visual Basic 6.0, Second Edition, will make you a better programmer by explaining everything that COM+ and Windows 2000 is doing behind the scenes. It does this by providing the exact level of information to be useful and avoids providing so much that you become mired in details.
Inside Microsoft Windows 2000, Third Edition, provides the definitive guide to the internals of Windows 2000. The Microsoft product team wrote this book with full access to the source code, so you know you are getting the most comprehensive, technical information available.
Designing Secure Web-Based Applications for Microsoft Windows 2000 provides an authoritative, end-to-end view of all the major Windows 2000 security services and gives you a solid foundation in Microsoft Windows 2000, Internet Explorer, Internet Information Services, SQL Server, and COM+ security concepts
Inside Server-Based Applications is a server-side programming primer for developing server-side applications. After reading it, you'll be able to develop scalable, secure, and supportable Web and intranet solutions in Windows 2000 and Microsoft BackOffice.
Programming Server-Side Applications for Microsoft Windows 2000 takes an in-depth tour of Windows 2000 services and provides expert guidance for designing and implementing applications that exploit their capabilities.

For detailed information on all of Microsoft Press's SQL Server 2000 titles for developers, visit the SQL Server 2000 product section page. For a complete list of all the developer titles, see the Developer Tools section.

		All Products \| Support \| Search \| microsoft.com Home

Home \| About This Site \| Support \| Worldwide Sites \| Advanced Search \| Newsletter \|