Feature Article - Microsoft Press Mastering Series

Building and Managing Highly Available SQL Server Applications
Building and Managing Highly Available SQL Server Applications

By Scott Mauvais

In a previous article, I showed you how companies such as Starbucks, GMAC, and FreeMarkets.com use Microsoft� SQL Server� 2000 to achieve 99.999% uptime. Such numbers are certainly commendable. Yet attaining this sort of availability does not require extraordinary efforts. It does, however require following some well-known design principles.

Building highly available applications requires three ingredients: technology, people, and process. Let's review what I mean by these terms.

Technology: This refers to the features of the underlying platform, such as clustering, load balancing, and the like.
People: This means the output of your peers or staff and the strategies used to produce bulletproof apps.
Process: This is the most complex ingredient. Depending on where you are in a project's life cycle, the definition of process varies. If you are in development, for example, process refers to coding standards, test plans, and applications frameworks. Conversely, once you've deployed an app, process usually refers to operations such as backup strategies and change control. Because this article focuses on availability, let's assume the app has been built and use the operations definition of process.

In my earlier article, I covered the technology ingredient. I discussed how you can use some of the new features in Microsoft SQL Server 2000�multiple instances, log shipping, and clustering�to ensure that the data tier is just like Denny's: �always open� and ready for business. For a comprehensive look at the features of Microsoft SQL Server 2000, I suggest that you purchase a copy of Kalen Delaney's Inside Microsoft SQL Server 2000. To get an idea of what the book is like you can review the Table of Contents and even read a sample chapter. For detailed information on building applications for Microsoft SQL Server 2000, check out Microsoft SQL Server 2000 Reference Library. As with Kalen's book, the Microsoft Press� Web site has both the Table of Contents and a sample chapter available.

This month, I will cover the other two ingredients of highly available applications: people and process. No matter how well you design your data tier and no matter how stable your platform, the time will come when the database is unavailable. Causes of this unavailability can range from scheduled downtime (to apply a service pack, for example) to a catastrophic failure that takes down one of your data centers.

I am not going to cover the failures in the application logic itself. More often than not, problems boil down to a single issue: sloppy code. Poor error handling, buffer overruns, and missing asserts all have known solutions, so I won't dwell on them here except to mention some excellent resources. Probably the best place to start is Steve McConnell's trilogy: Code Complete, Software Project Survival Guide, and Rapid Development. Another excellent book is Dynamics of Software Development by Jim McCarthy.

I am going to assume that your code is robust and works properly. After all, I always write perfect code, and I'm sure you do, too. It's other people's code we have to worry about. In this month's article, I will cover some techniques you can use to keep your applications up and running even if one of the core servers is unavailable (because of someone else's code, of course).

There are two key techniques for designing highly available applications. The first applies to the Web tier. The most important thing you can do is design a stateless application that can scale out through the use of cyberbricks. I discussed this topic at length a few months ago in my article on Planning High-Performance Web Sites, so I won't belabor the point.

The second technique involves the aggressive use of error handling to intelligently detect and respond to errors caused by resources being unavailable. So that you can better understand what your error handler needs to do, let's first review the relative pros and cons of Microsoft Cluster Services (MSCS) and log shipping.

Clustering versus Log Shipping
You will remember that the main benefit of MSCS is that it presents a single system image and that, except for a brief interruption in connectivity (akin to a quick reboot), the fail-over process is totally transparent as far as the client is concerned. Better yet, if you follow my advice regarding building stateless apps, the client probably won't know the server experienced a fail-over and thus the end user will not experience any downtime. The drawback of MSCS is that a fail-over always involves some amount of resource unavailability. In a perfect world, this would be limited to the 30 seconds or so it takes MSCS to bring up the resources on the new node. In reality, for Microsoft SQL Server, this unavailability can stretch into 60 seconds and sometimes even a bit longer.

Conversely, when you use log shipping, the backup server (or servers, if you are truly paranoid) are available immediately. The downside, however, is that there will always be a delay (with a minimum of one minute) between when a new transaction is committed on the live server and when the backup server receives the shipped log. Thus, if your primary server fails and you need to connect to your target server, you are forced to access it in read-only mode. If you were to update the target server, you would introduce inconsistencies because that server wouldn't have the most current data. Besides being immediately available, log shipping has the huge benefit of supporting geographically diverse backup sites. Therefore, log shipping allows you to have access to your data�albeit read-only�even if your primary data center slips into the Pacific Ocean or is hit by rolling blackouts.

So which approach should you use? Clustering provides guaranteed consistency and a single image that is easy to manage, but is not always available. Log shipping provides immediate availability, but limits you to read-only access. When in doubt, I always opt for both.

What happens if you combine these two approaches? For SQL Server-based applications, MSCS really provides the best availability solutions�except for that pesky period during the actual fail-over when the database is unavailable. Nice, but not five nines�three at the most. But wait: what if you accessed a log-shipped server during this brief fail-over? You could get those two nines back.

Combining Clustering and Log Shipping
This is where your aggressive error handler kicks in. From the error handler's perspective, a server in mid-fail-over is nothing more than a failed connection. Whenever your error handler traps a failed connection, you probably want to wait a few seconds and reattempt the connection. In most cases, the second attempt goes through, since many connection failures are the result of transitory network problems.

If the second attempt also fails, you probably have something more serious than a router hiccup, and it's time to hit the contingency plan. If the desired operation was a read, simply update your connection string to point one of the read-only log-shipped servers. One common dilemma is how to determine the connection string for a log-shipped server, as this information changes as you update your network topology and disaster-recovery plans. The solution is easy: put the string in Active Directory. For more information on how developers can leverage Active Directory, see my article Leverage Windows 2000 Active Directory Services to Build Distributed Applications.

That takes care of the read operation, but what if you want to update the database? If the requested operation was a write, you can use Microsoft Message Queue (MSMQ), which is a standard component in Microsoft Windows� 2000 and is part of the Option Pack for Microsoft Windows NT� 4.0. MSMQ is a loosely coupled, reliable network communications service based on a store-and-forward messaging model. When you add a message to a queue, MSMQ guarantees that the message will be delivered once and only once in the exact order in which it was added to the queue. Better yet, MSMQ is fully transactional, so you can use it with the Distributed Transaction Coordinator (DTC) just like you do when you access multiple databases. For more information on MSMQ, see the Microsoft Message Queue Server Developer Resources page.

With MSMQ, rather than connecting to the database via ADO and submitting your update, you create a new MSMQ message with the particulars of your desired database operation and push the message onto the queue. The message then sits in the queue until the database comes back up. Because MSMQ supports asynchronous communication, as soon as the message is in the queue (which happens nearly instantaneously) your application can continue on with its logic. If you had stayed with ADO, you would have had to wait in a loop either until the database came back up or returned an error to the user.

The best part of this approach is that it works equally well whether the failure is caused by the database being down because of an MSCS fail-over or the database is simply unreachable because of a network error. First I will show you a technique to deal with the fail-over scenario. Then I will show you how to extend this approach to address network problems.

Working with MSMQ
As soon as MSCS finishes failing-over the database, you need to pop the messages off the queue and apply the updates. How do you know when the fail-over is complete and the database is back on line? Easy: have the database tell you. Microsoft SQL Server 2000 supports autostart stored procedures that automatically execute every time the database engine starts up. Better yet, you can set any procedure you want to execute automatically rather than being limited to a single procedure with a special name, as with some other products.

The first step is to create a stored procedure that will read the messages of the queue and update the database with the queue transactions. Working with MSMQ can get a bit complex, so you probably want to write a COM component. In this approach, your stored procedure uses the sp_OA* stored procedures to call the appropriate methods on your component. The best part of this approach is you can use the same ADO calls in your component that you would have used had the server been online in the first place. This makes serializing and deserializing your updates very easy.

Once your stored procedure is written, you need to tell Microsoft SQL Server 2000 to execute it automatically whenever the server starts up. First you need to enable startup procedures because they are disabled by default. To do this, use sp_configure as shown below.

sp_configure 'show advanced options', 1
RECONFIGURE
exec sp_configure 'scan for startup procs', 1
RECONFIGURE
GO
sp_configure 'show advanced options', 0
RECONFIGURE

This setting does not take effect until you stop and restart the database engine.

Now you're ready to configure your procedure so that it runs on startup. To do this, use the sp_procoption stored procedure, which follows the same syntax as most other configuration settings: It takes the name of your procedure, the name of the option you are setting (startup in this case), and a flag telling it whether to enable or disable the feature. Here's an example of how you call sp_procoption:

sp_procoption 'YourProcedureName', 'startup', true

One caveat: You can only call procedures in the master database. This means that depending on how long it takes Microsoft SQL Server 2000 to recover the database you want to update, the database might not be online when the startup procedure executes. To prevent this from being a problem, you need to write some logic in your stored procedure that polls the server to see when the database is back online.

As a final step, you need to guard against the possibility that your error handler has placed messages on the queue because of network problems rather than because of a fail-over event. If the outage was caused by network connectivity (or lack thereof), the database will never start up�since it never stopped running�so your startup procedure will never execute.

Getting around this problem is simple. All you need to do is create a Job and use SQL Server Agent to execute it on a recurring basis. The actual job itself is very straightforward�you just call your startup procedure.

There you have it. By using this technique along with the new features in Microsoft SQL Server 2000, you can isolate your application from almost any sort of failure. What if the unthinkable happens and your application does indeed go down? That's when the third ingredient�process�kicks in.

Operational Processes for High Availability
I want to cover process because at one level or another, it applies equally to all types of organizations. While a discussion of operations might be a bit outside the scope of an article targeted at developers, I want to touch on it briefly because I feel it is so important to availability.

I emphasized the importance of process with a table I presented at the beginning of my earlier article. This table listed the common causes of downtime. Here is a summary:

Table 1:1 Common Causes of Downtime

Cause of Downtime Percentage

Platform (OS, network, environmental) 20%

Applications (bugs, poor scalability) 40%

Operations (poor procedures, misconfigurations) 40%

Fully 40% of all system downtime is the result of poor operations. Like the applications development that I just discussed, the operations department is entirely under your control. Furthermore, operational procedures are often relatively easy to address.

First let's define availability and how it differs from reliability. Many people use the two terms interchangeably and for most purposes that is okay. When you get into operations, however, the difference becomes important.

Reliability refers to how often the system fails: the fewer the failures, the more reliable the system. Typically, reliability is expressed in terms of mean time to failure (MTTF). Availability, on the other hand, measures the amount of time your system performs its functions as specified. Availability, of course, is the measure we are after. To calculate availability, start with MTTF and factor in how long it takes to recover from the failure, or the mean time to recover (MTTR). The formula looks like this:

Availability = MTTF / (MTTF + MTTR)

For example, if your applications average a failure every three months (or every 131,400 minutes) and it takes you 60 minutes to recover, your projected availability would be 99.95% = 131,400 / (131,400 + 60).

The important thing to remember is that you can increase availability both by improving your system's reliability (technology and people) and by reducing the time it takes to recover (process).

The Microsoft Operations Framework and Related Resources
The best way to improve operations is to get acquainted with the Microsoft Operations Framework (MOF). MOF is an award-winning set of tools, guides, training and services, white papers, and other resources that will help you achieve best practices in managing mission-critical applications. MOF content is built around Operations Guides targeted at specific tasks such as managing Exchange servers and capacity planning for ASP. Of particular interest to people supporting Microsoft SQL Server applications is the MOF Windows NT (4.0) High Availability white paper, which you can find on the Operations Guides Web site. Even though this white paper is designed around Microsoft Windows NT 4.0, most of its content can be put directly to use in Microsoft Windows 2000 and SQL Server 2000 environments.

For information specific to Microsoft SQL Server 2000, the soon-to-be-released Microsoft SQL Server 2000 Resource Kit is perfect for anyone looking to manage a database server.

Another excellent source of information is the High Availability SQL Server Platform Engineering white paper that describes the steps Microsoft's Information Technology Group has taken to engineer high availability into their internal SQL Server applications.

Some of the best practices around operations are simple but often overlooked. Recovery manuals, for example, can save hours in an emergency. At minimum, a recovery manual should list information such as settings on network cards and IP addresses, instructions for installing the application, special registry settings, and tuning parameters. You should also think about keeping an installation CD and floppies next to the server. Basically, you should be prepared for a situation in which someone who knows absolutely nothing about the server or the application needs to rebuild the server from scratch.

Monitoring and Error Detection
You should also look at your processes for monitoring and error detection. The best way to solve a problem is to prevent it from happening. On the monitoring front, you should look at CPU utilization, available memory, disk access queues, free space on disks, and bandwidth consumption. Compare these values to historical trends. If the current data and the trends start to diverge, you should investigate. This information not only is useful for capacity planning, but also can help increase availability. Obvious examples of where this strategy can help prevent downtime include detecting disks that are nearly full and identifying high packet loss that results as NICs start to fail. However, this practice can help in more subtle ways also.

For example, suppose you have a component that slowly leaks memory. Rather than waiting until the component consumes all available memory and brings down the entire server, you can gracefully shut the application down. In a Web farm, you can even do this with absolutely no downtime. When you want to reset the application, first drain off all the current users and take the server out of the load-balancing scheme. As soon as all the connections have dropped off, you can reset the system. Once the system comes back up, you can again add it to the load-balanced cluster. If you are using Microsoft Application Center 2000, you can even automate this entire process. To learn more about Microsoft Application Center 2000, see the Microsoft Application Center 2000 Resource Kit. You can also read my article Building and Managing Complex Web Applications with .NET Enterprise Servers, which covers Microsoft Application Center 2000 in some detail.

Conclusion
This article concludes my discussion of designing highly available applications built on Microsoft SQL Server 2000. In my earlier article I discussed how the technology behind Microsoft SQL Server 2000 directly contributes to your success at building reliable applications. This month I examined the people and processes that go into creating great software. I started off with a review of clustering and log shipping and showed you how to use each to increase your application's availability. I then showed how you can combine the two, along with some aggressive error handling, and have the best of both worlds. Finally, I looked at the processes provided by MOF to help you reduce any downtime you might experience.

Microsoft Press provides in-depth documentation for these and all other issues related to developing and administering SQL Server applications. The best place to start is with
Kalen Delaney's new book Inside Microsoft SQL Server 2000.
For more information, review the Table of Contents, sample chapter, and index.

If you plan to support any SQL Server 7.0 applications, you should also get a copy of Inside Microsoft SQL Server 7.0. See the Table of Contents, sample chapter, and index to review the complete scope of coverage. There are enough differences between the two editions to make both texts essential for the serious SQL Server developer.

Microsoft SQL Server 2000 Resource Kit is the perfect companion for any Microsoft SQL Server professional. It will help you plan, install, maintain, and optimize Microsoft SQL Server 2000. Besides being a great resource for tips and tricks, it includes a CD with several must-have tools for Microsoft SQL Server. Table of Contents, Sample Chapter, Index

For a quick overview of the new features in Microsoft SQL Server 2000, take a look at this article in the SQL Server section of Microsoft.com: What's New in SQL Server 2000.

If you are after a comprehensive look at the new features of Microsoft SQL Server 2000, along with lots of sample code, check out Carl Nolan's excellent MSDN article SQL Server 2000: New Features Provide Unmatched Ease of Use and Scalability to Admins and Users.

Other books every developer should have in her or his library include Steve McConnell's trilogy:

Other titles from Microsoft Press you might want to check out include the following:

Microsoft SQL Server 2000 Administrator's Companion is a great book for the administrator looking to support highly reliable SQL Server installations.
Table of Contents, Sample Chapter, Index

Microsoft SQL Server 2000 Reference Library is the perfect reference set that collects all the Microsoft SQL Server documentation in one place.
Table of Contents, Sample Chapter, Index

Hitchhiker's Guide to Visual Basic and SQL Server, Sixth Edition, gives the VB developer a jump-start on creating robust database applications with SQL Server. Even though this book is written for SQL Server 7.0, most everything it covers is directly applicable to SQL Server 2000.
Table of Contents, Sample Chapter, Index

Inside Microsoft Windows 2000, Third Edition, is the definitive guide to the internals of Windows 2000. The Microsoft product team wrote this book with full access to the source code, so you know you're getting the most comprehensive technical information available.
Table of Contents, Sample Chapter, Index

Inside Server-Based Applications is a server-side programming primer for developing server-side applications. After reading it, you'll be able to develop scalable, secure, supportable Web and intranet solutions in Windows 2000 and in Microsoft BackOffice.
Table of Contents, Sample Chapter, Index

For detailed information on all Microsoft Press Windows 2000 titles for developers, visit the Windows 2000 for Developers page. For a complete list of all the developer titles, see the Developer Tools section.

		All Products \| Support \| Search \| microsoft.com Home

Home \| About This Site \| Support \| Worldwide Sites \| Advanced Search \| Newsletter \|