![]() |
![]() |
All Products | Support | Search | microsoft.com Home | |
![]() |
![]() | ||
Home | About This Site | Support | Worldwide Sites | Advanced Search | Newsletter | |
|
![]() |
![]() |
|
![]() | ||||||||
![]() Building
and Managing Highly Available SQL Server Applications ![]() By Scott Mauvais In a previous article, I showed you how companies such as Starbucks, GMAC, and FreeMarkets.com use Microsoft� SQL Server� 2000 to achieve 99.999% uptime. Such numbers are certainly commendable. Yet attaining this sort of availability does not require extraordinary efforts. It does, however require following some well-known design principles. Building highly available applications requires three ingredients: technology, people, and process. Let's review what I mean by these terms.
This month, I will cover the other two ingredients of highly available applications: people and process. No matter how well you design your data tier and no matter how stable your platform, the time will come when the database is unavailable. Causes of this unavailability can range from scheduled downtime (to apply a service pack, for example) to a catastrophic failure that takes down one of your data centers. I am not going to cover the failures in the application logic itself. More often than not, problems boil down to a single issue: sloppy code. Poor error handling, buffer overruns, and missing asserts all have known solutions, so I won't dwell on them here except to mention some excellent resources. Probably the best place to start is Steve McConnell's trilogy: Code Complete, Software Project Survival Guide, and Rapid Development. Another excellent book is Dynamics of Software Development by Jim McCarthy. I am going to assume that your code is robust and works properly. After all, I always write perfect code, and I'm sure you do, too. It's other people's code we have to worry about. In this month's article, I will cover some techniques you can use to keep your applications up and running even if one of the core servers is unavailable (because of someone else's code, of course). There are two key techniques for designing highly available applications. The first applies to the Web tier. The most important thing you can do is design a stateless application that can scale out through the use of cyberbricks. I discussed this topic at length a few months ago in my article on Planning High-Performance Web Sites, so I won't belabor the point. The second technique involves the aggressive use of error handling to intelligently detect and respond to errors caused by resources being unavailable. So that you can better understand what your error handler needs to do, let's first review the relative pros and cons of Microsoft Cluster Services (MSCS) and log shipping. Clustering versus Log Shipping Conversely, when you use log shipping, the backup server (or servers, if you are truly paranoid) are available immediately. The downside, however, is that there will always be a delay (with a minimum of one minute) between when a new transaction is committed on the live server and when the backup server receives the shipped log. Thus, if your primary server fails and you need to connect to your target server, you are forced to access it in read-only mode. If you were to update the target server, you would introduce inconsistencies because that server wouldn't have the most current data. Besides being immediately available, log shipping has the huge benefit of supporting geographically diverse backup sites. Therefore, log shipping allows you to have access to your data�albeit read-only�even if your primary data center slips into the Pacific Ocean or is hit by rolling blackouts. So which approach should you use? Clustering provides guaranteed consistency and a single image that is easy to manage, but is not always available. Log shipping provides immediate availability, but limits you to read-only access. When in doubt, I always opt for both. What happens if you combine these two approaches? For SQL Server-based applications, MSCS really provides the best availability solutions�except for that pesky period during the actual fail-over when the database is unavailable. Nice, but not five nines�three at the most. But wait: what if you accessed a log-shipped server during this brief fail-over? You could get those two nines back. Combining Clustering and Log
Shipping If the second attempt also fails, you probably have something more serious than a router hiccup, and it's time to hit the contingency plan. If the desired operation was a read, simply update your connection string to point one of the read-only log-shipped servers. One common dilemma is how to determine the connection string for a log-shipped server, as this information changes as you update your network topology and disaster-recovery plans. The solution is easy: put the string in Active Directory. For more information on how developers can leverage Active Directory, see my article Leverage Windows 2000 Active Directory Services to Build Distributed Applications. That takes care of the read operation, but what if you want to update the database? If the requested operation was a write, you can use Microsoft Message Queue (MSMQ), which is a standard component in Microsoft Windows� 2000 and is part of the Option Pack for Microsoft Windows NT� 4.0. MSMQ is a loosely coupled, reliable network communications service based on a store-and-forward messaging model. When you add a message to a queue, MSMQ guarantees that the message will be delivered once and only once in the exact order in which it was added to the queue. Better yet, MSMQ is fully transactional, so you can use it with the Distributed Transaction Coordinator (DTC) just like you do when you access multiple databases. For more information on MSMQ, see the Microsoft Message Queue Server Developer Resources page. With MSMQ, rather than connecting to the database via ADO and submitting your update, you create a new MSMQ message with the particulars of your desired database operation and push the message onto the queue. The message then sits in the queue until the database comes back up. Because MSMQ supports asynchronous communication, as soon as the message is in the queue (which happens nearly instantaneously) your application can continue on with its logic. If you had stayed with ADO, you would have had to wait in a loop either until the database came back up or returned an error to the user. The best part of this approach is that it works equally well whether
the failure is caused by the database being down because of an MSCS
fail-over or the database is simply unreachable because of a network
error. First I will show you a technique to deal with the fail-over
scenario. Then I will show you how to extend this approach to address
network problems. The first step is to create a stored procedure that will read the
messages of the queue and update the database with the queue transactions.
Working with MSMQ can get a bit complex, so you probably want to write a
COM component. In this approach, your stored procedure uses the
Once your stored procedure is written, you need to tell Microsoft SQL
Server 2000 to execute it automatically whenever the server starts up.
First you need to enable startup procedures because they are disabled by
default. To do this, use
This setting does not take effect until you stop and restart the database engine. Now you're ready to configure your procedure so that it runs on
startup. To do this, use the One caveat: You can only call procedures in the master database. This means that depending on how long it takes Microsoft SQL Server 2000 to recover the database you want to update, the database might not be online when the startup procedure executes. To prevent this from being a problem, you need to write some logic in your stored procedure that polls the server to see when the database is back online. As a final step, you need to guard against the possibility that your error handler has placed messages on the queue because of network problems rather than because of a fail-over event. If the outage was caused by network connectivity (or lack thereof), the database will never start up�since it never stopped running�so your startup procedure will never execute. Getting around this problem is simple. All you need to do is create a Job and use SQL Server Agent to execute it on a recurring basis. The actual job itself is very straightforward�you just call your startup procedure. There you have it. By using this technique along with the new features
in Microsoft SQL Server 2000, you can isolate your application from almost
any sort of failure. What if the unthinkable happens and your application
does indeed go down? That's when the third ingredient�process�kicks
in. I emphasized the importance of process with a table I presented at the beginning of my earlier article. This table listed the common causes of downtime. Here is a summary: Table 1:1 Common Causes of Downtime
Fully 40% of all system downtime is the result of poor operations. Like the applications development that I just discussed, the operations department is entirely under your control. Furthermore, operational procedures are often relatively easy to address. First let's define availability and how it differs from reliability. Many people use the two terms interchangeably and for most purposes that is okay. When you get into operations, however, the difference becomes important. Reliability refers to how often the system fails: the fewer the failures, the more reliable the system. Typically, reliability is expressed in terms of mean time to failure (MTTF). Availability, on the other hand, measures the amount of time your system performs its functions as specified. Availability, of course, is the measure we are after. To calculate availability, start with MTTF and factor in how long it takes to recover from the failure, or the mean time to recover (MTTR). The formula looks like this:
For example, if your applications average a failure every three months (or every 131,400 minutes) and it takes you 60 minutes to recover, your projected availability would be 99.95% = 131,400 / (131,400 + 60). The important thing to remember is that you can increase availability both by improving your system's reliability (technology and people) and by reducing the time it takes to recover (process). The Microsoft Operations Framework and Related
Resources For information specific to Microsoft SQL Server 2000, the soon-to-be-released Microsoft SQL Server 2000 Resource Kit is perfect for anyone looking to manage a database server. Another excellent source of information is the High Availability SQL Server Platform Engineering white paper that describes the steps Microsoft's Information Technology Group has taken to engineer high availability into their internal SQL Server applications. Some of the best practices around operations are simple but often
overlooked. Recovery manuals, for example, can save hours in an emergency.
At minimum, a recovery manual should list information such as settings on
network cards and IP addresses, instructions for installing the
application, special registry settings, and tuning parameters. You should
also think about keeping an installation CD and floppies next to the
server. Basically, you should be prepared for a situation in which someone
who knows absolutely nothing about the server or the application needs to
rebuild the server from scratch. For example, suppose you have a component that slowly leaks memory. Rather than waiting until the component consumes all available memory and brings down the entire server, you can gracefully shut the application down. In a Web farm, you can even do this with absolutely no downtime. When you want to reset the application, first drain off all the current users and take the server out of the load-balancing scheme. As soon as all the connections have dropped off, you can reset the system. Once the system comes back up, you can again add it to the load-balanced cluster. If you are using Microsoft Application Center 2000, you can even automate this entire process. To learn more about Microsoft Application Center 2000, see the Microsoft Application Center 2000 Resource Kit. You can also read my article Building and Managing Complex Web Applications with .NET Enterprise Servers, which covers Microsoft Application Center 2000 in some detail. Conclusion Microsoft Press provides in-depth documentation for these and all other
issues related to developing and administering SQL Server applications.
The best place to start is with If you plan to support any SQL Server 7.0 applications, you should also get a copy of Inside Microsoft SQL Server 7.0. See the Table of Contents, sample chapter, and index to review the complete scope of coverage. There are enough differences between the two editions to make both texts essential for the serious SQL Server developer. Microsoft SQL Server 2000 Resource Kit is the perfect companion for any Microsoft SQL Server professional. It will help you plan, install, maintain, and optimize Microsoft SQL Server 2000. Besides being a great resource for tips and tricks, it includes a CD with several must-have tools for Microsoft SQL Server. Table of Contents, Sample Chapter, Index For a quick overview of the new features in Microsoft SQL Server 2000, take a look at this article in the SQL Server section of Microsoft.com: What's New in SQL Server 2000. If you are after a comprehensive look at the new features of Microsoft SQL Server 2000, along with lots of sample code, check out Carl Nolan's excellent MSDN article SQL Server 2000: New Features Provide Unmatched Ease of Use and Scalability to Admins and Users. Other books every developer should have in her or his library include Steve McConnell's trilogy:
Other titles from Microsoft Press you might want to check out include the following: Microsoft
SQL Server 2000 Administrator's Companion is a great book for the
administrator looking to support highly reliable SQL Server
installations. Inside
Microsoft Windows 2000, Third Edition, is the definitive guide to the
internals of Windows 2000. The Microsoft product team wrote this book with
full access to the source code, so you know you're getting the most
comprehensive technical information available. | ||||||||||||
|