Try a FREE Issue
  Shop the Store
  Shop the Pay-per-
View Archives

  Business & Personal Computing
  IT Professional
  Programming &
Web Development
  Graphics & Design
  Element K Journals Home Page

  Visit Discussion
Forums
  Contact Us
  Get Published
  About Element K Journals
  Advertise with Us!
  Meet Our Partners

  Tips
  Weekly Buzz
 
Access Online Courses
 
Search Element K Journals Articles


[ Element K Journals ]
QuickSkill CBT Excel
[ Element K Journals ]


 

SQL Server Solutions
Back Issues  •  About our Experts

 

June 2000

Using Northwind to Create a Huge test Database

 
by Scott Mauvais

Last month, in the article "Understanding parallel queries in SQL Server 7.0", we used a sample database based on the standard Northwind database included with SQL Server—but it was 1000 times the size of the standard one. We needed a database this large to make sure the dataset was big enough for the query optimizer to consider parallelism when it executed a query.

In this article, we're going to show you the BigNW.sql script we used to create this large database, as shown in Figure A. This script creates a new database named BigNW based on the Northwind sample database. We'll walk you through the major sections of this script. Further, you can download this script from our ftp site at ftp://ftp.elementkjournals.com/sql/sql006.zip. By the end of the article, you'll not only have the means to create a database as large as your hard disks can handle, but you'll also have a script that you can modify to create large databases based on your own production data.

Figure A: Use the BigNW.sql script to create a large database for testing your server.
[ Figure A ]

Why do you want a large database?

You might be asking yourself "Why would I want such a large database?" Once you have a large dataset built on real-life data, you can start testing all sorts of scenarios in your environment. Obvious scenarios include those listed in Table A.

Table A: Scenarios to test in your environment
Scenario
Questions to answer
Capacity planning What will performance be like in 12 months? How much horsepower do I need?
Operations When my database is 10 times bigger than it is now, can I still back it up with my current procedures? Can I still perform my routine maintenance in the time available during scheduled downtime?
Optimization As it grows, will my database benefit from file groups? What about additional disk spindles or controllers? Should I change the FILLFACTOR?
Disaster recovery Will I still be able to restore the database within my agreed upon recovery window after 18 months of growth at our current rate? What if the rate doubles?
Contingency planning What do I need to do if that new Web site really takes off?

You'll also find a large database useful in some unexpected ways. As a developer, you'll find performance tuning becomes much easier when you run suspect queries against a large dataset. It's often hard for you to find bad queries when you run them against a small dataset—and a small dataset is typical on most developers' computers because the performance hit may be only a couple of seconds at best. When we started testing against databases several times larger than those in production, our poorly written stored procedures jumped right out at us because they would often run for minutes. Whoops!

Several years back, when Microsoft Access first shipped, we created some Access Basic scripts to increase the size of its version of the Northwind database to test the limits of the Jet engine. We did this by reading the order entry tables, modifying the data just a bit, and inserting it (or appending it in Access-speak) back into the original tables. We then repeated this process until we'd created a database of the desired size.

During the technical beta for SQL Server 7.0, we compared the performance of some complex queries on the various beta releases. To make sure we had an easily reproducible test database, we decided to port our Access Basic scripts to the Transact-SQL script called BigNW.SQL. Then we used this script to create a huge SQL Server database and ran our tests against it.

When mapping out the tasks in the script, we broke it into four major sections, which we'll take a look at in the following sections:

  1. Set up the environment

  2. Create the database and populate it with the standard Northwind data

  3. Modify the tables

  4. Create sample data

Setting up the environment

The first part of the script sets up the environment by declaring and assigning values to some variables to hold database names, paths, and the like. Probably the most important variable here is #1HowBig, which determines the number of times the process will loop through the tables while creating sample data. We've set #1HowBig to a default value of 100, which means that the resulting database will have 100 times the number of customers and orders as the original. Executing this script in the default mode will require 60 MB of disk space and will take around two minutes to execute (depending on your hardware).

Managing the temporary tables generated by the script

Declaring the variables and assigning a value to them is pretty basic stuff. One thing we do want to draw your attention to, however, is how we'll handle temporary tables. Normally, when you create a table, if you start the table name off with a pound sign (#), you signify to SQL Server that you want it to be a temporary table. For example, the following SQL statement creates a temporary table named #BigNW_Vars:

CREATE TABLE #BigNW_Vars

(DBName sysname, HowBig int)

When SQL Server sees a statement like this, it creates a new table within the TempDB system database. This table is private to the user who created it and the server guarantees that the name won't collide with any other user's temporary tables—even if multiple users issue the same statement at the same time. That's the good part.

The downside is that SQL Server accomplishes this by obfuscating the table name by appending an arbitrary extension. This makes it extremely difficult to use the standard IF EXISTS construct to drop the table at the beginning of the script in case it already exists. You might ask "Why might a temporary table exist, and shouldn't my script automatically drop the temporary table when the user's session ends?"

When we were first developing this script, we occasionally introduced a bug (surprise!) that prevented the script from completing, so the clean-up code at the end that removed the temporary tables never executed. After we fixed the bug we'd simply re-execute the script only to have it error out when it tried to create the temporary tables at the beginning of the script. Because we hadn't ended our previous session gracefully, SQL Server hadn't automatically dropped the tables—and thus we couldn't re-create the tables. Now that we've debugged the script, we still need the ability to drop these temporary tables because BigNW.SQL might still abend for reasons such as lack of disk space.

But, before we can drop the tables, we have to first verify that they exist. As we mentioned earlier, we can't use the standard IF EXISTS approach shown below, because there isn't a table named #BigNW_Vars in sysobjects:

IF EXISTS (SELECT * FROM tempdb..sysobjects

WHERE name = '#BigNW_Vars'
AND type = 'U')

DROP TABLE #BigNW_Vars

To get around this problem, you can use the OBJECT_ID() function that somehow knows you mean your own instance of the temporary table. This is a very handy feature in SQL Server 7.0 that doesn't seem to be documented anywhere! If the table exists, it will return its object ID; conversely, it will return Null if the temporary table doesn't exist. So rather than the standard test for the existence of a row, we can test to see if OBJECT_ID() returns Null. Here's the code snippet to use:

IF OBJECT_ID('TempDB..#BigNW_Vars')
IS NOT NULL
DROP TABLE #BigNW_Vars

When you use this code snippet, make sure you specify the database location. Using OBJECT_ID('#BigNW_Vars') alone always returns Null because OBJECT_ID() always looks in the current database—even though the pound sign specifies that the table is temporary, and thus, must be stored in TempDB.

Creating temporary tables

We're now ready to start creating some temporary tables. Because you must execute this script in three Transact-SQL batches, you'll lose all of your local variables when they go out of scope as soon as the first batch ends with the GO command. To work around this, in the first batch, we'll create the ##BigNW_Vars table to hold the variables we're going to need to access in later batches by using the following syntax:

CREATE TABLE ##BigNW_Vars

(DBName sysname, HowBig int)

INSERT INTO ##BigNW_Vars (DBName, HowBig)

VALUES (@sDBName, @lHowBig)

When you want to retrieve these variables in a subsequent batch, you simply select the values into that batch's local variables like we do below with the @sDBName variable:

SELECT @sDBName = DBName FROM ##BigNW_Vars

One final word about variables before we move on to creating the database: To make the code snippets easier to read, we're going to replace the variables with their default values. When you read through the code in the download file, you'll find statements like:

SELECT @sSQL = "DUMP DATABASE " + @sSourceDB

+ " TO " + @sBackupName + " WITH INIT"

EXEC (@sSQL)

But in this article, we're going to show this statement as follows:

DUMP DATABASE Northwind TO NorthwindToBigNW

WITH INIT

By showing you the statements in this format, you won't have to wade through all the dynamic SQL. Now that we've declared and populated our variables and dealt with the temporary tables, let's move on to creating the database.

Creating the BigNW database

Creating the BigNW database is a three-step process. First, we need to drop the database if it already exists. Second, we'll create an empty database called BigNW. Finally, we'll dump the original Northwind database and load it over our new, empty BigNW. Begin with the following statements:

USE TempDB

IF EXISTS (SELECT * FROM master..sysdatabases

WHERE NAME = 'BigNW' )

EXEC ("DROP DATABASE BigNW")

Before we attempt to drop the database, we must make sure we aren't currently logged into that very database (you can't drop a database that's in use even if you're the only one using it). We use TempDB to go to a neutral database so we can be sure we'll exist in all systems. Next, we want to make sure the BigNW database does indeed exist before we actually drop it.

Once we know it exits, we drop it with the EXEC command. We need to use EXEC syntax because DROP DATABASE must be the first command of any batch—otherwise, we'll get an error number 226. This gets even a bit more interesting because we're building these statements dynamically. Therefore, we have an EXEC executing some dynamic SQL containing another EXEC.

While creating the remainder of the database section in the script file, we'll create the database, back up the Northwind database and restore it over our new database. Finally we'll set some database options such as trunc. log on chkpt. to optimize the performance for inserting a large amount of data. This process is either old hat to anyone who has worked with SQL Server for a while or it's easy to learn by reading the code. Accordingly, we'll skip over it here and move on to the next section of the script file.

Modify the tables

In this section, we need to accomplish two things: dropping the constraints and indices for faster loading, and increasing the size of some of the primary key fields so they can hold more unique values. We must remove the constraints first because SQL Server will not allow us to modify the underlying fields while the constraints are still being enforced. Even if we didn't need to change the fields, we would still want to drop the constraints because we don't want to incur the overhead of enforcing them for each and every row we insert.

While we're at it, we also want to remove the indices. If we don't, SQL Server won't only write the data page for the rows we add, but it will have to update the various index pages too. It gets even worse when you consider Northwind makes liberal use of clustered indices. Because SQL Server keeps the physical order of the rows the same as the clustered index order, we'll end up causing many page splits as we insert our new data if we leave the indices on the tables.

When we first started developing this script, we didn't take the time to write the portion that removes the indices. Like most software projects, the schedule was very tight and this was clearly a feature that could wait until the second version. But as soon as we added the code to drop the indices, we saw a threefold increase in performance!

In this version of BigNW.SQL, we've hard-coded the names of the constraints and indices to match table structure of the Northwind database shipped with SQL Server 7.0. If we were to create a more robust version, we'd probably read the schema to dynamically generate a list of objects to drop. Maybe the next release . In the meantime, you can use this stored procedure to show the tables referencing a given table:

CREATE PROCEDURE procShowRefs

@psTableName SYSNAME = '' AS

IF @psTableName = ''

BEGIN

PRINT 'USAGE: procShowRefs psTableName =

=>Name'

RETURN (1)

END

SELECT OBJECT_NAME(fkeyid) AS TableName,

OBJECT_NAME(constid) AS ConstraintName

FROM sysreferences

WHERE rkeyid = OBJECT_ID(@psTableName)

To generate a list of indices for a specific table, search the online Help for INDEX_COL(). The sample code includes an excellent example showing just how to do this.

You still need to make one more change to the database schema to support generating the sample data, but we'll cover that after describing the process we're going to use to loop through the tables.

Creating the sample data

We have just one more major element of this process to discuss and it's the most important one: just how are we going to expand the database by a factor of 100? We'll do so fairly easily. We're simply going to copy all the rows out of the tables we want to grow and place them into temporary tables. Then we'll insert this data back into the original tables 100 times. While simple in theory, it does present some interesting challenges in practice.

First, we must prevent key violations by generating a new primary key for each row we insert. This isn't a problem for tables where the primary key is numeric (by numeric, we mean int, float, and the like, not just the new numeric data type), because we can simply add one to the last value and write some error checking statements to handle the collisions. For identity, based primary keys it's even easier: SQL Server takes care of assigning a unique value to every row for us. The only thing we need to be careful of is that we don't add so many rows that we exceed the upper limit of the data type.

Creating new values for text-based primary keys presents a different problem. We have just such an issue with the Customers table. To solve this, the first time you insert the data from the temporary table, you're going to append 1 to the value in the CustomerID column; the second time through 2, and so on. Before you can do this, however, you need to increase the size of the CustomerID column to accommodate the extra characters.

When you install Northwind, the CustomerID column data type is VARCHAR(5). Because we can't imagine increasing this database by a factor of more than one million, let's increase it to a data type of VARCHAR(11) and buy ourselves six extra characters. When you change the size of the CustomerID in the Customers table, you must also change the size of all the columns that reference it, or else you'll receive an error when you try to re-create the foreign key relationships. Take a look at the error message below. In this example, we neglected to change the size of the CustomerID column in CustomerCustomerDemo table:

Server: Msg 1753, Level 16, State 1, Line 142

Column 'Customers.CustomerID' is not the same

length as referencing column

'CustomerCustomerDemo.CustomerID' in

foreign key

=>'FK_CustomerCustomerDemo_Customers'.

Did you know that there's even a table called CustomerCustomerDemo in the Northwind database? We didn't until the first time we ran this script and received this error.

Finally, we'll also add a new column called Cycle for housekeeping reasons and to help us keep track of the cycle in which we added a particular row. To do this, make your final schema change with the following SQL:

-- Increase size of all CustomerID columns

ALTER TABLE Customers ALTER COLUMN CustomerID

VARCHAR(11) NOT NULL

ALTER TABLE CustomerCustomerDemo ALTER COLUMN

CustomerID VARCHAR(11) NOT NULL

ALTER TABLE Orders ALTER COLUMN CustomerID

VARCHAR(11) NOT NULL

-- Add Cycle column to each table

ALTER TABLE Customers ADD Cycle INT

ALTER TABLE Orders ADD Cycle INT

ALTER TABLE [Order Details] ADD Cycle INT

Inserting data

We're now ready to start inserting data. We'll start with the master tables, move on to the header tables, and conclude with the transactions. We do this to maintain the referential integrity. To keep this script down to a manageable size, we'll only include the order entry tables.

In Northwind, like most every other order management databases, the customer table is the master. Let's step through the code we use to expand it:

-- Declare our variables

DECLARE @i INT

DECLARE @lHowBig INT

-- Retrieve the value for the size factor

SELECT @lHowBig = HowBig FROM #BigNW_Vars

-- Init the counter to 0

SELECT @i = 0

-- Remove any pre-existing new rows

DELETE FROM Customers WHERE Cycle IS NOT Null

-- Copy to the temp table we will use as source

-- and set the cycle counter to 0

SELECT * INTO #BigNW_Cust FROM Customers

UPDATE #BigNW_Cust SET Cycle = @I After setting up the variables, the first thing to do is remove any rows you may have already added to the Customers table in an earlier execution of this script. This isn't necessary if you run the entire script from top to bottom, because you restore BigNW from a fresh backup of Northwind. We included it in case you want to run just a portion of the script without running the entire process.

Use your new Cycle column to identify the new rows you've added to the table. Whenever you insert a row, you'll populate the Cycle column with the current cycle number. Because we allowed Nulls when we added this column, we know rows with a Null value must be from the original Northwind database. Therefore, delete any row that has a non-Null value in the Cycle column to get back to the original state of the Customers table.

Now that you have the Customers table down to the 91 original rows, use SELECT * INTO to create an exact duplicate in a temporary table called #BigNW_Cust. Once you have the temporary table created, set the Cycle column to the current value of the counter variable.

Now let's move on to the looping section where the real work gets done:

WHILE @i < @lHowBig

BEGIN

BEGIN TRAN

SELECT @i = @i + 1

-- Update PK so it is unique and insert

UPDATE #BigNW_Cust SET CustomerID =

LEFT(CustomerID, 5) +

CAST(@i AS VARCHAR(5)), Cycle = @i

INSERT Customers SELECT * FROM #BigNW_Cust

PRINT "Customer passes complete: "

+ CAST(@i as varchar(5))

COMMIT WORK

END

Start by testing the current value of the counter to make sure you're not complete. Assuming you aren't, increase the counter by one. Then, update all 91 rows in the temporary table by appending the value of @I (our counter) to the first five characters of the CustomerID column to ensure you have a unique primary key.

Next, insert the contents of the temporary table back into the main Customers table, print out some status information so you know something is happening, and restart the loop. When you're done, use the following query to test the results:

SELECT TOP 5 CustomerID, Cycle

FROM Customers

ORDER BY CustomerID

The output for this query is shown in Table A. Notice how the rows sort in alphabetical rather than numeric order.

Table A: The first rows of the Customers table after running BigNW.SQL
CustomerID
Cycle
ALFKI NULL
ALFKI1 1
ALFKI10 10
ALFKI11 11
ALFKI12 12

You might be wondering why you need to use the LEFT function. Use this because you want only the original five characters of the CustomerID. Table B shows what would have happened had we just used the statement:

SET CustomerID = CustomerID + CAST(@i

AS VARCHAR(5)

Table B: The result if we don't use LEFT(CustomerID, 5)
CustomerID
Cycle
ALFKI NULL
ALFKI1 1
ALFKI12 2
ALFKI123 3
ALFKI1234567 7

As you can see, each loop just added its value to the result of the previous one. When we reach the seventh cycle, the value would be larger than the field's data type of 11 characters. If you run BigNW.sql with the default settings, at this point you'll have a customer table with 9,191 rows (the original 91 plus 9,100 new ones).

With the master table populated, we're ready for Orders, the header table. Adding rows to this table follows a similar approach to the one we used with the Customers table, so we'll only focus on the differences. Also, we'll skip the first part where we declare the variables and clear out the old rows because it is the same. Listing A shows the relevant portion of the code.

Listing A: Use this query to copy all of the columns from the Northwind..Orders table into a temporary table.

-- Copy all columns except OrderID to the temp table

SELECT CustomerID, EmployeeID, OrderDate, RequiredDate,

ShippedDate, ShipVia, Freight,

ShipName, ShipAddress, ShipCity, ShipRegion,

ShipPostalCode, ShipCountry, Cycle

INTO #BigNW_Ord FROM Orders

UPDATE #BigNW_Ord SET Cycle = @i

-- Done yet?

WHILE @i < @lHowBig

BEGIN

SELECT @i = @i + 1

UPDATE #BigNW_Ord SET CustomerID = LEFT(CustomerID, 5)

+ CAST(@i AS VARCHAR(5)), Cycle = @i

INSERT Orders SELECT * FROM #BigNW_Ord

PRINT "Orders passes complete: " + CAST(@i as varchar(5))

END

The first difference you'll notice is that we have explicitly listed the columns rather than just using a SELECT * statement. We have to do this because OrderID is an Identity column, and we want SQL Server to automatically assign the next value. This takes care of ensuring we have unique primary keys. But what about foreign key columns? We need to change these also. If we didn't, we'd generate 83,000 orders (100 cycles times the 830 original) that only referenced the original 91 customers. While this would work from an integrity standpoint, such a skewed distribution would not provide a realistic database for testing and capacity planning.

Getting around this is extremely easy. We simply use the same code we used to generate unique CustomerIDs and reuse it here. The first round of orders takes the original CustomerIDs and appends 1; the second, 2. The end result is we get exactly the same distribution as when we started.

It's worth noting here that we didn't add any new rows to the Shippers or Employees tables. We contemplated it, but decided not to because we thought it was unlikely that a hundredfold increase in sales would necessitate a corresponding increase in staff and shipment methods. While such a growth rate would clearly require some additional resources, we didn't think the added complexity was worth the benefit when it came down to generating sample data. On to our last table, Order Details. This one is a bit more complicated so let's jump right into the code:

-- Make two copies of Order Details

-- one, OrdDet, that we will update and

-- insert back into the original table.

-- We will use the other, OrdDetOrig,

-- to refresh OrdDet after each cycle

SELECT * INTO #BigNW_OrdDet

FROM [Order Details]

SELECT * INTO #BigNW_OrdDetOrig

FROM [Order Details]

UPDATE #BigNW_OrdDet SET Cycle = @I

We start off just like we did for the previous two tables, but here you need to make two copies. After we walk through the next section, the reason for two will become clear. Because the table doesn't have an Identity column, we can revert back to the SELECT * syntax we used for the Customers table, as shown in Listing B.

Listing B: Use this script to create the Order Details table in the BigNW database.

WHILE @i < @lHowBig

BEGIN

SELECT @i = @i + 1

-- Refresh OrdDet temp table with original

-- [Order Details] table. We need to do

-- this because we need the original values in

-- the OrderID column

TRUNCATE TABLE #BigNW_OrdDet

INSERT INTO #BigNW_OrdDet

SELECT * FROM #BigNW_OrdDetOrig

WHERE Cycle IS NULL

-- 830 orders

UPDATE #BigNW_OrdDet SET OrderID = (OrderID + (830 * @i)), Cycle = @i

INSERT [Order Details] SELECT * FROM #BigNW_OrdDet

PRINT "Order Details passes complete: " + CAST(@i as varchar(5))

END

The key to understanding this entire loop is this one line of code:

SET OrderID = (OrderID + (830 * @i))

The OrderID has a foreign key reference to the Orders table. Therefore, we need to find some way to calculate values for

[Order Details].OrderIDp so that there's a corresponding value in the Orders table for each row we add to the Order Details table.

Because Orders.OrderID is an identity column, we know its value in every row we added in the last section will be one greater than previous. We also know by executing

SELECT COUNT(*) FROM orders

that there are 830 orders in the original Northwind. Although not relevant to the formula, knowing that the first OrderID is 10,248 helps the explanation. (We don't know what happened to the first 10247, but you can verify this is by design by reading the InstNwnd.sql script that the SQL Server installer uses to build the database.)

Now that we know these three pieces of information, Orders.OrderID is an Identity column, the first order is number 10,248, and there are 830 orders—how does this get us to SET OrderID = (OrderID + (830 * @i)) and how does it maintain our RI?

Because we relied on Orders.OrderID being an identity column to generate our primary keys, we know the first order we created in the first cycle was 11,078—the first order number, 10,248, plus the 830 original orders leaves us at 11,078 for the next one created. The first order of the second cycle was 1,190 or 10,248+(830*2). Therefore, all we need to do to create OrderIDs in the Order Details table is to take whatever the original OrderID for that row is and add 830 times the cycle counter to it. Not only does it ensure valid foreign keys, but it also maintains the data distribution.

This brings us back to why we need two copies of the original Order Details table. After we update the temporary table with the new values for OrderID, there's no easy way to get back to the original values. Sure, we could reverse the formula but it is far easier just to overwrite the data with a fresh copy—that's how we use the second copy.

There you have it—the BigNW database is now 100 times larger and you have a great platform for testing The only thing left to do is to re-create the constraints and indices and do a little clean up. As with the initial creation, it's quite easy, so we will leave that as an exercise for the reader.

Conclusion

In this article, we showed you how to expand the stock Northwind database into a platform where you can do capacity planning, test your operational procedures, and improve the performance of your existing apps.

Along the way, we covered many related topics and showed you several techniques to solve some of the more common headaches of T/SQL including temporary tables and variable scope. Finally, we closely looked at methods of generating large amounts of test data while maintaining referential integrity in your database.


 

Top of Page  •  Back Issues


Copyright © 2001 Element K Content LLC. All rights reserved. Reproduction in whole or in part in any form or medium without express written permission of Element K Content LLC is prohibited. Element K is a service mark of Element K LLC.

[ Advertise with Us! ]    Privacy Policy     Questions? Comments?