|
![]() ![]()
April 2000Using heterogeneous queriesby Scott Mauvais Now that corporate IT managers are (finally) viewing SQL Server as a serious enterprise platform, as developers and DBAs, we're facing new challenges. Before the release of SQL Server 7.0, we fought just to get in the game. To do so, we focused all our energies just on proving we could build and maintain high-performance, scaleable applications in a 24 x 7 environment. Now that we're in the data center, we need to learn the rules of the game—and the first rule is to play nice with others. With some two million servers running countless different flavors of UNIX and another 350,000 IBM mainframe and mid-range computers, the challenge of integrating all this disparate data into a structure usable by mere mortals might appear insurmountable. Fortunately, interoperability has always been one of SQL Server's strengths—and the heterogeneous query support in version 7.0 makes it even easier. As you can see in Figure A, SQL Server is able to connect to many different data sources. You can now transparently access any OLE DB data source with the standard Transact-SQL you use with your local data. Figure A: SQL Server 7.0 supports many heterogeneous data
sources. Last month, in the article entitled, "Using distributed queries with SQL Server 7.0", we looked at a related technology, SQL Server's distributed query processor. The distributed query enables you to access data on remote SQL servers and to present users with a single result set as if all of the data came from the local server. In this article, we're going to expand upon the distributed queries we learned about last month by showing you how to use heterogeneous queries to integrate remote data from sources other than SQL Server. We'll start by showing you how to create linked servers with SQL Server Enterprise Manager to connect to several data sources regardless of format or location. From there, we're going to write queries that access both relational and non-relational data and join these result sets with native SQL Server data. At the end of the article, we're going to explore some performance considerations that apply equally to both heterogeneous and distributed queries. Understanding the SQL Server query architectureBefore we get into the details of setting up heterogeneous queries, let's review the architecture behind SQL Server 7.0's distributed query processor and its heterogeneous query support. A heterogeneous query enables developers to issue a single SQL statement against a single SQL server that integrates data from several different stores. You can do all of this with only one connection to a single server because SQL Server takes care of all the hard work for you.Imagine that your company stores its human resources data in Oracle and its sales data in SQL Server. Your manager has just asked to you to create a monthly report that compares each salesperson's salary with his sales volume. Oh, and she wants you to automatically email it to each sales rep's manager so you need to query the Active Directory also. Without heterogeneous queries not only would you have to learn PL/SQL for Oracle and LDAP for the Active Directory, but you would have to manage the connectivity and security for yourself. Moreover, once you received the query results back, you would have to write your own algorithm to merge this data and join it with your local SQL Server Sales data. With heterogeneous queries, you don't need to worry about any of this. Instead, you simply build a standard SQL query as you've always done. When you issue a standard query requesting data, the SQL Server Query Processor first parses it, then compiles and optimizes it, and finally executes it—returning the results to the client application. When processing a heterogeneous query, during the parse phase, SQL Server breaks down the SQL statement and separates out the portions that refer to non-local data. SQL Server then sends the requests that it can't fulfill off to OLE DB. OLE DB in turn passes each request off to the specific OLE DB provider responsible for a given data source. If, for example, your query referenced an employee's pay grade from PeopleSoft running on Oracle and his manager's email address from LDAP, OLE DB would send the compensation request off to the OLE DB Provider for Oracle and the directory request off to the OLE DB Provider for Microsoft Directory Services. Once the data source has processed the request and returned the data—called a rowset in OLE DB-speak—the process is reversed. The data store first sends the data back to the OLE DB provider. The provider returns the data to OLE DB, which in turn passes rowset back to SQL Server. When all the OLE DB providers involved in the query have returned all of their rowsets, SQL Server merges them all together and returns a single result set back to the client application. From the user's perspective—and more importantly from the developer's—all the data came from a single SQL Server. Because Microsoft built SQL Server on OLE DB, heterogeneous queries
aren't limited to traditional relational data sources. With SQL Server,
you can query directory services, file systems, mail stores—any data
source that has an OLE DB provider written for it. For example if you
wanted to locate all of the files in a given directory written by a
specific author, SQL Server would pass the request off to OLE DB, which
would then pass the request off to the OLE DB Provider for Indexing
Service. If you want to determine the OLE DB providers installed on any
SQL server, you can use the Accessing remote serversAs with distributed queries last month, the first step to using heterogeneous queries is telling SQL Server how to access the data. You can do this either at design time by creating and storing a linked server configuration or you can use theOpenRowSet() function and link to the data source dynamically
at run time.
Because each OLE DB provider is specific to a particular data source, the behavior of each provider is a bit different. The main benefit is that the provider—once configured—knows exactly how to locate the data source, establish a connection, authenticate the user, and request the desired data in the most efficient manner. The downside is that the information required to configure a linked server varies by OLE DB provider. As we learned last month, you can set up a linked SQL server with as little as a network address. On the other hand, if you want to use Index Server to query the file system, you'll need to provide the name of the catalog. To use Jet to access a file server-style database (such as one in Microsoft Access) or a flat file, you'll need the fully qualified path to the data source and the name of the ISAM driver you want to use. Defining a linked server in SQL Server Enterprise ManagerOnce you know the required configuration information, adding a linked server is very straightforward. You can use either SQL Server Enterprise Manager or stored procedures. We covered the related store procedure approach, especiallysp_AddLinkedServer , in last month's article so we
won't do so here. There are several options when using SQL Server
Enterprise Manager that we saved to show you this month, so let's look at
them now.
Begin by starting SQL Server Enterprise Manager and connecting to the desired server. Next, expand the Security Folder and then right-click on Linked Servers. From the shortcut menu, choose New Linked Server, as shown in Figure B. This takes you to the General tab of the Linked Server Properties dialog box, as shown Figure C. Let's begin by setting up a connection to an Oracle server so we can look at each of these input fields in detail. You should define the properties as follows: Figure B: You can use SQL Server Enterprise Manager to define a
Linked Server. Figure C: Use the Linked Server Properties dialog box to define
the properties of a new linked server.
Once you've configured the general information for the linked server, you next need to configure the security for the connection. The security options enable you to identify how your SQL server can log on to the remote server. To configure the security options, select the Security tab as shown in Figure D. Figure D: You must also configure the security for a remote data
source. You use this tab to define how you're going to map your local SQL Server users to users on the remote data source—the Oracle server in this example. As you can see, the screen has two sections; we're going to look at the lower section first. Here you define a mapping table to translate user accounts on your local SQL server to valid accounts on the other server (in this case, the Oracle server). In our example, we've mapped the SQL Server sa account to an Oracle DBA account called system. This means that when the local sa user executes a heterogeneous query that involves this remove server, the distributed query processor instructs Oracle to run its portion of the query under the system user's security context. We've also mapped the scott account because, while the user name exists on both servers, the passwords are different. Finally, we checked the Impersonate box for the user demo because the user name and password are the same on both servers. Checking this tells SQL Server to pass the user's credentials directly to the remote server. Now let's return to the top section of the Security tab. Use this section to define how you want to handle users you haven't defined in the mapping section. You can use one of the following four choices:
Let's discuss each of these four options in detail. No Security Context Will Be UsedWhen you use this option, SQL Server makes no attempt to pass any credentials when connecting to a remote data source. For most data stores, this has the result of denying access because most servers will reject anonymous connections. However, this option works just fine for some data sources, such as xBase, which don't support authentication.They Will Be ImpersonatedWhen you use this option, SQL Server first checks to see if the current user has an entry in the mapping table. If the he doesn't, it passes the user's local login and password to the remote server. We rarely use this option because it requires that you make the user's account and password the same on both your local SQL server as well as the remote server—and it's too much effort to keep both accounts in synch.Also, there's a risk of a user inadvertently gaining access because of a security oversight. If you want a user to access a remote system using his local credentials, you should proactively grant access. We also use this as an easy concession to security paranoid data center DBAs that are none-too-thrilled to have a toy database (SQL Server) accessing their real databases (Oracle). They Will Be MappedUse this option if you want any user who isn't listed in the lower portion of the tab to be mapped to the username and password you provide in the adjacent text boxes. This setting is quite similar to SQL Server 7.0's new application database role, only much more frightening. With an application role, SQL Server explicitly delegates security to the application where users' actions are (presumably) tightly controlled and closely matches the intended use of the application.Conversely, if you map all users, every valid SQL Server login has access to the remote data. Worse still, the only way to deny user access is to set up a bogus map with an invalid remote login/password combination. We do occasionally use this setting, however, when we're first setting up the permissions in SQL Server. This way, when we have views and stored procedures that access remote data, we can test out permissions and ownership trees on these local objects without having to constantly bother the other DBAs by asking them to set up accounts on their systems. They Are Not Allowed To AccessUse this setting if you want SQL Server to deny access to the remote data to any user who you haven't explicitly defined in the mapping table. We use this configuration most often for the same reasons we don't use the They Will Be Impersonated option.Using linked serversOnce you've configured security, click OK to save your changes. Now that we've set up a linked server, you might be asking yourself "Where is a linked server saved and how exactly do I use it?" You can look at a linked server as a virtual server.When you create a linked server, SQL Sever encapsulates all of the information the OLE DB provider needs when it connects to the remote data source—the network, security, and semantics of the data source, and saves this configuration information in its system tables. This way, you can set the link up once and all of your developers can use it without having to manually specify the connection information each time they want to access non-SQL Server data. Better yet, if you need to move the data to another server—or even a
different platform—all you need to do is update the associated linked
server. If you require more granular control—for example, if you don't
know where the specific data will be located or what provider you'll want
to use until runtime—you can call the Writing heterogeneous queriesOnce you've defined a linked server, you can refer to any remote resource just as you would if it were a native SQL Server table. When constructing a heterogeneous query, you reference objects on a linked server using a four-part syntax that should be familiar to anyone who has written cross-database queries:
For example, to retrieve the employee list from the demo database that ships with Oracle, we used the following query (because Oracle's system catalog is case sensitive, make sure you use all upper-case letters): LAST_NAME, FUNCTION FROM ORACLE_HR..DEMO.EMPLOYEE E, ORACLE_HR..DEMO.JOB J WHERE E.JOB_ID = J.JOB_ID ORDER BY LAST_NAME, FIRST_NAME, MIDDLE_INITIAL Not only can you read from heterogeneous data sources, but you can also issue updates by using the following query: SET MIDDLE_INITIAL = 'Q' WHERE EMPLOYEE_ID = 7876 You can also create views against the Oracle server that behave just like native SQL Server views. For example, you might start by executing the following SQL statement in the Northwind database to create a simple view:
SELECT * FROM ORACLE_HR..DEMO.JOB Next, you can use the view by using either the Finally, you can join Oracle data together with native SQL Server data
and present the user with a single result set. For example, we created the
Listing A: Joining an Oracle table to a SQL Server table SELECT EmployeeID = CASE E.EMPLOYEE_ID WHEN 7369 THEN 1 WHEN 7499 THEN 2 WHEN 7505 THEN 3 WHEN 7506 THEN 4 WHEN 7507 THEN 5 WHEN 7521 THEN 6 WHEN 7555 THEN 7 WHEN 7557 THEN 8 WHEN 7560 THEN 9 ELSE EMPLOYEE_ID END, FirstName = E.FIRST_NAME, LastName = E.LAST_NAME, Title = J.FUNCTION, TitleOfCourtesy = "Other", BirthDate = DATEADD(day, E.EMPLOYEE_ID, =>CAST('1/1/1940' AS DATETIME)), HireDate = HIRE_DATE, Address = '1600 Pennsylvania Avenue', City = 'Washington', Region = 'DC', PostalCode = '20500', Country = 'USA', HomePhone = '(202) 456-1414', Extension = '123', Notes = 'Total Compensation ' + =>CAST(SALARY + ISNULL(COMMISSION, =>0) AS VARCHAR(16)), ReportsTo = MANAGER_ID, PhotoPath = '' FROM ORACLE_HR..DEMO.EMPLOYEE E, ORACLE_HR..DEMO.JOB J WHERE E.JOB_ID = J.JOB_ID After creating Listing B: Query to easily calculate the sales per employee SELECT O.EmployeeID, E.FirstName, =>E.LastName, SUM(OD.UnitPrice * =>OD.Quantity) AS Sales FROM [Order Details] OD, Orders O, qryOracleEmployees E WHERE O.OrderID = OD.OrderID AND O.EmployeeID = E.EmployeeID GROUP BY O.EmployeeID, E.FirstName, E.LastName This discussion, along with last month's article on distributed queries, should give you a strong understanding of how to work with traditional relational data whether it's stored in SQL Server or some other relational database management system. In the next section, we're going to walk you through linking to non-relational data sources.
Working with non-relational data sourcesBeing able to access any number of different vendors' RDBMSs from a single SQL statement greatly increases our productivity and adds much needed flexibility to our designs. Structured data, however, represents a very small portion of corporate data. By far, the majority of corporate data resides in a myriad of unstructured data sources from email stores to file systems to Web servers. We're going to walk you through setting up linked servers for two common, non-relational stores to which most of you will have easy access: the Windows NT file system and the new Windows 2000 Active Directory.Querying the Windows NT file systemRather than having to develop a full-text query engine for the Windows NT file system, OLE DB allowed the SQL Server development team to build upon Index Server that ships with the Windows NT Option Pack. If you don't already have a copy of Index Server, you can download it free fromwww.microsoft.com/ntserver/nts/downloads/recommended/ To add your Web data store as a linked server, execute the following SQL statement: -- Set up linked server
Note that Web is the default name for the Index Server catalog of the
Default Web Site. If you've changed the name of the catalog, you'll need
to update the sp_DropLinkedSrvLogin @RmtSrvName = WebStore, @LocalLogin = null -- Map the sa account to my network account sp_AddLinkedSrvLogin @RmtSrvName = 'WebStore', @UseSelf = 'false', @LocalLogin = 'sa', @RmtUser = 'MAUVAIS\Administrator', @RmtPassword = 'Secret' Finally, we execute a query that lists all the files in on the Default Web Site that reference SQL Server and ranks them by relevance: SELECT * FROM OPENQUERY(WebStore, 'SELECT Rank, path, DocTitle, Size, HitCount FROM SCOPE('' "/" '') WHERE CONTAINS ('' "SQL Server" '' ) ') ORDER BY Rank Desc
Querying the active directoryLet's move on from the file system to the Active Directory. As companies begin to consolidate multiple directories into Windows 2000's Active Directory, it will be a boon to developers whose apps are directory-enabled and to consultants who can help make legacy applications directory-aware.Better yet, it will make our lives going forward much easier because we can stop worrying about writing separate interfaces for user maintenance and then synchronizing our user data with the various other stores of user information located elsewhere in the enterprise. Let's see what we can do to get a head start on this trend. We start off just as we have in the other examples by adding a linked server and configuring security as shown in Listing C. Listing C: Linked server and security sp_AddLinkedServer @Server = ADSI, @SrvProduct = 'Active Directory Services', @Provider = 'ADSDSOObject' sp_DropLinkedSrvLogin @RmtSrvName = ADSI, @LocalLogin = null --Map sa to NT admin sp_AddLinkedSrvLogin @RmtSrvName = 'ADSI', @UseSelf = 'false', @LocalLogin = 'sa', @RmtUser = 'MAUVAIS\Administrator', @RmtPassword = 'Secret' Now that you've defined the linked server, to quickly get a list of all the users in your directory, you could use this query: SELECT name, adspath, cn FROM OpenQuery(ADSI, 'SELECT adspath, name, cn FROM ''LDAP://DC=mauvais,DC=Com'' WHERE objectCategory=''person'' AND objectClass=''user'' ') You'll want to note the use of two single quotes to escape the quoting required by the LDAP's SQL dialect. Some documentation we've seen incorrectly uses a set of double quotes. If you accidentally use double quotes, ADODB will return generic error number 7321 with the description
You might come across this error fairly often as you first start
writing heterogeneous queries. It simply means you have a syntax error in
the query string portion of You might be temped to include While we're talking about potential problems, be careful not to leave
the security configured for Let's return back to our original example of joining the PeopleSoft data with the sales database and the Active Directory. Here's the query to accomplish that: M.FIRST_NAME AS MANAGER_FIRST, M.LAST_NAME AS MANAGER_LAST, S.Sales, AD.Mail FROM qryOracleEmployees E, ORACLE_HR..DEMO.EMPLOYEE M, qrySalesBySalesPerson S, OpenQuery(ADSI, ‘SELECT givenName, sn, mail FROM ''LDAP://DC=mauvais,DC=Com'' WHERE objectCategory=''person'' AND objectClass=''user'' ') AD WHERE E.ReportsTo = M.EMPLOYEE_ID AND S.EmployeeID = E.EmployeeID AND AD.sn = M.LAST_NAME AND AD.givenName = M.FIRST_NAME Now you have all the information you need to mail a report to the sales managers. Maximizing the performance of heterogeneous queriesFinally, we want to spend a bit of time discussing performance of the distributed query processor, because it will make or break your application—whether you're writing heterogeneous queries across multiple data sources or distributed queries with native SQL Server data. To do this, we'll examine the performance difference between two nearly identical distributed queries accessing the Northwind database on two different servers. This first query joins the local Categories table to the Products table on a remote server and limits the result set to only those products in category 7, which is the Produce category:P.ProductName FROM Categories C, SAP.northwind.dbo.Products P WHERE P.CategoryID = C.CategoryID and C.CategoryID = 7 This second query does exactly the same thing except it uses the
P.ProductName FROM Categories C, SAP.northwind.dbo.Products P WHERE P.CategoryID = C.CategoryID and C.CategoryName = 'Produce' If you were to run these two queries against a single server, they
would perform equally well. When you execute them as distributed queries,
however, the first query runs about three times faster. Why such a large
difference? All of the relevant columns in the join condition and the
To get an idea of how SQL Server optimizes the query, let's take a look
at the execution plans for the two queries. First let's look at the plan
for the faster query, Figure E: You can use the query plan to analyze the execution of
the query containing the WHERE CategoryID = 7 clause. When SQL Server executes the first query, the join condition and the
criteria in the In the second, slower query, Figure F: Use the query plan to analyze the execution of the
query containing the WHERE CategoryName = 'Produce' clause. It does this because the You might be wondering why the remote query processed 66 rows when
there are 77 rows in the
the five rows between 61 and 65 belong to As you've probably surmised, the key to insuring good performance with heterogeneous queries is to limit the data you pass between servers. Fortunately, SQL Server provides several tools to help us analyze what the query optimizer is doing. ConclusionIn this article, we showed you how to leverage OLE DB to write standard Transact-SQL queries that access data from heterogeneous sources throughout the enterprise. More importantly, we saw how to join together data from several sources and present it in a single result set where end users and developers can easily understand and manipulate it.We started by using Enterprise Manager to create an Oracle linked server. At the same time, we took a close look at the options you have for controlling access to remote data. From there, we moved on to non-relational data sources where corporations still keep the majority of their critical information. We demonstrated how to access the Active Directory and the unstructured information in the NT file system. Finally, we wrapped up by looking at the performance implications of working in a distributed environment. Now that you know how to work with heterogeneous queries, you can integrate mission critical data into your applications—and start to move your SQL Server apps into the data center.
![]() |