-by Jeffry Schwartz
Queries with multiple join clauses create particularly difficult issues when they perform poorly. Join ordering is extremely important because if a join creates a huge interim result set at the beginning of the query execution, SQL Server must continue to handle this result set until it can be trimmed down with filtering or additional join clauses. If this does not occur until the end of query execution, the query will usually perform poorly. However, if the number of records from a large table can be limited early in the execution or the join to the large table is added at the end of the query, the query will usually perform better. Therefore, when large tables are involved the ordering and the types of joins can dramatically affect the performance of a query. When multiple large tables are involved, it is often quite difficult to determine which one caused the poor performance.
One way to influence SQL Server’s ordering of joins is to use Common Table Expressions (CTEs), which were introduced in SQL Server 2005. Using CTEs also has the benefit of allowing a developer to break up a very complex query into its component parts, which can simplify the logic as well as assessment of exactly which joins dramatically affect performance. In addition, since the query is broken up into component parts, development, maintenance, and readability can be simplified greatly. A CTE is defined to be a “temporary named result set, derived from a simple query and defined within the execution scope of a SELECT, INSERT, UPDATE, or DELETE statement.” “CTEs can be thought of as alternatives to derived tables (subquery), views, and inline user-defined functions.” Source: http://en.wikipedia.org/wiki/Common_table_expressions. The following query provides an illustration of a CTE in its most basic form:
with Top30000 as ( — just check the first 30,000 entries
select top 30000 Duration
from <table> with (nolock)
where Duration is not null
select min(Duration) as [Min Duration]
Clearly, this query could be written in one statement and the query plan would undoubtedly be the same regardless of whether it used a CTE. One key thing to remember about CTEs is that they only exist within the scope of the query. For example, any subsequent T-SQL statements will not have access to the Top30000 entity, so it is NOT a temporary table. A slightly more sophisticated example is shown below:
with ProductReviewID as (
SELECT p.ProductID, p.Name, pr.ProductReviewID, p.ListPrice
FROM Production.Product p
LEFT OUTER JOIN Production.ProductReview pr ON p.ProductID = pr.ProductID
WHERE p.ProductID = 718
SELECT DISTINCT p.ProductID, p.Name, p.ListPrice, sd.UnitPrice AS [Selling Price]
FROM Sales.SalesOrderDetail AS sd
INNER JOIN ProductReviewID AS p ON sd.ProductID = p.ProductID
Again, this query could be constructed without using a CTE. In this particular case, I selected two sample join queries from BOL and combined them into a single query using a CTE. Although the filter is hard-wired in this example, it is illustrative of filtering the result set early in the query to reduce the number of records being used by the last part of the query. Note also that the last portion of the query had to join to the CTE in order for the CTE to be visible. A subquery can also accomplish this.
Most CTE examples illustrate recursion or utilize the following layout that is similar to the one that used AdventureWorks above:
CTE2 select from CTE1
Select from CTE2 joining other tables
However, this layout is also possible:
Select from CTE2 joining CTE1 joining other tables
As is this form:
CTE3, which selects from CTE2 joining CTE1
Select from CTE3 joining other tables
One method that I have used extensively to develop efficient queries is to build complex queries gradually using the last layout, i.e., select from tables and gradually introduce joins while maintaining excellent performance. It is critical that representative data be used during this process, i.e., the table sizes used for testing should be representative of what the production database contains. For example, most of the large tables in the AdventureWorks database contain approximately 20,000 records. With table sizes this small, almost any query design works well. However, if the tables contain 200,000 or 2,000,000 records the behavior would be quite different. In these situations, joins can provide substantial filtering or actually expand the result set. The advantage of the gradual addition approach is that the developer can identify immediately those joins that significantly degrade performance and then attempt to utilize other joins until only the most troublesome joins remain. Although a very efficient form of the query can often be devised using only CTEs, sometimes, interim results must be stored in a temporary table, which is then joined with the large troublesome table.
In summary, although using CTEs will not force SQL Server to perform joins in the order specfied in the T-SQL code, it can often influence SQL Server to adhere more closely to a desired join ordering sequence, especially if filtering clauses are used early in the CTE sequence. Adding joins against large table at the end of a query can also achieve a similar result. Regardless, using a gradual approach to building queries with CTEs enables developers to understand exactly what the effect of every join is and therefore, create the most efficient query possible.
This is the beginning of a series on CTE’s and performance. Stay tuned for more CTE posts to come. I’ll show some examples of how to take an existing query with joins and convert it to a CTE with performance improvement.
For more information about blog posts, concepts and definitions, further explanations, or questions you may have…please contact us at SQLRX@sqlrx.com. We will be happy to help! Leave a comment and feel free to track back to us. We love to talk tech with anyone in our SQL family!