jump to navigation

Significance of order of columns in an index November 18, 2013

Posted by fofo in Sql Server, Sql Server 2005, SQL Server 2008, SQL Server 2012.
Tags:
add a comment

I have been delivering a certified course in MS SQL Server 2012 recently and I was asked a very interesting question “Does the order of the column in an index matter?”

Let’s give some basic definitions first and make some analogies.

A single column index is straightforward. Think of it as an index in the back of the book.
Let’s say you want to learn about “DBCC FREEPROCCACHE” . You look for that command in the index of the book. The index does not have the information. It has a pointer to the page where the command is described.You turn to that page and read about it. This is a good analogy for a single column,non-clustered index

In SQL Server you can also create an index that contains more than one column.This is known as the composite index. A good analogy for a composite index is the telephone book.
A telephone book lists everyone who has publicly available a telephone number in an area.It is organised not by one column but by two:last name and first name.To look up someone in the telephone book , we first navigate to the last name and then the first name. To find John Smith you first locate Smiths and then John. Composite indexes contain more than 1 column and can reference up to 16 columns from a single table or view.

Back to our question.I have answered this question with the demo below.

I have installed SQL Server 2012 Enterprise edition in my machine but you can use the SQL Server 2012/2008 Express edition as well.

I am connecting to my local instance through Windows Authentication and in a new query window I type (you can copy paste)

USE tempdb
GO

--create a sample table

CREATE TABLE Customers
(
Customer_ID INT NOT NULL IDENTITY(1,1),
Last_Name VARCHAR(20) NOT NULL,
First_Name VARCHAR(20) NOT NULL,
Email_Address VARCHAR(50) NULL

)
--create a clustered index on Customer_ID and
--a non-clustered composite index on the Last_Name and First_Name

CREATE CLUSTERED INDEX ix_Customer_ID ON Customers(Customer_ID)

CREATE NONCLUSTERED INDEX ix_Customer_Name ON Customers(Last_Name,First_Name)
--when we issue a query to SQL Server that retrieves data from the Customers table, the
--SQL Server query optimiser will consider the various retrieval methods at its disposal
--and select one it deems most appropriate

-- insert test row

INSERT INTO customerS VALUES('Smith','John','js@smith.com')

--use SQL Data Generator to create sample data 10000 rows

SELECT * FROM customers

-- DBCC DROPCLEANBUFFERS removes all buffers from the buffer pool
-- DBCC freeproccache removes all entries from the procedure cache

DBCC dropcleanbuffers
DBCC freeproccache

-- when we run this query and see the execution plan
-- we have an index seek on the nonclustered index
-- to locate the rows selected by the query
-- we also have a key lookup to find the values for the email
-- to retrieve the non-indexed columns
SELECT last_name ,
 first_name ,
 email_address
FROM customers
WHERE Last_Name = 'Smith'
 AND First_Name = 'John'

-- what happens in this case?
--where we have the WHERE statement in different order than the index order?
-- it will use the same execution plan

SELECT last_name ,
 first_name ,
 email_address
FROM customers
WHERE First_Name = 'John'
 AND Last_Name = 'Smith'

-- and what happens in this case?
--where we use only last name?
--the same plan is used

SELECT last_name ,
 first_name ,
 email_address
FROM customers
WHERE Last_Name = 'Smith'
-- what happens in this case?
-- when we use only first_name
--the index cannot be used
-- an index cannot be used to seek rows of data when the first column
-- of the index is not specified in the WHERE clause

INSERT INTO customers VALUES ('kantzelis','nikos','nikolaosk@hotmail.com')

SELECT last_name ,
 first_name ,
 email_address
FROM customers
WHERE First_Name = 'nikos'

 SET STATISTICS IO on

 DBCC dropcleanbuffers
DBCC freeproccache

 SELECT last_name ,
 first_name ,
 email_address
 FROM customers
 WHERE Last_Name = 'kantzelis'

-- now lets drop the index to see what happens

DROP INDEX ix_Customer_Name ON customers

--and rerun the query
 --we see a huge increase in logical reads without the index
 --we have in this case, in the absence of the index a clustered index scan
 --which means that each row of the table had to be read
 SELECT last_name ,
 first_name ,
 email_address
 FROM customers
 WHERE Last_Name = 'kantzelis'

Let  me explain what I am doing in this bit of code

  • I make the tempdb the current database
  • I create a new table called Customers
  • I create a clustered index on the table ix_Customer_ID on Customer_ID
  • I insert a test row to the table
  • Then I use a third party generator tool to create 10.000 records
  • Then I just use a simple statement to make sure all the rows were inserted
  • Then I remove all buffers from the buffer pool and all entries from the procedure cache
  • Then (I activate the actual execution plan) and execute the Select  query in line 43.
  • We have an index seek on the non clustered index and a key lookup to get the values for the email column.
  • Then I execute the Select statement in lines 54. The order of the columns in the WHERE clause are different than the index order.
  • Still the same execution plan is used
  • Then I execute the Select statement in lines 65.  We have just the last_name column in the  WHERE clause.The same plan is used
  • Then in line 76 I insert a new value in the table
  • Then I execute another Select statement (lines 78) where just the first_name column is in the WHERE clause.In this case the index cannot be used. An index cannot be used to seek rows of data when the first column of the index is not specified in the WHERE clause.
  • Then in line 84 I set statistics IO on so i can investigate the logical reads when having or not the index.
  • In line 89 I run a Select query again and make a note of the logical reads , which is 4 in my case.
  • Then I drop the index and rerun the query (line 103). This time the logical reads are 77. So you can see that the performance of the query without the index has been dramatically decreased.

Hope it helps!!!

Advertisements

The dangers of over indexing your SQL Server tables April 7, 2012

Posted by fofo in Sql Server, Sql Server 2005, SQL Server 2008.
Tags:
3 comments

I have been teaching SQL Server recently in my capacity as MCT and to be totally honest with you I enjoy it very much.The topic that kept popping up all the time was indexes and how to use indexes to achieve high performance in SQL Server .Indexes is a vast subject and well documented elsewhere on the web.In most cases the answer is “it depends” when someone asks me if a column in a particular table needs to be indexed. It is a very challenging and iterative process.Not only you need to create the right indexes but also you need to analyse them and maintain them.In most cases the customer “lies” about the database access patterns and the numbers of users that will be using the database .Because of that we might choose the wrong indexes because we have made the wrong assumptions about workloads.We must always bear in mind the type and frequency of the queries.Data modifications (INSERT,UPDATE,DELETE statements ) are impacted by indexes.The main points for someone to keep in mind when considering index creation is that he must know his data,know his workload and finally know how SQL Server works.

I will beginning a series of small posts looking into the various aspects of indexes hoping to shed some light on their use and misuse.A lot of times I find out that every time a DBA or developer detects a performance degradation tends to create indexes on the tables of the backend (SQL Server in this case).If there is a problem with performance it does not mean that index creation is always the answer. Many reasons could cause performance bottlenecks (e.g the hardware cannot withstand the workload,excessive blocking or little reuse of the execution plans)

In this post I will show you why it is a bad idea to over index your SQL Server tables.Over indexing consumes unnecessary disk space and the query optimiser may use the less efficient index thus the less efficient execution plan.

We will need a database to work with. I will use AdventureWorks2008R2. You can download it here .

We will need an instance of SQL Server running in our machine.You can download and install the free SQL Server Express edition from here.

I will create a new empty table in the AdventureWorks2008R2 .I will call it NewPerson and it will be based on the Person table and I will insert the same data that resides in the Person table to the NewPerson tables.There will be no indexes on the NewPerson table.I will use STATISTICS TIME ON to measure the time needed to insert the rows to the

Launch SQL Server Management Studio and log in the instance of SQL Server that AdventureWorks2008R2 is attached. In an new query window type


USE [AdventureWorks2008R2]
GO

SET NOCOUNT ON

SELECT *
INTO [Person].[NewPerson]
FROM [Person].[Person]
WHERE 3 = 4
GO

SET STATISTICS TIME ON
GO

INSERT INTO [AdventureWorks2008R2].[Person].[NewPerson]
([BusinessEntityID]
,[PersonType]
,[NameStyle]
,[Title]
,[FirstName]
,[MiddleName]
,[LastName]
,[Suffix]
,[EmailPromotion]
,[AdditionalContactInfo]
,[Demographics]
,[rowguid]
,[ModifiedDate])
SELECT    [BusinessEntityID]
,[PersonType]
,[NameStyle]
,[Title]
,[FirstName]
,[MiddleName]
,[LastName]
,[Suffix]
,[EmailPromotion]
,[AdditionalContactInfo]
,[Demographics]
,[rowguid]
,[ModifiedDate]
FROM [Person].[Person]
GO

SELECT * FROM Person.NewPerson

Have a look at the statistics output. In my case I got the results shown in the picture below

Now I will truncate the table and I will create a clustered and non clustered indexes on almost every column of the NewPerson table.Then I will insert the same data again and make a note of the statistics again.Type the following in the same query window.


TRUNCATE TABLE [Person].[NewPerson]
GO

ALTER TABLE [Person].[NewPerson]
ADD CONSTRAINT [PK_BusinessEntityID]
PRIMARY KEY CLUSTERED
([BusinessEntityID] ASC,
[LastName] ASC) ON [PRIMARY]
GO

CREATE NONCLUSTERED INDEX [IX_NewPersonType]
ON [Person].[NewPerson]
([PersonType] ASC) ON [PRIMARY]
GO
CREATE NONCLUSTERED INDEX [IX_NameStyle]
ON [Person].[NewPerson]
([NameStyle] ASC) ON [PRIMARY]
GO

CREATE NONCLUSTERED INDEX [IX_FirstName]
ON [Person].[NewPerson]
([FirstName] ASC) ON [PRIMARY]
GO

CREATE NONCLUSTERED INDEX [IX_MiddleName]
ON [Person].[NewPerson]
([MiddleName] ASC) ON [PRIMARY]
GO

CREATE NONCLUSTERED INDEX [IX_NSuffix]
ON [Person].[NewPerson]
([Suffix] ASC) ON [PRIMARY]
GO
CREATE NONCLUSTERED INDEX [IX_EmailPromotion]
ON [Person].[NewPerson]
([EmailPromotion] ASC) ON [PRIMARY]
GO

CREATE NONCLUSTERED INDEX [IX_rowguid]
ON [Person].[NewPerson]
([rowguid] ASC) ON [PRIMARY]
GO
CREATE NONCLUSTERED INDEX [IX_ModifiedDate]
ON [Person].[NewPerson]
([ModifiedDate] ASC) ON [PRIMARY]
GO

INSERT INTO [AdventureWorks2008R2].[Person].[NewPerson]
([BusinessEntityID]
,[PersonType]
,[NameStyle]
,[Title]
,[FirstName]
,[MiddleName]
,[LastName]
,[Suffix]
,[EmailPromotion]
,[AdditionalContactInfo]
,[Demographics]
,[rowguid]
,[ModifiedDate])
SELECT    [BusinessEntityID]
,[PersonType]
,[NameStyle]
,[Title]
,[FirstName]
,[MiddleName]
,[LastName]
,[Suffix]
,[EmailPromotion]
,[AdditionalContactInfo]
,[Demographics]
,[rowguid]
,[ModifiedDate]
FROM [Person].[Person]
GO

Have a look again at the statistics output. In my case I got the results shown in the picture below.
They are significantly greater (40% or higher).

In this case we see that over indexing does not help the performance of the system.We should drop the ones we do not need.   The NewPerson table has 19972 rows only. If it had 2-3 million rows the elapsed time in the second case it would be 10(1000%) times greater.

Hope it helps!!!!