How a tiny little whitespace can make life difficult for your SQL Cluster

June 15, 2012, 1:07 pm

≫ Next: Backup database results in error “Could not clear 'DIFFERENTIAL' bitmap in database”

≪ Previous: SQL 2008/2008 R2/2012 setup disappears/fails when installing Setup Support files

Remember that tiny little whitespace that we tend to ignore most of the time? Believe it or not, there are situations when you could pay heavily if you don’t pay attention to this itsy-bitsy little character. Let me explain how:

If you have a SQL Server instance, or multiple ones, on a cluster, and decide to have all of them running on the same static ports (on different IP’s, of course), then you might be surprised to see some of the services failing to come online after the change. The reason? Read on.

When we change the port from SQL Server Configuration manager (SQL Server Network Configuration->Protocols for InstanceName–>TCP/IP->Properties), typically we just remove the value for the TCP Dynamic Ports under IPAll, and enter the static port number in the TCP Port field. A value of 0 in the TCP Dynamic Ports field indicates that Dynamic ports are to be used. By default, the SQL installation uses dynamic ports, and except in the case of a default instance, the static port field is empty.

Coming back to the topic, say, after we change the port settings to reflect the static port number, we restart the service and it fails to come online. Check the errorlog, and you might see something like this:

2012-05-17 13:08:29.34 Server Error: 17182, Severity: 16, State: 1.
2012-05-17 13:08:29.34 Server TDSSNIClient initialization failed with error 0xd, status code 0x10. Reason: Unable to retrieve registry settings from TCP/IP protocol's 'IPAll' configuration key. The data is invalid.

2012-05-17 13:08:29.35 Server Error: 17182, Severity: 16, State: 1.
2012-05-17 13:08:29.35 Server TDSSNIClient initialization failed with error 0xd, status code 0x1. Reason: Initialization failed with an infrastructure error. Check for previous errors. The data is invalid.

So, the error says the data in the IPAll configuration key is invalid. Where exactly is this key anyways? The TCP protocol, and the IPAll subkey, are located in :

HKEY_LOCAL_MACHINE\Software\Microsoft\Microsoft SQL Server\<InstanceName>\MSSQLServer\SuperSocketNetLib\

Under the IPAll subkey, you will find the same two “TCP Dynamic Ports” and “TCP Port” keys. Check the value for the TCP Dynamic Ports key. Do you see a whitespace there? If so, then most likely that is the reason for the service startup failure. Removing the whitespace should fix the issue, and the service should come online just fine. This is equivalent to changing it from the SQL Server Configuration manager, and the registry should only be used when you cannot access the SQL Server Configuration Manager for some reason.

Hope this helps.

↧

Backup database results in error “Could not clear 'DIFFERENTIAL' bitmap in database”

June 25, 2012, 4:06 pm

≫ Next: The most interesting issue in DB Mirroring you will ever see

≪ Previous: How a tiny little whitespace can make life difficult for your SQL Cluster

I recently ran into yet another issue, where the error message had absolutely no relation to the final solution. When trying to back up a database, we were getting the following error:

Msg 18273, Level 16, State 1, Line 1

Could not clear 'DIFFERENTIAL' bitmap in database 'RS_newTempDB' because of error 9002. As a result, the differential or bulk-logged bitmap overstates the amount of change that will occur with the next differential or log backup. This discrepancy might slow down later differential or log backup operations and

cause the backup sets to be larger than necessary. Typically, the cause of this error is insufficient resources. Investigate the failure and resolve the cause. If the error

occurred on a data backup, consider taking a data backup to create a new base for future differential backups.

When checking the database properties, I noticed that the log file for the DB was just 504 KB in size, and it’s autogrowth was set to 1 percent. Now, since I had seen issues with keeping the autogrowth for log files low in the past (the famous VLF’s issue, which impacts startup and recovery of the DB), I suggested that we increase it. We set the Autogrowth to something like 100 MB, and voila, the backup completed successfully.

Hope this helps someone.

↧

The most interesting issue in DB Mirroring you will ever see

July 4, 2012, 6:22 am

≫ Next: An in-depth look at Ghost Records in SQL Server

≪ Previous: Backup database results in error “Could not clear 'DIFFERENTIAL' bitmap in database”

I recently worked on a very interesting “issue” in DB mirroring, relevant to a very specific scenario. Read on to find out more.

Basically, we have a setup which looks something like this:

Initial setup with machines A, B and C
A principal
B mirror
C witness

Take down the principal A (network disconnect or stop SQL)
A down
B online principal
C witness

Failover happens cleanly. While A is down we do our auto repair (remove witness, add a new mirror on D, establish a witness on E)
A down
B online principal
D mirror
E witness

Now when we bring A back up (reconnect network or start SQL)
A (principal in recovery)
B online principal
D mirror
E witness

At this point A correctly stays in recovery because it doesn’t have quorum. Now if you restart SQL on A
A online principal
B online principal
D mirror
E witness

So we end up with 2 copies of the database online, which can be undesirable in certain situations.

Looking at the errorlog, when the service is restarted for the first time, we saw these messages:

2012-06-26 17:10:23.80 spid21s Error: 1438, Severity: 16, State: 1.

2012-06-26 17:10:23.80 spid21s The server instance Partner rejected configure request; read its error log file for more information. The reason 1460, and state 1, can be of use for diagnostics by Microsoft. This is a transient error hence retrying the request is likely to succeed. Correct the cause if any and retry.

2012-06-26 17:10:26.49 spid19s Bypassing recovery for database 'XYZ' because it is marked as an inaccessible database mirroring database. A problem exists with the mirroring session. The session either lacks a quorum or the communications links are broken because of problems with links, endpoint configuration, or permissions (for the server account or security certificate). To gain access to the database, figure out what has changed in the session configuration and undo the change.

Attempting to access the database gives:

Msg 955, Level 14, State 1, Line 2

Database XYZ is enabled for Database Mirroring, but the database lacks quorum: the database cannot be opened. Check the partner and witness connections if configured.

However, when you restart the service for the second time, you see:

2012-06-26 17:32:32.51 spid7s Recovery is writing a checkpoint in database 'XYZ' (5). This is an informational message only. No user action is required.

After some research and a lot of discussions, we were able to nail it down to the following steps:

When A comes back on (first startup), it looks for the Partner B and Witness C. It is able to communicate to the Witness C (say) on port 5033. The Witness C sends back a message, (actually a DBM_NOT_SHIPPPING error) indicating it is not part of the mirroring session #1 anymore.

So the old principal A removed the Witness C from it’s local configuration. Now, after the next restart, it again attempts to contact the mirror B (but not the Witness C, because it has been removed from the configuration on A, remember). The mirror B says it is already part of a different mirror session, mirroring session # 2. So the principal A removes Mirror B also from its configuration.

At this point the system A is a restarting primary with no witness configured so it has the implied quorum vote of a primary and is able to restart and come on line. This is the same case as if a drop witness command was executed and had acted on the mirror and witness without the acknowledgement getting to the primary before a restart (command accepted remotely so restarted node syncs with latest configuration on restart).

In the normal case where a session is dropped while the primary is down the old mirror will return DBM_NOT_SHIPPING which will cause the old primary to drop mirroring locally and stay online.

The mirror in this case has been configured with a different DBM session so it returns DBM_ALREADYSHIPPING_REMOTE which does not cause the session to drop but the DB (on A) comes online as a principal – no witness, mirror not connected. Running an alter database set partner off will put it into the same state as the normal case.

As you probably surmised already, this behaviour is by design. But how to avoid this? One of my esteemed colleagues was able to come up with the following workaround:

When you remove the mirroring session #1, and establish mirroring session with B as the Primary, D as mirror and E as witness, you need to make sure that the old Witness C is not using the same endpoint(5033) anymore. This, in turn, will ensure that the old Principal A is unable to talk to any of the remnants of the mirror session # 1. As a result, any attempts by A to communicate to Witness C will lead to a timeout. Thus, the old Principal A will remain in a “RECOVERING” state since Quorum is not established yet. The only negative impact of this approach is that you cannot share the same Witness server/endpoint for multiple mirroring sessions.

Now apart from this, there are few other things you need to account for:

After a new mirror session is setup, you need to drain the information about the old principal A from all existing application connections and provide them with the new principal and its partner information. For this a disconnect and reconnect is required.
After the old principal A comes backup, you need to use the following commands to remove remnants of mirror session #1 from it and keep it out of application use:

            alter database XYZ set partner off
            go
            alter database XYZ set single_user with rollback immediate
            go
            alter database XYZ set offline
            go

Not a very common scenario, but an interesting one nonetheless. What say?

↧

An in-depth look at Ghost Records in SQL Server

July 27, 2012, 2:50 pm

≫ Next: An interesting issue with SQL Replication and a rogue system spid

≪ Previous: The most interesting issue in DB Mirroring you will ever see

Ghost records are something that are a bit of an enigma for most folks working with SQL Server, and not just because of the name. Today, I’ll seek to explain the concept, as well as identify some troubleshooting techniques.

The main reason behind introducing the concept of Ghost records was to enhance performance. In the leaf level of an index, when rows are deleted, they're marked as ghost records. This means that the row stays on the page but a bit is changed in the row header to indicate that the row is really a ghost. The page header also reflects the number of ghost records on a page. What this means, in effect, is that the DML operation which fired the delete will return to the user much faster, because it does not have to wait for the records to be deleted physically. Rather, they’re just marked as “ghosted”.

Ghost records are present only in the index leaf nodes. If ghost records weren't used, the entire range surrounding a deleted key would have to be locked. Here’s an example i picked up from somewhere:
Suppose you have a unique index on an integer and the index contains the values 1, 30, and 100. If you delete 30, SQL Server will need to lock (and prevent inserts into) the entire range between 1 and 100. With ghosted records, the 30 is still visible to be used as an endpoint of a key-range lock so that during the delete transaction, SQL Server can allow inserts for any value other than 30 to proceed.

SQL Server provides a special housekeeping thread that periodically checks B-trees for ghosted records and asynchronously removes them from the leaf level of the index. This same thread carries out the automatic shrinking of databases if you have that option set.The ghost record(s) presence is registered in:

The record itself
The Page on which the record has been ghosted
The PFS for that page (for details on PFS, see Paul Randal’s blog here)
The DBTABLE structure for the corresponding database. You can view the DBTABLE structure by using the DBCC DBTABLE command (make sure you have TF 3604 turned on).

The ghost records can be cleaned up in 3 ways:

If a record of the same key value as the deleted record is inserted
If the page needs to be split, the ghost records will be handled
The Ghost cleanup task (scheduled to run once every 5 seconds)

The Ghost cleanup process divides the “ghost pages” into 2 categories:

Hot Pages (frequently visited by scanning processes)
Cold Pages

The Ghost cleanup thread is able to retrieve the list of Cold pages from the DBTABLE for that database, or the PFS Page for that interval. The cleanup task cleans up a maximum of 10 ghost pages at a time. Also, while searching for the ghost pages, if it covers 10 PFS Pages, it yields.

As far as hot ghost pages are concerned, the ghost cleanup strives to keep the number of such pages below a specified limit. Also, if the thread cleans up 10 hot ghost pages, it yields. However, if the number of hot ghost pages is above the specified (hard-coded) limit, the task runs non-stop till the count comes down below the threshold value.

If there is no CPU usage on the system, the Ghost cleanup task runs till there are no more ghost pages to clean up.

Troubleshooting

So now we get to the interesting part. If your system has some huge delete operations, and you feel the space is not being freed up at all or even not at the rate it should be, you might want to check if there are ghost records in that database. I’ll try to break down the troubleshooting into some logical steps here:

Run the following command:
Select * from sys.dm_db_index_physical_stats(db_id(<dbname>),<ObjectID>,NULL,NULL,’DETAILED’)
P.S. The object ID can be looked up from sys.objects by filtering on the name column.
Check the Ghost_Record_Count and Version_Ghost_Record_Count columns (version ghost record count will be populated when you’re using snapshot isolation on the database). If this is high (several million in some cases), then you’ve most probably got a ghost record cleanup issue. If this is SQL Server 2008/2008 R2, then make sure you have applied the patch mentioned in the kb http://support.microsoft.com/kb/2622823
Try running the following command:
EXEC sp_clean_db_free_space @dbname=N’<dbname>’
If the ghost record count from step 1 is the same (or similar) after running this command, then we might need to dig in a bit deeper.
Warning: Some of the troubleshooting steps mentioned from hereon are unpublished and might be unsupported by Microsoft. Proceed at your own risk.
Enable Trace Flag 662 (prints detailed information about the work done by the ghost cleanup task when it runs next), and 3605 (directs the output of TF 662 to the SQL errorlog). Please do this during off hours.
Wait for a few minutes, then examine the errorlog. First, you need to check if the database is being touched at all. If so, it’s very much possible that the Ghost Cleanup task is doing it’s job, and will probably catch up in a bit. Another thing to watch out for is, do you see one page being cleaned up multiple times? If so, note the page number and file id. Please ensure you disable the TF 662 after this step (it creates a lot of noise in the errorlog, so please use it for as little time as possible)
Next, run the following command on the page to view its contents
DBCC PAGE(‘<DBName>’,<file id>,<Page no.>,3)
This will give you the contents of the page. see if you can spot a field called m_ghostRecCnt in the output. If it has a non-zero value, than means the page has ghost records. Also, look for the PFS page for that page. It will look something like PFS (1:1). You can also try dumping the PFS page to see if this page has a ‘Has Ghost’ against it. For more details on the DBCC Page, check out Paul Randal’s post here

Another thing that deserves mention is the special role of the PAGLOCK hint w.r.t ghost records:

Running a select statement with the PAGLOCK hint against a table will ensure that all the ghost records in that table are queued for cleanup by the ghost cleanup task.
Accommodating the PAGLOCK hint in your delete statement will ensure that the records are deleted there and then, and are not left behind for the Ghost Cleanup task to take care of later. By default, all indexes have the PAGLOCK option turned on (you can check by scripting out a create index task), but they might not be able to get it all the time. This is where the PAGLOCK query hint comes in. It makes your query wait for the Page Lock, so it can clean up the records physically before returning. However, it’s not advisable to use the PAGLOCK hint in your delete statements all the time, as the performance trade-off also needs to be taken into consideration (this is the same purpose for which the Ghost Cleanup task was introduced, remember?). This should be resorted to only under situations where you are facing a definite issue with Ghost Record cleanup, and have a dire need to prevent further ghost records from getting created.

These steps might or might not solve your problem, but what they will do is give you an insight into how the SQL Server Database Engine works w.r.t Ghost records and their cleanup. One of the most common (and quickest) resolutions for a ghost records issue is to restart SQL Server.

Once again, this post does not come with any guarantees, and the contents are in no way endorsed by Microsoft or any other corporation or individual.

Hope this helps you understand the concept of Ghost Records somewhat. You’re more than welcome to share your experiences/opinions/knowledge in the comments section, and I shall be delighted to include them in the contents of the post if suitable.

↧

An interesting issue with SQL Replication and a rogue system spid

August 9, 2012, 5:03 am

≫ Next: When DBMail started complaining about the servername being NULL

≪ Previous: An in-depth look at Ghost Records in SQL Server

I recently came across this interesting issue with SQL Replication. We were trying to create a new publication, and the new publication wizard would just hang. Upon doing some investigation, we found that we were hitting the connect article mentioned here. However, the connect article mentions that the bug is closed as “won’t fix”, so we had to somehow find a way out of the situation. Let me first describe how we narrowed down into the issue:

First, check sysprocesses to see which spid is blocking the new publication wizard (or whatever replication operation you’re trying to perform).
If you see a system spid, such as 5 or 7 (all spids less than 50 are system spids, as a general rule), then do a DBCC Opentran and see if the same spid shows up.
If you see something like this in the output:
Transaction information for database 'master'.

Oldest active transaction:

SPID (server process ID): 5s

UID (user ID) : -1

Name : user_transaction

LSN : (3286:3576:1)

Start time : Aug 2 2012 8:04:46:603AM

SID : 0x01

DBCC execution completed. If DBCC printed error messages, contact your system administrator.

then you’re likely hitting the same problem.

Another thing you might want to check is the locks held by that spid. I checked them using the sp_lock <spid> command, and found this (notice the last one):
spid dbid ObjId IndId Type Resource Mode Status

5 5 0 0 DB S GRANT

5 10 0 0 DB S GRANT

5 1 60 0 TAB IX GRANT

5 5 1663344990 0 TAB Sch-M GRANT

5 1 60 1 KEY (fa00cace1004) X GRANT

Next, check the SQL Server errorlog, and see if you can spot any messages point towards “script upgrade”. An example would be:

2012-08-02 08:04:06.500 Logon Error: 18401, Severity: 14, State: 1.

2012-08-02 08:04:06.500 Logon Login failed for user 'maverick'. Reason: Server is in script upgrade mode. Only administrator can connect at this time. [CLIENT: 150.232.101.86]

Also, see if you can spot messages related to upgrading replication in the errorlog. In my case I found quite a few:
2012-08-02 08:04:11.780 spid5s Database 'master' is upgrading script 'repl_upgrade.sql' from level 167774691 to level 167777660.

2012-08-02 08:04:13.010 spid5s Upgrading distribution settings and system objects in database distribution.

2012-08-02 08:04:17.590 spid5s Upgrading publication settings and system objects in database [Cash].

2012-08-02 08:04:18.270 spid5s Upgrading publication settings and system objects in database [Sellers].

2012-08-02 08:04:18.620 spid5s Upgrading publication settings and system objects in database [Revenue].

What this tells us is that there was a patch applied at some point, and it failed while upgrading replication. Now, every time SQL Server starts up, it tries to upgrade the replication. Let’s see if we can find an upgrade failure message as well. For example, you may find something that looks like this:
2012-08-02 08:04:46.470 spid5s       Upgrading subscription settings and system objects in database [XYZ].
2012-08-02 08:04:46.600 spid5s       Index cannot be created on object 'MSreplication_subscriptions' because the object is not a user table or view.
2012-08-02 08:04:46.600 spid5s       Error executing sp_vupgrade_replication.
2012-08-02 08:04:46.600 spid5s       Saving upgrade script status to 'SOFTWARE\Microsoft\MSSQLServer\Replication\Setup'.
2012-08-02 08:04:46.600 spid5s       Saved upgrade script status successfully.
2012-08-02 08:04:46.600 spid5s       Recovery is complete. This is an informational message only. No user action is required.
Also notice the spid in the aforementioned failure messages. See it? So, because the replication upgrade fails, this system spid holds the lock on some resource, and as a result, we’re unable to perform any replication related activities.

Troubleshooting

So how do we troubleshoot this? Let me list out the steps:

Let’s first focus on the exact error we see in the errorlog, which seems to be the reason behind the replication upgrade failing:

We can clearly see that it has an issue with the MSReplication_Susbcriptions object in the XYZ database. I checked on the object using sp_help, and found that it was a synonym.
Next, we dropped the offending synonym, and scripted out the MSReplication_Subscriptions object from one of the other databases that had replication enabled. We ran this script in the XYZ database to create the object.
As a test, we ran the sp_vupgrade_replication stored procedure explicitly from SSMS, and it completed fine.
Next, we restarted SQL, and saw that the script upgrade had completed successfully this time. Subsequent restarts did not result in SQL Server going into script upgrade mode. This meant that the system spid was no longer holding the lock, and we could now perform replication related activities successfully.

Hope this helps. Comments/feedback are welcome.

↧

When DBMail started complaining about the servername being NULL

August 22, 2012, 3:23 am

≫ Next: Migrating TFS from SQL Server Enterprise to Standard can cause problems due to compression

≪ Previous: An interesting issue with SQL Replication and a rogue system spid

I recently came across an issue, where, for some reason, DBMail was not working. To be more specific, we were unable to create a profile for DBMail, let alone send emails. When trying to add the profile to the account, we were getting this error:

TITLE: Configuring...
------------------------------
Unable to create new account test for SMTP server Microsoft.SqlServer.Management.SqlManagerUI.SQLiMailServer.
------------------------------
ADDITIONAL INFORMATION:
Create failed for MailAccount 'test'. (Microsoft.SqlServer.Smo)
For help, click: http://go.microsoft.com/fwlink?ProdName=Microsoft+SQL+Server&ProdVer=10.50.2500.0+((KJ_PCU_Main).110617-0038+)&EvtSrc=Microsoft.SqlServer.Management.Smo.ExceptionTemplates.FailedOperationExceptionText&EvtID=Create+MailAccount&LinkId=20476
------------------------------
An exception occurred while executing a Transact-SQL statement or batch. (Microsoft.SqlServer.ConnectionInfo)
------------------------------
Cannot insert the value NULL into column 'servername', table 'msdb.dbo.sysmail_server'; column does not allow nulls. INSERT fails.
The statement has been terminated. (Microsoft SQL Server, Error: 515)
For help, click: http://go.microsoft.com/fwlink?ProdName=Microsoft+SQL+Server&ProdVer=10.50.2500&EvtSrc=MSSQLServer&EvtID=515&LinkId=20476
------------------------------
BUTTONS:
OK
------------------------------

Now, looking at the error message, it’s clear that we’re somehow passing NULL for the servername field when creating the profile. I tried creating a profile using T-SQL (using the steps mentioned here), and that worked just fine. I could also see the row in the msdb.dbo.sysmail_server table.

So there was definitely an issue with how the servername value was being captured/passed. I captured a profiler trace, and found the following rows to be of interest:

SP:StmtCompleted        SELECT @mailserver_name=@@SERVERNAME
   --create a credential in the credential store if a password needs to be stored
       Microsoft SQL Server Management Studio

SP:StmtStarting             IF(@username IS NOT NULL)
       Microsoft SQL Server Management Studio

SP:StmtCompleted       IF(@username IS NOT NULL)
       Microsoft SQL Server Management Studio

SP:StmtStarting           INSERT INTO msdb.dbo.sysmail_server (account_id,servertype, servername, port, username, credential_id, use_default_credentials, enable_ssl)
   VALUES (@account_id, @mailserver_type, @mailserver_name, @port, @username, @credential_id, @use_default_credentials, @enable_ssl)
       Microsoft SQL Server Management Studio

Exception                     Error: 515, Severity: 16, State: 2
User Error Message     Cannot insert the value NULL into column 'servername', table 'msdb.dbo.sysmail_server'; column does not allow nulls. INSERT fails.
User Error Message     The statement has been terminated.

Accordingly, I tried running select @@SERVERNAME explicitly on the server, and lo and behold, that was NULL too…!!! However, the select serverproperty(‘servername’) command was able to return the server name. But unfortunately, DBMail uses select @@SERVERNAME, and not serverproperty (‘servername’), as we can see clearly in the profiler trace. So this was definitely where the issue was originating from. I then queried the sys.sysservers dmv, and I couldn’t see a record with the srvid 0 (the details of the local server are always stored in the dmv with srvid 0). Next, we ran the following commands to fix the situation:

sp_dropserver ‘<localservername>’

sp_addserver ‘<localservername>’, @local=’LOCAL’

After this, we restarted SQL Server, and DBMail worked like a charm (after we had cleaned up the mess we had created earlier, of course). Hope this helps.

↧

Migrating TFS from SQL Server Enterprise to Standard can cause problems due to compression

September 3, 2012, 7:21 am

≫ Next: SQL Server Cluster Failover Root Cause Analysis–the what, where and how

≪ Previous: When DBMail started complaining about the servername being NULL

When migrating a Team Foundation Server from SQL Server Enterprise to Standard , you might run into this error:

Restore Failed For Server ‘<Servername>’, (Microsoít.SqlServer.SmoExtended)
Additional information:
An exception occurred while executing a Transact-SQL statement or batch.

(Microsoft.SqlServer ,Connectionlnlo)
Database ‘<TFS Database name> cannot be started in this edition of SQL Server because part or all of object tbl_Branch’ is enabled with data compression or vardecimal storage Format. Data compression and vardecimal storage Format are only supported on SQL Server Enterprise Edition.

Database ‘<TFS Database name>’ cannot be started because some of the database functionality is not available in the current edition of SQL Server. (Microsoft SQL Server, Error: 909)

The error message seems obvious enough, but the question is, how exactly do you proceed? For example, one of the things you would need to find out is which objects have compression enabled on them(yeah, TFS enables compression on some objects in its databases) , and how to get rid of it, so the migration can proceed. Here are the steps:

Run the following query in each TFS database to determine whether there are objects which have compression enabled:

select so.name,so.type,so.type_desc,sp.data_compression,sp.data_compression_desc from sys.partitions sp
inner join sys.objects so
on (so.object_id=sp.object_id)
where sp.data_compression!=0

If there are objects listed in the output of the query, then the next step is to disable the compression on the objects and their indexes. I actually ended up writing a small script for this(see attachment “Disable Compression on TFS DB’s.sql”). As always, this script does not come with any guarantees. Please do test it thoroughly before running on your production environment. You will need to run this script in the context of each of the TFS databases.

After this, you should be good to proceed with the migration. If you face any issues when trying to disable the compression, please do not hesitate to call Microsoft for support.

Hope this helps. Do let me know if you have any feedback, suggestions or comments. Thanks.

↧

SQL Server Cluster Failover Root Cause Analysis–the what, where and how

September 3, 2012, 12:55 pm

≫ Next: --- Article Removed ---

≪ Previous: Migrating TFS from SQL Server Enterprise to Standard can cause problems due to compression

I know many of you get into situations where SQL Server fails over from one node of a cluster to the other, and you’re hard-pressed to find out why. In this post, I shall seek to answer quite a few questions about how to about conducting a post-mortem analysis for SQL Server cluster failover, aka Cluster Failover RCA.

First up, since this is a post mortem analysis, we need all the logs we can get. Start by collecting the following:

SQL Server Errorlogs
The “Application” and “System” event logs, saved in txt or csv format (eases analysis)
The cluster log (see here and here for details on how to enable/collect cluster logs for Windows 2003 and 2008 respectively)

Now that we have all the logs in place, then comes the analysis part. I’ve tried to list down the steps and most common scenarios here:

Start with the SQL Errorlog. The Errorlog files in the SQL Server log folder can be viewed using notepad, textpad or any other text editor. The current file will be named Errorlog, the one last used Errorlog.1, and so on. See if the SQL Server was shut down normally. For example, the following stack denotes a normal shutdown for SQL:

2012-09-04 00:32:54.32 spid14s     Service Broker manager has shut down.
2012-09-04 00:33:02.48 spid6s      SQL Server is terminating in response to a 'stop' request from Service Control Manager. This is an informational message only. No user action is required.
2012-09-04 00:33:02.50 spid6s      SQL Trace was stopped due to server shutdown. Trace ID = '1'. This is an informational message only; no user action is required.
You might see a lot of situations where SQL Server failed over due to a system shutdown i.e. the node itself rebooted. In that case, the stack at the bottom of the SQL Errorlog will look something like this:

2012-07-13 06:39:45.22 Server      SQL Server is terminating because of a system shutdown. This is an informational message only. No user action is required.
2012-07-13 06:39:48.04 spid14s     The Database Mirroring protocol transport has stopped listening for connections.
2012-07-13 06:39:48.43 spid14s     Service Broker manager has shut down.
2012-07-13 06:39:55.39 spid7s      SQL Trace was stopped due to server shutdown. Trace ID = '1'. This is an informational message only; no user action is required.
2012-07-13 06:39:55.43 Server      The SQL Server Network Interface library could not deregister the Service Principal Name (SPN) for the SQL Server service. Error: 0x6d3, state: 4. Administrator should deregister this SPN manually to avoid client authentication errors.
You can also use the systeminfo command from a command prompt to check when the node was last rebooted (look for “System Boot Time”), and see if this matches the time of the Failover. If so, then you need to investigate why the node rebooted, because SQL was just a victim in this case.

Next come the event logs. Look for peculiar signs in the application and system event logs that could have caused the failover. For example, one strange scenario that I came across was when the disks hosting tempdb became inaccessible for some reason. In that case, I saw the following in the event logs:

Information 7/29/2012 12:44:07 AM MSSQLSERVER 680 Server Error [8, 23, 2] occurred while attempting to drop allocation unit ID 423137010909184 belonging to worktable with partition ID 423137010909184.

Error 7/29/2012 12:44:07 AM MSSQLSERVER 823 Server The operating system returned error 2(The system cannot find the file specified.) to SQL Server during a read at offset 0x000001b6d70000 in file 'H:\MSSQL\Data\tempdata4.ndf'. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
And then some time later, we see SQL shutting down in reaction to this:

Error 7/29/2012 12:44:17 AM MSSQLSERVER 3449 Server SQL Server must shut down in order to recover a database (database ID 2). The database is either a user database that could not be shut down or a system database. Restart SQL Server. If the database fails to recover after another startup, repair or restore the database.

Error 7/29/2012 12:44:17 AM MSSQLSERVER 3314 Server During undoing of a logged operation in database 'tempdb', an error occurred at log record ID (12411:7236:933). Typically, the specific failure is logged previously as an error in the Windows Event Log service. Restore the database or file from a backup, or repair the database.

Error 7/29/2012 12:44:17 AM MSSQLSERVER 9001 Server The log for database 'tempdb' is not available. Check the event log for related error messages. Resolve any errors and restart the database.
Another error that clearly points toward the disks being a culprit is this:

Error 7/29/2012 12:44:15 AM MSSQLSERVER 823 Server The operating system returned error 21(The device is not ready.) to SQL Server during a read at offset 0x00000000196000 in file 'S:\MSSQL\Data\tempdb.mdf'. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
The next logical step of course would be to check why the disks became unavailable/inaccessible. I would strongly recommend having your disks checked for consistency, speed and stability by your vendor.

If you don’t have any clue from these past steps, try taking a look at the cluster log as well. Please do note that the Windows cluster logs are recorded in GMT/UTC time zone always, so you’ll need to make the necessary calculations to determine what time to focus on in the cluster log. See if you can find anything which could have caused the cluster group to fail, such as the network being unavailable, failure of the IP/Network name, etc.

There is no exhaustive guide to finding the root cause for a Cluster Failover, mainly because it is an approach thing. I do, however, want to talk about a few cluster concepts here, which might help you understand the messages from the various logs better.

checkQueryProcessorAlive: Also known as the isAlive check in SQL Server, this executes “SELECT @@servername” against the SQL Server instance. It waits 60 seconds before running the query again, but checks every 5 seconds whether the service is alive by calling sqsrvresCheckServiceAlive. Both these values(60 seconds and 5 seconds) are configured “by default” and can be changed from the properties of the SQL Server resource in Failover Cluster Manager/Cluster Administrator. I understand that for SQL 2012, we’ve included some more comprehensive checks like running sp_server_diagnostics as part of this check to ensure that SQL is in good health.

sqsrvresCheckServiceAlive: Also known as the looksAlive check in SQL Server, this checks to see if the status of the SQL Service and returns “Service is dead” if the status is not one of the following:

SERVICE_RUNNING
SERVICE_START_PENDING
SERVICE_PAUSED
SERVICE_PAUSE_PENDING

So if you see messages related to one of these checks failing in either the event logs or the cluster logs, you know that SQL Server was not exactly “available” at that time, which caused the failover. The next step, of course would be to investigate why SQL Server was not available at that time. It can be due to a resource bottleneck such as high CPU or memory consumption, SQL Server hung/stalled, etc.

The base idea here, as with any post-mortem analysis, is to construct a logical series of events leading up to the failover, based on the data. If we can do that, then we have at least a clear indication on what caused the failover, and more importantly, how to avoid such a situation in the future.

If you’re still unable to determine anything about the cause of the failover, I would strongly recommend contacting Microsoft CSS to review the data once and see if they’re able to spot anything.

Hope this helps. As always, comments, feedback and suggestions are welcome.

↧

--- Article Removed ---

July 15, 2013, 4:19 pm

≫ Next: When using SSL, SQL Failover Cluster Instance fails to start with error 17182

≪ Previous: SQL Server Cluster Failover Root Cause Analysis–the what, where and how

***
***
*** RSSing Note: Article removed by member request. ***
***

↧

When using SSL, SQL Failover Cluster Instance fails to start with error 17182

July 22, 2013, 2:02 pm

≫ Next: Something to watch out for when using IS_MEMBER() in TSQL

≪ Previous: --- Article Removed ---

I recently worked on an interesting issue with a SQL Server Failover Cluster Instance (FCI). We were trying to use an SSL certificate on the instance, and we followed these steps:

Made sure the certificate was requested according to the requirements defined here.

Loaded the certificate into the Personal store of the computer account across all the nodes

Copied the thumbprint of the certificate, eliminated the spaces, and pasted it into the value field HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL10.CLUSTEST\MSSQLServer\Certificate key. Please note that this was a SQL 2008 instance named "CLUSTEST"

However, when we restarted SQL Server after performing these changes, it failed. In the errorlog, we saw these messages:

2013-07-21 14:06:11.54 spid19s Error: 17182, Severity: 16, State: 1.

2013-07-21 14:06:11.54 spid19s TDSSNIClient initialization failed with error 0xd, status code 0x38. Reason: An error occurred while obtaining or using the certificate for SSL. Check settings in Configuration Manager. The data is invalid.

2013-07-21 14:06:11.54 spid19s Error: 17182, Severity: 16, State: 1.

2013-07-21 14:06:11.54 spid19s TDSSNIClient initialization failed with error 0xd, status code 0x1. Reason: Initialization failed with an infrastructure error. Check for previous errors. The data is invalid.

2013-07-21 14:06:11.54 spid19s Error: 17826, Severity: 18, State: 3.

2013-07-21 14:06:11.54 spid19s Could not start the network library because of an internal error in the network library. To determine the cause, review the errors immediately preceding this one in the error log.

2013-07-21 14:06:11.54 spid19s Error: 17120, Severity: 16, State: 1.

2013-07-21 14:06:11.54 spid19s SQL Server could not spawn FRunCommunicationsManager thread. Check the SQL Server error log and the Windows event logs for information about possible related problems.

I checked and made sure the certificate was okay, and that it was loaded properly. Then, I noticed something interesting. After copying the thumbprint to a text file, I got a Unicode to ANSI conversion warning when I tried to save the file in txt format:

This is expected, since the default format for notepad is indeed ANSI. I went ahead and clicked OK. When we reopened the file, we saw a "?" at the beginning, which basically meant that there was a Unicode character at the beginning of the string. We followed these steps to resolve the issue:

Eliminated the Unicode character from the thumbprint

Converted all the alphabetical characters in the thumbprint to Caps.

Eliminated the spaces from the thumbprint

Saved this thumbprint to the HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL10.CLUSTEST\MSSQLServer\Certificate key.

The instance came online just fine this time.

Hope this helps.

↧

Something to watch out for when using IS_MEMBER() in TSQL

August 5, 2013, 10:44 am

≫ Next: An interesting issue with Peer to Peer Replication

≪ Previous: When using SSL, SQL Failover Cluster Instance fails to start with error 17182

I recently worked on an interesting issue with my good friend Igor (@sqlsantos), where we were facing performance issues with a piece of code that used theIS_MEMBER() function. Basically, the IS_MEMBER function is used to find out whether the current user (windows/sql login used for the current session) is a member of the specified Windows group or SQL server database role.

In the specified code, the IS_MEMBER function was being used to determine the windows group membership of the windows login. The windows groups were segregated according to geographical areas, and based on the user's group membership, the result set was filtered to show rows for only those geographical areas for which the user was a member of the corresponding groups in Active Directory.

Here's an example of a piece of code where we perform this check:

With SalesOrgCTE AS(

SELECT

MinSalesOrg, MaxSalesOrg

FROM

RAuthorized WITH (NOLOCK)

WHERE

IS_MEMBER([Group]) = 1

)

The problem was that the complete procedure where we were using IS_MEMBER was taking several minutes to complete, for a table where the max result set cardinality was in the range of 18000-20000. We noticed the following wait types while the procedure was executing:

PREEMPTIVE_OS_AUTHORIZATIONOPS

PREEMPTIVE_OS_LOOKUPACCOUNTSID

I did some research on these waits, and found that since both of these are related to the communication/validation from Active Directory, they lie outside of SQL server, and there's no changes we can do from a configuration standpoint to help reduce/eliminate these waits.

Next, we studied the code, broke it down and tested the performance of the various sections that used the IS_MEMBER function, and found that the main section responsible for the execution time was the "WHERE" condition where we were using the result set of the code mentioned above. This is what the "WHERE" clause looked like:

(SELECT
COUNT(*)
FROM
SalesOrgCTE
WHERE
SORGNBR BETWEEN MinSalesOrg AND MaxSalesOrg) > 0

Notice that in this code, we've asked SQL to check the value of SORGNBR for each row, and if it's between MinSalesORG and MaxSalesOrg, then add it to the rowcount. We observed that due to this design, it had to make a trip to AD for validating each row, which meant quite a long time for a 18000-20000 row result set, which was responsible for the slow performance of the procedure.

We did some more research with different approaches for the where clause, and the combined efforts of myself, Igor and his team resulted in the following where clause whose performance was acceptable:

WHERE

SORGNBR IN

(

SELECT MinSalesOrg FROM SalesOrgCTE

)

AND SORGNBR IN

(

SELECT MaxSalesOrg FROM SalesOrgCTE

)

If you look carefully, you'll notice that in this code snippet, we'll need to communicate with AD only twice, thereby improving the performance of the procedure as a whole.

Summing up: The importance of writing good code cannot be over-emphasized. It's good coding practices like this that lead to performance gains most of the time.

Hope this helps.

↧

An interesting issue with Peer to Peer Replication

October 1, 2013, 12:24 am

≫ Next: An in-depth look at SQL Server Memory–Part 3

≪ Previous: Something to watch out for when using IS_MEMBER() in TSQL

I recently ran into an interesting issue when setting up Peer 2 Peer Replication across 3 instances.

The primary instance was SM-UTSQL, where we configured a Peer-to-Peer publication named "PUBLISH1" on the database "DDS-TRANS". Next, we proceeded to configure Peer-to-Peer topology (right-click on publication, click on "Configure Peer-To-Peer topology"), and added the other 2 instances to the topology. After this, we clicked on the primary node and selected "Connect to all displayed nodes" :

We then went ahead through the UI and configured the replication. However, when we checked in object explorer, we saw that on the Primary instance (SM-UTSQL), under replication->Publish1, we could see both the Peer nodes SO-UTSQL and ST-UTSQL as subscribers, but on SO-UTSQL and ST-UTSQL, we could see on SM-UTSQL as the subscriber for the publication i.e. SO-UTSQL and ST-UTSQL did not recognize each other as subscribers.

We tried to add the missing subscriber through the new subscriptions wizard, but got the following error:

TITLE: New Subscription Wizard

------------------------------

SQL Server could not create a subscription for Subscriber 'ST-UTSQL'.

------------------------------

ADDITIONAL INFORMATION:

An exception occurred while executing a Transact-SQL statement or batch. (Microsoft.SqlServer.ConnectionInfo)

------------------------------

Peer-to-peer publications only support a '@sync_type' parameter value of 'replication support only', 'initialize with backup' or 'initialize from lsn'.

The subscription could not be found.

Changed database context to 'DDS_TRANS'. (Microsoft SQL Server, Error: 21679)

For help, click: http://go.microsoft.com/fwlink?ProdName=Microsoft+SQL+Server&ProdVer=10.50.4000&EvtSrc=MSSQLServer&EvtID=21679&LinkId=20476

------------------------------

BUTTONS:

------------------------------

The resolution? Here are the steps:

Navigate to the replication tab in object explorer on the Primary instance (where you can see both subscriptions under the publication, SM-UTSQL in our case).
Right click on the publication and select generate scripts, and select the "To Create or enable the components" radio button.
In the resulting script, navigate to the bottom. Here, you will see 2 sets of "sp_addsubscription" and "sp_addpushsubscription_agent" calls:

-- Adding the transactional subscriptions

use [DDS_TRANS]

exec sp_addsubscription @publication = N'PUBLISH1', @subscriber = N'SO-UTSQL', @destination_db = N'DDS_TRANS', @subscription_type = N'Push', @sync_type = N'replication support only', @article = N'all', @update_mode = N'read only', @subscriber_type = 0

exec sp_addpushsubscription_agent @publication = N'PUBLISH1', @subscriber = N'SO-UTSQL', @subscriber_db = N'DDS_TRANS', @job_login = N'dds\dtsql.admin', @job_password = null, @subscriber_security_mode = 1, @frequency_type = 64, @frequency_interval = 1, @frequency_relative_interval = 1, @frequency_recurrence_factor = 0, @frequency_subday = 4, @frequency_subday_interval = 5, @active_start_time_of_day = 0, @active_end_time_of_day = 235959, @active_start_date = 0, @active_end_date = 0, @dts_package_location = N'Distributor'

use [DDS_TRANS]

exec sp_addsubscription @publication = N'PUBLISH1', @subscriber =N'ST-UTSQL', @destination_db = N'DDS_TRANS', @subscription_type = N'Push', @sync_type = N'replication support only', @article = N'all', @update_mode = N'read only', @subscriber_type = 0

exec sp_addpushsubscription_agent @publication = N'PUBLISH1', @subscriber = N'ST-UTSQL', @subscriber_db = N'DDS_TRANS', @job_login = N'dds\dtsql.admin', @job_password = null, @subscriber_security_mode = 1, @frequency_type = 64, @frequency_interval = 1, @frequency_relative_interval = 1, @frequency_recurrence_factor = 0, @frequency_subday = 4, @frequency_subday_interval = 5, @active_start_time_of_day = 0, @active_end_time_of_day = 235959, @active_start_date = 0, @active_end_date = 0, @dts_package_location = N'Distributor'

Copy these commands over, provide the value for the @job_password parameter (password for the login used to configure replication, reflected in the @job_login parameter), and run the appropriate set on the 2 subscribers. For example, we ran the first set of commands (@subscriber=N'SO-UTSQL') on ST-UTSQL instance, and the second set(@subscriber=N'ST-UTSQL') on the SO-UTSQL instance.

And voila, the subscriptions were created and syncing.

Hope this helped you. Comments and feedback are welcome.

↧

An in-depth look at SQL Server Memory–Part 3

March 15, 2013, 4:59 am

≫ Next: Why the service account format matters for upgrades

≪ Previous: An interesting issue with Peer to Peer Replication

In part 1 and part 2 of the series, we talked about the memory architecture and the Procedure Cache respectively. In this third and final instalment of the SQL Server Memory series, I will look to focus on troubleshooting SQL Server Memory pressure issues.

Before we start on the troubleshooting part though, we need to determine the type of memory pressure that we’re seeing here. I’ve tried to list those down here:

1.     External Physical Memory pressure – Overall RAM pressure on the server. We need to find the largest consumers of memory (might be SQL), and try to reduce their consumption. It might also be that the system is provided with RAM inadequate for the workload it’s running.

2.     Internal Physical Memory pressure – Memory Pressure on specific components of SQL Server. Can be a result of External Physical Memory pressure, or of one of the components hogging too much memory.

3.     Internal Virtual Memory pressure – VAS pressure on SQL server. Mostly seen only on 32 bit (X86) systems these days (X64 has 8 TB of VAS, whereas X86 only had 4 GB. Refer to Part 1 for details).

4.     External Virtual Memory pressure – Page file pressure on the OS. SQL Server does not recognize or respond to this kind of pressure.

Troubleshooting

Now for getting our hands dirty. When you suspect memory pressure on a server, I would recommend checking the following things, in order:

1. Log in to the server, and take a look at the performance tab of the Task Manager. Do you see the overall memory usage on the server getting perilously close to the total RAM installed on the box? If so, it’s probable that we’re seeing External Physical Memory pressure.

2. Next, look at the Processes tab, and see which of the processes is using the maximum amount of RAM. Again, for SQL, the true usage might not reflect in the Working set if LPIM is enabled (i.e. SQL is using AWE API’s to allocate memory). To check SQL’s total memory consumption, you can run the following query from inside SQL (valid from SQL 2008 onwards):

select physical_memory_in_use_kb/(1024) as sql_physical_mem_in_use_mb,

locked_page_allocations_kb/(1024) as awe_memory_mb,

total_virtual_address_space_kb/(1024) as max_vas_mb,

virtual_address_space_committed_kb/(1024) as sql_committed_mb,

memory_utilization_percentage as working_set_percentage,

virtual_address_space_available_kb/(1024) as vas_available_mb,

process_physical_memory_low as is_there_external_pressure,

process_virtual_memory_low as is_there_vas_pressure

from sys.dm_os_process_memory

Go

For SQL installations prior to 2008 (valid for 2008 and 2008 R2 as well), you can run DBCC Memorystatus, and take the total of VM Committed and AWE Allocated from the memory manager section to get a rough idea of the amount of memory being used by SQL Server.

3. Next, compare this with the total amount of RAM installed on the server. If SQL seems to be taking most of the memory, or at least, much more than it should, then we need to focus our attentions on SQL Server. The exact specifics will vary according to the environment, and factors such as whether it is a dedicated SQL server box, number of instances of SQL Server running on the server, etc. In case you have multiple instances of SQL Server, it will be best to start with the instance consuming the maximum amount of memory (or the maximum deviation from “what it should be consuming”), tune it and then move on to the next one.

4. One of the first things to check should be the value of the “max server memory” setting for SQL Server. You can check this by turning on the ‘show advanced options’ setting of sp_configure, or by right clicking on the instance in Object Explorer in SSMS, selecting properties, and navigating to the “memory” tab. If the value is “2147483647“, this means that the setting has been left to default, and has not been set since the instance was installed. It’s absolutely vital to set the max server memory setting to an optimal value. A general rule of thumb that you can use to set a starting value is as follows:
Total server memory – (Memory for other applications/instances+ OS memory)
The recommendation for the OS memory value is around 3-4 GB on 64 bit systems, and 1-2 GB on 32 bit systems. Please note that this is only a recommendation for the starting value. You need to fine tune it based on observations w.r.t performance of both SQL and other applications (if any) on the server.

5. Once you’ve determined that the max server memory is set properly, the next step is to find out which component within SQL is consuming the most memory. The best place to start is, quite obviously, the good old “DBCC Memorystatus” command, unless you’re using NUMA, in which case, it will be best to use perfmon counters to track page allocations across NUMA nodes, as outlined here.
I will try to break down most of the major components in the DBCC Memorystatus output here (I would recommend reading KB 907877 as a primer before this):

I. First up is the memory manager section. As discussed earlier, this section contains details about the overall memory comsumption of SQL Server. An example:

Memory Manager                           KB

—————————————- ———–

VM Reserved                                 4059416

VM Committed                                 43040

Locked Pages Allocated                   41600

Reserved Memory                              1024

Reserved Memory In Use                        0

II. Next, we have the memory nodes, starting with 0. As I mentioned, because there is a known issue with the way dbcc memorystatus displays the distribution of allocations across memory nodes, it is best to study the distribution through the SQL Server performance counters. Here’s a sample query:

select * from sys.dm_os_performance_counters

where object_name like ‘%Buffer Node%’

III. Next, we have the clerks. I’ve tried to outline the not so obvious ones in this table, along with their uses:

Clerk Name

Used for

MEMORYCLERK_SQLUTILITIES

Database mirroring, backups, etc.

MEMORYCLERK_SQLXP

Extended Stored Procedures (loaded into SQL Server)

MEMORYCLERK_XE, MEMORYCLERK_XE_BUFFER

Extended Events

If you see any of the clerks hogging memory, then you need to focus on that, and try and narrow down the possible causes.

Another thing to watch out for is high values for the multipage allocator. If you see any clerk with extremely high values for multipage allocator, it means that the non-Bpool area is growing due to one of the following:

                                       i.            CLR Code: Check the errorlog for appdomain messages

                                     ii.            COM Objects : Check the errorlog for sp_oacreate

                                    iii.            Linked servers: Can be checked using Object Explorer in SSMS

                                   iv.             Extended stored procedures : Check the errorlog for loading extended stored procedure messages.

                                    Alternatively, you can query the sys.extended_procedures view as well.

                                     v.            Third party DLL’s : Third party DLL’s loaded into the SQL server process space. Run the following query to check:
        select * from sys.dm_os_loaded_modules where company <> ‘Microsoft Corporation’

Here’s a query to check for the biggest multipage consumers:

select type, name, sum(multi_pages_kb)/1024 as multi_pages_mb

from sys.dm_os_memory_clerks

where multi_pages_kb > 0

group by type, name

order by multi_pages_mb desc

Yet another symptom to watch out for is a high ratio of stolen pages from the Buffer Pool. You can check this in the ‘Buffer Pool’ section of the MEMORYSTATUS output. A sample:

Buffer Pool                                      Value

—————————————- ———–

Committed                                          4448

Target                                                25600

Database                                             2075

Dirty                                                        50

In IO                                                          0

Latched                                                     0

Free                                                       791

Stolen                                                  1582

Reserved                                                   0

Visible                                                25600

Stolen Potential                                 22738

Limiting Factor                                        17

Last OOM Factor                                       0

Last OS Error                                             0

Page Life Expectancy                         87529

What this means is that Buffer Pool pages are being utilized for “other” uses, and not for holding data and index pages in the BPool. This can lead to performance issues and a crunch on the Bpool, thereby slowing down overall query performance (please refer to part 1 for consumers that “Steal” pages from the BPool). You can use the following query to check for the highest “Steal” consumers:

select type, name, sum((single_pages_kb*1024)/8192) as stolen_pages

from sys.dm_os_memory_clerks

where single_pages_kb > 0

group by type, name

order by stolen_pages desc

IV. Next, we have the stores namely, Cachestore, Userstore and Objectstore. Please refer to part 1 for how and by which component these clerks are used. You can use the following queries to check for the biggest Cachestores, Userstores and Objectstores respectively:

select name, type, (SUM(single_pages_kb)+SUM(multi_pages_kb))/1024

as store_size_mb

from sys.dm_os_memory_cache_counters

where type like ‘CACHESTORE%’

group by name, type

order by store_size_mb desc

go

select name, type, (SUM(single_pages_kb)+SUM(multi_pages_kb))/1024

as store_size_mb

from sys.dm_os_memory_cache_counters

where type like ‘USERSTORE%’

group by name, type

order by store_size_mb desc

go

select name, type, (SUM(single_pages_kb)+SUM(multi_pages_kb))/1024

as store_size_mb

from sys.dm_os_memory_clerks

where type like ‘OBJECTSTORE%’

group by name, type

order by store_size_mb desc

go

V. Next, we have the gateways. The concept of gateways was introduced to throttle the use of query compilation memory. In plain english, this means that we did not want to allow too many queries with a high requirement for compilation memory to be running at the same time, as this would lead to consequences like internal memory pressure (i.e. one of the components of the buffer pool growing and creating pressure on other components).

The concept basically works like this: When a query starts execution, it will start with a small amount of memory. As its consumption grows, it will cross the threshold for the small gateway, and must wait to acquire it. The gateway is basically implemented through a semaphore, which means that it will allow upto a certain number of threads to acquire it, and make threads beyond the limit wait. As the memory consumption for the query grows, it must acquire the medium and big gateways before being allowed to continue execution. The exact thresholds depend on factors like total memory on the server, SQL Max server memory sitting, memory architecture (x86 or x64), load on the server, etc.

The number of queries allowed at each of the gateways described in the following table:

Gateway

Dynamic/Static

Config Value

Small

Dynamic

Default is (no. of CPU’s SQL sees * 4)

Medium

Static

Number of CPU’s SQL sees.

Large

Static

1 per instance

So if you see a large number of queries waiting on the large gateway, it means that you need to see why there are so many queries requiring large amounts of memory, and try to tune those queries. Such queries will show up with RESOURCE_SEMAPHORE_QUERY_COMPILE or RESOURCE_SEMAPHORE wait types in sysprocesses, sys.dm_exec_requests, etc.

I am listing down some DMV’s that might come in handy for SQL Server Memory Troubleshooting:

Sysprocesses

Sys.dm_exec_requests

Sys.dm_os_process_memory: Usage above.

Sys.dm_os_sys_memory: Will give you the overall memory picture for the server

Sys.dm_os_sys_info: Can be used to check OS level information like hyperthread ratio, CPU Ticks, OS Quantum, etc.

Sys.dm_os_virtual_address_dump: Used to check for VAS usage (reservations). The following query will give you VAS usage in descending order of reservations:

with vasummary(Size,reserved,free) as (select size = vadump.size,

reserved = SUM(case(convert(int, vadump.base) ^ 0) when 0 then 0 else 1 end),

free = SUM(case(convert(int, vadump.base) ^ 0x0) when 0 then 1 else 0 end)

from

(select CONVERT(varbinary, sum(region_size_in_bytes)) as size,

region_allocation_base_address as base

from sys.dm_os_virtual_address_dump

where region_allocation_base_address<> 0x0

group by region_allocation_base_address

UNION(

select CONVERT(varbinary, region_size_in_bytes),

region_allocation_base_address

from sys.dm_os_virtual_address_dump

where region_allocation_base_address = 0x0)

)

as vadump

group by size)

select * from vasummary order by reserved desc

go

Sys.dm_os_memory_clerks (Usage above)

Sys.dm_os_memory_nodes: Just a select * would suffice. This DMV has one row for each memory node.

Sys.dm_os_memory_cache_counters: Used above to find the size of the cachestores. Another sample query would be

select (single_pages_kb+multi_pages_kb) as memusage,* from Sys.dm_os_memory_cache_counters order by memusage desc

Once you have narrowed down the primary consumer and the specific component which is causing a memory bottleneck, the resolution steps should be fairly simple. For example, if you see some poorly written code, you can hound the developers to tune it. For other processes hogging memory at the OS Level, you will need to investigate them. For high consumption by a particular clerk, check the corresponding components. An example would be, say, in case of high usage by the SQLUtilities clerk, one of the first things you need to check if there is any Mirroring set up on the instance, and if it’s working properly.

Another thing I would strongly recommend would be to watch out for memory related KB articles, and make sure you have the relevant fixes applied.

Hope this helps. Any feedback, questions or comments are welcome.

↧

Why the service account format matters for upgrades

April 1, 2013, 1:25 pm

≫ Next: An interesting issue with SQL Server Script upgrade mode

≪ Previous: An in-depth look at SQL Server Memory–Part 3

I’ve seen this issue a few times in the past few months, so decided to blog about this. When upgrading from SQL 2005 to SQL 2008/SQL 2008 R2 (or even from SQL 2008 to SQL 2008 R2), you might face an error with the in-place upgrade.

Open the setup logs folder (located in C:Program filesMicrosoft SQL Server<Version -100 for 2008 and 2008 r2>Setup BootstrapLog folder by default), and look for a folder with the datetime of the upgrade attempt. Inside this folder, look for a file named “Detail.txt”.

Looking inside the detail.txt file, check for the following stack:

2013-01-21 11:16:42 Slp: Sco: Attempting to check if container ‘WinNT://Harsh2k8,computer’ of user account exists

2013-01-21 11:16:42 Slp: Sco: User srv_sql@contoso.test wasn’t located

2013-01-21 11:16:42 Slp: Sco: User srv_sql@contoso.test doesn’t exist

2013-01-21 11:16:42 SQLBrowser: SQL Server Browser Install for feature ‘SQL_Browser_Redist_SqlBrowser_Cpu32’ generated exception, and will invoke retry option. The exception: Microsoft.SqlServer.Configuration.Sco.ScoException: The specified user ‘srv_sql@contoso.test’ does not exist.

at Microsoft.SqlServer.Configuration.Sco.UserGroup.AddUser(String userName)

at Microsoft.SqlServer.Configuration.SqlBrowser.SqlBrowserPrivateConfig.AddAccountToGroup(SqlBrowserPublicConfig publicConfigSqlBrowser)

at Microsoft.SqlServer.Configuration.SqlBrowser.SqlBrowserPrivateConfig.UpdateAccountIfNeeded(SqlBrowserPublicConfig publicConfigSqlBrowser)

at Microsoft.SqlServer.Configuration.SqlBrowser.SqlBrowserPrivateConfig.ConfigUserProperties(SqlBrowserPublicConfig publicConfigSqlBrowser)

at Microsoft.SqlServer.Configuration.SqlBrowser.SqlBrowserPrivateConfig.ExecConfigNonRC(SqlBrowserPublicConfig publicConfigSqlBrowser)

at Microsoft.SqlServer.Configuration.SqlBrowser.SqlBrowserPrivateConfig.SelectAndExecTiming(ConfigActionTiming timing, Dictionary`2 actionData, PublicConfigurationBase spcbPublicConfig)

at Microsoft.SqlServer.Configuration.SqlBrowser.SqlBrowserPrivateConfigBase.ExecWithRetry(ConfigActionTiming timing, Dictionary`2 actionData, PublicConfigurationBase spcbPublicConfig).

2013-01-21 11:16:42 SQLBrowser: The last attempted operation: Adding account ‘srv_sql@contoso.test’ to the SQL Server Browser service group ‘SQLServer2005SQLBrowserUser$Harsh2k8’..

The key thing here is the message “Attempting to check if container WinNT://Harsh2k8, computer of user account exists“. If you see this message, go to the SQL Server configuration manager, right click on the offending service mentioned in the detail.txt, open the properties window and navigate to the “Log On” tab. Check the format of the service account here. It should be in the format domainusername. Change this to username@domain, and type in the password. After this, restart the SQL Service to make sure the changes have taken effect.

Try the setup again, and it should work this time.

Hope this helps.

↧

An interesting issue with SQL Server Script upgrade mode

April 14, 2013, 6:08 pm

≫ Next: How To: Troubleshooting SQL Server I/O bottlenecks

≪ Previous: Why the service account format matters for upgrades

Here’s another common issue that I’ve seen quite a few people run into of late.

When you run a patch against SQL Server, the patch installs successfully, but on restart, SQL goes into “script upgrade mode” and you’re unable to connect to it. Upon looking at the errorlog, you see something like this:

2012-08-23 03:43:38.29 spid7s Error: 5133, Severity: 16, State: 1.

2012-08-23 03:43:38.29 spid7s Directory lookup for the file “D:SQLDatatemp_MS_AgentSigningCertificate_database.mdf” failed with the operating system error 2(The system cannot find the file specified.).

2012-08-23 03:43:38.29 spid7s Error: 1802, Severity: 16, State: 1.

2012-08-23 03:43:38.29 spid7s CREATE DATABASE failed. Some file names listed could not be created. Check related errors.

2012-08-23 03:43:38.31 spid7s Error: 912, Severity: 21, State: 2.

2012-08-23 03:43:38.31 spid7s Script level upgrade for database ‘master’ failed because upgrade step ‘sqlagent100_msdb_upgrade.sql’ encountered error 598, state 1, severity 25. This is a serious error condition which might interfere with regular operation and the database will be taken offline. If the error happened during upgrade of the ‘master’ database, it will prevent the entire SQL Server instance from starting. Examine the previous errorlog entries for errors, take the appropriate corrective actions and re-start the database so that the script upgrade steps run to completion.

2012-08-23 03:43:38.31 spid7s Error: 3417, Severity: 21, State: 3.

2012-08-23 03:43:38.31 spid7s Cannot recover the master database. SQL Server is unable to run. Restore master from a full backup, repair it, or rebuild it. For more information about how to rebuild the master database, see SQL Server Books Online.

Script upgrade means that when SQL is restarted for the first time after the application of the patch, the upgrade scripts are run against each system db (to upgrade the system tables, views, etc. ). During this process, SQL Server attempts to create this mdf file in the default data location, and if the path is not available, then we get this error. Most of the time, it’s a result of the data having been moved to a different folder, and the original Default Data path being no longer available.

The default data path can be checked from the following registry key (for a default SQL 2008 instance):

HKEY_LOCAL_MACHINESoftwareMicrosoftMicrosoft SQL ServerMSSQL10.MSSQLSERVERMSSQLServer

The Mssqlserver key will have a string entry named “DefaultData”. If you see a location here that’s no longer available, please change it to the current data location (alternatively, you can also “recreate” the default data path mentioned in the string value).

If you do not see the key, check for the please check the HKEY_LOCAL_MACHINESOFTWAREMicrosoftMicrosoft SQL ServerMSSQL10.<instance name>Setup key, and see if you can spot the SQLDataRoot key there. Check to see if this key has the path mentioned above, and if so, update it to the current path.

If the path is correct, then one of the following conditions holds true:

1. The relevant drive is not added as a resource to the SQL Server group in Failover cluster manager.

2. The SQL Server resource does not have a dependency on the specified drive.

After this, restart SQL Server and the script upgrade should complete successfully this time. Hope this helps.

↧

How To: Troubleshooting SQL Server I/O bottlenecks

June 3, 2013, 10:01 am

≫ Next: SQL Server patch fails with "Could not find any resources appropriate for the specified culture or the neutral culture"

≪ Previous: An interesting issue with SQL Server Script upgrade mode

One of the most common reason for server performance issues with respect to SQL Server is the presence of an I/O bottleneck on the system. When I say I/O bottleneck, it can mean issues like slow disks, other processes hogging I/O, out-dated drivers, etc. In this blog, I will seek to outline the approach for identifying and troubleshooting I/O bottlenecks on SQL Server.

The Symptoms

The following are the most common symptoms of an I/O bottleneck on the SQL Server machine:

You see a lot of threads waiting on one or more of the following waits:

PAGEIOLATCH_*
WRITELOG
TRACEWRITE
SQLTRACE_FILE_WRITE_IO_COMPLETION
ASYNC_IO_COMPLETION
IO_COMPLETION
LOGBUFFER

You see the famous “I/O taking longer than 15 seconds” messages in the SQL Server errorlogs:
2012-11-11 00:21:25.26 spid1 SQL Server has encountered 192 occurrence(s) of IO requests taking longer than 15 seconds to complete on file [E:SEDATAstressdb5.ndf] in database [stressdb] (7). The OS file handle is 0x00000000000074D4. The offset of the latest long IO is:0x00000000022000”.

Troubleshooting

Data Collection:

If you see the symptoms outlined above quite frequently on your SQL Server installation, then it will be safe to draw the conclusion that your instance is suffering from a disk subsystem or I/O bottleneck. Let’s look at the data collection and troubleshooting approach pertaining to the same:

Enable a custom Performance Monitor collector to capture all disk related counters. Just go to start->run, type perfmon, and hit ok. Next, go to Data Collector sets->User Defined, right click on User Defined, and click New-> Data Collector set.
Note: The best thing about perfmon(apart from the fact that it is built into windows) is that it’s a very lightweight diagnostic, and has negligible performance overhead/impact.
Give the data collector set a name, and select Create manually. Under type of data, select the “Create data logs” option, and check the Performance Counter checkbox under it.
Next, click on add performance counters, and select the “LogicalDisk”, “Process” and “PhysicalDisk” groups, and select “All instances” for both before adding them.
After you have added the counters, you can also modify the sample interval. You might want to do this if you see spikes lasting less than 15 seconds, which is the default sample interval. I sometimes use an interval of 5 seconds when I want to closely monitor an environment .
Click on Finish and you will now see the new Data Collector set created under User Defined.
Next, right click on the Data Collector set you just created, and click start.

I normally recommend my clients to run the perfmon collector set for at least one business day, so that it has captured data for the load exerted by at least one standard business cycle.

Analysis:

Now that we have the data, we can start the analysis. After stopping the collector set, you can open the blg file generated (the path is displayed under the output column, on the right hand side in perfmon) using perfmon (a simple double click works, as the file type is associated with perfmon by default). Once open, is should have automatically loaded all the counters. Analysing with all the counters can be a bit cumbersome, so I would suggest that you first delete all the counters and then add specific counters one by one.

I will list out the important counters here, along with their expected values:

Process->IO Data Bytes/sec: This counter represents the average amount of IO Data bytes/sec spawned by each process. In conjunction with IO Other Bytes/sec, this counter can be used to determine the average IO per second as well as the total amount of IO spawned by each process during the capture. Check for the largest I/O consumers, and see if SQL is being starved of I/O due to some other process spawning a large amount of I/O on the system.
Process-> IO Other Bytes/sec: This counter represents the non-data IO spawned by each process during the capture. Usually, the amount of non-data IO is very low as compared to data IO. Use the total of both IO Data Bytes and IO other bytes to determine the total amount of IO spawned by each process during the capture. Check for the largest I/O consumers, and see if SQL is being starved of I/O due to some other process spawning a large amount of I/O on the system.
Physical Disk/Logical Disk->Avg. Disk Sec/Read: This counter signifies the average amount of time, in ms, that it takes for a read I/O request to be serviced for each physical/logical disk. An average of less than 10 ms (0.010) is good, and between 10-15 ms (0.010-0.015) is acceptable, but anything beyond 15 ms (0.015) is a cause for concern.
Physical Disk/Logical Disk->Avg. Disk Sec/Write: This counter signifies the average amount of time, in ms, that it takes for a write I/O request to be serviced for each physical/logical disk. An average of less than 10 ms (0.010) is good, and between 10-15 ms (0.010-0.015) is acceptable, but anything beyond 15 ms (0.015) is a cause for concern.
Physical Disk/Logical Disk->Disk Bytes/Sec: This counter represents, in bytes, the throughput of your I/O subsystem for each physical/logical disk. Look for the max value for each disk, and divide it by 1024 twice to get the max throughput in MB for the disk. SAN’s generally start from 200-250 MB/s these days. If you see that the throughput is lower than the specifications for the disk, it’s not necessarily a cause for concern. Check this counter in conjunction with the Avg Disk Sec/Read or Avg Disk Sec/Write counters (depending on the wait/symptom you see in SQL), and see the latency at the time of the maximum throughput. If the latency is green, then it just means that SQL spawned I/O that was less the disk throughput capacity, and was easily handled by the disk.
Physical Disk/Logical Disk->Avg. Disk Queue Length: This counter represents the average number of I/O’s pending in the I/O queue for each physical/logical disk. Generally, if the average is greater than 2 per spindle, it’s a cause for concern. Please note that I mentioned the acceptable threshold as 2 per spindle. Most SAN’s these days have multiple spindles. So, for example, if your SAN has 4 spindles, the acceptable threshold for Avg Disk Queue Length would be 8.
Check the other counters to confirm.
Physical Disk/Logical Disk->Split IO/Sec: This counter indicates the I/O’s for which the Operating System had to make more than one command call, grouped by physical/logical disk. This happens if the IO request touches data on non-contiguous file segments. It’s a good indicator of file/volume fragmentation.
Physical Disk/Logical Disk->%Disk Time: This counter is a general mark of how busy the physical/logical disk is. Actually, it is nothing more than the “Avg. Disk Queue Length” counter multiplied by 100. It is the same value displayed in a different scale. This is the reason you can see the %Disk Time going greater than 100, as explained in the KB http://support.microsoft.com/kb/310067. It basically means that the Avg. Disk Queue Length was greater than 1 during that time. If you’ve captured the perfmon for a long period (a few hours or a complete business day), and you see the %Disk Time to be greater than 80%, it’s generally indicative of a disk bottleneck, and you should take a closer look at the other counters to arrive at a logical conclusion.

It’s important to keep 2 things in mind. One, make sure your data capture is not skewed or biased in any way (for example, do not run a capture at the time of a monthly data load or something). Second, make sure you correlate the numbers reflected across the various counters to arrive at the overall picture of how your disks are doing.

Most of the time, I see that people are surprised when they are told that there are I/O issues on the system. Their typical response is “But, it’s been working just fine for x years, how can it create a bottleneck now?”. The answer lies within the question itself. When the server was initially configured, the disk resources were sufficient for the load on the server. However, with time, it’s inevitable that the business grows as a whole, and so do the number of transactions, as well as the overall load. As a result, there comes a day when the load breaches that threshold, and the disk resources on the server are no longer sufficient to handle it. If you come to office one fine day, see high latency on the disks during normal working hours, and are sure that

No special/additional workloads are running on SQL
No other process on the server is spawning excessive I/O,
Nothing changed on the server in the past 24 hours (like a software installation, patching, reboot, etc.)
All the BIOS and disk drivers on the server are up to date,

Then it’s highly likely that the load on your server has breached this threshold, and you should think about asking your disk vendor(s) for a disk upgrade (after having them check the existing system once for latency and throughput, of course). Another potential root cause that can cause high latency is that your disk drivers and/or BIOS are out of date. I would strongly recommend checking periodically for updates to all the drivers on the machine, as well as the BIOS.

Hope this helps. As always, comments, feedbacks and suggestions are welcome.

↧

SQL Server patch fails with "Could not find any resources appropriate for the specified culture or the neutral culture"

June 12, 2013, 8:52 am

≫ Next: SQL 2012 Availability Group does not come up on one instance

≪ Previous: How To: Troubleshooting SQL Server I/O bottlenecks

I recently worked on a number of issues where SQL Server Service Pack/patch installation would fail, and we would see this error in the relevant Detail.txt (located in C:Program FilesMicrosoft SQL Server100Setup BootstrapLog<Date time of the installation attempt> for SQL 2008/2008 R2):

2013-04-07 20:14:07 Slp: Package sql_bids_Cpu64: – The path of cached MSI package is: C:WindowsInstaller5c23b5e.msi . The RTM product version is: 10.50.1600.1

2013-04-07 20:14:07 Slp: Error: Action “Microsoft.SqlServer.Configuration.SetupExtension.InitializeUIDataAction” threw an exception during execution.

2013-04-07 20:14:13 Slp: Received request to add the following file to Watson reporting: C:UserskalerahulAppDataLocalTemp2tmpCC09.tmp

2013-04-07 20:14:13 Slp: The following is an exception stack listing the exceptions in outermost to innermost order

2013-04-07 20:14:13 Slp: Inner exceptions are being indented

2013-04-07 20:14:13 Slp:

2013-04-07 20:14:13 Slp: Exception type: System.Resources.MissingManifestResourceException

2013-04-07 20:14:13 Slp: Message:

2013-04-07 20:14:13 Slp: Could not find any resources appropriate for the specified culture or the neutral culture. Make sure “Errors.resources” was correctly embedded or linked into assembly “Microsoft.SqlServer.Discovery” at compile time, or that all the satellite assemblies required are loadable and fully signed.

2013-04-07 20:14:13 Slp: Stack:

2013-04-07 20:14:13 Slp: at System.Resources.ResourceManager.InternalGetResourceSet(CultureInfo culture, Boolean createIfNotExists, Boolean tryParents)

2013-04-07 20:14:13 Slp: at System.Resources.ResourceManager.GetObject(String name, CultureInfo culture, Boolean wrapUnmanagedMemStream)

2013-04-07 20:14:13 Slp: at Microsoft.SqlServer.Discovery.MsiException.GetErrorMessage(Int32 errorNumber, CultureInfo culture)

2013-04-07 20:14:13 Slp: at Microsoft.SqlServer.Discovery.MsiException.GetErrorMessage(MsiRecord errorRecord, CultureInfo culture)

2013-04-07 20:14:13 Slp: at Microsoft.SqlServer.Discovery.MsiException.get_Message()

2013-04-07 20:14:13 Slp: at System.Exception.ToString()

2013-04-07 20:14:13 Slp: at Microsoft.SqlServer.Setup.Chainer.Workflow.ActionEngine.RunActionQueue()

2013-04-07 20:14:13 Slp: at Microsoft.SqlServer.Setup.Chainer.Workflow.Workflow.RunWorkflow(WorkflowObject workflowObject, HandleInternalException exceptionHandler)

2013-04-07 20:14:13 Slp: at Microsoft.SqlServer.Chainer.Setup.Setup.RunRequestedWorkflow()

2013-04-07 20:14:13 Slp: at Microsoft.SqlServer.Chainer.Setup.Setup.Run()

2013-04-07 20:14:13 Slp: at Microsoft.SqlServer.Chainer.Setup.Setup.Start()

2013-04-07 20:14:13 Slp: at Microsoft.SqlServer.Chainer.Setup.Setup.Main()

Now that’s a weird and hard to understand error, isn’t it? However, look closely at what the setup is trying to do, and you will see that it’s trying to access the following file from the installer cache:
C:WindowsInstaller5c23b5e.msi

Open the installer cache and try to install the msi manually. If it succeeds, try running the patch setup again and it should proceed beyond the error this time. If the msi setup fails, then you will need to troubleshoot that first, before the patch setup proceeds further. This behaviour is expected, in that the service pack setup will try to access the msi’s (Microsoft Installer files, installed with the base installation of SQL) and msp’s (Microsoft Patch files, installed by Service packs, CU’s and hotfixes) of each of the installed components of SQL Server. If it’s unable to access/run any of these, the Service pack setup will fail.

Hope this helps.

↧

SQL 2012 Availability Group does not come up on one instance

July 15, 2013, 9:19 am

≫ Next: When using SSL, SQL Failover Cluster Instance fails to start with error 17182

≪ Previous: SQL Server patch fails with "Could not find any resources appropriate for the specified culture or the neutral culture"

I recently came across this interesting issue with SQL 2012 Always on Availability Groups, wherein after the network and IP were changed, the AG would not come up on one of the instances.

We checked the errorlogs on the server, found the following successful stacks for the failovers that had been attempted for the successful instance:

2013-01-30 13:03:21.09 spid1347 The state of the local availability replica in availability group ‘SQL2012CLUS02’ has changed from ‘NOT_AVAILABLE’ to ‘RESOLVING_NORMAL’. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error. For more information, see the availability group dashboard, SQL Server error log, Windows Server Failover Cluster management console or Windows Server Failover Cluster log.

2013-01-30 13:03:21.66 spid135 AlwaysOn: The local replica of availability group ‘SQL2012CLUS02’ is preparing to transition to the primary role in response to a request from the Windows Server Failover Clustering (WSFC) cluster. This is an informational message only. No user action is required.

2013-01-30 13:03:21.69 spid135 The state of the local availability replica in availability group ‘SQL2012CLUS02’ has changed from ‘RESOLVING_NORMAL’ to ‘PRIMARY_PENDING’. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error. For more information, see the availability group dashboard, SQL Server error log, Windows Server Failover Cluster management console or Windows Server Failover Cluster log.

2013-01-30 13:03:21.70 Server The Service Broker endpoint is in disabled or stopped state.

2013-01-30 13:03:21.72 Server The Service Broker endpoint is in disabled or stopped state.

2013-01-30 13:03:21.82 Server The state of the local availability replica in availability group ‘SQL2012CLUS02’ has changed from ‘PRIMARY_PENDING’ to ‘PRIMARY_NORMAL’. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error. For more information, see the availability group dashboard, SQL Server error log, Windows Server Failover Cluster management console or Windows Server Failover Cluster log.

2013-01-30 13:03:21.82 Server The Service Broker endpoint is in disabled or stopped state.

2013-01-30 13:04:15.72 spid50s A connection for availability group ‘SQL2012CLUS02’ from availability replica ‘CTGDNAV’ with id [E4A205DD-0098-481F-94F4-B15ABDC3BAD1] to ‘CDGI-SQLPROD-02’ with id [58F02E10-68CB-4EB2-B517-60306BCC0E72] has been successfully established. This is an informational message only. No user action is required.

And

2013-01-30 13:04:36.51 spid754 AlwaysOn: The local replica of availability group ‘SQL2012CLUS02’ is preparing to transition to the resolving role in response to a request from the Windows Server Failover Clustering (WSFC) cluster. This is an informational message only. No user action is required.

2013-01-30 13:04:36.51 spid754 The state of the local availability replica in availability group ‘SQL2012CLUS02’ has changed from ‘PRIMARY_NORMAL’ to ‘RESOLVING_NORMAL’. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error. For more information, see the availability group dashboard, SQL Server error log, Windows Server Failover Cluster management console or Windows Server Failover Cluster log.

2013-01-30 13:04:36.52 Server The Service Broker endpoint is in disabled or stopped state.

2013-01-30 13:04:38.62 spid305 AlwaysOn: The local replica of availability group ‘SQL2012CLUS02’ is preparing to transition to the primary role in response to a request from the Windows Server Failover Clustering (WSFC) cluster. This is an informational message only. No user action is required.

2013-01-30 13:04:38.97 spid305 The state of the local availability replica in availability group ‘SQL2012CLUS02’ has changed from ‘RESOLVING_NORMAL’ to ‘PRIMARY_PENDING’. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error. For more information, see the availability group dashboard, SQL Server error log, Windows Server Failover Cluster management console or Windows Server Failover Cluster log.

2013-01-30 13:04:38.97 Server The Service Broker endpoint is in disabled or stopped state.

2013-01-30 13:04:39.13 Server The state of the local availability replica in availability group ‘SQL2012CLUS02’ has changed from ‘PRIMARY_PENDING’ to ‘PRIMARY_NORMAL’. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error. For more information, see the availability group dashboard, SQL Server error log, Windows Server Failover Cluster management console or Windows Server Failover Cluster log.

2013-01-30 13:04:39.14 Server The Service Broker endpoint is in disabled or stopped state.

2013-01-30 13:04:46.01 spid30s A connection for availability group ‘SQL2012CLUS02’ from availability replica ‘CTGDNAV’ with id [E4A205DD-0098-481F-94F4-B15ABDC3BAD1] to ‘CDGI-SQLPROD-02’ with id [58F02E10-68CB-4EB2-B517-60306BCC0E72] has been successfully established. This is an informational message only. No user action is required.

We then checked the critical events for the AG in failover cluster manager, this was all i could find:

Cluster resource ‘SQL2012CLUS02’ in clustered service or application ‘SQL2012CLUS02’ failed.

I then collected the errorlog for the CDGI-SQLPROD-02 instance, and found this:

2013-01-30 13:04:14.890 spid1535 The state of the local availability replica in availability group ‘SQL2012CLUS02’ has changed from ‘NOT_AVAILABLE’ to ‘RESOLVING_NORMAL’. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error. For more

2013-01-30 13:04:15.580 spid1536s The state of the local availability replica in availability group ‘SQL2012CLUS02’ has changed from ‘RESOLVING_NORMAL’ to ‘SECONDARY_NORMAL’. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error. For m

2013-01-30 13:04:15.890 spid43s A connection for availability group ‘SQL2012CLUS02’ from availability replica ‘CDGI-SQLPROD-02’ with id [58F02E10-68CB-4EB2-B517-60306BCC0E72] to ‘CTGDNAV’ with id [E4A205DD-0098-481F-94F4-B15ABDC3BAD1] has been successfully established. This is an infor

2013-01-30 13:04:36.370 spid1540 The state of the local availability replica in availability group ‘SQL2012CLUS02’ has changed from ‘SECONDARY_NORMAL’ to ‘RESOLVING_PENDING_FAILOVER’. The replica state changed because of either a startup, a failover, a communication issue, or a cluster er

2013-01-30 13:04:38.080 Logon Error: 18456, Severity: 14, State: 5.

2013-01-30 13:04:38.080 Logon Login failed for user ‘CLT-SQLPROD-02$’. Reason: Could not find a login matching the name provided. [CLIENT: <local machine>]

We can clearly see that the Login failure seems to be responsible for the failed failover of the AG. I tried adding the login manually, and restarted the instance, but the failover still failed. I then checked the event logs on CDGI-SQLPROD-02, found just this:

Log Name: Application

Source: MSSQLSERVER

Date: 1/30/2013 4:54:43 PM

Event ID: 35206

Task Category: Server

Level: Information

Keywords: Classic

User: N/A

Computer: CDGI-SQLPROD-02.CLT.com

Description:

A connection timeout has occurred on a previously established connection to availability replica ‘CTGDNAV’ with id [E4A205DD-0098-481F-94F4-B15ABDC3BAD1]. Either a networking or a firewall issue exists or the availability replica has transitioned to the resolving role.

And then this:

Log Name: Application

Source: MSSQLSERVER

Date: 1/30/2013 4:54:33 PM

Event ID: 18456

Task Category: Logon

Level: Information

Keywords: Classic,Audit Failure

User: SYSTEM

Computer: CDGI-SQLPROD-02.CLT.com

Description:

Login failed for user ‘CLT-SQLPROD-02$’. Reason: Could not find a login matching the name provided. [CLIENT: <local machine>]

The interesting thing here is that the connection attempt seems to be coming from the OS on the same box. I then captured a Profiler trace to confirm, and saw this:

Looked at the profiler trace, found the login error event:

ErrorLog 2013-01-31 11:21:30.55 Logon Error: 18456, Severity: 14, State: 5.

2013-01-31 11:21:30.55 Logon Login failed for user ‘CLT-SQLPROD-02$’. Reason: Could not find a login matching the name provided. [CLIENT: <local machine>]

Microsoft® Windows® Operating System CDGI-SQLPROD-02$ 3828 86 2013-01-31 11:21:30.550 1 master 18456 26102 CDGI-SQLPROD-02 CLT0 CDGI-SQLPROD-02 CLTCDGI-SQLPROD-02$ 5 14

EventLog Login failed for user ‘CLTCDGI-SQLPROD-02$’. Reason: Could not find a login matching the name provided. [CLIENT: <local machine>] Microsoft® Windows® Operating System CDGI-SQLPROD-02$ 3828 86 2013-01-31 11:21:30.550 0X184800000E0000001000000043004400470049002D00530051004C00500052004F0044002D00300032000000070000006D00610073007400650072000000 1 master 18456 26103 CDGI-SQLPROD-02 CLT0 CDGI-SQLPROD-02 CLTCDGI-SQLPROD-02$ 5 14

Audit Login Failed Login failed for user ‘CLTCDGI-SQLPROD-02$’. Reason: Could not find a login matching the name provided. [CLIENT: <local machine>] Microsoft® Windows® Operating System CDGI-SQLPROD-02$ CLTCDGI-SQLPROD-02$ 3828 86 2013-01-31 11:21:30.550 1 master 18456 26104 1 – Nonpooled CDGI-SQLPROD-02 CLT 0 CDGI-SQLPROD-02 CLTCDGI-SQLPROD-02$ 5 0 1 – Non-DAC

User Error Message Login failed for user ‘CLTCDGI-SQLPROD-02$’. Microsoft® Windows® Operating System CDGI-SQLPROD-02$ 3828 86 2013-01-31 11:21:30.550 1 master 18456 26105 CDGI-SQLPROD-02 CLT 0 CDGI-SQLPROD-02 CLTCDGI-SQLPROD-02$ 1 1 0 14

The profiler trace confirms our hunch. We then proceeded to run the following commands to add local system as a sysadmin on the problem instance:

USE [master]

/****** Object: Login [NT AUTHORITYSYSTEM] Script Date: 01-02-2013 03:31:56 ******/

CREATE LOGIN [NT AUTHORITYSYSTEM] FROM WINDOWS WITH DEFAULT_DATABASE=[master],

DEFAULT_LANGUAGE=[us_english]

ALTER SERVER ROLE [sysadmin] ADD MEMBER [NT AUTHORITYsystem]

After this, the failover worked perfectly fine.

Hope this helps.

↧

When using SSL, SQL Failover Cluster Instance fails to start with error 17182

July 22, 2013, 7:02 am

≫ Next: Something to watch out for when using IS_MEMBER() in TSQL

≪ Previous: SQL 2012 Availability Group does not come up on one instance

I recently worked on an interesting issue with a SQL Server Failover Cluster Instance (FCI). We were trying to use an SSL certificate on the instance, and we followed these steps:

Made sure the certificate was requested according to the requirements defined here.

Loaded the certificate into the Personal store of the computer account across all the nodes

Copied the thumbprint of the certificate, eliminated the spaces, and pasted it into the value field HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL10.CLUSTEST\MSSQLServer\Certificate key. Please note that this was a SQL 2008 instance named "CLUSTEST"

However, when we restarted SQL Server after performing these changes, it failed. In the errorlog, we saw these messages:

2013-07-21 14:06:11.54 spid19s Error: 17182, Severity: 16, State: 1.

2013-07-21 14:06:11.54 spid19s Error: 17826, Severity: 18, State: 3.

2013-07-21 14:06:11.54 spid19s Error: 17120, Severity: 16, State: 1.

2013-07-21 14:06:11.54 spid19s SQL Server could not spawn FRunCommunicationsManager thread. Check the SQL Server error log and the Windows event logs for information about possible related problems.

Eliminated the Unicode character from the thumbprint

Converted all the alphabetical characters in the thumbprint to Caps.

Eliminated the spaces from the thumbprint

Saved this thumbprint to the HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL10.CLUSTEST\MSSQLServer\Certificate key.

The instance came online just fine this time.

Hope this helps.

↧

Something to watch out for when using IS_MEMBER() in TSQL

August 5, 2013, 3:44 am

≫ Next: An interesting issue with Peer to Peer Replication

≪ Previous: When using SSL, SQL Failover Cluster Instance fails to start with error 17182

I recently worked on an interesting issue with my good friend Igor (@sqlsantos), where we were facing performance issues with a piece of code that used the IS_MEMBER () function. Basically, the IS_MEMBER function is used to find out whether the current user (windows/sql login used for the current session) is a member of the specified Windows group or SQL server database role.

In the specified code, the IS_MEMBER function was being used to determine the windows group membership of the windows login. The windows groups were segregated according to geographical areas, and based on the user’s group membership, the result set was filtered to show rows for only those geographical areas for which the user was a member of the corresponding groups in Active Directory.

Here’s an example of a piece of code where we perform this check:

With SalesOrgCTE AS(

SELECT

MinSalesOrg, MaxSalesOrg

FROM

RAuthorized WITH (NOLOCK)

WHERE

IS_MEMBER([Group]) = 1

)

PREEMPTIVE_OS_AUTHORIZATIONOPS

PREEMPTIVE_OS_LOOKUPACCOUNTSID

I did some research on these waits, and found that since both of these are related to the communication/validation from Active Directory, they lie outside of SQL server, and there’s no changes we can do from a configuration standpoint to help reduce/eliminate these waits.

Next, we studied the code, broke it down and tested the performance of the various sections that used the IS_MEMBER function, and found that the main section responsible for the execution time was the “WHERE” condition where we were using the result set of the code mentioned above. This is what the “WHERE” clause looked like:

(SELECT

COUNT(*)

FROM

SalesOrgCTE

WHERE

SORGNBR BETWEEN MinSalesOrg AND MaxSalesOrg) > 0

Notice that in this code, we’ve asked SQL to check the value of SORGNBR for each row, and if it’s between MinSalesORG and MaxSalesOrg, then add it to the rowcount. We observed that due to this design, it had to make a trip to AD for validating each row, which meant quite a long time for a 18000-20000 row result set, which was responsible for the slow performance of the procedure.

WHERE

SORGNBR IN

(

SELECT MinSalesOrg FROM SalesOrgCTE

)

AND SORGNBR IN

(

SELECT MaxSalesOrg FROM SalesOrgCTE

)

If you look carefully, you’ll notice that in this code snippet, we’ll need to communicate with AD only twice, thereby improving the performance of the procedure as a whole.

Summing up: The importance of writing good code cannot be over-emphasized. It’s good coding practices like this that lead to performance gains most of the time.

Hope this helps.

↧

Latest Images