Thursday, September 22, 2016

Corrupt Microsoft SQL Database Log in AlwaysOn High Availability Group (AAG)

We recently ran into an issue with one of our environments where the Microsoft SQL Server experienced corruption in the database log. This issue is usually discovered when you attempt to create a new backup and it fails with the message "BACKUP detected corruption in the database log"


Resolving this issue is normally fairly easy (set the database from a Full Recovery Model to simple and then back again) but it gets a bit more complex when you database is replicated via an AlwaysOn High Availability Group. Here are the steps to fix it (assuming no other databases are in the AAG).

1. Remove Secondary Replica - First we need to stop replication to the secondary replica. To do this we are going to connect to the primary node in our cluster and right click on the SECONDARY replica. Then we select "Remove from Availability Group" and follow the wizard.


2. Remove Database from AAG - Next we need to remove the database from the AAG by right clicking on it under the Availability Databases folder and selecting "Remove Database from Availability Group"

At this point you should have your primary node as the only member of the AAG with no databases associated. At this point you are going to delete the database from the SECONDARY node. Your secondary server should now have no replicas, no availability databases and no database. 

3. Next we need to change the remaining copy of the database on our primary node from Full to a Simple Recovery Model by right clicking on the database and selecting properties > Options.

4. Next we need to do a full backup of the database.
5. Repeat the steps in #3 but in this case change it from simple back to the original Full Recovery Model.
6. Backup the database again.

Now we are ready to re-add the secondary replica

7. On the primary server right click on the Available Replicas folder and select "Add Replica..."
Next you will need to select the "Add Replica" button and will be prompted to connect to your secondary server.

After this you will want to configure your replica. In our case we have selected to have the secondary copy of the database as readable as well as enabling automatic failover.

In the next screen you will need to configure your sync preferences. We are using a Full sync which requires a file share accessible by both SQL Servers. Using this file share SQL will run a backup and place it on the remote share and the secondary node will restore the database from this initial backup. 

Follow the wizard and verify that everything passes

After this you can track the progress of the backup/restore/sync

With that you should have a working AlwaysOn Availability Group again!

Friday, September 16, 2016

FreeTDS and Microsoft SQL Server Windows Authentication - Part 1

I've been trying to get the Zenoss SQL Transaction Zenpack working so that we can use Zenoss to run SQL queries for specific monitoring purposes and ran into a few things that might be worth sharing.

Using tsql for troubleshooting

Zenoss, among many other tools uses pymssql to connect to your SQL Servers; and pymssql uses FreeTDS behind the scenes. If you can't get pymssql to work them you can go a layer deeper to see if you can find the issues. In my case I have the following configuration:

Fedora Server 23
freetds-0.95.81-1
pymssql-2.1.3

First off, FreeTDS uses a config file at /etc/freetds.conf that has a [Global] section and examples for configuring individual server types. This is important because you need to use TDS version 7.0+ for Windows Authentication to work.

If we try to connect using the diagnostic tool tsql (not to be confused with the language T-SQL) without changing the default TDS version or adding a server record in the config file our attempts will fail

To fix this you can either:
Change the Global value for "tds version" to be 7+ (sounds like a good idea to me if you only have MSSQL):

or you can add a server record for each Microsoft SQL Server and leave the global version less than 7.


The catch to second method is that when you do your queries you will have to call the name as shown in the config file (in this case us01-0-srs1) and you cannot use the FQDN or it will fail because it defaults back to the Global setting. This method also creates overhead in managing the list of MSSQL Servers in the freetds.conf file.


Either way, at this point you should have tsql being able to query your MSSQL Servers using Windows Authentication


Getting started with pymssql
To make sure that pymssql is working I threw together a quick bit of python that allows you to connect using Windows Authentication


It's basically a simplified version of the example on the pymssql web page, but will prove if pymssql and MSSQL Windows Authentication is working or not.

-------------BEGIN Code
import pymssql

print('Connecting to SQL')
conn = pymssql.connect(server='server.domain.com', user='DOMAIN\\username', password='Super Secret P@ssW0rds', database='master')

print('Creating cursor')
cursor = conn.cursor()

print('Executing query')
cursor.execute("""
SELECT MAX(req.total_elapsed_time) AS [total_time_ms]
FROM sys.dm_exec_requests AS req
WHERE req.sql_handle IS NOT NULL
""")

print('Fetching results')
row = cursor.fetchone()
while row:
    print(row[0])
    row = cursor.fetchone()

print('Closing connection')
conn.close
-------------END Code 

After filling in the details on your MSSQL Server you can simply run it and get the results


Part 2 will cover the Zenoss specific aspects of this...

Friday, August 26, 2016

Zenoss and ServiceNow Integration - Custom Fields and Values

Our Zenoss instance is integrated with ServiceNow so that our support organization can open an incident with the appropriate event details at the click of a button from the Zenoss Events Console. The workflow for this looks something like the below flowchart that I just threw together.

The problem however is that our Zenoss instance was not following through in the last step after incident resolution and closing out the associated Zenoss Event. Because of this we were missing alerts on re-occurring issues since the event was in an acknowledged state. By default the Zenoss Incident Management ZenPack looks at the incident_state field for values 6 and 7 to indicate a closed event. However, our ServiceNow instance uses the underlying state field that is inherited from the task table that the Incidents table is built on top of instead of incident_state.
You can find out what field you are using by right clicking on the State label and either seeing the "Show - ''" or clicking on "Configure Label" which will show you the associated table


Next we need to find out the appropriate values associated with the state so that we can update Zenoss. Open the Task table under "System Definition - Tables". 


Then open the state column. (You can do this by clicking on the information button).


Next you will want to filter the results down to the Incident table and you will be able to find the integer values for your state.


In this case I want an incident with a state value greater than 3 to be considered from a Zenoss point of view to be "closed" and monitoring to be re-enabled by moving the Zenoss event from an "Acknowledged" state to "Closed".

Now, to make the change on our Zenoss server we need to create a snapshot of the Zope container, make the changes to the IncidentManagement ZenPack configuration and commit the snapshot so that the changes are persistent when the zenincidentpoll container is restarted.

From my Control Center I'm going to run the below command to start:
serviced service shell -i -s update_closed_sn zope

After that I can modify the appropriate file changing the values to match what I've discovered in the previous steps:

vi /opt/zenoss/ZenPacks/ZenPacks.zenoss.IncidentManagement-2.3.16-py2.7.egg/ZenPacks/zenoss/IncidentManagement/servicenow/action.py



After saving the file and exiting the Zope container using "exit" we now need to commit the new image using:
serviced snapshot commit update_closed_sn

After committing the snapshot you need to restart your zenincidentpoll container from the Zenoss Control Center UI and then your changes will be live and you should be able to close an Incident in ServiceNow and have Zenoss automatically close the associated Zenoss event as seen in the below event notes.


Hopefully that helps!


.

Monday, July 25, 2016

vCloud Director Logging

I was recently asked how to go about configuring the Log Insight Agent with VMware vCloud Director and thought that I would take the time to document it here for anyone else who is interested.

Logging in vCD is normally handled by log4j and configured by $VCLOUD_HOME/etc/log4j.properties.with the official KB located here. You should either use log4j OR the Log Insight Agent, but not both or you will have event duplication.

Log4j Configuration
First a quick overview of the log4j configuration.
1. Open $VCLOUD_HOME/etc/log4j.properties
2. Append "vcloud.system.syslog" to the rootLogger and make sure to not forget the comma before it
 3. At the bottom of the file go ahead and append the below 6 lines outlined in the KB making sure to change your target FQDN.
4. Unfortunately with vCD 5.x you also have to restart the vmware-vcd service for the changes to take effect. Hint: if you don't want to restart the services and take an outage you can continue reading and use the Log Insight Agent instead :)

Log Insight Agent
vCloud Director supports RHEL and CentOS so you only need to worry about the RPM install of the Log Insight Agent. First though, we need to do some prep work on the Log Insight Server.

1. Install the vCD Content Pack - On the Log Insight Server that you will be pointing your LI Agent at you will need to have the vCD Content Pack installed so the Agent Group is available. This is easily done via the Marketplace

2. Create your Agent Group - From the Administration window select Agents and then highlight the vCloud Director Cell Servers pre-defined Agent Group.
Next scroll to the bottom of the page and select Copy Template

3. Next you will need to define a filter that limits this collection to only vCD Cells. My test example here is very basic and limiting to hosts with a certain hostname prefix.
You can see in the bottom section of the agent group the actual files that will be collected by the agent.
By default the agent only collects info level logs but you can easily switch that to debug level logs if you desire. Feel free to check out my very basic sizing calculator on Github if you are curious of the impact of the additional logs. For now, just hit Save Agent Group to continue.

4. Now you are ready for the actual agent installation! You will need to copy the RPM to your vCD cells /tmp directory. The LI Agent will need to be installed and configured on every vCD Server.
Note: At some point after this step you will need to decide when to remove the log4j configuration and when to enable the Agent. I would personally recommend disabling log4j before installing the agent as short term you won't lose any events since the LI Agent will go through all the log files on the server and forward them on.

5. Install the agent via RPM

6.  If you downloaded the agent from the Log Insight server it is supposed to be forwarding to then you don't need to modify the liagent.ini file but if you downloaded it from my.vmware.com or another Log Insight Server you will need to update the target hostname.
If you want to be secure you can enable ssl and your /etc/liagent.ini file will look more like the below
Don't forget that you'll need certificates for SSL so follow the full official documentation available here 
 
At this point you should see that your agents are alive and sending data to your Log Insight Server




 

Friday, July 8, 2016

Early Boot Windows Debugging - Part 2 - Kernel Debugging over Serial

This post is a continuation of Part 1; I think I shall call it "Help, my ntbtlog.txt isn't being written to disk and I'm flying blind"

Ok, now I need more data because I'm not getting anywhere. Fortunately Windows still has the option to log kernel debugging over serial. A feature I wasn't aware existed util today. That brings up the big question: how do I make that work on a VM and a physical device without a serial port?

First you need to enable virtual printers in VMware Workstation under Edit > Preferences. Without this enabled Workstation can't attach to named pipes.
Next we need to add a virtual serial port to our VM and tell it to output to named pipe
Next accept or change the named pipe (only replace the part "com_1" if you change it) and set it so that "This end is the server" and "The other end is an application".  This means that your VM is the server and you are going to attach an application to the named pipe.
With that out of the way you need to install the Windows Debugging Tools which are included in the Windows SDK. Link for Windows 10 is here. After installing the debugging toolset we need to launch a new kernel debug session.
Go File > Kernel Debug in WinDbg
Next select the COM tab and fill it out with the below settings but replacing the name of the port with your named pipe.
Hit Ok and you should see your debugger start and say it's "Waiting to reconnect..."
Even if you boot the VM at this point you won't get any information first we need to boot to the Windows Repair wizard, go to Troubleshoot > Command Prompt and enable debugging using bcdedit.

Commands: 
bcdedit /bootdebug {bootmgr} on (Windows Boot Manager)
bcdedit /bootdebug on (boot loader)
bcdedit /debug on (OS Kernel debugger)

At this point you can now reboot. In theory this should be all that you need for debugging but I've noticed that the information is still lacking.

Instead have it boot explicitly to debug mode

Now your debug should have much more valuable information, this time pointing to "IOINIT: Built-in driver \Driver\sacdrv failed to initialize with status - 0xc0000037"


Congratulations, you can now see what is actually going on in your OS and where the root of the issue is at more more clarity.









Early Boot Windows Debugging - Part 1 - Basics

I have a Windows Server 2012 VM that will not boot past the Windows splash screen but throws a BSOD with the error "SYSTEM_THREAD_EXCEPTION_NOT_HANDLED (NETIO.SYS). It's been a long while since working on troubleshooting Windows (I primarily use CentOS) but here's what I've found. I don't have the solution yet but I'm recording some tidbits that I found so I will have them later.

First a bit of preamble:

1. Advanced Boot Options - When you select "Enable Boot Logging" this is supposed to write a log file named ntbtlog.txt. However, in this particular case that never happens. This is presumably because it is before the appropriate driver is loaded to write log files. However with 2012 this is conjecture since the latest Microsoft documentation that I can find applies to Server 2000. Regardless of reason, it isn't captured in this instance.
2. This VM was originally running on ESXi but I have exported and OVF to my local VMware Workstation for my troubleshooting.
3. In the below operations I will be referencing "d:\" which is actually the c:\ of the server. It is referenced from the rescue command prompt as d:\ on my system.

Step 1: Boot to the command prompt from the troubleshooting menu in the Automatic Repair wizard
Step 2: Run a chkdsk to verify the filesystem is in working order. My scan came back with required repairs which it corrected. Subsequent scans come back clean.

Command: chkdsk d: /f
Step 2: Run sfc to verify that Windows is ok. This returns that everything is ok

Command: sfc /offbootdir=d:\ /offwindir=d:\Windows /scannow
Step 3 - Just for grins I also ran DISM (Deployment Image Servicing and Management) to check the integrity. It will throw a warning if you don't give it a scratch directory so I just created a temporary one on my drive. This also returns no found corruption.

Command: dism /image=d:\ /cleanup-image /scan-health /scratchdir=d:\temp
So far, so good... except it still won't boot up. I have an existing "twin" of this machine that should match it in most regards so just to be super certain I also run a manual hashing check on netio.sys and sacdrv.sys (more on that file later). The syntax for that is:

certutil.exe -hashfile drivers\netio.sys md5 (or sha1)

The number 1 cause of netio.sys BSOD are driver conflicts according to googling so I start down that road next. An export of all the drivers between the 2 systems shows that they are absolutely identical. Because that doesn't help me I start yanking out drivers to see if it will make a difference.

To get a list of non-Microsoft drivers I again use DISM and find that there are fortunately only 8 to worry about.

Command: dism /image:d:\ /scratchdir:d:\temp /get-drivers

I'm going to start removing drivers to see if that makes any difference. Again, using DISM I start by removing the vmxnet3 driver since it makes the most sense considering a netio.sys error.

Command: dism /image:d:\ /scratchdir:d:\tetmp /remove-driver:oem4.inf

After a reboot, no change. In 1 of my tests I also then proceed to remove the 7 remaining drivers, that also did nothing. Time to get more information.... Queue next post....