Friday, August 26, 2016

Zenoss and ServiceNow Integration - Custom Fields and Values

Our Zenoss instance is integrated with ServiceNow so that our support organization can open an incident with the appropriate event details at the click of a button from the Zenoss Events Console. The workflow for this looks something like the below flowchart that I just threw together.

The problem however is that our Zenoss instance was not following through in the last step after incident resolution and closing out the associated Zenoss Event. Because of this we were missing alerts on re-occurring issues since the event was in an acknowledged state. By default the Zenoss Incident Management ZenPack looks at the incident_state field for values 6 and 7 to indicate a closed event. However, our ServiceNow instance uses the underlying state field that is inherited from the task table that the Incidents table is built on top of instead of incident_state.
You can find out what field you are using by right clicking on the State label and either seeing the "Show - ''" or clicking on "Configure Label" which will show you the associated table


Next we need to find out the appropriate values associated with the state so that we can update Zenoss. Open the Task table under "System Definition - Tables". 


Then open the state column. (You can do this by clicking on the information button).


Next you will want to filter the results down to the Incident table and you will be able to find the integer values for your state.


In this case I want an incident with a state value greater than 3 to be considered from a Zenoss point of view to be "closed" and monitoring to be re-enabled by moving the Zenoss event from an "Acknowledged" state to "Closed".

Now, to make the change on our Zenoss server we need to create a snapshot of the Zope container, make the changes to the IncidentManagement ZenPack configuration and commit the snapshot so that the changes are persistent when the zenincidentpoll container is restarted.

From my Control Center I'm going to run the below command to start:
serviced service shell -i -s update_closed_sn zope

After that I can modify the appropriate file changing the values to match what I've discovered in the previous steps:

vi /opt/zenoss/ZenPacks/ZenPacks.zenoss.IncidentManagement-2.3.16-py2.7.egg/ZenPacks/zenoss/IncidentManagement/servicenow/action.py



After saving the file and exiting the Zope container using "exit" we now need to commit the new image using:
serviced snapshot commit update_closed_sn

After committing the snapshot you need to restart your zenincidentpoll container from the Zenoss Control Center UI and then your changes will be live and you should be able to close an Incident in ServiceNow and have Zenoss automatically close the associated Zenoss event as seen in the below event notes.


Hopefully that helps!


.

Monday, July 25, 2016

vCloud Director Logging

I was recently asked how to go about configuring the Log Insight Agent with VMware vCloud Director and thought that I would take the time to document it here for anyone else who is interested.

Logging in vCD is normally handled by log4j and configured by $VCLOUD_HOME/etc/log4j.properties.with the official KB located here. You should either use log4j OR the Log Insight Agent, but not both or you will have event duplication.

Log4j Configuration
First a quick overview of the log4j configuration.
1. Open $VCLOUD_HOME/etc/log4j.properties
2. Append "vcloud.system.syslog" to the rootLogger and make sure to not forget the comma before it
 3. At the bottom of the file go ahead and append the below 6 lines outlined in the KB making sure to change your target FQDN.
4. Unfortunately with vCD 5.x you also have to restart the vmware-vcd service for the changes to take effect. Hint: if you don't want to restart the services and take an outage you can continue reading and use the Log Insight Agent instead :)

Log Insight Agent
vCloud Director supports RHEL and CentOS so you only need to worry about the RPM install of the Log Insight Agent. First though, we need to do some prep work on the Log Insight Server.

1. Install the vCD Content Pack - On the Log Insight Server that you will be pointing your LI Agent at you will need to have the vCD Content Pack installed so the Agent Group is available. This is easily done via the Marketplace

2. Create your Agent Group - From the Administration window select Agents and then highlight the vCloud Director Cell Servers pre-defined Agent Group.
Next scroll to the bottom of the page and select Copy Template

3. Next you will need to define a filter that limits this collection to only vCD Cells. My test example here is very basic and limiting to hosts with a certain hostname prefix.
You can see in the bottom section of the agent group the actual files that will be collected by the agent.
By default the agent only collects info level logs but you can easily switch that to debug level logs if you desire. Feel free to check out my very basic sizing calculator on Github if you are curious of the impact of the additional logs. For now, just hit Save Agent Group to continue.

4. Now you are ready for the actual agent installation! You will need to copy the RPM to your vCD cells /tmp directory. The LI Agent will need to be installed and configured on every vCD Server.
Note: At some point after this step you will need to decide when to remove the log4j configuration and when to enable the Agent. I would personally recommend disabling log4j before installing the agent as short term you won't lose any events since the LI Agent will go through all the log files on the server and forward them on.

5. Install the agent via RPM

6.  If you downloaded the agent from the Log Insight server it is supposed to be forwarding to then you don't need to modify the liagent.ini file but if you downloaded it from my.vmware.com or another Log Insight Server you will need to update the target hostname.
If you want to be secure you can enable ssl and your /etc/liagent.ini file will look more like the below
Don't forget that you'll need certificates for SSL so follow the full official documentation available here 
 
At this point you should see that your agents are alive and sending data to your Log Insight Server




 

Friday, July 8, 2016

Early Boot Windows Debugging - Part 2 - Kernel Debugging over Serial

This post is a continuation of Part 1; I think I shall call it "Help, my ntbtlog.txt isn't being written to disk and I'm flying blind"

Ok, now I need more data because I'm not getting anywhere. Fortunately Windows still has the option to log kernel debugging over serial. A feature I wasn't aware existed util today. That brings up the big question: how do I make that work on a VM and a physical device without a serial port?

First you need to enable virtual printers in VMware Workstation under Edit > Preferences. Without this enabled Workstation can't attach to named pipes.
Next we need to add a virtual serial port to our VM and tell it to output to named pipe
Next accept or change the named pipe (only replace the part "com_1" if you change it) and set it so that "This end is the server" and "The other end is an application".  This means that your VM is the server and you are going to attach an application to the named pipe.
With that out of the way you need to install the Windows Debugging Tools which are included in the Windows SDK. Link for Windows 10 is here. After installing the debugging toolset we need to launch a new kernel debug session.
Go File > Kernel Debug in WinDbg
Next select the COM tab and fill it out with the below settings but replacing the name of the port with your named pipe.
Hit Ok and you should see your debugger start and say it's "Waiting to reconnect..."
Even if you boot the VM at this point you won't get any information first we need to boot to the Windows Repair wizard, go to Troubleshoot > Command Prompt and enable debugging using bcdedit.

Commands: 
bcdedit /bootdebug {bootmgr} on (Windows Boot Manager)
bcdedit /bootdebug on (boot loader)
bcdedit /debug on (OS Kernel debugger)

At this point you can now reboot. In theory this should be all that you need for debugging but I've noticed that the information is still lacking.

Instead have it boot explicitly to debug mode

Now your debug should have much more valuable information, this time pointing to "IOINIT: Built-in driver \Driver\sacdrv failed to initialize with status - 0xc0000037"


Congratulations, you can now see what is actually going on in your OS and where the root of the issue is at more more clarity.









Early Boot Windows Debugging - Part 1 - Basics

I have a Windows Server 2012 VM that will not boot past the Windows splash screen but throws a BSOD with the error "SYSTEM_THREAD_EXCEPTION_NOT_HANDLED (NETIO.SYS). It's been a long while since working on troubleshooting Windows (I primarily use CentOS) but here's what I've found. I don't have the solution yet but I'm recording some tidbits that I found so I will have them later.

First a bit of preamble:

1. Advanced Boot Options - When you select "Enable Boot Logging" this is supposed to write a log file named ntbtlog.txt. However, in this particular case that never happens. This is presumably because it is before the appropriate driver is loaded to write log files. However with 2012 this is conjecture since the latest Microsoft documentation that I can find applies to Server 2000. Regardless of reason, it isn't captured in this instance.
2. This VM was originally running on ESXi but I have exported and OVF to my local VMware Workstation for my troubleshooting.
3. In the below operations I will be referencing "d:\" which is actually the c:\ of the server. It is referenced from the rescue command prompt as d:\ on my system.

Step 1: Boot to the command prompt from the troubleshooting menu in the Automatic Repair wizard
Step 2: Run a chkdsk to verify the filesystem is in working order. My scan came back with required repairs which it corrected. Subsequent scans come back clean.

Command: chkdsk d: /f
Step 2: Run sfc to verify that Windows is ok. This returns that everything is ok

Command: sfc /offbootdir=d:\ /offwindir=d:\Windows /scannow
Step 3 - Just for grins I also ran DISM (Deployment Image Servicing and Management) to check the integrity. It will throw a warning if you don't give it a scratch directory so I just created a temporary one on my drive. This also returns no found corruption.

Command: dism /image=d:\ /cleanup-image /scan-health /scratchdir=d:\temp
So far, so good... except it still won't boot up. I have an existing "twin" of this machine that should match it in most regards so just to be super certain I also run a manual hashing check on netio.sys and sacdrv.sys (more on that file later). The syntax for that is:

certutil.exe -hashfile drivers\netio.sys md5 (or sha1)

The number 1 cause of netio.sys BSOD are driver conflicts according to googling so I start down that road next. An export of all the drivers between the 2 systems shows that they are absolutely identical. Because that doesn't help me I start yanking out drivers to see if it will make a difference.

To get a list of non-Microsoft drivers I again use DISM and find that there are fortunately only 8 to worry about.

Command: dism /image:d:\ /scratchdir:d:\temp /get-drivers

I'm going to start removing drivers to see if that makes any difference. Again, using DISM I start by removing the vmxnet3 driver since it makes the most sense considering a netio.sys error.

Command: dism /image:d:\ /scratchdir:d:\tetmp /remove-driver:oem4.inf

After a reboot, no change. In 1 of my tests I also then proceed to remove the 7 remaining drivers, that also did nothing. Time to get more information.... Queue next post.... 









Friday, July 1, 2016

Log Insight Configuration API Audit and Standalone Remediation Tool - Updated!

For those of you who are interested I have updated the API based audit and remediation tool with a couple new features. After all, what is the use of automation if it isn't user friendly?

1. Better error handling of remediation errors: In the past you would just get a message to the effect of "Something went wrong" but now the tool will pass the HTTP status code and Error Details from the Log Insight Server's response to your remediation request. In the below example you can see this in action.

2. Now includes a wizard to help build a simplified JSON configuration file! Now, without having to create a single bit of JSON you can quickly get value from the tool. The wizard is simplified because let's be honest, if you want the wizard you don't want to answer 250 questions. Because of this some things are assumed/disabled. If you want them then you can simply add it to the code or use the template in the included docs (use the -d switch).

I hope that this helps you get started in seeing the value of using Configuration APIs to manage your Log Insight Servers!

Wednesday, May 18, 2016

Log Insight Configuration APIs

For those of you who have followed my blog you will know that I deal with Log Insight quite a bit in our production environments. Because of this I was excited that in the latest release of Log Insight 3.3 there are several new Configuration API's released under Tech Preview status. That said, the documentation around these APIs is very difficult to nail down. The exciting part is that I've just uploaded a new and unofficial standalone audit and remediation tool to my github repo! As always this code is my personal code and not supported or officially recognized by VMware.

Here's how it works: 
The tool reads the desired state of your Log Insight Server from a JSON file that you define. It can use that file to then connect to the Log Insight Server and audit it to see if it matches your desired state. If you wish you can throw in the -r switch and the script will make the Log Insight Server match your desired state.

Let's see it in action:
First up, let's pull up the embedded documentation by running the script with the -d switch to see what the JSON file needs to look like. I've taken pains to try and include complex examples so that you won't be left in the dark on anything.
  
After creating a new JSON file with our desired state it's time to run the tool in audit only mode by just specifying the -f flag and the name of our JSON file. The results that come back are that we have several areas that need remediation (email, event forwarders) and 1 (content packs) that cannot be remediated yet (hopefully in a later version).
 
That's all good but we want the tool to fix those issues so we append the -r flag
If you run the tool again the output comes back as all objects matching desired state but the nice thing is that you don't need to run it again. Once the remediation HTTP POST is sent to the server the tool will automatically go back and query the server for the configuration to verify that your changes have been implemented and the server is now set correctly. It will then show you success in the message immediately following the remediation step. 

The portions of Log Insight that the tool has the ability to configure are:
License Key
NTP Configuration
SMTP Configuration
Event Forwarder Configuration
Active Directory Configuration
RBAC Configuration
Content Packs (audit only right now)

Stay tuned as I plan on updating the tool over time as more APIs are released and as my python knowledge increases. In the meantime happy auditing and automatic remediation!


 

Tuesday, March 8, 2016

SELinux Hangs on Relabel Attempt - RHEL 7

I've recently run into an issue where during a recovery scenario it was necessary to relabel SELinux contexts on RHEL7.1. This is normally done by either placing an empty file at the root of the file system "touch /.autorelabel" or using the "fixfiles onboot" command; both of which I tried in this case. However in this case upon reboot the machine just hung at "Reached Target initrd Default Target" with no sign of even attempting to relabel the filesystem. Doing a bit of troubleshooting isolated the issue to SELinux as adding the "enforcing=0" parameter to GRUB allowed the machine to boot without issue. I tried quite a few different things including setting SELinux to "Permissive" in /etc/selinux/config and then back again, as well as a failed attempt to use "fixfiles restore /" and "restorecon -Rv /" which I'm assuming failed because SELinux was in permissive mode. Until today I've never seen a machine that won't respect the "touch /.autorelabel" nuclear option.

Ok, here's the fix that I found, odd as it is:
1. Modify GRUB to include "enforcing=0" to allow the OS to boot this first time without SELinux
2. Once inside the OS make sure that /etc/selinux/config is set to enforcing
3. Change the default runlevel from graphical to multi-user (think runlevel 3) with "systemctl set-default multi-user.target"
4. Reboot without modifying GRUB so that selinux is properly enabled on this boot

On the next reboot oddly enough the system recognized that a relabel had been ordered and proceeded as it should have the whole time. After another reboot and setting the default target back to graphical "systemctl set-default graphical.target" and another reboot as a sanity check it's working as expected again. Very odd problem and I must admit a very odd solution....