Avoid Unwanted Problems: 2012

mercoledì 21 marzo 2012

Recipe: Monitor Windows Task Scheduler status with Nagios

Install and configure check_nt plugin on Nagios and NSCA/NRPE on the Windows server (which I leave as an exercise to the reader, due to abundance of docs elsewhere)

On the Windows machine prepare a .cmd batch which barely resembles this:

@echo off
if x%1==x goto usage
set TF="%TEMP%\check_tasks.tmp"
schtasks /query /v /FO CSV > "%TF%"
if %1==running goto running
if %1==status goto status
goto usage

:running
for /F "delims=, tokens=2,5 usebackq" %%a in (%TF%) do if %%a-%%b=="%2"-"In esecuzione" (
  set R="%%a %%b"
  set RC=0
  goto result
) ELSE (
  set R="%%a not running!"
  set RC=2
)
goto result

:status
SET R=2
SET RC=2
for /F "skip=1 delims=, tokens=2,9 usebackq" %%a in (%TF%) do if %%a-%%b=="%2"-"0" (
  set R=%%b
  SET RC=0
  goto result
) ELSE (
  set R=--
  SET RC=2
)
goto result

:result
echo %R:"=%
goto end

:usage
echo Lists running or failed tasks
echo Usage: "%0    -- check if  is running (CRITICAL if not)"
echo Usage: "%0    -- show numerical status for task  and returns CRITICAL if not zero"
SET RC=3
 
:end
exit /B %RC%

Configure the checks in NSC.ini (we use NRPE):

check_schtasks_status=check_tasks.cmd status $ARG1$
check_schtasks_running=check_tasks.cmd running

In nagios configure your standard NRPE/NSCA check:

define command {
  command_name    check_nrpe_generic
  command_line    $USER1$/check_nrpe -H $HOSTADDRESS$ -n -u -c $ARG1$ -a $ARG2$
}

define service {
  host_name           yourhost
  service_description Yourtask status
  check_command       check_nrpe_generic!check_schtasks_status!YourScheduleName
}

define service {
  host_name           yourhost
  service_description Yourtask is running
  check_command       check_nrpe_generic!check_schtasks_running!YourScheduleName
}

lunedì 19 marzo 2012

Nagios + HelpDesk + Twitter = Synergy

This is a story about how a tech team can exploit social media to coordinate their efforts and get them to customers.

What we do
We are a 6-person team supporting and developing the IT infrastructure ("all that lies behind" is our motto - behind application level) for a publishing and retail group in Italy. Every company in the group has its own IT dept which takes care of the application layer and all business-related stuff; we provide tools and services for smooth IT operations. In your language: design and operations of network, servers and services. Administration of Windows domain, AIX, iSeries, Linux and SAP; deployment and provisioning of virtual servers (VMware ESXi) and VDI (VMware View 4.5), SAN (on Hitachi VSP), NAS (on EMC Celerra), network services above a vendor MPLS WAN (mostly on Cisco hardware, technologies employed are WAAS, WLC/WCS, ACS, ASA/AnyConnect). Of course we need to monitor efficiently the whole of it, to react fast, to have a clear picture of what is going on and to show the picture to the customers.

How we do it
The basic flow can be laid out as: Nagios checks stuff, alerts technicians with an email notification - and that's pretty basic Nagios stuff. The magic tweaks start now: the email notification contains an URL that brings directly to the service/host Nagios page. Nagios CGI is on SSL so that the technician presents her SSL certificate for authentication automatically and doesn't have to type to authenticate. When an acknowledgement is done, a specially crafted email is sent to the helpdesk system which parses it, creates a support request and assigns it to the technician who ACKed the alarm; another copy of the ACK, including comments from the technician, goes to a twitter account our customers can subscribe to, so that they know that there's a problem going on, that someone is taking care of it (and who) and what is being to done to solve it.
New tickets (opened via Nagios ACK or any other mean) are tweeted to another account followed only by members of the IT support team and include a (shortened) link to the request on the helpdesk system.

Turnaround
It's too soon to check the benefits of this system; at the moment it's being heavily used, works flawlessly, and has increased the fill rate of the most undisciplined member of the team (yours truly btw).
Will keep this post updated.