Effective Interface Monitoring with iSPI for Metrics

While iSPI for Metrics is useful for graphing, arguably the more critical use for it is in generating alerts when things are not working as expected.

I found today that Juniper does not include “FCS Lan Errors” in the ifInErrors counters which made finding an issue take a bit longer than usual. This is what I’m talking about:

Interface: ge-1/1/0, Enabled, Link is Up
Encapsulation: Ethernet, Speed: 1000mbps
Traffic statistics:                                           Current delta
  Input bytes:            42120054608900 (6704304 bps)           [16457880]
  Output bytes:           33903229061248 (11640616 bps)          [38429868]
  Input packets:             60739200100 (3676 pps)                 [82119]
  Output packets:            51621757514 (5316 pps)                [125843]
Error statistics:
  Input errors:                        0                                [0]
  Input drops:                         0                                [0]
  Input framing errors:          1435325                            [12231]
  Policed discards:                    0                                [0]
  L3 incompletes:                      0                                [0]

For effective monitoring of critical WAN interfaces, you can create an interface group using for example, a filter by interface name (or ifAlias – description if you have a naming standard), which then is set up as a monitoring policy with the thresholds you want. Alternatively you can use a Node group based policy bearing in mind you’ll be alerted for all interfaces on those nodes in the group.

Example:
thresh

You want to enable Interface Fault Polling and Interface Performance Polling to make this work, and may also want to consider lowering the default 5 minute poll period for critical infrastructure.

I tend to set the trigger count at 1, and the rearm count higher (at least 2) so that the incident doesn’t just disappear before someone’s had a chance to have a look at it when the error condition clears.

As a side note, you’ll also want to set generic thresholds… Eg: For “Routers” at least the following:

router-thresh