Juniper MX480 PSU Gripes

This has happened three times in a row to me. It seems that Juniper MX480 PSUs are not very graceful when they blow. I now think it’s unwise to ask anyone to try “re-seating the PSU and power cable” once a PSU has failed because this has a tendency to make the PDU trip, knocking out everything on that power strip. Ouch! Better to just RMA the thing.

Not so much of a problem when all kit in the rack has dual PSUs, but it’s worth bearing in mind in case there is any single PSU kit that isn’t on a static transfer switch in there! Lucky the environment I deal with is doesn’t involve the latter. :)

Faulty PSU shows as:

PEM 0:
  State:     Present
  AC input:  Out of range (1 feed expected, 1 feed connected)
  Capacity:  2050 W (maximum 2050 W)
  DC output: 0 W (zone 0, 0 A at 0 V, 0% of capacity)

Replacing a failed Nexus 5K and some bugs.

Being given the task of replacing a failed Nexus 5596UP (no console output, powers up with fans but no lights except amber on the mgmt. module at the back), I quickly ran into some annoying problems trying to configure the FEX uplinks before actually racking it and plugging it in. I wanted to get as much config done beforehand as possible to minimize any interruptions – I was also a bit nervous as this unit was in the VPC primary role before it failed.

The 5Ks are running in Active/Active mode similar to this diagram:

active-active

Fex definitions, port channels and other config went in fine, eg:

fex 102
  pinning max-links 1
  description "-=FEX-102=-"
  type N2248T

interface port-channel102
  description -= vPC 102 to FEX-102 / SW01 e1/1,e1/2 & SW02 e1/1,e1/2 =-
  switchport mode fex-fabric
  fex associate 102
  spanning-tree port-priority 128
  spanning-tree port type network
  vpc 102

But when trying to add the channel-group to the FEX ports (e1/1 and e1/2), it failed:

interface Ethernet1/1
  channel-group 102
command failed: port not compatible [port allowed VLAN list]

Removing the port channel allowed me to configure only one of the ports with the channel-group, giving the same error on the second member (as the port channel interface is automatically created).

The only way to get around this was to remove the port channel and use a range command:

no interface port-channel102

interface Ethernet1/1-2
 channel-group 102

Then re-add the Port channel 102 config. Obviously you’re stuffed on this version if you split the uplinks across different ASICs.

This seems to be down to a buggy version of NX-OS: version 5.1(3)N1(1a), which is quite old and already has some other bug warnings against it! Not much choice in this case given the replacement has to run the same config as the peer.

Also, FEX ports can only be configured by pre-provisioning.

Eg:

slot 102
 provision model N2K-C2248T

Cisco links for replacement (some important steps such as role priority):
Replacement Documentation
Pre-provisioning

Full Replacement Procedure
We aren’t running config sync so role priority wasn’t a problem – I didn’t need to change the priority on the replacement to a higher value. If using config sync, I’d follow Cisco’s guidance. In summary, here are the steps I took (may be a little paranoid but resulted in no outage when the replacement was brought online):

Preparation:

1) Label all cables carefully then disconnect and unrack faulty unit
2) Rack replacement – do NOT connect any cables apart from console
3) Set a management address on the mgmt0 port (I used a different IP to be safe) and set a default route in the management vrf if your FTP server is in a different subnet. eg:

vrf context management
  ip route 0.0.0.0/0 10.243.0.254

4) Connect management cable
5) Upgrade/downgrade code to same as the peer – FTP copy example below with images in /nos-images on FTP server.

copy ftp://user:pass@10.0.0.1//nos-images/[kickstart-filename].bin bootflash:
copy ftp://user:pass@10.0.0.1//nos-images/[nxos-system-filename].bin bootflash:

install all kickstart bootflash:[kickstart-filename] system bootflash:[nxos-system-filename]

6) Shut down the mgmt0 port
7) Pre-provision FEXes, otherwise you won’t be able to paste the config for the FEX interfaces.
Be sure to specify the correct model.. N2K-C2248T is correct for N2K-C2248TP-1GE

slot 102
provision model N2K-C2248T 
slot 103
provision model N2K-C2248T
…etc

Some config may look like it’s in a different order to the peer when pre-provisioned but it will sort itself out when the FEXes are online.
8) Apply the correct config via console (see note above regarding channel-group bug) – double check any long lines such as spanning-tree vlan config. Remove the additional management VRF default route from step 3 if it’s different to the config you put in.
9) Shut down all ports (including mgmt) – use range commands ie: int e1/1-48
10) Connect all cables to ports

NB: Cisco say to change the vpc role priority to a higher value on the replacement but this was not necessary as we’re not using config sync. Also, the VPC peer role does not pre-empt, so if replacing primary, for example, the secondary will stay in “secondary, operational primary” state.

Re-introduction:

Monitor console messages for anything strange from the Dist layer with term mon. You should see VPCs come up OK. I also set several pings running to hosts in the row that are connected to the now single-homed FEXes.
1) Open up mgmt port from console which will now have the correct IP. Check you can ping peer unit.
2) Open up all infrastructure liks to core/dist layer and the peer unit
3) Open FEX links (I did one pair of links at a time to catch any cabling issues – paranoid!)
4) Test!

Useful commands:

show vpc role
show fex
show vpc consistency-parameters vpc xxx
show vpc peer-keepalive
show port-channel database int poXXX