Juniper KB mentioned some RMA steps for failed Juniper device replacement. There are some steps not clear enough. I put some more configuration steps in this post for future reference:

There are many preparation works before you can add RMA device into your chassis group.

Step 1, Upgrade JunOS Remotely
Usually your RMA Device is delivered to the production environment to do replacement. You will have to remotely upgrade JunOS first.


login: root
root>
--- JUNOS 10.0R1.8 built 2009-11-03 10:06:39 UTC
root>

root> show version
Model: srx240-hm
JUNOS Software Release [10.0R1.8]

root> configure 
Entering configuration mode

[edit]
root# delete 
This will delete the entire configuration
Delete everything under this level? [yes,no] (no) yes

[edit]
root# set system root-authentication plain-text-password
New password:
Retype new password:

[edit]
root# commit and-quit
commit complete
Exiting configuration mode

root> set chassis cluster cluster-id 4 node 0 reboot 
Successfully enabled chassis cluster. Going to reboot now

Some basic configurationon fxp0.0 interface and default static route. Also ssh service will need to be enabled.

root> show configuration 
## Last commit: 2016-11-29 03:37:32 UTC by root
version 10.0R1.8;
system {
root-authentication {
encrypted-password "$1$2eav5HPL$01SUB9SOzDJl007hXhNVj0"; ## SECRET-DATA
}
services {
ssh;
}
}
interfaces {
fxp0 {
unit 0 {
family inet {
address 10.9.1.11/24;
}
}
}
}
routing-options {
static {
route 0.0.0.0/0 next-hop 10.9.1.1;
}
}
{primary:node0}
root> request system software add /var/tmp/junos-srxsme-12.1X46-D55.3-domestic.tgz reboot
NOTICE: Validating configuration against junos-srxsme-12.1X46-D55.3-domestic.tgz.
NOTICE: Use the 'no-validate' option to skip this if desired.
Formatting alternate root (/dev/da0s2a)...
/dev/da0s2a: 298.0MB (610284 sectors) block size 16384, fragment size 2048
using 4 cylinder groups of 74.50MB, 4768 blks, 9600 inodes.
super-block backups (for fsck -b #) at:
32, 152608, 305184, 457760
** /dev/altroot
FILE SYSTEM CLEAN; SKIPPING CHECKS
clean, 150096 free (24 frags, 18759 blocks, 0.0% fragmentation)
Checking compatibility with configuration
Initializing...
Verified manifest signed by PackageProduction_10_0_0
Verified junos-10.0R1.8-domestic signed by PackageProduction_10_0_0
Using junos-12.1X46-D55.3-domestic from /altroot/cf/packages/install-tmp/junos-12.1X46-D55.3-domestic
Copying package ...
veriexec: cannot validate /cf/var/validate/chroot/junos/pkg/manifest.certs: unhandled critical extension: /C=US/ST=CA/L=Sunnyvale/O=Juniper Networks/OU=Juniper CA/CN=PackageProductionRSA_2016/[email protected]
chroot: /usr/bin/hwdb_xml_parser: Authentication error
Unable to regenerate Hardware Database, skipping hardware database checks at install time
chroot: tar: Authentication error
Validating against /config/juniper.conf.gz
cp: /cf/var/validate/chroot/var/etc/resolv.conf and /etc/resolv.conf are identical (not copied).
cp: /cf/var/validate/chroot/var/etc/hosts and /etc/hosts are identical (not copied).
chroot: /usr/sbin/mgd: Authentication error
Validation failed
WARNING: Current configuration not compatible with /altroot/cf/packages/install-tmp/junos-12.1X46-D55.3-domestic

{primary:node0}
root> request system software add /var/tmp/junos-srxsme-12.1X46-D55.3-domestic.tgz reboot no-validate
Formatting alternate root (/dev/da0s2a)...
/dev/da0s2a: 298.0MB (610284 sectors) block size 16384, fragment size 2048
using 4 cylinder groups of 74.50MB, 4768 blks, 9600 inodes.
super-block backups (for fsck -b #) at:
32, 152608, 305184, 457760
** /dev/altroot
FILE SYSTEM CLEAN; SKIPPING CHECKS
clean, 150096 free (24 frags, 18759 blocks, 0.0% fragmentation)
Installing package '/altroot/cf/packages/install-tmp/junos-12.1X46-D55.3-domestic' ...
verify-sig: cannot validate ./certs.pem
unhandled critical extension: /C=US/ST=CA/L=Sunnyvale/O=Juniper Networks/OU=Juniper CA/CN=PackageProductionRSA_2016/[email protected]

Installation failed for package '/altroot/cf/packages/install-tmp/junos-12.1X46-D55.3-domestic'

One of the reasons why installation failed is because the device is set to a date earlier than the date on which the jloader was built, therefore the certificate for the file is not yet valid.

root> set date 201611281600.00    
node0:
--------------------------------------------------------------------------
Mon Nov 28 16:00:00 UTC 2016


Another reason is you will have to upgrade to intermediate version first before you can upgrade to some latest release. For example, from JunOS 10 to 12.1×44 first, then you can do upgrade to 12.1×46

Step 2: Follwoing Juniper KB’s instruction:

Note: It does not include IDP signature database step when there is IDP feature enabled on your system. You will have to deactivate security idp first before go to step 6.

  [KB21134] Show KB Properties

Perform the following procedure:

  1. Check the following parameters, prior to  deploying a RMA device in a Chassis Cluster environment:

    Make sure that the following parameters on the new RMA device are the same as the active node of the Chassis Cluster.

    • Check the hardware on the active cluster node and ensure that the device, which is being placed in the cluster, has the same hardware setup and all FPCs are present in the same slot and active. The command to check this is show chassis hardware.
    • Check the Junos version on the active node of the cluster and upgrade or downgrade Junos (for more information, refer to KB16652 – SRX Getting Started – Junos Software Installation/Upgrade) on the new device; so that they match. 
    • Save the configuration in a file on the working node and upload the file to the new device in the /var/tmpdirectory.
    • note: we can use FAT formatted USB key to transfer file into new SRX. 
    • Command: mount -t msdos /dev/da0s1 /mnt
  2. Console to the isolated RMA device (make sure that no cables are connected, other than console cable) and perform the following procedure:    

    1. Get into the configuration mode.
    2. Execute the # delete command.
    3. Configure the root password:

      # set system root-authentication plain-text-password

    4. Then commit:

      # commit

  3. Configure Chassis Clustering on the isolated RMA device.  Use the following command to enable the chassis cluster (you can execute the show chassis cluster status command on the working node to identify the cluster-id):

    code>set chassis cluster cluster-id <id> node <No.>

     <No.> will be 1 or 0, depending on which node is being replaced.

  4. Reboot the new node. The node will come online with the cluster being enabled:
    > request system reboot
  5. Enter the configuration mode and load the configuration from the file, which was copied in the /var/tmp directory in step 1. Use the  following command to load the configuration:
    # load override /var/tmp/<filename>

    note: if there is IDP feature enabled, you will have to deactivate it first with command : deactivate security idp

  6. When the configuration is completely loaded, commit the configuration:

    # commit and-quit

  7. Halt the new node:

    > request system halt.

  8. Now connect the fabric and control ports (makes sure that none of the revenue port cables are connected) and reboot the node.
  9. Check the status of the FPC PIC by executing the show chassis fpc pic-status command. In the output, all of the FPCs and PICs should be online.
  10. When the new node comes online, it should join the cluster as the secondary. You can check the status by executing the show chassis cluster status command. In the output, the priority of RG0 should be the configured value and the priority of the other RG should be 0, If interface monitoring has been configured.
  11. In the output that is generated in step 10, if the new node is shown as the primary, then contact Juniper support for assistance.
  12. If the output that is generated in step 10 shows the primary and secondary for all RGs, then connect all the revenue port cables and again check the chassis cluster status via the show chassis cluster status command. In this output, you should see the configured values for all of the RGs.
  13. If you can access the internet from the new node, then update the license on the new node or download the license and load it. If you are downloading the license on the PC, then save it in a file and upload it to the new node in the /var/tmp directory:

    > request system licnese update >  If you can access the the internet from the new node.
    > request system license add /var/tmp/<filename> > if adding the license from a file.   

Step 3: Troubleshooting Issues

3.1 Nodes of a cluster go into Primary/Lost  or Primary / Primary state
Control link and Fabric link send the packets but not receive anything.
Changed Fabric ports on SRX , but situation is still same. Changed cable to try, same result.

Based on KB23929, it is caused with following reason:

“With codes prior to 10.4, by default, the control port tagging was enabled and it used the 4094 VLAN. For 10.4 and later codes, by default, it is disabled.

So, the upgrade/downgrade makes one node of the control port as tagged and the other node as untagged; so this causes control packets to drop, which in turn causes the Split Brain condition.”

SOLUTION:

to avoid the split brain condition, set both sides of the control-link either as tagged or untagged, by using the following command via the CLI:

root> set chassis cluster control-link-vlan enable/disable
warning: A reboot is required for control-link-vlan to be disabled

{primary:node1}
test@fw1-2> request system reboot 
Reboot the system ? [yes,no] (no) yes

{primary:node1}
test@fw1-2>
show chassis cluster information detail
node0:
--------------------------------------------------------------------------
Redundancy mode:
Configured mode: active-active
Operational mode: active-active
Cluster configuration:
Heartbeat interval: 1000 ms
Heartbeat threshold: 3
Control link recovery: Enabled
Fabric link down timeout: 66 sec
Node health information:
Local node health: Healthy
Remote node health: Healthy

Redundancy group: 0, Threshold: 255, Monitoring failures: none
Events:
Dec 7 13:57:43.435 : hold->secondary, reason: Hold timer expired
Dec 7 15:48:17.158 : secondary->primary, reason: Control & Fabric links down
Dec 7 15:48:34.749 : primary->secondary-hold, reason: Preempt/yield(10/100)
Dec 7 15:53:34.754 : secondary-hold->secondary, reason: Ready to become secondary
Dec 7 17:53:56.761 : secondary->primary, reason: Control & Fabric links down
Dec 7 17:53:59.428 : primary->secondary-hold, reason: Preempt/yield(10/100)
Dec 7 17:58:59.433 : secondary-hold->secondary, reason: Ready to become secondary

Redundancy group: 1, Threshold: 255, Monitoring failures: none
Events:
Dec 7 13:57:43.512 : hold->secondary, reason: Hold timer expired
Dec 7 15:48:17.134 : secondary->ineligible, reason: Fabric link down
Dec 7 15:48:17.863 : ineligible->primary, reason: Control & Fabric links down
Dec 7 15:48:34.753 : primary->secondary-hold, reason: Monitor failed: IF
Dec 7 15:48:35.762 : secondary-hold->secondary, reason: Ready to become secondary
Dec 7 15:51:00.571 : secondary->ineligible, reason: Fabric link down
Dec 7 17:53:41.929 : ineligible->secondary, reason: fabric link UP
Dec 7 17:53:56.830 : secondary->primary, reason: Control & Fabric links down
Dec 7 17:53:59.431 : primary->secondary-hold, reason: Monitor failed: CS
Dec 7 17:54:00.434 : secondary-hold->secondary, reason: Ready to become secondary
Control link statistics:
Control link 0:
Heartbeat packets sent: 19997
Heartbeat packets received: 19949
Heartbeat packet errors: 0
Duplicate heartbeat packets received: 0
Control recovery packet count: 0
Sequence number of last heartbeat packet sent: 20024
Sequence number of last heartbeat packet received: 20501
Fabric link statistics:
Child link 0
Probes sent: 11579
Probes received: 11575
Child link 1
Probes sent: 0
Probes received: 0
Switch fabric link statistics:
Probe state : DOWN
Probes sent: 0
Probes received: 0
Probe recv errors: 0
Probe send errors: 0
Probe recv dropped: 0
Sequence number of last probe sent: 0
Sequence number of last probe received: 0

Chassis cluster LED information:
Current LED color: Green
Last LED change reason: No failures
Control port tagging:
Disabled
............omitted......

node1:
--------------------------------------------------------------------------
Redundancy mode:
Configured mode: active-active
Operational mode: active-active
Cluster configuration:
Heartbeat interval: 1000 ms
Heartbeat threshold: 3
Control link recovery: Enabled
Fabric link down timeout: 66 sec
Node health information:
Local node health: Healthy
Remote node health: Healthy

Redundancy group: 0, Threshold: 255, Monitoring failures: none
Events:
Dec 7 13:49:59.220 : hold->secondary, reason: Hold timer expired
Dec 7 13:53:47.517 : secondary->primary, reason: Remote node reboot

Redundancy group: 1, Threshold: 255, Monitoring failures: none
Events:
Dec 7 13:49:59.267 : hold->secondary, reason: Hold timer expired
Dec 7 13:51:05.382 : secondary->primary, reason: Remote yield (100/0)
Control link statistics:
Control link 0:
Heartbeat packets sent: 20475
Heartbeat packets received: 20172
Heartbeat packet errors: 0
Duplicate heartbeat packets received: 0
Control recovery packet count: 0
Sequence number of last heartbeat packet sent: 20502
Sequence number of last heartbeat packet received: 20025
Fabric link statistics:
Child link 0
Probes sent: 11740
Probes received: 11585
Child link 1
Probes sent: 0
Probes received: 0
Switch fabric link statistics:
Probe state : DOWN
Probes sent: 0
Probes received: 0
Probe recv errors: 0
Probe send errors: 0
Probe recv dropped: 0
Sequence number of last probe sent: 0
Sequence number of last probe received: 0

Chassis cluster LED information:
Current LED color: Green
Last LED change reason: No failures
Control port tagging:
Disabled
............omitted......

By Jon

Leave a Reply