During a regular review firewall mem and cpu usage, I found some of Checkpoint UTM272 R77.10 gateways are using lots memory and ssh / snmp access seems slow sometimes. With the TOP command , I am able to sort the mem / cpu usage and see who is hogging the resources.
The result of finding is monitord service. Monitord server is used by device sensors to monitor hardware and saves data into DB file stored on local. Before R76, it will keep one year data in DB. After R76, it only keeps 3 months history to save devices resources during process the data. In my case, the DB file is more than 350M which cause monitord service consumes lots memory to process DB file. Although we are using R77.10, it seems upgrading to R771.10, not fresh installation, wont reset your DB file structure.
There is workaround provided at SK93587. Here are all steps I recorded to fix this.
1. Before applied the workaround, monitord is using 42.5% MEM.
top – 10:56:37 up 10 days, 1:08, 1 user, load average: 0.00, 0.06, 0.43
Tasks: 83 total, 3 running, 80 sleeping, 0 stopped, 0 zombie
Cpu(s): 1.2%us, 1.1%sy, 0.0%ni, 97.3%id, 0.2%wa, 0.1%hi, 0.1%si, 0.0%st
Mem: 957272k total, 947392k used, 9880k free, 2772k buffers
Swap: 2096472k total, 43292k used, 2053180k free, 209280k cached
5.0 4226 admin 15 0 263m 47m 11m S 0.4 59:12.98 cpd
0.1 2782 admin 15 0 2172 1084 836 R 0.2 0:00.05 top
0.8 3988 admin 15 0 24344 7956 5780 S 0.2 22:38.83 snmpd
1.4 3947 admin 16 0 33796 13m 7964 S 0.1 2947:10 confd
42.5 3952 admin 15 0 400m 397m 2332 S 0.1 119:05.53 monitord
0.1 3545 admin 18 0 1708 688 584 S 0.1 2:38.13 syslogd
0.1 1 admin 15 0 2040 580 548 S 0.0 0:01.47 init
0.0 2 admin RT -5 0 0 0 S 0.0 0:00.00 migration/0
0.0 3 admin 15 0 0 0 0 S 0.0 0:00.67 ksoftirqd/0
0.0 4 admin RT -5 0 0 0 S 0.0 0:00.00 watchdog/0
0.0 5 admin 10 -5 0 0 0 S 0.0 0:01.56 events/0 Next is the top outputs sorted by %MEM:
top – 10:58:15 up 10 days, 1:10, 1 user, load average: 0.00, 0.04, 0.38
Tasks: 83 total, 3 running, 80 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.3%us, 0.3%sy, 0.0%ni, 99.0%id, 0.3%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 957272k total, 947972k used, 9300k free, 3036k buffers
Swap: 2096472k total, 43292k used, 2053180k free, 209708k cached
%MEM PID USER PR NI VIRT RES SHR S %CPU TIME+ COMMAND
42.5 3952 admin 15 0 400m 397m 2332 S 0.3 119:05.63 monitord
6.9 6938 admin 19 0 122m 64m 3836 S 0.0 19:09.09 DAService
5.0 4226 admin 15 0 263m 47m 11m S 0.0 59:13.25 cpd
2.0 4386 admin 15 0 284m 18m 10m S 0.0 1:23.18 fw_full
1.5 3948 admin 15 0 38032 13m 1704 S 0.0 70:42.63 searchd
1.4 3947 admin 15 0 33796 13m 7964 S 0.0 2947:10 confd
1.4 6779 admin 15 0 163m 13m 7252 S 0.0 0:03.49 rtmd
0.8 3988 admin 15 0 24344 7956 5780 S 0.0 22:39.07 snmpd
2. Rebuild monitord DB
[Expert@CP-DMZ-1:0]# tellpm process:monitord
[Expert@CP-DMZ-1:0]#
Message from syslogd@ at Wed Aug 26 10:59:39 2015 …
CP-DMZ-1 monitord[3952]: monitord got killed
[Expert@CP-DMZ-1:0]# top (Sorted result by %MEM)
top – 11:00:09 up 10 days, 1:12, 1 user, load average: 0.00, 0.02, 0.33
Tasks: 82 total, 2 running, 80 sleeping, 0 stopped, 0 zombie
Cpu(s): 2.3%us, 1.7%sy, 0.0%ni, 95.7%id, 0.3%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 957272k total, 542928k used, 414344k free, 3620k buffers
Swap: 2096472k total, 42700k used, 2053772k free, 208824k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6938 admin 19 0 122m 64m 3836 S 0.0 6.9 19:09.09 DAService
4226 admin 15 0 263m 47m 11m S 1.0 5.0 59:13.62 cpd
4386 admin 15 0 284m 18m 10m S 0.0 2.0 1:23.18 fw_full
3948 admin 15 0 38032 13m 1704 S 0.0 1.5 70:42.63 searchd
3947 admin 15 0 33796 13m 7968 S 0.0 1.4 2947:10 confd
6779 admin 15 0 163m 13m 7252 S 0.0 1.4 0:03.49 rtmd
3930 admin 15 0 25300 7996 6340 S 0.0 0.8 0:00.41 pm
3988 admin 15 0 24344 7956 5780 S 0.3 0.8 22:39.35 snmpd
4339 admin 15 0 149m 7352 5748 S 0.0 0.8 0:00.51 cphamcset
4367 admin 15 0 32944 7224 6472 S 0.0 0.8 1:09.32 routed
4374 admin 16 0 33044 7168 6976 S 0.0 0.7 0:13.16 routed
3951 admin 18 0 99768 7024 6620 S 0.0 0.7 0:06.79 rconfd
3983 admin 17 0 25272 6816 6136 S 0.0 0.7 0:00.34 cloningd
2228 admin 15 0 21000 5972 3324 S 0.0 0.6 0:00.52 clish
4240 admin 15 0 150m 5732 5592 S 0.0 0.6 0:00.75 mpdaemon
[Expert@CP-DMZ-1:0]# cd /var/log
[Expert@CP-DMZ-1:0]# ls -l db
-rw-r–r– 1 admin root 356237312 Aug 26 10:45 db
[Expert@CP-DMZ-1:0]# cp /var/log/db /var/log/db_ORIGINAL
[Expert@CP-DMZ-1:0]# sqlite3 /var/log/db
SQLite version 3.6.20
Enter “.help” for instructions
Enter SQL statements terminated with a “;”
sqlite> VACUUM;
sqlite> .exit
[Expert@CP-DMZ-1:0]# tellpm process:monitord t
[Expert@CP-DMZ-1:0]#
3. Check Memory usage after workaround applied
The memory usage has been reduced to only 4.9%, dropped from 42.5% we found from Step 1
top – 11:15:24 up 10 days, 1:27, 1 user, load average: 0.00, 0.05, 0.18
Tasks: 83 total, 2 running, 81 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.7%us, 0.3%sy, 0.0%ni, 98.3%id, 0.0%wa, 0.3%hi, 0.3%si, 0.0%st
Mem: 957272k total, 446428k used, 510844k free, 4808k buffers
Swap: 2096472k total, 42696k used, 2053776k free, 67228k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6938 admin 17 0 122m 64m 3836 S 0.0 6.9 19:09.09 DAService
4226 admin 15 0 263m 47m 11m S 0.0 5.0 59:16.10 cpd
3088 admin 15 0 49684 45m 2320 S 0.0 4.9 0:01.55 monitord
4386 admin 15 0 284m 18m 10m S 0.0 2.0 1:23.23 fw_full
3948 admin 15 0 38032 13m 1704 S 0.0 1.5 70:42.63 searchd
3947 admin 15 0 33796 13m 7968 S 0.0 1.4 2947:10 confd
6779 admin 15 0 163m 13m 7252 S 0.0 1.4 0:03.49 rtmd
3930 admin 16 0 25300 8012 6340 S 0.0 0.8 0:00.41 pm
3988 admin 15 0 24344 7956 5780 S 0.0 0.8 22:41.56 snmpd
4339 admin 15 0 149m 7352 5748 S 0.0 0.8 0:00.51 cphamcset
4367 admin 15 0 32944 7224 6472 S 0.0 0.8 1:09.33 routed
4374 admin 15 0 33044 7168 6976 S 0.0 0.7 0:13.19 routed
3951 admin 18 0 99768 7024 6620 S 0.0 0.7 0:06.79 rconfd
3983 admin 17 0 25272 6816 6136 S 0.0 0.7 0:00.34 cloningd
2228 admin 15 0 21000 5972 3324 S 0.0 0.6 0:00.52 clish
4240 admin 15 0 150m 5732 5592 S 0.0 0.6 0:00.75 mpdaemon
4787 admin 18 0 20936 5512 5508 S 0.0 0.6 0:00.28 cpviewd
4347 nobody 17 0 18748 5108 5104 S 0.0 0.5 0:00.21 ci_http_server
And the DB size reduced from more than 350M to less than 40M
[Expert@CP-DMZ-1:0]# ls -l db
-rw-r–r– 1 admin root 37168128 Aug 26 11:32 db
If you want to get an even smaller memory footprint and considering that after just deleting /var/log/db apparently it wasn’t getting re-created, I’ve modified the procedure in sk93587 like this:
[Expert@HostName]# tellpm process:monitord
[Expert@HostName]# cp /var/log/db /var/log/db_archive_
[Expert@HostName]# sqlite3 /var/log/db
sqlite> DELETE FROM cpu_stat;
sqlite> DELETE FROM hwmonitor;
sqlite> DELETE FROM mem_stat;
sqlite> DELETE FROM net_stat;
sqlite> VACUUM;
sqlite> .exit
[Expert@HostName]# tellpm process:monitord t