Mason issues (12 june 2017)
Closed, ResolvedPublic

Description

I noticed mason had a significant clock skew (several hours). I connected via ssh and saw ntpd wasn't running (systemctl status ntp.service reported Active: active (exited) since Don 2017-05-11 21:24:07 CEST; 1 months 0 days ago). Restarting the service fixed the clock.

Then I noticed other weird things:

  • Logs contained a few instances of Unit systemd-journald.service entered failed state
  • systemctl status reported State: degraded (not sure if the failed unit referred to journald, ntpd, or something else entirely)
  • There was a high load and sometimes very high (99%) wa CPU usage reported in top, despite iotop reporting no significant I/O happening
  • dmesg showed a few kernel backtraces involving ext4 (old, but from this boot).

So I rebooted it. It took several minutes to respond to ssh again.

However, load still seems high for the little activity it has; to the point of xinetd refusing connections (refused connect from ::ffff:217.13.79.74 due to excessive load). I saw load go up to 10.

I saw reports about having problems pulling from git (tosky with scripty, acheronuk with the kubuntu CI), so I took mason out of anongit rotation.

I installed telegraf to report stats to overwatch (took very long to install too). Didn't add graphs on Grafana yet.

nalvarez created this task.Jun 12 2017, 5:25 AM
Restricted Application added a subscriber: sysadmin. · View Herald TranscriptJun 12 2017, 5:25 AM

Augh, Mason can't report metrics to Overwatch because it has an outgoing firewall blocking port 8086. I stopped and disabled telegraf.service for now.

bcooksley added a subscriber: bcooksley.

I've emailed Stepping Stone support regarding this.

bcooksley closed this task as Resolved.Jun 13 2017, 8:51 AM
bcooksley claimed this task.

This appears to have been fixed as far as I can tell now.
I've restored Mason to rotation.