Fri Jan 24 07:00:53 CET 2014
NTP: Working around bad hardware since 1842
After my recent troubles with NTP and excessive time drift things have settled down.
For reasons unknown to me the time drift on the problem server changed from -330ppm to +2.3ppm. I'm not quite sure how to interpret that.
Comparing some other machines I have access to:
The stability of <30ppm means a drift of about 0.1sec/day. Without extra correction that's tolerable (a few seconds a month), but still unsatisfactory.
The old P4 drifting at 300ppm means it'll be getting close to 3 minutes a week away from "real time" - that's enough to cause problems if you rely on it.
I think the lesson in this is "manufacturers use the cheapest they can get away with", so every computer should have a time correction mechanism (NTP, DCF-77, GPS - doesn't matter as long as you correct it). And there's a reasonable assumption that environmental factors (heat, hardware aging, change in the provided voltage, ...) will randomly change the time drift.
And I thought timekeeping was a problem solved two centuries ago ...
For reasons unknown to me the time drift on the problem server changed from -330ppm to +2.3ppm. I'm not quite sure how to interpret that.
Comparing some other machines I have access to:
Old P4: -292.238 Dell R510: 12.428 Another R510: 13.232 Random amd64: -28.438 Another amd64: -23.323 A newish Xeon: -7.296So the general trend seems to be older = more time drift. And machines with "same" hardware appear to have similar drift factors.
The stability of <30ppm means a drift of about 0.1sec/day. Without extra correction that's tolerable (a few seconds a month), but still unsatisfactory.
The old P4 drifting at 300ppm means it'll be getting close to 3 minutes a week away from "real time" - that's enough to cause problems if you rely on it.
I think the lesson in this is "manufacturers use the cheapest they can get away with", so every computer should have a time correction mechanism (NTP, DCF-77, GPS - doesn't matter as long as you correct it). And there's a reasonable assumption that environmental factors (heat, hardware aging, change in the provided voltage, ...) will randomly change the time drift.
And I thought timekeeping was a problem solved two centuries ago ...
Fri Jan 17 04:24:40 CET 2014
Unexpected fun with NTP
This morning I had to fix an unexpected dovecot "failure" by restarting it. Apparently it only tolerates time jumps of less than seven seconds.
The trigger of this oopsie is NTP:
In other words, if the drift is above 500 PPM it may force a clock reset because it can't drift fast enough. And it looks like this situation was either a failing mainboard RTC clock, or a screwed up ntp server (since it always sync'ed to the same one).
I've tried two things to avoid this time skipping:
1) Change the ntp servers used to something more "local" - the global pool.ntp.org may not be as reliable as servers geographically close you
2) Remove the drift file to force the system to re-learn
The results, at first glance, look promising:
The trigger of this oopsie is NTP:
Jan 16 23:52:53 stupidserver ntpd[27668]: synchronized to 202.112.10.36, stratum 3 Jan 16 23:52:45 stupidserver ntpd[27668]: time reset -7.732856 sRiiight. That's not nice, but why does it jump around so much? Looks like the time behaviour worsened over the last days:
Jan 15 19:34:18 stupidserver ntpd[27668]: no servers reachable Jan 15 19:59:56 stupidserver ntpd[27668]: synchronized to 202.112.10.36, stratum 2 Jan 15 20:06:22 stupidserver ntpd[27668]: time reset +0.533773 s ... Jan 16 11:47:33 stupidserver ntpd[27668]: synchronized to 202.112.10.36, stratum 2 Jan 16 11:47:30 stupidserver ntpd[27668]: time reset -2.966137 s ... Jan 16 18:14:28 stupidserver ntpd[27668]: synchronized to 202.112.10.36, stratum 2 Jan 16 18:15:27 stupidserver ntpd[27668]: time reset -4.223295 s ... Jan 16 23:52:53 stupidserver ntpd[27668]: synchronized to 202.112.10.36, stratum 3 Jan 16 23:52:45 stupidserver ntpd[27668]: time reset -7.732856 sThat's an offset of more than 1sec/h, and that's with ntpd correcting at around 330 PPM. The docs say: "The capture range of the loop is 500 PPM at an interval of 64s decreasing by a factor of two for each doubling of interval." (PPM = parts-per-million)
In other words, if the drift is above 500 PPM it may force a clock reset because it can't drift fast enough. And it looks like this situation was either a failing mainboard RTC clock, or a screwed up ntp server (since it always sync'ed to the same one).
I've tried two things to avoid this time skipping:
1) Change the ntp servers used to something more "local" - the global pool.ntp.org may not be as reliable as servers geographically close you
2) Remove the drift file to force the system to re-learn
The results, at first glance, look promising:
Jan 17 10:48:37 stupidserver ntpd[3059]: kernel time sync status 0040 Jan 17 10:52:55 stupidserver ntpd[3059]: synchronized to 202.120.2.101, stratum 3 Jan 17 10:52:50 stupidserver ntpd[3059]: time reset -5.023639 s Jan 17 10:57:54 stupidserver ntpd[3059]: synchronized to 202.120.2.101, stratum 3 Jan 17 11:01:08 stupidserver ntpd[3059]: synchronized to 202.73.36.32, stratum 1 Jan 17 11:05:34 stupidserver ntpd[3059]: kernel time sync enabled 0001So after an initial 5-second skip it managed to sync twice without abnormal drift. Let's hope that it's going to stay sane ...
Thu Jan 16 07:36:03 CET 2014
EAPI usage in tree
Total number of ebuilds: 37807 EAPI 0: 5959 15.78% EAPI 1: 370 0.98% EAPI 2: 3335 8.82% EAPI 3: 3005 7.95% EAPI 4: 12385 32.76% EAPI 5: 12746 33.72%That looks quite good: EAPI5 has grown very well, EAPI1 is almost gone.
EAPI0 is still needlessly common, and EAPI 2+3 should be deprecated.
Update: Now running as a cronjerb, Output here>, History here