Tue Oct 29 13:39:32 CET 2013

Systemd propaganda: It's a crap!

Today I was pointed at a posting by Lennart Poettering, triggered by a debian mailinglist discussion, about ... uhm ... I'm not quite sure yet.
So let me try to dissect the rather amusing post ...
The kernel folks want userspace to have a single arbitrator component for cgroups, and on systemd systems that is now systemd, and you cannot isolate this bit out of systemd.
I'm not quite sure how to parse that, but maybe it means "we're lazy"
 I am pretty sure the Ubuntu guys don't even remotely understand the complexities of this.
Nice indirect insult, but ... cgroups are *not* complex. I'm quite sure I understood the general concept in the afternoon it took me to write the initial OpenRC support ...
 Control groups of course are at the center of what a modern server needs to do.
... what. No. A Server needs to serve (and protect? no wait, that's the wrong motto). CGroups are just one more technology that most sysadmins will ignore. There's nothing magic to it, it's about as difficult to use as ulimit is.
 The API systemd exposes is very systemd-specific
Not sure if tautology is true, or lack of design is lacking. Maybe tautology is true?
Of course, at that time a major part of the Linux ecosystem will already use the systemd APIs...
I remember this thingy ... embrace, extend ... profit?
Maybe it makes sense to create generic APIs and use them everywhere instead of domain-specific hackery that is unportable just because it is misdesigned.
Basically the D-Bus daemon has been subsumed by systemd itself (and the kernel),
and the logic cannot be reproduced without it. In fact, the logic even spills
into the various daemons as they will start shipping systemd .busname and
.service unit files rather than the old bus activation files if they want to be bus activatable.
So, uhm, where do I even start. So the idea is to tie everything so hard into systemd that alternatives will just be impossible because of undocumented, unportable APIs that are leaky abstractions that make no sense without the thing they fail to abstract?
And why on earth does every daemon now need to be patched? They worked well for the last few decades without such specific hackery.
Then, one of the most complex bits of the whole thing, which is the
remarshalling service that translates old dbus1 messages to kdbus
GVariant marshalling and back is a systemd socket
service, so no chance to rip this out of systemd.
Ok, that's a non sequitur to start; one does wonder why kdbus uses a different message format than dbus if it's supposed to be a replacement. And just because it's part of systemd and undocumented still does not mean in any way that it cannot be either cut out and made to work standalone, or reimplemented from scratch as a standalone tool. The ignorance (or incompetence?) in that statement reminds me of windows users that denigrated linux for not having a shiny GUI ...
 logind of course is one of the components of systemd
where we explicitly documented that it is not a
component you can rip out of systemd.
Just because you write a document doesn't make it true. But this is a great attempt to take over the discussion by claiming that it's Alternativlos, err, we want it and you shut up now, as we see in politics.
As logind is not software there's no design document or proper documentation what it does, just an API that's tied to systemd so tightly that it's exquisitely painful to try to understand. The only thing stopping me from reimplementing that functionality is that I have no idea what it's supposed to *do* ... and I don't enjoy reading through all of systemd to find out. But from what I've seen so far there's nothing special in it.
 just a few months after Canonical did this, things are broken again, and this was to be expected: logind now uses the new cgroups userspace APIs
"When I break things they are broken" ... the only reason it doesn't work is ... no documentation and changes that break it. Circular reasoning declaring itself the truth.
 the Linux userspace plumbing layer is nowadays developed to a big part in the systemd source tree.
"We don't even pretend to understand the Unix philosophy anymore" - the systemd repository is more a junkyard than anything else, it should be properly split (submodules!) so that the independent components are independent. There's NO reason for udev, or hwdb, or any of the other dozen unrelated things to use the same repo.
It also tells me that there's a lack of oversight and management going on because things are just randomly smushed together for no logical reason.
Note that logind, kdbus or the cgroup stuff is new technology,
we didn't break anything by simply writing it. Hence we are not
regressing, we are just adding new components that we
believe are highly interesting to people
CGroups are "old", just forcing everyone to use them in a specific way is new. It goes against the systemd design documents which are now considered wrong after being The Only Truth for a few years. Gets hard to figure out what's the soup of the day if you keep changing the menu so often ...
And maybe if someone documents what LoginD is supposed to do (and not just an API dump that doesn't tell anything - doxygen doesn't write documentation!) we could just properly implement it everywhere so that people are not forced to break their machines just to do ... whatever logind does. Spiderpig, Spiderpig, does whatever Spiderpig does!
 For us having a simple design and a simple code base is a
lot more important than trying to accommodate for distros
that want to combine everything with everything else.
... simple? I think this word means not what you think it does! (Not with 200k LoC for something that used to be done in 35k)
I just hope you guys do it knowing what's at stake here.
Yes, that's why some of us are so antagonistic. Propaganda much?

So, to summarize: No one else can implement what systemd does, and thus you must use it. It is so brilliant that you shouldn't even try!
Just reading this propaganda is making me unhappy, with that level of dishonesty and misdirection I don't see how we can have a nice discussion. Most of the arguments are either circular ("No one can write logind without writing logind") or false ("Cgroups are at the center of what a server does").
At the same time everyone who disagrees is a luddite ... or illiterate ... or whatever. Anyway, YOUR UGLY, so I win discussion! or something something.

Oh well. Who would expect a rational discussion when you can just cause infinite flamewars, wasting all the time that could have been used for improving things ... but that would be boring.

Posted by Patrick | Permalink

Mon Oct 14 11:41:21 CEST 2013

Fixing Sloccount

+  "sh" => "sh", "bash" => "sh", "ebuild" => "sh",
-  "sh" => "sh", "bash" => "sh",
 
+ if ( ($command =~ m/^(bash|ksh|zsh|pdksh|sh|runscript)[0-9\.]*(\.exe)?$/i) ||
- if ( ($command =~ m/^(bash|ksh|zsh|pdksh|sh)[0-9\.]*(\.exe)?$/i) ||
These two tiny changes make sloccount recognize ebuilds and gentoo init scripts (/sbin/runscript as interpreter)

The results are quite nice on gentoo-x86:
Totals grouped by language (dominant language first):
sh:         1460091 (99.24%)
ansic:         3366 (0.23%)
perl:          2339 (0.16%)
python:        1800 (0.12%)
lisp:          1593 (0.11%)
java:           795 (0.05%)
awk:            529 (0.04%)
asm:            304 (0.02%)
haskell:        280 (0.02%)
ruby:            99 (0.01%)
php:             40 (0.00%)
sed:             21 (0.00%)
csh:              9 (0.00%)
objc:             5 (0.00%)
tcl:              2 (0.00%)




Total Physical Source Lines of Code (SLOC)                = 1,471,273
Development Effort Estimate, Person-Years (Person-Months) = 423.75 (5,084.98)
 (Basic COCOMO model, Person-Months = 2.4 * (KSLOC**1.05))
Schedule Estimate, Years (Months)                         = 5.34 (64.02)
 (Basic COCOMO model, Months = 2.5 * (person-months**0.38))
Estimated Average Number of Developers (Effort/Schedule)  = 79.43
Total Estimated Cost to Develop                           = $ 57,242,626
 (average salary = $56,286/year, overhead = 2.40).
SLOCCount, Copyright (C) 2001-2004 David A. Wheeler
SLOCCount is Open Source Software/Free Software, licensed under the GNU GPL.
SLOCCount comes with ABSOLUTELY NO WARRANTY, and you are welcome to
redistribute it under certain conditions as specified by the GNU GPL license;
see the documentation for details.
Please credit this data as "generated using David A. Wheeler's 'SLOCCount'."

Posted by Patrick | Permalink

Sat Oct 12 08:19:14 CEST 2013

TCC is self-hosting again!

The last few times I tried tcc was mildly broken - e.g. makefile calling gcc and other silly.
With the release of 0.26 that improved, and the current git head is even more better. With a little bit of hacking I ended up with this:
tcc-0.9.26 $ time sh -c './configure --cc="tcc -I/usr/lib/tcc/include" ; make -j'
Binary  directory   /usr/local/bin
TinyCC directory    /usr/local/lib/tcc
Library directory   /usr/local/lib
Include directory   /usr/local/include
Manual directory    /usr/local/share/man
Info directory      /usr/local/share/info
Doc directory       /usr/local/share/doc/tcc
Target root prefix  
Source path      /var/tmp/tcc/tcc-0.9.26
C compiler       tcc -I/usr/lib/tcc/include
Target OS        Linux
CPU              x86-64
Big Endian       no
gprof enabled    no
cross compilers  no
use libgcc       no
Creating config.mak and config.h
config.h is unchanged
tcc -I/usr/lib/tcc/include -o tcc.o -c tcc.c -DCONFIG_LDDIR="\"lib64\"" -DTCC_TARGET_X86_64 -I.  -Wall -g -O2
tcc -I/usr/lib/tcc/include -o libtcc.o -c libtcc.c -DCONFIG_LDDIR="\"lib64\"" -DTCC_TARGET_X86_64 -I.  -Wall -g -O2
tcc -I/usr/lib/tcc/include -o tccpp.o -c tccpp.c -DCONFIG_LDDIR="\"lib64\"" -DTCC_TARGET_X86_64 -I.  -Wall -g -O2
tcc -I/usr/lib/tcc/include -o tccgen.o -c tccgen.c -DCONFIG_LDDIR="\"lib64\"" -DTCC_TARGET_X86_64 -I.  -Wall -g -O2
tcc -I/usr/lib/tcc/include -o tccelf.o -c tccelf.c -DCONFIG_LDDIR="\"lib64\"" -DTCC_TARGET_X86_64 -I.  -Wall -g -O2
tcc -I/usr/lib/tcc/include -o tccasm.o -c tccasm.c -DCONFIG_LDDIR="\"lib64\"" -DTCC_TARGET_X86_64 -I.  -Wall -g -O2
tcc -I/usr/lib/tcc/include -o tccrun.o -c tccrun.c -DCONFIG_LDDIR="\"lib64\"" -DTCC_TARGET_X86_64 -I.  -Wall -g -O2
tcc -I/usr/lib/tcc/include -o x86_64-gen.o -c x86_64-gen.c -DCONFIG_LDDIR="\"lib64\"" -DTCC_TARGET_X86_64 -I.  -Wall -g -O2
tcc -I/usr/lib/tcc/include -o i386-asm.o -c i386-asm.c -DCONFIG_LDDIR="\"lib64\"" -DTCC_TARGET_X86_64 -I.  -Wall -g -O2
make -C lib native
make[1]: Entering directory `/var/tmp/tcc/tcc-0.9.26/lib'
mkdir -p x86_64
tcc -I/usr/lib/tcc/include -c libtcc1.c -o x86_64/libtcc1.o -I..  -Wall -g -O2 -DTCC_TARGET_X86_64
tcc -I/usr/lib/tcc/include -c alloca86_64.S -o x86_64/alloca86_64.o -I..  -Wall -g -O2 -DTCC_TARGET_X86_64
ar rcs ../libtcc1.a x86_64/libtcc1.o x86_64/alloca86_64.o
make[1]: Leaving directory `/var/tmp/tcc/tcc-0.9.26/lib'
ar rcs libtcc.a libtcc.o tccpp.o tccgen.o tccelf.o tccasm.o tccrun.o x86_64-gen.o i386-asm.o
tcc -I/usr/lib/tcc/include -o tcc tcc.o libtcc.a -lm -ldl -I.  -Wall -g -O2  

real    0m0.220s
user    0m0.437s
sys     0m0.040s

So there, not even half a second to build a compiler ... this could be quite useful if tcc manages to compile other things too.
(Note: --cc="tcc -I/usr/lib/tcc/include" is a naughty hack as it includes the include path, otherwise it fails. I'm not yet sure about the best way to fix that) (edit: Figured it out - default path was set wrongly for amd64, fixed in current ebuilds)

Posted by Patrick | Permalink

Sat Oct 12 05:54:04 CEST 2013

Why sleep is not a synchronization primitive

Reading through code I've found some hilarious uses of sleep() over the years. Usually people wish to avoid some kind of race condition, so they just try to sit it out.

First problem: You never know how fast things move.
Example: doing a sleep() during shutdown to let the storage devices flush their caches
Little guesstimate:
A RAID controller has 256MB cache that is used as write-cache.
Worst case that's ~64000 sectors at 4kB each.
If they are completely independent random data each write would take around 10msec
(assuming slow SATA disks here)

Worst case the controller would need close to 10 minutes to write all data
So on average sleeping 10 seconds is "safe" ... but how can you know?
For a SATA disk itself the use of write-cache leads to similar insane results: A 32MB on-disk write cache could take a minute to be properly flushed.

Solution: Use a "clean" blocking method like sync() (and pray that your hardware doesn't lie).

Second problem: Race conditions are timing-sensitive
So, hmm, most of the time the sleep(1) guarantees that the other script generates the file. Buuuut ... what if the storage is very busy and the other script needs 3 seconds to run now?
So that's going to work on your local test machine by accident, but on a faster or slower server it might deviate in "unexpected" ways. Better to check properly (stat, inotify, etc. - we have the proper tools to do that). Never rely on wall-time for such things because you have no control over the scheduling of tasks and the system load!

Third problem: Waiting doesn't fix problems
For example ... you have two threads that both aquire two locks, and sleep(1) before trying to aquire a lock if the lock has been taken. Both threads get woken up at the same time ... and sleep(1) ... and get woken at the same time again.
The underlying race condition in lock aquisition cannot be fixed with sleep()!

The real fix is to learn about locking protocols and have the threads release all locks when any lock aquisition fails (plus some other bookkeeping, but the important part to avoid dead- and livelocks is that if you fail in any way you just "release everything" and start from the beginning)

So, if you think that a sleep() is the proper thing to do in any code - please think if it makes sense. Out of the 7.5 seconds used initially in my KVM start-stop cycle 4 seconds were sleep() that did not do anything useful, and removing it just made things faster at no cost.

Posted by Patrick | Permalink

Fri Oct 11 06:21:24 CEST 2013

Timing boot, the hard way

Now that I have boot Seriously Fast it gets more difficult to see with the naked eye where delays happen. Using script/scriptreplay is quite nice for that as you can slow down the output - "script -ttiming", "scriptreplay -ttiming -d 0.1" and it scrolls along at 1/10th the original speed. Neat trick, but still too subjective.

Bootchart / bootchartd is nice, but doesn't really scale down well. For sub-second boot times the output doesn't tell you that much. But, still, nice tool!

So,hrm, we get output from OpenRC ... there's /proc/uptime containing time since boot ... let's staple things together with a few screws, a hammer and a skunk:
@@ -335,9 +337,18 @@ write_prefix(const char *buffer, size_t bytes, bool *prefixed)
                }
 
                if (!*prefixed) {
+                       /* read timestamp from /proc/uptime */
+                       int uptime_fd;
+                       char *u_buf[1024];
+                       char *uptime;
+                       uptime_fd=open("/proc/uptime", 0);
+                       read(uptime_fd, u_buf, 1024);
+                       uptime=strtok(&u_buf, " ");
+
                        ret += write(fd, ec, strlen(ec));
                        ret += write(fd, prefix, strlen(prefix));
                        ret += write(fd, ec_normal, strlen(ec_normal));
+                       ret += write(fd, uptime, strlen(uptime));
                        ret += write(fd, "|", 1);
                        *prefixed = true;
                }
Oh dear, I should really improve my C hackery. But, after all, this works for what I want it to do, and other people can make it nice and shiny. What it does, eh, takes uptime and ... let me show you:
udev            0.26| * Starting udev ...
udev            0.27| [ ok ]
udev            0.28| * Generating a rule to create a /dev/root symlink ...
udev            0.28| [ ok ]
udev            0.29| * Populating /dev with existing devices through uevents ...
udev            0.32| [ ok ]
udev            0.33| * Waiting for uevents to be processed ...
udev            0.34| [ ok ]
hostname        0.40| * Setting hostname to localhost ...
sysctl          0.40| * Configuring kernel parameters ...
loopback        0.40| * Bringing up network interface lo ...
hostname        0.41| [ ok ]
loopback        0.41|ip: either "to" is duplicate, or "scope" is garbage
sysctl          0.41| [ ok ]
loopback        0.42| [ ok ]
fsck            0.42| * Checking local filesystems  ...
fsck            0.42|/sbin/fsck.xfs: XFS file system.
fsck            0.43| [ ok ]
root            0.45| * Remounting root filesystem read/write ...
root            0.46| [ ok ]
So now we get timing output with 0.01sec granulatity. That should be enough to figure out how long things take.

To measure something more relevant this time I'm using a readwrite filesystem - because of mount/fsck times it's XFS. And I'm starting syslog-ng, nginx and postgresql in a first approximation of a "working server". Of course they don't *do* anything, but it's a nice experiment.

Full boot log can be found here
With autologin in inittab (just for testing, y'know?) the full boot takes under 2 seconds, and halt about 400msec.

There's a naughty bug with busybox ip misbehaving, so network is most likely not working - but I haven't figured out how that one misbehaves yet.

Posted by Patrick | Permalink

Wed Oct 9 09:33:21 CEST 2013

Faster Boot Again Now

[    0.338524] reboot: Power down
Now this is getting silly ...

This is faster mostly because net.lo doesn't start, and all non-essential init scripts have been removed. The overhead for actually parsing them (or just the slowdown because there's more items in a directory?) becomes noticeable at this scale.

So now what is actually still happening?
The kernel boots
Rootfs is mounted
Misc filesystems (proc sys dev run ...) get mounted
Udev/mdev is started

At this point you could log in, about 250msec after power on. (Or about one second after kvm start - it takes a noticeable non-null time to initialize itself)

One could claim this is no longer useful, but this was mostly about *booting* fast, not doing anything else.

Posted by Patrick | Permalink

Wed Oct 9 08:32:52 CEST 2013

Booting Seriously Fast Now

Booting a kvm instance into a shutdown:
[    0.653860] reboot: Power down
So there. I broke the 2.5 second limit at last - by an insane margin. What happened?

I noticed that startup happened Seriously Fast now, but shutdown seemed to be eternally slooow. Very strange.
So I started reading through sysvinit to figure out what happens to stop/reboot. Interesting reading:
/usr/include/sys/reboot.h:
/* Stop system and switch power off if possible.  */
#define RB_POWER_OFF    0x4321fedc
Magic values handed to a syscall to tell the kernel "Off! Now!"

Digging more through the sysvinit source I noticed some ... err ... wat ... eh?
src/halt.c ~line 266:

        if (do_sync) {
                sync();
                sleep(2);
        }
So, sometimes, like, we just, uhm, SLEEP FOR TWO SECONDS
(Remember that at this point we're taking 2.5 seconds for the whole boot cycle!)

And looking at what sets do_sync (option -n) we can just try setting that in /etc/inittab:
-l0s:0:wait:/sbin/halt -dhp
+l0s:0:wait:/sbin/halt -dhpn
And there it is, the under-a-second start-and-stop cycle.
And just for the record: This went from 7.5 second stock to 0.7 seconds by removing 3 instances of sleep() causing 4 seconds of delay and about 2 seconds of time "wasted" by bash being slow (instead of busybox sh)

Looks like no one did much profiling there in the last decade :\

Posted by Patrick | Permalink

Sun Oct 6 11:58:40 CEST 2013

Booting fast(er)

Today I played around with an idea that had been bugging me for a while: How fast can I boot (and halt) a VM in KVM?

So, for this experiment I'll stick my CPU to the lowest speed (would be boring otherwise) - 1.4Ghz.
The goal is to boot and halt a Gentoo/amd64 VM in kvm in a reasonably short time.

The root filesystem is on squashfs - first test with a 1GB ext4 filesystem vs. squashfs showed a 5 second difference thanks to fsck+mount. Blergh ;)
I just unpacked a stage3, configured a few things (want to login for debugging, so set a password, etc.) and built a squashfs like this:
mksquashfs stage3/* ./kvm-squashfs -comp lzo
The kernel I just stripped down to a minimum - use virtio drivers, prune all others, and iteratively throw away what doesn't look useful.
Setup is 15264 bytes (padded to 15360 bytes).
System is 3004 kB
CRC 7315b3d8
Kernel: arch/x86/boot/bzImage is ready  (#8)
It's still quite big ... and that's after a few iterations of pruning.

So, the first attempt clocked in at ~7.5sec:
qemu-kvm -nographic  -kernel kernel  -boot c -drive file=./kvm-squashfs,if=virtio -append "quiet root=/dev/vda console=ttyS0 init=/sbin/halt"
[    7.409846] reboot: Power down
That's not bad ... but not good either. So, obviously bash is a slow bloated thingy, let's use busybox's sh:
[    5.709225] reboot: Power down
Eeeek. That's surprising!
Another surprise: booting with "-smp 4" costs around 0.3 seconds in kernel time.
And using parallel boot ... is 0.2 seconds slower ?! Eh.
The difference between mdev and udev is within the margin of error, with a very small bias towards mdev being faster.
Adding memory ( -m 1024 ) gives a little speedup of ~0.2 seconds.
Boot parameter "quiet" saves ~0.1 seconds.

There are lots of silly services ... so let's prune a bit.
# ls /etc/runlevels/*

etc/runlevels/boot:
bootmisc  hostname  localmount  mtab  net.lo  procfs  sysctl  tmpfiles.setup

etc/runlevels/default:
local  netmount

etc/runlevels/shutdown:
killprocs

etc/runlevels/sysinit:
dmesg  mdev  sysfs
Now there's lots of unneeded stuff (keymaps? There's no keyboard ya muppet!) pruned away. (root ? we can't remount / as rw because of squashfs. urandom? can't save/restore random seed anyway. etc. etc.) Result? We're now barely below 5 seconds:
[    4.966840] reboot: Power down
But I notice ... well ... killprocs at the end takes "an eternity". Guess what: It has ... two times ... "sleep 1"

That horror just costs us time! So let's cut it out.
[    2.798762] reboot: Power down
So there. Under three seconds for a whole start-stop cycle!

Posted by Patrick | Permalink