Sat Oct 22 18:11:25 CEST 2011
Booting Gentoo - from init to console
I've spent some time with OpenRC and sysvinit trying to understand a few things (for example how to integrate
CGroups support), and along the way I've learned a few things about the boot process that are not that well
documented.
So why not document it for posteriority ...
In the beginning, through some magic, the kernel is booted. How that happens is another story and not our concern at the moment. At some point the kernel has initialized, figured out where the rootfs is (for example through the root= kernel parameter), mounted it ... and now what?
At this point we need to start userspace process #1, traditionally known as "init". The kernel has some hardcoded defaults that by default will try /sbin/init, but it's easy enough to override that with the init= kernel parameter if you want to have something else run (like /bin/bash to get just a rescue shell)
init comes from the sysvinit package and is small enough to be read in an afternoon. There are some surprisingly elegant bits in it, but it's still just doing one job well - starting the rest of the userland processes. It takes its info from /etc/inittab, which just lists different runlevels and what to start. Now in the case of Gentoo this is a bit unusual as it mostly just calls "rc", which is part of the OpenRC package, with a parameter like "rc single". This is the name of the runlevel then - "default" by default. We have sane defaults!
Init is also triggered to change runlevels, this is usually done through the "init" or "telinit" commands.
Now OpenRC needs to figure out what to start. Before the "default" runlevel is started we need to start the "sysinit" and "boot" runlevels (for details have a look at /etc/runlevels ). This starts a few things like udev and mounts local filesystems, then starts all the daemons you requested. The bookkeeping for that (what has started, is starting, has failed etc.) can be found in /lib/rc/init.d/ - just another simple directory with self-explaining filenames. And running "rc" without arguments will just try to get us back to the current runlevel defaults - start what is stopped, and stop everything not defined in /etc/runlevels. Running "rc" in cron is a nice way to keep things like sshd running even through accidents like "killall sshd" :)
How OpenRC figures out the dependencies is quite "magic", but if you trace it you find runscript (an executable) running runscript.sh with a parameter like "depend", which sources the init scripts and just outputs the value of the DEPEND line. (Read /lib/rc/sh/runscript.sh to get an idea, or if you get bored read the source of runscript). And that information is cached in /lib/rc/init.d/deptree to avoid having to re-source the init scripts as this is a "slow" process (maybe 50msec per init script, but if you have 100 scripts that's still 5 seconds you lose just parsing the init scripts instead of starting stuff)
So OpenRC starts all the things from /etc/runlevel and is now done, it returns the control to init, which now notices that it has a few lines like this in its config (/etc/inittab):
And here we are, booted up and ready to serve our human overlords ;)
So why not document it for posteriority ...
In the beginning, through some magic, the kernel is booted. How that happens is another story and not our concern at the moment. At some point the kernel has initialized, figured out where the rootfs is (for example through the root= kernel parameter), mounted it ... and now what?
At this point we need to start userspace process #1, traditionally known as "init". The kernel has some hardcoded defaults that by default will try /sbin/init, but it's easy enough to override that with the init= kernel parameter if you want to have something else run (like /bin/bash to get just a rescue shell)
init comes from the sysvinit package and is small enough to be read in an afternoon. There are some surprisingly elegant bits in it, but it's still just doing one job well - starting the rest of the userland processes. It takes its info from /etc/inittab, which just lists different runlevels and what to start. Now in the case of Gentoo this is a bit unusual as it mostly just calls "rc", which is part of the OpenRC package, with a parameter like "rc single". This is the name of the runlevel then - "default" by default. We have sane defaults!
Init is also triggered to change runlevels, this is usually done through the "init" or "telinit" commands.
Now OpenRC needs to figure out what to start. Before the "default" runlevel is started we need to start the "sysinit" and "boot" runlevels (for details have a look at /etc/runlevels ). This starts a few things like udev and mounts local filesystems, then starts all the daemons you requested. The bookkeeping for that (what has started, is starting, has failed etc.) can be found in /lib/rc/init.d/ - just another simple directory with self-explaining filenames. And running "rc" without arguments will just try to get us back to the current runlevel defaults - start what is stopped, and stop everything not defined in /etc/runlevels. Running "rc" in cron is a nice way to keep things like sshd running even through accidents like "killall sshd" :)
How OpenRC figures out the dependencies is quite "magic", but if you trace it you find runscript (an executable) running runscript.sh with a parameter like "depend", which sources the init scripts and just outputs the value of the DEPEND line. (Read /lib/rc/sh/runscript.sh to get an idea, or if you get bored read the source of runscript). And that information is cached in /lib/rc/init.d/deptree to avoid having to re-source the init scripts as this is a "slow" process (maybe 50msec per init script, but if you have 100 scripts that's still 5 seconds you lose just parsing the init scripts instead of starting stuff)
So OpenRC starts all the things from /etc/runlevel and is now done, it returns the control to init, which now notices that it has a few lines like this in its config (/etc/inittab):
# TERMINALS c1:12345:respawn:/sbin/agetty -c 38400 tty1 linuxSo what it does now is very simple - it runs agetty, which configures the (pseudo-)terminals (tty1 here) and starts a login program (in this case /bin/login, the default). This asks us for username and password (another interesting story for a different time), and when this is done runs the login shell specified for that user.
And here we are, booted up and ready to serve our human overlords ;)
Wed Oct 19 15:11:02 CEST 2011
OpenRC, agetty and terminal blanking
Some people might have noticed a naughty interaction with one of the last sys-apps/util-linux updates and OpenRC.
The symptoms are described in bug 381401 - what you usually notice is that after boot the console blanks / resets and all you see is the boring login prompt.
The cause is a small change in agetty defaults, the manpage now mentions:
Until then you should still be able to see the boot messages in /var/log/rc.log - by default /etc/rc.conf has rc_logger="YES" set, so that should "just work"(tm)
The symptoms are described in bug 381401 - what you usually notice is that after boot the console blanks / resets and all you see is the boring login prompt.
The cause is a small change in agetty defaults, the manpage now mentions:
-c, --noreset Don't reset terminal cflags (control modes). See termios(3) for more details.So, if you are bothered by this change you need to edit /etc/inittab, which defines where and how the agetty processes are started. Change:
c1:12345:respawn:/sbin/agetty 38400 tty1 linuxto
c1:12345:respawn:/sbin/agetty -c 38400 tty1 linuxAnd from now on you should still have the old behaviour on reboot. And maybe we can convince Vapier that that would be a sane default so we don't have to change it on every system.
Until then you should still be able to see the boot messages in /var/log/rc.log - by default /etc/rc.conf has rc_logger="YES" set, so that should "just work"(tm)
Sun Oct 16 17:19:36 CEST 2011
CGroups support for OpenRC
During a train ride I spent some time implementing a prototype for CGroups support for OpenRC - the big awesome feature for SysTemD that people seem to desire the most.
Here's the most amusing bit:
To explain the patch a bit - sh/init.sh.Linux.in:
Now when a service starts we just need to take the parent process (which is conveniently the runscript.sh process that runs the init script) and push it into its own group. Although, maybe, it makes sense to let some init scripts be in the same group, right?
We allow users to set CGROUP in conf.d files, but if it's not set we just use RC_SVCNAME, which is the name of the init script usually. Then we create a subdirectory (but we don't care if it already exists) and throw ourselves into it.
Tadaah!
There's one little downside, we create directories but don't remove them. So let's add a little cleanup at the end of runscript.sh:
There's a neat trick in it - we are still in that cgroup, so we take ourselves and move us back to the root cgroup - otherwise the cgroup will never be empty. Duh! :)
And here's a more, let's say, controversial part. It's a bit exquisitely rude, so I guess this will have to be adapted for public use - but it works too well. See, there's this leeeetle problem that OpenRC tends to not kill processes that well. So if, for example, the apache init script doesn't manage to kill all indians you may have a problem - for example re-starting might fail because there's still the last mohican protecting port 80.
Since we know that the processes end up in a specific cgroup we can just take all those, put them against the wall and shoot them dead. This has some potential side-effects, for example if there's multiple init scripts using one cgroup or if there's a service where it's expected that a process survives (although that's funkay). Anyway ...
And there's some funky resource limits possible - I'm still trying to figure out how to make use of them properly, but ...
And so in one afternoon we managed to grow a pretty decent bonus feature ..
Here's the most amusing bit:
$ git diff 3ad849c5d6a24ef66152004eb3149d2cff973b1c..082c04e0a1c31115417af9fd348ae83ee8ecc397 --stat sh/init.sh.Linux.in | 10 +++++++++ sh/runscript.sh.in | 54 +++++++++++++++++++++++++++++++++++++++++++++++++- 2 files changed, 62 insertions(+), 2 deletions(-)So I've managed to stay below my initial estimate of "under 100 lines of shell" - I don't quite get what the big deal is. Of course the initial patch is a bit rough, there are some things that just don't feel clean to me - but it works.
To explain the patch a bit - sh/init.sh.Linux.in:
+# Setup Kernel Support for CGroups +if grep -qs cgroup /proc/filesystems; then + ebegin "Mounting CGroup filesystem" + mkdir -p /dev/cgroup + mount -n -t cgroup -o nodev,noexec,nosuid \ + OpenRC-CGroup /dev/cgroup + eend $? +fiWe need to mount a CGroup pseudofilesystem somewhere, I figured out that we better do it as early as possible (although procfs and sysfs init scripts aren't wrong per se). So we mount a cgroupfs named "OpenRC-CGroup" - that's just a nice marker so you see who did it. So much for setup.
Now when a service starts we just need to take the parent process (which is conveniently the runscript.sh process that runs the init script) and push it into its own group. Although, maybe, it makes sense to let some init scripts be in the same group, right?
+# Attach to CGroup - dir existing is enough for us +if [ -d /dev/cgroup ]; then + # use the svcname unless overriden in conf.d + SVC_CGROUP=${CGROUP:-$RC_SVCNAME} + mkdir -p /dev/cgroup/${SVC_CGROUP} + # now attach self to cgroup - any children of this process will inherit this + echo $$ > /dev/cgroup/${SVC_CGROUP}/tasks + # TODO: set res limits from conf.d +fiLook, ma, no hands!
We allow users to set CGROUP in conf.d files, but if it's not set we just use RC_SVCNAME, which is the name of the init script usually. Then we create a subdirectory (but we don't care if it already exists) and throw ourselves into it.
Tadaah!
There's one little downside, we create directories but don't remove them. So let's add a little cleanup at the end of runscript.sh:
+# CGroup cleanup +if [ -d /dev/cgroup ]; then + # use the svcname unless overriden in conf.d + SVC_CGROUP=${CGROUP:-$RC_SVCNAME} + # reattach ourself to root cgroup + echo $$ > /dev/cgroup/tasks + # remove cgroup if empty, will fail if any task attached + rmdir /dev/cgroup/${SVC_CGROUP} 2>/dev/null +fi +# need to exit cleanly +exit 0The exit 0 is a bit naughty, but I think it's the right way to terminate. Otherwise the rmdir might give us a bad return value which makes it look like things fail even when they don't ...
There's a neat trick in it - we are still in that cgroup, so we take ourselves and move us back to the root cgroup - otherwise the cgroup will never be empty. Duh! :)
And here's a more, let's say, controversial part. It's a bit exquisitely rude, so I guess this will have to be adapted for public use - but it works too well. See, there's this leeeetle problem that OpenRC tends to not kill processes that well. So if, for example, the apache init script doesn't manage to kill all indians you may have a problem - for example re-starting might fail because there's still the last mohican protecting port 80.
Since we know that the processes end up in a specific cgroup we can just take all those, put them against the wall and shoot them dead. This has some potential side-effects, for example if there's multiple init scripts using one cgroup or if there's a service where it's expected that a process survives (although that's funkay). Anyway ...
+# Kill everything in the CGroup with maximum prejudice until it is dead +# This is very naughty if multiple init scripts were to share a CGroup ... +terminate() +{ + if ! [ -d /dev/cgroup ]; then + return 0 + else + SVC_CGROUP=${CGROUP:-$RC_SVCNAME} + # we want to survive and thus must not be killed dead + echo $$ > /dev/cgroup/tasks + for signal in TERM KILL; do + for i in `cat /dev/cgroup/${SVC_CGROUP}/tasks`; do + kill -s $signal $i + done + done + fi + # if anyone survived here we could try task_freezer; SIGTERM; unfreeze + # to be sure everyone really got the message that they are dead +}This works well even for uncooperative init scripts, but might not always be what you want. But at least it kills everyone dead :)
And there's some funky resource limits possible - I'm still trying to figure out how to make use of them properly, but ...
+ # cpuset support try 1 + if [ -v CGROUP_CPUS ] && [ -e /dev/cgroup/${SVC_CGROUP}/cpuset.cpus ]; then + echo $CGROUP_CPUS > /dev/cgroup/${SVC_CGROUP}/cpuset.cpus + fi... this nails everything from one init script to a cpuset, so you can hammer apache onto CPUs 1-7 and leave CPU 0 free for "everything else", if you want. It also allows memory limits and a few other advanced gadgets, but I haven't played with that yet.
And so in one afternoon we managed to grow a pretty decent bonus feature ..