Sun Sep 21 13:59:16 CEST 2014

bcache

My "sacrificial box", a machine reserved for any experimentation that can break stuff, has had annoyingly slow IO for a while now. I've had 3 old SATA harddisks (250GB) in a RAID5 (because I don't trust them to survive), and recently I got a cheap 64GB SSD that has become the new rootfs initially.

The performance difference between the SATA disks and the SSD is quite amazing, and the difference to a proper SSD is amazing again. Just for fun: the 3-disk RAID5 writes random data at about 1.5MB/s, the crap SSD manages ~60MB/s, and a proper SSD (e.g. Intel) easily hits over 200MB/s. So while this is not great hardware it's excellent for demonstrating performance hacks.

Recent-ish kernels finally have bcache included, so I decided to see if I can make use of it. Since creating new bcache devices is destructive I copied all data away, reformated the relevant partitions and then set up bcache. So the SSD is now 20GB rootfs, 40GB cache. The raid5 stays as it is, but gets reformated with bcache.
In code:
wipefs /dev/md0 # remove old headers to unconfuse bcache
make-bcache -C /dev/sda2 -B /dev/md0 --writeback --cache_replacement_policy=lru
mkfs.xfs /dev/bcache0 # no longer using md0 directly!
Now performance is still quite meh, what's the problem? Oh ... we need to attach the SSD cache device to the backing device!
ls /sys/fs/bcache/
45088921-4709-4d30-a54d-d5a963edf018  register  register_quiet
That's the UUID we need, so:
echo 45088921-4709-4d30-a54d-d5a963edf018 > /sys/block/bcache0/bcache/attach
and dmesg says:
[  549.076506] bcache: bch_cached_dev_attach() Caching md0 as bcache0 on set 45088921-4709-4d30-a54d-d5a963edf018
Tadaah!

So what about performance? Well ... without any proper benchmarks, just copying the data back I see very different behaviour. iotop shows writes happening at ~40MB/s, but as the network isn't that fast (100Mbit switch) it's only writing every ~5sec for a second.
Unpacking chromium is now CPU-limited and doesn't cause a minute-long IO storm. Responsivity while copying data is quite excellent.

The write speed for random IO is a lot higher, reaching maybe 2/3rds of the SSD natively, but I have 1TB storage with that speed now - for a $25 update that's quite amazing.

Another interesting thing is that bcache is chunking up IO, so the harddisks are no longer making an angry purring noise with random IO, instead it's a strange chirping as they only write a few larger chunks every second. It even reduces the noise level?! Neato.

First impression: This is definitely worth setting up for new machines that require good IO performance, the only downside for me is that you need more hardware and thus a slightly bigger budget. But the speedup is "very large" even with a cheap-crap SSD that doesn't even go that fast ...

Edit: ioping, for comparison:
native sata disks:
32 requests completed in 32.8 s, 34 iops, 136.5 KiB/s
min/avg/max/mdev = 194 us / 29.3 ms / 225.6 ms / 46.4 ms

bcache-enhanced, while writing quite a bit of data:
36 requests completed in 35.9 s, 488 iops, 1.9 MiB/s
min/avg/max/mdev = 193 us / 2.0 ms / 4.4 ms / 1.2 ms


Definitely awesome!

Posted by Patrick | Permalink

Fri Sep 5 08:41:43 CEST 2014

32bit Madness

This week I ran into a funny issue doing backups with rsync:
rsnapshot/weekly.3/server/storage/lost/of/subdirectories/some-stupid.file => rsnapshot/daily.0/server/storage/lost/of/subdirectories/some-stupid.file
ERROR: out of memory in make_file [generator]
rsync error: error allocating core memory buffers (code 22) at util.c(117) [generator=3.0.9]
rsync error: received SIGUSR1 (code 19) at main.c(1298) [receiver=3.0.9]
rsync: connection unexpectedly closed (2168136360 bytes received so far) [sender]
rsync error: error allocating core memory buffers (code 22) at io.c(605) [sender=3.0.9]
Oopsiedaisy, rsync ran out of memory. But ... this machine has 8GB RAM, plus 32GB Swap ?!
So I re-ran this and started observing, and BAM, it fails again. With ~4GB RAM free.

4GB you say, eh? That smells of ... 2^32 ...
For doing the copying I was using sysrescuecd, and then it became obvious to me: All binaries are of course 32bit!

So now I'm doing a horrible hack of "linux64 chroot /mnt/server" so that I have a 64bit environment that does not run out of space randomly. Plus 3 new bugs for the Gentoo livecd, which fails to appreciate USB and other things.
Who would have thought that a 16TB partition can make rsync stumble over address space limits ...

Posted by Patrick | Permalink

Wed Sep 3 08:25:27 CEST 2014

AMD HSA

With the release of the "Kaveri" APUs AMD has released some quite intriguing technology. The idea of the "APU" is a blend of CPU and GPU, what AMD calls "HSA" - Heterogenous System Architecture.
What does this mean for us? In theory, once software catches up, it'll be a lot easier to use GPU-acceleration (e.g. OpenCL) within normal applications.

One big advantage seems to be that CPU and GPU share the system memory, so with the right drivers you should be able to do zero-copy GPU processing. No more host-to-GPU copy and other waste of time.

So far there hasn't been any driver support to take advantage of that. Here's the good news: As of a week or two ago there is driver support. Still very alpha, but ... at last, drivers!

On the kernel side there's the kfd driver, which piggybacks on radeon. It's available in a slightly very patched kernel from AMD. During bootup it looks like this:
[    1.651992] [drm] radeon kernel modesetting enabled.
[    1.657248] kfd kfd: Initialized module
[    1.657254] Found CRAT image with size=1440
[    1.657257] Parsing CRAT table with 1 nodes
[    1.657258] Found CU entry in CRAT table with proximity_domain=0 caps=0
[    1.657260] CU CPU: cores=4 id_base=16
[    1.657261] Found CU entry in CRAT table with proximity_domain=0 caps=0
[    1.657262] CU GPU: simds=32 id_base=-2147483648
[    1.657263] Found memory entry in CRAT table with proximity_domain=0
[    1.657264] Found memory entry in CRAT table with proximity_domain=0
[    1.657265] Found memory entry in CRAT table with proximity_domain=0
[    1.657266] Found memory entry in CRAT table with proximity_domain=0
[    1.657267] Found cache entry in CRAT table with processor_id=16
[    1.657268] Found cache entry in CRAT table with processor_id=16
[    1.657269] Found cache entry in CRAT table with processor_id=16
[    1.657270] Found cache entry in CRAT table with processor_id=17
[    1.657271] Found cache entry in CRAT table with processor_id=18
[    1.657272] Found cache entry in CRAT table with processor_id=18
[    1.657273] Found cache entry in CRAT table with processor_id=18
[    1.657274] Found cache entry in CRAT table with processor_id=19
[    1.657274] Found TLB entry in CRAT table (not processing)
[    1.657275] Found TLB entry in CRAT table (not processing)
[    1.657276] Found TLB entry in CRAT table (not processing)
[    1.657276] Found TLB entry in CRAT table (not processing)
[    1.657277] Found TLB entry in CRAT table (not processing)
[    1.657278] Found TLB entry in CRAT table (not processing)
[    1.657278] Found TLB entry in CRAT table (not processing)
[    1.657279] Found TLB entry in CRAT table (not processing)
[    1.657279] Found TLB entry in CRAT table (not processing)
[    1.657280] Found TLB entry in CRAT table (not processing)
[    1.657286] Creating topology SYSFS entries
[    1.657316] Finished initializing topology ret=0
[    1.663173] [drm] initializing kernel modesetting (KAVERI 0x1002:0x1313 0x1002:0x0123).
[    1.663204] [drm] register mmio base: 0xFEB00000
[    1.663206] [drm] register mmio size: 262144
[    1.663210] [drm] doorbell mmio base: 0xD0000000
[    1.663211] [drm] doorbell mmio size: 8388608
[    1.663280] ATOM BIOS: 113
[    1.663357] radeon 0000:00:01.0: VRAM: 1024M 0x0000000000000000 - 0x000000003FFFFFFF (1024M used)
[    1.663359] radeon 0000:00:01.0: GTT: 1024M 0x0000000040000000 - 0x000000007FFFFFFF
[    1.663360] [drm] Detected VRAM RAM=1024M, BAR=256M
[    1.663361] [drm] RAM width 128bits DDR
[    1.663471] [TTM] Zone  kernel: Available graphics memory: 7671900 kiB
[    1.663472] [TTM] Zone   dma32: Available graphics memory: 2097152 kiB
[    1.663473] [TTM] Initializing pool allocator
[    1.663477] [TTM] Initializing DMA pool allocator
[    1.663496] [drm] radeon: 1024M of VRAM memory ready
[    1.663497] [drm] radeon: 1024M of GTT memory ready.
[    1.663516] [drm] Loading KAVERI Microcode
[    1.667303] [drm] Internal thermal controller without fan control
[    1.668401] [drm] radeon: dpm initialized
[    1.669403] [drm] GART: num cpu pages 262144, num gpu pages 262144
[    1.685757] [drm] PCIE GART of 1024M enabled (table at 0x0000000000277000).
[    1.685894] radeon 0000:00:01.0: WB enabled
[    1.685905] radeon 0000:00:01.0: fence driver on ring 0 use gpu addr 0x0000000040000c00 and cpu addr 0xffff880429c5bc00
[    1.685908] radeon 0000:00:01.0: fence driver on ring 1 use gpu addr 0x0000000040000c04 and cpu addr 0xffff880429c5bc04
[    1.685910] radeon 0000:00:01.0: fence driver on ring 2 use gpu addr 0x0000000040000c08 and cpu addr 0xffff880429c5bc08
[    1.685912] radeon 0000:00:01.0: fence driver on ring 3 use gpu addr 0x0000000040000c0c and cpu addr 0xffff880429c5bc0c
[    1.685914] radeon 0000:00:01.0: fence driver on ring 4 use gpu addr 0x0000000040000c10 and cpu addr 0xffff880429c5bc10
[    1.686373] radeon 0000:00:01.0: fence driver on ring 5 use gpu addr 0x0000000000076c98 and cpu addr 0xffffc90012236c98
[    1.686375] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[    1.686376] [drm] Driver supports precise vblank timestamp query.
[    1.686406] radeon 0000:00:01.0: irq 83 for MSI/MSI-X
[    1.686418] radeon 0000:00:01.0: radeon: using MSI.
[    1.686441] [drm] radeon: irq initialized.
[    1.689611] [drm] ring test on 0 succeeded in 3 usecs
[    1.689699] [drm] ring test on 1 succeeded in 2 usecs
[    1.689712] [drm] ring test on 2 succeeded in 2 usecs
[    1.689849] [drm] ring test on 3 succeeded in 2 usecs
[    1.689856] [drm] ring test on 4 succeeded in 2 usecs
[    1.711523] tsc: Refined TSC clocksource calibration: 3393.828 MHz
[    1.746010] [drm] ring test on 5 succeeded in 1 usecs
[    1.766115] [drm] UVD initialized successfully.
[    1.767829] [drm] ib test on ring 0 succeeded in 0 usecs
[    2.268252] [drm] ib test on ring 1 succeeded in 0 usecs
[    2.712891] Switched to clocksource tsc
[    2.768698] [drm] ib test on ring 2 succeeded in 0 usecs
[    2.768819] [drm] ib test on ring 3 succeeded in 0 usecs
[    2.768870] [drm] ib test on ring 4 succeeded in 0 usecs
[    2.791599] [drm] ib test on ring 5 succeeded
[    2.812675] [drm] Radeon Display Connectors
[    2.812677] [drm] Connector 0:
[    2.812679] [drm]   DVI-D-1
[    2.812680] [drm]   HPD3
[    2.812682] [drm]   DDC: 0x6550 0x6550 0x6554 0x6554 0x6558 0x6558 0x655c 0x655c
[    2.812683] [drm]   Encoders:
[    2.812684] [drm]     DFP2: INTERNAL_UNIPHY2
[    2.812685] [drm] Connector 1:
[    2.812686] [drm]   HDMI-A-1
[    2.812687] [drm]   HPD1
[    2.812688] [drm]   DDC: 0x6530 0x6530 0x6534 0x6534 0x6538 0x6538 0x653c 0x653c
[    2.812689] [drm]   Encoders:
[    2.812690] [drm]     DFP1: INTERNAL_UNIPHY
[    2.812691] [drm] Connector 2:
[    2.812692] [drm]   VGA-1
[    2.812693] [drm]   HPD2
[    2.812695] [drm]   DDC: 0x6540 0x6540 0x6544 0x6544 0x6548 0x6548 0x654c 0x654c
[    2.812695] [drm]   Encoders:
[    2.812696] [drm]     CRT1: INTERNAL_UNIPHY3
[    2.812697] [drm]     CRT1: NUTMEG
[    2.924144] [drm] fb mappable at 0xC1488000
[    2.924147] [drm] vram apper at 0xC0000000
[    2.924149] [drm] size 9216000
[    2.924150] [drm] fb depth is 24
[    2.924151] [drm]    pitch is 7680
[    2.924428] fbcon: radeondrmfb (fb0) is primary device
[    2.994293] Console: switching to colour frame buffer device 240x75
[    2.999979] radeon 0000:00:01.0: fb0: radeondrmfb frame buffer device
[    2.999981] radeon 0000:00:01.0: registered panic notifier
[    3.008270] ACPI Error: [\_SB_.ALIB] Namespace lookup failure, AE_NOT_FOUND (20131218/psargs-359)
[    3.008275] ACPI Error: Method parse/execution failed [\_SB_.PCI0.VGA_.ATC0] (Node ffff88042f04f028), AE_NOT_FOUND (20131218/psparse-536)
[    3.008282] ACPI Error: Method parse/execution failed [\_SB_.PCI0.VGA_.ATCS] (Node ffff88042f04f000), AE_NOT_FOUND (20131218/psparse-536)
[    3.509149] kfd: kernel_queue sync_with_hw timeout expired 500
[    3.509151] kfd: wptr: 8 rptr: 0
[    3.509243] kfd kfd: added device (1002:1313)
[    3.509248] [drm] Initialized radeon 2.37.0 20080528 for 0000:00:01.0 on minor 0
It is recommended to add udev rules:
# cat /etc/udev/rules.d/kfd.rules 
KERNEL=="kfd", MODE="0666"
(this might not be the best way to do it, but we're just here to test if things work at all ...)

AMD has provided a small shell script to test if things work:
# ./kfd_check_installation.sh 

Kaveri detected:............................Yes
Kaveri type supported:......................Yes
Radeon module is loaded:....................Yes
KFD module is loaded:.......................Yes
AMD IOMMU V2 module is loaded:..............Yes
KFD device exists:..........................Yes
KFD device has correct permissions:.........Yes
Valid GPU ID is detected:...................Yes

Can run HSA.................................YES
So that's a good start. Then you need some support libs ... which I've ebuildized in the most horrible ways
These ebuilds can be found here

Since there's at least one binary file with undeclared license and some other inconsistencies I cannot recommend installing these packages right now.
And of course I hope that AMD will release the sourcecode of these libraries ...

There's an example "vector_copy" program included, it mostly works, but appears to go into an infinite loop. Outout looks like this:
# ./vector_copy 
Initializing the hsa runtime succeeded.
Calling hsa_iterate_agents succeeded.
Checking if the GPU device is non-zero succeeded.
Querying the device name succeeded.
The device name is Spectre.
Querying the device maximum queue size succeeded.
The maximum queue size is 131072.
Creating the queue succeeded.
Creating the brig module from vector_copy.brig succeeded.
Creating the hsa program succeeded.
Adding the brig module to the program succeeded.
Finding the symbol offset for the kernel succeeded.
Finalizing the program succeeded.
Querying the kernel descriptor address succeeded.
Creating a HSA signal succeeded.
Registering argument memory for input parameter succeeded.
Registering argument memory for output parameter succeeded.
Finding a kernarg memory region succeeded.
Allocating kernel argument memory buffer succeeded.
Registering the argument buffer succeeded.
Dispatching the kernel succeeded.
^C
Big thanks to AMD for giving us geeks some new toys to work with, and I hope it becomes a reliable and efficient platform to do some epic numbercrunching :)

Posted by Patrick | Permalink