Sep 27 (Tue), 2005, 13:10

python weirdness

A while back I wrote a tokenizing generator func for some core portage rewrite string processing; nothing incredibly fancy, just chunks up strings dependant on splitters past in. This serves as the basis for chunking up and processing depset syntax. Example being: "dev-util/diffball bsdiff? ( dev-util/bsdiff )".

Now, I thought the func was fairly tight, speedy enough. Simple little sucker-

def iter_tokens(s, splitter=" "):
    """iterable yielding of splitting of a string"""
    pos = 0
    l = len(s)
    while pos < l:
        if s[pos] in splitter:
            pos += 1
            continue
        next_pos = pos + 1
        while next_pos < l and s[next_pos] not in splitter:
            next_pos+=1
        yield s[pos:next_pos]
        pos = next_pos + 1
python -m timeit -s 'x="mamma said knock you out\n mama \t knocked\t you out\n";x*=10000;' 'list(iter_tokens(x, " \t\n"))'
Using that func, is 365ms per run (roughly). Now granted, there is room for improvement, but at first glance, the only tricky spot is linear search of splitter- using a set there actually is a bit slower, due to overhead of creating said set.

What massively gets my goat is that it's actually pretty damn slow in most real world usage compared to a seemingly primitive, and butt ugly (imo) aproach.

from itertools import ifilter
def iter_tokens(s, splitter=" "):
    l = len(splitter)
    if l > 1:
        if l == 3 and " " in splitter and "\t" in splitter and "\n" in splitter:
            return iter(s.split())
        for x in splitter[:-1]:
            s = s.replace(x, splitter[-1])
    return ifilter(None, s.split(splitter[-1]))
python -m timeit -s 'x="mamma said knock you out\n mama \t knocked\t you out\n";x*=10000;' 'list(iter_tokens(x, " \t\n"))';
Is faster. Much faster. Clocks in at 38.7ms. Without the check for " \t\n", it clocks in at 61ms.

If it were a single split, still the replace hack is faster (although the difference between the two is minor enough). So... that's weird, and bugged the hell out of me last night :)

Zac Medico's comments about the yield instantiating and returning another string instance probably are fairly on par. Either way, it's not intuitive to me :)

Final comment on it, downside to the faster approach is that you have to do the processing up front, rather then JIT as the generator does- in the case of the code that uses this, it's not an issue though. Haven't dug into the underlying python source to figure out why there's such a difference, so if someone knows kindly tell me so I spend my time doing something else ;)

Update: Tweaked the replace func and updated it's runtime since it was brain dead from experimentation at 3am, saner/simpler version of the replace loop is courtesy Andy Dustman for the replace cleanup. The check for " \t\n" is a quicky addition from me, mainly since that is even faster.
Note also that the faster approach I don't have issue with, I'm just rather amazed at the major difference in runtime for the two approaches.


Posted by Brian Harring | Permalink | Categories: General Gentoo, python

Sep 24 (Sat), 2005, 21:33

Upcoming rsync cache changes

Commited a variation of a patch I posted in this thread to stable earlier today. Covers two things-

Detection of $PORTDIR/metadata/cache format- currently portage stable uses an ordered list of (implicit) key -> value; this makes it essentially impossible to ever remove a key, and makes addition of keys have a hard limit. Bad. So... the new format is an old format I hacked out a year back, flat_hash, (explicit) key -> value unordered. Nothing hugely fancy, but does allow us to jam stuff in without issue.

Increased flexibility requires us to version the cache entry in some way, so that we know if entries are incompatible with the version of portage reading it. Additionally, we should have been versioning the expected ebuild env (how it will be called, what funcs are available, etc) long ago. EAPI is that; additions/extensions to the ebuild spec result in a new EAPI standard, for example, src_configure addition is part of what EAPI=1 is. With EAPI in the cache, we can know whether or not the local portage version is capable of properly handling that cache entry. A higher EAPI (later portage release) may add new metadata; any portage version that doesn't support that EAPI must in some way mark the entry as "I know of it, but I can't use this ebuild".

So... in that jumble, essentially the rsync metadata/cache auto-detection allows us to move over to a more flexible format without causing cache horkages every time we change stuff (as has happened often enough), and EAPI allows us to to version those entries, so that EAPI aware portage versions can protect themselves from doing something stupid.

That and a lot of emerge --metadata cleanups got stuck in, hopefully killing off any remaining failures during cache transfer ;)

Also dropped root requirement for emerge --metadata. That always bugged the heck out of me, since it wasn't needed...


Posted by Brian Harring | Permalink | Categories: General Gentoo, Portage news

Sep 19 (Mon), 2005, 11:55

mailbox archiving

Finally got around to writing a quicky filter to yank old msgs from maildir, and slap them into mbox. Nothing massively fancy, just a quicky script since I prefer my archives in mbox, so if anyone is interested it's available here.

Does the trick for my needs, will miss a few weird headers, but doesn't lose msgs via it (or shouldn't) ;)


Posted by Brian Harring | Permalink