Oct 23 (Sun), 2005, 03:53

diffball releases

Hola kiddies

Received a request earlier, which wound up with diffball 0.7 being released, and a large amount of lib work being done so that people can get access to a rather simple high level api.

At this point, the lib exposes the exact same set of runs that compromises the differ binaries algo, and exposes a reconstruct func that, again, pretty much comprises the patcher binaries internal reconstruction calls. An example of using libdiffball for doing delta compression between two files in memory, dumping the results to another array (note that the code is being anal about dealloc, hence the structure)-

#include 
#include 
#include 

// returns 0 on success, non zero on failure
int diff_it(char *src_array, int src_len, char *ver_array, int ver_len, 
  char **out_array, int *out_len)
{
  cfile src_cfh, ver_cfh, out_cfh;
  int ret = -1;

  if((ret = copen_mem(&src_cfh, src_array, src_len, NO_COMPRESSOR, CFILE_RONLY)) 
    == 0) {

    if((ret = copen_mem(&ver_cfh, ver_array, ver_len, NO_COMPRESSOR, CFILE_RONLY)) 
      == 0) {

      if((ret = copen_mem(&out_cfh, NULL, 0, NO_COMPRESSOR, CFILE_WONLY)) == 0) {
      
        if((ret = simple_difference(&src_cfh, &ver_cfh, &out_cfh, 0, 0, 0, 0)) 
          == 0) {
          *out_array = out_cfh.data.buff;
    	  *out_len = out_cfh.data.write_end;
        } else {
          free(out_cfh.data.buff);
        }
        cclose(&out_cfh);
      }
      cclose(&ver_cfh);
    }
    cclose(&src_cfh);
  }
  return ret;
}

So what exactly is that doing? Libdiffball uses internally the io lib cfile that I created for it, that's intended to transparently wrap access to mem/compressed files/raw files behind a common structure and set of functions. Pretty much have the usual open/read/write/lseek/close/tell, just prefixed with a c; internally, if it's a bzip2 stream, it does the decompression on it's own, and feeds data to users.

To head off concerns about buffers of buffers, preferred access to cfile structs is via a page api, that exposes the underlying buffer, so as to cut down on unneccessarily cread memcpy's (for example).

The copen_mem above is configuring the cfile struct so that it's working directly from the passed in array- for CFILE_RONLY, obviously, doesn't modify the buffer. :) I'm opening the out_cfh with a null buffer, so that it allocs it's own.

From there... the simple_difference call is pretty straightforward. The trailing zero args are just telling it to use defaults for a couple of tunables (in this case, I have no interest in fooling with them).

What's all the nonsense involving pulling a buffer? Well, that's because I haven't yet thought up a good, *clean* api for raiding the buffer + size out; till that's in place, management of data.buff (namely free'ing), is left to the caller- this is also why I choose this example, since would welcome suggestions for it. Remaining chunk of the code is pretty much just free'ing stuff on the way out, error or otherwise, then returning either the error or zero.

That's the beast. Might seem a bit complex, but aside from the CFILE_WONLY + free design decision, it's about as simple as I can make it. The api reconstruct equivalent, 'simple_reconstruct' is roughly on par for straightforwardness-

int simple_reconstruct(cfile *src_cfh, cfile *patch_cfh[], unsigned char patch_count, 
  cfile *out_cfh, unsigned int force_patch_id, unsigned int max_buff_size);
Again, pretty much just having the cfile handles passed in, a couple of options that most people will leave set to zero (use internal sane defaults). Nifty thing to note is that patch_cfh is an array of cfile ptrs- multiple patches applied in a single run (hard max of 256 atm), so via those funcs above you've got transparent decompression of patches/sources, multiple patches in a single run, and at least read capabilities on 6 different patch formats- xdelta, fdtu, bsdiff, and gdiff being the main external formats others may know.

So, enough whoring of that. Needed an easy way to test the new code from above, so I also wrote out some quick python bindings via pyrex- pydiffball-0.1 being the result. Exposed functionality is cfile creation (both memory and usual file based), simple_difference, and simple_reconstruct. The python module is quite a bit more friendly, since it'll generate the cfile instances on it's own for all api calls, assuming it's an on disk path passed in when the arg is a basestring derivative.

Hopefully people find some use in it; personally, the python bindings will at least make my life a helluva lot easier for some extra code I occasionaly work with.


Posted by Brian Harring | Permalink | Categories: delta compression, python

Sep 27 (Tue), 2005, 13:10

python weirdness

A while back I wrote a tokenizing generator func for some core portage rewrite string processing; nothing incredibly fancy, just chunks up strings dependant on splitters past in. This serves as the basis for chunking up and processing depset syntax. Example being: "dev-util/diffball bsdiff? ( dev-util/bsdiff )".

Now, I thought the func was fairly tight, speedy enough. Simple little sucker-

def iter_tokens(s, splitter=" "):
    """iterable yielding of splitting of a string"""
    pos = 0
    l = len(s)
    while pos < l:
        if s[pos] in splitter:
            pos += 1
            continue
        next_pos = pos + 1
        while next_pos < l and s[next_pos] not in splitter:
            next_pos+=1
        yield s[pos:next_pos]
        pos = next_pos + 1
python -m timeit -s 'x="mamma said knock you out\n mama \t knocked\t you out\n";x*=10000;' 'list(iter_tokens(x, " \t\n"))'
Using that func, is 365ms per run (roughly). Now granted, there is room for improvement, but at first glance, the only tricky spot is linear search of splitter- using a set there actually is a bit slower, due to overhead of creating said set.

What massively gets my goat is that it's actually pretty damn slow in most real world usage compared to a seemingly primitive, and butt ugly (imo) aproach.

from itertools import ifilter
def iter_tokens(s, splitter=" "):
    l = len(splitter)
    if l > 1:
        if l == 3 and " " in splitter and "\t" in splitter and "\n" in splitter:
            return iter(s.split())
        for x in splitter[:-1]:
            s = s.replace(x, splitter[-1])
    return ifilter(None, s.split(splitter[-1]))
python -m timeit -s 'x="mamma said knock you out\n mama \t knocked\t you out\n";x*=10000;' 'list(iter_tokens(x, " \t\n"))';
Is faster. Much faster. Clocks in at 38.7ms. Without the check for " \t\n", it clocks in at 61ms.

If it were a single split, still the replace hack is faster (although the difference between the two is minor enough). So... that's weird, and bugged the hell out of me last night :)

Zac Medico's comments about the yield instantiating and returning another string instance probably are fairly on par. Either way, it's not intuitive to me :)

Final comment on it, downside to the faster approach is that you have to do the processing up front, rather then JIT as the generator does- in the case of the code that uses this, it's not an issue though. Haven't dug into the underlying python source to figure out why there's such a difference, so if someone knows kindly tell me so I spend my time doing something else ;)

Update: Tweaked the replace func and updated it's runtime since it was brain dead from experimentation at 3am, saner/simpler version of the replace loop is courtesy Andy Dustman for the replace cleanup. The check for " \t\n" is a quicky addition from me, mainly since that is even faster.
Note also that the faster approach I don't have issue with, I'm just rather amazed at the major difference in runtime for the two approaches.


Posted by Brian Harring | Permalink | Categories: General Gentoo, python