2014/05/14

lifesaver for batch processing

It's been a while since the last post, I've been quite occupied with KBOData, more information on that will follow soon (here and on http://practical-linkeddata.blogspot.com). Today a short tip.

Batch processing. At some point in my IT career it occupied all of my time. Batch processing on mainframe (using PL/I and IDMS) to be exact. The performance we got back then is unmatched by anything I've seen since, you just can't beat the big iron as far as batch processing goes.

Standard ROC processing isn't optimized for batch processing. Look at it this way ... say you request the resource that is your batch process then out-of-the-box NetKernel keeps track of every dependency that makes up that resource. In a batch process this can pretty quickly turn nasty on memory usage. And think about it, rarely do you want the result of a batch process to be cached.

It is possible to do very efficient batch processing with ROC though. You can fan out requests asynchronously for example. More on that another time. For now, here's the lifesaver I got from Tony Butterfield yesterday. Not only did it shorten execution time massively, it also kept memory usage down (to next to nothing) :
<request>
  <identifier>active:csv2nt+filein@data:text/plain,file:/var/tmp/input/address.csv+fileout@data:text/plain,file:/var/tmp/output/address.ttl+template@data:text/plain,address</identifier>
  <header name="forget-dependencies">
    <literal type="boolean">true</literal>
  </header>
</request>


What this exact request does is not so important (it converts a huge csv file into rdf syntax using a freemarker template on each line), what is important is the header. For that header makes the difference between "batch sucks on NetKernel" and "batch roc(k)s on NetKernel". Thanks Tony !