Sunday, August 1, 2010

Unarchiving Heritrix' archives - arc.gz

Those who have used heritrix for web crawling are aware of  the 'arc' format. For others, Heritrix is a web crawler (a million $ guess :P) which archives the pages it has crawled into arc format.
In order to collect the corpus for training my classifier I thought of using it. Though phoneyc would have been an option, but I had to collect as many samples as possible so I planned on using heritrix.
Configuring it is pretty easy as it has a nice documentation. Well, in order to extract the crawled pages I was searching for some script or tool (I am too foolish and scared of errors while coding, in short i am a noob!), so I wasted quite a lot of time googling.. Sometimes laziness is a boon, I wish I had been lazy to google!
Finally, I mailed Peter Likarish, a Phd. student at University of Iowa, who had previous experience with heritrix and obfuscated JS classification too, and he suggested that arc's are flat files and its pretty easy to extract pages from there. Also, some understanding of sgmllib.py helped me. using a handful of regular expressions and some loops, ta-da!! I got the code up and running!
For interested readers, the code is here.
I have tried it on some arc's, it seems to work fine. Well, in case someone tries to use it and run into a bug, I apologise for their inconvenience. Please let me know in case of problems, bugs, or errors. I would be grateful..

so, start crawling!!!

No comments:

Post a Comment