random ideas: August 2010

Friday, August 27, 2010

Posting from vim using python to blogger

All thanks to Dennis for his cool python code and the post, http://djcraven5.blogspot.com/2006/10/success-posting-to-blogger-beta-using.html which enables to make blog posts from vim to blogger!!!
check it out, it is simply fun!!

Wednesday, August 11, 2010

Training the classifier or handling vmware errors!!

Training the classifier doesn't seem to be as much fun as I thought it would be.
Reasons I thought it would be fun:

I had found some new malicious web pages, by simple google searches!!
The lists for training the classifier for both the mal and safe classes was prepared.
The only task was to now to pickle the features dictionary.

Reasons it became a pita:

the vmware error, lack of memory, at the end of 12 URL scan
the continuation of this error now, at each URL, and even after restarting the host machine, and allocating larger RAM to vmware.
finally it rewrote the pickle file that it had learned, means features of 15 URLs..

I have googled for this vmware error, but haven't found any suitable solution.
I thought the OS were trained to handle batch jobs, very early after their birth. This anomalous behaviour is out my understanding!!
Any body with any relevant suggestions??

Sunday, August 1, 2010

Unarchiving Heritrix' archives - arc.gz

Those who have used heritrix for web crawling are aware of the 'arc' format. For others, Heritrix is a web crawler (a million $ guess :P) which archives the pages it has crawled into arc format.
In order to collect the corpus for training my classifier I thought of using it. Though phoneyc would have been an option, but I had to collect as many samples as possible so I planned on using heritrix.
Configuring it is pretty easy as it has a nice documentation. Well, in order to extract the crawled pages I was searching for some script or tool (I am too foolish and scared of errors while coding, in short i am a noob!), so I wasted quite a lot of time googling.. Sometimes laziness is a boon, I wish I had been lazy to google!
Finally, I mailed Peter Likarish, a Phd. student at University of Iowa, who had previous experience with heritrix and obfuscated JS classification too, and he suggested that arc's are flat files and its pretty easy to extract pages from there. Also, some understanding of sgmllib.py helped me. using a handful of regular expressions and some loops, ta-da!! I got the code up and running!
For interested readers, the code is here.
I have tried it on some arc's, it seems to work fine. Well, in case someone tries to use it and run into a bug, I apologise for their inconvenience. Please let me know in case of problems, bugs, or errors. I would be grateful..

so, start crawling!!!