I have extracted 9 features, which have been mentioned by a lot of people in their works. These features have been extracted from a very very modest corpus, which is not very broad yet, of 15 benign and 10 malicious JS samples.
The findings expressed as graphs, file against the feature value can be found here. The graphs show malicious scripts features in red and benign scripts in blue.
1. average characters per line
2. average eval() argument length
3. string definition to string use ratio
4. # unicode characters
5. # lines in the script
6. % human readable characters
7. % white space in the script
8. # words in the script
9. dynamic execution calls
Though the results aren't very encouraging for all the features, but some of them like the string definition to use ratio, % human readable characters, %white space, offer some hope. Improvements in the implementation of the features extraction with little assumptions is required to build a proper extractor for the classifier.
The code for the feature extractor and the classifier may be accessed in my svn branch of phoneyc under njain-anomalydetection.
I sincerely appreciate the comments and reviews on the current work feature extraction and classification.
p.s. - truly speaking,this is my first attempt at regex, pickling, or in short, programming, in that case. All thanks to the mentor for his able guidance and constant motivation.