Mining software licenses with cvsanaly and ohcount

During the last three weeks I’ve been diving into cvsanaly to refresh my python skills. My first contributions have been a couple of easy fixes but now I’m finishing the integration of the ohcount tool which detects the license used in source code files ( see my previous entries about ohcount ).

This afternoon with 35ºC outside I’m very close to the air conditioning while testing and cleaning up the code before submitting the patch to my colleague carlosgc. With this new extension we get a table which relates files, revisions and licenses. See the picture below.

Ohcount is a very interesting tool, we even realized we had incorrect headers in 31 source files of cvsanaly. The new extension allow us to detect these changes. For instance the image below reflects the different licenses over time on one of the cvsanaly files, as you can see the file had two licenses (gpl and lpgl) before revision 609. That happened due to a incorrect header which mixed gpl and lgpl text together.

So, our plan is to integrate ohcount to study the licenses used in the fresh code and start studying if there are significant facts over time. I hope the code will be committed to git:// by the end of next week, in any case drop me a mail if you are interested on it and I’ll let you know.


Written by sanacl

August 27, 2010 at 3:53 pm

