Libre Software People's Front

don't confuse it with People's Front of Open Source

Posts Tagged ‘cvsanaly

How to get quantitative data from the Android source code (II)

with 2 comments

( have a look at the previous post if you didn’t )

I recommend you to use the screen command to download the repos, it could take a couple of hours if your connection is not quick. Use a log file to ensure that everything was properly downloaded and the mail command to notify you when the downloads finish.

../ > ../log_git_clone.txt 2>&1; mail -s "git clone fin" < ../log_git_clone.txt 

After using git clone to get all the git repositories used by Android, we need to start using cvsanaly to analyze the code, again we will use a log file.

for i in $list
echo "------ ANALYSING $i" >> ../log-cvsanaly.txt
~/repos/cvsanaly/cvsanaly2 -u **** -p **** -d cvsanaly_android_lcanas $i >> ../log-cvsanaly.txt 2>&1
mail -s "cvsanaly finished" < ../log-cvsanaly.txt

At this point we’ve got a single mysql database with all the information of the 167 Android repositories. The next step is to use this information to answer some questions, in this introductory study we are going to examine the activity over time (in terms of commits) of the project and divided by Google staff and others. We will assume that the Google employees use a user id with @google or @android, that’s how we will divide them in two groups.

The first R commands below create the connection with the mysql database and obtain the variables comm and googlers which contain the number of commits per month and domain.

> library(RMySQL)
Loading required package: DBI
> con <- dbConnect( MySQL(), user="***", password="***", dbname="cvsanaly_android23_lcanas" )
> comm <- dbGetQuery(con, "select count( as comm from scmlog join people on ( 
where date >= '2008-10-21 00:00:00' and not like '' and not like '' 
group by date_format(, '%Y %m') order by date_format(, '%Y %m') asc;")

> googlers <- dbGetQuery(con, "select count( as googlers from scmlog join people on ( 
where date >= '2008-10-21 00:00:00' and like '' or like '' 
group by date_format(, '%Y %m') order by date_format(, '%Y %m') asc;")

We join the information from google employees and the rest of contributors. It is also needed to obtain the list of months which will be useful as x axis in the chart we will generate.

> mymatrix2<-cbind(googlers,comm)

> months <- dbGetQuery(con, "select date_format(, '%m/%y') as month from scmlog join people 
on ( where date >= '2008-10-21 00:00:00' and not like '' and not like '' group by date_format(, '%Y %m') order by date_format(, '%Y %m') asc;")

The last step is to generate the chart and save it to a file.

> barplot(t(mymatrix2),names.arg=t(months),ylab="commits",legend.text=c("Google employees","Rest"),col=c("dark green","grey"))

> savePlot(filename="android-commits-domains.png", type="png")

Voilà, based on the software history of the Android project we have generated a view of the activity around the code in terms of commits over time.

This basic process should be improved to obtain more accurate results, for instance some of the Google employees committed code using an empty mail address, then the contribution from non google employees seems to be bigger than it is. It will also be necessary to analyze the Linux kernel together with the rest of the Android code in order to obtain a wider view of the effort invested by the Android community. There are many different questions that can shed some light on how the different communities work, in the last two posts we’ve seen one of the methods to start performing a quantitative study with the purpose of answering some of those questions.

Written by sanacl

December 31, 2010 at 2:07 am

How to get quantitative data from the Android source code (I)

with one comment

One of my targets for 2011 is to make as easy as possible the process of obtaining quantitative data from open source projects. We have developed several tools with that purpose but they still need a lot of love to be really user-friendly and stable. In the following two posts I’ll show you how to get basic data from FLOSS projects using the source code repository, in this example we will study the code provided by Android using cvsanaly to get data from the repositories and R to create a couple of charts.

The Google developers created a tool called repo to deal with the different git repos that they are using in Android. I don’t like to install tools that I won’t use so I’ll bypass it with a couple of bash commands.

The repo command uses the git:// as starting point, so after cloning this repository you’ll see that it contains a XML file called default.xml with the following content:

  <project path="system/bluetooth" name="platform/system/bluetooth" />
  <project path="system/core" name="platform/system/core" />
  <project path="system/extras" name="platform/system/extras" />
  <project path="system/netd" name="platform/system/netd" />
  <project path="system/vold" name="platform/system/vold" />
  <project path="system/wlan/ti" name="platform/system/wlan/ti" />

The XML code above only shows some of the 159 references to git repositories. Without the repo command created by Google, the developers should have to download them one by one or using a script. We will use awk and a simple bash script to extract them form the XML file and download them in one go.

$ list=`cat default.xml |awk -F '"' '{print $4}'|grep -v '^$'|grep -v "UTF-8"|grep -v "Makefile$"`
$ for i in $list
j=`echo $i|sed 's:/:_:g'`
echo git clone git://$i $j >>

Now, just edit the file and add the following lines at the beginning and we have a script to download the Android’s repositories. Don’t forget to give it execution permission.

echo "getting android repos"

Easy, isn’t it?. The next step is to execute the script to download the 159 git repositories and in the meanwhile install cvsanaly which has to be installed from sources, but do not panic it is straightforward:

At this point you are ready to start playing with the raw data extracted from all the git repositories in a single relational database. Stay tuned, the second chapter is coming soon.

UPDATE: the new release of Android 2.3 which has been published a couples of days ago uses 167 git repositories

Read the second part

Written by sanacl

December 17, 2010 at 8:52 am

Mining software licenses with cvsanaly and ohcount

leave a comment »

During the last three weeks I’ve been diving into cvsanaly to refresh my python skills. My first contributions have been a couple of easy fixes but now I’m finishing the integration of the ohcount tool which detects the license used in source code files ( see my previous entries about ohcount ).

This afternoon with 35ºC outside I’m very close to the air conditioning while testing and cleaning up the code before submitting the patch to my colleague carlosgc. With this new extension we get a table which relates files, revisions and licenses. See the picture below.

Ohcount is a very interesting tool, we even realized we had incorrect headers in 31 source files of cvsanaly. The new extension allow us to detect these changes. For instance the image below reflects the different licenses over time on one of the cvsanaly files, as you can see the file had two licenses (gpl and lpgl) before revision 609. That happened due to a incorrect header which mixed gpl and lgpl text together.

So, our plan is to integrate ohcount to study the licenses used in the fresh code and start studying if there are significant facts over time. I hope the code will be committed to git:// by the end of next week, in any case drop me a mail if you are interested on it and I’ll let you know.

Written by sanacl

August 27, 2010 at 3:53 pm