Libre Software People's Front

don't confuse it with People's Front of Open Source

Posts Tagged ‘data mining software

Bicho 0.9 is comming soon!

leave a comment »

During last months we’ve been working to improve Bicho, one of our data mining tools. Bicho gets information from remote bug/issue tracking systems and store them in a relational database.

How Bicho works

Bicho

The next release of Bicho 0.9 will also include incremental support, which is something we’ve missed for flossmetrics and for standalone studies with a huge amount of bugs. We also expect that more backends will be created easily with the improved backend model created by Santi Dueñas. So far we support JIRA, Bugzilla and Sourceforge. For the first two ones we parse HTML + XML, for sourceforge all we have is HTML so we are more dependent from the layout (to minimize that problem we use BeautifulSoup). We plan to include at least backends for FusionForge and Mantis (which is partially written) during this year.

Bicho is being used currently in the ALERT project (still in the first months) where all the information offered by the bug/issue reports will be related to the information available in the source code repositories (using CVSAnaly) through semantic analysis. That relationship will allow us to help developers through recommendations and other more pro-active use cases. One of my favorites is to recommend a developer to fix a bug through the analysis of the stacktraces posted in a bug. In libre software projects all the information is available in the internet, the main problem (not a trival one) is that it is available in very different resources. Using bicho against the bts/its we can get the part of the code (function name, class and file) that probably contains the error and the version of the application. That information can be related to the one got from the source code repository with cvsanaly, in this case we would need to find out who is the developer that edit that part of the code more often. This and other uses cases are being defined in the ALERT project.

If you want to stay tunned to Bicho have a look at the project page at http://projects.libresoft.es/projects/bicho/wiki or the mailing list libresoft-tools-devel _at__ lists.morfeo-project.org

[This entry is part of the work I do in LibreSoft and it is also available in my blog at libresoft.es]

Written by sanacl

June 9, 2011 at 3:00 pm

Study of the Android development activity and its authors

leave a comment »

Libre software is changing the way applications are built by companies, while the traditional software development model does not pay attention to external contributions, libre software products developed by companies benefit from them. These external contributions are promoted creating communities around the project and will help the company to create a superior product with a lower cost than possible for traditional competitors. The company in exchange offers the product free to use under a libre software license.

Android is one of these products, it was created by Google a couple of years ago and it follows a single vendor strategy. As Dirk Riehle introduced some time ago it is a kind of a economic paradox that a company can earn money making its product available for free as open source. But companies are not NGOs, they don’t give away money without expecting something in return, so where is the trick?

As a libre software project Android did not start from scratch, it uses software that would be unavailable for non-libre projects. Besides that, it has a community of external stakeholders that improve and test the latest version published, help to create new features and fix errors. It is true that Android is not a project driven by a community but driven by a single vendor, and Google does it in a very restricted way. For instance external developers have to sign a Grant of Copyright License and they do not even have a roadmap, Google publish the code after every release so there are big intervals of time where external developers do not have access to the latest code. Even with these barriers there are a significant part of the code that is being provided from external people, it is done directly for the project or reused from common dependencies (GIT provides ways to reuse changes done to remote repositories).

Commits by domain per month (proportional)

Commits by domain per month (proportional)

Commits by domain per month (total)

Commits by domain per month (total)

The figures above reflect the monthly number of commits done by people split up in two, in green colour commits from mail domains google.com or android.com, the study assumes that these persons are Google employees. On the other hand in grey colour the rest of commits done by other mail domains, these ones belong to different companies or volunteers.

According to the first figure (on the left), which shows the proportion of commits, during the first months that were very active (March and April 2009) the number of commits from external contributors was similar to the commits done by Google staff. The number of external commits is also big in October 2009, when the total amount of commits reached its maximum. Since April 2009 the monthly activity of the external contributors seems to be between 10% and 15%.

The figure on the left provides a interesting view of the total activity per month, two very interesting facts here: the highest peak of development was reached during late 2009 (more than 8K commits per month during two months). The second is the activity during the last months, as it was mentioned before the Google staff work in private repositories so until they publish the next version of Android, we won’t see another peak of development (take into account that commits in GIT will modify the history when the code is published, thus the last months in the timeline will be overwritten during the next release)

Commits by domain

Commits by domain

More than 10% of the commits used by Google in Android were committed using mail domains different to google.com or android.com. At this point the question is: who did it?

(Since October 2008)

# Commits Domain
69297 google.com
22786 android.com
8815 (NULL)
1000 gmail.com
762 nokia.com
576 motorola.com
485 myriadgroup.com
470 sekiwake.mtv.corp.google.com
422 holtmann.org
335 src.gnome.org
298 openbossa.org
243 sonyericsson.com
152 intel.com

Having a look at the name of the domains, it is very surprising that Nokia is one of the most active contributors. This is a real paradox, the company that states that Android is its main competition helps it!. One of the effects of using libre software licenses for your work is that even your competition can use your code, currently there are Nokia commits in the following repositories:

  • git://android.git.kernel.org/platform/external/dbus
  • git://android.git.kernel.org/platform/external/bluetooth/bluez

This study is a ongoing process that should become a scientific paper, if you have feedback please let me know.

CVSAnalY was used to get data from 171 GIT repositories (the Linux kernel was not included). Our tool allow us to store the metadata of all the repositories in one SQL database, which helped a lot. The study assumes that people working for Google use a domain @google.com or @android.com.

References:

[This entry is part of the work I do in LibreSoft and it is also available in my blog at libresoft.es]

Written by sanacl

April 16, 2011 at 5:29 pm

Mining software licenses with cvsanaly and ohcount

leave a comment »

During the last three weeks I’ve been diving into cvsanaly to refresh my python skills. My first contributions have been a couple of easy fixes but now I’m finishing the integration of the ohcount tool which detects the license used in source code files ( see my previous entries about ohcount ).

This afternoon with 35ºC outside I’m very close to the air conditioning while testing and cleaning up the code before submitting the patch to my colleague carlosgc. With this new extension we get a table which relates files, revisions and licenses. See the picture below.

Ohcount is a very interesting tool, we even realized we had incorrect headers in 31 source files of cvsanaly. The new extension allow us to detect these changes. For instance the image below reflects the different licenses over time on one of the cvsanaly files, as you can see the file had two licenses (gpl and lpgl) before revision 609. That happened due to a incorrect header which mixed gpl and lgpl text together.

So, our plan is to integrate ohcount to study the licenses used in the fresh code and start studying if there are significant facts over time. I hope the code will be committed to git://git.libresoft.es/git/cvsanaly by the end of next week, in any case drop me a mail if you are interested on it and I’ll let you know.

Written by sanacl

August 27, 2010 at 3:53 pm