With Next Generation Sequencing Data (NGS) coming off age and being routinely used, evolutionary biology is transforming into a data-driven science. As a consequence, researchers have to rely on a growing number of increasingly complex software. All widely used tools in our field have grown considerably, in terms of the number of features as well as lines of code. In addition, analysis pipelines now include substantially more components than 5-10 years ago. A topic that has received little attention in this context is the code quality of widely used codes. Unfortunately, the majority of users tend to blindly trust software and the results it produces. To this end, we assessed the code quality of 15 highly cited tools (e.g., MrBayes, MAFFT, SweepFinder etc.) from the broader area of evolutionary biology that are used in current data analysis pipelines. We also discuss widely unknown problems associated with floating point arithmetics for representing real numbers on computer systems. Since, the software quality of the tools we analyzed is rather mediocre, we provide a list of best practices for improving the quality of existing tools, but also list techniques that can be deployed for developing reliable, high quality scientific software from scratch. Finally, we also discuss journal and science policy as well as funding issues that need to be addressed for improving software quality as well as ensuring support for developing new and maintaining existing software. Our intention is to raise the awareness of the community regarding software quality issues and to emphasize the substantial lack of funding for scientific software development.
Thanks for posting this We have uploaded another rather critical paper recently that deals with global pair-wise sequence alignment: http://biorxiv.org/content/early/2015/11/12/031500 it might be of interest to those that regularly teach bioinformatics courses.
A timely article. It prompted me to try FindBugs on my java code. I was pleasantly surprised at how easy it was to install and use. It didn’t find anything ‘scary’ but several things that could easily be made more robust. It made me feel a bit silly for not getting round to doing it earlier.
I noticed you use gcc -Wall. -Wall does not turn on all warnings as its name suggests. I usually use gcc -Wall -Wextra -pedantic.
Great that this was useful to you. Regarding gcc, you are right and that is what we use fro RAxML etc. However, for new codes we develop from scratch we use the clang compiler, it does yield substantially more warnings and finds type issues that gcc doesn’t. So for C/C++ in the future clang is definitely the compiler to use for development.
All the best,