Version control keeps a complete history of your work on a given project. It facilitates collaboration – everyone can work freely on any part of the project without overriding others’ changes. You can move between past versions and roll back when needed. You can also review your project’s history through commit messages describing each added change and see what exactly has changed in the content. You can see who made the changes and when they happened.
Version control is a powerful tool and fundamental practice in software development. When coupled through a code hosting service, it easily allows contributions from outside collaborators. Version control benefits both individuals and teams and should be adopted in almost all projects.
- Introduction version control from MolSSI’s Best Practices Workshop
- Software Carpentry Version Control with Git
- GitHub 15 Minutes to Learn Git
- Git Commit Best Practices
In a couple of cases, we might want to share the code with others. When working on a software project, instead of working alone, there is a high chance that more than one person is working on the same thing. Sharing code becomes critical when different people are collaborating on the same codebase. For open source software, sharing code also gives the public access to the code for reviewing, testing, and contributing to it.
At the same time, sharing code along with published papers also enables the ability of others to understand the code, increasing the reproducibility, reusability, and expandability. It also enables others to be able to cite the software and credit its authors when they use it.
Tutorials for sharing code with GitHub
Software should be tested regularly throughout the development cycle to insure correct operation. Thorough testing is typically an afterthought, but for larger projects it can be essential for ensuring changes in some parts of the code do not negatively affect other parts.
Two main types of testing are strongly encouraged
- Regression tests – given a known input, does the software correctly and consistently return the correct values?
- Unit tests – Similar to general testing, except testing is done on much smaller units (such as single functions or classes). This is helpful for catching errors in uncommonly-used parts of the code which may be skipped in general testing. Unit tests can be added as new features are added, resulting in better code coverage.
Continuous integration (CI) automatically builds your codes,runs tests on a variety of different platforms, and deploys all manner of builds and documentation as desired Typically this may be run when new code is proposed (e.g. through GitHub Pull Requests), or committed to the repository. CI is useful for catching bugs before they reach your end users, and for testing on platforms that are not available to every developer.
CI can be broken down into several stages. Most CI should at least build the code and then run unit tests. The build stage takes the source code and does any compilation and dependency resolution/installation for the next stage. Compiled languages like C++ and Rust require this step to turn all the source code into executables. Interpreted languages like Python or R do not usually need this step explicitly to turn source code into compiled code, but still typically need to install dependencies. The unit test stage runs a series of tests to ensure the code is working as expected without syntactical or logical errors. Most, if not all, codes should have these. Some codes where accuracy is needed (especially in the scientific field) should also include a regression test stage where accuracy is compared against computed values. Regression tests can take significantly longer than unit tests and may need to be relegated to very infrequent CI runs, or handled through a separate means. Lastly a deploy stage can take any compiled and verified code and push it to the appropriate branch or service to make it available. Deployment can also include things such as documentation pages, API’s, and experimental/nightly builds.
GitHub itself now provides a CI service for its repositories called “GitHub Actions” which can be configured to run with most repos. However, there are also many other CI services, most of which have webhooks for integration with GitHub. There are also CI services for non-GitHub based code repositories.
Examples of CI Software/Services
- Web Based Services
- Self Hosted Options
Code that lives beyond its initial development will be read many times more than written as the project is maintained and new features are added. Establishing and following a standard style in your projects will increase readability, make maintenance easier, and can reduce onboarding time for new developers.
While code style can be personal, languages usually have at least a few dominant coding styles which are familiar to most programmers in that language. When programming in Python, the most commonly followed style is some variation of PEP 8. In Python, you might also consider adopting type hinting for large projects. Documentation embedded in the code through documentation strings or comments is a crucial aspect of code style you should also establish for your projects.
Automatic formatting tools can enforce a particular coding style and are often configurable for each project.
Example of a coding style guides:
The importance of documentation in an organization is often determined by multiple factors including the adopted software development practices (waterfall, agile etc.) and the size of the software being documented. Regardless, the documentation is a vitrine for the software which reflects its health and mirrors the livelihood of the software ecosystem with regular updates.
Ideally, the documentation not only offers brief and informative guidelines to help busy users achieve their goals rapidly but also detailed user/developer manuals to provide deeper insights into the software infrastructure. The former type of documentation is often entitled as getting started, quick guide, 10-minutes to … etc. The 10 minutes to Pandas and 10 minutes to Dask are great examples of such documents. The documentation can also be complemented by short video tutorials or brief blog posts with practical examples to further help the users.
The developer documentation, on the other hand, involves several more detailed components:
- Build requirements and dependencies
- How to compile/build/test/install
- How to use the software
- More detailed practical examples
In addition, the developer guides should also delineate the application programming interface (API) which paves the way for the developer community support and collaboration to further implement and maintain bits and pieces within the software infrastructure. The API Reference often involves the documentation of various internal files, function and class signatures as well as the reasoning behind the naming conventions and certain adopted designs. Mature scientific and engineering libraries such as oneAPI Math Kernel Library (oneMKL) and oneAPI Deep Neural Network (oneDNN) library from Intel provide great examples of this class of documentation.
The documentation should be kept up to date with changes in the code which is not an easy task, especially for large and fast-moving code bases tied with agile software engineering practices. However, slightly out-of-date documentation is generally preferable to no documentation. It is recommended that the examples provided within the documentation are compiled and tested regularly in order to maintain the quality and usefulness of the documentation over time.
Popular documentation packages:
- C/C++/Fortran – Doxygen
- Python – Sphinx and Read The Docs
- Fortran – Ford and Doxygen
- Julia – Documenter.jl
Examples of good documentation:
Generally, at least part of a lot of software must be compiled. Doing this in a clean (and possibly cross-platform) way is not trivial.
However, having a somewhat standard build system makes uptake by new users and developers much easier and makes it more likely that the code will be maintained in the future. Therefore, use of common build systems is encouraged. For most compiled C/C++/Fortran code encountered in computational chemistry, CMake is recommended.
Software quality depends on many factors such as: Functionality, usability, performance, reliability, portability, interoperability, scalability, and reusability (see full description here).
There are many aspects that contribute to a good design and to the quality of your software. An important one is to follow the best practices and give thoughts to the design of your software. Luckily, many experienced programmers have developed best practices over a substantial period of time. Those best practices can help inexperienced developers to learn software design easily and quickly.
The first thing you can learn that will immediately improve the quality of software is to follow the SOLID Principles of Software Design. Following those 5 principles will result in a more understandable, flexible and maintainable code. You can read more here:
- Dev IQ: The SOLID principles of Object Oriented Design
- The Team Coder: SOLID Principles of Software Design by Examples
Design Patterns are well-thought-of, reusable solutions to common recurring problems that developers face during software development. They are considered a common terminology between experienced developers. Design Patterns are general and can be applied to any programming language. The following are some references to get you started.
- Python Design Patterns: For Sleek And Fashionable Code
- Design Pattern in Python (Github examples)
- A General tutorial on Design Patterns
Object Oriented Programming (OOP):
Object Oriented Programming (OOP) is a method of structuring functions and data into objects that can help organize software projects. It has a number of advantages, including improving reusability and maintainability. It is highly encouraged to use OOP in large-scale projects
Containerization is a tool that allows you to launch and work with many different environments (known as “containers”) on a single computer, each of which might be running a different operating system, a different set of installed libraries, different environment variables, etc. Each container operates in isolation from the others; effectively, you can think of each container as a totally different computer, which just happens to be sharing the same hardware as the other containers. Although similar in some ways to Virtual Machines, containerization is based on fundamentally different technology, and is generally much easier to set up and has a dramatically smaller performance cost. For most purposes, the performance cost of using containers is negligible.
This opens up all sorts of possibilities. Do you own a Windows machine, but want to test a code on Linux? No problem – just launch a Linux container on your Windows machine, and test away! Are you having trouble reproducing a bug someone else has encountered, and suspect the problem might be dependent on some detail of the runtime environment? There’s no need to mess with (and potentially break) your own environment in pursuit of the bug – just try some tests in a few containers, leaving your own environment unchanged. In fact, because of the clean isolation and reproducibility of environments that is provided by containerization, anything you can do in a container should probably be done in a container.
Moreover, you can easily build and deploy containers. For example, you could build a code you are developing (along with all of its dependencies) within a container, and then deploy the container. Because the container contains your compiled software and everything needed to run it, including the operating system, the end user doesn’t need to install anything on their system. All they need to do is launch your container and start running calculations.
By far the most commonly used containerization tool is Docker, which is what we recommend getting started with. It is important to note that using Docker effectively requires root access. HPC centers are never going to give you root access to their expensive machines, which precludes Docker from use in an HPC context. Fortunately, there are several containerization alternatives that have been developed specifically for use on HPC machines, with the most prominent being Apptainer. If you are interested in using containerization on an HPC machine, ask the organization that operates the machine about which containerization solution(s) they recommend.
Recommended Software (not usable with HPC):
Recommended Hosting Service:
Alternatives for HPC: