3. Molecular Simulation Design Framework (MoSDeF)

As shown in Figure 1, performing a molecular simulation, whether MD or MC, requires multiple steps: building an initial configuration of the system, selecting and applying a force field, generating a syntactically correct input file (or files) for a target simulation engine, equilibration (to relax the system from its initial configuration – e.g., a crystal – to a configuration characteristic of equilibrium – e.g., liquid), production run to generate a trajectory, and analysis of the trajectory (e.g., averaging over the trajectory to compute thermodynamic and/or structural properties, perform visualization, etc.). Often reliability and statistics are improved by running multiple independent trajectories using the same workflow. Accomplishing these steps in a way that is both accurate and reproducible can be a significant challenge. For example, the application of a force field is a frequent source of error in simulations; for a system composed of moderately complex molecules (such as an ionic liquid) the force field can have a hundred or more parameters that must be provided, offering multiple opportunities for errors (e.g., use of incompatible units, use of parameter values from a publication containing a typographical error, incorrect application of parameters due to logic errors or because of ambiguous definition of parameter usage, etc.). While the use of a community developed, open-source simulation engine may help to reduce the likelihood of fundamental errors in algorithms underlying the simulations, such codes cannot necessarily prevent users from providing parameters that are inconsistent with the intended usage.
Typically, many of these steps are performed within a given research group by a single graduate student, often making use of ad hoc,in-house software, even if open-source simulation engines are used. This approach has several shortcomings that can make simulations more prone to error, limit the extensibility, and hamper reproducibility. For example, the various tools used to accomplish these steps may only be loosely coupled and require manipulation, editing, and/or modification of the tools and/or data by the user. This manipulation may introduce errors and make it difficult to reproducibly capture the exact procedures employed. The need for human manipulation may also limit the ability to use such workflows in applications that require automation, such as parameter screening studies or within the context of larger workflows (e.g., to predict phase equilibrium within a process simulator). The use of in-house software itself, which is typically not open-source or freely available, creates numerous roadblocks as well. Someone wishing to reproduce a simulation would be required to write their own software to accomplish the same tasks. The development of such software may be time consuming and publications often do not provide sufficient detail regarding the procedures used to initialize and parameterize simulations. Furthermore, without access to the original source code, it is not possible to ascertain the quality of the software; that is, to know whether it has undergone sufficient validation or if there are errors and bugs that ultimately impact the accuracy of the reported results.
The Mo lecular S imulation De sign F ramework (MoSDeF)32 is designed to address these issues of automation/efficiency, accuracy, and reproducibility in molecular simulation. MoSDeF is an open-source Python library built upon the scientific Python software stack with three major components: mBuild (for constructing initial configurations of systems) and foyer (for applying force fields). The third component, GMSO (General Molecular Simulation Object), is currently under development and is designed to be a general, flexible way of encapsulating the information required to define a simulation topology in a simulation engine in an agnostic manner. All of the capabilities of MoSDeF are scriptable, thus making the tools inherently reproducible, as well as suitable for automated calculations (e.g., screening). MoSDeF is implemented as a set of composable/modular tools, where each “subpackage” (i.e., module) is designed such that it can be used within MoSDeF, or as a standalone package, allowing MoSDeF to more easily integrate with other community efforts. This also allows the framework to be more easily modified, tested, extended, and have fewer bugs than a monolithic approach. Performing a simulation using MoSDeF, combined with dissemination of simulation scripts on a service such as Github, enables a molecular simulation to be published as a TRUE (t ransparent,r eproducible, u sable by others, and e xtensible) simulation33.
MoSDeF has its origins in a decade of National Science Foundation (NSF)-supported collaborative research at Vanderbilt University involving researchers from chemical engineering and computer science34–36, the latter affiliated with the Institute for Software Integrated Systems (ISIS)37. ISIS is a leading academic software engineering research center, and is the originator of the concept of model-integrated computing (MIC)38. MIC is a systems engineering approach that focuses on the creation of domain specific modeling languages to capture the essential features of the individual components of a given process, at the level of abstraction that is appropriate for the end users. Due to abstraction, processes are described at a meta level that allows tasks to be coupled together to execute scientific or engineering workflows. MIC has been deployed in applications as diverse as managing auto assembly lines and processing health records. MIC design principles, domain-specific modeling languages, and the general philosophy of abstraction have shaped the development of MoSDeF. In particular, MoSDeF attempts to be simulation-engine-agnostic, treating the concept of a molecular simulation at a meta level, above the specifics of the simulation engines. The tools within MoSDeF are designed to fully describe a system: implementation relies on writers to instantiate syntactically correct input files for specific engines from this information. MoSDeF was initially developed to support several commonly used open-source MD codes (LAMMPS39, GROMACS40 and HOOMD-blue41) and has since grown to support open-source MC simulation engines, namely Cassandra16 and GOMC18. In the Supplementary Information, we provide details on how to install MoSDeF through various hosting systems (anaconda, docker, from source using github, etc.) on Apple OSX, Linux, and Windows platforms. Below we describe each of the three key components. Source code, tutorials, documentation, and related publications can be accessed from mosdef.org and/or github.com/mosdef-hub/.

3.1. mBuild