Erika McCarthy1, Şölen Ekesan1, and Darrin M. York1
1Laboratory for Biomolecular Simulation Research, Institute for Quantitative Biomedicine and Department of Chemistry and Chemical Biology, Rutgers University, Piscataway, NJ 08854, USA
The necessary inputs and subsequent outputs for parts 1.1 through 1.6 are available in:
/expanse/projects/qstore/amber_ws/tutorials/QMMM_DFTB3_MTR1/1D/inputs
and
/expanse/projects/qstore/amber_ws/tutorials/QMMM_DFTB3_MTR1/1D/ouputs.
The inputs should be copied to your working directory. If you unable to produce an output for a given step, copy it from the output directory in order to proceed. There will be times when a calculation has been performed for you and you will do analysis. In these cases, where indicated, you will copy the outputs to your working directory and analyze them.
To use AMBER programs, you must:
module use /expanse/projects/qstore/amber_ws/packages/modulefiles/
module load workshop/software
You will also periodically need to use xmgrace to plot your results. This module conflicts with AMBER on Expanse, so it is recommended to open another Expanse window (be sure to use ssh -Y to enable the gui) and type:
module load cpu/0.15.4 intel/19.1.1.217 grace/5.1.25
to do the plotting tasks and avoid loading and unloading modules.
The motivation for using umbrella sampling is that, without restraints, a system undergoing dynamics will very rarely cross a reaction barrier. Even if it does, it will be very sparsely sampled and take much longer than we can afford to sample. Therefore, we apply a series of harmonic biasing potentials, also referred to as an umbrella potential, to force the system over a free energy barrier in the space of the selected reaction coordinate. In this way, we can thoroughly sample the breaking and formation of bonds, or other processes like a conformational change. In order to construct a potential of mean force, we must track the value of the reaction coordinate throughout our simulations so that we can later remove the impact of the bias and compute what the distribution of sampling would look like had we not applied the restraint. This tutorial will show you how to perform and analyze QM/MM simulations of a methyl transfer reaction in the MTR1 ribozyme using 1 reaction coordinate.
The active site is shown in Figure 1, and includes the methyl donor ligand (O6 methylated guanine), the nucleophillic target A63, and C10 protonated at the N3 position.
First, we must decide which atoms will be treated quantum mechanically (aka the QM region). The rest of the atoms will be treated with the molecular mechanical force field. In this case, we are simulating methyl transfer from O6 methylated guanine (O6mG) to the N1 of A63, so we want to include both of these residues in the QM region. The positively charged C10 residue is directly hydrogen bonding to the leaving group, so we will include it as well. In part 2 this residue will also be involved in the reaction, so we will proactively include it for part 1. For A63 and C10, we will only include the nucleobases since the backbones are not directly involved in the reaction. Additionally, U45 is important for holding the ligand in the active site, but since it is not directly involved in the reaction it will remain in the MM region. Based on this selection, the total charge of the QM region is +1 because C10 is in a non-standard protonation state.
Based on this QM region selection, one would need to make some modifications to the parameter file such as removing connection atom charges and redistributing MM charges so that the QM region within each residue is an integer. One would also need to perform hydrogen mass repartitioning in order to run QM/MM simulations stably with a 2 fs time step. These modifications are outside the scope of this exercise and have already been performed to produce qmmm.parm7 that is provided to you in the template directory.
The inputs for this section are located in 1D/inputs. They should be copied to your working directory.
[user@cluster] mkdir 1D
[user@cluster] cd 1D
[user@cluster] cp -r /expanse/projects/qstore/amber_ws/tutorials/QMMM_DFTB3_MTR1/1D/inputs/* ./
You have been provided a parameter file (template/qmmm.parm7), and a structure file (start/reactant.rst7) that was previously equilibrated and represent the reactant state of the ribozyme. Additionally, you have been provided a template mdin file (template/short.mdin). This mdin file is valid for QM/MM simulations in general, and is not specific to umbrella sampling.
title
&cntrl
! IO =======================================
irest = 0 ! 0 = start, 1 = restart
ntx = 1 ! 1 = start, 5 = restart
ntxo = 1 ! read/write rst as formatted file
iwrap = 1 ! wrap crds to unit cell
ioutfm = 1 ! write mdcrd as netcdf
imin = 0
ntmin = 1
ntpr = 20
ntwr = 20
ntwx = 0
ntwf = 0
! DYNAMICS =================================
nstlim = 250
dt = 0.002 ! ps/step
ntb = 1 ! 1=NVT periodic, 2=NPT periodic, 0=no box
! TEMPERATURE ==============================
temp0 = 298 ! target temp
gamma_ln = 5.0 ! Langevin collision freq
ntt = 3 ! thermostat (3=Langevin)
! PRESSURE ================================
ntp = 0 ! 0=no scaling, 1=isotropic, 2=anisotropic
! SHAKE ====================================
ntc = 2 ! 1=no shake, 2=HX constrained, 3=all constrained
noshakemask = ":69|@272-282,290-291,1975-1988" ! do not shake these
ntf = 1 ! 1=cpt all bond E, 2=ignore HX bond E, 3=ignore all bond E
! MISC =====================================
cut = 10.0
ifqnt = 1
ig = -1
nmropt = 1
/
&ewald
dsum_tol = 1.e-6
/
&qmmm
qm_theory = 'DFTB3'
qmmask = ':69|@272-282,290-291,1975-1988'
qmcharge = 1
spin = 1
qmshake = 0
qm_ewald = 1
qmmm_switch = 1
scfconv = 1.e-10
verbosity = 0
tight_p_conv = 1
diag_routine = 0
pseudo_diag = 1
dftb_maxiter = 100
/
&wt type = 'DUMPFREQ', istep1 = 10, /
&wt type='END' /
DISANG=img.disang
DUMPAVE=img.dumpave
LISTOUT=POUT
LISTIN=POUT
As highlighted in cyan, we will be using a 2 fs timestep, but this can only be done safely when hydrogen mass repartitioning has been applied to the parameter file. The initial simulations used to generate the windows for umbrella sampling will only be performed for 0.5 ps.
As highlighted in yellow, we will be using the DFTB3 semi-empirical Hamiltonian, and the total charge of the QM region is +1 due to the protonation of cytosine. When determining the total charge of a nucleic acid QM region, be sure to consider any negatively charged phosphates that may be included. The QM region is defined by the mask ':69|@272-282,290-291,1975-1988' which includes the entire ligand (:69), the nucleobase of C10 (@272-282,290-291) and the nucleobase of A63 (@1975-1988).
The noshakemask (pink ) contains the same mask as the QM region, thus we will not shake QM atoms as these bonds involving hydrogen may fluctuate outside of equilibrium MM values when bonds are made/broken. SHAKE constraints are inherently dependent on the MM force field, thus MM constraints should not be combined with our QM description.
Since we want to apply nmr restraints, nmropt (green) is set to 1, and we will output restraint values to the dumpave file every 10 steps. The green highlighted img.disang and img.dumpave are place holders that will be replaced to indicate the appropriate umbrella window number in order to apply restraints.
Let's review how nmr restraints work in AMBER. When nmropt=1 is set in the mdin file, restraints will be read from the specified DISANG file. These are different than coordinate restraints set when using ntr=1, which effectively freezes the atoms in place. Using nmr restraints allows the atom position to fluctuate subject to a specified biasing potential relating it to another atom or atoms. The values of the restrained properties for a given step are output to the specified DUMPAVE file with a frequency of istep1.
Next we need to select the reaction coordinates. We choose the reaction coordinate to be the difference in distance between the atoms where the bond will break, and between the atoms where the bond will form. This ensures the reaction coordinate value to increase going from reactant to product making it more intuitive. In MTR1, describing methyl transfer between O6mG:O6 and A63:N1 will yield the reaction coordinate (RC): RC = |RO6mG:O6 - RCm| - | RA63:N1 - RCm |. We will evaluate the reaction going from the reactant with RC at -2.5 Å, to product with RC at 2.5 Å. You have been provided template/img.disang containing the restraint definitions.
#methylation: |GN-C5|-|A62-C5|
&rst iat= 2200, 2189, 2200, 1984,
r1=-999, r2=RC1, r3=RC1, r4=999, rk2=100, rk3=100, rstwt=1.0,-1.0 /
#Max distances for H-bonds
## 1 ##
&rst iat= 1403, 2207,
r1=-999, r2=-999, r3=3.00, r4=999,rk2=0, rk3=100 /
## 2 ##
&rst iat= 1405, 2192,
r1=-999, r2=-999, r3=2.20, r4=999,rk2=0, rk3=100 /
## 3 ##
&rst iat= 1407, 2206,
r1=-999, r2=-999, r3=2.00, r4=999,rk2=0, rk3=100 /
## 4 ##
&rst iat= 2205, 282,
r1=-999, r2=-999, r3=2.20, r4=999,rk2=0, rk3=100 /
## 5 ##
&rst iat= 280, 2189,
r1=-999, r2=-999, r3=2.20, r4=999,rk2=0, rk3=100 /
## 6 ##
&rst iat= 2190, 1982,
r2=-999, r2=-999, r3=3.00, r4=999,rk2=0, rk3=100 /
## 7 ##
&rst iat= 291, 2193,
r1=-999, r2=-999, r3=2.20, r4=999,rk2=0, rk3=100 /
For now, we need to understand more about the restraints used in this example. The first restraint (highlighted yellow) defines the reaction coordinate. The line "&rst iat= 2200, 2189, 2200, 1984," indicates that we are restraining the difference in distance between the methyl carbon (2200) to ligand oxygen (2189) and the methyl carbon (2200) to the N1 of adenine (1984). These distances are shown in Figure 1. The presence of "rstwt=1.0,-1.0" indicates this should be a difference of distances. In theory, the two distances could be seperate reaction coordinates, but combining them reduces the dimensionality while still describing the same event. The values of r1, r2, r3, and r4 define the shape of the biasing potential. The r2 and r3 values define the flat region of the potential. If r2 is less than r3, then there is no penalty for being within those bounds. When applying an umbrella potential, we set r2=r3 to sample directly on the desired value. The img.disang file is a template, where RC1 will ultimately be replaced for each umbrella window. If more reaction coordinates were present, the placeholder RC2 and so on would be used. The r1 and r4 values define the bounds of the linear response region, ie the penalty will progressively increase as the restrained property approaches r1 and r4. By setting r1=-999 and r4=999 we have effectively created a parabola with infinite bounds. The values rk2 and rk3 are the weights of the restraints. These effectively define the steepness of the parabola, thus controlling the penalty for going outside the r2 to r3 region. For umbrella sampling we want to set this to a high number to force the sampling of high energy regions. The higher the free energy at an umbrella center position, the more the observed value (reported in the dumpave file) will attempt to move away from this center and fight against the biasing potential. It is this concept that helps us construct a PMF.
The rest of the restraints do not define reaction coordinates, and they will be held constant in all of the umbrella windows. These are distance restraints on the active site hydrogen bonds to the ligand to ensure it stays in the pocket. As highlighted in green, these are different from the harmonic potential used for the reaction coordinate because r2=-999 and rk2=0. This defines a half-harmonic biasing potential where any distance less than 3.0 feels no penalty. If the values were reversed (ie. r3=999 and rk2=100), then the two atoms would be pushed away from each other. Now that we have a comprehensive understand of the nmr restraints, we will focus on setting up an umbrella sampling simulation.
Now we will generate the equil directory based on the inputs provided. This will be where we sequentially generate the umbrella windows starting from the reactant structure. It is important to slowly increment the reaction coordinate starting from a neighboring structure because each time you change the restraint, you are effectively changing the force field and risk large spikes in energy that could cause unintended bonds to break. We will use the ndfes program to write the inputs for this step.
You have been provided a reactant and product.disang file where the reaction coordinate is set to -2.5 and 2.5 respectively. These will define the end points, and a total of 32 umbrella windows will be generated to linearly interpolate through them.
[user@cluster] ndfes-path-prepguess.py --disang ../template/img.disang --mdin ../template/short.mdin --min reactant.disang --min product.disang --nsim 32 --odir ../equil --pad 2 --dry-run
The --min flags indicate the minima we want to use as control points (reactant and product). The option --disang indicates the template disang file with a placeholder, --mdin indicates the template mdin file with a placeholder, --nsim determines the number of umbrella windows, and --pad means the file names will be padded with a leading zero. The --dry-run flag will just print the output without creating the simulation directory called equil. The output should look like this:
1 -2.500 ../equil/img01.disang
2 -2.339 ../equil/img02.disang
3 -2.177 ../equil/img03.disang
4 -2.016 ../equil/img04.disang
5 -1.855 ../equil/img05.disang
6 -1.694 ../equil/img06.disang
7 -1.532 ../equil/img07.disang
8 -1.371 ../equil/img08.disang
9 -1.210 ../equil/img09.disang
10 -1.048 ../equil/img10.disang
11 -0.887 ../equil/img11.disang
12 -0.726 ../equil/img12.disang
13 -0.565 ../equil/img13.disang
14 -0.403 ../equil/img14.disang
15 -0.242 ../equil/img15.disang
16 -0.081 ../equil/img16.disang
17 0.081 ../equil/img17.disang
18 0.242 ../equil/img18.disang
19 0.403 ../equil/img19.disang
20 0.565 ../equil/img20.disang
21 0.726 ../equil/img21.disang
22 0.887 ../equil/img22.disang
23 1.048 ../equil/img23.disang
24 1.210 ../equil/img24.disang
25 1.371 ../equil/img25.disang
26 1.532 ../equil/img26.disang
27 1.694 ../equil/img27.disang
28 1.855 ../equil/img28.disang
29 2.016 ../equil/img29.disang
30 2.177 ../equil/img30.disang
31 2.339 ../equil/img31.disang
32 2.500 ../equil/img32.disang
dmax 0.161
dmax is the spacing between umbrella windows. In general, if the spacing is too larger, consider adding more windows. For our purposes this is sufficiently small.
[user@cluster] ndfes-path-prepguess.py --disang ../template/img.disang --mdin ../template/short.mdin --min reactant.disang --min product.disang --nsim 32 --odir ../equil --pad 2
[user@cluster] cd ../equil
[user@cluster] ls
eq01fwd.sh img04.disang img07.mdin img11.disang img14.mdin img18.disang img21.mdin img25.disang img28.mdin img32.disang
img01.disang img04.mdin img08.disang img11.mdin img15.disang img18.mdin img22.disang img25.mdin img29.disang img32.mdin
img01.mdin img05.disang img08.mdin img12.disang img15.mdin img19.disang img22.mdin img26.disang img29.mdin init01.rst7
img02.disang img05.mdin img09.disang img12.mdin img16.disang img19.mdin img23.disang img26.mdin img30.disang
img02.mdin img06.disang img09.mdin img13.disang img16.mdin img20.disang img23.mdin img27.disang img30.mdin
img03.disang img06.mdin img10.disang img13.mdin img17.disang img20.mdin img24.disang img27.mdin img31.disang
img03.mdin img07.disang img10.mdin img14.disang img17.mdin img21.disang img24.mdin img28.disang img31.mdin
You will see that the eq01fwd.sh script was generated for you in the ouputs directory.
#!/bin/bash
set -e
set -u
#
# You can create a slurm script and run these
# commands in the following way:
#
# export LAUNCH="srun sander.MPI"
# export PARM="path/to/parm7"
# bash eq01fwd.sh
#
if [ "${LAUNCH}" == "" ]; then
echo 'bash variable LAUNCH is undefined. Defaulting to: export LAUNCH="sander"'
export LAUNCH="sander"
fi
if [ "${PARM}" == "" ]; then
echo 'bash variable PARM is undefined. Please: export PARM="/path/to/parm7"'
exit 1
else
if [ ! -e "${PARM}" ]; then
echo "File not found: ${PARM}"
exit 1
fi
fi
if [ ! -e "init01.rst7" ]; then
echo "File not found: init01.rst7"
exit 1
fi
${LAUNCH} -O -p ${PARM} -i img01.mdin -o img01.mdout -c init01.rst7 -r img01.rst7 -x img01.nc -inf img01.mdinfo
${LAUNCH} -O -p ${PARM} -i img02.mdin -o img02.mdout -c img01.rst7 -r img02.rst7 -x img02.nc -inf img02.mdinfo
${LAUNCH} -O -p ${PARM} -i img03.mdin -o img03.mdout -c img02.rst7 -r img03.rst7 -x img03.nc -inf img03.mdinfo
${LAUNCH} -O -p ${PARM} -i img04.mdin -o img04.mdout -c img03.rst7 -r img04.rst7 -x img04.nc -inf img04.mdinfo
${LAUNCH} -O -p ${PARM} -i img05.mdin -o img05.mdout -c img04.rst7 -r img05.rst7 -x img05.nc -inf img05.mdinfo
${LAUNCH} -O -p ${PARM} -i img06.mdin -o img06.mdout -c img05.rst7 -r img06.rst7 -x img06.nc -inf img06.mdinfo
${LAUNCH} -O -p ${PARM} -i img07.mdin -o img07.mdout -c img06.rst7 -r img07.rst7 -x img07.nc -inf img07.mdinfo
${LAUNCH} -O -p ${PARM} -i img08.mdin -o img08.mdout -c img07.rst7 -r img08.rst7 -x img08.nc -inf img08.mdinfo
${LAUNCH} -O -p ${PARM} -i img09.mdin -o img09.mdout -c img08.rst7 -r img09.rst7 -x img09.nc -inf img09.mdinfo
${LAUNCH} -O -p ${PARM} -i img10.mdin -o img10.mdout -c img09.rst7 -r img10.rst7 -x img10.nc -inf img10.mdinfo
${LAUNCH} -O -p ${PARM} -i img11.mdin -o img11.mdout -c img10.rst7 -r img11.rst7 -x img11.nc -inf img11.mdinfo
${LAUNCH} -O -p ${PARM} -i img12.mdin -o img12.mdout -c img11.rst7 -r img12.rst7 -x img12.nc -inf img12.mdinfo
${LAUNCH} -O -p ${PARM} -i img13.mdin -o img13.mdout -c img12.rst7 -r img13.rst7 -x img13.nc -inf img13.mdinfo
${LAUNCH} -O -p ${PARM} -i img14.mdin -o img14.mdout -c img13.rst7 -r img14.rst7 -x img14.nc -inf img14.mdinfo
${LAUNCH} -O -p ${PARM} -i img15.mdin -o img15.mdout -c img14.rst7 -r img15.rst7 -x img15.nc -inf img15.mdinfo
${LAUNCH} -O -p ${PARM} -i img16.mdin -o img16.mdout -c img15.rst7 -r img16.rst7 -x img16.nc -inf img16.mdinfo
${LAUNCH} -O -p ${PARM} -i img17.mdin -o img17.mdout -c img16.rst7 -r img17.rst7 -x img17.nc -inf img17.mdinfo
${LAUNCH} -O -p ${PARM} -i img18.mdin -o img18.mdout -c img17.rst7 -r img18.rst7 -x img18.nc -inf img18.mdinfo
${LAUNCH} -O -p ${PARM} -i img19.mdin -o img19.mdout -c img18.rst7 -r img19.rst7 -x img19.nc -inf img19.mdinfo
${LAUNCH} -O -p ${PARM} -i img20.mdin -o img20.mdout -c img19.rst7 -r img20.rst7 -x img20.nc -inf img20.mdinfo
${LAUNCH} -O -p ${PARM} -i img21.mdin -o img21.mdout -c img20.rst7 -r img21.rst7 -x img21.nc -inf img21.mdinfo
${LAUNCH} -O -p ${PARM} -i img22.mdin -o img22.mdout -c img21.rst7 -r img22.rst7 -x img22.nc -inf img22.mdinfo
${LAUNCH} -O -p ${PARM} -i img23.mdin -o img23.mdout -c img22.rst7 -r img23.rst7 -x img23.nc -inf img23.mdinfo
${LAUNCH} -O -p ${PARM} -i img24.mdin -o img24.mdout -c img23.rst7 -r img24.rst7 -x img24.nc -inf img24.mdinfo
${LAUNCH} -O -p ${PARM} -i img25.mdin -o img25.mdout -c img24.rst7 -r img25.rst7 -x img25.nc -inf img25.mdinfo
${LAUNCH} -O -p ${PARM} -i img26.mdin -o img26.mdout -c img25.rst7 -r img26.rst7 -x img26.nc -inf img26.mdinfo
${LAUNCH} -O -p ${PARM} -i img27.mdin -o img27.mdout -c img26.rst7 -r img27.rst7 -x img27.nc -inf img27.mdinfo
${LAUNCH} -O -p ${PARM} -i img28.mdin -o img28.mdout -c img27.rst7 -r img28.rst7 -x img28.nc -inf img28.mdinfo
${LAUNCH} -O -p ${PARM} -i img29.mdin -o img29.mdout -c img28.rst7 -r img29.rst7 -x img29.nc -inf img29.mdinfo
${LAUNCH} -O -p ${PARM} -i img30.mdin -o img30.mdout -c img29.rst7 -r img30.rst7 -x img30.nc -inf img30.mdinfo
${LAUNCH} -O -p ${PARM} -i img31.mdin -o img31.mdout -c img30.rst7 -r img31.rst7 -x img31.nc -inf img31.mdinfo
${LAUNCH} -O -p ${PARM} -i img32.mdin -o img32.mdout -c img31.rst7 -r img32.rst7 -x img32.nc -inf img32.mdinfo
This script would start from img01 and iteratively generate the umbrella windows using the output of the previous window as the starting structure. It is important to increment the starting structure slowly to avoid errors. The serial nature of this process makes it too slow to compute during the course of this workshop, so these simulations have been conducted for you.
After generating the windows, it it best practice to exclude the first ~2 ps of production sampling from analysis to allow for proper equilibration time. The equilibration region can be checked using ndfes-CheckEquil.py. However, under the time constraints of the exercise, you will not be able to perform long enough sampling to gather equilibrium statistics, so you will only perform enough sampling to generate a PMF.
Now we will perform umbrella sampling on all of the umbrella windows in parallel. For the sake of of this exercise, the windows have been divided into 8 groups of 4 in directories called t1 through t8. Normally, one would run all of the windows together, but this will allow each participant to perform 3 ps of sampling in ~8 minutes. Trial directories should be chosen such that each trial is simulated by roughly an even number of people. This tutorial will focus on t1 as an example, but the instructions are the same for the other sets just replacing the directory name.
[user@cluster] ls t1
analyze_1-4.sh img01.mdin img02.mdin img03.mdin img04.mdin init02.rst7 init04.rst7 template.groupfile
img01.disang img02.disang img03.disang img04.disang init01.rst7 init03.rst7 run_1-4.slurm
The init{01..04} restart files are copies of the final img{01..04} restart files generated in the equil directory. In the mdin files, nstlim has been increased to 1500 so that with a 2 fs timestep we will perform 3 ps of sampling. It is typically desirable to run longer simulations when more time is available; however, this will be enough sampling to generate and analyze a PMF as a demonstration.
You have also been provided a slurm script called run_1-4.slurm.
#!/bin/bash
#SBATCH --job-name="sim.1-4"
#SBATCH --output="%a.slurmout"
#SBATCH --error="%a.slurmerr"
#SBATCH --partition=shared
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=32
#SBATCH --mem=64GB
#SBATCH --cpus-per-task=1
#SBATCH --requeue
#SBATCH --export=ALL
#SBATCH -t 0-00:30:00
#SBATCH --account=gue998
#SBATCH --reservation=amber24
module load workshop/amber24/default
export LAUNCH="srun --mpi=pmi2 -K1 -N1 -n32 -c1 --exclusive sander.MPI"
#export PARM="../template/qmmm.parm7"
set -e
set -u
$LAUNCH -ng 4 -groupfile template.groupfile
wait
sleep 1
Each windows will be allotted 8 cores. A single compute node on expanse has 128 cores, which is conveniently divisible by 32, so a full run with 32 windows running in parallel could scale to an entire compute node. Now that the jobs are being run in parallel, we will provide a groupfile as input to sander.MPI.
-O -p ../template/qmmm.parm7 -c init01.rst7 -i img01.mdin -o img01.mdout -r img01.rst7 -x img01.nc -inf img01.mdinfo
-O -p ../template/qmmm.parm7 -c init02.rst7 -i img02.mdin -o img02.mdout -r img02.rst7 -x img02.nc -inf img02.mdinfo
-O -p ../template/qmmm.parm7 -c init03.rst7 -i img03.mdin -o img03.mdout -r img03.rst7 -x img03.nc -inf img03.mdinfo
-O -p ../template/qmmm.parm7 -c init04.rst7 -i img04.mdin -o img04.mdout -r img04.rst7 -x img04.nc -inf img04.mdinfo
This is similar to executing multiple backgrounded (&) sander.MPI processes, but it also gives us the capability to run replica exchange if desired. Since you will only be running a portion of the umbrella windows, HREMD will not be performed.
[user@cluster] cd t1
[user@cluster] sbatch run_1-4.slurm
#!/bin/bash
mkdir analysis
mkdir analysis/dumpaves
cd analysis
for i in $(seq 1 4); do
base=$(printf "img%02i" ${i})
ndfes-PrepareAmberData.py -d ../${base}.disang -i ../${base}.dumpave -o dumpaves/${base}.dumpave -r 1 >> metafile
done
ndfes_omp --mbar -w 0.15 --nboot 0 -c metafile.chk metafile
ndfes-path_omp --chk metafile.chk --ipath metafile --npathpts 4 --nsplpts 20 --opath path
This script will create an analysis directory and use the ndfes program to generate the potential of mean force for the segment of the reaction space spanned by your 4 umbrella windows. Each simulation generates a dumpave file that tracks the values of the restrained properties, including the reaction coordinate. For each dumpave file, the program ndfes-PrepareAmberData.py will extract just the data related to the reaction coordinate and output this to a new dumpave file in analysis/dumpaves. It will also generate a file called metafile that will indicate the index of the Hamiltonian (default is zero), the temperature (default is 298), the path to the dumpave file, the umbrella center, and the force constant.
Next, the program ndfes_omp is used to solve the MBAR equations with a bin width equal to 0.15. Setting the --nboot flag equal to zero means that we will perform no bootstrap resamples. If you wish to put error bars on your data points, nboot can be set to a nonzero number like 50, but this analysis will take longer with more resamples. The output of the program is an xml file called metafile.chk. This contains the value of the free energy in each sampled bin.
Finally, we use the ndfes-path_omp program to print the PMF information to a file called path. In this example, we are merely asking it to print the free energy along a path that is defined by the 4 control points in the metafile (ie. where we placed the umbrella windows). To create a smooth curve, we will use 20 spline points to represent our path. When using interpolation, the default setting is to use radial basis functions (rbf). Additional options will be discussed in upcoming exercises with more complex reaction paths.
[user@cluster] bash analyze_1-4.sh
[user@cluster] ls analysis
dumpaves metafile metafile.chk path path.rbf.0.dat path.xml
[user@cluster] module load cpu/0.15.4 intel/19.1.1.217 grace/5.1.25
[user@cluster] cd analysis
[user@cluster] xmgrace -block path.rbf.0.dat -bxy 3:4
The y-axis is the free energy in kcal/mol, and the x-axis is the reaction coordinate value. Without the context of the other umbrella windows, this does not give us much information. Now we will look at the PMF as a whole.
In this section we will combine all of the umbrella sampling trials to construct a combined PMF. You have been provided the output of all of the trials that must be copied to the working directory.
[user@cluster] cp -r /expanse/projects/qstore/amber_ws/tutorials/QMMM_DFTB3_MTR1/1D/ouputs/t{2..8} ./
[user@cluster] ls
combined_analysis t1 t2 t3 t4 t5 t6 t7 t8
The structures files have been removed from the output directories for the sake of file size, but these would have been generated by the simulations and are not required for analysis. We want to combine the metafiles from all of the trial directories.
[user@cluster] cd combined_analysis
[user@cluster] ndfes-CombineMetafiles.py -o metafile.all ../t{1..8}/analysis/metafile
[user@cluster] ndfes_omp --mbar -w 0.15 --nboot 0 -c metafile.all.chk metafile.all
This will generate metafile.all.chk. Finally, we will generate the path file for the entire PMF.
[user@cluster] ndfes-path_omp --chk metafile.all.chk --ipath metafile.all --npathpts 32 --nsplpts 400 --opath path_all
The ndfes commands are also given in combined_analysis.sh, but for the sake of understanding the role of each program they have been run one at a time.
[user@cluster] xmgrace -block path_all.rbf.0.dat -bxy 3:4
Now we can see the full PMF. From this we see that the free energy barrier for this reaction is ~30 kcal/mol, which is quite high. In 1 dimension it is very clear what the path is from reactants to products, and we can plot the free energy as a function of a single coordinate. However, if more reaction coordinates are involved, the path from reactant to product is less clear. In the following sections you will see how the use of more reaction coordinates can better model this reaction.
Only complete this section if you have remaining time in the session and are on pace to finish the required materials.
In this section you will demonstrate the importance of the spacing between umbrella windows. You have been provided the directory test_spacing containing metafile.11.
0 298.00 ../t1/analysis/dumpaves/img01.dumpave -2.500000 1.00000000000000e+02
0 298.00 ../t1/analysis/dumpaves/img04.dumpave -2.016129 1.00000000000000e+02
0 298.00 ../t2/analysis/dumpaves/img07.dumpave -1.532258 1.00000000000000e+02
0 298.00 ../t3/analysis/dumpaves/img10.dumpave -1.048387 1.00000000000000e+02
0 298.00 ../t4/analysis/dumpaves/img13.dumpave -0.564516 1.00000000000000e+02
0 298.00 ../t4/analysis/dumpaves/img16.dumpave -0.080645 1.00000000000000e+02
0 298.00 ../t5/analysis/dumpaves/img19.dumpave 0.403226 1.00000000000000e+02
0 298.00 ../t6/analysis/dumpaves/img22.dumpave 0.887097 1.00000000000000e+02
0 298.00 ../t7/analysis/dumpaves/img25.dumpave 1.370968 1.00000000000000e+02
0 298.00 ../t7/analysis/dumpaves/img28.dumpave 1.854839 1.00000000000000e+02
0 298.00 ../t8/analysis/dumpaves/img31.dumpave 2.338710 1.00000000000000e+02
This is is the same as metafile.all used previously for combined analysis except it only contains 11 of the umbrella windows. Make note of the spacing between the reaction coordinate values in the fourth column of metafile.11 compared to in metafile.all. They are now spaced by ~0.5 . First we will attempt to use this metafile in the same way we did previously.
[user@cluster] ndfes_omp --mbar -w 0.15 --nboot 0 -c metafile.11.0.15.chk metafile.11
[user@cluster] ndfes-path_omp --chk metafile.11.0.15.chk --ipath metafile.11 --npathpts 11 --nsplpts 400 --opath path_11_spl_0.15
[user@cluster] xmgrace -block path_11_spl_0.15.rbf.0.dat -bxy 3:4
The x-axis is the reaction coordinate value and the y-axis is the free energy in kcal/mol. Something has gone horribly wrong and the free energy has shot up to 5000 in 3 places. When we ran the ndfes_omp command to create the chk file, we used -w 0.15, meaning the bin width when solving the MBAR/UWHAM equations was 0.15 . However, our windows are now spaced ~0.5 apart. In our disang file we set the force constants to be 100 kcal/mol 2. Given our harmonic biasing potential, the distribution of observed reaction coordinates should be approximately normally distributed with a standard deviation of σ=(1/2βK)1/2 where β is the one over the Boltzmann constant times temperature and K is the force constant set in the disang file. Setting rk2 and rk3=100 gives an expected standard deviation in the reaction coordinate of ~0.054 . In a normal distribution, about 99.7% of the samples will be within 3 standard deviations of the mean, which here is ~0.16 . If the windows are ~0.5 away from each other and the sampling spans ~0.16 around the mean, it is highly likely that there will be bins between windows that contain zero samples.
When a bin contains no samples, ndfes reports 5000 for the energy, which effectively indicates it is infinitely high because a calculation could not be made. When we ran ndfes-path_omp we set --nsplpts to 400 in an attempt to create a smooth spline interpolation through the data, but this meant interpolating through empty bins. Now let's try running the command again without interpolating. This will just return the free energy at the 11 path points.
[user@cluster] ndfes-path_omp --chk metafile.11.0.15.chk --ipath metafile.11 --npathpts 11 --opath path_11_0.5
[user@cluster] xmgrace -block path_11_0.5.rbf.0.dat -bxy 3:4
Now the energy is more reasonable, but the PMF is disjointed. Perhaps is we increase the bin width, there will be no empty bins and we will be able to smoothly interpolate the PMF.
[user@cluster] ndfes_omp --mbar -w 0.3 --nboot 0 -c metafile.11.0.3.chk metafile.11
[user@cluster] ndfes-path_omp --chk metafile.11.0.3.chk --ipath metafile.11 --npathpts 11 --nsplpts 400 --opath path_11_spl_0.3
[user@cluster] xmgrace -block path_11_spl_0.3.rbf.0.dat -bxy 3:4
Recall that with 32 windows, the PMF looked like this:
As you can see, we were able to make an interpolation, but the PMF is very noisy. All of the bins contain samples, but for many it is not enough to reliably interpolate. In addition, increasing the bin size creates a more course gained view of the free energy landscape. In the end, we simply need more sampling, which is achieved by using more umbrella windows.
The necessary inputs and subsequent outputs for parts 2.1 through 2.3 are available in:
/expanse/projects/qstore/amber_ws/tutorials/QMMM_DFTB3_MTR1/2D
There are 3 sub-directories we will look at: linear, PT, and MT. Each contains input and output. The inputs should be copied to your working directory. If you are unable to produce an output for a given step, copy it from the output directory in order to proceed. There will be times when a calculation has been performed for you and you will do analysis. In these cases, where indicated, you will copy the outputs to your working directory and analyze them.
To use AMBER programs, you must:
module use /expanse/projects/qstore/amber_ws/packages/modulefiles/
module load workshop/software
You will also periodically need to use xmgrace to plot your results. This module conflicts with AMBER on Expanse, so it is recommended to open another Expanse window (be sure to use ssh -Y to enable the gui) and type:
module load cpu/0.15.4 intel/19.1.1.217 grace/5.1.25
to do the plotting tasks and avoid loading and unloading modules.
Given that the barrier to methyl transfer was so high, we will now include an additional coordinate for proton transfer. This evaluates the potential impact of general acid catalysis on the mechanism and free energy profile. However, with this new degree of freedom, it is unclear what the minimum free energy path is to get from reactants to products, so we need a method to navigate the multidimensional free energy landscape.
In this section you will analyze data from a series of surface accelerated string method (SASM) simulations that sample the reaction space with two reaction coordinates: RC1 = |RC10:N3-H| - |RmG:N1 - RH| and RC2 = |RO6mG:O6 - RCm| - | RA63:N1 - RCm |. The process of generating the inputs and equilibrating the windows for two dimensions would be analogous to the 1D example, except another reaction coordinate is specified in the img.disang file, and adequate window spacing must be maintained in both dimensions. The set-up for these simulations has been done for you, and you will focus on analysis of the free energy surface.
The inputs for the sting simulations and resulting outputs are located in the 2D directory.The tutorial is structured as follows:
The surface accelerated string method is an iterative procedure starting from an initial guess of the path that seeks to convergence upon the minimum free energy path. This method converges rapidly by running a set of umbrella sampling simulations on a given path, and after each iteration optimizing the path on the collective free energy surface from all of the sampling iterations performed up to that point. Every other string iteration, it performs "exploratory" simulations where the umbrella centers are placed just beyond the point on the minimum free energy path to thoroughly sample the vicinity of the path to avoid falling into local minima and create a rigorous free energy surface.
Under the time constraints of this exercise, the simulation data has been provided for you, and you will perform analysis. First, let's take a look at how the data was generated. There are 3 simulation directories provided called PT, MT, and linear. These correspond to 3 initial guesses of the path from which the pathfinding was initiated. PT is a stepwise pathway where the proton is transferred followed by the methyl transfer in a perfectly stepwise fashion. MT follows the opposite pathway where the methyl transfer occurs first and the proton transfer occurs second. The linear pathway is a perfectly concerted mechanism where the path is a straight line from reactants to products. At the offset, it is unclear which pathway is most energetically favorable. The tutorial will use the linear path as the example.
[user@cluster] mkdir linear
[user@cluster] cd linear
[user@cluster] cp -r /expanse/projects/qstore/amber_ws/tutorials/QMMM_DFTB3_MTR1/2D/linear/input/* ./
In the simulation directory, there is a script called SASM.slurm that was used to generate the data provided to you. The script was set up to run 51 iterations of the string method, each simulation for 4 ps. We will briefly take a look at key lines in the file that are fundamental to using the SASM. First,
MAKEANA="ndfes-path-analyzesims.py --curit=${cit} -d ${DISANG} ${NEQIT} ${TEMP} --maxeq=0.25 --skipg"
Once you have run a set of simulations, this command acts a wrapper for ndfes-CheckEquil.py, ndfes-PrepareAmberData.py, and ndfes-CombineMetafiles.py. It will determine the equilibration region of the simulations up to a fraction of maxeq, and analyze the dumpave files in that region. The resulting metafile will contain data from the current string and previous string iterations, excluding NEQIT iterations. Next, the combined metafile is used to generate the chk file:
MAKECHK="ndfes_omp --mbar -w 0.15 --nboot 0 -c metafile.all.chk metafile.all"
This command remains the same from the 1D example we did previously. Finally, the next string iteration directory is generated using the following command:
MAKESIMS="ndfes-path_omp --curit=${cit} -d ${DISANG} -i ${MDIN} ${NEQIT} ${TEMP} --sasm --maxit=300 --minsize=10 --akima --npathpts=100 --wavg=4 --wavg-niter=1"
This ndfes command optimizes the path on the existing free energy surface in order to determine where the next set of umbrella centers should be placed. The optimization is done by performing up to maxit so called "synthetic iterations" of the string method using npathpts. This optimization essentially decouples the minimum free energy path from the simulated umbrella centers so that the path can be more rigorously represented, and the simulations are performed to improve the quality of sampling. The minsize parameter indicates that a bin should have at least 10 samples to be included in the analysis. The parameter wavg is the order of the cardinal B-spline used to approximate the free energy surface as a weighted average of nearby bin values. The wavg-niter parameter applies an iterative correction to the wavg histogram values such that the interpolated values match the values at the bin centers. Increasing this value will decrease artifacts from averaging, but it may lead to noisier data. These values can be adjusted based on the system and desired smoothness of the free energy surface. Based on the optimized path, the MAKESIMS command will determine the next set of umbrella centers, and it will set up the next run directory based on the template input file and disang file provided.
Now that you understand how the data was generated, you will see how the string has progressed and the resulting free energy surface. The output of the string simulations is located in the outputs directory for the respective initial guesses. You will perform the analysis on the outputs of the string, so these should be copied to your working directory.
[user@cluster] cp -r /expanse/projects/qstore/amber_ws/tutorials/QMMM_DFTB3_MTR1/2D/linear/output ./
[user@cluster] cd output
The outputs contain iterations zero (init) through 20, and each iteration contains a directory called analysis. For the init, it002, it005, it010, and it020 directories, the analysis directory has been moved to analysis-out so that you can generate analysis on your own without writing over files, and if you encounter issues, the intended results are still available for you to look at. For the sake of file size, structure files have been removed, but would have been generated by the simulations. For each of these directories you will run the following series of commands. In each case, replace the --curit flag and change the cd command to the directory you are analyzing. When running Example2d.py, change the title to match the iteration, and use the flag --inset-bright instead of --inset-tleft if you are analyzing the MT initial guess of the path.
[user@cluster] ndfes-path-analyzesims.py --curit=0 -d template/img.disang --neqit=4 --temp=298 --maxeq=0.25 --skipg
[user@cluster] cd init/analysis
[user@cluster] ndfes_omp --mbar -w 0.15 --nboot 0 -c metafile.all.chk metafile.all
[user@cluster] ndfes-path_omp --ipath metafile.current --chk metafile.all.chk --neqit=4 --temp 298 --maxit=300 --minsize=10 --akima --npathpts=100 --wavg=4 --wavg-niter=1 --opath path
[user@cluster] python ../../Example2d.py --ipath path.wavg.0.dat --title 'init' --inset-tleft --zerobypath0 --wavg=4 --wavg-niter=1 --minene=-25 --maxene=25 --minsize=10 metafile.all.chk
The ndfes-path-analyzesims.py and ndfes_omp commands are the same as in the SASM run script. For each window, the ndfes-path-analyzesims.py script will print some information about the equilibration of the window. More information can be found by running:
ndfes-CheckEquil.py -h
Ninp: The number of input samples
Nout: The number of output samples
Teq: The percentage of samples excluded as equilibration
i0: The 1-based index of the first frame to write (the "start" value
when using cpptraj)
s: The stride through the data (the "offset" value when using cpptraj)
g: The statistical inefficiency of the correlated samples
Wf: The mean value of the bias potential from the first half of
statistically independent samples after excluding the first
i0-1 samples as equilibration
dWf: The standard error of Wf
Wl: The mean value of the bias potential from the last half of
statistically independent samples after excluding the first
i0-1 samples as equilibration
dWl: The standard error of Wl
Here we are modifying the ndfes-path_omp command to simply print the minimum free energy path without creating a new simulation directory. Finally, the Example2d.py script will create a figure of the free energy surface with the current estimate of the minimum free energy path and free energy profile. The Example2d.py script as well as a 3d version are available with the ndfes package in the examples directory if you wish to make figures like this for your own projects in the future. If you are unable to obtain the desired output, the files are available in a directory called analysis-out in each of the unanalyzed iteration directories.
For the linear guess, the progression looks something like this:
Now let's compare the path from the different starting guessed. Here is an example of the result from it020 of the strings initiated from the 3 different pathways:
We see that both the linear and stepwise initial guess where proton transfer (RC1) precedes methyl transfer (RC2) both converge to the minimum free energy path, but the stepwise initial guess where methyl transfer precedes proton transfer remains in that local minimum. This highlights that while the SASM is robust, like all string methods one should make educated, and potentially mulitple, initial guesses.
Only complete sections 2.3 and 2.4 if you have completed sections 1.1-1.5, 2.1, and 2.2 and have remaining time in the session.
Finally, let's take a look at the convergence of the string over all of the iterations of string data provided in the outputs for the pathway you have chosen (linear, PT, or MT). The Example2d.py script has a feature for creating movies of the string and free energy surface progression. Each string directory contains a script called Make2dMovie.sh that runs the Example2d.py script in order to create a series of png files that are combined into a movie using ffmpeg. This exercise was performed on 50 string iterations to confirm convergence of the path. A run of this script has already been performed for you, and the output was written to the fullmovie directory:
[user@cluster] ls fullmovie
movie.sims.mp4 sims.00006.png sims.00012.png sims.00018.png sims.00024.png sims.00030.png sims.00036.png sims.00042.png sims.00048.png
sims.00001.png sims.00007.png sims.00013.png sims.00019.png sims.00025.png sims.00031.png sims.00037.png sims.00043.png sims.00049.png
sims.00002.png sims.00008.png sims.00014.png sims.00020.png sims.00026.png sims.00032.png sims.00038.png sims.00044.png sims.00050.png
sims.00003.png sims.00009.png sims.00015.png sims.00021.png sims.00027.png sims.00033.png sims.00039.png sims.00045.png sims.00051.png
sims.00004.png sims.00010.png sims.00016.png sims.00022.png sims.00028.png sims.00034.png sims.00040.png sims.00046.png sims.00052.png
sims.00005.png sims.00011.png sims.00017.png sims.00023.png sims.00029.png sims.00035.png sims.00041.png sims.00047.png sims.00053.png
[user@local] google-chrome fullmovie/movie.sims.mp4
Here is the movie depicting the string progression from the linear initial guess:
In the movie, you will see small circles along the path that represent the control points defining the path. The x's indicate where the umbrella centers have been placed for simulations. You will notice that there is alternation between simulations (x's) being placed explicitly on the path for refinement and off the path for exploration. This allows for rapid convergence as well as rigorous sampling on both sides of the path. Since we used the neqit=4 flag when running ndfes, string iterations 0-4 are eventually phased out once at least 5 previous iterations of sampling available to analyze. The reasoning is that the initial guess is very poor (the string evolves dramatically), so the short amount of sampling of these early iterations likely do not well-represent an equilibrium ensemble. Feel free to look at the movies generated for the other 2 paths to compare.
Visually, there is very little evolution of the minimum free energy path past iteration 10 or so. When the simulation data was generated, the MAKESIMS command in SASM.slurm outputs two convergence tests to analysis.CIT.log, where CIT is the current iteration. The first test is called the slope test that checks whether the slope of the reaction coordinates over at least 5 iterations is less than the threshold value. The RMSD test checks whether the RMSD of the overall path over at least 5 iterations in less the the threshold value. These are thresholds that have been established based on various test cases, but one should still visually inspect the path.
[user@cluster] grep -r "Slopes appear to be converged" *log | head -n5
analysis.008.log:Slopes appear to be converged
analysis.009.log:Slopes appear to be converged
analysis.010.log:Slopes appear to be converged
analysis.011.log:Slopes appear to be converged
analysis.012.log:Slopes appear to be converged
[user@cluster] grep -r "Path is effectively CONVERGED" *log
analysis.010.log:Min slope in RMSD 1.687e-03 Path is effectively CONVERGED
analysis.011.log:Min slope in RMSD 6.127e-04 Path is effectively CONVERGED
analysis.012.log:Min slope in RMSD 2.864e-04 Path is effectively CONVERGED
analysis.013.log:Min slope in RMSD 2.337e-04 Path is effectively CONVERGED
analysis.014.log:Min slope in RMSD -9.001e-05 Path is effectively CONVERGED
Based on this criteria, it appears that both tests were passed by iteration 10 for the linear initial guess. This is consistent with our visual inspection of the movie. Based on the string simulations performed in this tutorial, it is clear that the minimum free energy path for the MTR1 reaction involves a stepwise mechanism where a general acid protonates the bound ligand to initiate methyl transfer to the target adenine. This free energy barrier (~20 kcal/mol) is about 10 kcal/mol lower than the 1 dimensional case that we computed previously for just methyl transfer, and the 2 dimensional path where methyl transfer precedes proton transfer.
The input and subsequent outputs for part 2.4 are located in /expanse/projects/qstore/amber_ws/tutorials/QMMM_DFTB3_MTR1/2D/compare_PMF/inputs and /expanse/projects/qstore/amber_ws/tutorials/QMMM_DFTB3_MTR1/2D/compare_PMF/outputs. The inputs should be copied to your working directory in a new folder called compare_PMF.
[user@cluster] mkdir compare_PMF
[user@cluster] cd compare_PMF
[user@cluster] cp -r /expanse/projects/qstore/amber_ws/tutorials/QMMM_DFTB3_MTR1/2D/compare_PMF/inputs/* ./
In this tutorial you have used an approximate semi-empirical Hamiltonian, so now let's get a sense of how well it performed by comparing the PMF to the ab initio, PBE0 level PMF. This PMF was also generated using the string method. You have been provided a file called DFTB3.dat and PBE0.dat in the compare_PMF directory. DFTB3.dat is path file that we previously obtained when analyzing the DFTB3 strings that converged to the minimum free energy path.
xmgrace -block DFTB3.dat -bxy 2:5 -block PBE0.dat -bxy 2:5
Not only are the rate determining free energy barriers different, but the mechanistic interpretation is different when the higher level of theory is used. At the DFTB3 level (black) there is a stable intermediate state in which the proton transfer has occurred. This would suggest a proton would reside on the ligand a majority of the time. This is contrary to experimental activity-pH profiles of this type of base pair. At the PBE0 level (red), the intermediate is slightly higher in energy than the reactant state, suggesting the proton would likely be in rapid equilibrium between the cytosine residue and the ligand. This has implications for how one would interpret experimental observables such as pH dependence and highlights the importance of balancing the speed of the simulations with the accuracy of Hamiltonian.