Principal Component Analysis (PCA)
Introduction
Principal component analysis (PCA) allows us to identify major changes in conformation and movement after MD simulation.
This guide assumes you have all files in the same directory. If not, be sure to reference each file from their respective directories – though it will be much easier later if all files are in the same directory.
Small systems will be able to do the initial set-up, gmx covar
, and gmx anaeig
within the same interactive session. If you have very large/complex systems you wish to compare, you may need to submit a slurm job to run these commands.
Interactive sessions allow us to use an allocation of ARC to actively submit commands, rather than queueing a job and waiting.
To begin an interactive session of 2 hours:
Then import the necessary modules (copy and paste individual lines to the command line):
module load apps site/infer/easybuild/setup
module load CMake/3.15.3-GCCcore-8.3.0
module load CUDA/10.1.243-GCC-8.3.0
export MODULEPATH=$MODULEPATH:/projects/bevanlab/software/infer/modules/modules/infer-skylake/all
module load gromacs-v100/2020.4
Initial set-up
You should have the following files from your MD simulations:
- Final trajectory files (.xtc) for all replicates
- Topology file (.tpr) from any replicate
Concatenate trajectory files
All trajectory (.xtc) replicates must be concatenated, or joined together, in order to analyze the average motion of your system.
Note
You MUST use the -cat
option, otherwise duplicate timeframes will be overwritten by the last XTC file used.
Get backbone-only trajectory (.xtc)
Since covariance analysis is performed on the protein backbone, you will need to generate an xtc trajectory file containing the protein backbone only.
To generate a backbone-only XTC file:
Select backbone when prompted.
Get backbone-only topology file (.tpr)
The topology (.tpr) and trajectory (.xtc) files MUST match in atom count, otherwise you will get a fatal error when trying to run gmx covar
.
To guarantee a match, create a backbone-only TPR file:
Select backbone when prompted. It does not matter which TPR file is used (whether replicate 1 or 3), as gmx covar
will only be referencing the structural information.
Covariance analysis
You need the new concatenated backbone-only XTC and TPR files generated in the initial set-up to perform covariance analysis. Because this step generates many files, it is advisable to make a new directory (e.g., covar/
) to store your outputs.
To perform covariance analysis:
gmx covar -f all_cat_backbone.xtc -s backbone.tpr -o ./covar/eigenval.xvg -ascii ./covar/ascii.dat -av ./covar/ave.pdb -v ./covar/eigenvec.trr
Select backbone when prompted.
This produces the following outputs:
-o
: eigenvalues of the covariance matrix-ascii
: the whole covariance matrix, in text form-av
: the average structure-v
: the full precision trajectory
Repeat this for all systems. Be sure to use a clear naming convention, as this data will be projected onto each other in the next step.
Analysis of eigenvectors
In order to analyze changes in motion during simulation, the outputs from the previous step must be processed using gmx anaeig
.
The following files are needed for every system you plan to compare:
backbone.xtc
eigenval.xvg
ave.pdb
eigenvect.trr
To perform eigenvector analysis:
# Project MUT onto WT
gmx anaeig -s wt_ave.pdb -f mut_backbone.xtc -v wt_eigenvec.trr -eig wt_eigenval.xvg -proj mut_wt_proj-ev.xvg -extr mut_wt_ev.pdb -2d mut_wt_eigtraj.xvg
# Project WT onto MUT
gmx anaeig -s mut_ave.pdb -f wt_backbone.xtc -v mut_eigenvec.trr -eig mut_eigenval.xvg -proj wt_mut_proj-ev.xvg -extr wt_mut_ev.pdb -2d wt_mut_eigtraj.xvg
Select backbone, protein, or system when prompted (these are all the same).
This produces the following outputs:
-proj
: the projections of trajectories on the first and last eigenvectors-extr
: the two extreme projections along the average structure-2d
: 2D projection of trajectories on the eigenvectors
Data visualization and analysis
Graphing 2D projections
You will need the mdpca.py
file to create PCA graphs from the 2D projection data. A version of this file is available here. You will also need a way to run Python 3 scripts.
The following files are needed for every system you plan to compare:
eigtraj.xvg
All relevant files must be in the same directory as the script OR use a path to the file.
Using mdpca.py
Open the script in a code editor and configure the following:
- Under
#2D INPUT FILES
, change file names to match the twoeigtraj.xvg
files. - Under
#GRAPH COLORS
, change the hex codes to adjust graph colors. - Under
#TITLE AND FILE PARAMETERS
, change the graph title and output file name.- The graph title can be styled using Mathtext or LaTeX.
- By default, graphs are saved as PNG files.
Save and run the script. The output will be two side-by-side projection graphs with a single title:
The degree of overlap between the two datasets represents how structurally similar the systems are during simulation. Overlap in the data suggests these simulations take on more similar structures, or sample similar conformational space. Conversely, regions without overlap indicate differences in conformation.
Visualizing extreme structures with PyMOL
The conformational changes indicated by the 2D graphs can be visualized by looking at the extreme structures that were generated by -extr
of gmx anaeig
. Opening one of these PDB files in PyMOL gives us a protein structure with two frames. Each frame represents the first and last eigenvectors, which are depicted in the 2D graph as the X- and Y-axes, respectively.
If your 2D graphs show any regions without overlap, these two frames should be different when you toggle between them. We can add arrow indicators to show the degree and direction of these differences using modevectors
.
To do this, you will need the modevectors.py
script. You can additionally use the optional modesplit.py
script, which automates part of this process with a little extra set-up. Place all scripts in the same directory as the PDB files you are working with.
Using modevectors.py
The modevectors.py
script requires that the two frames be split into two PyMOL objects. This can be done through the graphical user interface (GUI) or the PyMOL command line.
To do this in the PyMOL GUI, find the object buttons at the right of the program. Navigate to either your protein or the "all" object and click the A icon.
Navigate to the "state" option, then select "split".
This will create 2 new objects, "protein_0001" and "protein_0002", which correspond to the two frames.
To do this in the PyMOL command line, type split_states all
.
You can now use modevectors.py
! Run each of the following lines individually in the PyMOL command line, replacing the object names as appropriate:
skip
indicates how many vectors (arrows) are skipped during rendering. Adjust this if the final render is too visually cluttered.
Using modesplit.py
Repeatedly running Modevectors can be tedious if you are analyzing multiple systems with the same protein, or multi-subunit systems where you want to analyze each subunit individually. The modesplit.py
script is an additional (optional) script that can split subunits and automate the above process. It requires some set-up, but afterwards the script can be run as-is for all relevant PDB files.
Open modesplit.py
in a code editor. Define all protein subunits in the #CHAINS
dictionary. The dictionary can be modified for any number of subunits, but at least one entry is required to run.
Note that the start_index
and end_index
ask for the index of the first and last atoms of the subunit, not residues. This is to avoid duplicates with homo-oligomers, which will have the same residue numbers designated across identical subunits.
To find the index of an atom in PyMOL:
- Select the first residue of the subunit. In the command line, type
index %sele
and hit enter. PyMOL will return three lines of text. Note the number at the end of the first line of the output. - Click off the protein to deselect the residue.
- Repeat step 1, instead selecting the last residue of the subunit. Note the number at the end of the last line of the output.
- Repeat for all subunits.

Update the indices, then save the file. Load it into PyMOL using run modesplit.py
in the command line.
Note
This script will also automatically run modevectors.py
. You can adjust the parameters for Modevectors in line 47.
Analyzing motion with DIRECT-ID
DIRECT-ID is a program that identifies residues with the greatest variance in movement between two systems based on the covariance matrices produced by gmx covar
.
The following files are needed to perform DIRECT-ID analysis:
ascii.dat
for both systems of interestave.pdb
for either system
DIRECT-ID should already be set-up on ARC and can be accessed using the following command:
Depending on where you are trying to run the analysis, you may need to run the command using a path to the DIRECT-ID program.
Regardless of how the program is run, two outputs will be generated: a text file containing a list of atoms with the most variance in movement, and a command line output listing the residues.

In most cases, the terminal output is the most relevant. Record or screenshot this output, as it does not get saved in a file. The residues listed should match those with the greatest motion in the Modevectors analysis.