Protein Homology Modelling : Protein Homology Modelling
Learning Objectives : Learning Objectives After this lesson you should be able to:
Explain the individual steps involved in calculating a protein homology model.
Identify suitable templates for modelling.
Outline the principles behind ab initio protein structure prediction.
Describe the differences between homology modelling and ab initio structure prediction.
Describe the major pitfalls in protein modelling.
Outline : Outline Protein homology modelling
Individual steps
Caveats
Pitfalls
Ab initio protein structure prediction
Threading
True ab initio methods
Slide4 : Why Do We Need Homology Modelling? Ab Initio protein folding (“random” sampling):
100 aa, 3 conf./residue gives approximately 1048 different overall conformations!
Random sampling is NOT feasible, even if conformations can be sampled at picosecond (10-12 sec) rates.
Levinthal’s paradox
Levinthal argued that since the number of possible conformations of a protein chain may be estimated to be exponential in the number (N) of aminoacids, the exhaustive exploration of conformation space in a finite time of biological relevance is practically impossible since it is also exponential in N.
Do homology modelling instead.
How Is It Possible? : How Is It Possible? The structure of a protein is uniquely determined by its amino acid sequence (but sequence is sometimes not enough):
prions
pH, ions, cofactors, chaperones
Structure is conserved much longer than sequence in evolution.
Structure > Function >> Sequence
How Often Can We Do It? : How Often Can We Do It? There are currently ~47000 structures in the PDB (but only ~4000 if you include only ones that are not more than 30% identical and have a resolution better than 3.0 Å).
An estimated 25% of all sequences can be modeled and structural information can be obtained for ~50%.
Worldwide Structural Genomics : Worldwide Structural Genomics Complete genomes
Signaling proteins
Disease-causing organisms
Model organisms
Membrane proteins
Protein-ligand interactions
Structural Genomics in North America : Structural Genomics in North America 10 year $600 million project initiated in 2000, funded largely by NIH.
AIM: structural information on 10000 unique proteins (now 4-6000), so far 1000 have been determined.
Improve current techniques to reduce time (from months to days) and cost (from $100.000 to $20.000/structure).
9 research centers currently funded (2005), targets are from model and disease-causing organisms (a separate project on TB proteins).
Homology Modeling for Structural Genomics : Homology Modeling for Structural Genomics Roberto Sánchez et al. Nature Structural Biology 7, 986 - 990 (2000)
How Well Can We Do It? : How Well Can We Do It? Sali, A. & Kuriyan, J. Trends Biochem. Sci. 22, M20–M24 (1999)
How Is It Done? : How Is It Done? Identify template(s) – initial alignment
Improve alignment
Backbone generation
Loop modelling
Side chains
Refinement
Validation
Template Identification : Template Identification Search with sequence
Blast
Psi-Blast
Fold recognition methods
Use biological information
Functional annotation in databases
Active site/motifs
Alignment : Alignment
Slide14 : 1 2 3 4 5 6 7 8 9 10 11 12 13 14
PHE ASP ILE CYS ARG LEU PRO GLY SER ALA GLU ALA VAL CYS
PHE ASN VAL CYS ARG THR PRO --- --- --- GLU ALA ILE CYS
PHE ASN VAL CYS ARG --- --- --- THR PRO GLU ALA ILE CYS
Slide15 : 1 2 3 4 5 6 7 8 9 10 11 12 13 14
PHE ASP ILE CYS ARG LEU PRO GLY SER ALA GLU ALA VAL CYS
PHE ASN VAL CYS ARG THR PRO --- --- --- GLU ALA ILE CYS
PHE ASN VAL CYS ARG --- --- --- THR PRO GLU ALA ILE CYS
Improving the Alignment : 1 2 3 4 5 6 7 8 9 10 11 12 13 14
PHE ASP ILE CYS ARG LEU PRO GLY SER ALA GLU ALA VAL CYS
PHE ASN VAL CYS ARG THR PRO --- --- --- GLU ALA ILE CYS
PHE ASN VAL CYS ARG --- --- --- THR PRO GLU ALA ILE CYS From ”Professional Gambling” by Gert Vriend
http://www.cmbi.kun.nl/gv/articles/text/gambling.html Improving the Alignment
Template Quality : Template Quality Selecting the best template is crucial!
The best template may not be the one with the highest % id (best p-value…)
Template 1: 93% id, 3.5 Å resolution
Template 2: 90% id, 1.5 Å resolution
Slide18 : The Importance of Resolution 4 Å 2 Å 3 Å 1 Å
Evaluation of NMR Structures : Evaluation of NMR Structures What regions in the structure are most well-defined? Look at the pdb ensembles to see which regions are well-defined
1RJH
Nielbo et al, Biochemistry, 2003
Slide20 : Ramachandran Plot Allowed backbone torsion angles in proteins Amino acid residue
Template Quality – Ramachandran Plot : Template Quality – Ramachandran Plot X-ray structure – good data.
Backbone Generation : Backbone Generation
Generate the backbone coordinates from the template for the aligned regions.
Several programs can do this, most of the groups at CASP6 use Modeller:
http://salilab.org/modeller/modeller.html
Loop Modelling : Loop Modelling Knowledge based:
Searches PDB for fragments that match the sequence to be modelled (Levitt, Holm, Baker etc.).
Energy based:
Uses an energy function to evaluate the quality of the loop and minimizes this function by Monte Carlo (sampling) or molecular dynamics (MD) techniques.
Combination
Loops – the Rosetta Method : Loops – the Rosetta Method
Find fragments (10 per amino acid) with the same sequence and secondary structure profile as the query sequence.
Combine them using a Monte Carlo scheme to build the loop.
David Baker et al.
Side Chains : Side Chains
If the seq. ID is high, the networks of side chain contacts may be conserved, and keeping the side chain rotamers from the template may be better than predicting new ones.
Predicting Side Chain Conformations : Predicting Side Chain Conformations Side chain rotamers are dependent on backbone conformation.
Most successful method in CASP6 was SCWRL by Dunbrack et al.:
Graph-theory knowledge based method to solve the combinatorial problem of side chain modelling.
http://dunbrack.fccc.edu/SCWRL3.php
Side Chains - Accuracy : Side Chains - Accuracy Prediction accuracy is high for buried residues, but much lower for surface residues
Experimental reasons: side chains at the surface are more flexible.
Theoretical reasons: much easier to handle hydrophobic packing in the core than the electrostatic interactions, including H-bonds to waters.
Refinement : Refinement Energy minimization
Molecular dynamics
Big errors like atom clashes can be removed, but force fields are not perfect and small errors will also be introduced – keep minimization to a minimum or matters will only get worse.
Error Recovery : Error Recovery If errors are introduced in the model, they normally can NOT be recovered at a later step
The alignment can not make up for a bad choice of template.
Loop modeling can not make up for a poor alignment.
If errors are discovered, the step where they were introduced should be redone.
Validation : Validation Most programs will get the bond lengths and angles right.
The Ramachandran plot of the model usually looks pretty much like the Ramachandran plot of the template (so select a high quality template).
Inside/outside distributions of polar and apolar residues can be useful.
Biological/biochemical data
Active site residues
Modification sites
Interaction sites
Validation – ProQ Server : Validation – ProQ Server ProQ is a neural network based predictor that based on a number of structural features predicts the quality of a protein model.
ProQ is optimized to find correct models in contrast to other methods which are optimized to find native structures. Arne Elofssons group: http://www.sbc.su.se/~bjorn/ProQ/
Structure Validation : Structure Validation ProCheck
http://www.biochem.ucl.ac.uk/~roman/procheck/procheck.html
WhatIf server
http://swift.cmbi.kun.nl/WIWWWI/
Homology Modelling Servers : Homology Modelling Servers
Eva-CM performs continuous and automated analysis of comparative protein structure modeling servers
A current list of the best performing servers can be found at:
http://cubic.bioc.columbia.edu/eva/doc/intro_cm.html
Summary – Homology Modelling : Summary – Homology Modelling
Successful homology modelling depends on the following:
Template quality
Alignment (add biological information)
Modelling program/procedure (use more than one)
Always validate your final model!
Fold Recognition and Ab Initio Protein Structure Prediction : Fold Recognition and Ab Initio Protein Structure Prediction
Outline : Outline Threading and pair potentials
Ab initio methods
Human intervention (what kind of knowledge can be used for alignment and selection of templates?)
Meta-servers (the principle, 3d jury)
Summary of take-home messages
Threading and Pair Potentials : Threading and Pair Potentials Compares a given sequence against known structures (folds).
By using potentials that describe tendencies observed in known protein structures. Example: Pair potentials
How normal is it to observe a pair of an alanine and a valine separated by 20 residues in the sequence and 3Å in space? (X)
How normal is it to observe any pair of residues separated by 20 residues and 3Å in space? (Y)
Potential: log (X/Y)
Potentials of Mean Force : Alignment score from structural fitness (pair potential)
How well does K fit environment at P6?
If P8 is acidic then fine, if P8 is basic then poor Potentials of Mean Force 1 7 8 9 10 3 4 5 6 2 .. A T N L Y K E T L .. Deletions
Threading Methods Today : Threading Methods Today Problem: No protein is average
Interactions in proteins cannot only be described by pairs of amino acids
The information in the potentials is partly captured with sequence profiles
Today mostly used in HYBRID approaches in combination with profile-profile based methods
Potentials can be used to score models based on different templates or alignments
Ab Initio Methods : Ab Initio Methods
Aim is to find the fold of native protein by simulating the biological process of protein folding.
A VERY DIFFICULT task because a protein chain can fold into millions of different conformations.
Use it only when no detectable homologues are available.
Methods can also be useful for fold recognition in cases of extremely low homology (e.g. convergent evolution).
Fragment-based Ab Initio Modelling : Fragment-based Ab Initio Modelling Rosetta method of the Baker group:
Submit sequence to a number of secondary structure predictors.
Compare fragments of 3 and 9 residues to library from know structures.
Link fragments together.
Use energy minimization techniques (Monte Carlo optimization) to calculate tertiary structure.
Potentials for Finding Good Models : Potentials for Finding Good Models
Use of energy potentials for scoring and computing models.
Potentials should make models more “native-like”.
These can be based on contact potentials, solvation potentials, Van der Waals repulsion and attractive forces, hydrogen bond potentials.
Globularity/radius of gyration (ab initio).
Problems with Empirical Potentials : Problems with Empirical Potentials Fragments with correct local structure Nature’s potential Empirical potential
Human Intervention : Human Intervention The best methods use maximum knowledge of query proteins.
Specialists can help to find a correct template and correct alignments. Knowledge of function
Cysteines forming disulfide bridges or binding e.g. zinc molecules
Proteolytic cleavage sites
Other metal binding residues
Antibody epitopes or escape mutants
Ligand binding
Results from CD or fluorescence experiments
Knowledge of secondary structure
Meta-Servers : Meta-Servers Democratic modeling
The highest score hit is often wrong.
Many prediction methods have the correct fold among the top 10-20 hits.
If many different prediction methods all have some fold among the top hits, this fold is probably correct.
Example of a Meta-Server : Example of a Meta-Server 3DJury http://bioinfo.pl/meta/
Inspired by Ab initio modeling methods
Average of frequently obtained low energy structures is often closer to the native structure than the lowest energy structure
Find most abundant high scoring model in a list of prediction from several predictors
Use output from a set of servers
Superimpose all pairs of structures
Similarity score based on # of Cα pairs within 3.5Å
Similar methods developed by A. Elofsson (Pcons) and D. Fischer (3D shotgun).
3DJury : 3DJury Because it is a meta-server it can be slow.
If queue is too long some servers are skipped.
Output is only Cα coordinates.
What to do with the rest of the structure?
Use e.g. maxsprout server to build sidechains and backbone atoms.
http://www.ebi.ac.uk/maxsprout/
Summary – Ab Initio Methods : Summary – Ab Initio Methods
Hybrid methods using both threading methods and profile-profile alignments are the best.
Use only Ab initio methods if necessary and know that the quality is really low!
Try to use as much knowledge as possible for alignment and template selections in difficult cases.
Use meta-servers when you can.