EmMod

Module 2 · EmMod

Structural & Sequence Alignment

draftnotes13 oral Q14 MCQ
Notes

Core Problem

Structural alignment — find optimal subsets A(P) and B(Q) from two point sets A and B, then find the best rigid body transformation G minimizing distance metric D.

Three steps:

  1. Finding correspondences — exponential (NP-hard)
  2. Finding best rigid body transformation — O(n) (linear)
  3. Calculating RMSD

Key Definitions

RMSD — Root Mean Square Deviation. Measures average distance between matched atoms after superimposition. Lower = better alignment.

Rigid body transformation — translation (vector, 3 elements) + rotation (3×3 matrix, 3 angles). Describes any movement in 3D space without deformation.

Structure superimposition — correspondence set is known beforehand. Just optimize position. O(n).

Structure alignment — correspondence set must be found. NP-hard + O(n). Much harder.

Sequence identity — fraction of positions with identical amino acids in an alignment.

Twilight zone — ~25–30% sequence identity. Alignment becomes unreliable; structural inference is uncertain.

Safe zone — >30% identity. Structure can be inferred from sequence with reasonable confidence.

Hamming distance — simple scoring: 1 if same, 0 if different. Ignores amino acid properties.

BLOSUM62 — substitution matrix. Scores based on observed substitution frequencies. Accounts for chemical/physical similarity. Symmetrical. Positive = likely substitution, negative = unlikely.

Gap penalty — negative score for insertions/deletions in alignment. Can differ for opening vs. extension.

Global alignment (Needleman-Wunsch) — aligns full length of both sequences.

Local alignment (Smith-Waterman) — finds most similar subsequences.

Homology — shared evolutionary origin. Cannot be quantified (unlike similarity).

Remote homologs — very low sequence identity (<30%) but significant structural similarity. Example: myoglobin vs hemoglobin (~1.8 Å RMSD despite <30% identity).

Gene Ontology (GO) — hierarchical classification of protein function: molecular function, biological pathway, subcellular localization. No loops, DAG structure.

Comparative modeling (homology modeling) — predict 3D structure of target using known template. Quality scales with sequence identity.

Threading (fold recognition) — match target sequence against library of known 3D folds. Useful at low sequence identity.

Ab initio (de novo) — predict structure from scratch using physical principles. No template. Limited to small proteins.


Sequence → Structure → Function Pipeline

Sequence identityStructural inferenceNotes
>30%SafeHomology modeling reliable
25–30%Twilight zoneUnreliable, needs caution
<25%UnsafeAlignment may fail entirely

Threshold shifts depending on property being inferred:

  • Structure prediction: ~30%
  • Subcellular localization: higher threshold needed
  • Function annotation: varies — false positives possible even at 70%

CE Algorithm (Combinatorial Extension)

Key idea — fragment-based structural alignment.

  1. Break proteins into octameric fragments (8 residues each)
  2. Compare all fragment pairs → AFPs (Aligned Fragment Pairs) with low RMSD
  3. Assemble AFPs into global alignment path using dynamic programming
  4. Evaluate statistical significance via Z-score

Gap rules for consecutive AFPs:

  • Rule 1: no gaps
  • Rule 2: gap in protein A only
  • Rule 3: gap in protein B only
  • Never simultaneous gaps in both

Complexity reduction: exponential → quadratic via dynamic programming.

Z-score > 3.5 → ~1 in 1000 chance the alignment is random. High confidence.


Structure Representation Methods

MethodInvariant to rotation?Size issueNotes
C-alpha coordinatesSimple, common
Torsion angles (φ, ψ)Local features only, misses global topology
Distance matrixn(n-1)/2 elementsChirality problem; hard to match different sizes

Important Distinctions

  • Same vs Similar — identical copies vs shared structural motifs
  • Different vs Dissimilar — distinct functions vs structurally unrelated
  • Superimposition vs Alignment — known correspondence vs must find it

Cytochrome C Case Study Summary

ComparisonIdentityRMSDZone
Human vs Horse88%0.35 ÅSafe
Human vs Rhodobacter c2~29%~2 ÅTwilight
Human vs Rhodopseudomonas c229%1.3 ÅTwilight
Human vs Arabidopsis c6A13%3.5 ÅUnsafe

Limitations of Structural Alignment

  1. Topology — conserved secondary structure ≠ conserved topology. CE struggles with non-topological cases.
  2. Fragment length — 8 residues is empirical, not always optimal.
  3. Domains — multidomain proteins need per-domain alignment; global RMSD misleading.
Oral Questions13 questions

Superimposition — the correspondence between atoms is already known. You just find the best rotation/translation to minimize distance. Complexity is O(n).

Alignment — the correspondence set must be found as part of the problem. This is NP-hard. Once correspondences are found, the transformation is still O(n).

Key point: alignment = find correspondences + superimpose. Superimposition = just superimpose.

The twilight zone is the range of sequence identity (~25–30%) where alignment-based inference becomes unreliable.

Above ~30%: structural similarity can be inferred safely from sequence. Below ~25%: alignments are likely to fail — critical residues may be misaligned.

Important: the threshold is not fixed — it depends on what you're inferring. Subcellular localization requires higher identity than structure prediction. Even at 70% identity, functional predictions can produce false positives.

Hamming distance treats all amino acids as equal — a score of 1 if identical, 0 otherwise.

This ignores the chemical and physical properties of amino acids. In reality, some substitutions are conservative (e.g., Leu → Ile) and others are radical (e.g., Gly → Trp).

BLOSUM62 solves this by assigning scores based on observed substitution frequencies in aligned proteins — similar residues get positive scores, dissimilar ones get negative.

Because the size and identity of the optimal subset are both unknown. You must consider all possible subsets of all possible sizes from both structures.

The number of combinations grows exponentially with the number of atoms. This makes exhaustive search computationally infeasible for real proteins.

Solutions like CE transform this into a tractable problem using heuristics and dynamic programming, reducing complexity to quadratic.

CE (Combinatorial Extension) uses a fragment-based approach:

1. Splits proteins into overlapping octameric fragments (8 residues) 2. Computes RMSD for all fragment pairs → identifies AFPs (Aligned Fragment Pairs) 3. Assembles AFPs into a global alignment path using dynamic programming 4. Allows gaps only in one sequence at a time (3 rules) 5. Evaluates significance via Z-score (threshold 3.5 → ~1 in 1000 false positive rate)

Key insight: dynamic programming reduces exponential complexity to quadratic.

Remote homologs are proteins with very low sequence identity (<30%) but significant structural similarity (low RMSD). Example: myoglobin and hemoglobin (~1.8 Å RMSD, <30% identity).

They are difficult to detect because: - Standard sequence alignment methods rely on sequence similarity to find correspondences - Below 30% identity, alignments are unreliable - The evolutionary signal in sequence has largely been erased

Detection typically requires structure-based methods or deep learning approaches.

1. Comparative modeling (homology modeling) — used when a closely related template exists (>30% sequence identity). Aligns target to template and transfers structural information. Quality improves with identity.

2. Threading (fold recognition) — used when sequence identity is low but structural fold may be shared. Matches target sequence against a library of known 3D folds.

3. Ab initio / de novo — used when no template exists. Predicts structure from physical principles (energy minimization). Limited to small proteins due to computational cost.

High sequence identity does not guarantee the same function. A protein can undergo mutations at functionally critical sites — catalytic residues, binding sites, regulatory domains — even while the overall sequence is largely conserved.

Example: one protein with catalytic activity, another with regulatory function, both at ~70% identity.

This is why the distinction between different (distinct functions) and dissimilar (structurally unrelated) is important. High sequence similarity → similar structure likely, but function can still diverge.

A distance matrix records the pairwise distances between all C-alpha atoms in a protein. It contains n(n-1)/2 elements for n atoms.

Advantages: - Invariant to rotation and translation - Straightforward to compare structures

Disadvantages: - Computationally expensive (O(n²) elements) - Chirality is not captured - Comparing matrices of different sizes requires specialized algorithms

A contact map is a binary simplification of the distance matrix. Instead of storing the exact distance between every pair of C-alpha atoms, you apply a threshold (typically 8 Å) and store: - 1 if the two residues are in contact (distance ≤ threshold) - 0 if they are not

So the distance matrix is continuous; the contact map is discrete.

Advantages over full distance matrix: - Much cheaper to compute and store - Still invariant to rotation and translation - Captures the essential topology of the fold

Limitation: you lose the exact distance information — two pairs both scored 1 could have very different actual distances.

Contact maps encode the topology of a protein fold in a rotation/translation-invariant way.

Two proteins with similar folds will have similar contact map patterns — the same pairs of residues tend to be in contact, reflecting conserved secondary and tertiary structure.

Practical uses: - Structure comparison without needing to superimpose — useful for remote homologs - Structure prediction — predicting which residues will be in contact from sequence is easier than predicting full 3D coordinates; contacts constrain the fold - Fold recognition — matching a predicted contact map against a database of known maps

In AlphaFold-like methods, predicting inter-residue distances (a generalization of contact maps) was a key intermediate step toward accurate 3D prediction.

The most common threshold is 8 Å between C-alpha atoms, though 6 Å (all atoms) or 12 Å are also used depending on the application.

The threshold matters because: - Too small → very sparse map, misses long-range interactions that define the fold - Too large → too dense, loses discriminative power — almost every residue is in contact

The 8 Å cutoff is empirically chosen to capture meaningful secondary structure contacts (alpha helices show diagonal bands, beta sheets show off-diagonal stripes) while keeping the map sparse enough to be informative.

Contact maps have characteristic visual signatures for secondary structure elements:

Alpha helix → a band running parallel to the main diagonal, offset by 3–4 positions. This is because residues i and i+4 are in contact in a helix (one full turn = 3.6 residues).

Beta sheet (parallel) → bands running parallel to the diagonal but far from it, reflecting long-range contacts between strands.

Beta sheet (antiparallel) → bands running perpendicular to the diagonal (i.e., anti-diagonal stripes), because residues pair in opposite sequence directions.

Loop regions → scattered, irregular contacts.

These patterns make contact maps interpretable even without viewing the 3D structure.

MCQ Practice14 questions

Q1What is the computational complexity of finding the best rigid body transformation once correspondences are known?

Q2Which of the following best describes the 'twilight zone'?

Q3In the CE algorithm, what is an AFP?

Q4What does BLOSUM62 encode that Hamming distance does not?

Q5A Z-score above 3.5 in CE alignment means:

Q6Which method is most appropriate when no structural template is available?

Q7What is the key advantage of the distance matrix representation for structural alignment?

Q8Myoglobin and hemoglobin are an example of:

Q9In CE, gaps are handled by:

Q10Which of these best explains why functional annotation has a higher twilight zone threshold than structure prediction?

Q11A contact map differs from a distance matrix in that it:

Q12Which secondary structure produces off-diagonal perpendicular stripes in a contact map?

Q13What is the main advantage of using contact maps for structure comparison over superimposition-based methods?

Q14If you set the contact map threshold too high (e.g., 20 Å), what happens?

0 / 14 answered