LAB1 — Laboratory of Bioinformatics 1

Core Problem

Structural alignment — find optimal subsets A(P) and B(Q) from two point sets A and B, then find the best rigid body transformation G minimizing distance metric D.

Three steps:

Finding correspondences — exponential (NP-hard)
Finding best rigid body transformation — O(n) (linear)
Calculating RMSD

Key Definitions

RMSD — Root Mean Square Deviation. Measures average distance between matched atoms after superimposition. Lower = better alignment.

Rigid body transformation — translation (vector, 3 elements) + rotation (3×3 matrix, 3 angles). Describes any movement in 3D space without deformation.

Structure superimposition — correspondence set is known beforehand. Just optimize position. O(n).

Structure alignment — correspondence set must be found. NP-hard + O(n). Much harder.

Sequence identity — fraction of positions with identical amino acids in an alignment.

Twilight zone — ~25–30% sequence identity. Alignment becomes unreliable; structural inference is uncertain.

Safe zone — >30% identity. Structure can be inferred from sequence with reasonable confidence.

Hamming distance — simple scoring: 1 if same, 0 if different. Ignores amino acid properties.

BLOSUM62 — substitution matrix. Scores based on observed substitution frequencies. Accounts for chemical/physical similarity. Symmetrical. Positive = likely substitution, negative = unlikely.

Gap penalty — negative score for insertions/deletions in alignment. Can differ for opening vs. extension.

Global alignment (Needleman-Wunsch) — aligns full length of both sequences.

Local alignment (Smith-Waterman) — finds most similar subsequences.

Homology — shared evolutionary origin. Cannot be quantified (unlike similarity).

Remote homologs — very low sequence identity (<30%) but significant structural similarity. Example: myoglobin vs hemoglobin (~1.8 Å RMSD despite <30% identity).

Gene Ontology (GO) — hierarchical classification of protein function: molecular function, biological pathway, subcellular localization. No loops, DAG structure.

Comparative modeling (homology modeling) — predict 3D structure of target using known template. Quality scales with sequence identity.

Threading (fold recognition) — match target sequence against library of known 3D folds. Useful at low sequence identity.

Ab initio (de novo) — predict structure from scratch using physical principles. No template. Limited to small proteins.

Sequence → Structure → Function Pipeline

Sequence identity	Structural inference	Notes
>30%	Safe	Homology modeling reliable
25–30%	Twilight zone	Unreliable, needs caution
<25%	Unsafe	Alignment may fail entirely

Threshold shifts depending on property being inferred:

Structure prediction: ~30%
Subcellular localization: higher threshold needed
Function annotation: varies — false positives possible even at 70%

CE Algorithm (Combinatorial Extension)

Key idea — fragment-based structural alignment.

Break proteins into octameric fragments (8 residues each)
Compare all fragment pairs → AFPs (Aligned Fragment Pairs) with low RMSD
Assemble AFPs into global alignment path using dynamic programming
Evaluate statistical significance via Z-score

Gap rules for consecutive AFPs:

Rule 1: no gaps
Rule 2: gap in protein A only
Rule 3: gap in protein B only
Never simultaneous gaps in both

Complexity reduction: exponential → quadratic via dynamic programming.

Z-score > 3.5 → ~1 in 1000 chance the alignment is random. High confidence.

Structure Representation Methods

Method	Invariant to rotation?	Size issue	Notes
C-alpha coordinates	✗	—	Simple, common
Torsion angles (φ, ψ)	✓	—	Local features only, misses global topology
Distance matrix	✓	n(n-1)/2 elements	Chirality problem; hard to match different sizes

Important Distinctions

Same vs Similar — identical copies vs shared structural motifs
Different vs Dissimilar — distinct functions vs structurally unrelated
Superimposition vs Alignment — known correspondence vs must find it

Cytochrome C Case Study Summary

Comparison	Identity	RMSD	Zone
Human vs Horse	88%	0.35 Å	Safe
Human vs Rhodobacter c2	~29%	~2 Å	Twilight
Human vs Rhodopseudomonas c2	29%	1.3 Å	Twilight
Human vs Arabidopsis c6A	13%	3.5 Å	Unsafe

Limitations of Structural Alignment

Topology — conserved secondary structure ≠ conserved topology. CE struggles with non-topological cases.
Fragment length — 8 residues is empirical, not always optimal.
Domains — multidomain proteins need per-domain alignment; global RMSD misleading.