Module 1 · Rita Casadio
Bioinformatics — interdisciplinary field combining biology, computer science, mathematics, and statistics to analyze and interpret biological data, particularly sequence data.
Computational biology — broader term; includes modeling of biological systems. Bioinformatics is a subset focused on sequence/structure data.
Primary database — stores raw experimental data as submitted (e.g. GenBank, PDB, UniProtKB/TrEMBL).
Secondary database — derived, curated, annotated entries (e.g. UniProtKB/SwissProt, PFAM). Higher reliability, smaller size.
Sequence annotation — process of assigning biological meaning to a sequence: identifying genes, regulatory regions, functional domains.
| Problem | Example Tool |
|---|---|
| Sequence similarity | BLAST |
| Multiple alignment | ClustalW, MUSCLE |
| Structure prediction | Rosetta, AlphaFold |
| Gene finding | GenScan |
| Phylogenetics | RAxML, IQ-TREE |
Next Generation Sequencing (NGS) reduced the cost of sequencing by ~6 orders of magnitude since 2001. Data production outpaces Moore's law — storage and analysis are now the bottleneck, not sequencing itself.
Primary databases store raw experimental data as submitted by researchers — no curation. Examples: GenBank (nucleotide sequences), PDB (protein structures), UniProtKB/TrEMBL.
Secondary databases are derived from primary data through curation, integration, and annotation. They are smaller but more reliable. Examples: UniProtKB/SwissProt, PFAM, SCOP.
Key point: SwissProt entries are manually reviewed; TrEMBL entries are computationally annotated only.
The exponential growth of biological sequence data — particularly after the Human Genome Project (2001) and the NGS revolution — created a bottleneck: data production far exceeded the capacity for manual analysis.
Bioinformatics emerged to automate sequence analysis, structure prediction, and functional annotation at scale. The core insight: sequence encodes structure, structure encodes function — if we can read one, we can infer the others computationally.
Sequence annotation is the process of assigning biological meaning to a raw sequence: identifying genes, regulatory elements, domains, and functional regions.
It is non-trivial because: - Not all DNA codes for proteins (only ~1.5% in humans) - Alternative splicing means one gene → many proteins - Regulatory signals are short, degenerate, and context-dependent - Functional inference relies on homology, which assumes conserved function — not always valid
Q1Which of the following is a secondary database?
Q2The central pipeline in bioinformatics is best described as:
Q3Next Generation Sequencing reduced sequencing costs by approximately:
Q4What distinguishes bioinformatics from computational biology?