Notes

The Role of Bioinformatics

Key Definitions

Bioinformatics — interdisciplinary field combining biology, computer science, mathematics, and statistics to analyze and interpret biological data, particularly sequence data.

Computational biology — broader term; includes modeling of biological systems. Bioinformatics is a subset focused on sequence/structure data.

Primary database — stores raw experimental data as submitted (e.g. GenBank, PDB, UniProtKB/TrEMBL).

Secondary database — derived, curated, annotated entries (e.g. UniProtKB/SwissProt, PFAM). Higher reliability, smaller size.

Sequence annotation — process of assigning biological meaning to a sequence: identifying genes, regulatory regions, functional domains.

Core Problems Bioinformatics Addresses

Problem	Example Tool
Sequence similarity	BLAST
Multiple alignment	ClustalW, MUSCLE
Structure prediction	Rosetta, AlphaFold
Gene finding	GenScan
Phylogenetics	RAxML, IQ-TREE

Why Sequences?

DNA sequence encodes all biological information
Cheaper to sequence than to determine structure experimentally
Sequence → infer structure → infer function (the central pipeline)

Data Explosion

Next Generation Sequencing (NGS) reduced the cost of sequencing by ~6 orders of magnitude since 2001. Data production outpaces Moore's law — storage and analysis are now the bottleneck, not sequencing itself.

Oral Questions3 questions

Primary databases store raw experimental data as submitted by researchers — no curation. Examples: GenBank (nucleotide sequences), PDB (protein structures), UniProtKB/TrEMBL.

Secondary databases are derived from primary data through curation, integration, and annotation. They are smaller but more reliable. Examples: UniProtKB/SwissProt, PFAM, SCOP.

Key point: SwissProt entries are manually reviewed; TrEMBL entries are computationally annotated only.

The exponential growth of biological sequence data — particularly after the Human Genome Project (2001) and the NGS revolution — created a bottleneck: data production far exceeded the capacity for manual analysis.

Bioinformatics emerged to automate sequence analysis, structure prediction, and functional annotation at scale. The core insight: sequence encodes structure, structure encodes function — if we can read one, we can infer the others computationally.

Sequence annotation is the process of assigning biological meaning to a raw sequence: identifying genes, regulatory elements, domains, and functional regions.

It is non-trivial because: - Not all DNA codes for proteins (only ~1.5% in humans) - Alternative splicing means one gene → many proteins - Regulatory signals are short, degenerate, and context-dependent - Functional inference relies on homology, which assumes conserved function — not always valid

MCQ Practice4 questions

Q1Which of the following is a secondary database?

Q2The central pipeline in bioinformatics is best described as:

Q3Next Generation Sequencing reduced sequencing costs by approximately:

Q4What distinguishes bioinformatics from computational biology?

0 / 4 answered