1 Completed genomes: bacteria and archaeaChapter 17: Completed genomes: bacteria and archaea Jonathan Pevsner, Ph.D. Bioinformatics and Functional Genomics (Wiley-Liss, 3rd edition, 2015) You may use this PowerPoint for teaching purposes
2 Outline Introduction Classification of bacteria and archaeaThe human microbiome Analysis of bacterial and archaeal genomes Nucleotide composition Finding genes Gene annotation Lateral gene transfer Comparison of bacterial genomes Perpective
3 Learning objectives After studying this chapter, you should be able to: ■ define bacteria and archaea; ■ explain the bases of their classification; ■ describe the genomes of Escherichia coli and other bacteria; ■ describe bioinformatics approaches to identifying and characterizing bacterial and archaeal genes and proteins; and ■ compare bacterial genomes.
4 Bacteria and archaea: genome analysisBacteria and archaea constitute two of the three main branches of life. Together they are the prokaryotes (although some discourage use of that term because it does not correspond to a satisfactory evolutionary model). Bacteria and archaea are characterized by a lack of a membrane-bound nucleus, a lack of extensive intracellular organelles, and lack of a cytoskeleton—features that are common to eukaryotes. The word microbe refers to microorganisms that cause disease. These include bacteria, archaea, and a variety of eukaryotes (e.g. fungi and protozoa) that we discuss later.
5 Outline Introduction Classification of bacteria and archaeaThe human microbiome Analysis of bacterial and archaeal genomes Nucleotide composition Finding genes Gene annotation Lateral gene transfer Comparison of bacterial genomes Perspective
6 Bacteria and archaea: genome analysisWe can classify bacteria and archaea (prokaryotes) based on six criteria: [1] morphology [2] genome size [3] lifestyle [4] relevance to human disease [5] molecular phylogeny (rRNA) [6] molecular phylogeny (other molecules)
7 Bacterial and archaeal classification: genome sizeBacterial and archaeal genomes vary over a 25-fold range from ~0.5 megabases (Mb) to ~13 Mb. Bacteria: typically ~0.5 Mb to 13.2 Mb Smallest: Candidatus Carsonella ruddii PV (0.16 Mb) Largest: Solibacter usitatus Ellin6076 (10 Mb) Archaea: ~0.5 Mb to ~6 Mb Smallest: Nanoarchaeum equitans Kin4-M (0.49 Mb) Largest: Methanosarcina acetivorans C2A (5.75 Mb)
8 Bacterial and archaeal classification: genome sizeGenome size comparisons: A Nanoarchaeum equitans 490,885 bp 582 genes B Mycoplasma genitalium 580,070 bp 506 genes V Mimivirus Mb ~1200 genes B Streptomyces coelicolor 8.7 Mb genes B Myxococcus xanthus Mb 7388 genes E Schizosaccharomyces pombe 13 Mb genes Key: V=virus (Chapter 16) A=archeon B=bacterium E=eukaryote (Chapters 18-19)
9 Classification of bacteriaB&FG 3e Table 17.1 Page 799 Bacteria are a kingdom, followed by “intermediate ranks.”
10 Classification of archaeaB&FG 3e Table 17.2 Page 800 Archaea are a kingdom, followed by “intermediate ranks.”
11 Bacterial classification: morphologyThe gram stain is absorbed by about half of all bacteria. (It reflects the protein and peptidoglycan composition of the cell wall.) Most bacteria can be classified in the following groups: Type Examples Gram-positive cocci † Staphylococcus aureus Gram-positive rods Bacillus anthracis (anthrax) Gram-negative cocci Neisseria Gram-negative rods Escherichia coli, Vibrio cholerae Other Mycobacterium leprae (leprosy) Borrelia burgdorferi (Lyme) Chlamydia trachomatis Mycoplasma pneumoniae † having a spherical shape
12 Major categories of bacteria based on morphological criteriaB&FG 3e Table 17.3 Page 800 Disease is indicated in parentheses
13 Range of genome sizes in bacteria and archaeaB&FG 3e Table 17.4 Page 801
14 Genome size of selected bacteria and archaea having relatively large or small genomesB&FG 3e Table 17.5 Page 803 (A): archaeal; (B): eubacterial.
15 Number of predicted protein-encoding genes versus genome size for 246 complete published genomesB&FG 3e Fig. 17.2 Page 804
16 * * Bacterial and archaeal classification: lifestylesWe may distinguish six prokaryotic lifestyles: [1] extracellular (e.g. E. coli) [2] facultatively intracellular (Mycobacterium tuberculosis) [3] extremophilic (e.g. M. jannaschi) [4] epicellular bacteria (e.g. Mycoplasma pneumoniae) [5] obligate intracellular and symbiotic (B. aphidicola) [6] obligate intracellular and parasitic (Rickettsia) * * These tend to have an extreme reduction in genome size
17 Bacterial classification: disease relevanceVaccine-preventable bacterial diseases Anthrax Bacillus anthracis Diarrheal disease (cholera) Vibrio cholerae Diphtheria Cornyebacterium diphtheriae Lyme disease Borrelia burgdorferi Meningitis Haemophilus influenzae type B Streptococcus pneumoniae Neisseria meningitidis Pertussis Bordetella pertussis Tetanus Clostridium tetani Tuberculosis Mycobacterium tuberculosis Typhoid Salmonella typhi
18 Vaccine-preventable bacterial diseasesB&FG 3e Table 17.7 Page 808
19 Bacterial classification: rRNA phylogeny16S ribosomal RNA (rRNA) based trees by Woese and colleagues showed distinct superkingdoms of bacteria and archaea. The following figure (adapted from Casjens, 1998) summarizes bacterial chromosome size and geometry. 23 major named bacterial phyla are shown. Geometry (circular or linear chromosomes) and genome sizes (in kb) are indicated. Branch lengths are not proportional to evolutionary distance. Note that four phyla have been sampled most extensively: Proteobacteria, Firmicutes, Actinobacteria, and Bacteroidetes. These account for >90% of known bacteria.
20 Bacterial chromosome size and geometryB&FG 3e Fig. 17.1 Page 802
21 Estimates of the phylogenetic diversity SSU rRNA genesB&FG 3e Fig. 17.3 Page 810 Number of organisms
22 Archaeal classification: phylogenyAmongst the archaea, the two major divisions are [1] euryarchaeota (e.g. Methanococcus jannaschii, sequenced in 1996 and renamed Methanocaldococcus jannaschii) [2] crenarchaeota (e.g. Aeropyrum pernix, a strictly aerobic hyperthermophilic archaeon that is highly motile, lives in volcanic hydrothermal areas, and thrives at 90-95°C).
23 Outline Introduction Classification of bacteria and archaeaThe human microbiome Analysis of bacterial and archaeal genomes Nucleotide composition Finding genes Gene annotation Lateral gene transfer Comparison of bacterial genomes Perspective
24 The human microbiome There may be ten times more bacterial cells than human cells in our bodies. These bacteria, as well as some archaea, viruses, and eukaryotes, collectively may contain greater than two orders of magnitude more genes than are encoded by our human genome. This collection of foreign genomes in our bodies is referred to as the human microbiome. Most are commensal, coexisting and helping to digest food and facilitate our metabolism; some are pathogenic. Together they weigh about 1.5 kg in a typical human gut. B&FG 3e Page 811
25 The human microbiome: conclusions of the Human Microbiome Project (HMP) and the Metagenomics of the Human Intestinal Tract (MetaHIT) project [1] There are extraordinary bioinformatics challenges associated with these types of projects. [2] Most of the microbiome is bacterial (other eukaryotes 0.5%, archaea 0.8%, viruses up to 5.8% [3] There is no single reference microbiome because there is such enormous diversity of species within each individual and between individuals. [4] Each body region does have characteristic bacterial species within each individual, and these often occur in common between individuals. [5] Most metabolic pathways are evenly distributed and evenly prevalent across body regions and between individuals (see next slide!). B&FG 3e Page 813
26 Characterization of bacterial taxa in human microbiomesPhyla Metabolic pathways Anterior nares RC Buccal mucosa Supra- gingival plaque Tongue dorsum Stool Posterior fornix B&FG 3e Fig. 17.5 Page 813 Reported by the HMP
27 Outline Introduction Classification of bacteria and archaeaThe human microbiome Analysis of bacterial and archaeal genomes Nucleotide composition Finding genes Gene annotation Lateral gene transfer Comparison of bacterial genomes Perspective
28 Phylogenetic relationships of E. coli strainsAs we focus on E. coli we begin with a phylogenetic perspective B&FG 3e Fig. 17.6 Page 815
29 The Integrated Microbial Genomes (IMG) website offers data on bacterial genomes such as E. coli K-12 MG1655 B&FG 3e Fig. 17.7 Page 816
30 The UCSC Genome Browser offers an E. coli hubB&FG 3e Fig. 17.8 Page 817
31 Outline Introduction Classification of bacteria and archaeaThe human microbiome Analysis of bacterial and archaeal genomes Nucleotide composition Finding genes Gene annotation Lateral gene transfer Comparison of bacterial genomes Perspective
32 Bacteria and archaea: nucleotide compositionThe guanine plus cytosine (GC) content in bacteria ranges from ~20% to 75% (in archaea from ~28% to 66%). GC content often correlates with bacterial phylum. We will see in a later lecture that eukaryotic genomes have GC contents that often have a restricted range from ~35-50% (about 40%-45% in vertebrates). What is the consequence of extreme GC content on protein composition?
33 GC content for ∼15,000 bacterial and archaeal genomesB&FG 3e Fig. 17.9 Page 818
34 67-74% GC ~25-31% ~40% ~50% 40-60% GC ~23-35% ~30-33%
35 Use the seqinr R package to analyze nucleotide compositionThis shows the GC content of E. coli is 50.79% B&FG 3e Page 818
36 GC content of E. coli strain K-12: seqinr R packageYou can calculate GC conent across a series of bins and plot them. B&FG 3e Page 819
37 GC content of E. coli strain K-12The sequence of an E. coli strain was downloaded, input seqinr, a for loop was used to calculate GC content in windows of 20,000 base pairs, and the data were plotted. B&FG 3e Fig Page 819
38 C. carsonella: low GC content (16%) and tiny genomeC. carsonella “may have achieved organelle-like status”
39 Example of a C. carsonella proteinLook for residues such as asparagine (N) encoded by AT-rich codons
40 Example of a C. carsonella contig (note AT richness)Candidatus Carsonella ruddii PV 159,662 nt NC_008512
41 Bacteria and archaea: nucleotide compositionThere are two main theories to account for the variation in GC content in prokaryotes (Li and Graur, 1991): ►Selectionist hypothesis. GC content is an adaptation to environmental conditions. GC-rich codons (encoding ala, arg) are more stable in hot environments; AT-rich codons (encoding ser, lys) are thermally unstable. TT dimers are sensitive to radiation, so soil- and air-exposed prokaryotes may have a higher GC content. ►Mutationist hypothesis. GC content is determined by biases in the mutation patterns.
42 Outline Introduction Classification of bacteria and archaeaThe human microbiome Analysis of bacterial and archaeal genomes Nucleotide composition Finding genes Gene annotation Lateral gene transfer Comparison of bacterial genomes Perspective
43 Bacteria and archaea: finding genesGenome annotation involves the identification of features such as protein-coding genes, noncoding genes, or regulatory elements. For the annotation of genes, four main features of genomic DNA are useful. In particular, genes must be distinguished from randomly occurring open reading frames. [1] Open reading frame length. An ORF begins with a start codon (ATG or sometimes GTG or TTG in bacteria) and ends with a stop codon (TAA, TAG, TGA) [2] Consensus for ribosome binding (Shine-Dalgarno) [3] Pattern of codon usage [4] Homology of putative gene to other genes
44 Programs for gene finding in bacterial and archaeal genomesB&FG 3e Table 17.8 Page 820
45 Glimmer for prokaryotic gene findingGlimmer (Gene Locator and Interpolated Markov Modeler) uses interpolated Markov models (IMMs) to identify coding regions and distinguish them from noncoding DNA. The Glimmer home page is: Glimmer involves two steps: [1] Training the algorithm for a particular organism. This involves first identifying all ORFs, and sometimes also involves blast searching them against other organisms [2] Running the trained algorithm against the genome sequence.
46 Glimmer for prokaryotic gene findingGlimmer sequentially scans nucleotide sequences for particular kmers (e.g. the 5mer ATGGC) and estimates the probability of that pattern occurring in a real gene. The statistical model of a gene is then used to analyze the complete set of unknown genomic DNA. The ORFs that are analyzed by Glimmer must exceed some minimum length (e.g. 99 base pairs). Glimmer uses a hidden Markov model (HMM) approach. HMMs are statistical models of the patterns of nucleotides comprising a gene. The HMM includes observed states (e.g. nucleotide sequence including a start or stop codon) and hidden states (genes in DNA).
47 Identifying E. coli genes using the web-based GLIMMER3 program at NCBIStarting from the accession number of an E. coli strain (NC_ ) the “send to” option is selected to download a text file with the nucleotide sequence in the FASTA format. B&FG 3e Fig Page 821
48 Identifying E. coli genes using the web-based GLIMMER3 program at NCBIframe B&FG 3e Fig Page 821 The first ten open reading frame predictions (of 4482 total) are shown.
49 Identifying E. coli genes using GLIMMER3 (command-line)We can download, unpack, and compile the GLIMMER program. B&FG 3e Page 822
50 Identifying E. coli genes using GLIMMER3 (command-line)Copy the executable into the PATH variable. Obtain the DNA sequence of a genome of interest, e.g. E. coli. B&FG 3e Page 822
51 Identifying E. coli genes using GLIMMER3 (command-line)Use grep and word count (wc) to count the number of entries in the file. Use head to look at the first portion of the file. B&FG 3e Page 822
52 Identifying E. coli genes using GLIMMER3 (command-line)Build an interpolated context model (ICM). View a text version of the output which includes contextual patterns and codon predictions. B&FG 3e Page 823
53 Identifying E. coli genes using GLIMMER3 (command-line)…continuing to the bottom of the file. B&FG 3e Page 823
54 Identifying E. coli genes using GLIMMER3 (command-line)Now run GLIMMER3. B&FG 3e Page 823
55 Identifying E. coli genes using GLIMMER3 (command-line)The output includes a table with open reading frames. B&FG 3e Page 824
56 Identifying E. coli genes using GLIMMER3 (command-line)This file contains the final gene predictions. B&FG 3e Page 824
57 Outline Introduction Classification of bacteria and archaeaThe human microbiome Analysis of bacterial and archaeal genomes Nucleotide composition Finding genes Gene annotation Lateral gene transfer Comparison of bacterial genomes Perspective
58 Gene annotation Gene annotation is used to assign functions to genes and, in some cases, to reconstruct metabolic pathways or other higher levels of gene function. Gene annotation pipelines seek to maximize accuracy, consistency, and completeness. An example of the functional groups assigned to E. coli genes by the EcoCyc database B&FG 3e Page 825
59 The EcoCyc database includes a cellular overview of E. coliB&FG 3e Fig Page 826
60 Automated annotation of bacterial and archaeal genomes by RASTB&FG 3e Fig Page 827
61 Outline Introduction Classification of bacteria and archaeaThe human microbiome Analysis of bacterial and archaeal genomes Nucleotide composition Finding genes Gene annotation Lateral gene transfer Comparison of bacterial genomes Perspective
62 Bacteria and archaea: lateral gene transferLateral gene transfer (LGT), also called horizontal gene transfer (HGT), is a phenomenon in which a genome acquires a gene from another organism directly, but not by descent. The gene transfer is unidirectional (rather than involving a reciprocal exchange of DNA).
63 Lateral gene transfer: significanceLGT may represent a major, “alternative” form of non-vertical evolution. It is a process that offers organisms the capacity to adopt novel functions. LGT is significant as a possible source of error in phylogenetic analyses. LGT may be incorrectly ascribed when other mechanisms operate such as selection, variable evolutionary rates, and biased sampling (see JA Eisen [2000] Curr. Op. Genet. Devel. 10:606).
64 Lateral gene transfer occurs in stages[1] Four species evolved from a common ancestor. [2] A gene transfers from species 4 to 3. The gene is [3] fixed in some individual genomes, [4] maintained under strong selection, and [5] spread through the population. [6] The laterally transferred gene continues to evolve. B&FG 3e Fig Page 828
65 Lateral gene transfer of a gene encoding a sarcosine dimethylglycine methyltransferase from cyanobacteria to the eukaryote G. sulphuraria B&FG 3e Fig Page 830
66 Lateral gene transfer: examplesThere are many examples of LGT, both in many bacterial genomes, and between distantly related organisms. ►It has occurred in the parasitic amoeba Entamoeba histolytica. It may have received metabolic genes from bacterial co-habitants in the human gastrointestinal tract. (See Loftus B et al. (2005) Nature Feb. 24) ►Proteorhodopsin has been transferred between marine planktonic bacteria and archaea. In an upper water column of the ocean, archaea of the order Thermoplasmatales have proteorhodopsins that otherwise have been thought to be present in proteobacteria or other bacteria (Frigaard N-U et al. (2006) Nature 439:847).
67 Outline Introduction Classification of bacteria and archaeaThe human microbiome Analysis of bacterial and archaeal genomes Nucleotide composition Finding genes Gene annotation Lateral gene transfer Comparison of bacterial genomes Perspective
68 How can whole genomes be compared?-- molecular phylogeny -- You can BLAST (e.g. DELTA-BLAST) all the DNA and/or protein in one genome against another -- PipMaker, MUMmer and other programs align large stretches of genomic DNA from multiple species
69 Bacterial and archaeal species for which genomes of at least two closely related strains have been determined B&FG 3e Table 17.9 Page 831
70 Aligning genomes: MUMmerMUMmer is a tool for DNA alignments of complete genomes (or of chromosomes). The algorithm uses a suffix tree approach to identify all exact matches of nucleotide subsequences that are at least some minimum length (e.g. 20 or 150 base pairs). In this way maximal unique matching subsequences (MUMs) are identified.
71 MUMmer pairwise genome alignment: visualizing shared regions, inversions, translocationsEisen JA et al. (2000) Genome Biology 1(6)
72 MUMmer pairwise genome alignment: comparisons within V. choleraeEisen JA et al. (2000) Genome Biology 1(6)
73 MUMmer within-genome alignment (S. pyogenes)Eisen JA et al. (2000) Genome Biology 1(6) MUMmer within-genome alignment (S. pyogenes)
74 MUMmer compares two microbial genomes on a dotplotB&FG 3e Fig Page 834 We showed how to use MUMmer on the command line in Chapter 16. This is from a web-based version.
75 Aligning genomes: MUMmerRunning MUMmer there are three options: MUMmer NUCmer PROmer
76 NUCmer (NUCleotide MUMmer) is the most user-friendly alignment script for standard DNA sequence alignment. It is a robust pipeline that allows for multiple reference and multiple query sequences to be aligned in a many vs. many fashion. For instance, a very common use for nucmer is to determine the position and orientation of a set of sequence contigs in relation to a finished sequence, however it can be just as effective in comparing two finished sequences to one another. Like all of the other alignment scripts, it is a three step process - maximal exact matching, match clustering, and alignment extension. It begins by using mummer to find all of the maximal unique matches of a given length between the two input sequences. Following the matching phase, individual matches are clustered into closely grouped sets with mgaps. Finally, the non-exact sequence between matches is aligned via a modified Smith-Waterman algorithm, and the clusters themselves are extended outwards in order to increase the overall coverage of the alignments. nucmer uses the mgaps clustering routine which allows for rearrangements, duplications and inversions; as a consequence, nucmer is best suited for large-scale global alignments, as is shown in the following plot.
77 Helicobacter pylori J99 Helicobacter pylori 26695
78 Aligning genomes: PROmerPROmer (PROtein MUMmer) is a close relative to the NUCmer script. It follows the exact same steps as NUCmer and even uses most of the same programs in its pipeline, with one exception - all matching and alignment routines are performed on the six frame amino acid translation of the DNA input sequence. This provides promer with a much higher sensitivity than nucmer because protein sequences tends to diverge much slower than their underlying DNA sequence. Therefore, on the same input sequences, promer may find many conserved regions that nucmer will not, simply because the DNA sequence is not as highly conserved as the amino acid translation.
79 All of this is performed behind the scenes, as the input is still the raw DNA sequence and output coordinates are still reported in reference to the DNA, so the two programs (nucmer and promer) exhibit little difference in their interfaces and usability. Because of its greatly increased sensitivity, it is usually best to use promer on those sequences that cannot be adequately compared by nucmer, because if run on very similar sequences the promer output can be quite voluminous. This is because promer makes no effort to distinguish between proteins and junk amino acid translations, therefore a single highly conserved gene may have up to six alignments in promer output, one for each of the six amino acid reading frames, when only the correct reading frame would be sufficient. This makes promer ideally suited for highly divergent sequences that show little DNA sequence conservation, as is shown in the following two plots.
80 These dot plots represent two comparisons of Streptococcus pyogenes (x-axis) and Streptococcus mutans (y-axis), with forward matches colored red and reverse matches colored green. The graph generated with nucmer output is on the left, while the graph generated with promer output is on the right (both run with default parameters). It is clearly visible that promer has aligned the two genomes with a much greater sensitivity, thus demonstrating the effectiveness of comparing two divergent genomes on the amino acid level.
81 Outline Introduction Classification of bacteria and archaeaThe human microbiome Analysis of bacterial and archaeal genomes Nucleotide composition Finding genes Gene annotation Lateral gene transfer Comparison of bacterial genomes Perspective
82 Perspective Sequencing of thousands of bacterial and archaeal genomes has the following benefits: We obtain a comprehensive survey of genes and regulatory elements. Comparative genomics informs us about function. We begin to uncover the principles of genome organization, and can compare pathogenic versus nonpathogenic strains. We gain insights into the evolution of both genes and species. We can appreciate lateral gene transfer as one of the driving forces of microbial evolution. We can study gene duplication and gene loss. Complete genome sequences offer a starting point for biological investigations.