Recovering evolutionary history

1 Recovering evolutionary historyBjörn Nystedt Bioinforma...
Author: Dora Hines
0 downloads 2 Views

1 Recovering evolutionary historyBjörn Nystedt Bioinformatics scientist, SciLifeLab

2 SciLifeLab ScieLifeLabStrategic money from the government. Nodes in Stockholm (KI, SU, KTH) and Uppsala (UU). “THE VISION is to become one of the leading centers in the world for high-throughput bioscience with focus on genomes, protein profiling and bioinformatics with relevance for human diseases” 10 platforms Genomics Bioinformatics Clinical diagnostics Capacity Approaching 300 human genomes/week (30X) Approaching 10,000 single-cell transcriptomes/day

3 Fig 1. Growth of DNA sequencing.Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, et al. (2015) Big Data: Astronomical or Genomical?. PLoS Biol 13(7): e doi: /journal.pbio

4 Earnst Haeckel ( )

5 Woese and 16S rRNA Woese CR, Kandler O, Wheelis ML. (1990) Towards a natural system of organisms: proposal for the domains Archaea, Bacteria, and Eucarya. Proc Natl Acad Sci U S, 87:4576-9

6 Re-writing the tree of lifeEukaryotes nested within archaea? Spang et al. (2015) Nature 521:173–179

7 Woese revisited Eukaryota

8 The Eukaryotic tree of lifeThe Eukaryotic Tree of Life from a Global Phylogenomic Perspective. Cold Spring Harb Perspect Biol :a016147

9 Understanding change Amniote phylogeny based on protein synonymous sites showing major features of amniote evolution. J Alföldi et al. Nature (2011) “Nothing in biology makes sense except in the light of evolution” Dobzhansky 1973

10 What is a phylogeny A phylogeny is a pattern of event histories shared between biological replicators. In practice, the by far most commonly used replicators are genes (or species, in some sense). A phylogeny is typically modeled as a tree, representing the historical events linking the replicators together.

11 What are we trying to do? True history ATCGTGT ATCCTGT ATAGTGTATCGTGT ATAGTGT ATCGTGA ATCGTGT ATAGTGT time Species A Species B Species C

12 What are we trying to do? Observed data ATCGTGA ATCGTGT ATAGTGTtime Species A Species B Species C

13 What are we trying to do? Inferred history(“find most probable history given the observations”) Note! Normally we do not explicitly infer the ancestral states ATCGTGA ATCGTGT ATAGTGT Relative time Species A Species B Species C

14 Terminology

15 Topology and branch lengthsA tree has information in its topology (the order of branch splits) branch lengths Typically, the topology reflect relatedness (more on this later), and the branch lengths represents amount of change. Little change per time unit / slow evolution A lot of change per time unit / fast evolution A B C D phylogram

16 Cladogram Phylogram - A phylogenetic tree with branch lengths relative to the amount of change Cladogram - A phylogenetic tree with uninformative branch lengths C A B C D D B A phylogram cladogram (of the same tree)

17 Collect and organize your data

18 Collect your data (homologs)Phylogeny can only be performed on characters which are related by decent, ie they share a common history (within a reasonable timeframe). Sequences related by decent are called ‘homologs’

19 Organize your data (alignment)

20 Alignment consequencesMultiple alignment => much stronger assumption about homology Not just the genes are assumed to be homologous (in some fluffy sense), but each column in the alignment is assumed to contain only homologous characters. In effect, we assume each column to carry a signal of the same underlying gene tree. Ancestral state ATCCCTTCTATTTGA ATCCGCTCTATATGG ATCCGCTCTATATGA ATGCCTA-TCCTAGA ATGTCTA-TCCTTGA Look at your alignment! If the alignment does not make sense, your tree won’t make sense either

21 Evolutionary models

22 Naïve distances (p distance)Observed differences (p-distance) Actual changes ATCGTGTG ATCCTGTG ATCATGTG ATCGTGTG ATCATGTG time Actual change: 3 Observed diff: 1 This effect is due to multiple substitutions at the same site!

23 Jukes-Cantor model Example: Jukes-Cantor model A G T C P(t)=eQtObserved differences (p-distance) Actual changes A G T C P(t)=eQt , if i=j , if i≠j Jukes-Cantor is the simplest model in a class of models called time-reversible models for DNA (1 parameter)

24 Transitions and transversionsSo, equal substitution rates for different nucleotide substitution types seems to be a bad assumption! ‘GTR’ is the most complex time-reversible model (6 parameters)

25 More complex models # parameters nt aa Base/aa frequencies 3 19Substitution rates * Rate heterogeneity among sites (gamma) 1 1 Fraction of invariant sites (1) (1) Tree topology (1) (1) * Empirical values, not estimated for each dataset Special cases Non-reversible substitution rates Rate heterogeneity among branches Time-constrained trees (fossil data) Clock-like trees

26 (using one of the previous models)Methods (using one of the previous models)

27 Algorithmic: Neighbor-Joining (NJ)Use pairwise distances (based on your model of choise) Use iterative approach to build one tree (no such thing as a “second-best tree”). + Very fast Surprisingly accurate - Fails in complicated cases No possibility to compare to alternative trees

28 Methods based on criteriaCalculate a score for all (or at least many) possible trees, and pick the one with the best score.

29 Maximum likelihood criterionGiven two trees, the one with the higher likelihood, i.e. the one with the higher conditional probability of observing the data, is the better Site likelihood is the conditional probability of the data at one site (one character) given the assumed model of evolution and parameters of the model Data set likelihood is the product of the site likelihoods (character independence) Likelihood values under different models are comparable, thus giving us a way to test the adequacy of the model The model consists of A substitution model, e.g. Jukes-Cantor A tree with branch lengths

30 Likelihood of a one-branch treeTaxon1 AC Taxon2 CC αt Taxon1 AC Taxon2 CC For Jukes-Cantor! Ltot=L1·L2, or log Ltot = logL1+logL2

31 A one-branch tree 30 nucleotides from ψη-globin genes of two primates on a one-edge tree * * Gorilla GAAGTCCTTGAGAAATAAACTGCACACTGG Orangutan GGACTCCTTGAGAAATAAACTGCACACTGG There are two differences and 28 similarities αt lnL αt= lnL= Possible (and quick) to optimize parameters for a given tree.

32 Likelihood of a 4-taxon treeBases at internal nodes are unknown (so sum over all possible states!) A C e1 e3 e5 u v e2 e4 A T

33 Comparing trees Calculating the likelihood for a given tree is (pretty) fast. So, all we need to do now is to compute the Likelihood for all possible trees, and pick the best one! Easy, but..

34 ..it’s a forest out there! Search strategiesNumber of (rooted) trees for n terminals is (2n-3)·(2n-5)·(2n-7)…3·1 3 taxa -> 3 trees 4 taxa -> 15 trees 10 taxa -> trees 25 taxa -> 1,19·1030 trees 52 taxa -> 2,75·1080 trees Finding the optimal tree is an NP-complete or NP-hard problem Search strategies Exact Will find the best tree (according to criterion Exhaustive Up to ca 10 taxa Heuristic Limits the search to a“reasonable” set of trees. May not find the optimal tree

35 Star decomposition … B A C D E C B D A E C A E B D C A D B E E A B D C

36 Stepwise addition 837 831 783 914 921 915 916 905 C A B A C B D A B C

37 Branch swapping SPR TBR F A B G E D I C H F B A E D C G I H A I G H D

38 Trapped in local optimum?

39 Reliability Start of lecture 2

40 4. Evaluate reliability (bootstrap)Idea: A reliable phylogeny can be inferred also from subset of the data. Therefore: Try estimating phylogeny from parts of data. What subtrees are persistent? Definition: A replicate of a multialignment is achieved by column sampling with replacement. ACGTACG AC--ACC ACG-AGG GTGTAAG

41 4. Evaluate reliability (bootstrap)Idea: A reliable phylogeny can be inferred also from subset of the data. Therefore: Try estimating phylogeny from parts of data. What subtrees are persistent? Definition: A replicate of a multialignment is achieved by column sampling with replacement. ACGTACG G AC--ACC - ACG-AGG G GTGTAAG G !

42 4. Evaluate reliability (bootstrap)Idea: A reliable phylogeny can be inferred also from subset of the data. Therefore: Try estimating phylogeny from parts of data. What subtrees are persistent? Definition: A replicate of a multialignment is achieved by column sampling with replacement. ACGTACG GC AC--ACC -C ACG-AGG GG GTGTAAG GA !

43 4. Evaluate reliability (bootstrap)Idea: A reliable phylogeny can be inferred also from subset of the data. Therefore: Try estimating phylogeny from parts of data. What subtrees are persistent? Definition: A replicate of a multialignment is achieved by column sampling with replacement. ACGTACG GCA AC--ACC -CA ACG-AGG GGA GTGTAAG GAA !

44 4. Evaluate reliability (bootstrap)Idea: A reliable phylogeny can be inferred also from subset of the data. Therefore: Try estimating phylogeny from parts of data. What subtrees are persistent? Definition: A replicate of a multialignment is achieved by column sampling with replacement. ACGTACG GCAG AC--ACC -CA- ACG-AGG GGAG GTGTAAG GAAG !

45 4. Evaluate reliability (bootstrap)Idea: A reliable phylogeny can be inferred also from subset of the data. Therefore: Try estimating phylogeny from parts of data. What subtrees are persistent? Definition: A replicate of a multialignment is achieved by column sampling with replacement. ACGTACG GCAGA AC--ACC -CA-A ACG-AGG GGAGA GTGTAAG GAAGG !

46 4. Evaluate reliability (bootstrap)Idea: A reliable phylogeny can be inferred also from subset of the data. Therefore: Try estimating phylogeny from parts of data. What subtrees are persistent? Definition: A replicate of a multialignment is achieved by column sampling with replacement. ACGTACG GCAGAT AC--ACC -CA-A- ACG-AGG GGAGA- GTGTAAG GAAGGT !

47 4. Evaluate reliability (bootstrap)Idea: A reliable phylogeny can be inferred also from subset of the data. Therefore: Try estimating phylogeny from parts of data. What subtrees are persistent? Definition: A replicate of a multialignment is achieved by column sampling with replacement. ACGTACG GCAGATA AC--ACC -CA-A-A ACG-AGG GGAGA-A GTGTAAG GAAGGTA !

48 4. Evaluate reliability (bootstrap)Idea: A reliable phylogeny can be inferred also from subset of the data. Therefore: Try estimating phylogeny from parts of data. What subtrees are persistent? Definition: A replicate of a multialignment is achieved by column sampling with replacement. ACGTACG GCAGATA AC--ACC -CA-A-A ACG-AGG GGAGA-A GTGTAAG GAAGGTA / / 1st pseudo-replicate dataset

49 Bootstrap Aus Beus Ceus Deus Original analysis, e.g. MP, ML, NJ. Original data set with n characters. Draw n characters randomly with re-placement. Repeat m times. m pseudo-replicates, each with n characters. Aus Beus Ceus Deus Repeat original analysis on each of the pseudo-replicate data sets. Aus Beus Ceus Deus 75% Evaluate the results from the m analyses. Bootstrap proportions between 0.5 and 1 can be interpreted as a measure of confidence or support Rule of thumb: 80% support is “good” Bootstrap support values are not probabilities! Valules below 0.5 are non-sense

50 Precision and accuracyHigh precision Low accuracy => High bootstrap value Low precision High accuracy => Low bootstrap value

51 Summary; phylogeny for dummies1. Collect your data 2. Align your seqs (look at your alignment!) 3. Infer your tree (model/method of choice) 4. Estimate reliability

52 Rooting and relatedness

53 Who is more related to whom?The result of the phylogenetic inference is an unrooted tree! A B C D Observations NJ/ML/.. A C Unrooted phylogenetic tree B D ? ? ? ? ? Many possible rooted trees, (one possibility per branch!)where only one is correct. time A B C D A C D B

54 “A and B are more closely related than A and C”Relatedness “A and B are more closely related than A and C” = The most recent ancestor of A and B is younger (in time!) than the most recent ancestor of A and C RELATEDNESS CAN ONLY BE ASSESSED IN A ROOTED TREE DO NOT CONFUSE RELATEDNESS WITH SIMILARITY! Identical unrooted topologies! time A B C D A C D B The above statement is TRUE The above statement is FALSE

55 Rooting If we have a little prior(!) knowledge about the relatedness of our species, we can use this to root the tree. In particular, we want to include in our dataset a species we know is the most early diverging one in the set. A C Unrooted phylogenetic tree B D We know (somehow!) that B diverged earliest among the species in this dataset ! Only one possible rooted tree (B more distant to all the others than any of the others among themselves) time A C D B “Ingroup” “Outgroup”

56 Orthologs and paralogs Gene trees and species trees

57 Orthologs and paralogs2 types of homologs orthologs paralogs

58 Resulting gene tree

59 Oops.. A gene tree is not always representing the species tree!

60 Hard enough? Bah!

61 Is there a tree? (HGT) Horizontal gene transfer (HGT)obscuring the species tree (but does not make it obsolete) degrades the relation between phylogenetic history and phenotype “…Eukaryotes are largely unaffected by the HGT except in their earliest evolutionary period.” Most likely bullshit, but might be true for a few multicellular branches of the Eukaryotic domain.

62 Do genes behave? Not only genes can undergo lateral transfer; any piece of DNA can move, thus violating the assumptions in our evolutionary models. Gene conversion (overwrite) Recombination (exchange) Insertion Deletion