Analysis of protein-coding genetic variation in 60,706 humans

1 Analysis of protein-coding genetic variation in 60,706 ...
Author: Lizbeth Rice
0 downloads 2 Views

1 Analysis of protein-coding genetic variation in 60,706 humansExome Aggregation Consortium (ExAC): Lek M*, Karczewski KJ*, Minikel EV*, Samocha KE*, Banks E, Fennell T, O'Donnell-Luria AH, Ware JS, Hill AJ, Cummings BB, Tukiainen T, Birnbaum DP, Kosmicki JA, Duncan L, Estrada K, Zhao F, ZouJ, Pierce-Hoffman E, Cooper DN, DePristo M, Do R, Flannick J, Fromer M, Gauthier L, Goldstein J, Gupta N, Howrigan D, Kiezun A, Kurki MI, Moonshine AL, Natarajan P, Orozco L, Peloso GM, Poplin R, Rivas MA, Ruano-Rubio V, Ruderfer DM, Shakir K, Stenson PD, Stevens C, Thomas BP, Tiao G, Tusie-Luna MT, Weisburd B, Won H-H, Yu D, Altshuler DM, Ardissino D, Boehnke M, Danesh J, Elosua R, Florez JC, Gabriel SB, Getz G, Hultman CM, Kathiresan S, Laakso M, McCarroll S, McCarthy MI, McGovern D, McPherson R, Neale BM, Palotie A, Purcell SM, Saleheen D, Scharf J, Sklar P, Sullivan PF, Tuomilehto J, Watkins HC, Wilson JG, Daly MJ, MacArthur DG† MIT, MGH, Harvard, Boston Children’s, Brigham Women’s Hospital, Univ Sydney Australia, Imperial College London, Univ Washington, Cardiff, Mount Sinai, Mexico City, Parma Italy, Univ Michigan, Cambridge UK, Barcelona Spain, Stockholm Sweden, Kuopio Finland, Oxford UK, Cedars-Sinai, Ottawa Canada, U Penn, Karachi Pakistan, UNC, Helsinki Finland, Univ Mississippi

2 Exome Aggregation Consortium (ExAC)The Exome Aggregation Consortium (ExAC) is a coalition of investigators seeking to aggregate and harmonize exome sequencing data from a wide variety of large-scale sequencing projects, and to make summary data available for the wider scientific community. Data set includes 60,706 exome sequences from various disease-specific and population genetic studies.

3 Purpose of Exome Aggregation Consortium (ExAC)Filling in the knowledge gaps: High-Throughput DNA Sequencing Human Variation Medical Exomes/Genomes Difficult to put into practice Inconsistent processing Complicates variant-calling pipelines used by different groups Current publically available databases NHLBI Exome Sequencing Project 6,503 Exomes 1000 Genomes 2,504 Genomes

4 Purpose of Exome Aggregation Consortium (ExAC)Importance of a large database of human genetic variation: Human evolution Human biology/Gene function Filtering of variants Large numbers Ancestral diversity Clinical interpretation of variants in patients with disease

5

6

7

8 Exome Aggregation Consortium (ExAC)So what have we learned? Joint variant calling and analysis of high-quality variant calls across 60,706 human exomes Apply this dataset to: Resolution of very low-frequency variants Identify widespread mutational recurrence Inference of gene-level constraint against truncating variation Clinical interpretation of variation in Mendelian disease genes Discovery of human “knockout” variants in protein coding regions

9 Materials and Methods 1 petabyte (1015 or 1000 terabytes) of raw sequencing data (FASTQ) files from 91,796 individual exomes. Single informatic pipeline. Alignment to hg19. East Asia, South Asia, Africa, European – no phenotype Parent-Offspring Trios – looking for mutations in schizophrenic probands Controls from Type 2 Diabetes study Case – Control Study for identifying genes for IBD Genetics of early-onset heart attack Genes in heart, lung and blood disorders Genetics of mental disorders/psychiatric screens Case-Control Type 2 Diabetes in Mexico, Latin American Finnish Samples Swedish Case-Control Study: schizophrenia and bipolar Multi-ethnic case-control study for type 2 diabetes Parent-Offspring Trios: schizophrenia Blood samples from subjects with cancer

10 Materials and Methods Variants called using Genome Analysis Toolkit (GATK) HaplotypeCaller. Exome mean coverage ~65X Internal and external validation data to calibrate filters and evaluate the quality of filtered variants. Adjusted to increase singletons to pass filter 50.49% transmission rate in 490 trios 29 whole genome sequences – false discovery Data from 10,650 SNP arrays 699 de novo validated (99.9%) Single Nucleotide Variants (SNVs): 99.8% sensitivity 0.13% false discovery rate Insertion and deletions (indels): 95.1% sensitivity 1.95% false discovery rate

11 Final Data Set Ancestry Principal Component Analysis (PCA)5,400 common SNVs PCA 1-4 (emphasize variation, brings out strong patterns) Geographic ancestry within each sample Done again to separate Finnish from non-Finnish Europeans Population Clusters South American and Mexico = Latino North America and Europe = European 60,706 Unrelated Adults High-quality sequencing data Without severe pediatric disease GATK used to calculate: # variants Transition transversion ratio (TiTv ~ exomes) Alternate allele het/hom ratio Insertion/deletion ratio Outliers within each population were removed

12

13 Supplementary Table 3: ExAC Samples summarized by population and gender.

14 Results Variants: 10,195,872 candidate sequence variantsAfter stringent depth and site/genotype quality filters 7,404,909 high quality variants (HQ) Pass by VQSR 80% samples depth>=10 and genotype quality >=20 At least 1 sample with alternate allele with same depth Not located in region with highest level of multi-allelic variation Including 317,381 indels 1 variant for every 8 bp Majority are low frequency 99% have frequency <1% 54% were seen only in 1 individual 72% were absent from ESP and 1000G Density and type of variation not uniform Observed 7.5% of possible synonymous variants Observed 62.8% of possible CpG transitions (C to T variants) Observed 3.1% of possible transversions Observed 9% of other possible transitions Observed less missense and nonsense (not surprising)

15 Results Variants: Indels 95% have length -6 to +6Shorter deletion most common Frameshifts in smaller numbers and more likely to be found once

16 Results Patterns of variation: De novo7.9% of variant sites are multiallelic Expected 8.3% 0.48% 1000G 0.43% ESP Recurrence Synonymous variants 43% de novo Independent event – separate origin CpG transition variants 87% de novo Reaching saturation (20,000 individuals) Single observation rates and site mutability Low predicted mutability – 60% singleton rate High mutability (CpG) – 20% singleton rate 16% of CpG synonymous changes are found in at least 20 copies CpG are more likely to be found in more than one population Mutation rate is positively correlated with likelihood of being observed in 2 individuals of different populations Doubletons

17 Extended Data Figure 4. The impact of recurrence across different mutation and functional classes.Relatively stable but changes drastically at larger sample sizes Doubletons: Transversion/non-CpG transitions more likely to be found in similar ethnicities Stop loss higher than nonsense. CpG transitions more likely to have multiple origins to account for higher frequency. Wide variety of functional classes of variants seen once. No stop-lost CpG transitions

18 Results Multinucleotide Polymorphisms (MNPs):Number of MNPs per impact on the variant interpretation Multinucleotide Polymorphisms (MNPs): Used read-based phasing pipeline Multiple substitutions in the same codon in at least one individual 5,945 MNPs 23 per sample 647 of protein-truncating variant (PTV) is eliminated by adjacent SNP 131 underlying synonymous or missense result in PTV 10 disease causing Missed in other variant calling and annotation pipelines

19 Results Multinucleotide Polymorphisms (MNPs):Used read-based phasing pipeline Distribution of the number of MNPs per sample where phasing changes interpretation If composed of both rare and common – considered rare >1% <1%

20 Results Deleterious Variants: Expected to have lower frequenciesNatural Selection Frequencies are skewed by mutation rate Mutation rate not uniformly distributed across the functional classes CpG can never result in stop loss Corrected mutation rate - Mutability-adjusted proportion singleton (MAPS) metric Strong selection against predicted PTVs and missense variants

21 Results Selection against variant categories in single genes:Examine proportion of variation that is missing compared to expectations under random mutation. Loss of function tolerant pLI<0.1, n=10,374 Loss of function intolerant (pLI) pLI>0.9, n=3,230 Positively correlated with number of binding partners for gene product Highest are involved in core biological processes Spliceosome Ribosome Proteasome components Haploinsufficient disease genes 79% not yet assigned a disease phenotype Lowest Olfactory receptors High constraint Higher expression levels Broader tissue expression 18,225 genes ClinVar

22 Extended Data Figure 6. Distribution of synonymous, missense, and protein-truncating Z scores for gene sets. Fragile X mental retardation protein (FMRP) is a polyribosome-associated RNA-binding protein that regulates the synthesis of a set of plasticity-related proteins by stalling ribosomal translocation on target mRNAs.

23 Results Selection against variant categories in single genes:Loss of function intolerant (pLI) Depleted for eQTLs Large changes in expression would be detrimental Enriched within genome-wide significant trait-associated loci small changes in expression would cause disease Dosage-sensitive

24 Results Variant Interpretation in Mendelian Disease:Filtering using ExAC as reference dataset versus ESP Remove variants at >0.1% allele frequency (dominant) Using ExAC Reduced candidate variants by 7-fold Most powerful when the highest allele frequency in any one population (popmax) was used rather than the average global allele frequency ExAC – greater power to remove more variants. (154 on average vs. 1090) 69% of ESP European singletons are not seen a seond time in ExAC: danger of filtering on very low allele counts

25 Supplementary Table 13: Number of missense and protein-truncating variants/individual

26 Results Variant Interpretation in Mendelian Disease:Mendelian Disease Causing Variants Average ExAC sequence harbors ~53 disease-causing variants ~41 are high-quality genotypes with allele frequency >1% in at least one population Not due to genotyping error but misclassification Curated 192 variants with allele freq >1% in Latino 9 had sufficient data to support disease association Mild or incompletely penetrant 163 were reclassified as benign or likely benign (ACMG guidelines) 18 had + functional studies CIRH1A p.R565W variant AR cirrhotic liver failure during childhood Four homozygotes No history of liver disease Liver function normal in 2 Not fully penetrant Reference Population >1% global (75) or Latin American or South Asia (117)

27 Results Variant Interpretation in Mendelian Disease:Mendelian Disease Causing Variants Average ExAC exome contains 0.89 Mendelian variants in well-characterized dominant disease genes at <1% population allele frequency 0.20 at <0.1% False reports of pathogenicity and incomplete penetrance Just because it’s rare does not make it disease causing. Supplementary Table 14: ExAC frequencies of reportedly pathogenic variants

28 Results Rare Protein-truncating variants (PTVs):179,774 PTVs out of 7,404,909 HQ variants 121,309 were only found once 58,435 occurred in more than one individual 33,625 occurred in only one population Single individual: 85 heterozygous PTVs 35 homozygous PTVs ~2 unique to him/her 0.14 are found in pLI genes Populations: PTVs differ across human populations Finland Increase in 1-5% frequency PTVs Africa Increase in common (>0.1%) PTVs <0.1% popmax *

29 Results Rare Protein-truncating variants:Supplementary Table 18: Number of protein-truncating variants per individual by population

30 Extended Data Figure 10. Number of protein-truncating variants in constrained genes per individual by allele frequency bin. Equivalent to Figure 5b limited to constrained (pLI ≥ 0.9) genes.

31 Discussion Most comprehensive catalogue of human protein coding variation 60,706 individuals of diverse geographic ancestry Identification of very low frequency variants Resource for clinical interpretation of genetic variants in disease patients Publically available Identification of 2,557 loss of function intolerant genes for which disease phenotypes have not been described Severe haploinsufficient disease Heterozygous inactivation results in embryonic lethality Unexpected tolerance to functional variation Inflated number of pathogenic variants in databases

32 Discussion Populations and PTVsUnderstanding function though human “knockouts” Africans have more PTVs 140/person AF above 1% Finnish Lack PTVs with AF <0.1% Peak at 1-5% AF South Asian (Pakistani) Homozygous PTVs (consanguineous cohorts)

33 Limitations Still limited in power at lower allele frequenciesSamples from disease-associated databases Not a random sample Biased for certain disease/risk alleles Diabetes Mental disorders No phenotypes available (lack of full consent) Late-onset disorders Some populations missing Middle East Some exons lack coverage Confounded by capture technology Confounded with cohort and population Only one source of variation Variants outside of gene coding regions Saturation of other classes of variation is possible with increase sample size

34 What’s Next? More samples >90,000 exomes in next releaseMoving to genomes Test run complete on 3,600 whole genomes More to come Sharing ages Releasing the distribution of carrier ages for each variant User-friendly tools Framework to allow users to analyze patients vs ExAC https://www.clinicalgenome.org/site/assets/files/2753/macarthur_exac.pdf

35

36

37

38

39

40

41

42