1 V11 Multi-variate analysisProgram for today: Staphylococcus aureus Africa project – analysis for confounding variables Overview multivariate analysis for omics projects Case study: gene-regulatory network for breast cancer Case study: single cell methylation and expression data V11 Processing of Biological Data
2 Review: S. aureus in Germany vs. Africa: StaphNet6 study sites each collected 100 isolates of healthy volunteers and 100 of blood culture or clinical infection sites. Aim microbiological and molecular characterization of African S. aureus isolates by DNA microarray analysis including clonal complex analysis supplemented by Whole Genome Sequencing Lambaréné Münster Freiburg Homburg Ifakara Manhiça Lambaréné Münster Freiburg Homburg Ifakara Manhiça Bouaké Bouaké Bouaké Kinshasa Kinshasa Kinshasa
3 Distribution of clonal complexesSome clonal complexes more prevalent in Africa, others predominant in Germany.
4 Activitity of individual probes for CCs
5 PCA of MA hybridization dataPrinciple component analysis of 1200 strains Input data: binary matrix of MA data; dimension 1200 x 334 probes PCA identifies local clusters that are characteristic for particular clonal complexes PCA of MA hybridization data
6 Staphylococcus aureus data from Africa project (V1)Age distribution is heavily skewed: many small kids / babies in Africa – few seniors in Africa very few small kids / babies in Germany – many seniors in Germany Site # of cases below 1 year # of cases 1 to 5 years # of cases years # of cases 26 – 65 years # of cases above 66 years Africa + Germany (clinical) 88 109 90 225 Africa + Germany (commensal) 19 34 363 175 9 Africa (clinical) 86 106 53 54 1 Africa (commensal) 17 156 89 4 Germany (clinical) 2 3 37 171 87 Germany (commensal) 207 5 V11 Processing of Biological Data
7 Analyze whether age is a confounding variableTo test whether age is a confounding variable, one can compare the results from simple linear regression with those from multiple linear regression. The principle difference between these two types of regression models is the number of explanatory variables: (1) the simple linear regression (SLR) model uses only one dependent variable y and one explanatory variable x: y = a + b ∙ x In our case, y stands for the binary output from the Alere-chip experiment for a particular gene. y therefore has values of 0 or 1. With the binary variable x we could encode the sites Africa/Germany. a and b are weights estimated by the model. Generally SLR tries to find such weights (values for a and b) so that the difference between the estimated y and actual y will be the smallest. V11 Processing of Biological Data
8 Analyze whether age is a confounding variable(2) the multiple linear regression model also has one dependent variable y but more than one explanatory variables y = a + b1 ∙ x1 + b2 ∙ x2 + …+ bn ∙ xn As above, y will be the Alere-chip entry for a gene with value 0 or 1. The site, clin/com and age categories will be used as explanatory variables . V11 Processing of Biological Data
9 Steps of testing age categories for confounding(1) Estimate a linear regression model for the dependent variable and one or more explanatory variables. (2) Repeat step 1 with age categories added as further explanatory variable. (3) Compare the weights obtained in steps 1 and 2. As a rule of thumb, if the weight (-s) (regression coefficient(-s)) from step 1 changes by more than 10%, then the variable (here: age) may be considered as a confounder. By following these steps, one can test for every significant finding (for example, gene association) whether age is a confounder. Reasons for this could be e.g. a significant imbalance in the distribution of age among samples. V11 Processing of Biological Data
10 Processing of Biological DataCase study Case study: test whether age categories are a confounding variable for the 2 genes lukS.PV and sdrC..total Previously, we found that these 2 genes have different frequencies in African vs German sites as well as in clinical vs commensal samples. Therefore we will now test age as a confounder in the association of those genes with the Africa/Germany and clinical/commensal categories. Africa was encoded as 1 and Germany as 0. Clinical samples were encoded as 1 and commensal with 0. Age categories were encoded from 1 to 5. V11 Processing of Biological Data
11 Multiple linear regression model for the lukS.PV geneThe Alere result for this gene for different samples is the dependent variable and the site affiliation, clin/com values are explanatory variables. The table lists the dependent (lukS.PV) and explanatory (Africa_value, clin_com_value) variables for 10 samples out of 1200 samples. Since all these samples are from a German site, the Africa_value = 0. Also, all samples are clinical (clin_com_value = 1). # samples lukS.PV Africa_value clin_com_value 1 FR-B 2 FR-B 3 FR-B 4 FR-B 5 FR-B 6 FR-B 7 FR-B 8 FR-B 9 FR-B 10 FR-B V11 Processing of Biological Data
12 Processing of Biological DatalukS.PV Application of linear regression determines optimal weights w1, w2, w3 so that we get for every sample lukS.PV = w1 + w2 ∙Africa.value + w3 ∙clin com value . For the first sample FR-B001 the formula would be 0 = w1 + w2 ∙ 0 + w3 ∙ 1 . Results from multiple linear regression (coefficients marked in bold): Estimate Std. Error t value Pr(>|t|) (Intercept) e-05 *** Africa_value <2e-16 *** clin_com_value <2e-16 *** In other words, the following model is estimated: lukS.PV = ∙ Africa_value ∙ clin_com_value V11 Processing of Biological Data
13 Processing of Biological DatalukS.PV We then added a further variable “age category” with weight w4 to the model. lukS.PV = w1 + w2 ∙ Africa.value + w3 ∙ clin com value + w4 ∙ age Estimate Std. Error t value Pr(>|t|) (Intercept) Africa_value < 2e-16 *** clin_com_value < 2e-16 *** age ** Residual standard error: on 1196 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 3 and 1196 DF, p-value: < 2.2e-16 lukS.PV = ∙ Africa value ∙ clin com value ∙ age V11 Processing of Biological Data
14 Processing of Biological DatalukS.PV This result shows (a) that the age category has a very small impact (its own weight is close to 0) and (b) the two other weights (for the site and clin/com) did not change much. E.g. the weight of the Africa_values changed in relative terms by : The weight of clin_com_value changed by only 0.15%. Both values are smaller than 10% (rule of thumb). Conclusion: There is no statistical evidence that age acts as a confounding variable. V11 Processing of Biological Data
15 Same analysis for gene sdrC_totalBefore adding age categories: Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) < 2e-16 *** Africa_value < 2e-16 *** clin_com_value e-05 *** Residual standard error: on 1197 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 2 and 1197 DoF, p-value: < 2.2e-16 After adding age categories: Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) < 2e-16 *** Africa_value e-12 *** clin_com_value e-05 *** age-category Residual standard error: on 1196 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 3 and 1196 DF, p-value: < 2.2e-16 Weight of Africa_value changed by 9.87%, weight of clin_com_value changed by 0.17% V11 Processing of Biological Data
16 Processing of Biological DataConclusion There is no evidence from our preliminary analysis for the genes lukS.PV and sdrC..total that age acts as a confounder in the associations of genes with invasiveness and site affiliation. We wrote in our manuscript: “The discrepancy in population age between the German and African cohort potentially biases the ‘true’ distribution of clones and genes between isolates from the different geographic regions … [but] application of a multiple linear regression model for the detection rate of Panton-Valentine leucocidin genes failed to provide evidence that age acts as a confounding variable” Ruffing et al. Sci. Rep. 7, 154 (2017) V11 Processing of Biological Data
17 Diabetes/HIV as confounding variablesNext, we tested using Fisher's exact test whether (a) diabetes and HIV have similar frequencies in the total groups of African and German samples and (b) whether diabetes and HIV have similar frequencies in selected groups of African and German individuals carrying particular clonal complexes. The Fisher test considers the distribution provided in a 2 x 2 table. Africa Germany Row Total HIV+ a b a + b HIV- c d c + d Column Total a + c b + d a + b + c+ d = n The formula for the (exact) p-value calculation is : Explanation: these are the number of possible combinatoric combinations for these fields. V11 Processing of Biological Data
18 Analysis of HIV co-infectionFirst, we will test the null hypothesis that “HIV is equally distributed in African and German samples”. (a) For all African samples and all German samples we obtain the following dependencies of HIV carriers (HIV+) and of individuals without approved HIV status (you may say non-carriers) (HIV-): The p-value obtained for this table can be interpreted as the sum of evidence provided by the observed data—or any more extreme table—for the null hypothesis that “there is no difference in the proportions of HIV carriers among the African and German individuals tested in our study”. The smaller the value of p, the greater the evidence for rejecting the null hypothesis. Africa Germany HIV+ 41 HIV- 315 586 V11 Processing of Biological Data
19 Analysis of HIV co-infectionFor the data shown above, Thus, there is very strong evidence from the observed frequencies that African and German individuals are not equally likely to be HIV carriers. V11 Processing of Biological Data
20 Analysis of diabetes co-infectionSimilarly, we can obtain Fisher's exact p-value for the distribution of Diabetes among African and German samples. Africa Germany diab diab p-value = e-14 Also, here, the null hypothesis of a similar distribution is strongly rejected suggesting the prevalence of Diabetes in individuals from Germany compared to individuals from Africa. Of course, we can trace this imbalance back to the difference in age categories of the two groups. V11 Processing of Biological Data
21 HIV/diabetes in individuals with selected CCsNext, we tested the distribution of HIV/Diabetes in individuals carrying S. aureus from selected clonal complexes (CC15, CC45, CC121, CC30 which showed significant imbalance in german/african samples). These are the results (tables + p-values from Fisher's exact test) RF_HIV CC15 Africa Germany hiv hiv p-value (after correction for false discovery rate (FDR)) CC45 Africa Germany hiv hiv p-value (FDR-corrected) V11 Processing of Biological Data
22 HIV/diabetes in individuals with selected CCsCC121 Africa Germany hiv hiv p-value (FDR-corrected) CC30 Africa Germany hiv hiv p-value (FDR-corrected) V11 Processing of Biological Data
23 HIV/diabetes in individuals with selected CCsRF_CCSI_Diab_mel CC15 Africa Germany diab diab p-value (FDR-corrected) CC45 Africa Germany diab diab p-value (FDR-corrected) CC121 Africa Germany diab p-value (FDR-corrected) CC30 Africa Germany diab diab p-value (FDR-corrected) V11 Processing of Biological Data
24 Processing of Biological DataInterpretation In most cases, there is no evidence based on our data to reject the null hypothesis of assuming a similar distribution of HIV and diabetes carriers among African and German samples belonging to particular clonal complexes. The only exceptions to this are CC45 (diabetes – p=0.008/q=0.03) and CC121 (HIV – borderline p=0.013/q=0.05). Therefore, we concluded “we observed statistically significant imbalances in the frequencies of all these clonal complexes XXX, YYY ... between African and Germany. We tested based on Fisher's exact test that these imbalances were not due to an imbalance of HIV and diabetes carriers in both groups. The only exceptions to this are CC45 (diabetes) and CC121 (HIV) where such associations cannot be ruled out.” V11 Processing of Biological Data
25 Processing of Biological DataInterpretation On the other hand, the FDR-corrected p-values for CC45 and CC121 are borderline (0.03 and 0.05). Therefore, there only exists weak statistical evidence for a significant association between CC45 and diabetes or between CC121 and HIV. V11 Processing of Biological Data
26 integration of multi-omics dataOverview of methods for multivariate analysis: https://bmcbioinformatics.biomedcentral.com/articles/ /s Different types of high-throughput technologies allow us to collect information on the molecular components of biological systems - e.g. nucleotide sequencing, - DNA-chips measuring gene expression and - protein mass spectrometry measuring protein abundances). Therefore, in order to draw a more comprehensive view of biological processes, experimental data made on different layers have to be integrated and analyzed. The development of methods for the integrative analysis of multi-layer datasets is one of the most relevant problems computational scientists are addressing nowadays. Bersanelli et al. BMC Bioinformatics (2016) 17(Suppl 2):S15 V11 Processing of Biological Data
27 Graph-based integration of multi-omics dataSome group of approaches use graphs to model the interactions among variables. These approaches, designated as “network-based” (NB), take into account currently known (e.g. protein-protein interactions) or predicted (e.g. from correlation analysis) relationships between biological variables. Then, graph measures (e.g. degree, connectivity, centrality) and graph algorithms (e.g. sub-network identification) are used to identify valuable biological information. Importantly, networks are used in the modeling of the cell’s intricate wiring diagram and suggest possible mechanisms of action at the basis of healthy and pathological phenotypes Bersanelli et al. BMC Bioinformatics (2016) 17(Suppl 2):S15 V11 Processing of Biological Data
28 Bayesian integration of multi-omics dataThe second criterion is whether the approach is Bayesian (BY). These approaches use a statistical model in which, starting from an a priori reasonable assumption about the data probability distribution (parametric or non-parametric) it is possible to compute the updated posterior probability distribution making use of the Bayes’ rule. In the network-based area, Bayesian networks are another promising framework for the analysis multi-omics data. 4 classes of methods: - network-free non-Bayesian (NF-NBY), - network-free Bayesian (NF-BY), - network-based non-Bayesian (NB-NBY) and - network-based Bayesian (NB-BY) methods Bersanelli et al. BMC Bioinformatics (2016) 17(Suppl 2):S15 V11 Processing of Biological Data
29 Overview of multi-omics methodsGrey: network-free, non-Bayesian methods; yellow: network-free, Bayesian methods; blue: network-based, non-Bayesian methods; green: network-based Bayesian methods Abbreviations: GEN = genome, CC = ChIP-chip, CN = copy number variations, DM = DNA methylation, DS = DNA sequence, Hi-C = genome-wide data of chromosomal interactions, LOH = loss of heterozigosity, GT = genotype, GE = gene expression, PE = protein expression Bersanelli et al. BMC Bioinformatics (2016) 17(Suppl 2):S15 V11 Processing of Biological Data
30 Processing of Biological DataExisting tools Method Multi-omics approach Implementation Camelot Bivariate predictive regression model NA CNAmet Multi-omics gene-wise scores R FALDA FA + LDA of a joint matrix Integromics Regularized CCA, sparse PLS iPAC Sequential MCD MCIA Multiple co-inertia analysis sMBPLS Sparse Multi-Block PLS regression Matlab Coalesce Multi-omics probabilities C ++ iCluster Joint Gaussian latent variable models MDI DMA mixture models PSDF Hierarchical DMA mixture models TMD Kernel Fusion Integration of omics-specific kernels Endeavour Integration of omics-specific ranks with order statistics Webserver MOO Sub-network extraction on MWG Multiplex Joint analysis of multi-layered networks NuChart Analysis of a MWG SNF Similarity network fusion Matlab, R SteinerNet stSVM MWG Paradigm Multi-omics bayesian factor graphs Conexic Java legend MWG = multi-weighted graph; FA = factor analysis; LDA = linear discriminant analysis; CCA = canonical correlation analysis; PLS = partial least squares; DMA = Dirichelet multinomial allocation Bersanelli et al. BMC Bioinformatics (2016) 17(Suppl 2):S15 V11 Processing of Biological Data
31 Multi-omics analysis of breast cancer networkTCGA data on breast cancer Hamed et al. BMC Genomics 16 (Suppl5):S2 (2015)
32 Breast cancer network from TCGA dataca differentially expressed genes. Hierarchical clustering of co-expression network: 10 modules with genes. Regulatory info from databases Jaspar, Tred, MSigDB. Shown are 3 modules. Squares are known drug targets. Hamed et al. BMC Genomics 16 (Suppl5):S2 (2015)
33 Drug Targets in breast cancer networkSome key genes are protein targets of known anti-cancer drugs, → relevance of key genes is validated Hamed et al. Nucl Acids Res 43: W283-W288 (2015)
34 Case study: single-cell analytics In the presence of serum, mouse embryonic stem cells (ESCs) constitute a metastable population with stochastic switching between transcriptional states. This transcriptional heterogeneity has been linked to the differentiation potential of ESCs. E.g. NANOGlo cells have an increased propensity to differentiate and elevated expression of differentiation markers compared with NANOGhi cells. Sorted populations of cells show different levels of DNA methylation between transcriptional states, such as gains in DNA methylation in NANOGlo and REX1/ZFP42lo cells compared with, respectively, NANOGhi and REX1hi cells. To investigate the link between epigenetic and transcriptional heterogeneity in ESCs, Reik et al. performed scM&T-seq on 76 individual serum ESCs. V11 Processing of Biological Data
35 Processing of Biological DatascMT & T-seq protocol Single cells are collected and lysed. Then poly-A RNA is captured on magnetic beads and physically separated from DNA. Angermüller et al. Nature Methods (2016) 13, 229 V11 Processing of Biological Data
36 Processing of Biological DatascMT & T-seq protocol Amplified cDNA is generated from mRNA on beads. DNA is bisulfite converted and Illumina sequencing libraries are prepared from both components in parallel. Angermüller et al. Nature Methods (2016) 13, 229 V11 Processing of Biological Data
37 Processing of Biological DataHow good is the protocol: check against single cell bisulfite sequencing (scBS-seq) Methylome coverage in scM&T-seq libraries was lower than that in scBS-seq libraries. However, genome-wide CpG coverage at matched sequencing depth (c) and coverage of different regions (d) was consistent across protocols. Angermüller et al. Nature Methods (2016) 13, 229 V11 Processing of Biological Data
38 Clustering based on DNA methylation dataShown is a hierarchical clustering analysis of gene-body methylation for the 300 most variable genes in terms of DNA methylation. Angermüller et al. Nature Methods (2016) 13, 229 V11 Processing of Biological Data
39 Clustering based on expression dataShown is a hierarchical clustering analysis of gene expression for the 300 most variable genes (on the basis of DNA-methylation variance) → Both data yield distinct clustering of cells. This suggests that global methylome and transcriptome profiles reveal complementary, but distinct, aspects of cell state. This is also consistent with previous observations that the transcriptome and methylome are partially uncoupled in serum ESCs. Angermüller et al. Nature Methods (2016) 13, 229 V11 Processing of Biological Data
40 Associations of expression and DNA-methylation variation1,493 associations were found between the expression of individual genes and DNA-methylation variation in several genomic contexts (FDR) < 10%). There exist both positive and negative associations, highlighting the complexity of interactions between the methylome and the transcriptome. Also distal regulatory elements including low-methylation regions (LMRs) had a fair balance of positive and negative associations. log10(q-value) Pearson correlation Angermüller et al. Nature Methods (2016) 13, 229 V11 Processing of Biological Data
41 Associations of expression and DNA-methylation variationNegative correlations between DNA methylation and gene expression were predominant for non-CGI promoters. Angermüller et al. Nature Methods (2016) 13, 229 V11 Processing of Biological Data
42 Zoomed-in analysis for EsrrbCorrelation of methyla-tion and expression (black), variance (blue) Solid black curve : weighted mean methylation rate across all cells Estimated methylation rate of 3-kb windows for each cell Dot size : CpG coverage, Dot colors correspond to single cells. Esrrb is a known hub gene in pluripotency networks. Its expression negatively correlates with the methylation of several LMR and p300 sites overlapping ‘superenhancers’ in the genomic neighborhood. Angermüller et al. Nature Methods (2016) 13, 229 V11 Processing of Biological Data
43 Zoomed-in analysis for EsrrbTwo scatter plots at the top right: association between DNA methylation at a p300 region (outlined in yellow) and at an LMR (outlined in blue) and Esrrb expression. Angermüller et al. Nature Methods (2016) 13, 229 V11 Processing of Biological Data
44 Correlations of DNA methylation and expressionGene-specific association analysis of correlations between DNA methylation in different genomic contexts and gene expression in individual cells. Shown are methylation-expression correlations for all variable genes in single cells, for each annotation, with the correlation obtained from matched RNA-seq and BS-seq of a bulk cell population superimposed (orange circles). Prom = promoter. Angermüller et al. Nature Methods (2016) 13, 229 V11 Processing of Biological Data
45 Processing of Biological DataSummary Multi-variate vs. single-variate analysis reveals possible confounding effects Multi-omics methods: graph-based and/or Bayesian methods for data integration Single cell analysis showed: non–CGI promoter methylation and transcription are negatively associated in single cells / both positive and negative associations at distal regulatory regions. expression levels of many pluripotency factors, such as Esrrb are negatively associated with DNA methylation → an important mechanistic component of fluctuating pluripotency in serum ESCs is epigenetic heterogeneity the strength of the connection between the methylome and the transcriptome can vary from cell to cell Q: is our understanding / data generation ready for multi-omics analysis? V11 Processing of Biological Data