Broad Institute of Harvard and MIT

1 Broad Institute of Harvard and MITMichael Reich Broad I...
Author: Jennifer Walker
0 downloads 2 Views

1 Broad Institute of Harvard and MITMichael Reich Broad Institute of Harvard and MIT July 14, 2012

2 Need: Insights through integrative studiesGene mutation causing Leigh Syndrome French Canadian Type (LFSC) and 8 other mitochondrial diseases Integrate: candidate genomic region, mitochondrial proteomic data, and cancer expression compendium Authors: Mootha et al. 2003, Calvo et al. 2006 Discovery of 3 new genes involved in Glioblastoma Multiforme (NF1, ERRB2, PIK3R1); Confirmation of TP53, PTEN, EGFR, RB1, PIK3CA Integrate: DNA sequence, copy number, methylation aberrations and expression profiles in 206 glioblastomas Authors: TCGA Research Network 2008 Success stories from integrative genomics studies ********************** What’s preventing these analyses from becoming more widespread? However, integrative analysis of multiple data types remains an enormous challenge The key reason is the growing gap between the need to use a variety of different analysis and visualization tools, and the difficulty of getting tools from different sources to work together. Subtle repression of oxphos genes in diabetic muscle: role for mitochondrial dysregulation in diabetes pathology; new computational approach Gene Set Enrichment Analysis Integrate: Gene sets/pathways & processes with expression profiles Authors: Mootha et al. 2003, Subramanian & Tamayo et al. 2005 ~3000 novel, large non-coding RNAs with functions in development, the immune response and cancer Integrate: Genome sequences from 21 mammals, epigenomic maps, and expression profiles Authors: Guttman, Rinn et al. 2009 Characterization of disease subtypes and improved risk stratification for medulloblastoma patients Integrate: Copy number, expression, clinical data for 96 medulloblastoma patients Authors: Tamayo et al., Cho et al. 2011 IKBKE as a new breast cancer oncogene Integrate: RNAi screens, transformation of activated kinases, and copy number from SNP arrays of cell lines Authors: Boehm et al. 2007

3 Translational Research ExampleGenePattern Cytoscape IGV/UCSC Genomica CMAP iii Arrests G2/M Extract module ii Learn p53 site/score on promoter iv Load compendium Show module map i Differentially Expressed Genes 1 Compendium ODF  GMT HTML SIF GCT  GXP NA  gene list GXC  GRP GFF  GXA Expand +1 (include neighbors) 4 Show network 3 Idea GSEA test enrichment 2 Conclusion vi Pathway activation Added to GenePattern v Network Show Chromosome 5 Add Transcription Factor track from UCSC 6 Alterations I’m going to think up an experiment, and as I think you’re going to see my thoughts mapped into tool and data space A biologist wishes to find the molecular basis for the prognostic differences between breast cancers from 2 different grades. She profiles mRNA levels of a collection of breast cancer samples from the two grades: She identifies genes differentially expressed (CMS in GP) She probes molecular processes that gave rise to this signature (GSEA in GP) Finding the P53 pathway is overrepresented, she further explores by finding other potentially implicated genes (export list to Cytoscape, map genes onto network, extract all neighboring genes from network) View where expanded set of genes lie on genome (export to IGV) Determine relationship of these genes to recurrent chromosomal aberrations in the same type of breast cancer as well as a catalog of conserved TF binding sites from an analysis of 21 mammalian genomes. (UCSC browser -> IGV) Explores where aberrations and genes in set intersect and notice p53 site occurs frequently in these regions Determine if 3 features (genes, aberrations, binding sites) is significant (Genomica) Also determine that many genes are co-regulated across a compendium of 1975 expression profiles from 23 cancers. (GP => GenomicaModuleMap) Now she wants to see if this signature has any relationship to the expression profile of perturbation by a drug compound and finds one with drugs that arrest cells at G2/M. (Export Genomica,format for CMAP, load into CMAP) Knowing that this cell cycle checkpoint is controlled by p53, she evaluates if there are a significant number of genes in the signature with a p53 binding site in their promoter. This requries obtaining all human promoter sequences, formatting the p53 PSSM file, runing the command line tool to score the known PSSM for the p53 binding site, and teting the signature genes for enrichment for the site using Genomica. Evidence points to p53 controlling different grades of cancer. She confirms this by looking for p53 activation within the original cancer samples’ expression profiles. (Requires use of a new pathway activation algorithm in GP) Inspecting these results she realizes that some of the lower grade patients have activation of the signature. This leads her to hypothesize that treatment with the G2/M inhibitor might offer an alternative therapy for these patients. Who would dare to do this now? Each tool has its own data input and output formats. The problems involved in merely converting the data would likely make this type of research a non-starter atcgcgtttattcgataagg atcgcgttttttcgataagg Looks close to p53 site 7 Expression Test for similarity of p53 and gene location 8

4 12 steps, 6 tools, 7 transitionsGenePattern Cytoscape IGV Genomica CMAP UCSC Browser Analysis step Analysis conclusion Within tool Across tools 2 3 ii iii The key observation here is that if you can focus on making the transitions between tools easier, you can greatly accelerate the way you can use those tools in concert 4 5 iv v 6 8

5 The Challenges Flood of high throughput biological datagenomic sequence, global mRNA expression profiles, copy number and LOH, epigenetic data, protein level and modification status, metabolite profiles Proliferation of tools Databases, visualization, and analysis Difficulty of getting tools to work together Access, analyze, visualize each data type separately The key challenge is the growing gap between the need to use a variety of different analysis and visualization tools, and the difficulty of getting tools from different sources to work together. 7-10K bioinformatics tools Broad alone lists ~60 tools on its external website. Doesn’t include internal use tools, lims, data processing/mgmt pipelines, 5K public data bases

6 The Need A lightweight “connection layer” for a wide variety of integrative genomic analyses Support for all types of resource: Web-based, desktop, etc. Automatic conversion of data formats between tools Easy access to data from any location Any tool that joins is automatically connected to the community of tools Ease of entry into the environment We know that there are many pitfalls when you’re trying to set up an environment that will allow diverse tools to send data to each other – searching for tools, harmonizing data formats, transferring large data files, etc. – we could do a whole talk just on each one of these. But are there areas where there are successes that we can emulate? 6

7 API connectivity layerCloud-based storage API connectivity layer Data storage in the cloud is being used in a huge number of applications for the cost, scalability and accessibility And, Web-based APIs have been used to connect disparate Web resources in a very easy way These were the principles we used when we developed GenomeSpace

8 Online community to share diverse computational tools6 Seed Tools Cytoscape Galaxy GenePattern Genomica IGV UCSC Browser 3 Driving Biological Projects lincRNAs Cancer stem cells Patient Stratification 0. GenomeSpace project to build an online community to share, find, and interoperate diverse computational tools. Tools retain their identity and use as stand- alone software and GenomeSpace maintains their native look and feel. Our goal is to bring the ever-changing wealth of genomic analysis methods and whatever data is required to the fingertips of any biologist Seeded with 6 popular genomics tools representing diverse architectures (cytoscape, galaxy, genepattern, genomica, igv, UCSC browser) Support interoperability through frictionless data transfer with Reproducibility, analytic work flows, comprehensive documentation Development driven by 3 driving biological projects in (cancer, lincRNAs,, stem cell circuits, and patient stratification) Live in the Cloud Next phase starting to Engage new tools Engage new biomedical projects Current participating institutions New tools New Biological Projects

9 GenomeSpace PrinciplesAimed towards non-programming users Support interoperability through automatic cross-tool data transfer Requires minimal changes to tools

10 Authentication and AuthorizationGenomeSpace Components GenePattern Galaxy Cytoscape Integrative Genomics Viewer GS Enabled Tools Authentication and Authorization Genome Space Server Data Manager Analysis and Tool Manager GenomeSpace Project Data 1 2 3 So what IS GenomeSpace: Server component - Authentication layer to manage credentials & provide single sign-on - Analysis/tools manager - Manage tools, their parameters, and the analyses they provide Data manager – manages data repositories & transparent file format conversions These are designed to run in the cloud API component Client developers’ kit Restful interface geWorkbench External Data Sources & Tools

11 Seed Tools Cytoscape Galaxy GenePattern Genomica IGV UCSC GenomeSeed tools span a wide variety of genomic analysis areas: Network analysis and visualization Sequence analysis Functional genomics analysis Module map analysis Integrated genomics and sequence visualization Genomica IGV UCSC Genome Browser

12 New Tools Recently added In development InSilicoDB Cistrome(University of Brussels) Cistrome (Dana-Farber Cancer Institute) In development Seed tools span a wide variety of genomic analysis areas: Network analysis and visualization Sequence analysis Functional genomics analysis Module map analysis Integrated genomics and sequence visualization geWorkbench (Columbia University) Reactome (Ontario Institute of Cancer Research) ArrayExpress (EBI)

13 Cloud-based filesystemUsing GenomeSpace Tools and Data Sources Actions Cloud-based filesystem

14 GenomeSpace Actions

15 GenomeSpace Tool Enablement: IGV

16 GenomeSpace Tool Enablement: GenePattern

17 GenomeSpace Tool Enablement: GenePatternView GenomeSpace files from within GenePattern, use a file from GenomeSpace in a GenePattern analysis Go to the GenomeSpace UI

18 GenomeSpace Data Source Integration: InSilico DBThis is a way to get GEO datasets into any tool that will accept gene expression files. You also don’t need to convert from MAGE-TAB. We are also working with the developers of ArrayExpress to GenomeSpace-enable their repository.

19 Other collaborating projectsTaverna/MyExperiment (University of Manchester) National Center for Biomedical Ontology (Stanford University)

20 DBP3: Studying the regulatory control of human hematopoiesisLet’s show you how it works. One of the scientific collaborators, Aviv Regev, recently published work on the regulatory control of human hematopoiesis. For this GenomeSpace demonstration, we will show a simplified version of part of this work.

21 DBP3: Studying the regulatory control of human hematopoiesis – OverviewPart 1: Data pre-processing and quality control Part 3: Studying the transcriptional program Part 4: cis-regulatory site analysis Genomica From part 2 From part 3 2 3 2 4 GenePattern 2 3 4 1 IGV To part 2 2 2 Cytoscape 1 1 From part 2 1 Galaxy 1 2 2 Manual step 3 2 To part 5 1 Analysis section 2 1 Part 5: Finding new transcription factor regulators 4 Analysis step (# steps) 1 1 Analysis conclusion This is an overview of the analytic workflow of the work done by Aviv and colleagues. This is months of research condensed to one slide. The analysis can be divided in 5 major parts. The first part comprises data preprocessing and quality control, the second part is a basic analysis, the third part comprises studying the transcriptional program, part four is a cis-regulatory analysis, and part five aims for finding new transcription factor regulators. Each circle represents work done in one of the seed tools, which you can see here, colored by the identity of the tool. Circles colored in grey represent manual steps. They are manual because the functionality doesn't exist in any of the tools. During this demo, we will mention when we have to do such a manual step. The number inside each circle represents the number of steps that are performed within the tool. And the arrows represent the different paths that can be followed. At certain points, indicated by the blue boxes, there are alternative optional paths that can be followed. We will now zoom in on part 5 in this demo, which is on finding new transcription factor regulators. From part 4 Optional choices Part 2: Basic analysis 1 Currently integrated From part 1 To part 3 3 2 1 2 3 2 Not yet integrated 2 2 1 2 1 3+ 2 1 2 1 To part 4

22 Part 5: Finding new transcription factor regulatorsFrom part 4 Step 1 2 3 Step 4 Step 2 2 Step 3 2 What we’re actually doing in this demo is the following. We start from a very broad set of transcription factors. We then find those transcription factors that are significantly differentially expressed in a particular lineage as compared to the other lineages. These we’ll call our lineage-dependent transcription factors. We then infer regulatory programs using Module Networks from Genomica. Regulatory programs consist of modules of coexpressed genes that themselves are coregulated. These regulators we then visualize and validate using previously published GWAS SNPs and linkage regions. We’re now going to walk you through the steps in this demo. We first start in Genomica and there we load the full expression data containing more than 200 samples and about 8000 genes together with a gene set that contains the GO transcription factors. The blue squares next to the genes indicate which genes are GO transcription factors. In Genomica we then can save the expression data from only the GO transcription factors into a text file. In GenePattern we want to refine the GO transcription factors to a set of lineage-specific transcription factors. We do this using the ComparativeMarkerSelection and ExtractComparativeMarkerResults modules in GenePattern. So we basically use a t-test to assess differential expression of transcription factors in each lineage versus the rest of the lineages, and we can use criteria such as an FDR < 0.05 to actually select significantly differentially expressed transcription factors. As an example, we continue with the transcription factors that are specific to the hematopoietic stem cell (HSC) lineage. Back in Genomica, we run Module Networks on the full expression data, while also loading the lineage-specific transcription factors we just generated in GenePattern. As running Module Networks usually takes a while, we now immediately load our previously saved results. In this case, we have 80 modules of coexpressed genes for which we can generate a list of the potential regulators. This list allows us to browse very easily through the results. Here you see a module that has as top regulator SMAD4. SMAD4 divides the sample space into 2 partitions, while other regulators further subdivide the samples. As a last step, we want to visualize and validate our list of potential regulators. To do that, we follow 2 approaches in parallel. Both approaches require us to manually create a .bed annotation track that contains the coordinates of the regulators. To validate our regulators, we want to find overlaps between our regulators and previously published GWAS SNPs and linkage regions. We download these SNPs and linkage regions and we manually create .bed annotation tracks for both of them. The first approach is to visualize our regulators in IGV. This requires us to upload the 3 .bed annotation tracks in IGV. In parallel, we submit the same 3 .bed annotation tracks to Galaxy, where we can do overlap analysis between regulators, SNPs and linkage regions in several steps. From this analysis, we get an excel file that displays the overlaps in table format. Interesting also is that this table contains the pubmed IDs of the papers in which a particular GWAS SNP has been published in relation to a particular disease.

23 Step 1: Create transcription factor dataset in Genomica and save to GenomeSpaceTo launch desktop-based tools, we first get the prompt that we’re downloading the application in jnlp format

24 Step 2: Send transcription factor datasets into GenePattern

25 Step 2: Perform differential expression analysis in GenePatternNOTE: The file sent to GenePattern is still a Genomica-formatted file. We specify in the URL what format we want to convert the file to

26 Step 3: Send differentially expressed genes to GenomicaWe send this file directly from the GenePattern interface to Genomica, now converting it back to a Genomica formatted file

27 Step 3: perform module network analysis in GenomicaWe now see Genomica open with the data file, and we perform a module map analysis on this data to determine regulators of module networks, and since we’ll need some other data for the next step, we save the results to GenomeSpace

28 Step 4: Visualize regulators with known SNPs and linkage regionsFrom the GenomeSpace Ui, we’re going to send these files now to IGV

29 Step 4: Visualize regulators with known SNPs and linkage regionsWe now see IGV open with our tracks. What’s under the covers…

30 Deployment ArchitectureDeployment Architecture and APIs Amazon GS Clients Clustered Identity Service (OpenID) GS UI IGV REST Genomica CDK Gene- Pattern Galaxy Cytoscape UCSC Clustered API component Client developers’ kit Restful interface Provide connections to other GenomeSpace-enabled tools as well as ATM and DM Analysis Task Manager (ATM) REST Simple DB Provenance Data Manager (DM) REST External Data Sources (e.g., Arrary Express) S3 File transfers 30 30

31 Join the GenomeSpace communityResearchers with biological projects Developers Add your tools Contribute format converters Build new infrastructure Data portals and repositories Link your resources The beta release is available for use! A few brave men and women (explorers)

32 Acknowledgements Funding [email protected]Ted Liefeld Helga Thorvaldsdottir Jim Robinson Marco Ocana Eliot Polk Jill Mesirov, PI GenomeSpace Collaborators Cytoscape: Trey Ideker Lab, UCSD Galaxy: Anton Nekrutenko Lab, Penn State University Genomica: Eran Segal Lab, Weizmann Institute UCSC Browser Team GenePattern Team IGV Team Driving Biological Projects Howard Chang Lab – Stanford University Aviv Regev Lab – Broad Institute Funding