Our paper on “Integrative approaches for large-scale transcriptome-wide association studies” is now out in Nature Genetics and I’m very proud of all the work that went into this by many people involved. Though the paper focuses on gene expression, I believe it’s addressing a broad challenge of integrating complex data with disease GWAS. Our goal here was to understand and quantify the transcriptional component of disease: which genes harbor mutations that effect expression which in turn effects phenotype. Ideally we would do this using a large cohort which had measured genetics/SNPs, and phenotype (ex: BMI), and rich expression data (ex: gene expression painstakingly collected from relevant adipose tissue). With these pieces in hand, we could associate expression with BMI to find potential causal genes, or look at fancier models involving genetic correlation/mediation to isolate the shared genetics. Unfortunately what we typically have is SNPs and expression measured in a small study, and - separately - SNPs and phenotype measured in a large GWAS for which only the summary data is available. Our paper proposes a solution based on two key insights:
- We can leverage the fact that SNPs are observed in both studies and predict expression from one study into the other. In practice, if we restrict to the cis locus this prediction is very accurate, and we can essentially work with the predicted expression as if we had measured it directly. This is in strong contrast to genomweide disease prediction where hundreds of thousands of samples are often still not enough to get traction.
- In the special case where this is a linear predictor (that is, a sum of SNPs multiplied by weights) we can additionally use the fact that the relationship between SNPs (LD) is well-estimated in reference panels and infer what the predicted expression-phenotype association would be by only using the separately estimated relationships between (a) expression-SNP, (b) SNP-SNP/LD, and (c) SNP-phenotype. This sounds intuitive but it’s a powerful concept that allows one to do many very useful things with just summary data and LD (including estimate heritability, perform conditional analysis, and impute untyped variants).
We call this a TWAS (transcriptome-wide association study), emphasizing the fact that it identifies expression-disease associations at the scale of the largest GWAS study. This approach is conceptually very appealing because the model always aggregates the cis effects into a single unit corresponding to the gene. So whether we are fine-mapping known regions associated with disease or looking for novel associations we always identify likely genes. This is quite different from a GWAS, which picks out a set of SNPs (where the mechanism is often ambiguous); or eQTL-based analyses, which require ad hoc decisions on which eQTLs to select and how to overlap them with the disease. Of course, we also show that the method is substantially more powerful than other approaches when the model assumptions are met so there’s a practical relevance here: you find new associations, and those new associations have biological meaning. There’s a lot more going on in the paper that I hope readers find interesting, but the big takeaway is that we can use computational tools to get at the biological information we wish we had even when it’s scattered across different, seemingly isolated datasets.