Welcome to this tutorial where we perform finemapping on a small GWAS dataset that is included with this package. Try to reproduce this analysis to ensure everything is installed properly.

We begin by attaching our bigsnpr object. This is our reference panel of genotypes which we use to compute LD between SNPs. If you are in the He lab, you can just use the following chunk. Otherwise, you will need to obtain your own genotypes in PLINK format (bed/bim/fam) and use bigsnpr::readBed() to create the .rds file below.

bigSNP <- bigsnpr::snp_attach(rdsfile = '/project2/xinhe/1kg/bigsnpr/EUR_variable_1kg.rds')

Here we show a quick run-through of finemappeR using a test summary statistics.

First we run RunCleaner(), which takes our summary statistics in either tab or comma delimited. It also takes a character vector of the 8 column names needed. The columns needed have to be given in the following order:

  • chromosome (just the number, no “chr”, hg19!)
  • position (base pair position, hg19!)
  • beta (if you have Odds Ratio, you will need to transform it to log(Odds Ratio))
  • standard error (SE)
  • reference allele (A,C,T,G)
  • association/effect allele (A,C,T,G)
  • some kind of SNP ID (rsID, or chr:position:a1:a2)
  • p-value
library(finemappeR)

fpath <- system.file("extdata", "test_sumstats.txt.gz", package="finemappeR")
gwas <- RunCleaner(sumstats = fpath,
                   ColsToKeep = c('chr','position_b37','bcac_onco2_beta','bcac_onco2_se','a0','a1','phase3_1kg_id','bcac_onco2_P1df_Wald'),
                   bigSNP = bigSNP)
## [1] "Loading summary statistics..."
## Observations: 9,999
## Variables: 14
## chr [5]: var_name, phase3_1kg_id, a0, a1, rsIDs
## dbl [9]: chr, position_b37, bcac_onco2_r2, bcac_onco2_eaf_controls, bcac_onco2_beta...
## 
## Call `spec()` for a copy-pastable column specification
## Specify the column types with `col_types` to quiet this message
## [1] "Cleaning summary statistics.."
## [1] "Matching to reference panel..."
## 8,961 variants to be matched.
## 1,255 ambiguous SNPs have been removed.
## 7,600 variants have been matched; 0 were flipped and 6,374 were reversed.
## [1] "Assining SNPs to LD blocks..."
## [1] "Complete."

Check that the output is cleaned and has appropriate columns:

gwas[,c('chr','pos','beta','se','snp','pval','zscore')]
## # A tibble: 7,600 x 7
##      chr   pos     beta     se snp                       pval   zscore
##    <int> <int>    <dbl>  <dbl> <chr>                    <dbl>    <dbl>
##  1     1 13110 -0.0207  0.0323 1:13110:G:A           0.522    -0.639  
##  2     1 13116 -0.0126  0.0190 rs201725126:13116:T:G 0.508    -0.662  
##  3     1 13118 -0.0126  0.0190 rs200579949:13118:A:G 0.508    -0.662  
##  4     1 13550  0.132   0.112  1:13550:G:A           0.241     1.17   
##  5     1 14604  0.0222  0.0186 1:14604:A:G           0.233     1.19   
##  6     1 14930  0.00247 0.0141 rs75454623:14930:A:G  0.860     0.176  
##  7     1 14933 -0.0199  0.0388 rs199856693:14933:G:A 0.607    -0.514  
##  8     1 15211  0.00002 0.0162 rs78601809:15211:T:G  0.999     0.00123
##  9     1 15644  0.364   0.104  1:15644:G:A           0.000484  3.49   
## 10     1 15774  0.265   0.0935 rs374029747:15774:G:A 0.00463   2.83   
## # ... with 7,590 more rows

Next, we must perform enrichment analysis with TORUS, which requires annotation files in .bed format. Use the PrepareTorusFiles() function, which takes the previous output, and a directory with bed files. Your bed file should contain only 3 columns (chr, start, end) and chromosomes should be just the number (no “chr”). Currently only hg19/b37 coordinates are accepted. You will get wrong results if you use hg38/other.

bed_annotations = system.file("extdata", "test_bed_dir/", package="finemappeR")
torus.files <- PrepareTorusFiles(gwas, bed_annotations = bed_annotations)
## [1] "Annotating Snps.."
## [1] "Writing files to temporary location.."
## [1] "Done."

Now that the appropriate files have been generated (they are in .temp/), lets run TORUS. RunTorus() returns two things: enrichment level of each annotation in log2 units, and the posterior inclusion probability (PIP) of each SNP. These PIPs will be used as priors for finemapping.

torus.result <- RunTorus(torus_annot_file = torus.files[[1]], torus_zscore_file = torus.files[[2]])
## Running /tmp/Rtmp8OB23t/temp_libpathc39132959580/finemappeR/torus -d \
##   /tmp/RtmpV78IY8/filec81933ed5a96.txt.gz -annot \
##   /tmp/RtmpV78IY8/filec819b0e664c.txt.gz --load_zval -dump_prior prior
## Read in 2 loci, 7600 locus-SNP pairs ... 
## Loading annotations ... 
## Initializing ... 
## Starting EM ... 
##   Iter          loglik          Intercept    testbed.bed.1   testbed2.bed.1  
##    1            -5.143           -9.147      0.404      0.681  
##    2            -0.647          -10.368      0.809      1.100  
##    3            -0.252          -11.365      1.213      1.469  
##    4            -0.128          -12.268      1.618      1.818  
##    5            -0.072          -13.122      2.022      2.156  
## Output priors to directory prior...
## 
##                 Intercept    -13.122       -13.988    -12.256
##             testbed.bed.1      0.002        -1.091      1.094
##            testbed2.bed.1      0.002        -1.076      1.080

TORUS also gives us the uncertainty associated with whether each chunk contains a causal variant or not. We run RunTorusFDR() to get the probability of each chunk containing a causal variant. We can filter our chunks with low probability, thus saving us on computational time!

torus.fdr <- RunTorusFDR()
## Running /tmp/Rtmp8OB23t/temp_libpathc39132959580/finemappeR/torus -d \
##   .temp/torus_zscores.txt.gz -annot .temp/torus_annotations.txt.gz \
##   --load_zval -qtl
## Read in 2 loci, 7600 locus-SNP pairs ... 
## Loading annotations ... 
## Initializing ... 
## Starting EM ... 
##   Iter          loglik          Intercept    testbed.bed.1   testbed2.bed.1  
##    1            -5.143           -9.147      0.404      0.681  
##    2            -0.647          -10.368      0.809      1.100  
##    3            -0.252          -11.365      1.213      1.469  
##    4            -0.128          -12.268      1.618      1.818  
##    5            -0.072          -13.122      2.022      2.156  
## 
## 
##     Total Loci: 2   Rejections: 0
##     1                     1    9.510e-01    0
##     2                     2    9.714e-01    0

Lets add the TORUS results back onto our cleaned summary statistics. PrepareSusieData() takes a parameter called fdr_thresh which is the FDR associated with each chunk. This value is 10% by default, and chunks with FDR < 10% are removed. I set it to 1 here just to keep all chunks but you will want to lower this or just use default.

susie.df <- PrepareSusieData(sumstats = gwas, torus_pip = torus.result$snp_pip, torus_fdr = torus.fdr, fdr_thresh = 1)

We see we have a new column called torus_pip

susie.df[,c('chr','pos','beta','se','snp','pval','zscore','torus_pip')]
## # A tibble: 7,600 x 8
##      chr   pos     beta     se snp                  pval   zscore torus_pip
##    <int> <int>    <dbl>  <dbl> <chr>               <dbl>    <dbl>     <dbl>
##  1     1 13110 -0.0207  0.0323 1:13110:G:A      0.522    -0.639     2.00e-6
##  2     1 13116 -0.0126  0.0190 rs201725126:131~ 0.508    -0.662     1.51e-5
##  3     1 13118 -0.0126  0.0190 rs200579949:131~ 0.508    -0.662     1.51e-5
##  4     1 13550  0.132   0.112  1:13550:G:A      0.241     1.17      1.51e-5
##  5     1 14604  0.0222  0.0186 1:14604:A:G      0.233     1.19      1.51e-5
##  6     1 14930  0.00247 0.0141 rs75454623:1493~ 0.860     0.176     1.51e-5
##  7     1 14933 -0.0199  0.0388 rs199856693:149~ 0.607    -0.514     1.51e-5
##  8     1 15211  0.00002 0.0162 rs78601809:1521~ 0.999     0.00123   1.51e-5
##  9     1 15644  0.364   0.104  1:15644:G:A      0.000484  3.49      1.51e-5
## 10     1 15774  0.265   0.0935 rs374029747:157~ 0.00463   2.83      1.51e-5
## # ... with 7,590 more rows

With this data frame, we can perform finemapping. We use RunFinemapping() to get out results. This will be quite slow if chunks contain many SNPs( O(n^2) where n is # of SNPs). I am working on ways to parallize this step (maybe you can help!)

susie_finemap_L1 <- RunFinemapping(sumstats = susie.df, bigSNP = bigSNP)
## [1] "Finemapping chunks.. 1 of 2"
## [1] "Finemapping chunks.. 2 of 2"

susie_finemap_L1 is a list of SuSiE results, one for each chunk/LD block. Usually we are just interested in the SuSiE PIP, which gives the probability of a SNP being causal. We can annotate our cleaned summary statistics with this information using merge_susie_sumstats()

gwas_finemapped <- merge_susie_sumstats(susie_results = susie_finemap_L1, sumstats = susie.df)

Lets look at the final cleaned, and finemapped summary statistics. We see we have a new column called susie_pip which is the probability of being causal. Note that we ran SuSiE with L = 1 here, meaning we assumed there is at most 1 causal variant per SNP.

gwas_finemapped[,c('chr','pos','beta','se','snp','pval','zscore','torus_pip','susie_pip')]
## # A tibble: 7,600 x 9
##      chr   pos     beta     se snp        pval   zscore torus_pip susie_pip
##    <int> <int>    <dbl>  <dbl> <chr>     <dbl>    <dbl>     <dbl>     <dbl>
##  1     1 13110 -0.0207  0.0323 1:1311~ 5.22e-1 -0.639     2.00e-6         0
##  2     1 13116 -0.0126  0.0190 rs2017~ 5.08e-1 -0.662     1.51e-5         0
##  3     1 13118 -0.0126  0.0190 rs2005~ 5.08e-1 -0.662     1.51e-5         0
##  4     1 13550  0.132   0.112  1:1355~ 2.41e-1  1.17      1.51e-5         0
##  5     1 14604  0.0222  0.0186 1:1460~ 2.33e-1  1.19      1.51e-5         0
##  6     1 14930  0.00247 0.0141 rs7545~ 8.60e-1  0.176     1.51e-5         0
##  7     1 14933 -0.0199  0.0388 rs1998~ 6.07e-1 -0.514     1.51e-5         0
##  8     1 15211  0.00002 0.0162 rs7860~ 9.99e-1  0.00123   1.51e-5         0
##  9     1 15644  0.364   0.104  1:1564~ 4.84e-4  3.49      1.51e-5         0
## 10     1 15774  0.265   0.0935 rs3740~ 4.63e-3  2.83      1.51e-5         0
## # ... with 7,590 more rows