Computation codes for article "A whole-genome reference panel of 14,393 individuals for East Asian populations accelerates discovery of rare functional variants"
"A whole-genome reference panel of 14,393 individuals for East Asian populations accelerates discovery of rare functional variants"
currently in submission.
ABSTRACT
Underrepresentation of non-European populations hinders growth of global precision medicine. Resources such as imputation reference panels that match the study population are necessary to find low-frequency variants with significant effects. We created a reference panel consisting of 14,393 whole-genome sequences including more than 11,000 Asian individuals. Genome-wide association studies were conducted using the new reference panel and a population-specific genotype array of 72K for 8 phenotypes. The newly constructed panel yields improved imputation accuracy of rare and low-frequency variants within East Asian populations compared with the largest reference panel (TOPMed). Thirty-eight novel associations were found, and more than half of the variants were East Asian specific. We discovered genes with rare protein-altering variants, including LTBP1 for height and GPR75 for body-mass-index, as well as putative regulatory mechanisms for rare noncoding variants with cell-type-specific effects. We suggest this data set will add to the potential value of Asian precision medicine.
Below we provide the computational codes used in this analysis.
You can get more information for each step by visiting the corresponding tool's webpage.
For additional information, please contact gmi{four}184{one}8@gmail.com
Eagle, Beagle and SHAPEIT: Pre-phasing
1) Eagle v2.4.1
2) Beagle v5.0
3) SHAPEIT v2 (r904)
Minimac4 (v1.0.2): Imputation
plink2: Association analysis
https://www.cog-genomics.org/plink/2.0/
LDSC: used to calculate heritability and genomic inflation lambda
https://github.com/bulik/ldsc/wiki/LD-Score-Estimation-Tutorial
https://github.com/bulik/ldsc/wiki/Heritability-and-Genetic-Correlation
gcta-cojo: independent analysis
https://yanglab.westlake.edu.cn/software/gcta/#COJO
magma: gene based analysis
https://ctg.cncr.nl/software/magma
HyPrColoc: colocalization analysis
https://github.com/jrs95/hyprcoloc
mashR: calculate pairwise sharing
https://stephenslab.github.io/mashr/articles/intro_mash.html
https://stephenslab.github.io/mashr/articles/intro_mash_dd.html
SuSiE: finemapping- polyfun implemented
https://github.com/omerwe/polyfun/wiki/1.-Computing-prior-causal-probabilities-with-PolyFun#1-create-a-munged-summary-statistics-file-in-a-polyfun-friendly-parquet-format-1
https://github.com/omerwe/polyfun/wiki/3.-Functionally-informed-fine-mapping-with-finemapper
fgwas: enrichment analysis
https://github.com/joepickrell/fgwas/blob/master/man/fgwas_manual.pdf
Epigenetic annotation was performed with an in-house script to annotate each variant.
we provide each link for the database we used.
ABC-model
http://yed.ucsd.edu:8787/ABC_scores/
ATAC-peak
http://yed.ucsd.edu:8787/Peaks/
epimap summary
http://compbio.mit.edu/epimap/
https://www.meuleman.org/research/dhsindex/
https://personal.broadinstitute.org/cboix/epimap/links/pergroup/
ENCODE, SCREEN
https://screen.encodeproject.org/
GTEx
https://gtexportal.org/home/datasets
deltaSVM
https://www.beerlab.org/deltasvm/
JASPER
https://jaspar.genereg.net/downloads/
https://genome.ucsc.edu/cgi-bin/hgTables
currently in submission.
ABSTRACT
Underrepresentation of non-European populations hinders growth of global precision medicine. Resources such as imputation reference panels that match the study population are necessary to find low-frequency variants with significant effects. We created a reference panel consisting of 14,393 whole-genome sequences including more than 11,000 Asian individuals. Genome-wide association studies were conducted using the new reference panel and a population-specific genotype array of 72K for 8 phenotypes. The newly constructed panel yields improved imputation accuracy of rare and low-frequency variants within East Asian populations compared with the largest reference panel (TOPMed). Thirty-eight novel associations were found, and more than half of the variants were East Asian specific. We discovered genes with rare protein-altering variants, including LTBP1 for height and GPR75 for body-mass-index, as well as putative regulatory mechanisms for rare noncoding variants with cell-type-specific effects. We suggest this data set will add to the potential value of Asian precision medicine.
Below we provide the computational codes used in this analysis.
You can get more information for each step by visiting the corresponding tool's webpage.
For additional information, please contact gmi{four}184{one}8@gmail.com
Eagle, Beagle and SHAPEIT: Pre-phasing
1) Eagle v2.4.1
Eagle --vcf infile --geneticMapFile eagle.map.file --outPrefix output --allowRefAltSwap --vcfOutFormat z --outputUnphased
2) Beagle v5.0
Beagle gt=infile map=plink.map out=output
3) SHAPEIT v2 (r904)
Shapeit2 --input-vcf infile -M genetic_map_hg19 -O output --force
Shapeit2 -convert --input-haps haps --output-vcf output.vcf
Minimac4 (v1.0.2): Imputation
Minimac4 --minRatio 0.00001 --allTypedSites --ignoreDuplicates --meta --refHaps panel.m3vcf.gz --haps input.vcf.gz --prefix output
plink2: Association analysis
https://www.cog-genomics.org/plink/2.0/
plink2 --assoc --bfile plink1.bfile --out GWAS --pheno phenotype.txt
LDSC: used to calculate heritability and genomic inflation lambda
https://github.com/bulik/ldsc/wiki/LD-Score-Estimation-Tutorial
https://github.com/bulik/ldsc/wiki/Heritability-and-Genetic-Correlation
# create LD score
python ldsc.py --bfile --l2 --ld-wind-cm 1 --out chrom1-22
# gwas summary statistics
python munge_sumstats.py --sumstats GWAS.summary.tsv --out GWAS --N 72288 --a1-inc --info-min 0.3 --maf-min 0.0005
# compute h2
python ldsc.py --h2 GWAS.sumstats.gz --ref-ld-chr KNIH.NARD --w-ld-chr KNIH.NARD --out h2.GWAS
gcta-cojo: independent analysis
https://yanglab.westlake.edu.cn/software/gcta/#COJO
gcta64 --bfile plink.bfile --chr chrom1-22 --cojo-file cojo.input.tsv --cojo-slct --cojo-p 5e-8 --out cojo.out
magma: gene based analysis
https://ctg.cncr.nl/software/magma
# create annotation
magma --annotate --snp-loc plink.coding.bfile.bim --gene-loc gene.loc --out genes.annot
# run gene based analysis
magma --bfile plink.coding.bfile --gene-annot genes.annot --pheno pheno.tsv --gene-model multi --genes-only --burden 0.0005 --gene-settings snp-max-maf=0.05 --out magma.gene
HyPrColoc: colocalization analysis
https://github.com/jrs95/hyprcoloc
#binary.outcomes option: 1 for DM and HTN otherwise 0 for other phenotypes
hyprcoloc(effect.est = betas, effect.se = ses, binary.outcomes = c(0,0,1,0,0,1,0,0), trait.names=traits, snp.id=snp.id)
mashR: calculate pairwise sharing
https://stephenslab.github.io/mashr/articles/intro_mash.html
https://stephenslab.github.io/mashr/articles/intro_mash_dd.html
m.1by1 = mash_1by1(data)
strong = get_significant_results(m.1by1,0.05)
U.pca = cov_pca(data,5,subset=strong)
U.ed = cov_ed(data, U.pca, subset=strong)
m.ed = mash(data, U.ed)
med.loglik=get_loglik(m.ed)
med.sharing=get_pairwise_sharing(m.ed,factor=0.5)
SuSiE: finemapping- polyfun implemented
https://github.com/omerwe/polyfun/wiki/1.-Computing-prior-causal-probabilities-with-PolyFun#1-create-a-munged-summary-statistics-file-in-a-polyfun-friendly-parquet-format-1
https://github.com/omerwe/polyfun/wiki/3.-Functionally-informed-fine-mapping-with-finemapper
# create summary statistics
python munge_polyfun_sumstats.py --sumstats GWAS.summary.tsv --out GWAS.sumstats.parquet --n 72298 --min-info 0.3 --min-maf 0.0005 --keep-hla
# SuSiE fine-mapping per loci of interest
python finemapper.py --geno plink2.bfile --sumstats GWAS.sumstats.parquet --n 72298 --chr chr1-22 --start start_bp --end end_bp --method susie --max-num-causal 10 --cache-dir cache --out finemap.tsv --non-funct --allow-missing
fgwas: enrichment analysis
https://github.com/joepickrell/fgwas/blob/master/man/fgwas_manual.pdf
fgwas -i input -o output -w ATAC.bed -fine
Epigenetic annotation was performed with an in-house script to annotate each variant.
we provide each link for the database we used.
ABC-model
http://yed.ucsd.edu:8787/ABC_scores/
ATAC-peak
http://yed.ucsd.edu:8787/Peaks/
epimap summary
http://compbio.mit.edu/epimap/
https://www.meuleman.org/research/dhsindex/
https://personal.broadinstitute.org/cboix/epimap/links/pergroup/
ENCODE, SCREEN
https://screen.encodeproject.org/
GTEx
https://gtexportal.org/home/datasets
deltaSVM
https://www.beerlab.org/deltasvm/
JASPER
https://jaspar.genereg.net/downloads/
https://genome.ucsc.edu/cgi-bin/hgTables