Computation codes for article "A whole-genome reference panel of 14,393 individuals for East Asian populations accelerates discovery of rare functional variants"

2023-03-23l 조회수 1058
"A whole-genome reference panel of 14,393 individuals for East Asian populations accelerates discovery of rare functional variants"
currently in submission.

ABSTRACT
Underrepresentation of non-European populations hinders growth of global precision medicine. Resources such as imputation reference panels that match the study population are necessary to find low-frequency variants with significant effects. We created a reference panel consisting of 14,393 whole-genome sequences including more than 11,000 Asian individuals. Genome-wide association studies were conducted using the new reference panel and a population-specific genotype array of 72K for 8 phenotypes. The newly constructed panel yields improved imputation accuracy of rare and low-frequency variants within East Asian populations compared with the largest reference panel (TOPMed). Thirty-eight novel associations were found, and more than half of the variants were East Asian specific. We discovered genes with rare protein-altering variants, including LTBP1 for height and GPR75 for body-mass-index, as well as putative regulatory mechanisms for rare noncoding variants with cell-type-specific effects. We suggest this data set will add to the potential value of Asian precision medicine.

Below we provide the computational codes used in this analysis.
You can get more information for each step by visiting the corresponding tool's webpage.

For additional information, please contact gmi{four}184{one}8@gmail.com

Eagle, Beagle and SHAPEIT: Pre-phasing
1) Eagle v2.4.1
Eagle --vcf infile --geneticMapFile eagle.map.file --outPrefix output --allowRefAltSwap --vcfOutFormat z --outputUnphased

2) Beagle v5.0
Beagle gt=infile map=plink.map out=output

3) SHAPEIT v2 (r904)
Shapeit2 --input-vcf infile -M genetic_map_hg19 -O output --force
Shapeit2 -convert --input-haps haps --output-vcf output.vcf

Minimac4 (v1.0.2): Imputation
Minimac4 --minRatio 0.00001 --allTypedSites --ignoreDuplicates --meta --refHaps panel.m3vcf.gz --haps input.vcf.gz --prefix output

plink2: Association analysis
https://www.cog-genomics.org/plink/2.0/
plink2   --assoc  --bfile plink1.bfile  --out GWAS  --pheno phenotype.txt

LDSC: used to calculate heritability and genomic inflation lambda

https://github.com/bulik/ldsc/wiki/LD-Score-Estimation-Tutorial
https://github.com/bulik/ldsc/wiki/Heritability-and-Genetic-Correlation
# create LD score
python ldsc.py --bfile --l2 --ld-wind-cm 1 --out chrom1-22
# gwas summary statistics
python munge_sumstats.py --sumstats GWAS.summary.tsv --out GWAS --N 72288 --a1-inc --info-min 0.3 --maf-min 0.0005
# compute h2
python ldsc.py --h2 GWAS.sumstats.gz --ref-ld-chr KNIH.NARD --w-ld-chr KNIH.NARD --out h2.GWAS

gcta-cojo: independent analysis
https://yanglab.westlake.edu.cn/software/gcta/#COJO
gcta64 --bfile plink.bfile --chr chrom1-22 --cojo-file cojo.input.tsv --cojo-slct --cojo-p 5e-8 --out cojo.out

magma: gene based analysis
https://ctg.cncr.nl/software/magma
# create annotation
magma --annotate --snp-loc plink.coding.bfile.bim --gene-loc gene.loc --out genes.annot
# run gene based analysis
magma --bfile plink.coding.bfile --gene-annot genes.annot --pheno pheno.tsv --gene-model multi --genes-only --burden 0.0005 --gene-settings snp-max-maf=0.05 --out magma.gene

HyPrColoc: colocalization analysis
https://github.com/jrs95/hyprcoloc
#binary.outcomes option: 1 for DM and HTN otherwise 0 for other phenotypes
hyprcoloc(effect.est = betas, effect.se =  ses, binary.outcomes =  c(0,0,1,0,0,1,0,0), trait.names=traits, snp.id=snp.id)

mashR: calculate pairwise sharing
https://stephenslab.github.io/mashr/articles/intro_mash.html
https://stephenslab.github.io/mashr/articles/intro_mash_dd.html
m.1by1 = mash_1by1(data)
strong = get_significant_results(m.1by1,0.05)
U.pca = cov_pca(data,5,subset=strong)
U.ed = cov_ed(data, U.pca, subset=strong)
m.ed = mash(data, U.ed)
med.loglik=get_loglik(m.ed)
med.sharing=get_pairwise_sharing(m.ed,factor=0.5)

SuSiE: finemapping- polyfun implemented
https://github.com/omerwe/polyfun/wiki/1.-Computing-prior-causal-probabilities-with-PolyFun#1-create-a-munged-summary-statistics-file-in-a-polyfun-friendly-parquet-format-1
https://github.com/omerwe/polyfun/wiki/3.-Functionally-informed-fine-mapping-with-finemapper
# create summary statistics
python munge_polyfun_sumstats.py --sumstats GWAS.summary.tsv --out  GWAS.sumstats.parquet --n 72298 --min-info 0.3 --min-maf 0.0005 --keep-hla
# SuSiE fine-mapping per loci of interest
python finemapper.py --geno plink2.bfile --sumstats GWAS.sumstats.parquet --n 72298 --chr chr1-22 --start start_bp --end end_bp --method susie --max-num-causal 10 --cache-dir cache --out finemap.tsv --non-funct --allow-missing

fgwas: enrichment analysis
https://github.com/joepickrell/fgwas/blob/master/man/fgwas_manual.pdf
fgwas -i input -o output -w ATAC.bed -fine

Epigenetic annotation was performed with an in-house script to annotate each variant.
we provide each link for the database we used.
ABC-model
http://yed.ucsd.edu:8787/ABC_scores/

ATAC-peak
http://yed.ucsd.edu:8787/Peaks/

epimap summary
http://compbio.mit.edu/epimap/
https://www.meuleman.org/research/dhsindex/
https://personal.broadinstitute.org/cboix/epimap/links/pergroup/
ENCODE, SCREEN
https://screen.encodeproject.org/

GTEx
https://gtexportal.org/home/datasets

deltaSVM
https://www.beerlab.org/deltasvm/

JASPER
https://jaspar.genereg.net/downloads/

https://genome.ucsc.edu/cgi-bin/hgTables