Last updated: 2025-05-16

Checks: 7 0

Knit directory: Logica_Analysis/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20250516)

The command set.seed(20250516) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: f75c581

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version f75c581. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Unstaged changes:
    Modified:   analysis/_site.yml

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/QC_Pipeline.Rmd) and HTML (docs/QC_Pipeline.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
html	40d92e6	borangao	2025-05-16	Build site.
Rmd	07263f2	borangao	2025-05-16	Publish the initial files for myproject

Data Preparation

Note: Following this guideline will help maintain the integrity and accuracy of the local genetic correlation process across ancestries.

Step-by-Step Guideline

For illustrative purposes, we utilized the GWAS summary statistics of LDL from the UK Biobank (UKBB) and the Asian GWAS meta-analyzed from Biobank Japan, Korean Biobank, and Taiwan Biobank. We provide a series of R scripts for data preparation, available in the inst folder of this repository. The scripts are designed to be run sequentially, and we recommend executing them using the Rscript command.

1. GWAS and Reference Panel Quality Control and SNP Alignment

This script performs quality control (QC) on GWAS summary statistics and genotype reference panel data for two specified ancestries. It aligns SNPs between GWAS datasets and corresponding reference panels, filters for common SNPs, and prepares genotype files suitable for downstream analyses.

Required Input Formats

1. Reference Genotype Files:

PLINK binary files (.bed, .bim, .fam) organized by chromosome
All files placed in a single directory per ancestry
Example filename format: EUR_chr_1.bed, EUR_chr_1.bim, EUR_chr_1.fam

2. GWAS Summary Statistics Files:

Required columns:
- CHROM: Chromosome
- POS: Base-pair position
- SNP: SNP identifier
- ALLELE0: Reference allele
- ALLELE1: Alternative allele
- Z: Z-score
- N: Sample size
Optional columns:
- BETA: Effect size
- SE: Standard error
- A1FREQUENCY: Allele frequency

Input Parameter Descriptions:

input_dir_1: Directory containing PLINK files for ancestry 1.
input_prefix_1: Filename prefix for ancestry 1 (e.g., EUR_chr).
ancestry_1: Label for ancestry 1 (e.g., EUR).
gwas_1: Path to GWAS summary statistics file for ancestry 1.
input_dir_2: Directory containing PLINK files for ancestry 2.
input_prefix_2: Filename prefix for ancestry 2 (e.g., EAS_chr).
ancestry_2: Label for ancestry 2 (e.g., EAS).
gwas_2: Path to GWAS summary statistics file for ancestry 2.
trait: Name of the trait analyzed (e.g., LDL).
output_dir: Directory for output files.
plink_path: Full path to PLINK 2 executable.
skip_geno_qc: Boolean flag to skip genotype QC steps (default FALSE).

Main Functionalities:

QC on GWAS summary statistics (filters ambiguous SNPs, missing data, allele frequency, and MHC region).
QC on reference genotype data using PLINK (filters MAF, genotype rate, and HWE).
Alignment of SNPs across two ancestries and GWAS summary statistics.
Generates standardized outputs:
- QC’d genotype files (PLINK binary format by chromosome)
- Aligned GWAS summary statistics
- SNP lists for further analysis

Dependencies:

R libraries: optparse, data.table, dplyr
External software: PLINK 2

Example Usage:

Rscript Step1_GWAS_Reference_Align.R \
  --input_dir_1 path/to/ancestry1 \
  --input_prefix_1 EUR_chr \
  --ancestry_1 EUR \
  --gwas_1 EUR_GWAS.txt \
  --input_dir_2 path/to/ancestry2 \
  --input_prefix_2 EAS_chr \
  --ancestry_2 EAS \
  --gwas_2 EAS_GWAS.txt \
  --trait LDL \
  --output_dir path/to/output \
  --plink_path /path/to/plink2 \
  --skip_geno_qc FALSE

Practical Example:

system(paste0("Rscript Step1_GWAS_Reference_Align.R --input_dir_1 /net/fantasia/home/borang/MALGC/ukb_bbj_ref/EUR/ --input_prefix_1 EUR_chr --ancestry_1 EUR  --gwas_1 /net/fantasia/home/borang/MALGC/real_data/European/UKBB/UKBB_QC/LDL_common.txt --input_dir_2 /net/fantasia/home/borang/MALGC/ukb_bbj_ref/EAS/ --input_prefix_2 EAS_chr --ancestry_2 EAS --gwas_2 /net/fantasia/home/borang/MALGC/real_data/Asian/Meta/LDL_common.txt --trait LDL  --output_dir /net/fantasia/home/borang/MALGC/pipeline_example --plink_path /net/fantasia/home/borang/software/plink2 --skip_geno_qc TRUE"))

2. Split Region and Construct LD Matrix and Eigen-Decomposition for LDER

This script extracts SNPs within specified LD blocks from genotype data, computes LD matrices for each ancestry, performs eigen decomposition of the LD matrices, and generates inputs compatible with Linkage Disequilibrium Eigenvalue Regression (LDER).

Required Input Formats:

1. Genotype Files: - PLINK binary files (.bed, .bim, .fam) organized by chromosome - Files in a single directory named geno - Filename example: EUR_chr_1_aligned.bed, EUR_chr_1_aligned.bim, EUR_chr_1_aligned.fam

2. GWAS Aligned Summary Statistics Files: - Required columns: CHROM, POS, SNP, ALLELE0, ALLELE1, Z - Organized by trait and ancestry - Example filename: LDL_EUR_aligned.txt

3. LD Block File: - BED file specifying LD blocks (chromosome, start, end positions)

Dependencies:

R libraries: optparse, data.table, dplyr, snpStats

Example Usage:

system(paste0("Rscript ~/MALGC/MALGC_software/Data_Process/Step2_LD_Region.R --input_dir /net/fantasia/home/borang/MALGC/pipeline_example --output_dir /net/fantasia/home/borang/MALGC/pipeline_example --ancestry_1 EUR --ancestry_2 EAS --trait LDL --block_index 1 --ld_block_file /net/fantasia/home/borang/MALGC/ld_blocks/grch37.eur.eas.loci.bed "))

Parellel Computing

We provided a Step2_LD_Region.sh file for parallel computing.

3. LDER Intercept Estimation

This script performs Linkage Disequilibrium Eigenvalue Regression (LDER) analysis using GWAS summary statistics for two specified ancestries. It aggregates LD block data, computes intercepts using the ‘lder’ function from the LDER R package, and saves intercept estimates for downstream genetic correlation analyses.

Required Input Formats:

Preprocessed LDER GWAS blocks:
- .RData files generated from eigen decomposition analyses
- Files stored as input_dir/[trait]/[trait]_Block/
- Filename example: LDL_LDER_block_1.RData

Dependencies:

R libraries: optparse, data.table, dplyr, LDER
Installation: devtools::install_github('borangao/LDER')

Example Usage:

system(paste0("Rscript /net/fantasia/home/borang/MALGC/MALGC_software/Data_Process/Step3_LDER.R --input_dir /net/fantasia/home/borang/MALGC/pipeline_example  --ancestry_1 EUR --n_ancestry_1 343621 --ancestry_2 EAS --n_ancestry_2 237613  --trait LDL"))

Output:

Intercept estimates saved as [Trait]_intercept.RData in input_dir/[Trait]/.

sessionInfo()

R version 4.5.0 (2025-04-11)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3;  LAPACK version 3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] workflowr_1.7.1

loaded via a namespace (and not attached):
 [1] jsonlite_1.8.7    compiler_4.5.0    promises_1.3.0    Rcpp_1.0.14      
 [5] stringr_1.5.1     git2r_0.33.0      callr_3.7.6       later_1.4.2      
 [9] jquerylib_0.1.4   yaml_2.3.10       fastmap_1.2.0     R6_2.6.1         
[13] knitr_1.48        tibble_3.2.1      rprojroot_2.0.4   bslib_0.8.0      
[17] pillar_1.9.0      rlang_1.1.4       utf8_1.2.4        cachem_1.1.0     
[21] stringi_1.8.4     httpuv_1.6.15     xfun_0.47         getPass_0.2-4    
[25] fs_1.6.6          sass_0.4.9        cli_3.6.3         magrittr_2.0.3   
[29] formatR_1.14      ps_1.7.7          digest_0.6.37     processx_3.8.4   
[33] rstudioapi_0.16.0 lifecycle_1.0.4   vctrs_0.6.5       evaluate_0.24.0  
[37] glue_1.8.0        whisker_0.4.1     fansi_1.0.6       rmarkdown_2.28   
[41] httr_1.4.7        tools_4.5.0       pkgconfig_2.0.3   htmltools_0.5.8.1

Data Preparation

borangao

Data Preparation

Step-by-Step Guideline

1. GWAS and Reference Panel Quality Control and SNP Alignment

Required Input Formats

Input Parameter Descriptions:

Main Functionalities:

Dependencies:

Example Usage:

Practical Example:

2. Split Region and Construct LD Matrix and Eigen-Decomposition for LDER

Required Input Formats:

Dependencies:

Example Usage:

Parellel Computing

3. LDER Intercept Estimation

Required Input Formats:

Dependencies:

Example Usage:

Output: