Quickstart

Installation and quickstart


This section is mainly intended for the installation of MRBIGR and the briefly introduction of several typical usage of MRBIGR to help users get started quickly. Details and advanced usage of MRBIGR, please see Tutorial section.


1. Installation of MRBIGR

1.1 Download MRBIGR

git clone https://gitee.com/crazyhsu/MRBIGR.git
cd MRBIGR
docker build -t mrbigr_image .
curr_dir=$(pwd)
docker run -dit --name mrbigr_env -p 3838:3838 \
    -v $curr_dir/MRBIGR_data:/root/MRBIGR/MRBIGR_data \
    -v $curr_dir/MRBIGR_output/demo:/root/MRBIGR/MRBIGR_output/demo \
    -v $curr_dir/MRBIGR_log:/root/MRBIGR/MRBIGR_log \
    -e SHINY_INPUT_PATH=/root/MRBIGR/MRBIGR_data \
    -e SHINY_OUTPUT_PATH=/root/MRBIGR/MRBIGR_output/demo \
    -e SHINY_LOG_PATH=/root/MRBIGR/MRBIGR_log \
    -e SHINY_ROOT_PATH=/root/MRBIGR \
    mrbigr_image:latest

If no exception is thrown out, please enter the http://localhost:3838 in the browser and then you will see the following GUI: MRBIGR GUI

#5A9AB9 Note: If you encounter the issue like below:

#0 building with "default" instance using docker driver

#1 [internal] load build definition from Dockerfile
#1 transferring dockerfile: 1.10kB done
#1 DONE 0.2s

#2 [internal] load metadata for docker.io/crazyhsu/mrbigr_env:v1.2
#2 ERROR: failed to authorize: failed to fetch anonymous token: Get "https://auth.docker.io/token?scope=repository%3Acrazyhsu%2Fmrbigr_env%3Apull&service=registry.docker.io": dial tcp [2a03:2880:f130:83:face:b00c:0:25de]:443: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
------
 > [internal] load metadata for docker.io/crazyhsu/mrbigr_env:v1.2:
------
Dockerfile:1
--------------------
   1 | >>> FROM crazyhsu/mrbigr_env:v1.2
   2 |     
   3 |     ENV CONDA_DEFAULT_ENV=mrbigr
--------------------
ERROR: failed to solve: crazyhsu/mrbigr_env:v1.2: failed to authorize: failed to fetch anonymous token: Get "https://auth.docker.io/token?scope=repository%3Acrazyhsu%2Fmrbigr_env%3Apull&service=registry.docker.io": dial tcp [2a03:2880:f130:83:face:b00c:0:25de]:443: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.

Please try the following code:

cd MRBIGR
docker pull crazyhsu/mrbigr_env:v1.2  # direct pull the images without authorize
docker build -t mrbigr_image .
# or
wget https://zenodo.org/records/13955396/files/mrbigr_env_v1.2.tar
docker load -i mrbigr_env_v1.2.tar
docker build -t mrbigr_image .

Then run the docker container:

curr_dir=$(pwd)
docker run -dit --name mrbigr_env -p 3838:3838 \
    -v $curr_dir/MRBIGR_data:/root/MRBIGR/MRBIGR_data \
    -v $curr_dir/MRBIGR_output/demo:/root/MRBIGR/MRBIGR_output/demo \
    -v $curr_dir/MRBIGR_log:/root/MRBIGR/MRBIGR_log \
    -e SHINY_INPUT_PATH=/root/MRBIGR/MRBIGR_data \
    -e SHINY_OUTPUT_PATH=/root/MRBIGR/MRBIGR_output/demo \
    -e SHINY_LOG_PATH=/root/MRBIGR/MRBIGR_log \
    -e SHINY_ROOT_PATH=/root/MRBIGR \
    mrbigr_image:latest

1.3 Installation using conda

1.3.1 Build and install

conda create -n mrbigr python=3.7 -y
conda activate mrbigr

1.3.2 Install dependencies

cd MRBIGR  
python setup.py build  
python setup.py install  

1.3.3. Install dependencies

pip install pyranges  
conda install -y -c conda-forge r-rcppeigen r-xml r-rsqlite r-europepmc r=3.6 rpy2 vcftools 
Rscript -e 'install.packages(c("data.table", "ggplot2", "ggsignif", "ggrepel","Matrix", "igraph", "network", "GGally", "sna","tidyr","ggraph","lme4","ggpubr","pheatmap","factoextra", "R.utils"), repos="https://cloud.r-project.org")' 
Rscript -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager", repos="https://cloud.r-project.org");BiocManager::install(c("AnnotationForge","clusterProfiler","ggtree"))'  
Rscript -e 'install.packages("bigsnpr", dependence=T, repos="https://cloud.r-project.org")'
Rscript -e 'install.packages("https://cran.r-project.org/src/contrib/Archive/FactoMineR/FactoMineR_1.42.tar.gz", repos=NULL)'

1.3.4 Set environment variables

echo "export PATH=`pwd`/utils:\$PATH" >> ~/.bashrc  
echo "export LD_LIBRARY_PATH=`pwd`/utils/libs:\$LD_LIBRARY_PATH" >> ~/.bashrc
source ~/.bashrc

1.3.5 Set up MRBIGR GUI app

Rscript -e "install.packages(c('shiny','bslib','shinyFiles','shinyalert'), repos='https://mirrors.tuna.tsinghua.edu.cn/CRAN/')"

Rscript -e "shiny::runApp('MRBIGR-I/app.R', host = '0.0.0.0', port = 3838)"

#f03c15 Depends: R(>=3.6), python(>=3.7)


2. Data preparation

Using MRBIGR for basic analysis requires only genotypic and phenotypic data. Users can perform advanced functional analyses and generate visualizations through the use of built-in functions and their combinations (Details in the document). The example datasets used below are from MRBIGR_data, which contains all the data used for our paper "MRBIGR: a versatile toolbox for genetic causal inference in population-scale multi-omics data".

# download and unpack MRBIGR_data.tar.gz
wget https://zenodo.org/records/13955396/files/MRBIGR_data.20241019.tar.gz -O MRBIGR_data.tar.gz
tar -xvzf MRBIGR_data.tar.gz --transform 's,^MRBIGR_data.20241019,MRBIGR_data,'
rm MRBIGR_data.tar.gz
mkdir MRBIGR_output/demo

2.1 Genotypic data

MRBIGR uses PLINK-bed file format for Genotypic data by default (details see http://zzz.bwh.harvard.edu/plink/data.shtml#bed). The built-in geno module supports converting HapMap or VCF formatted genotype files to PLINK-bed format, as well as converting PLINK-bed format to VCF format.

2.1.1 HapMap file format

Hapmap is a commonly used format for storing sequence data where SNP information is stored in the rows and taxa information is stored in the columns. This format allows the SNP information (chromosome and position) and genotypes of each taxa to be stored in a single file.

The first 11 columns display attributes of the SNPs and the remaining columns show the nucleotides observed at each SNP for each taxa. The first row contains the header labels and each remaining row contains all the information for a single SNP. The first nine SNPs from the tutorial data (MRBIGR_data/chr_HAMP_female.hmp) are presented below.

rs# alleles chrom posi strand assembly center protLSID assayLSID panelLSID QCdode CIMBL32 CIMBL89 CIMBL7 CIMBL45 ZHENG58 ... CIMBL46
chr10.s_74401 C/T 10 74401 + We Are The Maize Re Seq CC TT CC TT CC ... TT
chr10.s_76702 A/G 10 76702 + We Are The Maize Re Seq GG AA GG AA GG ... AA
chr10.s_92823 C/T 10 92823 + We Are The Maize Re Seq CC CC CC CC CC ... CC
chr10.s_94339 T/A 10 94339 + We Are The Maize Re Seq AA TT AA TT AA ... TT
chr10.s_94579 C/G 10 94579 + We Are The Maize Re Seq CC GG CC GG CC ... GG
chr10.s_94901 G/A 10 94901 + We Are The Maize Re Seq AA AA AA AA AA ... AA
chr10.s_94930 T/G 10 94930 + We Are The Maize Re Seq GG TT GG TT GG ... TT
chr10.s_108603 C/T 10 108603 + We Are The Maize Re Seq TT CC TT CC TT ... CC
chr10.s_108610 C/T 10 108610 + We Are The Maize Re Seq TT CC TT CC TT ... CC

2.1.2 VCF file format

VCF (Variant Call Format) file is a standard text file format for storing genomic variation information. This format is widely used in genomic research, especially when it comes to the difference between the genome sequence of an individual or a population and the reference genome. VCF files can describe multiple types of genetic variations, such as single nucleotide polymorphisms (SNPs), insertions/deletions (InDels), structural variations, etc.

The basic structure of a VCF file consists of two parts: header information and data lines. The header information starts with ## and contains the file's metadata such as format version, generation program, etc., and defines the column name with the #CHROM line; the data part lists the information of each variant site in detail, including chromosome number, position, identifier, reference and alternative bases, quality score, filtering status, additional information and sample-specific data. The first five SNPs from the converted data (MRBIGR_output/demo/geno/chr_HAMP_female.vcf.vcf) are presented below.

##fileformat=VCFv4.2
##fileDate=20241004
##source=PLINKv1.90
##contig=<ID=10,length=150903390>
##contig=<ID=1,length=306984567>
##contig=<ID=2,length=209800297>
##contig=<ID=3,length=235647547>
##contig=<ID=4,length=246987745>
##contig=<ID=5,length=223818355>
##contig=<ID=6,length=173410493>
##contig=<ID=7,length=182367039>
##contig=<ID=8,length=181101714>
##contig=<ID=9,length=159764980>
##INFO=<ID=PR,Number=0,Type=Flag,Description="Provisional reference allele, may not be based on real reference genome">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  CIMBL32 CIMBL89 CIMBL7  CIMBL45 ZHENG58 CML415  CIMBL46 CIMBL70 CIMBL124        CIMBL68 CIMBL77 CML324  CML326  CIMBL69 CIMBL23 CIMBL71 
10      74401   chr10.s_74401   C       T       .       .       PR      GT      0/0     1/1     0/0     1/1     0/0     1/1     1/1     1/1     0/0     1/1     1/1     0/0     0/1     1/1     1/1     1/1     
10      76702   chr10.s_76702   G       A       .       .       PR      GT      0/0     1/1     0/0     1/1     0/0     1/1     1/1     1/1     0/0     1/1     1/1     0/0     0/1     1/1     1/1     1/1     
10      92823   chr10.s_92823   C       T       .       .       PR      GT      0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     
10      94339   chr10.s_94339   A       T       .       .       PR      GT      0/0     1/1     0/0     1/1     0/0     1/1     1/1     1/1     0/0     1/1     1/1     0/0     1/1     1/1     1/1     1/1     
10      94579   chr10.s_94579   C       G       .       .       PR      GT      0/0     1/1     0/0     1/1     0/0     1/1     1/1     1/1     0/0     1/1     1/1     0/0     1/1     1/1     1/1     1/1    

2.2 Phenotypic data

Phenotypic file includes multiple phenotype columns. Taxa names should be in the first column of the phenotypic data file and the remaining columns should contain the observed phenotype from each individual. Missing data should be indicated by either NaN or NA. MRBIGR supports many types of datasets as phenotypic file, as long as the data is provided in comma-separated value format, i.e., CSV file. The three most commonly used data types are presented below.

2.2.1 Trait file

The first nine observations in the tutorial data (MRBIGR_data/blup_traits_final.new.csv) are displayed as follows:

Trait Plantheight Earheight Earleafwidth Earleaflength Tasselmainaxislength Tasselbranchnumber Leafnumberaboveear Earlength Eardiameter Cobdiameter Kernernumberperrow 100grainweight cobweight Kernelwidth Silkingtime Pollenshed Headingdate
150 199.40686200000002 84.09240313 8.415551016 62.57602963 26.18701812 7.0083920310000005 5.446805873 13.16553466 3.814461974 2.3336648540000002 24.01368699 18.66442693 18.73223124 6.948005094 69.5634 67.3508 64.3586
177 167.7376926 62.32666383 8.336793644 75.84687159999999 26.86362604 8.25953511 5.364463679 13.24180052 3.664731428 2.3747896369999997 23.22662586 22.17517342 16.69770013 7.95607262 69.1351 67.5734 66.1437
238 195.1777397 65.90031644 9.38907964 78.97464118 27.64054122 9.749416145 5.498509112000001 13.84549223 3.7085016239999997 2.4834557630000003 23.399574100000002 21.94336045 17.4338478 8.18091486 74.5957 71.9141 69.8254
268 160.7956655 52.69387703 8.844888078 70.42957469 28.36394677 6.981059882 5.635426946 10.73827719 3.66728469 2.257745983 19.08355482 18.86634823 14.05614498 7.360464037000001 71.2765 68.90899999999999 65.6974
501 170.01783559999998 62.25478889 8.796211646 67.83068252 26.54872089 6.3387543729999996 5.772344781 10.33143474 4.086384316 2.546535413 22.25759352 25.83110804 17.26231825 8.435537529 71.2765 69.7994 66.2553
647 143.4828657 68.96951963 6.5188109789999995 65.55214977 25.94091108 17.84754152 4.351873538 6.8750427389999995 2.557710224 1.4466421669999998 16.79624763 11.789889599999999 4.497660526000001 5.1284208730000005 72.24015 70.9124 66.7015
812 164.7820379 57.30049720000001 8.435240359 71.97071207 27.00737492 7.468018377999999 4.7605753260000006 11.73433366 3.678227239 2.319895255 21.14170296 22.99261918 11.90201965 8.541258052 68.7068 66.4604 65.6974
1323 183.0386928 70.55827585 10.53762464 70.99257322 32.98852183 5.730474602 5.672768174 12.49229884 4.136537665 2.9266879930000003 21.17811141 22.25630796 24.18210638 8.20176116 75.4523 73.5836 71.9452
1462 211.1751917 88.14599481 9.699733717 84.89124385 34.74483117 11.73294927 6.547893356 17.13560948 4.340433828 2.470617348 32.87164817 25.7713247 22.36770487 8.295569512 70.9553 69.0203 68.0403

2.2.2 Gene expression file

The first nine observations in the tutorial data (MRBIGR_data/AMP_kernel_transcriptome_v4_FPKM) are displayed as follows:

Zm00001d007718 Zm00001d007716 Zm00001d018986 Zm00001d048402 Zm00001d048403 Zm00001d048404 Zm00001d048407 Zm00001d048342 Zm00001d048393 Zm00001d048394
05W002 13.68390968 0.08887555 22.46513484 35.2460181 90.0115652 4.245998763 2.072609286 0.8325380590000001 9.159937637999999 67.86617791
05WN230 0.41449239299999996 0.043290454 24.22802224 38.30711955 77.83880474 2.5675474190000003 6.337722396 0.359176224 11.66285576 95.13785552
07KS4 1.514937182 0.094552666 20.89936082 30.22573463 89.46284399999999 2.86870803 4.861722385 1.344393716 4.40600588 96.44533135
1462 2.9976042960000004 0.069739101 17.11780085 19.00876018 65.92898232 2.765295001 5.0446673319999995 0.34219332 8.192603334 78.60389131
150 5.716710774 0.018280516 22.47700014 0.211730029 64.85073286 3.2508831189999996 2.072331077 0.41587338700000004 9.749669454 59.98792917
177 0.38862201799999996 0.083882895 23.34402367 58.74669505 72.21102064 4.7976829819999995 2.037686245 0.654807971 13.24835189 66.17025883
18-599 0.649410418 0.082275625 20.45144972 0.46588138700000004 78.06259285 3.971991299 5.0632273969999995 0.7951806290000001 13.01598039 74.97802633
238 9.543760621 0.095198605 20.834385100000002 14.73831069 71.00077488 2.627118969 3.4174614439999997 0.9766994040000001 1.9633184590000001 36.03050858
268 1.030472507 0.11706547 17.722316199999998 36.98853947 163.2058748 1.9805356109999999 15.65963041 1.206266576 9.333207445 101.6617829

2.2.3 Metabolite abundance file

The first nine observations in the tutorial data (MRBIGR_data/E3_log2.normalized_phe.csv) are displayed as follows:

ID n0004 n0006 n0012 n0014 n0016 n0019 n0020 n0025 n0029 n0033
150 23.85756808279454 21.52133253560975 19.429310701409214 17.88465348347412 23.728581650951327 23.323886129502455 23.144138063805507 20.50688186826299 21.385744989040916 17.91415735710555
177 24.060851668708853 22.93337099206989 18.037252010387334 18.91707191416155 26.173789983480603 22.863251815465887 22.931568749661043 19.489348200901247 22.200602015093704 18.037252010387334
238 23.697103421767665 22.049263944412104 19.015635556176196 19.50881803882359 24.13320250890116 22.41699565411851 23.9225235204567 19.36879843924739 22.117435431711307 18.779568681864543
268 24.31008026193824 22.356490926193132 19.89504417298552 19.930126596592963 24.296141071653768 23.39100031911556 23.101493731065926 19.71230028530486 22.815189581096966 17.919986412355946
647 24.516531130157624 22.012226573732015 20.2065768091335 20.477537926576417 25.08131572946815 24.20955339098966 24.141022012936638 20.396237881758427 21.164229940027102 19.23724964702257
1462 24.117435193904463 21.826871578478677 20.447584725205136 17.80499437272362 25.354474345565613 22.707672768009736 22.87829961658445 18.925789113705253 21.995071856772647 17.82997662112729
4019 24.492283584929798 22.767492845636664 21.008812251998197 18.7339719183688 24.598325218007993 22.699223579410997 23.142580918608278 19.28771463375869 21.938764429607698 17.95447465132481
5213 24.759387644740613 21.953048652078188 21.688592242664424 20.353802646644912 26.400803379275068 23.958368717176786 23.16269388084486 20.56383771559438 21.78755866516714 17.638216698650762
5237 25.15653497289918 22.359175010264977 19.336473870765303 18.4374635629781 25.967192501047197 22.846133284156966 23.165763445593072 19.784368059810987 21.540378285014388 18.86966914228456

3. Quick start of MRBIGR (command line based)

MRBIGR consists of seven analysis modules: genotype based analysis module geno, phenotype based analysis module pheno, GWAS and SAL related analysis module gwas, Mendelian randomization ralated analysis (three module:Mendelian randomization analysis module mr, MR-based network construction module net and gene ontology analysis module go) and data visulaization module plot. Each module can be invoked via a subcommand, and each module also provides several functions which can be called with corresponding parameters.

In this part, only several commonly used functions in each module are shown as examples. For more detailed instructions or GUI operation, please refer to the Tutorial section.

3.1 Genotypic data process

If you hava downloaded MRBIGR_data and unpacked it, you will see a set of plink-bed format genotypic data named chr_HAMP.bed/chr_HAMP.bim/chr_HAMP.fam.

First, quality control of the original genotypic data should be performed through the below command:

docker exec -it mrbigr_env MRBIGR.py geno -qc -g chr_HAMP -o geno_qc -od MRBIGR_output/demo -maf 0.05 -mis 0.2 -mind 0.2  

The subcommand geno invokes the genotype analysis module; parameter -qc calls the quality control function; -g is the prefix of plink-bed format input genotypic data; -o is the output genotypic data prefix; -od is the output directory; -maf is the MAF for a SNP to be kept; -mis is the maximum proportion of missing values for a SNP to be kept; -mind is the maximum proportion of missing values for a sample to be kept. After this step, the QC-filtered genotypic data geno_qc.bed/geno_qc.bim/geno_qc.fam would be generated.

Then, dimensionality reduction of the original genotypic data could be performed using the below command:

docker exec -it mrbigr_env MRBIGR.py geno -clump -g MRBIGR_output/demo/geno/geno_qc -o geno -od MRBIGR_output/demo

The subcommand geno invokes the genotype analysis module; parameter -clump calls the genotype clumping function to keep only one representative SNP per region of LD; -g is the prefix of plink-bed format input genotypic data; -o is the output genotypic data prefix (suffix _clump will be added automatically for the output files); -od is the output directory. After this step, the clumped genotypic data geno_clump.bed/geno_clump.bim/geno_clump.fam would be generated.

Last, take the dimensional reduced genotypic data as input to perform principal component analysis using the below command:

docker exec -it mrbigr_env MRBIGR.py geno -pca -g MRBIGR_output/demo/geno/geno_clump -o geno -od MRBIGR_output/demo 

The subcommand geno invokes the genotype analysis module; parameter -pca calls the principal component analysis function; -g is the prefix of the plink-bed format input genotypic data; -o is the output genotypic data prefix (a suffix _pca will be added automatically for the output files); -od is the output directory. After this step, an output file named geno_pca.csv would be generated.

#5A9AB9 Note: the all output will generated in MRBIGR_output/demo/geno directory.

3.2 Phenotypic data process

For the example CSV format phenotype file named blup_traits_final.new.csv in MRBIGR_data, the quality control of the original phenotypic data should be performed through the below command:

docker exec -it mrbigr_env MRBIGR.py pheno -qc -p blup_traits_final.new.csv -o blup_traits_final_qc -od MRBIGR_output/demo -mis 0.5 -val 0.1 -rout zscore

The subcommand pheno invokes the phenotype analysis module; parameter -qc calls the quality control function; -p is the input phenotype matrix; -o is the prefix of output file; -od is the output directory; -mis is the missing rate cutoff; -val is the small value cutoff; -rout means outlier removal of phenotypic values with the default method. After this step, a QC-filtered Phenotypic data named blup_traits_final_qc.phe.csv would be generated.

Then, take this file as input, perform missing phenotypic value imputation through the below command:

docker exec -it mrbigr_env MRBIGR.py pheno -imput -p MRBIGR_output/demo/pheno/blup_traits_final_qc.phe.csv -o blup_traits_final_qc_imput -od MRBIGR_output/demo

The subcommand pheno invokes the phenotype analysis module; parameter -imput calls the phenotype imputation function; -p is the input phenotype matrix; -o is the prefix of output file; -od is the output directory. After this step, a phenotype file named blup_traits_final_qc_imput.phe.csv with no NA value would be generated.

Last, take this file as input to perform phenotypic value normalization using the below command:

docker exec -it mrbigr_env MRBIGR.py pheno -scale -p MRBIGR_output/demo/pheno/blup_traits_final_qc_imput.phe.csv -o blup_traits_final_qc_imput_norm -od MRBIGR_output/demo -boxcox -minmax  

The subcommand pheno invokes the phenotype analysis module; parameter -scale calls the phenotype scaling/normalization/standardization/transformation function; -p is the input phenotype matrix; -o is the prefix of output file; -od is the output directory; -boxcox and -minmax means perform both box-cox normalization and min-max scaling of the input data. After this step, a phenotype file named blup_traits_final_qc_imput_norm.phe.csv with normalized and scaled phenotypic values would be generated.

In some cases, phenotypic data are collected from different environment or years, which should be merged through appropriate algorithm such as mean-values or BLUP before further analysis. Here is an example, take one trait from different environment as input, the below command line will merge all the columns using BLUP algorithm:

docker exec -it mrbigr_env MRBIGR.py pheno -merge -p MRBIGR_data/blup_traits_final.new.csv -o blup_traits_final.blup -od MRBIGR_output/demo -mm blup

The subcommand pheno invokes the phenotype analysis module; parameter -merge calls the phenotype merge function; -p is the input phenotype matrix for a trait in CSV format; -o is the prefix of output file; -od is the output directory; -mm is the merge method. After this step, a phenotype file named blup_traits_final.blup.phe.csv with BLUP merged phenotypic values would be generated.

#5A9AB9 Note: the all output will generated in MRBIGR_output/demo/pheno directory.

3.3 GWAS and SAL detection

Take the plink-bed format genotypic data and CSV format phenotypic data as inputs, GWAS can be performed through the below command:

docker exec -it mrbigr_env MRBIGR.py gwas -gwas -model gemma_mlm -thread 12 -g MRBIGR_data/chr_HAMP -p MRBIGR_data/AMP_kernel_transcriptome_v4 -od MRBIGR_output/demo  

The subcommand gwas invokes the GWAS and SAL analysis module; parameter -gwas calls the GWAS function; -model is the model to fit, with linear mixed model (lmm) in gemma by default; -thread is the thread number to run the command; -g is the plink-bed format input Genotypic data; -p is the CSV format phenotypic data; -od is the output directory. After this step, an output directory named MRBIGR_output/demo/gwas/gemma/lmm would be generated with the GWAS result files named Zm00001dxxxxx.assoc.txt in it.

#5A9AB9 Note: the exact output directory is depend on the model you selected, where -model gemma_mlm means the final output directory will be MRBIGR_output/demo/gwas/gemma/lmm.

Then, SAL can be determined using the below command based on the GWAS results:

docker exec -it mrbigr_env MRBIGR.py gwas -qtl -g MRBIGR_data/chr_HAMP -thread 6 -i MRBIGR_output/demo/gwas/gemma/lmm -o qtl_output -od MRBIGR_output/demo -p1 1e-5 -p2 1e-3 -p2n 5  

The subcommand gwas invokes the GWAS and SAL analysis module; parameter -qtl calls the SAL detection function; -thread is the thread number to run the command; -i is the GWAS result directory; -o is the output file prefix; -od is the output directory; -p1 is the significance threshold for index SNPs used to determine SALs; -p2 is the secondary significance threshold for clumped SNPs used to determine the reliability of SALs; -p2n is secondary significance SNP number in a SAL. After this step, an output file named qtl_output.qtl_res.csv would be generated in MRBIGR_output/demo/gwas/qtl.

If you want to generate manhattan-plots and QQ-plots for visualization of the GWAS results, the below command should be helpful:

docker exec -it mrbigr_env MRBIGR.py gwas -visual -thread 12 -i MRBIGR_output/demo/gwas/gemma/lmm -od MRBIGR_output/demo -multi -q MRBIGR_output/demo/gwas/qtl/qtl_output.qtl_res.csv   

The subcommand gwas invokes the GWAS and SAL analysis module; parameter -visual calls the visualization function; -thread is the thread number to run the command; -i is the GWAS result directory; -od is the output directory for the generated plots files; parameters -multi and -q are optional, if -multi is set to generate a multi-traits manhattan-plot, -q which is the SAL result file also need to be set.

Take the plink-bed format genotypic data, CSV format exposure data, CSV format exposure SAL data and CSV format outcome data as inputs, MR analysis can be performed through the below command:

docker exec -it mrbigr_env MRBIGR.py mr -g MRBIGR_data/chr_HAMP -exposure MRBIGR_data/AMP_kernel_transcriptome_v4 -qtl MRBIGR_output/demo/gwas/qtl/qtl_output.qtl_res.csv -outcome MRBIGR_data/blup_traits_final.new.csv -mlm -thread 12 -o g2p -od MRBIGR_output/demo 

The subcommand mr invokes the MR analysis module; parameter -exposure is the CSV format exposure data; -outcome is the CSV format outcome data; -qtl is the CSV format exposure SAL data; -thread is the thread number to run the command; -g is the plink-bed format input Genotypic data; -mlm represents perform MR analysis through mixed linear model; -o is the prefix of output file; -od is the output directory. After this step, a MR result file named g2p.MR.csv would be generated.

It also can take the CSV format population gene expression data and CSV format gene SAL data as inputs, pairwise genes MR analysis can be performed through the below command:

docker exec -it mrbigr_env MRBIGR.py mr -g MRBIGR_data/chr_HAMP -gene_exp MRBIGR_data/AMP_kernel_transcriptome_v4 -pairwise -mlm -qtl MRBIGR_output/demo/gwas/qtl/qtl_output.qtl_res.csv -thread 12 -o gene_pairwise_mr_out -od MRBIGR_output/demo

The subcommand mr invokes the MR analysis module; parameter -gene_exp is the CSV format population gene expression data; -qtl is the CSV format gene SAL data; -thread is the thread number to run the command; -g is the plink-bed format input Genotypic data; -mlm represents perform MR analysis through mixed linear model; -o is the prefix of output file; -od is the output directory. After this step, a MR result file named gene_pairwise_mr_out.MR.csv would be generated.

If you want to perform MR analysis between transcription factor and genes it targets, the below command should be helpful:

mkdir MRBIGR_output/demo/tmp/
head -100 MRBIGR_data/gene.anno.tsv > MRBIGR_output/demo/tmp/test_head100_genefunc.txt
tail -100 MRBIGR_data/gene.anno.tsv > MRBIGR_output/demo/tmp/test_tail100_genefunc.txt

docker exec -it mrbigr_env MRBIGR.py mr -g MRBIGR_data/chr_HAMP -gene_exp MRBIGR_data/AMP_kernel_transcriptome_v4 -tf MRBIGR_output/demo/tmp/test_head100_genefunc.txt -target MRBIGR_output/demo/tmp/test_tail100_genefunc.txt -mlm -qtl MRBIGR_output/demo/gwas/qtl/qtl_output.qtl_res.csv -threads 12 -o tf_test_mr_out -od MRBIGR_output/demo

The subcommand mr invokes the MR analysis module; parameter -gene_exp is the CSV format population gene expression data; -qtl is the CSV format transcription factor SAL data; -tf is the transcription factor annotation file; -target is the annotation files for genes targeted by transcription factors; -thread is the thread number to run the command; -g is the plink-bed format input Genotypic data; -mlm represents perform MR analysis through mixed linear model; -o is the prefix of output file; -od is the output directory. After this step, a MR result file named tf_test_mr_out.MR.csv would be generated.

#5A9AB9 Note: the all output will generated in MRBIGR_output/demo/mr directory.

3.5 MR-based network construction

Take the CSV format MR analysis data as input, MR based network analysis can be performed through the below command:

docker exec -it mrbigr_env MRBIGR.py net -mr MRBIGR_output/demo/mr/gene_pairwise_mr_out.MR.csv -plot -o net_out -od MRBIGR_output/demo

The subcommand net invokes the MR based network analysis module; parameter -mr is the CSV format MR analysis data; -plot represents plot network for each identified network module; -o is the prefix of output file; -od is the output directory. After this step, network edgelist file net_out.edge_list, ClusterONE software result net_out.clusterone.result.csv, network module plot net_out.module*.pdf and final network module file net_out.module.csv would be generated.

#5A9AB9 Note: the all output will generated in MRBIGR_output/demo/net directory.

3.6 Gene ontology analysis of network modules

Take the CSV format network module data, Tabular format gene ontology annotation of each gene data and Tabular format gene annotation data as inputs, gene ontology enrichment analysis of each module can be performed through the below command:

csvtk sep -f 3 -t MRBIGR_data/gene.anno.tsv -s ":" -n chr,start-end,strand | csvtk sep -f start-end -t -s "-" -n start,end | csvtk cut -f 1,5,8,9,7,2 -t | csvtk rename -t -f 1-6 -n "gene_id,chr,start,end,stard,annotation" -o MRBIGR_output/demo/tmp/gene_anno.txt

docker exec -it mrbigr_env MRBIGR.py go -gene_lists MRBIGR_output/demo/net/net_out.module.csv -go_anno MRBIGR_data/maize.genes2go.txt -gene_info MRBIGR_output/demo/tmp/gene_anno.txt -o go_out -od MRBIGR_output/demo

The subcommand go invokes the gene ontology enrichment analysis module; parameter -gene_lists is the CSV format network module data; -go_anno is the tabular format gene ontology annotation of each gene data; -gene_info is the tabular format gene annotation data; -o is the prefix of output file; -od is the output directory. After this step, gene ontology enrichment analysis result go_out.GO.csv, visualized results of functional enrichment results go_out.BP.dotplot.pdf, go_out.MF.dotplot.pdf and go_out.CC.dotplot.pdf would be generated.

#5A9AB9 Note: the all output will generated in MRBIGR_output/demo/go directory.