Quickstart
This section is mainly intended for the installation of MRBIGR and the briefly introduction of several typical usage of MRBIGR to help users get started quickly. Details and advanced usage of MRBIGR, please see Tutorial section.
1. Installation of MRBIGR
1.1 Download MRBIGR
git clone https://gitee.com/crazyhsu/MRBIGR.git
1.2 Installation using docker (recommended)
cd MRBIGR
docker build -t mrbigr_image .
curr_dir=$(pwd)
docker run -dit --name mrbigr_env -p 3838:3838 \
-v $curr_dir/MRBIGR_data:/root/MRBIGR/MRBIGR_data \
-v $curr_dir/MRBIGR_output/demo:/root/MRBIGR/MRBIGR_output/demo \
-v $curr_dir/MRBIGR_log:/root/MRBIGR/MRBIGR_log \
-e SHINY_INPUT_PATH=/root/MRBIGR/MRBIGR_data \
-e SHINY_OUTPUT_PATH=/root/MRBIGR/MRBIGR_output/demo \
-e SHINY_LOG_PATH=/root/MRBIGR/MRBIGR_log \
-e SHINY_ROOT_PATH=/root/MRBIGR \
mrbigr_image:latest
If no exception is thrown out, please enter the http://localhost:3838 in the browser and then you will see the following GUI:

Note: If you encounter the issue like below:
#0 building with "default" instance using docker driver
#1 [internal] load build definition from Dockerfile
#1 transferring dockerfile: 1.10kB done
#1 DONE 0.2s
#2 [internal] load metadata for docker.io/crazyhsu/mrbigr_env:v1.2
#2 ERROR: failed to authorize: failed to fetch anonymous token: Get "https://auth.docker.io/token?scope=repository%3Acrazyhsu%2Fmrbigr_env%3Apull&service=registry.docker.io": dial tcp [2a03:2880:f130:83:face:b00c:0:25de]:443: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
------
> [internal] load metadata for docker.io/crazyhsu/mrbigr_env:v1.2:
------
Dockerfile:1
--------------------
1 | >>> FROM crazyhsu/mrbigr_env:v1.2
2 |
3 | ENV CONDA_DEFAULT_ENV=mrbigr
--------------------
ERROR: failed to solve: crazyhsu/mrbigr_env:v1.2: failed to authorize: failed to fetch anonymous token: Get "https://auth.docker.io/token?scope=repository%3Acrazyhsu%2Fmrbigr_env%3Apull&service=registry.docker.io": dial tcp [2a03:2880:f130:83:face:b00c:0:25de]:443: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
Please try the following code:
cd MRBIGR
docker pull crazyhsu/mrbigr_env:v1.2 # direct pull the images without authorize
docker build -t mrbigr_image .
# or
wget https://zenodo.org/records/13955396/files/mrbigr_env_v1.2.tar
docker load -i mrbigr_env_v1.2.tar
docker build -t mrbigr_image .
Then run the docker container:
curr_dir=$(pwd)
docker run -dit --name mrbigr_env -p 3838:3838 \
-v $curr_dir/MRBIGR_data:/root/MRBIGR/MRBIGR_data \
-v $curr_dir/MRBIGR_output/demo:/root/MRBIGR/MRBIGR_output/demo \
-v $curr_dir/MRBIGR_log:/root/MRBIGR/MRBIGR_log \
-e SHINY_INPUT_PATH=/root/MRBIGR/MRBIGR_data \
-e SHINY_OUTPUT_PATH=/root/MRBIGR/MRBIGR_output/demo \
-e SHINY_LOG_PATH=/root/MRBIGR/MRBIGR_log \
-e SHINY_ROOT_PATH=/root/MRBIGR \
mrbigr_image:latest
1.3 Installation using conda
1.3.1 Build and install
conda create -n mrbigr python=3.7 -y
conda activate mrbigr
1.3.2 Install dependencies
cd MRBIGR
python setup.py build
python setup.py install
1.3.3. Install dependencies
pip install pyranges
conda install -y -c conda-forge r-rcppeigen r-xml r-rsqlite r-europepmc r=3.6 rpy2 vcftools
Rscript -e 'install.packages(c("data.table", "ggplot2", "ggsignif", "ggrepel","Matrix", "igraph", "network", "GGally", "sna","tidyr","ggraph","lme4","ggpubr","pheatmap","factoextra", "R.utils"), repos="https://cloud.r-project.org")'
Rscript -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager", repos="https://cloud.r-project.org");BiocManager::install(c("AnnotationForge","clusterProfiler","ggtree"))'
Rscript -e 'install.packages("bigsnpr", dependence=T, repos="https://cloud.r-project.org")'
Rscript -e 'install.packages("https://cran.r-project.org/src/contrib/Archive/FactoMineR/FactoMineR_1.42.tar.gz", repos=NULL)'
1.3.4 Set environment variables
echo "export PATH=`pwd`/utils:\$PATH" >> ~/.bashrc
echo "export LD_LIBRARY_PATH=`pwd`/utils/libs:\$LD_LIBRARY_PATH" >> ~/.bashrc
source ~/.bashrc
1.3.5 Set up MRBIGR GUI app
Rscript -e "install.packages(c('shiny','bslib','shinyFiles','shinyalert'), repos='https://mirrors.tuna.tsinghua.edu.cn/CRAN/')"
Rscript -e "shiny::runApp('MRBIGR-I/app.R', host = '0.0.0.0', port = 3838)"
Depends: R(>=3.6), python(>=3.7)
2. Data preparation
Using MRBIGR for basic analysis requires only genotypic and phenotypic data. Users can perform advanced functional analyses and generate visualizations through the use of built-in functions and their combinations (Details in the document). The example datasets used below are from MRBIGR_data, which contains all the data used for our paper "MRBIGR: a versatile toolbox for genetic causal inference in population-scale multi-omics data".
# download and unpack MRBIGR_data.tar.gz
wget https://zenodo.org/records/13955396/files/MRBIGR_data.20241019.tar.gz -O MRBIGR_data.tar.gz
tar -xvzf MRBIGR_data.tar.gz --transform 's,^MRBIGR_data.20241019,MRBIGR_data,'
rm MRBIGR_data.tar.gz
mkdir MRBIGR_output/demo
2.1 Genotypic data
MRBIGR uses PLINK-bed file format for Genotypic data by default (details see http://zzz.bwh.harvard.edu/plink/data.shtml#bed). The built-in geno module supports converting HapMap or VCF formatted genotype files to PLINK-bed format, as well as converting PLINK-bed format to VCF format.
2.1.1 HapMap file format
Hapmap is a commonly used format for storing sequence data where SNP information is stored in the rows and taxa information is stored in the columns. This format allows the SNP information (chromosome and position) and genotypes of each taxa to be stored in a single file.
The first 11 columns display attributes of the SNPs and the remaining columns show the nucleotides observed at each SNP for each taxa. The first row contains the header labels and each remaining row contains all the information for a single SNP. The first nine SNPs from the tutorial data (MRBIGR_data/chr_HAMP_female.hmp) are presented below.
| rs# | alleles | chrom | posi | strand | assembly | center | protLSID | assayLSID | panelLSID | QCdode | CIMBL32 | CIMBL89 | CIMBL7 | CIMBL45 | ZHENG58 | ... | CIMBL46 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| chr10.s_74401 | C/T | 10 | 74401 | + | We | Are | The | Maize | Re | Seq | CC | TT | CC | TT | CC | ... | TT |
| chr10.s_76702 | A/G | 10 | 76702 | + | We | Are | The | Maize | Re | Seq | GG | AA | GG | AA | GG | ... | AA |
| chr10.s_92823 | C/T | 10 | 92823 | + | We | Are | The | Maize | Re | Seq | CC | CC | CC | CC | CC | ... | CC |
| chr10.s_94339 | T/A | 10 | 94339 | + | We | Are | The | Maize | Re | Seq | AA | TT | AA | TT | AA | ... | TT |
| chr10.s_94579 | C/G | 10 | 94579 | + | We | Are | The | Maize | Re | Seq | CC | GG | CC | GG | CC | ... | GG |
| chr10.s_94901 | G/A | 10 | 94901 | + | We | Are | The | Maize | Re | Seq | AA | AA | AA | AA | AA | ... | AA |
| chr10.s_94930 | T/G | 10 | 94930 | + | We | Are | The | Maize | Re | Seq | GG | TT | GG | TT | GG | ... | TT |
| chr10.s_108603 | C/T | 10 | 108603 | + | We | Are | The | Maize | Re | Seq | TT | CC | TT | CC | TT | ... | CC |
| chr10.s_108610 | C/T | 10 | 108610 | + | We | Are | The | Maize | Re | Seq | TT | CC | TT | CC | TT | ... | CC |
2.1.2 VCF file format
VCF (Variant Call Format) file is a standard text file format for storing genomic variation information. This format is widely used in genomic research, especially when it comes to the difference between the genome sequence of an individual or a population and the reference genome. VCF files can describe multiple types of genetic variations, such as single nucleotide polymorphisms (SNPs), insertions/deletions (InDels), structural variations, etc.
The basic structure of a VCF file consists of two parts: header information and data lines. The header information starts with ## and contains the file's metadata such as format version, generation program, etc., and defines the column name with the #CHROM line; the data part lists the information of each variant site in detail, including chromosome number, position, identifier, reference and alternative bases, quality score, filtering status, additional information and sample-specific data. The first five SNPs from the converted data (MRBIGR_output/demo/geno/chr_HAMP_female.vcf.vcf) are presented below.
##fileformat=VCFv4.2
##fileDate=20241004
##source=PLINKv1.90
##contig=<ID=10,length=150903390>
##contig=<ID=1,length=306984567>
##contig=<ID=2,length=209800297>
##contig=<ID=3,length=235647547>
##contig=<ID=4,length=246987745>
##contig=<ID=5,length=223818355>
##contig=<ID=6,length=173410493>
##contig=<ID=7,length=182367039>
##contig=<ID=8,length=181101714>
##contig=<ID=9,length=159764980>
##INFO=<ID=PR,Number=0,Type=Flag,Description="Provisional reference allele, may not be based on real reference genome">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT CIMBL32 CIMBL89 CIMBL7 CIMBL45 ZHENG58 CML415 CIMBL46 CIMBL70 CIMBL124 CIMBL68 CIMBL77 CML324 CML326 CIMBL69 CIMBL23 CIMBL71
10 74401 chr10.s_74401 C T . . PR GT 0/0 1/1 0/0 1/1 0/0 1/1 1/1 1/1 0/0 1/1 1/1 0/0 0/1 1/1 1/1 1/1
10 76702 chr10.s_76702 G A . . PR GT 0/0 1/1 0/0 1/1 0/0 1/1 1/1 1/1 0/0 1/1 1/1 0/0 0/1 1/1 1/1 1/1
10 92823 chr10.s_92823 C T . . PR GT 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0
10 94339 chr10.s_94339 A T . . PR GT 0/0 1/1 0/0 1/1 0/0 1/1 1/1 1/1 0/0 1/1 1/1 0/0 1/1 1/1 1/1 1/1
10 94579 chr10.s_94579 C G . . PR GT 0/0 1/1 0/0 1/1 0/0 1/1 1/1 1/1 0/0 1/1 1/1 0/0 1/1 1/1 1/1 1/1
2.2 Phenotypic data
Phenotypic file includes multiple phenotype columns. Taxa names should be in the first column of the phenotypic data file and the remaining columns should contain the observed phenotype from each individual. Missing data should be indicated by either NaN or NA. MRBIGR supports many types of datasets as phenotypic file, as long as the data is provided in comma-separated value format, i.e., CSV file. The three most commonly used data types are presented below.
2.2.1 Trait file
The first nine observations in the tutorial data (MRBIGR_data/blup_traits_final.new.csv) are displayed as follows:
| Trait | Plantheight | Earheight | Earleafwidth | Earleaflength | Tasselmainaxislength | Tasselbranchnumber | Leafnumberaboveear | Earlength | Eardiameter | Cobdiameter | Kernernumberperrow | 100grainweight | cobweight | Kernelwidth | Silkingtime | Pollenshed | Headingdate |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 150 | 199.40686200000002 | 84.09240313 | 8.415551016 | 62.57602963 | 26.18701812 | 7.0083920310000005 | 5.446805873 | 13.16553466 | 3.814461974 | 2.3336648540000002 | 24.01368699 | 18.66442693 | 18.73223124 | 6.948005094 | 69.5634 | 67.3508 | 64.3586 |
| 177 | 167.7376926 | 62.32666383 | 8.336793644 | 75.84687159999999 | 26.86362604 | 8.25953511 | 5.364463679 | 13.24180052 | 3.664731428 | 2.3747896369999997 | 23.22662586 | 22.17517342 | 16.69770013 | 7.95607262 | 69.1351 | 67.5734 | 66.1437 |
| 238 | 195.1777397 | 65.90031644 | 9.38907964 | 78.97464118 | 27.64054122 | 9.749416145 | 5.498509112000001 | 13.84549223 | 3.7085016239999997 | 2.4834557630000003 | 23.399574100000002 | 21.94336045 | 17.4338478 | 8.18091486 | 74.5957 | 71.9141 | 69.8254 |
| 268 | 160.7956655 | 52.69387703 | 8.844888078 | 70.42957469 | 28.36394677 | 6.981059882 | 5.635426946 | 10.73827719 | 3.66728469 | 2.257745983 | 19.08355482 | 18.86634823 | 14.05614498 | 7.360464037000001 | 71.2765 | 68.90899999999999 | 65.6974 |
| 501 | 170.01783559999998 | 62.25478889 | 8.796211646 | 67.83068252 | 26.54872089 | 6.3387543729999996 | 5.772344781 | 10.33143474 | 4.086384316 | 2.546535413 | 22.25759352 | 25.83110804 | 17.26231825 | 8.435537529 | 71.2765 | 69.7994 | 66.2553 |
| 647 | 143.4828657 | 68.96951963 | 6.5188109789999995 | 65.55214977 | 25.94091108 | 17.84754152 | 4.351873538 | 6.8750427389999995 | 2.557710224 | 1.4466421669999998 | 16.79624763 | 11.789889599999999 | 4.497660526000001 | 5.1284208730000005 | 72.24015 | 70.9124 | 66.7015 |
| 812 | 164.7820379 | 57.30049720000001 | 8.435240359 | 71.97071207 | 27.00737492 | 7.468018377999999 | 4.7605753260000006 | 11.73433366 | 3.678227239 | 2.319895255 | 21.14170296 | 22.99261918 | 11.90201965 | 8.541258052 | 68.7068 | 66.4604 | 65.6974 |
| 1323 | 183.0386928 | 70.55827585 | 10.53762464 | 70.99257322 | 32.98852183 | 5.730474602 | 5.672768174 | 12.49229884 | 4.136537665 | 2.9266879930000003 | 21.17811141 | 22.25630796 | 24.18210638 | 8.20176116 | 75.4523 | 73.5836 | 71.9452 |
| 1462 | 211.1751917 | 88.14599481 | 9.699733717 | 84.89124385 | 34.74483117 | 11.73294927 | 6.547893356 | 17.13560948 | 4.340433828 | 2.470617348 | 32.87164817 | 25.7713247 | 22.36770487 | 8.295569512 | 70.9553 | 69.0203 | 68.0403 |
2.2.2 Gene expression file
The first nine observations in the tutorial data (MRBIGR_data/AMP_kernel_transcriptome_v4_FPKM) are displayed as follows:
| Zm00001d007718 | Zm00001d007716 | Zm00001d018986 | Zm00001d048402 | Zm00001d048403 | Zm00001d048404 | Zm00001d048407 | Zm00001d048342 | Zm00001d048393 | Zm00001d048394 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 05W002 | 13.68390968 | 0.08887555 | 22.46513484 | 35.2460181 | 90.0115652 | 4.245998763 | 2.072609286 | 0.8325380590000001 | 9.159937637999999 | 67.86617791 |
| 05WN230 | 0.41449239299999996 | 0.043290454 | 24.22802224 | 38.30711955 | 77.83880474 | 2.5675474190000003 | 6.337722396 | 0.359176224 | 11.66285576 | 95.13785552 |
| 07KS4 | 1.514937182 | 0.094552666 | 20.89936082 | 30.22573463 | 89.46284399999999 | 2.86870803 | 4.861722385 | 1.344393716 | 4.40600588 | 96.44533135 |
| 1462 | 2.9976042960000004 | 0.069739101 | 17.11780085 | 19.00876018 | 65.92898232 | 2.765295001 | 5.0446673319999995 | 0.34219332 | 8.192603334 | 78.60389131 |
| 150 | 5.716710774 | 0.018280516 | 22.47700014 | 0.211730029 | 64.85073286 | 3.2508831189999996 | 2.072331077 | 0.41587338700000004 | 9.749669454 | 59.98792917 |
| 177 | 0.38862201799999996 | 0.083882895 | 23.34402367 | 58.74669505 | 72.21102064 | 4.7976829819999995 | 2.037686245 | 0.654807971 | 13.24835189 | 66.17025883 |
| 18-599 | 0.649410418 | 0.082275625 | 20.45144972 | 0.46588138700000004 | 78.06259285 | 3.971991299 | 5.0632273969999995 | 0.7951806290000001 | 13.01598039 | 74.97802633 |
| 238 | 9.543760621 | 0.095198605 | 20.834385100000002 | 14.73831069 | 71.00077488 | 2.627118969 | 3.4174614439999997 | 0.9766994040000001 | 1.9633184590000001 | 36.03050858 |
| 268 | 1.030472507 | 0.11706547 | 17.722316199999998 | 36.98853947 | 163.2058748 | 1.9805356109999999 | 15.65963041 | 1.206266576 | 9.333207445 | 101.6617829 |
2.2.3 Metabolite abundance file
The first nine observations in the tutorial data (MRBIGR_data/E3_log2.normalized_phe.csv) are displayed as follows:
| ID | n0004 | n0006 | n0012 | n0014 | n0016 | n0019 | n0020 | n0025 | n0029 | n0033 |
|---|---|---|---|---|---|---|---|---|---|---|
| 150 | 23.85756808279454 | 21.52133253560975 | 19.429310701409214 | 17.88465348347412 | 23.728581650951327 | 23.323886129502455 | 23.144138063805507 | 20.50688186826299 | 21.385744989040916 | 17.91415735710555 |
| 177 | 24.060851668708853 | 22.93337099206989 | 18.037252010387334 | 18.91707191416155 | 26.173789983480603 | 22.863251815465887 | 22.931568749661043 | 19.489348200901247 | 22.200602015093704 | 18.037252010387334 |
| 238 | 23.697103421767665 | 22.049263944412104 | 19.015635556176196 | 19.50881803882359 | 24.13320250890116 | 22.41699565411851 | 23.9225235204567 | 19.36879843924739 | 22.117435431711307 | 18.779568681864543 |
| 268 | 24.31008026193824 | 22.356490926193132 | 19.89504417298552 | 19.930126596592963 | 24.296141071653768 | 23.39100031911556 | 23.101493731065926 | 19.71230028530486 | 22.815189581096966 | 17.919986412355946 |
| 647 | 24.516531130157624 | 22.012226573732015 | 20.2065768091335 | 20.477537926576417 | 25.08131572946815 | 24.20955339098966 | 24.141022012936638 | 20.396237881758427 | 21.164229940027102 | 19.23724964702257 |
| 1462 | 24.117435193904463 | 21.826871578478677 | 20.447584725205136 | 17.80499437272362 | 25.354474345565613 | 22.707672768009736 | 22.87829961658445 | 18.925789113705253 | 21.995071856772647 | 17.82997662112729 |
| 4019 | 24.492283584929798 | 22.767492845636664 | 21.008812251998197 | 18.7339719183688 | 24.598325218007993 | 22.699223579410997 | 23.142580918608278 | 19.28771463375869 | 21.938764429607698 | 17.95447465132481 |
| 5213 | 24.759387644740613 | 21.953048652078188 | 21.688592242664424 | 20.353802646644912 | 26.400803379275068 | 23.958368717176786 | 23.16269388084486 | 20.56383771559438 | 21.78755866516714 | 17.638216698650762 |
| 5237 | 25.15653497289918 | 22.359175010264977 | 19.336473870765303 | 18.4374635629781 | 25.967192501047197 | 22.846133284156966 | 23.165763445593072 | 19.784368059810987 | 21.540378285014388 | 18.86966914228456 |
3. Quick start of MRBIGR (command line based)
MRBIGR consists of seven analysis modules: genotype based analysis module geno, phenotype based analysis module pheno, GWAS and SAL related analysis module gwas, Mendelian randomization ralated analysis (three module:Mendelian randomization analysis module mr, MR-based network construction module net and gene ontology analysis module go) and data visulaization module plot. Each module can be invoked via a subcommand, and each module also provides several functions which can be called with corresponding parameters.
In this part, only several commonly used functions in each module are shown as examples. For more detailed instructions or GUI operation, please refer to the Tutorial section.
3.1 Genotypic data process
If you hava downloaded MRBIGR_data and unpacked it, you will see a set of plink-bed format genotypic data named chr_HAMP.bed/chr_HAMP.bim/chr_HAMP.fam.
First, quality control of the original genotypic data should be performed through the below command:
docker exec -it mrbigr_env MRBIGR.py geno -qc -g chr_HAMP -o geno_qc -od MRBIGR_output/demo -maf 0.05 -mis 0.2 -mind 0.2
The subcommand geno invokes the genotype analysis module; parameter -qc calls the quality control function; -g is the prefix of plink-bed format input genotypic data; -o is the output genotypic data prefix; -od is the output directory; -maf is the MAF for a SNP to be kept; -mis is the maximum proportion of missing values for a SNP to be kept; -mind is the maximum proportion of missing values for a sample to be kept. After this step, the QC-filtered genotypic data geno_qc.bed/geno_qc.bim/geno_qc.fam would be generated.
Then, dimensionality reduction of the original genotypic data could be performed using the below command:
docker exec -it mrbigr_env MRBIGR.py geno -clump -g MRBIGR_output/demo/geno/geno_qc -o geno -od MRBIGR_output/demo
The subcommand geno invokes the genotype analysis module; parameter -clump calls the genotype clumping function to keep only one representative SNP per region of LD; -g is the prefix of plink-bed format input genotypic data; -o is the output genotypic data prefix (suffix _clump will be added automatically for the output files); -od is the output directory. After this step, the clumped genotypic data geno_clump.bed/geno_clump.bim/geno_clump.fam would be generated.
Last, take the dimensional reduced genotypic data as input to perform principal component analysis using the below command:
docker exec -it mrbigr_env MRBIGR.py geno -pca -g MRBIGR_output/demo/geno/geno_clump -o geno -od MRBIGR_output/demo
The subcommand geno invokes the genotype analysis module; parameter -pca calls the principal component analysis function; -g is the prefix of the plink-bed format input genotypic data; -o is the output genotypic data prefix (a suffix _pca will be added automatically for the output files); -od is the output directory. After this step, an output file named geno_pca.csv would be generated.
Note: the all output will generated in MRBIGR_output/demo/geno directory.
3.2 Phenotypic data process
For the example CSV format phenotype file named blup_traits_final.new.csv in MRBIGR_data, the quality control of the original phenotypic data should be performed through the below command:
docker exec -it mrbigr_env MRBIGR.py pheno -qc -p blup_traits_final.new.csv -o blup_traits_final_qc -od MRBIGR_output/demo -mis 0.5 -val 0.1 -rout zscore
The subcommand pheno invokes the phenotype analysis module; parameter -qc calls the quality control function; -p is the input phenotype matrix; -o is the prefix of output file; -od is the output directory; -mis is the missing rate cutoff; -val is the small value cutoff; -rout means outlier removal of phenotypic values with the default method. After this step, a QC-filtered Phenotypic data named blup_traits_final_qc.phe.csv would be generated.
Then, take this file as input, perform missing phenotypic value imputation through the below command:
docker exec -it mrbigr_env MRBIGR.py pheno -imput -p MRBIGR_output/demo/pheno/blup_traits_final_qc.phe.csv -o blup_traits_final_qc_imput -od MRBIGR_output/demo
The subcommand pheno invokes the phenotype analysis module; parameter -imput calls the phenotype imputation function; -p is the input phenotype matrix; -o is the prefix of output file; -od is the output directory. After this step, a phenotype file named blup_traits_final_qc_imput.phe.csv with no NA value would be generated.
Last, take this file as input to perform phenotypic value normalization using the below command:
docker exec -it mrbigr_env MRBIGR.py pheno -scale -p MRBIGR_output/demo/pheno/blup_traits_final_qc_imput.phe.csv -o blup_traits_final_qc_imput_norm -od MRBIGR_output/demo -boxcox -minmax
The subcommand pheno invokes the phenotype analysis module; parameter -scale calls the phenotype scaling/normalization/standardization/transformation function; -p is the input phenotype matrix; -o is the prefix of output file; -od is the output directory; -boxcox and -minmax means perform both box-cox normalization and min-max scaling of the input data. After this step, a phenotype file named blup_traits_final_qc_imput_norm.phe.csv with normalized and scaled phenotypic values would be generated.
In some cases, phenotypic data are collected from different environment or years, which should be merged through appropriate algorithm such as mean-values or BLUP before further analysis. Here is an example, take one trait from different environment as input, the below command line will merge all the columns using BLUP algorithm:
docker exec -it mrbigr_env MRBIGR.py pheno -merge -p MRBIGR_data/blup_traits_final.new.csv -o blup_traits_final.blup -od MRBIGR_output/demo -mm blup
The subcommand pheno invokes the phenotype analysis module; parameter -merge calls the phenotype merge function; -p is the input phenotype matrix for a trait in CSV format; -o is the prefix of output file; -od is the output directory; -mm is the merge method. After this step, a phenotype file named blup_traits_final.blup.phe.csv with BLUP merged phenotypic values would be generated.
Note: the all output will generated in MRBIGR_output/demo/pheno directory.
3.3 GWAS and SAL detection
Take the plink-bed format genotypic data and CSV format phenotypic data as inputs, GWAS can be performed through the below command:
docker exec -it mrbigr_env MRBIGR.py gwas -gwas -model gemma_mlm -thread 12 -g MRBIGR_data/chr_HAMP -p MRBIGR_data/AMP_kernel_transcriptome_v4 -od MRBIGR_output/demo
The subcommand gwas invokes the GWAS and SAL analysis module; parameter -gwas calls the GWAS function; -model is the model to fit, with linear mixed model (lmm) in gemma by default; -thread is the thread number to run the command; -g is the plink-bed format input Genotypic data; -p is the CSV format phenotypic data; -od is the output directory. After this step, an output directory named MRBIGR_output/demo/gwas/gemma/lmm would be generated with the GWAS result files named Zm00001dxxxxx.assoc.txt in it.
Note: the exact output directory is depend on the model you selected, where -model gemma_mlm means the final output directory will be MRBIGR_output/demo/gwas/gemma/lmm.
Then, SAL can be determined using the below command based on the GWAS results:
docker exec -it mrbigr_env MRBIGR.py gwas -qtl -g MRBIGR_data/chr_HAMP -thread 6 -i MRBIGR_output/demo/gwas/gemma/lmm -o qtl_output -od MRBIGR_output/demo -p1 1e-5 -p2 1e-3 -p2n 5
The subcommand gwas invokes the GWAS and SAL analysis module; parameter -qtl calls the SAL detection function; -thread is the thread number to run the command; -i is the GWAS result directory; -o is the output file prefix; -od is the output directory; -p1 is the significance threshold for index SNPs used to determine SALs; -p2 is the secondary significance threshold for clumped SNPs used to determine the reliability of SALs; -p2n is secondary significance SNP number in a SAL. After this step, an output file named qtl_output.qtl_res.csv would be generated in MRBIGR_output/demo/gwas/qtl.
If you want to generate manhattan-plots and QQ-plots for visualization of the GWAS results, the below command should be helpful:
docker exec -it mrbigr_env MRBIGR.py gwas -visual -thread 12 -i MRBIGR_output/demo/gwas/gemma/lmm -od MRBIGR_output/demo -multi -q MRBIGR_output/demo/gwas/qtl/qtl_output.qtl_res.csv
The subcommand gwas invokes the GWAS and SAL analysis module; parameter -visual calls the visualization function; -thread is the thread number to run the command; -i is the GWAS result directory; -od is the output directory for the generated plots files; parameters -multi and -q are optional, if -multi is set to generate a multi-traits manhattan-plot, -q which is the SAL result file also need to be set.
3.4 Mendelian randomization (MR) related analysis
Take the plink-bed format genotypic data, CSV format exposure data, CSV format exposure SAL data and CSV format outcome data as inputs, MR analysis can be performed through the below command:
docker exec -it mrbigr_env MRBIGR.py mr -g MRBIGR_data/chr_HAMP -exposure MRBIGR_data/AMP_kernel_transcriptome_v4 -qtl MRBIGR_output/demo/gwas/qtl/qtl_output.qtl_res.csv -outcome MRBIGR_data/blup_traits_final.new.csv -mlm -thread 12 -o g2p -od MRBIGR_output/demo
The subcommand mr invokes the MR analysis module; parameter -exposure is the CSV format exposure data; -outcome is the CSV format outcome data; -qtl is the CSV format exposure SAL data; -thread is the thread number to run the command; -g is the plink-bed format input Genotypic data; -mlm represents perform MR analysis through mixed linear model; -o is the prefix of output file; -od is the output directory. After this step, a MR result file named g2p.MR.csv would be generated.
It also can take the CSV format population gene expression data and CSV format gene SAL data as inputs, pairwise genes MR analysis can be performed through the below command:
docker exec -it mrbigr_env MRBIGR.py mr -g MRBIGR_data/chr_HAMP -gene_exp MRBIGR_data/AMP_kernel_transcriptome_v4 -pairwise -mlm -qtl MRBIGR_output/demo/gwas/qtl/qtl_output.qtl_res.csv -thread 12 -o gene_pairwise_mr_out -od MRBIGR_output/demo
The subcommand mr invokes the MR analysis module; parameter -gene_exp is the CSV format population gene expression data; -qtl is the CSV format gene SAL data; -thread is the thread number to run the command; -g is the plink-bed format input Genotypic data; -mlm represents perform MR analysis through mixed linear model; -o is the prefix of output file; -od is the output directory. After this step, a MR result file named gene_pairwise_mr_out.MR.csv would be generated.
If you want to perform MR analysis between transcription factor and genes it targets, the below command should be helpful:
mkdir MRBIGR_output/demo/tmp/
head -100 MRBIGR_data/gene.anno.tsv > MRBIGR_output/demo/tmp/test_head100_genefunc.txt
tail -100 MRBIGR_data/gene.anno.tsv > MRBIGR_output/demo/tmp/test_tail100_genefunc.txt
docker exec -it mrbigr_env MRBIGR.py mr -g MRBIGR_data/chr_HAMP -gene_exp MRBIGR_data/AMP_kernel_transcriptome_v4 -tf MRBIGR_output/demo/tmp/test_head100_genefunc.txt -target MRBIGR_output/demo/tmp/test_tail100_genefunc.txt -mlm -qtl MRBIGR_output/demo/gwas/qtl/qtl_output.qtl_res.csv -threads 12 -o tf_test_mr_out -od MRBIGR_output/demo
The subcommand mr invokes the MR analysis module; parameter -gene_exp is the CSV format population gene expression data; -qtl is the CSV format transcription factor SAL data; -tf is the transcription factor annotation file; -target is the annotation files for genes targeted by transcription factors; -thread is the thread number to run the command; -g is the plink-bed format input Genotypic data; -mlm represents perform MR analysis through mixed linear model; -o is the prefix of output file; -od is the output directory. After this step, a MR result file named tf_test_mr_out.MR.csv would be generated.
Note: the all output will generated in MRBIGR_output/demo/mr directory.
3.5 MR-based network construction
Take the CSV format MR analysis data as input, MR based network analysis can be performed through the below command:
docker exec -it mrbigr_env MRBIGR.py net -mr MRBIGR_output/demo/mr/gene_pairwise_mr_out.MR.csv -plot -o net_out -od MRBIGR_output/demo
The subcommand net invokes the MR based network analysis module; parameter -mr is the CSV format MR analysis data; -plot represents plot network for each identified network module; -o is the prefix of output file; -od is the output directory. After this step, network edgelist file net_out.edge_list, ClusterONE software result net_out.clusterone.result.csv, network module plot net_out.module*.pdf and final network module file net_out.module.csv would be generated.
Note: the all output will generated in MRBIGR_output/demo/net directory.
3.6 Gene ontology analysis of network modules
Take the CSV format network module data, Tabular format gene ontology annotation of each gene data and Tabular format gene annotation data as inputs, gene ontology enrichment analysis of each module can be performed through the below command:
csvtk sep -f 3 -t MRBIGR_data/gene.anno.tsv -s ":" -n chr,start-end,strand | csvtk sep -f start-end -t -s "-" -n start,end | csvtk cut -f 1,5,8,9,7,2 -t | csvtk rename -t -f 1-6 -n "gene_id,chr,start,end,stard,annotation" -o MRBIGR_output/demo/tmp/gene_anno.txt
docker exec -it mrbigr_env MRBIGR.py go -gene_lists MRBIGR_output/demo/net/net_out.module.csv -go_anno MRBIGR_data/maize.genes2go.txt -gene_info MRBIGR_output/demo/tmp/gene_anno.txt -o go_out -od MRBIGR_output/demo
The subcommand go invokes the gene ontology enrichment analysis module; parameter -gene_lists is the CSV format network module data; -go_anno is the tabular format gene ontology annotation of each gene data; -gene_info is the tabular format gene annotation data; -o is the prefix of output file; -od is the output directory. After this step, gene ontology enrichment analysis result go_out.GO.csv, visualized results of functional enrichment results go_out.BP.dotplot.pdf, go_out.MF.dotplot.pdf and go_out.CC.dotplot.pdf would be generated.
Note: the all output will generated in MRBIGR_output/demo/go directory.