Tutorial

Introduction of details and advanced usage of MRBIGR


In this part, the implementation of each module/function/parameter in MRBIGR will be introduced in detail, and several real-data results will also be shown to help users better understand the functions of MRBIGR. The example data is from MRBIGR_data, which can be prepared by following the code.

The module overview of MRBIGR

$ docker exec -it mrbigr_env MRBIGR.py -h
usage: MRBIGR.py command [options]

MRBIGR: a versatile toolbox for genetic causal inference in population-scale
multi-omics data

optional arguments:
  -h, --help   show this help message and exit

command:

    geno       Genotype file based analyses
    pheno      Phenotype file based analyses
    gwas       GWAS and QTL related analyses
    mr         perform Mendelian Randomization analysis
    net        perform network analysis
    go         perform GO enrichment analysis
    plot       Visualize the results generated by MRBIGR and the data used in
               MRBIGR by various types of plots

1. Genotypic data based analysis

The genotypic data analysis module takes plink-bed format genotypic data as input and can be invoked through the subcommand geno, which includes 9 functions: -convert, -qc, -imput, -clump, -kinship, -pca, -tsne, -tree and -anno. -convert is used for genotypic data format conversion; -qc is used for quality control of genotypic data; -imput introduced several simple methods for fast imputation of genotypic data; -clump is used to keep only one representative SNP per region of LD based on a clumping method, to reduce the dimensionality of the original genotypic data; -kinship is used to calculate the kinship matrix from genotypic data;-pca is used to perform principle component analysis from genotypic data; -tsne is used to perform t-SNE analysis from genotypic data; -tree is used to build a ML-tree from genotypic data; -anno is used to annotate vcf-format genotypic data.

$ docker exec -it mrbigr_env MRBIGR.py geno -h
usage: MRBIGR.py geno [options]

optional arguments:
  -h, --help        show this help message and exit
  -g G              Genotype file of plink-bed format. Prefix of plink-bed
                    file is needed, that is, when the input files are
                    name.bed/name.bim/name.fam, -g should bed set to name. It
                    should be used together with these parameters:
                    –convert/-qc/-imput/-clump/-kinship/-pca/-tsne/-tree.
  -o [geno_output]  Prefix of the output files, the default value is
                    geno_output. It should be used together with these
                    parameters: –convert/-qc/-imput/-clump/-kinship/-pca/-tsne
                    /-tree/-anno.
  -od [output_dir]  Directory of the output files.
  -convert          Convert hapmap or vcf format genotype file to plink-bed
                    format files, or convert plink-bed format genotype files
                    to vcf format. It should be used together with the
                    parameters –g/-hapmap/-vcf and –o.
  -hapmap HAPMAP    The hapmap format genotype file to be converted to plink-
                    bed format. It should be used together with the parameter
                    –convert.
  -vcf VCF          The vcf format genotype file to be converted to plink-bed
                    format. It should be used together with the parameter
                    –convert.
  -qc               Quality control of the original genotype data. SNPs and
                    individuals pass the –maf, -mis and –mind cutoff would be
                    kept. It should be used together with the parameters
                    –g/-o/-maf/-mis/-mind.
  -maf [0.05]       Minimum Minor Allele Frequency (MAF) for a SNP to be kept,
                    default is 0.05. It should be used together with the
                    parameter –qc.
  -mis [0.1]        Maximum proportion of missing values for a SNP to be kept,
                    default is 0.1. It should be used together with the
                    parameter –qc.
  -mind [0.99]      Maximum proportion of missing values for a sample to be
                    kept, default is 0.99. It should be used together with the
                    parameter –qc.
  -imput            Perform fast genotype imputation via mode, mean, sampling
                    according to allele frequencies, or 0. It should be used
                    together with the parameters –g/-o/–method.
  -method [mode]    Genotype imputation method. Options: "mode" (most frequent
                    call), "random" (sampling according to allele
                    frequencies), "mean0" (rounded mean), "mean2" (rounded
                    mean to 2 decimal places). It should be used together with
                    the parameters –g/-o/–imput.
  -clump            Clumping analysis is used to keep only one representative
                    SNP per region of LD, to reduce the dimensionality of the
                    original genotype file. It should be used together with
                    the parameters –g/-o/-r2.
  -r2 [0.8]         r^2 value for the SNP clumping analysis, default is 0.8.
                    It should be used together with the parameters –clump.
  -kinship          Generate a kinship/relatedness matrix from plink-bed
                    format genotype data. It should be used together with the
                    parameters –g/-o.
  -pca              Perform principal component analysis for genotype
                    dimensionality reduction and population structure display.
                    It should be used together with the parameters –g/-o/-dim.
  -tsne             Perform t-SNE analysis to display the population
                    structure. It should be used together with the parameters
                    –g/-o/-dim.
  -dim [10|3]       Dimensions for the output of PCA or t-SNE analysis
                    results, default is 10 for PCA and 3 for t-SNE. It should
                    be used together with the parameters –pca/-tsne.
  -tree             To build a ML-tree from plink-bed format genotype data.
                    Clumping analysis or genotype pruning is recommended
                    before this step for large dataset. It should be used
                    together with the parameters –g/-o.
  -anno             To annotate the vcf-format genotype data using ANNOVAR. It
                    should be used together with the parameters -vcf
                    /-o/–db/-dbdir/-gtf/-fa.
  -db DB            The generated or input annotation database name. It should
                    be used together with the parameter -anno.
  -dbdir DBDIR      The generated/input annotation database directory name,
                    with *_refGene.txt and *_refGeneMrna.fa files in this
                    directory. It should be used together with the parameter
                    -anno.
  -gtf GTF          GTF format gene annotation file, used to generate the
                    annotation database. It should be used together with the
                    parameter -anno.
  -fa FA            Fasta format genomic reference sequence file, used to
                    generate the annotation database. It should be used
                    together with the parameter -anno.

1.1 Format conversion of genotypic data

MRBIGR supports the conversion of Hapmap/VCF format genotypic data to plink-bed format, as well as the conversion of plink-bed format to VCF format. This function can be called through the parameter -convert. The command line is as follows:

# Hapmap to plink-bed  
docker exec -it mrbigr_env MRBIGR.py geno -convert -hapmap MRBIGR_data/chr_HAMP_female.hmp -od MRBIGR_output/demo -o chr_HAMP_female.plink
# plink-bed to VCF  
docker exec -it mrbigr_env MRBIGR.py geno -convert -g MRBIGR_output/demo/geno/chr_HAMP_female.plink -od MRBIGR_output/demo -o chr_HAMP_female.vcf
# VCF to plink-bed  
docker exec -it mrbigr_env MRBIGR.py geno -convert -vcf MRBIGR_output/demo/geno/chr_HAMP_female.vcf.vcf -od MRBIGR_output/demo -o geno_vcf 

The subcommand geno invokes the genotypic data analysis module; parameter -convert calls the data format conversion function; -hapmap and -vcf are the correspond format file name; -g is the prefix of plink-bed format input data; -o is the name of output data; -od is the output directory.

1.2 Quality control of genotypic data

The raw genotypic data may contain SNPs with low values of MAF (minor allele frequency), high level of missing rate, as well as samples with extreme high missing rate. These data should be filtered before downstream analysis. The geno module provides a QC-based genotypic data filter function which can be called through parameter -qc. The command line is as follows:

docker exec -it mrbigr_env MRBIGR.py geno -qc -g MRBIGR_data/chr_HAMP -od MRBIGR_output/demo -o chr_HAMP_qc -maf 0.05 -mis 0.2 -mind 0.2

The subcommand geno invokes the genotype analysis module; parameter -qc calls the quality control function; -g is the plink-bed format input genotypic data; -o is the output genotypic data prefix; -od is the output directory; -maf is the MAF for a SNP to be kept; -mis is the maximum proportion of missing values for a SNP to be kept; -mind is the maximum proportion of missing values for a sample to be kept. The parameters -maf, -mis and -mind have default values and are not mandatory.

1.3 Fast imputation of missing genotype values

MRBIGR provides several fast and simple methods to impute the missing values in the input genotypic data, either random (sampling according to allele frequencies), mean0 (rounded mean), mean2 (rounded mean to 2 decimal places), mode (most frequent call). It should be noted that these methods are just for fast imputation of test dataset, more accurate imputation of genotypic data should be performed using other professional imputation tools such as Beagle. The command line is as follows:

docker exec -it mrbigr_env MRBIGR.py geno -imput -method mode -g MRBIGR_output/demo/geno/chr_HAMP_qc_qc -od MRBIGR_output/demo -o geno_impute

The subcommand geno invokes the genotype analysis module; parameter -imput calls the genotype imputation function; -method is the imputation method; -g is the plink-bed format input genotypic data; -o is the output genotypic data prefix (a suffix _imput will be added automatically for the output files); -od is the output directory.

1.4 Dimensionality reduction of genotypic data through SNP clumping

SNP clumping is used to keep only one representative SNP per region of LD, which can be used to reduce dimensionality of the genotypic data. This method is proposed in R package bigsnpr, it is similar to the LD-based SNP pruning method in PLINK, but SNP clumping has the advantage of only the SNP with the highest MAF is kept in a LD region and more genetic variation can be kept. The command line is as follows:

docker exec -it mrbigr_env MRBIGR.py geno -clump -g MRBIGR_data/chr_HAMP -od MRBIGR_output/demo -o chr_HAMP_380k -r 0.7  

The subcommand geno invokes the genotype analysis module; parameter -clump calls the genotype clumping function to keep only one representative SNP per region of LD; -g is the plink-bed format input genotypic data, -o is the output genotypic data prefix (a suffix _clump will be added automatically for the output files); -od is the output directory.

1.5 Calculation of kinship matrix from genotypic data

Pairwise kinship coefficients calculation is used as a measurement of genetic similarity between individuals. MRBIGR can generate a kinship/relatedness matrix by the below command:

docker exec -it mrbigr_env MRBIGR.py geno -kinship -g MRBIGR_data/chr_HAMP -o chr_HAMP.kinship -od MRBIGR_output/demo

The subcommand geno invokes the genotype analysis module; parameter -kinship calls the kinship calculation function; -g is the plink-bed format input genotypic data, -o is the output file prefix. The output file kinship.cXX.txt could be found in a directory named output; -od is the output directory.

1.6 Principle component analysis from genotypic data

Principal components analysis (PCA) is commonly applied to population structure inference and dimension reduction of the data. Top PCs calculated from the genotypic data can reflect population structure among the sample individuals. MRBIGR can perform PCA analysis by the below command:

docker exec -it mrbigr_env MRBIGR.py geno -pca -g MRBIGR_data/chr_HAMP -o chr_HAMP.geno -dim 10 -od MRBIGR_output/demo 

The subcommand geno invokes the genotype analysis module; parameter -pca calls the PCA analysis function; -g is the plink-bed format input genotypic data; -o is the output file prefix; -od is the output directory; -dim is the dimensionality or PCs for the output data, with a default value 10. Then, a CSV format output file named chr_HAMP.geno_pc.csv would be generated.

1.7 t-SNE analysis from genotypic data

t-SNE (t-Distributed Stochastic Neighbor Embedding) is a machine learning algorithm for dimensional reduction. As a nonlinear dimensional reduction algorithm, t-SNE performs better than PCA, and is suitable for visualization of population structure. MRBIGR first use PCA to reduce the data to 50 dimensions, and then use t-SNE algorithm to further reduced the data to 2 or 3 dimensions. The command line is as follows:

docker exec -it mrbigr_env MRBIGR.py geno -tsne -g MRBIGR_data/chr_HAMP -o chr_HAMP.geno -dim 3 -od MRBIGR_output/demo

The subcommand geno invokes the genotype analysis module; parameter -tsne calls the t-SNE analysis function; -g is the plink-bed format input genotypic data; -o is the output file prefix; -od is the output directory; -dim is the dimensionality for the output data, usually 2 or 3, with a default value 3. Then, a CSV format output file named geno_tsne.csv would be generated.

1.8 Phylogenetic analysis from genotypic data

MRBIGR introduces a one-step phylogenetic analysis function from genotypic data, which facilitate the visualization of relatedness of individuals and population structure. A nwk format maximum likelihood (ML) tree can be generated follow this command:

docker exec -it mrbigr_env MRBIGR.py geno -tree -g MRBIGR_output/demo/geno/chr_HAMP_380k_clump -o chr_HAMP_380k_clump.geno_tree -od MRBIGR_output/demo  

The subcommand geno invokes the genotype analysis module; parameter -tree calls the phylogenetic analysis function; -g is the plink-bed format input genotypic data; -o is the output file prefix; -od is the output directory. Then, a nwk format output file named geno_tree.tree.nwk and some other related files ggeno_tree.* would be generated.

1.9 Functional annotation of genotypic data

This function in MRBIGR is relied on ANNOVAR, an efficient software tool to utilize update-to-date information to functionally annotate genetic variants. Only vcf-format genotypic data is supported as input, plink-bed format genotypic data should be converted to VCF format using the -convert function. If annotation is performed for the first time, an annotation database should be built by adding the parameters -gtf and -fa. The command line is as follows:

gunzip -k MRBIGR_data/Zea_mays.B73_RefGen_v4.50.gtf.gz
gunzip -k MRBIGR_data/Zea_mays.B73_RefGen_v4.dna.toplevel.fa.gz

docker exec -it mrbigr_env MRBIGR.py geno -anno -vcf MRBIGR_output/demo/geno/chr_HAMP_female.vcf.vcf -o chr_HAMP_female.vcf_anno -od MRBIGR_output/demo -db ZmB73 -dbdir MRBIGR_output/demo/geno/ref -gtf MRBIGR_data/Zea_mays.B73_RefGen_v4.50.gtf -fa MRBIGR_data/Zea_mays.B73_RefGen_v4.dna.toplevel.fa 

The subcommand geno invokes the genotype analysis module; parameter -anno calls the annotation function; -vcf is the vcf-format input genotypic data; -o is the output file prefix; -od is the output directory; -db is the output database name; -dbdir is the output database directory; -fa is the reference genome sequences file in FASTA format; -gtf is the reference gene annotation file in GTF format. Then, MRBIGR would generate an annotation database and perform functional annotate for genetic variants automatically. The output annotation result files include testvcf_anno.ZmB73_multianno.vcf, testvcf_anno.ZmB73_multianno.bed, and testvcf_anno.ZmB73_largeEffect.bed. If the annotation database has been built before, the parameters -gtf and -fa are no longer need, and the parameters -db and -dbdir are the input database name and database directory, respectively. In this case, the command line is as follows:

docker exec -it mrbigr_env MRBIGR.py geno -anno -vcf MRBIGR_output/demo/geno/chr_HAMP_female.vcf.vcf -o chr_HAMP_female.vcf_anno -od MRBIGR_output/demo -db ZmB73 -dbdir MRBIGR_output/demo/geno/ref

2. Phenotypic data based analysis

The phenotypic data analysis module takes a CSV format phenotype file with the first column and the first row should be the names of samples and traits as input, can be invoked through the subcommand pheno, which includes 5 functions: -qc, -imput, -scale, -correct, and -merge. -qc is used for quality control of phenotypic data; -imput is used for imputation of missing values in phenotypic data; -scale is used for data scaling, normalization, standardization or transformation of phenotypic data; -correct is used for population structure based phenotypic data correction; -merge is used to merge a trait from different environment or years using either the methods of mean values, BLUP or BLUE.

$ docker exec -it mrbigr_env MRBIGR.py pheno -h
usage: MRBIGR.py pheno [options]

optional arguments:
  -h, --help         show this help message and exit
  -p P               Phenotype file of *.csv format for processing and
                     transformation, the first column and the first row are
                     the names of samples and traits, respectively. It should
                     be used together with the parameters
                     –qc/-imput/-scale/-correct/-merge.
  -o [pheno_output]  Prefix of the output files, default is pheno_output. It
                     should be used together with the parameters
                     –qc/-imput/-scale/-correct/-merge.
  -od [output_dir]   Directory of the output files.
  -qc                Perform phenotype data quality control and filtration,
                     this includes outlier data removal, missing rate
                     filtering and small value filtering, each of the
                     criterions is optional and can be skipped, while only the
                     data points pass the cutoff would be kept. It should be
                     used together with the parameters –p/-o/-rout/-mis/-val.
  -tr                Transposition of the input data
  -rout [zscore]     Outlier removal of phenotype data, the default method is
                     zscore. Options: zscore, boxplot. It should be used
                     together with the parameter –qc.
  -mis [0.5]         Filter traits with a missing rate larger than the cutoff,
                     default is 0.5. It should be used together with the
                     parameter –qc.
  -val [0.1]         Filter traits with an average value smaller than the
                     cutoff, default is 0.1. It should be used together with
                     the parameter –qc.
  -imput             Perform imputation for missing phenotype values. It
                     should be used together with the parameters
                     –p/-o/-method.
  -method [mean]     Phenotype values imputation method. Options: mean,
                     median, most_frequent. It should be used together with
                     the parameter -imput.
  -scale             Scaling/Normalization/Standardization/Transformation of
                     the phenotype data. We provide a variety of methods for
                     phenotype data scaling. It should be used together with
                     the parameters –g/-o and any one or more of the
                     parameters
                     –log2/-log10/-ln/-boxcox/-qqnorm/-minmax/-zscore/-robust.
  -log2              log2 scale of the phenotype data. It should be used
                     together with the parameter -scale.
  -log10             log10 scale of the phenotype data. It should be used
                     together with the parameter -scale.
  -ln                ln scale of the phenotype data. It should be used
                     together with the parameter -scale.
  -boxcox            Box-Cox transformation of the phenotype data. It should
                     be used together with the parameter -scale.
  -qqnorm            Normal quantile transformation of the phenotype data. It
                     should be used together with the parameter -scale.
  -minmax            Min-Max normalization of the phenotype data. It should be
                     used together with the parameter -scale.
  -zscore            Z-score Standardization of the phenotype data. It should
                     be used together with the parameter -scale.
  -robust            Scale phenotype values using statistics that are robust
                     to outliers. It should be used together with the
                     parameter -scale.
  -correct           Correct phenotype data caused by population structure
                     through a genotype-based pre-calculated PCA file. It
                     should be used together with the parameters –p/-o/-pca.
  -pca PCA           A pre-calculated PCA file used for population structure
                     correction of the phenotype data. It should be used
                     together with the parameter –correct.
  -merge             Merge the input phenotype values from different
                     environment using either mean values, BLUP or BLUE
                     methods. Only one trait is allowed in the input phenotype
                     file. It should be used together with the parameters
                     –p/-o/-mm.
  -mm [mean]         Merge method of phenotype values, default is mean.
                     Options: mean, blup, blue. It should be used together
                     with the parameter –merge.

2.1 Quality control of phenotypic data

Quality control of the original phenotypic data include transposition of the original phenotype matrix if needed, removal of outlier data points based on z-score or boxplot methods, filtering traits with high level of missing rate and low average values. In this function, either of the parameters -tr, -rout, -mis and -val is optional. If you want to perform quality control use all the default parameters, the command line looks like this

docker exec -it mrbigr_env MRBIGR.py pheno -qc -p MRBIGR_data/blup_traits_final.new.csv -o blup_traits_final.qc -od MRBIGR_output/demo -rout zscore -mis 0.1 -val 0.1  

The subcommand pheno invokes the phenotype analysis module; parameter -qc calls the quality control function; -p is the input phenotype matrix; -o is the prefix of output file; -mis is the missing rate cutoff with default value 0.5; -val is the small value cutoff with default value 0.1; -rout means outlier removal of phenotypic values with the default method z-score. If you want to perform quality control use personalized parameters, the command line could be as follows:

# transposition of the original phenotype matrix  
docker exec -it mrbigr_env MRBIGR.py pheno -qc -tr -p MRBIGR_data/blup_traits_final.new.csv -o blup_traits_final.tr -od MRBIGR_output/demo  
# reset missing rate cutoff, skip small value filter and outlier removal  
docker exec -it mrbigr_env MRBIGR.py pheno -qc -p MRBIGR_data/blup_traits_final.new.csv -o blup_traits_final.qc -mis 0.1 -od MRBIGR_output/demo  
# skip small value filter and set outlier removal method to boxplot  
docker exec -it mrbigr_env MRBIGR.py pheno -qc -p MRBIGR_data/blup_traits_final.new.csv -o blup_traits_final.qc -rout boxplot -mis 0.1 -od MRBIGR_output/demo 

2.2 Imputation of missing phenotype values

There are three different phenotype imputation methods provided by MRBIGR: mean, median, most_frequent. The missing phenotype values in input file should be defined as NA, and the command line is as follows:

docker exec -it mrbigr_env MRBIGR.py pheno -imput -p MRBIGR_output/demo/pheno/blup_traits_final.qc.phe.csv -o blup_traits_final.qc.imput -od MRBIGR_output/demo -method most_frequent  

The subcommand pheno invokes the phenotype analysis module; parameter -imput calls the phenotype imputation function; -p is the input phenotype matrix; -o is the prefix of output file; -method is the imputation method. After this step, a phenotype file named pheno_imput.phe.csv with no missing value would be generated.

2.3 Scaling and normalization of phenotypic data

MRBIGR provides multiple commonly used phenotypic data scaling and normalization methods, such as logarithmization, z-score standarlization, box-cox normalization, normal quantile normalization. These scaling and normalization methods are applicable to different types of phenotypic data, like agronomic traits, transcripts expression data, metabolites intensity data. The command line is as follows:

docker exec -it mrbigr_env MRBIGR.py pheno -scale -p MRBIGR_output/demo/pheno/blup_traits_final.qc.imput.phe.csv -o blup_traits_final.qc.imput.norm -od MRBIGR_output/demo -boxcox -minmax

The subcommand pheno invokes the phenotype analysis module; parameter -scale calls the phenotype scaling/normalization/standardization/transformation function; -p is the input phenotype matrix; -o is the prefix of output file; -boxcox and -minmax means use both Box-Cox and Min-Max methods to transform the data. Box-Cox transformation is used to transform each trait to meet normality assumptions, a lambda is calculated per trait and used to transform each trait, while Min-Max scaling is used to scale the values to 0-1. Other optional methods are -log2-log10-ln-qqnorm-zscore,and -robust.

2.4 Correction of phenotypic data base on population structure

Since population structure may lead to phenotypic data distributed in different levels, in some cases, correction of phenotype values to the same level based on population structure may help downstream analysis. The -correct function in MRBIGR use a PCA file, which can be generated through genotypic data, to perform phenotypic data correction. The command line is as follows:

docker exec -it mrbigr_env MRBIGR.py pheno -correct -p MRBIGR_output/demo/pheno/blup_traits_final.qc.imput.phe.csv -o blup_traits_final.qc.imput.pca_correct -od MRBIGR_output/demo -pca MRBIGR_output/demo/geno/chr_HAMP.geno_pca.csv 

The subcommand pheno invokes the phenotype analysis module; parameter -correct calls the phenotype correct function; -p is the input phenotype matrix; -o is the output file prefix; -pca is the PCA result file generated by genotypic data through the command geno -pca.

2.5 Merge the phenotype values from different environment

Phenotypic data form different environment or years need to be merged before downstream analysis, commonly used phenotypic data merge algorithms are mean-value, BLUP (best linear unbiased prediction) and BLUE (best linear unbiased estimation). MRBIGR provides all three methods for this purpose, which can be called and selected from parameters -merge and -mm. It is noted that only one trait is accepted in an input file. The command line is as follows:

docker exec -it mrbigr_env MRBIGR.py pheno -merge -p MRBIGR_data/blup_traits_final.new.csv -o blup_traits_final.blup -od MRBIGR_output/demo -mm blup 

The subcommand pheno invokes the phenotype analysis module; parameter -merge calls the phenotype merge function; -p is the input phenotype matrix for a trait in CSV format;-o is the prefix of output file; -mm is the merge method. After this step, a phenotype file named blup_traits_final.blup.phe.csv with BLUP merged phenotypic values would be generated.


The GWAS and SAL related analysis module can be utilized for GWAS, SAL detection and annotation, as well as peak or gene based haplotype analysis, this module can be invoked through the subcommand gwas, which includes 6 functions: -gwas, -qtl, -anno, -visual, -peaktest, and -genetest. -gwas is used to perform GWAS, -qtl is used to detect SALs from GWAS results, -anno is used for annotation of SAL regions, -visual is used for the visualization of GWAS results, -peaktest is used for peak based haplotype test, -genetest is used for gene based haplotype test.

$ docker exec -it mrbigr_env MRBIGR.py gwas -h
usage: MRBIGR.py gwas [options]

optional arguments:
  -h, --help           show this help message and exit
  -g G                 Genotype file of plink-bed format. Prefix of plink-bed
                       file is needed, that is, when the input files are
                       name.bed/name.bim/name.fam, -g should bed set to name.
                       It should be used together with the parameters
                       -gwas/-peaktest.
  -p P                 Phenotype file of *.csv format after processing and
                       transformation, the first column and the first row are
                       the names of samples and traits, respectively. It
                       should be used together with the parameters
                       –gwas/-peaktest/-genetest.
  -o [gwas_out]        Prefix of the output files. It will be used together
                       with the parameters
                       -qtl/-anno/-visual/-peaktest/-genetest.
  -od [output_dir]     Directory of the output files.
  -thread [1]          Number of threads for GWAS and QTL analysis, default is
                       1. It should be used together with the parameters
                       –gwas/-qtl/-visual.
  -gwas                Perform GWAS using a linear mixed model (lmm) or a
                       linear model (lm) implemented in GEMMA. All the result
                       files will be generated in a newly-built output
                       directory located in the working directory by default.
                       It should be used together with the parameters
                       –g/-p/–model/-thread.
  -c_pca               Whether to specify PCA as covariate. Default dimension
                       is 3. Or you can specify the PCA covariate file by
                       using -pca_file.
  -pca_file pca_file   The PCA covariate file which should be used together
                       with the parameter –c_pca.
  -model [lmm]         Fit a linear mixed model (lmm) or a linear model (lm),
                       default is lmm. Options: lmm, lm. It should be used
                       together with the parameter –gwas.
  -qtl                 QTL detection from the GWAS results based on the PLINK-
                       clump method. It should be used together with the
                       parameter –i/-p1/-p2/-p2n/-thread.
  -i [output]          Parent directory of the GWAS output files *.assoc.txt,
                       default is output. It should be used together with the
                       parameter –qtl.
  -p1 [1e-7]           Significance threshold for index SNPs used to determine
                       QTLs, default is 1e-7. It should be used together with
                       the parameter –qtl.
  -p2 [1e-5]           Secondary significance threshold for clumped SNPs used
                       to determine the reliability of QTLs, default is 1e-5.
                       It should be used together with the parameter –qtl.
  -p2n [5]             Secondary significance SNP number in a QTL, default is
                       5. It should be used together with the parameter –qtl.
  -anno                Perform QTL annotation using a pre-formatted gene
                       annotation file. It should be used together with the
                       parameter –q/-a.
  -q Q                 QTL result file in *.csv format generated by the -qtl
                       subcommand. It should be used together with the
                       parameter –anno.
  -a A                 Pre-formatted gene annotation file in used for QTL
                       annotation. It should be used together with the
                       parameter –anno.
  -visual              Generate Manhattan-plots and QQ-plots for visualization
                       of the GWAS results. It should be used together with
                       the parameters –i/-multi/-thread.
  -multi               Generate multiple-trait Manhattan-plot, -q must be set.
                       It should be used together with the parameters
                       –visual/-q.
  -peaktest            Perform peak-based haplotype test using lead SNP in
                       each QTL. It should be used together with the
                       parameters –p/-g/-q/-o.
  -genetest            Perform gene-based haplotype test for each trait. It
                       should be used together with the parameters
                       -f1/-f2/-vcf/-p/-o.
  -f1 F1               QTL annotation file in *.csv format generated by the
                       subcommand gwas –anno. It should be used together with
                       the parameter –genetest.
  -f2 F2               Genotype annotation file in *.bed format generated by
                       the subcommand geno –anno. It should be used together
                       with the parameter –genetest.
  -vcf VCF             VCF format genotype file. It should be used together
                       with the parameter –genetest.
  -plot_fmt [jpg,pdf]  The format of manhattan and QQ plot.

3.1 Parallel GWAS

MRBIGR support parallel GWAS for multiple traits. Take the processed plink-bed format genotypic data and CSV format phenotypic data as inputs, GWAS can be performed through the below command:

docker exec -it mrbigr_env MRBIGR.py gwas -gwas -model gemma_mlm -thread 12 -g MRBIGR_data/chr_HAMP -p MRBIGR_data/AMP_kernel_transcriptome_v4 -od MRBIGR_output/demo

The subcommand gwas invokes the GWAS and SAL analysis module; parameter -gwas calls the GWAS function which utilizes GEMMA(default), GAPIT or rMVP to perform GWAS; -model is the model to fit, with gemma_lm, gemma_mlm, rmvp_glm, rmvp_mlm, rmvp_farmcpu, gapit_glm, gapit_mlm, gapit_cmlm, gapit_mmlm, gapit_farmcpu, and gapit_blink as options; -thread is the thread number to run the command; -g is the plink-bed format input genotypic data; -p is the CSV format phenotypic data; -od is the output directory. After this step, an output directory named MRBIGR_output/demo/gwas/gemma/lmm would be generated with the GWAS result files named Zm00001dxxxxx.assoc.txt in it.

Note: the exact output directory is depend on the model you selected, where -model gemma_mlm means the final output directory will be MRBIGR_output/demo/gwas/gemma/lmm.

3.2 Parallel SAL detection based on GWAS results

MRBIGR introduces the clumped method implemented in PLINK v1.9 to automatically detect and optimize SALs based on the original GWAS results. In detail, a stricter P-value threshold -p1 is set to uncover the significantly associated SNPs; then, for each significantly associated SNP, if the other SNPs within a 500 kb distance have P-values smaller than the second looser P-value threshold -p2, and have r2 values greater than 0.2 with the index SNP, as well as the number of such SNPs surpass the SNP number threshold -p2n, then the region is regarded as a SAL; finally, all overlapping SALs are merged to generate the final SAL set, while the SNP with the smallest P-value in a SAL is defined as a lead SNP.The command line is as follows:

docker exec -it mrbigr_env MRBIGR.py gwas -qtl -g MRBIGR_data/chr_HAMP -thread 6 -i MRBIGR_output/demo/gwas/gemma/lmm -o qtl_output -od MRBIGR_output/demo -p1 1e-5 -p2 1e-3 -p2n 5 

The subcommand gwas invokes the GWAS and SAL analysis module; parameter -qtl calls the SAL detection function; -thread is the thread number to run the command; -i is the GWAS result directory; -o is the output file prefix; -od is the output directory; -p1 is the significance threshold for index SNPs used to determine SALs; -p2 is the secondary significance threshold for clumped SNPs used to determine the reliability of SALs; -p2n is secondary significance SNP number in a SAL. After this step, an output file named qtl_output.qtl_res.csv would be generated.

3.3 Annotation of SAL regions

The SAL result file could be annotated use a four columns gene annotation file in TSV format, which looks like this:

geneid  aliased position        function
Zm00001d027230  Zm00001d027230  1:44289-49837:+ Zm00001d027230
Zm00001d027231  Zm00001d027231  1:50877-55716:- Zm00001d027231
Zm00001d027232  Zm00001d027232  1:92299-95134:- Zm00001d027232
Zm00001d027233  Zm00001d027233  1:111655-118312:-       Zm00001d027233

This file could be generated from a standard GTF format gene annotation file through the follow command:

mkdir MRBIGR_output/demo/tmp/
grep -v '#' MRBIGR_data/Zea_mays.B73_RefGen_v4.50.gtf |awk '{if($3=="gene"){print $0}}' |sed 's/ /\t/g'|sed 's/"//g'|sed 's/;//g'|awk '{print $10"\t"$10"\t"$1":"$4"-"$5":"$7"\t"$10}'|sed '1igeneid\taliased\tposition\tfunction' > MRBIGR_output/demo/tmp/gene.anno.tsv

Then, take this file as input, SAL annotation could be performed use follow the command:

docker exec -it mrbigr_env MRBIGR.py gwas -anno -q MRBIGR_output/demo/gwas/qtl/qtl_output.qtl_res.csv -a MRBIGR_output/demo/tmp/gene.anno.tsv -o anno_output -od MRBIGR_output/demo

The subcommand gwas invokes the GWAS and SAL analysis module; parameter -anno calls the SAL annotation function; -q is the input SAL result file for annotation file; -o is the prefix of output file. After this step, an output file named anno_output.qtl_anno.csv would be generate.

3.4 Visualization of GWAS results

Manhattan-plots and QQ-plots can be generated for the visualization purpose of the GWAS results. The command line is as follows:

docker exec -it mrbigr_env MRBIGR.py gwas -visual -thread 12 -i MRBIGR_output/demo/gwas/gemma/lmm -od MRBIGR_output/demo 

or

docker exec -it mrbigr_env MRBIGR.py gwas -visual -thread 12 -i MRBIGR_output/demo/gwas/gemma/lmm -od MRBIGR_output/demo -multi -q MRBIGR_output/demo/gwas/qtl/qtl_output.qtl_res.csv

The subcommand gwas invokes the GWAS and SAL analysis module; parameter -visual calls the visualization function; -i is the GWAS result directory; -od is the output directory; -multi is optional, with a multi-trait Manhattan-plot would be generate when set; -q is the input SAL result file, it only need to be set when -multi is set; -thread is the thread number to run the command.

3.5 Statistical test of lead SNP

A simple approach to establish the relationship of phenotype distribution and SAL haplotype type is to use genotype of the lead SNP to represent the haplotype type of the SAL region, then, phenotype values distribution under different genotype of the lead SNP could be displayed, and student’s test could be performed. This simple statistical test can be performed in MRBIGR through the command below:

docker exec -it mrbigr_env MRBIGR.py gwas -peaktest -o peaktest_output -od MRBIGR_output/demo -p MRBIGR_data/AMP_kernel_transcriptome_v4 -g MRBIGR_data/chr_HAMP -q MRBIGR_output/demo/gwas/qtl/qtl_output.qtl_res.csv  

The subcommand gwas invokes the GWAS and SAL analysis module; parameter -peaktest calls the peak test function; -o is the prefix of output file; -od is the output directory; -p is the input phenotypic data; -g is the input genotypic data; -q is the input SAL result file. Then, genotype information of each sample and phenotype distribution plot would be generated in the output directory.

3.6 Statistical test of genes

Another approach to establish the relationship of phenotype distribution and SAL haplotype is to test the phenotype distribution and haplotype type for each gene in the SAL, to discover potential casual gene and variants. For a SAL region, MRBIGR uses large effect variants in each gene to build gene haplotypes, and perform gene based haplotype test. Welch’s test was used for a two-group haplotype test and a Tukey’s test was used for a multiple group haplotype test as described in Yano et al, 2016. The command line is as follows:

docker exec -it mrbigr_env MRBIGR.py gwas -genetest -f1 MRBIGR_output/demo/gwas/anno/anno_output.qtl_anno.csv -f2 MRBIGR_output/demo/geno/chr_HAMP_female.vcf_anno.ZmB73_multianno.bed -vcf MRBIGR_output/demo/geno/chr_HAMP_female.vcf.vcf -p MRBIGR_data/blup_traits_final.new.csv -od MRBIGR_output/demo

The subcommand gwas invokes the GWAS and SAL analysis module; parameter -genetest calls the gene test function; -f1 is the input SAL annotation file generated by gwas -anno command; -f2 is the input genotype annotation file generated by geno -anno command; -p is the input phenotype file; -od is the output directory. Then, related gene haplotype information of each sample and phenotype distribution plot would be generated in the MRBIGR_output/demo/gwas/gene_haplotest directory.


4. Mendelian randomization analysis of multi-omics data

The Mendelian randomization analysis module can be utilized for perform Mendelian randomization analysis between different omics data, this module can be invoked through the subcommand mr, which includes 3 modes according to parameters. The first mode provides input omics data through the -exposure and -outcome parameters, and performs Mendelian randomization analysis between -exposure data and -outcome data; the second mode provides transcriptome expression data through the -gene_exp parameter, and perform Mendelian randomization analysis between gene pairs through the -pairwise parameter; the third mode provides transcriptome expression data through the -gene_exp parameter, the -tf parameter provides the annotation information of the transcription factor, and -target provides the transcription factor target gene annotation information, and then perform Mendelian randomization analysis between transcription factors and target genes. There are two calculation models for Mendelian randomization, linear model and mixed linear model, respectively, specified by the parameters -lm and -mlm.

$ docker exec -it mrbigr_env MRBIGR.py mr -h
usage: MRBIGR.py mr [options]

optional arguments:
  -h, --help           show this help message and exit
  -g                   Genotype file in plink bed format
  -exposure            Exposure phenotype file, such as mTrait phenotype, the
                       file format is csv format, the first column and the
                       first row are the names of inbred lines and phenotypes
                       respectively
  -outcome             Outcome phenotype file, such as pTrait phenotype, the
                       file format is csv format, the first column and the
                       first row are the names of inbred lines and phenotypes
                       respectively
  -tf                  Transcription factors annotation file
  -target              targeted genes annotation file
  -type [All]          Regulatory relationship between transcription factors
                       and targeted genes. direct, direct regulatory
                       relationship between transcription factors and targeted
                       genes. All, direct and indirect regulatory relationship
                       between transcription factors and targeted genes
  -pairwise            perform Mendelian randomization analysis between all
                       gene pairs
  -gene_exp            gene expression file, the file format is csv format,
                       the first column and the first row are the names of
                       inbred lines and phenotypes respectively
  -qtl                 Exposure phenotype QTL file, generated by subcommand
                       regiongwas
  -lm                  Perform Mendelian Randomization through linear model
  -mlm                 Perform Mendelian Randomization through mixed linear
                       model
  -pvalue [default:1]  The pvalue cutoff of Mendelian randomization analysis
                       result output,default 1
  -thread [1]          Number of threads for Mendelian Randomization analysis
  -o [mr_out]          The prefix of the output file, default mr_out
  -od [output_dir]     Directory of the output files.

4.1 Exposure/outcome Mode

The first mode uses the -exposure parameter to specify the exposure data required by the Mendelian randomization model, -outcome provides the outcome data required by the Mendelian randomization model, and -qtl provides the genetic variation of the exposure data, then Mendelian randomization analysis can be performed through the below command:

docker exec -it mrbigr_env MRBIGR.py mr -g MRBIGR_data/chr_HAMP -exposure MRBIGR_data/AMP_kernel_transcriptome_v4 -qtl MRBIGR_output/demo/gwas/qtl/qtl_output.qtl_res.csv -outcome MRBIGR_data/blup_traits_final.new.csv -mlm -thread 12 -o g2p -od MRBIGR_output/demo 

The subcommand mr invokes the Mendelian randomization analysis module; parameter -exposure is the CSV format exposure data; -outcome is the CSV format outcome data; -qtl is the CSV format exposure SAL data; -thread is the thread number to run the command; -g is the plink-bed format input genotypic data; -mlm represents perform Mendelian randomization analysis through mixed linear model; -o is the prefix of output file; -od is the output directory. After this step, a MR result file named g2p.MR.csv would be generated.

4.2 Pairwise Mode

The second mode uses the -gene_exp parameter to specify the population gene expression data, -qtl specifies the genetic variation that affects gene expression, and -pairwise indicates to perform Mendelian randomization analysis between all pairs of genes in the population gene expression data, pairwise genes Mendelian randomization analysis can be performed through the below command:

docker exec -it mrbigr_env MRBIGR.py mr -g MRBIGR_data/chr_HAMP -gene_exp MRBIGR_data/AMP_kernel_transcriptome_v4 -pairwise -mlm -qtl MRBIGR_output/demo/gwas/qtl/qtl_output.qtl_res.csv -thread 12 -o gene_pairwise_mr_out -od MRBIGR_output/demo 

The subcommand mr invokes the Mendelian randomization analysis module; parameter -gene_exp is the CSV format population gene expression data; -qtl is the CSV format gene SAL data; -thread is the thread number to run the command; -g is the plink-bed format input genotypic data; -mlm represents perform Mendelian randomization analysis through mixed linear model; -o is the prefix of output file. After this step, a MR result file named gene_pairwise_mr_out.MR.csv would be generated.

4.3 TF Mode

The third mode uses the -gene_exp parameter to specify the population expression data while using the -tf and -target parameters to specify the transcription factor and their target genes, respectively, and then perform Mendelian randomization analysis between the transcription factors and the target genes. The regulatory relationship between transcription factors and target genes can be divided into direct and indirect regulatory relationships. In detail, when the genomic distance between a transcription factor and its target gene is < 500 kb, the relationship between them is defined as direct relationship, while the regulatory relationship between the transcription factor and remaining target genes are defined as indirect regulation. The -type parameter is used to specify the regulatory relationship between transcription factors and target genes in Mendelian randomization analysis. The command line is as follows:

head -100 MRBIGR_data/gene.anno.tsv > MRBIGR_output/demo/tmp/test_head100_genefunc.txt
tail -100 MRBIGR_data/gene.anno.tsv > MRBIGR_output/demo/tmp/test_tail100_genefunc.txt

docker exec -it mrbigr_env MRBIGR.py mr -g MRBIGR_data/chr_HAMP -gene_exp MRBIGR_data/AMP_kernel_transcriptome_v4 -tf MRBIGR_output/demo/tmp/test_head100_genefunc.txt -target MRBIGR_output/demo/tmp/test_tail100_genefunc.txt -mlm -qtl MRBIGR_output/demo/gwas/qtl/qtl_output.qtl_res.csv -threads 12 -o tf_test_mr_out -od MRBIGR_output/demo  

The subcommand mr invokes the Mendelian randomization analysis module; parameter -gene_exp is the CSV format population gene expression data; -qtl is the CSV format transcription factor SAL data; -tf is the transcription factor annotation file; -target is the annotation files for genes targeted by transcription factors; -thread is the thread number to run the command; -g is the plink-bed format input genotypic data; -mlm represents perform Mendelian randomization analysis through mixed linear model; -type is the regulatory relationship between transcription factors and target genes, with either direct (perform Mendelian randomization analysis transcription factors and directly regulated targeted genes) or all (perform Mendelian randomization analysis between transcription factors and all targeted genes) as option; -o is the prefix of output file; -od is the output directory. After this step, a MR result file named tf_test_mr_out.MR.csv would be generated.

#5A9AB9Tips: The transcription factor annotation file and the annotation files for genes targeted by transcription factors could be annotated use a four columns gene annotation file in TSV format, which looks like this (take MRBIGR_data/gene.anno.tsv as example):

geneid  aliased position        function
Zm00001d027230  Zm00001d027230  1:44289-49837:+ Zm00001d027230
Zm00001d027231  Zm00001d027231  1:50877-55716:- Zm00001d027231
Zm00001d027232  Zm00001d027232  1:92299-95134:- Zm00001d027232  

5. MR-based network construction and module identification

In MRBIGR, the weight of MR effects between gene pairs are used to construct the MR-based network and ClusterONE algorithm is used to identify modules for the constructed network, which can be invoked through the subcommand net:

$ docker exec -it mrbigr_env MRBIGR.py net -h
usage: MRBIGR.py net [options]

optional arguments:
  -h, --help           show this help message and exit
  -mr                  pairwise Mendelian Randomization analysis result
  -module_size [5]     minimal size of genes in a module,default 5
  -pvalue [1e-3]       pvalue cutoff for MR-based network analysis,default
                       1e-3
  -plot                plot network of each module
  -plot_fmt [jpg,pdf]  The format of the output plots
  -o [net_out]         The prefix of the output file, default go_out
  -od [output_dir]     Directory of the output files.

The MR-based network analysis command line is as follows:

docker exec -it mrbigr_env MRBIGR.py net -mr MRBIGR_output/demo/mr/gene_pairwise_mr_out.MR.csv -plot -o net_out -od MRBIGR_output/demo 

The subcommand net invokes the Mendelian randomization based network analysis module; parameter -mr is the CSV format Mendelian randomization analysis data; -plot represents plot network for each identified network module; -o is the prefix of output file; -od is the output directory. After this step, network edgelist file net_out.edge_list, ClusterONE software result net_out.clusterone.result.csv, network module plot net_out.module*.pdf and final network module file net_out.module.csv would be generated.


6. Gene ontology analysis of network modules

Gene ontology analysis of the MR-based network modules is helpful to find modules with biological significance. The enrichGO function in R package clusterProfiler is used to perform GO enrichment analysis on each module based on gene ontology annotation, and rich graphics (e.g., dotplot, barplot, cnetplot, heatplot, emapplot, upsetplot) could be chose for the visualization of the enrichment results. Gene ontology analysis can be invoked through the subcommand go:

$ docker exec -it mrbigr_env MRBIGR.py go -h
usage: MRBIGR.py go [options]

optional arguments:
  -h, --help            show this help message and exit
  -gene_lists           gene lists file for GO enrichment analysis
  -go_anno              GO annotation file for each gene used to construct GO
                        annotation database
  -gene_info            gene information file used to construct GO annotation
                        database
  -pvalue               adjusted pvalue cutoff on enrichment tests to report
  -plot_type [dotplot]  Visualization types of GO enrichment result, support
                        barplot, dotplot, Gene-Concept Network(cnetplot),
                        heatplot, Enrichment Map(emapplot) and upsetplot, one
                        or more plot types can be specified at the same time,
                        multiple types are separated by commas, such as
                        barplot,emapplot,heatplot
  -plot_fmt [jpg,pdf]   The format of the output plots
  -o [go_out]           The prefix of the output file, default go_out
  -od [output_dir]      Directory of the output files.

The analysis command line is as follows:

csvtk sep -f 3 -t MRBIGR_data/gene.anno.tsv -s ":" -n chr,start-end,strand | csvtk sep -f start-end -t -s "-" -n start,end | csvtk cut -f 1,5,8,9,7,2 -t | csvtk rename -t -f 1-6 -n "gene_id,chr,start,end,stard,annotation" -o MRBIGR_output/demo/tmp/gene_anno.txt

docker exec -it mrbigr_env MRBIGR.py go -gene_lists MRBIGR_output/demo/net/net_out.module.csv -go_anno MRBIGR_data/maize.genes2go.txt -gene_info MRBIGR_output/demo/tmp/gene_anno.txt -plot_type barplot,dotplot -o go_out -od MRBIGR_output/demo  

The subcommand go invokes the gene ontology enrichment analysis module; parameter -gene_lists is the CSV format network module data; -go is the Tabular format gene ontology annotation of each gene data; -gene_info is the Tabular format gene annotation data; -plot_type is visualization type of enrichment results; -o is the prefix of output file; -od is the output directory. After this step, gene ontology enrichment analysis result go_out.GO.csv, visualized results of functional enrichment results go_out.BP.dotplot.pdf , go_out.MF.dotplot.pdf and go_out.CC.dotplot.pdf , go_out.BP.barplot.pdf , go_out.MF.barplot.pdf and go_out.CC.barplot.pdf would be generated.

#5A9AB9Tips: The gene ontology annotation information of each gene could be annotated use a two columns annotation file in TSV format, which looks like this (take MRBIGR_data/maize.genes2go.txt as example):

Zm00001d018305  GO:0051649  
Zm00001d018305  GO:0003677  
Zm00001d003209  GO:0005886  
Zm00001d003209  GO:0044238  

The gene annotation information of each gene could by annotated use a five columns annotation file in TSV format, which looks like this (take MRBIGR_data/maize_genefunc.txt as example):

gene_id chr start   end stard   annotation  
Zm00001d027230  1   44289   49837   +   Mitochondrial transcription termination factor family protein  
Zm00001d027231  1   50877   55716   -   OSJNBa0093O08.6 protein  
Zm00001d027232  1   92299   95134   -   Zm00001d027232  
Zm00001d027234  1   118683  119739  -   Zm00001d027234 

7. Data visualization

The data visualization module in MRBIGR provides a collection of plot functions for the visualization of genotype, phenotype, GWAS, Mendelian randomization, network, and GO analysis results. This module can be invoked through the subcommand plot. The parameter -i is used to specify the data file needed for plotting, -plot_type specifies the plot type, the options include manhattan, qqplot, SNPdensity, hist, boxplot, scatterplot_ps, barplot, dotplot, forestplot, and scatterplot_mr. The plot module can be invoked through plot plot_old subcommand:

$ docker exec -it mrbigr_env MRBIGR.py plot plot_old -h
usage: MRBIGR.py plot plot_old [options]

optional arguments:
  -h, --help           show this help message and exit
  -i                   The result file generated by MRBIGR or the data file
                       used in MRBIGR, the file format is csv format
  -plot_type           Visualization type of the result file generated by
                       MRBIGR or the data file used in MRBIGR,including tree,
                       scatter_mr, forestplot, manhattan, qqplot, SNPdensity,
                       hist, boxplot, scatter_ps, barplot, dotplot; support
                       tree plot for phylogenetic tree analysis, scatterplot
                       and forest plot for Mendelian Randomization analysis,
                       manhattan plot and QQ plot for GWAS result, SNP density
                       plot for GWAS result and SNP info file, boxplot and
                       histogram for phenotype data, barplot and dotplot for
                       GO enrich result, scatter plot for population
                       structure.
  -order               Visualizing the results of Mendelian randomization in
                       the order of trait in the file, uesed in forest plot
  -group               Group file for scatter plot, boxplot and tree plot
  -group_sep           Delimiter used in group file
  -drops               Strains used to drop
  -plot_fmt [jpg,pdf]  The format of the output plots
  -chrom               Specify the chromosome for manhattan plot. Default plot
                       all the chromosomes
  -start               Specify the start position for manhattan plot. Default:
                       0
  -end                 Specify the start position for manhattan plot. Default:
                       length of chromosome
  -hl_pos              Specify the position of SNPs to be highlighted for
                       manhattan plot. Use comma to separate multiple
                       positions.
  -hl_text             The texts to be highlighted for manhattan plot. Use
                       comma to separate multiple positions.
  -input_sep           The separator in GWAS result file. Default: tab
  -o [plot_out]        The prefix of the output file, default plot_out
  -od [output_dir]     Directory of the output files.

7.1 Genotypic data based plot

The genotypic data based plot includes SNP density plot (SNPdensity) and scatterplot of population (scatter_ps). SNP density plot is used to demonstrate the distribution of the SNP across the genome. The command line is as follows:

csvtk cut -f 2,1,4 -t MRBIGR_data/chr_HAMP.bim -H | csvtk add-header -t -n "rs,chr,ps" -o MRBIGR_output/demo/tmp/snp_info.txt
docker exec -it mrbigr_env MRBIGR.py plot plot_old -i MRBIGR_output/demo/tmp/snp_info.txt -plot_type SNPdensity -o plot_out -od MRBIGR_output/demo  

#5A9AB9Tips: The SNP information file could be annotated use a three columns file in TSV format, which looks like this (take generated MRBIGR_output/demo/tmp/snp_info.txt as example):

rs  chr ps  
chr10.s_109149  10  109149  
chr10.s_109277  10  109277  
chr10.s_109475  10  109475  
chr10.s_109511  10  109511  

The first column represents the SNP id, the second column represents the chromosome of the SNP, and the third column represents the position of the SNP. The population scatterplot takes a PCA or t-SNE result file as input to visualize the population structure. The population information of each individual can be specified by the -group parameter, which is used to assign individuals to populations with different colors. The command line is as follows:

docker exec -it mrbigr_env MRBIGR.py plot plot_old -i MRBIGR_output/demo/geno/chr_HAMP.geno_pca.csv -plot_type scatter_ps -group MRBIGR_data/491.group_info.csv -o plot_out -od MRBIGR_output/demo

The population structure file could be annotated use a three columns file in CSV format, which looks like this (take MRBIGR_output/demo/geno/chr_HAMP.geno_pca.csv as example):

,PC1,PC2,PC3
CIMBL32,-500.22876280227405,291.4219232815866,-106.31967411588965
CIMBL89,-317.01045032787937,76.23101098461999,-5.709429188066682
CIMBL7,-438.9677175927099,147.47327899812078,17.419342974755814
CIMBL45,168.92927463582174,142.43977568787892,137.81050851770115
ZHENG58,687.5073243050226,-230.98495850470024,-262.2697439302226
CML415,-296.43476541023466,8.953155752331595,1.960958532295878
CIMBL46,-363.40800595652894,72.35692371618754,85.151195977471
CIMBL70,-416.08725961220046,162.22519090608853,30.936760965741744
CIMBL124,-493.08148241098274,158.7571953595056,21.30131838983761
...

The first column represents the sample id, the second column represents first principal component data of each inbred line, and the third column represents second principal component data of each inbred line. The group file could be annotated use a two columns file in CSV format, which looks like this (take MRBIGR_data/491.group_info.csv as example):

line,subpop
150,Mixed
177,Mixed
238,NSS
268,NSS
501,NSS
647,Mixed
812,SS
832,SS
1323,Mixed
...

The first column represents the sample identifier, the second column represents the population information of each sample.

7.2 Phenotypic data based plot

The phenotypic data based plot includes histogram (hist) and boxplot (boxplot), which are used to demonstrate the distribution of phenotypic data. Boxplot can not only show the overall distribution of phenotypes, but also the phenotype distribution of different groups, when the -group parameter is specified. The command line is as follows:

csvtk cut -f 1-2 MRBIGR_data/blup_traits_final.new.csv -o MRBIGR_output/demo/tmp/example.phe.csv
docker exec -it mrbigr_env MRBIGR.py plot plot_old -i MRBIGR_output/demo/tmp/example.phe.csv -plot_type hist -o plot_out -od MRBIGR_output/demo
cut -f 1,12- MRBIGR_data/chr_HAMP_female.hmp| head -2 | csvtk transpose -d $'\t' -D ',' | csvtk rename -f 1 -n "line" -o MRBIGR_output/demo/tmp/example.snp_group.csv
docker exec -it mrbigr_env MRBIGR.py plot plot_old -i MRBIGR_output/demo/tmp/example.phe.csv -plot_type boxplot -group MRBIGR_output/demo/tmp/example.snp_group.csv -o plot_out -od MRBIGR_output/demo

The phenotype file could be annotated use a two columns file in CSV format, which looks like this (take MRBIGR_output/demo/tmp/example.phe.csv as example):

Trait,Plantheight
150,199.40686200000002
177,167.7376926
238,195.1777397
268,160.7956655
501,170.01783559999998
647,143.4828657
812,164.7820379
1323,183.0386928
1462,211.1751917
...

The first column represents the sample identifier, the second column represents the phenotypic data of each sample. The group file could be annotated use a two columns file in CSV format, which looks like the format of MRBIGR_data/491.group_info.csv or like this (take MRBIGR_output/demo/tmp/example.snp_group.csv as example):

line,chr10.s_74401
CIMBL32,CC
CIMBL89,TT
CIMBL7,CC
CIMBL45,TT
ZHENG58,CC
CML415,TT
CIMBL46,TT
CIMBL70,TT
CIMBL124,CC
...

The first column represents the sample ID, the second column represents the group of each sample, which can be haplotype type or populations information.

7.3 GWAS based plot

The GWAS based plot includes manhattan plot (manhattan) and QQ plot (qqplot), which is used to demonstrate GWAS result. The command line is as follows:

docker exec -it mrbigr_env MRBIGR.py plot plot_old -i MRBIGR_output/demo/gwas/gemma/lmm/Zm00001d007718.assoc.txt -plot_type manhattan -o plot_out -od MRBIGR_output
docker exec -it mrbigr_env MRBIGR.py plot plot_old -i MRBIGR_output/demo/gwas/gemma/lmm/Zm00001d007718.assoc.txt -plot_type qqplot -o plot_out -od MRBIGR_output

#5A9AB9Tips: The GWAS result file could be annotated use a four columns file in TSV format, which looks like this:

chr     rs      ps      n_miss  allele1 allele0 af      beta    se      l_remle p_wald
10      chr10.s_74401   74401   0       T       C       0.422   -2.967759e-02   1.526325e-01    2.711385e+00    8.459573e-01
10      chr10.s_76702   76702   0       A       G       0.390   -1.052776e-01   1.498642e-01    2.521573e+00    4.828912e-01
10      chr10.s_92823   92823   0       T       C       0.067   -1.291505e-01   2.722853e-01    2.824265e+00    6.355980e-01
10      chr10.s_94339   94339   0       T       A       0.517   -1.321889e-01   1.421446e-01    2.540068e+00    3.530991e-01
10      chr10.s_94579   94579   0       G       C       0.351   -1.398295e-01   1.473273e-01    2.495780e+00    3.432887e-01
10      chr10.s_94901   94901   0       G       A       0.141   -2.443615e-01   2.047943e-01    2.788237e+00    2.336815e-01
...

The first column represents the ID of SNPs, the second column represents the chromosome of SNPs, and the third column represents the position of SNPs, the fourth column represents the GWAS P-value of SNPs.

7.4 MR data based plot

The MR based plot includes forest plot (forestplot) and scatter plot (scatter_mr). The forest plot is used to display the effect between each exposure and outcome traits, and can specify the order of the outcome traits in the forest plot through the order file, so as to compare the effects of different exposures on the outcome traits. It can be used to compare the influence of different exposures under a small number of outcome traits. The command line is as follows:

csvtk cut -f 2 MRBIGR_output/demo/mr/g2p.MR.csv -U | sort -u | head -20 | csvtk grep -P - -f 2 MRBIGR_output/demo/mr/g2p.MR.csv -o MRBIGR_output/demo/tmp/example.g2p.MR.csv
head -1 MRBIGR_data/blup_traits_final.new.csv | csvtk transpose | sed '1d' > MRBIGR_output/demo/tmp/example_trait.order.txt
docker exec -it mrbigr_env plot -i MRBIGR_output/demo/mr/example.g2p.MR.csv -plot_type forestplot -order MRBIGR_output/demo/tmp/example_trait.order.txt -o plot_out -od MRBIGR_output/demo

#5A9AB9Tips: The MR file is generated by mr subcommand. The order file is a one column file that list outcome traits, which looks like this (take the traits in MRBIGR_data/blup_traits_final.new.csv as example):

Plantheight
Earheight
Earleafwidth
Earleaflength
Tasselmainaxislength
Tasselbranchnumber
Leafnumberaboveear
Earlength
Eardiameter
Cobdiameter
Kernernumberperrow
100grainweight
cobweight
Kernelwidth
Silkingtime
Pollenshed
Headingdate 

The scatter plot is used to display the Mendelian randomized p-value between exposure and outcome, and group the outcome through the group file. It can be used to show the impact of exposure on the outcome traits under a large number of outcome traits. The command line is as follows:

docker exec -it mrbigr_env MRBIGR.py plot plot_old -i MRBIGR_output/demo/mr/g2p.MR.csv -plot_type scatter_mr -group MRBIGR_data/genes.grp -o plot_out -od MRBIGR_output/demo

#5A9AB9Tips: The group file of outcome traits could be annotated use a two columns annotation file in CSV format, which looks like this (take MRBIGR_data/genes.grp as example):

id,group
Zm00001d027230,1
Zm00001d027231,1
Zm00001d027232,1
Zm00001d027233,1
Zm00001d027234,1 
...

The first column represents the outcome trait, and the second column represents the category corresponding to the outcome trait, which can be the chromosome where the gene is located, the type of metabolite, etc.

7.5 GO based plot

The GO based plot includes bar plot (barplot) and dot plot (dotplot). It depicts the enrichment scores (e.g. P-values) and gene count or ratio in plot. The command line is as follows:

docker exec -it mrbigr_env plot -i MRBIGR_output/demo/go/go_out.GO.csv -plot_type barplot -o plot_out -od MRBIGR_output/demo
docker exec -it mrbigr_env plot -i MRBIGR_output/demo/go/go_out.GO.csv -plot_type dotplot -o plot_out -od MRBIGR_output/demo

#5A9AB9Tips:The GO enrichment result file could be annotated use a ten columns file in CSV format, which looks like this:

ONTOLOGY,ID,Description,GeneRatio,BgRatio,pvalue,p.adjust,qvalue,geneID,Count  
MF,GO:0003729,mRNA binding,5/82,371/37229,1.393e-3,0.033,0.026,Zm00001d044979/Zm00001d049442/Zm00001d005350,5 
...

The first column represents the ontology domains, including BP, MF and CC, the second column represents the ontology id, the third column represents the description of ontology id, the fourth column represents the ratio of genes containing ontology id in the gene list to the total genes in the gene list, the fifth column represents the ratio of genes containing ontology id in whole genome to the total genes in whole genome, the sixth column represents the P value of GO enrichment analysis, the seventh and eighth columns represents the corrected P value of GO enrichment analysis, it can be same when the GO enrichment result file is user-customized, the ninth column represents the gene id of genes containing ontology id, the tenth represents the count of genes containing ontology id.