Tutorial
In this part, the implementation of each module/function/parameter in MRBIGR will be introduced in detail, and several real-data results will also be shown to help users better understand the functions of MRBIGR. The example data is from MRBIGR_data, which can be prepared by following the code.
The module overview of MRBIGR
$ docker exec -it mrbigr_env MRBIGR.py -h
usage: MRBIGR.py command [options]
MRBIGR: a versatile toolbox for genetic causal inference in population-scale
multi-omics data
optional arguments:
-h, --help show this help message and exit
command:
geno Genotype file based analyses
pheno Phenotype file based analyses
gwas GWAS and QTL related analyses
mr perform Mendelian Randomization analysis
net perform network analysis
go perform GO enrichment analysis
plot Visualize the results generated by MRBIGR and the data used in
MRBIGR by various types of plots
1. Genotypic data based analysis
The genotypic data analysis module takes plink-bed format genotypic data as input and can be invoked through the subcommand geno
, which includes 9 functions: -convert
, -qc
, -imput
, -clump
, -kinship
, -pca
, -tsne
, -tree
and -anno
. -convert
is used for genotypic data format conversion; -qc
is used for quality control of genotypic data; -imput
introduced several simple methods for fast imputation of genotypic data; -clump
is used to keep only one representative SNP per region of LD based on a clumping method, to reduce the dimensionality of the original genotypic data; -kinship
is used to calculate the kinship matrix from genotypic data;-pca
is used to perform principle component analysis from genotypic data; -tsne
is used to perform t-SNE analysis from genotypic data; -tree
is used to build a ML-tree from genotypic data; -anno
is used to annotate vcf-format genotypic data.
$ docker exec -it mrbigr_env MRBIGR.py geno -h
usage: MRBIGR.py geno [options]
optional arguments:
-h, --help show this help message and exit
-g G Genotype file of plink-bed format. Prefix of plink-bed
file is needed, that is, when the input files are
name.bed/name.bim/name.fam, -g should bed set to name. It
should be used together with these parameters:
–convert/-qc/-imput/-clump/-kinship/-pca/-tsne/-tree.
-o [geno_output] Prefix of the output files, the default value is
geno_output. It should be used together with these
parameters: –convert/-qc/-imput/-clump/-kinship/-pca/-tsne
/-tree/-anno.
-od [output_dir] Directory of the output files.
-convert Convert hapmap or vcf format genotype file to plink-bed
format files, or convert plink-bed format genotype files
to vcf format. It should be used together with the
parameters –g/-hapmap/-vcf and –o.
-hapmap HAPMAP The hapmap format genotype file to be converted to plink-
bed format. It should be used together with the parameter
–convert.
-vcf VCF The vcf format genotype file to be converted to plink-bed
format. It should be used together with the parameter
–convert.
-qc Quality control of the original genotype data. SNPs and
individuals pass the –maf, -mis and –mind cutoff would be
kept. It should be used together with the parameters
–g/-o/-maf/-mis/-mind.
-maf [0.05] Minimum Minor Allele Frequency (MAF) for a SNP to be kept,
default is 0.05. It should be used together with the
parameter –qc.
-mis [0.1] Maximum proportion of missing values for a SNP to be kept,
default is 0.1. It should be used together with the
parameter –qc.
-mind [0.99] Maximum proportion of missing values for a sample to be
kept, default is 0.99. It should be used together with the
parameter –qc.
-imput Perform fast genotype imputation via mode, mean, sampling
according to allele frequencies, or 0. It should be used
together with the parameters –g/-o/–method.
-method [mode] Genotype imputation method. Options: "mode" (most frequent
call), "random" (sampling according to allele
frequencies), "mean0" (rounded mean), "mean2" (rounded
mean to 2 decimal places). It should be used together with
the parameters –g/-o/–imput.
-clump Clumping analysis is used to keep only one representative
SNP per region of LD, to reduce the dimensionality of the
original genotype file. It should be used together with
the parameters –g/-o/-r2.
-r2 [0.8] r^2 value for the SNP clumping analysis, default is 0.8.
It should be used together with the parameters –clump.
-kinship Generate a kinship/relatedness matrix from plink-bed
format genotype data. It should be used together with the
parameters –g/-o.
-pca Perform principal component analysis for genotype
dimensionality reduction and population structure display.
It should be used together with the parameters –g/-o/-dim.
-tsne Perform t-SNE analysis to display the population
structure. It should be used together with the parameters
–g/-o/-dim.
-dim [10|3] Dimensions for the output of PCA or t-SNE analysis
results, default is 10 for PCA and 3 for t-SNE. It should
be used together with the parameters –pca/-tsne.
-tree To build a ML-tree from plink-bed format genotype data.
Clumping analysis or genotype pruning is recommended
before this step for large dataset. It should be used
together with the parameters –g/-o.
-anno To annotate the vcf-format genotype data using ANNOVAR. It
should be used together with the parameters -vcf
/-o/–db/-dbdir/-gtf/-fa.
-db DB The generated or input annotation database name. It should
be used together with the parameter -anno.
-dbdir DBDIR The generated/input annotation database directory name,
with *_refGene.txt and *_refGeneMrna.fa files in this
directory. It should be used together with the parameter
-anno.
-gtf GTF GTF format gene annotation file, used to generate the
annotation database. It should be used together with the
parameter -anno.
-fa FA Fasta format genomic reference sequence file, used to
generate the annotation database. It should be used
together with the parameter -anno.
1.1 Format conversion of genotypic data
MRBIGR supports the conversion of Hapmap/VCF format genotypic data to plink-bed format, as well as the conversion of plink-bed format to VCF format. This function can be called through the parameter -convert
. The command line is as follows:
# Hapmap to plink-bed
docker exec -it mrbigr_env MRBIGR.py geno -convert -hapmap MRBIGR_data/chr_HAMP_female.hmp -od MRBIGR_output/demo -o chr_HAMP_female.plink
# plink-bed to VCF
docker exec -it mrbigr_env MRBIGR.py geno -convert -g MRBIGR_output/demo/geno/chr_HAMP_female.plink -od MRBIGR_output/demo -o chr_HAMP_female.vcf
# VCF to plink-bed
docker exec -it mrbigr_env MRBIGR.py geno -convert -vcf MRBIGR_output/demo/geno/chr_HAMP_female.vcf.vcf -od MRBIGR_output/demo -o geno_vcf
The subcommand geno
invokes the genotypic data analysis module; parameter -convert
calls the data format conversion function; -hapmap
and -vcf
are the correspond format file name; -g
is the prefix of plink-bed format input data; -o
is the name of output data; -od
is the output directory.
1.2 Quality control of genotypic data
The raw genotypic data may contain SNPs with low values of MAF (minor allele frequency), high level of missing rate, as well as samples with extreme high missing rate. These data should be filtered before downstream analysis. The geno
module provides a QC-based genotypic data filter function which can be called through parameter -qc
. The command line is as follows:
docker exec -it mrbigr_env MRBIGR.py geno -qc -g MRBIGR_data/chr_HAMP -od MRBIGR_output/demo -o chr_HAMP_qc -maf 0.05 -mis 0.2 -mind 0.2
The subcommand geno invokes the genotype analysis module; parameter -qc
calls the quality control function; -g
is the plink-bed format input genotypic data; -o
is the output genotypic data prefix; -od
is the output directory; -maf
is the MAF for a SNP to be kept; -mis
is the maximum proportion of missing values for a SNP to be kept; -mind is the maximum proportion of missing values for a sample to be kept. The parameters -maf
, -mis
and -mind
have default values and are not mandatory.
1.3 Fast imputation of missing genotype values
MRBIGR provides several fast and simple methods to impute the missing values in the input genotypic data, either random
(sampling according to allele frequencies), mean0
(rounded mean), mean2
(rounded mean to 2 decimal places), mode
(most frequent call). It should be noted that these methods are just for fast imputation of test dataset, more accurate imputation of genotypic data should be performed using other professional imputation tools such as Beagle. The command line is as follows:
docker exec -it mrbigr_env MRBIGR.py geno -imput -method mode -g MRBIGR_output/demo/geno/chr_HAMP_qc_qc -od MRBIGR_output/demo -o geno_impute
The subcommand geno
invokes the genotype analysis module; parameter -imput
calls the genotype imputation function; -method
is the imputation method; -g
is the plink-bed format input genotypic data; -o
is the output genotypic data prefix (a suffix _imput will be added automatically for the output files); -od
is the output directory.
1.4 Dimensionality reduction of genotypic data through SNP clumping
SNP clumping is used to keep only one representative SNP per region of LD, which can be used to reduce dimensionality of the genotypic data. This method is proposed in R package bigsnpr, it is similar to the LD-based SNP pruning method in PLINK, but SNP clumping has the advantage of only the SNP with the highest MAF is kept in a LD region and more genetic variation can be kept. The command line is as follows:
docker exec -it mrbigr_env MRBIGR.py geno -clump -g MRBIGR_data/chr_HAMP -od MRBIGR_output/demo -o chr_HAMP_380k -r 0.7
The subcommand geno
invokes the genotype analysis module; parameter -clump
calls the genotype clumping function to keep only one representative SNP per region of LD; -g
is the plink-bed format input genotypic data, -o
is the output genotypic data prefix (a suffix _clump will be added automatically for the output files); -od
is the output directory.
1.5 Calculation of kinship matrix from genotypic data
Pairwise kinship coefficients calculation is used as a measurement of genetic similarity between individuals. MRBIGR can generate a kinship/relatedness matrix by the below command:
docker exec -it mrbigr_env MRBIGR.py geno -kinship -g MRBIGR_data/chr_HAMP -o chr_HAMP.kinship -od MRBIGR_output/demo
The subcommand geno
invokes the genotype analysis module; parameter -kinship
calls the kinship calculation function; -g
is the plink-bed format input genotypic data, -o
is the output file prefix. The output file kinship.cXX.txt could be found in a directory named output; -od
is the output directory.
1.6 Principle component analysis from genotypic data
Principal components analysis (PCA) is commonly applied to population structure inference and dimension reduction of the data. Top PCs calculated from the genotypic data can reflect population structure among the sample individuals. MRBIGR can perform PCA analysis by the below command:
docker exec -it mrbigr_env MRBIGR.py geno -pca -g MRBIGR_data/chr_HAMP -o chr_HAMP.geno -dim 10 -od MRBIGR_output/demo
The subcommand geno
invokes the genotype analysis module; parameter -pca
calls the PCA analysis function; -g
is the plink-bed format input genotypic data; -o
is the output file prefix; -od
is the output directory; -dim
is the dimensionality or PCs for the output data, with a default value 10. Then, a CSV format output file named chr_HAMP.geno_pc.csv would be generated.
1.7 t-SNE analysis from genotypic data
t-SNE (t-Distributed Stochastic Neighbor Embedding) is a machine learning algorithm for dimensional reduction. As a nonlinear dimensional reduction algorithm, t-SNE performs better than PCA, and is suitable for visualization of population structure. MRBIGR first use PCA to reduce the data to 50 dimensions, and then use t-SNE algorithm to further reduced the data to 2 or 3 dimensions. The command line is as follows:
docker exec -it mrbigr_env MRBIGR.py geno -tsne -g MRBIGR_data/chr_HAMP -o chr_HAMP.geno -dim 3 -od MRBIGR_output/demo
The subcommand geno
invokes the genotype analysis module; parameter -tsne
calls the t-SNE analysis function; -g
is the plink-bed format input genotypic data; -o
is the output file prefix; -od
is the output directory; -dim
is the dimensionality for the output data, usually 2 or 3, with a default value 3. Then, a CSV format output file named geno_tsne.csv would be generated.
1.8 Phylogenetic analysis from genotypic data
MRBIGR introduces a one-step phylogenetic analysis function from genotypic data, which facilitate the visualization of relatedness of individuals and population structure. A nwk format maximum likelihood (ML) tree can be generated follow this command:
docker exec -it mrbigr_env MRBIGR.py geno -tree -g MRBIGR_output/demo/geno/chr_HAMP_380k_clump -o chr_HAMP_380k_clump.geno_tree -od MRBIGR_output/demo
The subcommand geno
invokes the genotype analysis module; parameter -tree
calls the phylogenetic analysis function; -g
is the plink-bed format input genotypic data; -o
is the output file prefix; -od
is the output directory. Then, a nwk format output file named geno_tree.tree.nwk and some other related files ggeno_tree.* would be generated.
1.9 Functional annotation of genotypic data
This function in MRBIGR is relied on ANNOVAR, an efficient software tool to utilize update-to-date information to functionally annotate genetic variants. Only vcf-format
genotypic data is supported as input, plink-bed format genotypic data should be converted to VCF format using the -convert
function. If annotation is performed for the first time, an annotation database should be built by adding the parameters -gtf
and -fa
. The command line is as follows:
gunzip -k MRBIGR_data/Zea_mays.B73_RefGen_v4.50.gtf.gz
gunzip -k MRBIGR_data/Zea_mays.B73_RefGen_v4.dna.toplevel.fa.gz
docker exec -it mrbigr_env MRBIGR.py geno -anno -vcf MRBIGR_output/demo/geno/chr_HAMP_female.vcf.vcf -o chr_HAMP_female.vcf_anno -od MRBIGR_output/demo -db ZmB73 -dbdir MRBIGR_output/demo/geno/ref -gtf MRBIGR_data/Zea_mays.B73_RefGen_v4.50.gtf -fa MRBIGR_data/Zea_mays.B73_RefGen_v4.dna.toplevel.fa
The subcommand geno
invokes the genotype analysis module; parameter -anno
calls the annotation function; -vcf
is the vcf-format input genotypic data; -o
is the output file prefix; -od
is the output directory; -db
is the output database name; -dbdir
is the output database directory; -fa
is the reference genome sequences file in FASTA format; -gtf
is the reference gene annotation file in GTF format. Then, MRBIGR would generate an annotation database and perform functional annotate for genetic variants automatically. The output annotation result files include testvcf_anno.ZmB73_multianno.vcf, testvcf_anno.ZmB73_multianno.bed, and testvcf_anno.ZmB73_largeEffect.bed. If the annotation database has been built before, the parameters -gtf
and -fa
are no longer need, and the parameters -db
and -dbdir
are the input database name and database directory, respectively. In this case, the command line is as follows:
docker exec -it mrbigr_env MRBIGR.py geno -anno -vcf MRBIGR_output/demo/geno/chr_HAMP_female.vcf.vcf -o chr_HAMP_female.vcf_anno -od MRBIGR_output/demo -db ZmB73 -dbdir MRBIGR_output/demo/geno/ref
2. Phenotypic data based analysis
The phenotypic data analysis module takes a CSV format phenotype file with the first column and the first row should be the names of samples and traits as input, can be invoked through the subcommand pheno
, which includes 5 functions: -qc
, -imput
, -scale
, -correct
, and -merge
. -qc
is used for quality control of phenotypic data; -imput
is used for imputation of missing values in phenotypic data; -scale
is used for data scaling, normalization, standardization or transformation of phenotypic data; -correct
is used for population structure based phenotypic data correction; -merge
is used to merge a trait from different environment or years using either the methods of mean values, BLUP or BLUE.
$ docker exec -it mrbigr_env MRBIGR.py pheno -h
usage: MRBIGR.py pheno [options]
optional arguments:
-h, --help show this help message and exit
-p P Phenotype file of *.csv format for processing and
transformation, the first column and the first row are
the names of samples and traits, respectively. It should
be used together with the parameters
–qc/-imput/-scale/-correct/-merge.
-o [pheno_output] Prefix of the output files, default is pheno_output. It
should be used together with the parameters
–qc/-imput/-scale/-correct/-merge.
-od [output_dir] Directory of the output files.
-qc Perform phenotype data quality control and filtration,
this includes outlier data removal, missing rate
filtering and small value filtering, each of the
criterions is optional and can be skipped, while only the
data points pass the cutoff would be kept. It should be
used together with the parameters –p/-o/-rout/-mis/-val.
-tr Transposition of the input data
-rout [zscore] Outlier removal of phenotype data, the default method is
zscore. Options: zscore, boxplot. It should be used
together with the parameter –qc.
-mis [0.5] Filter traits with a missing rate larger than the cutoff,
default is 0.5. It should be used together with the
parameter –qc.
-val [0.1] Filter traits with an average value smaller than the
cutoff, default is 0.1. It should be used together with
the parameter –qc.
-imput Perform imputation for missing phenotype values. It
should be used together with the parameters
–p/-o/-method.
-method [mean] Phenotype values imputation method. Options: mean,
median, most_frequent. It should be used together with
the parameter -imput.
-scale Scaling/Normalization/Standardization/Transformation of
the phenotype data. We provide a variety of methods for
phenotype data scaling. It should be used together with
the parameters –g/-o and any one or more of the
parameters
–log2/-log10/-ln/-boxcox/-qqnorm/-minmax/-zscore/-robust.
-log2 log2 scale of the phenotype data. It should be used
together with the parameter -scale.
-log10 log10 scale of the phenotype data. It should be used
together with the parameter -scale.
-ln ln scale of the phenotype data. It should be used
together with the parameter -scale.
-boxcox Box-Cox transformation of the phenotype data. It should
be used together with the parameter -scale.
-qqnorm Normal quantile transformation of the phenotype data. It
should be used together with the parameter -scale.
-minmax Min-Max normalization of the phenotype data. It should be
used together with the parameter -scale.
-zscore Z-score Standardization of the phenotype data. It should
be used together with the parameter -scale.
-robust Scale phenotype values using statistics that are robust
to outliers. It should be used together with the
parameter -scale.
-correct Correct phenotype data caused by population structure
through a genotype-based pre-calculated PCA file. It
should be used together with the parameters –p/-o/-pca.
-pca PCA A pre-calculated PCA file used for population structure
correction of the phenotype data. It should be used
together with the parameter –correct.
-merge Merge the input phenotype values from different
environment using either mean values, BLUP or BLUE
methods. Only one trait is allowed in the input phenotype
file. It should be used together with the parameters
–p/-o/-mm.
-mm [mean] Merge method of phenotype values, default is mean.
Options: mean, blup, blue. It should be used together
with the parameter –merge.
2.1 Quality control of phenotypic data
Quality control of the original phenotypic data include transposition of the original phenotype matrix if needed, removal of outlier data points based on z-score or boxplot methods, filtering traits with high level of missing rate and low average values. In this function, either of the parameters -tr
, -rout
, -mis
and -val
is optional. If you want to perform quality control use all the default parameters, the command line looks like this
docker exec -it mrbigr_env MRBIGR.py pheno -qc -p MRBIGR_data/blup_traits_final.new.csv -o blup_traits_final.qc -od MRBIGR_output/demo -rout zscore -mis 0.1 -val 0.1
The subcommand pheno
invokes the phenotype analysis module; parameter -qc
calls the quality control function; -p
is the input phenotype matrix; -o
is the prefix of output file; -mis
is the missing rate cutoff with default value 0.5; -val
is the small value cutoff with default value 0.1; -rout
means outlier removal of phenotypic values with the default method z-score. If you want to perform quality control use personalized parameters, the command line could be as follows:
# transposition of the original phenotype matrix
docker exec -it mrbigr_env MRBIGR.py pheno -qc -tr -p MRBIGR_data/blup_traits_final.new.csv -o blup_traits_final.tr -od MRBIGR_output/demo
# reset missing rate cutoff, skip small value filter and outlier removal
docker exec -it mrbigr_env MRBIGR.py pheno -qc -p MRBIGR_data/blup_traits_final.new.csv -o blup_traits_final.qc -mis 0.1 -od MRBIGR_output/demo
# skip small value filter and set outlier removal method to boxplot
docker exec -it mrbigr_env MRBIGR.py pheno -qc -p MRBIGR_data/blup_traits_final.new.csv -o blup_traits_final.qc -rout boxplot -mis 0.1 -od MRBIGR_output/demo
2.2 Imputation of missing phenotype values
There are three different phenotype imputation methods provided by MRBIGR: mean
, median
, most_frequent
. The missing phenotype values in input file should be defined as NA, and the command line is as follows:
docker exec -it mrbigr_env MRBIGR.py pheno -imput -p MRBIGR_output/demo/pheno/blup_traits_final.qc.phe.csv -o blup_traits_final.qc.imput -od MRBIGR_output/demo -method most_frequent
The subcommand pheno
invokes the phenotype analysis module; parameter -imput
calls the phenotype imputation function; -p
is the input phenotype matrix; -o
is the prefix of output file; -method
is the imputation method. After this step, a phenotype file named pheno_imput.phe.csv with no missing value would be generated.
2.3 Scaling and normalization of phenotypic data
MRBIGR provides multiple commonly used phenotypic data scaling and normalization methods, such as logarithmization, z-score standarlization, box-cox normalization, normal quantile normalization. These scaling and normalization methods are applicable to different types of phenotypic data, like agronomic traits, transcripts expression data, metabolites intensity data. The command line is as follows:
docker exec -it mrbigr_env MRBIGR.py pheno -scale -p MRBIGR_output/demo/pheno/blup_traits_final.qc.imput.phe.csv -o blup_traits_final.qc.imput.norm -od MRBIGR_output/demo -boxcox -minmax
The subcommand pheno
invokes the phenotype analysis module; parameter -scale
calls the phenotype scaling/normalization/standardization/transformation function; -p
is the input phenotype matrix; -o
is the prefix of output file; -boxcox
and -minmax
means use both Box-Cox and Min-Max methods to transform the data. Box-Cox transformation is used to transform each trait to meet normality assumptions, a lambda is calculated per trait and used to transform each trait, while Min-Max scaling is used to scale the values to 0-1. Other optional methods are -log2
,-log10
,-ln
,-qqnorm
,-zscore
,and -robust
.
2.4 Correction of phenotypic data base on population structure
Since population structure may lead to phenotypic data distributed in different levels, in some cases, correction of phenotype values to the same level based on population structure may help downstream analysis. The -correct
function in MRBIGR use a PCA file, which can be generated through genotypic data, to perform phenotypic data correction. The command line is as follows:
docker exec -it mrbigr_env MRBIGR.py pheno -correct -p MRBIGR_output/demo/pheno/blup_traits_final.qc.imput.phe.csv -o blup_traits_final.qc.imput.pca_correct -od MRBIGR_output/demo -pca MRBIGR_output/demo/geno/chr_HAMP.geno_pca.csv
The subcommand pheno
invokes the phenotype analysis module; parameter -correct
calls the phenotype correct function; -p
is the input phenotype matrix; -o
is the output file prefix; -pca
is the PCA result file generated by genotypic data through the command geno -pca
.
2.5 Merge the phenotype values from different environment
Phenotypic data form different environment or years need to be merged before downstream analysis, commonly used phenotypic data merge algorithms are mean-value, BLUP (best linear unbiased prediction) and BLUE (best linear unbiased estimation). MRBIGR provides all three methods for this purpose, which can be called and selected from parameters -merge
and -mm
. It is noted that only one trait is accepted in an input file. The command line is as follows:
docker exec -it mrbigr_env MRBIGR.py pheno -merge -p MRBIGR_data/blup_traits_final.new.csv -o blup_traits_final.blup -od MRBIGR_output/demo -mm blup
The subcommand pheno
invokes the phenotype analysis module; parameter -merge
calls the phenotype merge function; -p
is the input phenotype matrix for a trait in CSV format;-o
is the prefix of output file; -mm
is the merge method. After this step, a phenotype file named blup_traits_final.blup.phe.csv with BLUP merged phenotypic values would be generated.
3. GWAS and SAL related analysis
The GWAS and SAL related analysis module can be utilized for GWAS, SAL detection and annotation, as well as peak or gene based haplotype analysis, this module can be invoked through the subcommand gwas
, which includes 6 functions: -gwas
, -qtl
, -anno
, -visual
, -peaktest
, and -genetest
. -gwas
is used to perform GWAS, -qtl
is used to detect SALs from GWAS results, -anno
is used for annotation of SAL regions, -visual
is used for the visualization of GWAS results, -peaktest
is used for peak based haplotype test, -genetest
is used for gene based haplotype test.
$ docker exec -it mrbigr_env MRBIGR.py gwas -h
usage: MRBIGR.py gwas [options]
optional arguments:
-h, --help show this help message and exit
-g G Genotype file of plink-bed format. Prefix of plink-bed
file is needed, that is, when the input files are
name.bed/name.bim/name.fam, -g should bed set to name.
It should be used together with the parameters
-gwas/-peaktest.
-p P Phenotype file of *.csv format after processing and
transformation, the first column and the first row are
the names of samples and traits, respectively. It
should be used together with the parameters
–gwas/-peaktest/-genetest.
-o [gwas_out] Prefix of the output files. It will be used together
with the parameters
-qtl/-anno/-visual/-peaktest/-genetest.
-od [output_dir] Directory of the output files.
-thread [1] Number of threads for GWAS and QTL analysis, default is
1. It should be used together with the parameters
–gwas/-qtl/-visual.
-gwas Perform GWAS using a linear mixed model (lmm) or a
linear model (lm) implemented in GEMMA. All the result
files will be generated in a newly-built output
directory located in the working directory by default.
It should be used together with the parameters
–g/-p/–model/-thread.
-c_pca Whether to specify PCA as covariate. Default dimension
is 3. Or you can specify the PCA covariate file by
using -pca_file.
-pca_file pca_file The PCA covariate file which should be used together
with the parameter –c_pca.
-model [lmm] Fit a linear mixed model (lmm) or a linear model (lm),
default is lmm. Options: lmm, lm. It should be used
together with the parameter –gwas.
-qtl QTL detection from the GWAS results based on the PLINK-
clump method. It should be used together with the
parameter –i/-p1/-p2/-p2n/-thread.
-i [output] Parent directory of the GWAS output files *.assoc.txt,
default is output. It should be used together with the
parameter –qtl.
-p1 [1e-7] Significance threshold for index SNPs used to determine
QTLs, default is 1e-7. It should be used together with
the parameter –qtl.
-p2 [1e-5] Secondary significance threshold for clumped SNPs used
to determine the reliability of QTLs, default is 1e-5.
It should be used together with the parameter –qtl.
-p2n [5] Secondary significance SNP number in a QTL, default is
5. It should be used together with the parameter –qtl.
-anno Perform QTL annotation using a pre-formatted gene
annotation file. It should be used together with the
parameter –q/-a.
-q Q QTL result file in *.csv format generated by the -qtl
subcommand. It should be used together with the
parameter –anno.
-a A Pre-formatted gene annotation file in used for QTL
annotation. It should be used together with the
parameter –anno.
-visual Generate Manhattan-plots and QQ-plots for visualization
of the GWAS results. It should be used together with
the parameters –i/-multi/-thread.
-multi Generate multiple-trait Manhattan-plot, -q must be set.
It should be used together with the parameters
–visual/-q.
-peaktest Perform peak-based haplotype test using lead SNP in
each QTL. It should be used together with the
parameters –p/-g/-q/-o.
-genetest Perform gene-based haplotype test for each trait. It
should be used together with the parameters
-f1/-f2/-vcf/-p/-o.
-f1 F1 QTL annotation file in *.csv format generated by the
subcommand gwas –anno. It should be used together with
the parameter –genetest.
-f2 F2 Genotype annotation file in *.bed format generated by
the subcommand geno –anno. It should be used together
with the parameter –genetest.
-vcf VCF VCF format genotype file. It should be used together
with the parameter –genetest.
-plot_fmt [jpg,pdf] The format of manhattan and QQ plot.
3.1 Parallel GWAS
MRBIGR support parallel GWAS for multiple traits. Take the processed plink-bed format genotypic data and CSV format phenotypic data as inputs, GWAS can be performed through the below command:
docker exec -it mrbigr_env MRBIGR.py gwas -gwas -model gemma_mlm -thread 12 -g MRBIGR_data/chr_HAMP -p MRBIGR_data/AMP_kernel_transcriptome_v4 -od MRBIGR_output/demo
The subcommand gwas
invokes the GWAS and SAL analysis module; parameter -gwas
calls the GWAS function which utilizes GEMMA
(default), GAPIT
or rMVP
to perform GWAS; -model
is the model to fit, with gemma_lm
, gemma_mlm
, rmvp_glm
, rmvp_mlm
, rmvp_farmcpu
, gapit_glm
, gapit_mlm
, gapit_cmlm
, gapit_mmlm
, gapit_farmcpu
, and gapit_blink
as options; -thread
is the thread number to run the command; -g
is the plink-bed format input genotypic data; -p
is the CSV format phenotypic data; -od
is the output directory. After this step, an output directory named MRBIGR_output/demo/gwas/gemma/lmm
would be generated with the GWAS result files named Zm00001dxxxxx.assoc.txt in it.
Note: the exact output directory is depend on the model you selected, where -model gemma_mlm
means the final output directory will be MRBIGR_output/demo/gwas/gemma/lmm
.
3.2 Parallel SAL detection based on GWAS results
MRBIGR introduces the clumped method implemented in PLINK v1.9 to automatically detect and optimize SALs based on the original GWAS results. In detail, a stricter P-value threshold -p1
is set to uncover the significantly associated SNPs; then, for each significantly associated SNP, if the other SNPs within a 500 kb distance have P-values smaller than the second looser P-value threshold -p2
, and have r2 values greater than 0.2 with the index SNP, as well as the number of such SNPs surpass the SNP number threshold -p2n
, then the region is regarded as a SAL; finally, all overlapping SALs are merged to generate the final SAL set, while the SNP with the smallest P-value in a SAL is defined as a lead SNP.The command line is as follows:
docker exec -it mrbigr_env MRBIGR.py gwas -qtl -g MRBIGR_data/chr_HAMP -thread 6 -i MRBIGR_output/demo/gwas/gemma/lmm -o qtl_output -od MRBIGR_output/demo -p1 1e-5 -p2 1e-3 -p2n 5
The subcommand gwas invokes the GWAS and SAL analysis module; parameter -qtl
calls the SAL detection function; -thread
is the thread number to run the command; -i
is the GWAS result directory; -o
is the output file prefix; -od
is the output directory; -p1
is the significance threshold for index SNPs used to determine SALs; -p2
is the secondary significance threshold for clumped SNPs used to determine the reliability of SALs; -p2n
is secondary significance SNP number in a SAL. After this step, an output file named qtl_output.qtl_res.csv would be generated.
3.3 Annotation of SAL regions
The SAL result file could be annotated use a four columns gene annotation file in TSV format, which looks like this:
geneid aliased position function
Zm00001d027230 Zm00001d027230 1:44289-49837:+ Zm00001d027230
Zm00001d027231 Zm00001d027231 1:50877-55716:- Zm00001d027231
Zm00001d027232 Zm00001d027232 1:92299-95134:- Zm00001d027232
Zm00001d027233 Zm00001d027233 1:111655-118312:- Zm00001d027233
This file could be generated from a standard GTF format gene annotation file through the follow command:
mkdir MRBIGR_output/demo/tmp/
grep -v '#' MRBIGR_data/Zea_mays.B73_RefGen_v4.50.gtf |awk '{if($3=="gene"){print $0}}' |sed 's/ /\t/g'|sed 's/"//g'|sed 's/;//g'|awk '{print $10"\t"$10"\t"$1":"$4"-"$5":"$7"\t"$10}'|sed '1igeneid\taliased\tposition\tfunction' > MRBIGR_output/demo/tmp/gene.anno.tsv
Then, take this file as input, SAL annotation could be performed use follow the command:
docker exec -it mrbigr_env MRBIGR.py gwas -anno -q MRBIGR_output/demo/gwas/qtl/qtl_output.qtl_res.csv -a MRBIGR_output/demo/tmp/gene.anno.tsv -o anno_output -od MRBIGR_output/demo
The subcommand gwas
invokes the GWAS and SAL analysis module; parameter -anno
calls the SAL annotation function; -q
is the input SAL result file for annotation file; -o
is the prefix of output file. After this step, an output file named anno_output.qtl_anno.csv would be generate.
3.4 Visualization of GWAS results
Manhattan-plots and QQ-plots can be generated for the visualization purpose of the GWAS results. The command line is as follows:
docker exec -it mrbigr_env MRBIGR.py gwas -visual -thread 12 -i MRBIGR_output/demo/gwas/gemma/lmm -od MRBIGR_output/demo
or
docker exec -it mrbigr_env MRBIGR.py gwas -visual -thread 12 -i MRBIGR_output/demo/gwas/gemma/lmm -od MRBIGR_output/demo -multi -q MRBIGR_output/demo/gwas/qtl/qtl_output.qtl_res.csv
The subcommand gwas
invokes the GWAS and SAL analysis module; parameter -visual
calls the visualization function; -i
is the GWAS result directory; -od
is the output directory; -multi
is optional, with a multi-trait Manhattan-plot would be generate when set; -q
is the input SAL result file, it only need to be set when -multi
is set; -thread
is the thread number to run the command.
3.5 Statistical test of lead SNP
A simple approach to establish the relationship of phenotype distribution and SAL haplotype type is to use genotype of the lead SNP to represent the haplotype type of the SAL region, then, phenotype values distribution under different genotype of the lead SNP could be displayed, and student’s test could be performed. This simple statistical test can be performed in MRBIGR through the command below:
docker exec -it mrbigr_env MRBIGR.py gwas -peaktest -o peaktest_output -od MRBIGR_output/demo -p MRBIGR_data/AMP_kernel_transcriptome_v4 -g MRBIGR_data/chr_HAMP -q MRBIGR_output/demo/gwas/qtl/qtl_output.qtl_res.csv
The subcommand gwas
invokes the GWAS and SAL analysis module; parameter -peaktest calls the peak test function; -o
is the prefix of output file; -od
is the output directory; -p
is the input phenotypic data; -g
is the input genotypic data; -q
is the input SAL result file. Then, genotype information of each sample and phenotype distribution plot would be generated in the output directory.
3.6 Statistical test of genes
Another approach to establish the relationship of phenotype distribution and SAL haplotype is to test the phenotype distribution and haplotype type for each gene in the SAL, to discover potential casual gene and variants. For a SAL region, MRBIGR uses large effect variants in each gene to build gene haplotypes, and perform gene based haplotype test. Welch’s test was used for a two-group haplotype test and a Tukey’s test was used for a multiple group haplotype test as described in Yano et al, 2016. The command line is as follows:
docker exec -it mrbigr_env MRBIGR.py gwas -genetest -f1 MRBIGR_output/demo/gwas/anno/anno_output.qtl_anno.csv -f2 MRBIGR_output/demo/geno/chr_HAMP_female.vcf_anno.ZmB73_multianno.bed -vcf MRBIGR_output/demo/geno/chr_HAMP_female.vcf.vcf -p MRBIGR_data/blup_traits_final.new.csv -od MRBIGR_output/demo
The subcommand gwas invokes the GWAS and SAL analysis module; parameter -genetest
calls the gene test function; -f1
is the input SAL annotation file generated by gwas -anno
command; -f2
is the input genotype annotation file generated by geno -anno
command; -p
is the input phenotype file; -od
is the output directory. Then, related gene haplotype information of each sample and phenotype distribution plot would be generated in the MRBIGR_output/demo/gwas/gene_haplotest
directory.
4. Mendelian randomization analysis of multi-omics data
The Mendelian randomization analysis module can be utilized for perform Mendelian randomization analysis between different omics data, this module can be invoked through the subcommand mr
, which includes 3 modes according to parameters. The first mode provides input omics data through the -exposure
and -outcome
parameters, and performs Mendelian randomization analysis between -exposure
data and -outcome
data; the second mode provides transcriptome expression data through the -gene_exp
parameter, and perform Mendelian randomization analysis between gene pairs through the -pairwise
parameter; the third mode provides transcriptome expression data through the -gene_exp
parameter, the -tf
parameter provides the annotation information of the transcription factor, and -target
provides the transcription factor target gene annotation information, and then perform Mendelian randomization analysis between transcription factors and target genes. There are two calculation models for Mendelian randomization, linear model and mixed linear model, respectively, specified by the parameters -lm
and -mlm
.
$ docker exec -it mrbigr_env MRBIGR.py mr -h
usage: MRBIGR.py mr [options]
optional arguments:
-h, --help show this help message and exit
-g Genotype file in plink bed format
-exposure Exposure phenotype file, such as mTrait phenotype, the
file format is csv format, the first column and the
first row are the names of inbred lines and phenotypes
respectively
-outcome Outcome phenotype file, such as pTrait phenotype, the
file format is csv format, the first column and the
first row are the names of inbred lines and phenotypes
respectively
-tf Transcription factors annotation file
-target targeted genes annotation file
-type [All] Regulatory relationship between transcription factors
and targeted genes. direct, direct regulatory
relationship between transcription factors and targeted
genes. All, direct and indirect regulatory relationship
between transcription factors and targeted genes
-pairwise perform Mendelian randomization analysis between all
gene pairs
-gene_exp gene expression file, the file format is csv format,
the first column and the first row are the names of
inbred lines and phenotypes respectively
-qtl Exposure phenotype QTL file, generated by subcommand
regiongwas
-lm Perform Mendelian Randomization through linear model
-mlm Perform Mendelian Randomization through mixed linear
model
-pvalue [default:1] The pvalue cutoff of Mendelian randomization analysis
result output,default 1
-thread [1] Number of threads for Mendelian Randomization analysis
-o [mr_out] The prefix of the output file, default mr_out
-od [output_dir] Directory of the output files.
4.1 Exposure/outcome Mode
The first mode uses the -exposure
parameter to specify the exposure data required by the Mendelian randomization model, -outcome
provides the outcome data required by the Mendelian randomization model, and -qtl
provides the genetic variation of the exposure data, then Mendelian randomization analysis can be performed through the below command:
docker exec -it mrbigr_env MRBIGR.py mr -g MRBIGR_data/chr_HAMP -exposure MRBIGR_data/AMP_kernel_transcriptome_v4 -qtl MRBIGR_output/demo/gwas/qtl/qtl_output.qtl_res.csv -outcome MRBIGR_data/blup_traits_final.new.csv -mlm -thread 12 -o g2p -od MRBIGR_output/demo
The subcommand mr
invokes the Mendelian randomization analysis module; parameter -exposure
is the CSV format exposure data; -outcome
is the CSV format outcome data; -qtl
is the CSV format exposure SAL data; -thread
is the thread number to run the command; -g
is the plink-bed format input genotypic data; -mlm
represents perform Mendelian randomization analysis through mixed linear model; -o
is the prefix of output file; -od
is the output directory. After this step, a MR result file named g2p.MR.csv would be generated.
4.2 Pairwise Mode
The second mode uses the -gene_exp
parameter to specify the population gene expression data, -qtl
specifies the genetic variation that affects gene expression, and -pairwise
indicates to perform Mendelian randomization analysis between all pairs of genes in the population gene expression data, pairwise genes Mendelian randomization analysis can be performed through the below command:
docker exec -it mrbigr_env MRBIGR.py mr -g MRBIGR_data/chr_HAMP -gene_exp MRBIGR_data/AMP_kernel_transcriptome_v4 -pairwise -mlm -qtl MRBIGR_output/demo/gwas/qtl/qtl_output.qtl_res.csv -thread 12 -o gene_pairwise_mr_out -od MRBIGR_output/demo
The subcommand mr
invokes the Mendelian randomization analysis module; parameter -gene_exp
is the CSV format population gene expression data; -qtl
is the CSV format gene SAL data; -thread
is the thread number to run the command; -g
is the plink-bed format input genotypic data; -mlm
represents perform Mendelian randomization analysis through mixed linear model; -o
is the prefix of output file. After this step, a MR result file named gene_pairwise_mr_out.MR.csv would be generated.
4.3 TF Mode
The third mode uses the -gene_exp
parameter to specify the population expression data while using the -tf
and -target
parameters to specify the transcription factor and their target genes, respectively, and then perform Mendelian randomization analysis between the transcription factors and the target genes. The regulatory relationship between transcription factors and target genes can be divided into direct and indirect regulatory relationships. In detail, when the genomic distance between a transcription factor and its target gene is < 500 kb, the relationship between them is defined as direct relationship, while the regulatory relationship between the transcription factor and remaining target genes are defined as indirect regulation. The -type
parameter is used to specify the regulatory relationship between transcription factors and target genes in Mendelian randomization analysis. The command line is as follows:
head -100 MRBIGR_data/gene.anno.tsv > MRBIGR_output/demo/tmp/test_head100_genefunc.txt
tail -100 MRBIGR_data/gene.anno.tsv > MRBIGR_output/demo/tmp/test_tail100_genefunc.txt
docker exec -it mrbigr_env MRBIGR.py mr -g MRBIGR_data/chr_HAMP -gene_exp MRBIGR_data/AMP_kernel_transcriptome_v4 -tf MRBIGR_output/demo/tmp/test_head100_genefunc.txt -target MRBIGR_output/demo/tmp/test_tail100_genefunc.txt -mlm -qtl MRBIGR_output/demo/gwas/qtl/qtl_output.qtl_res.csv -threads 12 -o tf_test_mr_out -od MRBIGR_output/demo
The subcommand mr
invokes the Mendelian randomization analysis module; parameter -gene_exp
is the CSV format population gene expression data; -qtl
is the CSV format transcription factor SAL data; -tf
is the transcription factor annotation file; -target
is the annotation files for genes targeted by transcription factors; -thread
is the thread number to run the command; -g
is the plink-bed format input genotypic data; -mlm
represents perform Mendelian randomization analysis through mixed linear model; -type
is the regulatory relationship between transcription factors and target genes, with either direct (perform Mendelian randomization analysis transcription factors and directly regulated targeted genes) or all (perform Mendelian randomization analysis between transcription factors and all targeted genes) as option; -o
is the prefix of output file; -od
is the output directory. After this step, a MR result file named tf_test_mr_out.MR.csv would be generated.
Tips: The transcription factor annotation file and the annotation files for genes targeted by
transcription factors could be annotated use a four columns gene annotation file in TSV format,
which looks like this (take
MRBIGR_data/gene.anno.tsv
as example):
geneid aliased position function
Zm00001d027230 Zm00001d027230 1:44289-49837:+ Zm00001d027230
Zm00001d027231 Zm00001d027231 1:50877-55716:- Zm00001d027231
Zm00001d027232 Zm00001d027232 1:92299-95134:- Zm00001d027232
5. MR-based network construction and module identification
In MRBIGR, the weight of MR effects between gene pairs are used to construct the MR-based network and ClusterONE algorithm is used to identify modules for the constructed network, which can be invoked through the subcommand net
:
$ docker exec -it mrbigr_env MRBIGR.py net -h
usage: MRBIGR.py net [options]
optional arguments:
-h, --help show this help message and exit
-mr pairwise Mendelian Randomization analysis result
-module_size [5] minimal size of genes in a module,default 5
-pvalue [1e-3] pvalue cutoff for MR-based network analysis,default
1e-3
-plot plot network of each module
-plot_fmt [jpg,pdf] The format of the output plots
-o [net_out] The prefix of the output file, default go_out
-od [output_dir] Directory of the output files.
The MR-based network analysis command line is as follows:
docker exec -it mrbigr_env MRBIGR.py net -mr MRBIGR_output/demo/mr/gene_pairwise_mr_out.MR.csv -plot -o net_out -od MRBIGR_output/demo
The subcommand net
invokes the Mendelian randomization based network analysis module; parameter -mr
is the CSV format Mendelian randomization analysis data; -plot
represents plot network for each identified network module; -o
is the prefix of output file; -od
is the output directory. After this step, network edgelist file net_out.edge_list, ClusterONE software result net_out.clusterone.result.csv, network module plot net_out.module*.pdf and final network module file net_out.module.csv would be generated.
6. Gene ontology analysis of network modules
Gene ontology analysis of the MR-based network modules is helpful to find modules with biological significance. The enrichGO function in R package clusterProfiler is used to perform GO enrichment analysis on each module based on gene ontology annotation, and rich graphics (e.g., dotplot
, barplot
, cnetplot
, heatplot
, emapplot
, upsetplot
) could be chose for the visualization of the enrichment results. Gene ontology analysis can be invoked through the subcommand go
:
$ docker exec -it mrbigr_env MRBIGR.py go -h
usage: MRBIGR.py go [options]
optional arguments:
-h, --help show this help message and exit
-gene_lists gene lists file for GO enrichment analysis
-go_anno GO annotation file for each gene used to construct GO
annotation database
-gene_info gene information file used to construct GO annotation
database
-pvalue adjusted pvalue cutoff on enrichment tests to report
-plot_type [dotplot] Visualization types of GO enrichment result, support
barplot, dotplot, Gene-Concept Network(cnetplot),
heatplot, Enrichment Map(emapplot) and upsetplot, one
or more plot types can be specified at the same time,
multiple types are separated by commas, such as
barplot,emapplot,heatplot
-plot_fmt [jpg,pdf] The format of the output plots
-o [go_out] The prefix of the output file, default go_out
-od [output_dir] Directory of the output files.
The analysis command line is as follows:
csvtk sep -f 3 -t MRBIGR_data/gene.anno.tsv -s ":" -n chr,start-end,strand | csvtk sep -f start-end -t -s "-" -n start,end | csvtk cut -f 1,5,8,9,7,2 -t | csvtk rename -t -f 1-6 -n "gene_id,chr,start,end,stard,annotation" -o MRBIGR_output/demo/tmp/gene_anno.txt
docker exec -it mrbigr_env MRBIGR.py go -gene_lists MRBIGR_output/demo/net/net_out.module.csv -go_anno MRBIGR_data/maize.genes2go.txt -gene_info MRBIGR_output/demo/tmp/gene_anno.txt -plot_type barplot,dotplot -o go_out -od MRBIGR_output/demo
The subcommand go
invokes the gene ontology enrichment analysis module; parameter -gene_lists
is the CSV format network module data; -go
is the Tabular format gene ontology annotation of each gene data; -gene_info
is the Tabular format gene annotation data; -plot_type
is visualization type of enrichment results; -o
is the prefix of output file; -od
is the output directory. After this step, gene ontology enrichment analysis result go_out.GO.csv, visualized results of functional enrichment results go_out.BP.dotplot.pdf , go_out.MF.dotplot.pdf and go_out.CC.dotplot.pdf , go_out.BP.barplot.pdf , go_out.MF.barplot.pdf and go_out.CC.barplot.pdf would be generated.
Tips: The gene ontology annotation information of each gene could be annotated use a two
columns annotation file in TSV format, which looks like this (take
MRBIGR_data/maize.genes2go.txt
as example):
Zm00001d018305 GO:0051649
Zm00001d018305 GO:0003677
Zm00001d003209 GO:0005886
Zm00001d003209 GO:0044238
The gene annotation information of each gene could by annotated use a five columns annotation file
in TSV format, which looks like this (take MRBIGR_data/maize_genefunc.txt
as example):
gene_id chr start end stard annotation
Zm00001d027230 1 44289 49837 + Mitochondrial transcription termination factor family protein
Zm00001d027231 1 50877 55716 - OSJNBa0093O08.6 protein
Zm00001d027232 1 92299 95134 - Zm00001d027232
Zm00001d027234 1 118683 119739 - Zm00001d027234
7. Data visualization
The data visualization module in MRBIGR provides a collection of plot functions for the visualization of genotype, phenotype, GWAS, Mendelian randomization, network, and GO analysis results. This module can be invoked through the subcommand plot. The parameter -i
is used to specify the data file needed for plotting, -plot_type
specifies the plot type, the options include manhattan
, qqplot
, SNPdensity
, hist
, boxplot
, scatterplot_ps
, barplot
, dotplot
, forestplot
, and scatterplot_mr
. The plot module can be invoked through plot plot_old
subcommand:
$ docker exec -it mrbigr_env MRBIGR.py plot plot_old -h
usage: MRBIGR.py plot plot_old [options]
optional arguments:
-h, --help show this help message and exit
-i The result file generated by MRBIGR or the data file
used in MRBIGR, the file format is csv format
-plot_type Visualization type of the result file generated by
MRBIGR or the data file used in MRBIGR,including tree,
scatter_mr, forestplot, manhattan, qqplot, SNPdensity,
hist, boxplot, scatter_ps, barplot, dotplot; support
tree plot for phylogenetic tree analysis, scatterplot
and forest plot for Mendelian Randomization analysis,
manhattan plot and QQ plot for GWAS result, SNP density
plot for GWAS result and SNP info file, boxplot and
histogram for phenotype data, barplot and dotplot for
GO enrich result, scatter plot for population
structure.
-order Visualizing the results of Mendelian randomization in
the order of trait in the file, uesed in forest plot
-group Group file for scatter plot, boxplot and tree plot
-group_sep Delimiter used in group file
-drops Strains used to drop
-plot_fmt [jpg,pdf] The format of the output plots
-chrom Specify the chromosome for manhattan plot. Default plot
all the chromosomes
-start Specify the start position for manhattan plot. Default:
0
-end Specify the start position for manhattan plot. Default:
length of chromosome
-hl_pos Specify the position of SNPs to be highlighted for
manhattan plot. Use comma to separate multiple
positions.
-hl_text The texts to be highlighted for manhattan plot. Use
comma to separate multiple positions.
-input_sep The separator in GWAS result file. Default: tab
-o [plot_out] The prefix of the output file, default plot_out
-od [output_dir] Directory of the output files.
7.1 Genotypic data based plot
The genotypic data based plot includes SNP density plot (SNPdensity
) and scatterplot of population (scatter_ps
). SNP density plot is used to demonstrate the distribution of the SNP across the genome. The command line is as follows:
csvtk cut -f 2,1,4 -t MRBIGR_data/chr_HAMP.bim -H | csvtk add-header -t -n "rs,chr,ps" -o MRBIGR_output/demo/tmp/snp_info.txt
docker exec -it mrbigr_env MRBIGR.py plot plot_old -i MRBIGR_output/demo/tmp/snp_info.txt -plot_type SNPdensity -o plot_out -od MRBIGR_output/demo
Tips: The SNP information file could be annotated use a three columns file in TSV format, which
looks like this (take generated
MRBIGR_output/demo/tmp/snp_info.txt
as example):
rs chr ps
chr10.s_109149 10 109149
chr10.s_109277 10 109277
chr10.s_109475 10 109475
chr10.s_109511 10 109511
The first column represents the SNP id, the second column represents the chromosome of the SNP, and the third column represents the position of the SNP. The population scatterplot takes a PCA or t-SNE result file as input to visualize the population structure. The population information of each individual can be specified by the -group parameter, which is used to assign individuals to populations with different colors. The command line is as follows:
docker exec -it mrbigr_env MRBIGR.py plot plot_old -i MRBIGR_output/demo/geno/chr_HAMP.geno_pca.csv -plot_type scatter_ps -group MRBIGR_data/491.group_info.csv -o plot_out -od MRBIGR_output/demo
The population structure file could be annotated use a three columns file in CSV format, which looks like this (take MRBIGR_output/demo/geno/chr_HAMP.geno_pca.csv
as example):
,PC1,PC2,PC3
CIMBL32,-500.22876280227405,291.4219232815866,-106.31967411588965
CIMBL89,-317.01045032787937,76.23101098461999,-5.709429188066682
CIMBL7,-438.9677175927099,147.47327899812078,17.419342974755814
CIMBL45,168.92927463582174,142.43977568787892,137.81050851770115
ZHENG58,687.5073243050226,-230.98495850470024,-262.2697439302226
CML415,-296.43476541023466,8.953155752331595,1.960958532295878
CIMBL46,-363.40800595652894,72.35692371618754,85.151195977471
CIMBL70,-416.08725961220046,162.22519090608853,30.936760965741744
CIMBL124,-493.08148241098274,158.7571953595056,21.30131838983761
...
The first column represents the sample id, the second column represents first principal component data of each inbred line, and the third column represents second principal component data of each inbred line.
The group file could be annotated use a two columns file in CSV format, which looks like this (take MRBIGR_data/491.group_info.csv
as example):
line,subpop
150,Mixed
177,Mixed
238,NSS
268,NSS
501,NSS
647,Mixed
812,SS
832,SS
1323,Mixed
...
The first column represents the sample identifier, the second column represents the population information of each sample.
7.2 Phenotypic data based plot
The phenotypic data based plot includes histogram (hist
) and boxplot (boxplot
), which are used to demonstrate the distribution of phenotypic data. Boxplot can not only show the overall distribution of phenotypes, but also the phenotype distribution of different groups, when the -group parameter is specified. The command line is as follows:
csvtk cut -f 1-2 MRBIGR_data/blup_traits_final.new.csv -o MRBIGR_output/demo/tmp/example.phe.csv
docker exec -it mrbigr_env MRBIGR.py plot plot_old -i MRBIGR_output/demo/tmp/example.phe.csv -plot_type hist -o plot_out -od MRBIGR_output/demo
cut -f 1,12- MRBIGR_data/chr_HAMP_female.hmp| head -2 | csvtk transpose -d $'\t' -D ',' | csvtk rename -f 1 -n "line" -o MRBIGR_output/demo/tmp/example.snp_group.csv
docker exec -it mrbigr_env MRBIGR.py plot plot_old -i MRBIGR_output/demo/tmp/example.phe.csv -plot_type boxplot -group MRBIGR_output/demo/tmp/example.snp_group.csv -o plot_out -od MRBIGR_output/demo
The phenotype file could be annotated use a two columns file in CSV format, which looks like this (take MRBIGR_output/demo/tmp/example.phe.csv
as example):
Trait,Plantheight
150,199.40686200000002
177,167.7376926
238,195.1777397
268,160.7956655
501,170.01783559999998
647,143.4828657
812,164.7820379
1323,183.0386928
1462,211.1751917
...
The first column represents the sample identifier, the second column represents the phenotypic data of each sample.
The group file could be annotated use a two columns file in CSV format, which looks like the format of MRBIGR_data/491.group_info.csv
or like this (take MRBIGR_output/demo/tmp/example.snp_group.csv
as example):
line,chr10.s_74401
CIMBL32,CC
CIMBL89,TT
CIMBL7,CC
CIMBL45,TT
ZHENG58,CC
CML415,TT
CIMBL46,TT
CIMBL70,TT
CIMBL124,CC
...
The first column represents the sample ID, the second column represents the group of each sample, which can be haplotype type or populations information.
7.3 GWAS based plot
The GWAS based plot includes manhattan plot (manhattan
) and QQ plot (qqplot
), which is used to demonstrate GWAS result. The command line is as follows:
docker exec -it mrbigr_env MRBIGR.py plot plot_old -i MRBIGR_output/demo/gwas/gemma/lmm/Zm00001d007718.assoc.txt -plot_type manhattan -o plot_out -od MRBIGR_output
docker exec -it mrbigr_env MRBIGR.py plot plot_old -i MRBIGR_output/demo/gwas/gemma/lmm/Zm00001d007718.assoc.txt -plot_type qqplot -o plot_out -od MRBIGR_output
Tips: The GWAS result file could be annotated use a four columns file in TSV format, which looks like this:
chr rs ps n_miss allele1 allele0 af beta se l_remle p_wald
10 chr10.s_74401 74401 0 T C 0.422 -2.967759e-02 1.526325e-01 2.711385e+00 8.459573e-01
10 chr10.s_76702 76702 0 A G 0.390 -1.052776e-01 1.498642e-01 2.521573e+00 4.828912e-01
10 chr10.s_92823 92823 0 T C 0.067 -1.291505e-01 2.722853e-01 2.824265e+00 6.355980e-01
10 chr10.s_94339 94339 0 T A 0.517 -1.321889e-01 1.421446e-01 2.540068e+00 3.530991e-01
10 chr10.s_94579 94579 0 G C 0.351 -1.398295e-01 1.473273e-01 2.495780e+00 3.432887e-01
10 chr10.s_94901 94901 0 G A 0.141 -2.443615e-01 2.047943e-01 2.788237e+00 2.336815e-01
...
The first column represents the ID of SNPs, the second column represents the chromosome of SNPs, and the third column represents the position of SNPs, the fourth column represents the GWAS P-value of SNPs.
7.4 MR data based plot
The MR based plot includes forest plot (forestplot
) and scatter plot (scatter_mr
). The forest plot is used to display the effect between each exposure and outcome traits, and can specify the order of the outcome traits in the forest plot through the order file, so as to compare the effects of different exposures on the outcome traits. It can be used to compare the influence of different exposures under a small number of outcome traits. The command line is as follows:
csvtk cut -f 2 MRBIGR_output/demo/mr/g2p.MR.csv -U | sort -u | head -20 | csvtk grep -P - -f 2 MRBIGR_output/demo/mr/g2p.MR.csv -o MRBIGR_output/demo/tmp/example.g2p.MR.csv
head -1 MRBIGR_data/blup_traits_final.new.csv | csvtk transpose | sed '1d' > MRBIGR_output/demo/tmp/example_trait.order.txt
docker exec -it mrbigr_env plot -i MRBIGR_output/demo/mr/example.g2p.MR.csv -plot_type forestplot -order MRBIGR_output/demo/tmp/example_trait.order.txt -o plot_out -od MRBIGR_output/demo
Tips: The MR file is generated by
mr
subcommand. The order file is a one column file that list outcome traits, which looks like this (take the traits in MRBIGR_data/blup_traits_final.new.csv
as example):
Plantheight
Earheight
Earleafwidth
Earleaflength
Tasselmainaxislength
Tasselbranchnumber
Leafnumberaboveear
Earlength
Eardiameter
Cobdiameter
Kernernumberperrow
100grainweight
cobweight
Kernelwidth
Silkingtime
Pollenshed
Headingdate
The scatter plot is used to display the Mendelian randomized p-value between exposure and outcome, and group the outcome through the group file. It can be used to show the impact of exposure on the outcome traits under a large number of outcome traits. The command line is as follows:
docker exec -it mrbigr_env MRBIGR.py plot plot_old -i MRBIGR_output/demo/mr/g2p.MR.csv -plot_type scatter_mr -group MRBIGR_data/genes.grp -o plot_out -od MRBIGR_output/demo
Tips: The group file of outcome traits could be annotated use a two columns annotation file in CSV format, which looks like this (take
MRBIGR_data/genes.grp
as example):
id,group
Zm00001d027230,1
Zm00001d027231,1
Zm00001d027232,1
Zm00001d027233,1
Zm00001d027234,1
...
The first column represents the outcome trait, and the second column represents the category corresponding to the outcome trait, which can be the chromosome where the gene is located, the type of metabolite, etc.
7.5 GO based plot
The GO based plot includes bar plot (barplot
) and dot plot (dotplot
). It depicts the enrichment scores (e.g. P-values) and gene count or ratio in plot. The command line is as follows:
docker exec -it mrbigr_env plot -i MRBIGR_output/demo/go/go_out.GO.csv -plot_type barplot -o plot_out -od MRBIGR_output/demo
docker exec -it mrbigr_env plot -i MRBIGR_output/demo/go/go_out.GO.csv -plot_type dotplot -o plot_out -od MRBIGR_output/demo
Tips:The GO enrichment result file could be annotated use a ten columns file in CSV format,
which looks like this:
ONTOLOGY,ID,Description,GeneRatio,BgRatio,pvalue,p.adjust,qvalue,geneID,Count
MF,GO:0003729,mRNA binding,5/82,371/37229,1.393e-3,0.033,0.026,Zm00001d044979/Zm00001d049442/Zm00001d005350,5
...
The first column represents the ontology domains, including BP, MF and CC, the second column represents the ontology id, the third column represents the description of ontology id, the fourth column represents the ratio of genes containing ontology id in the gene list to the total genes in the gene list, the fifth column represents the ratio of genes containing ontology id in whole genome to the total genes in whole genome, the sixth column represents the P value of GO enrichment analysis, the seventh and eighth columns represents the corrected P value of GO enrichment analysis, it can be same when the GO enrichment result file is user-customized, the ninth column represents the gene id of genes containing ontology id, the tenth represents the count of genes containing ontology id.