MCPtaggR 1.0.0
MCPtaggR provides the functions to conduct mappable-collinear-polymorphic tag genotyping (MCPtagg). MCPtagg is a pipeline for next generation sequencing (NGS)-based genotyping technique to precisely detect SNPs and count reads on validated tags (short genome segments or regions) based on genome comparison information. Short NGS reads can not always be mapped to the proper positions on the reference genome if the reads were originally derived from a genome that have plenty polymorphisms and structural variations against the reference genome, e.g. mapping reads of a hybrid population that were derived from a cross between distant relatives. Polymorphic reads that were derived from a non-reference genome could result in mismapping and mapping bias. The mismapping and mappability bias lead biased observation of mapped reads on SNP markers where reads of either allele are preferentially mapped. To eliminate unexpected detection of such error-prone markers, MCPtagg count reads only mapped on MCP tags that are the genomic sequences/regions on which NGS reads obtained by RRS can be mapped uniquely and properly to call genotypes at SNPs. The list of MCP tags is obtained by whole genome sequence comparison between the genome sequences of the prarents of a given mapping/breeding population followed by sequencing read mappability assessment using simulated reads. The MCPtagg pipeline efficiently eliminates error-prone markers from your resultant genotype data obtained via a NGS-based genotyping technique, compared to the result produced by the conventional pipeline that maps reads on one reference genome. MCPtaggR currently supports NGS reads generated by reduced-representation sequence, e.g. genotyping-by-sequencing (GBS) and Restriction site associated DNA markers sequencing (RAD-seq), for a biparental population derived from a cross between inbred parents.
MCPtaggR
requires external tools: MUMmer and Subreads. Please install these
two tools on your system and make sure their executable files can be executed
from your R environment.
For installation of MUMmer, please visit MUMmer’s web site.
For installation of Subread, please see the vignette of Rsubread.
Briefly, once the executable files of the MUMmer suite were compiled in a
directory, these files are available from MCPtaggR simply by specifying the
directory path to the mummer_dir argument of the run_mummer() function. If you
got a permission error for the MUMmer functions, change the permission settings of
the compiled executable files to allow R to execute them. For Subreads, the
instructions available at the above link will make the Subread
functions executable by MCPtaggR via the functions of the Rsubread package.
You can install MCPtaggR
from the GitHub repository.
if (!requireNamespace("devtools", quietly = TRUE))
install.packages("devtools")
devtools::install_github("tomoyukif/MCPtaggR", build_vignettes = FALSE)
library("MCPtaggR")
# Path to a directory containing FASTQ files
fq_dir <- system.file("extdata/fastq", package = "MCPtaggR")
# Path to a reference genome FASTA file
ref_fn <- system.file("extdata/ref/IRGSP-1.0_genome.masked.fa.gz",
package = "MCPtaggR")
# Path to a alternative genome FASTA file
alt_fn <- system.file("extdata/ref/ol_genome_chr.masked.fa.gz",
package = "MCPtaggR")
# Path to a directory containing MUMmer executable files
mummer_dir <- system.file("extdata/MUMmer3.23", package = "MCPtaggR")
# Path to an output directory. Set a path that you want to save outputs.
out_dir <- tempdir()
MUMmer executable files will be installed together with the MCPtaggR package.
The directory path for these files can be obtained by the code shown above.
However, these files were compiled on a Linux system with Ubuntu 22.04. There
is no guarantee that these are also executable on your system and, off course,
these won’t run on Windows and Mac. Please compile executable files on your
machine and set the mummer_dir argument of the run_mummer() function to the
directory where the executable files are stored.
out_fn <- "mummer_out"
# Decompress FASTA files is needed
R.utils::gunzip(ref_fn, overwrite = TRUE, remove = FALSE)
R.utils::gunzip(alt_fn, overwrite = TRUE, remove = FALSE)
# Set FASTA file names again with removing the .gz suffix.
ref_fn <- sub("\\.gz", "", ref_fn)
alt_fn <- sub("\\.gz", "", alt_fn)
mummer_fn <- run_mummer(ref_fn = ref_fn, alt_fn = alt_fn,
mummer_dir = mummer_dir,
out_dir = out_dir, out_fn = out_fn)
You can use pre-calculated mummer outputs from installation directory of MCPtaggR as demo data. These mummer outputs were obtained from the comparison between Oryza staiva cv. Nipponbare and O. longistaminata
mummer_fn <- c(system.file("extdata/mummer/ol2nb_mcptagg.snps.gz",
package = "MCPtaggR"),
system.file("extdata/mummer/ol2nb_mcptagg.coord.gz",
package = "MCPtaggR"))
Import and organize information of SNPs in collinear blocks.
snps <- mummer2SNPs(mummer_fn = mummer_fn)
The mummer_fn argument requires as input a vector of two strings: .snps and
.coord files geenrated by MUMmer. The run_mummer() function returns the vector
that can be supplied to the mummer2SNPs() function as input.
The output GDS file can be manipulated using the GBScleanR package that provides functions for data visualization, filtering, genotyping error correction, and allele dosage estimation.
library(GBScleanR)
Open the connection to the GDS file.
mcptagg <- system.file("extdata/genotype.gds", package = "MCPtaggR")
gds <- loadGDS(mcptagg)
Calculate and visualize summary statistics.
gds <- countGenotype(gds)
gds <- countRead(gds)
histGBSR(x = gds, stats = "missing")
histGBSR(x = gds, stats = "het")
histGBSR(x = gds, stats = "raf")
histGBSR(x = gds, stats = "dp")
Apply filtering on SNPs.
# Retain only one SNP if multiple SNPs in a 150 bp stretch.
gds <- thinMarker(object = gds, range = 150)
# Filter out SNPs at which all samples shows missing.
gds <- setMarFilter(object = gds, missing = 0.9999)
nmar(gds)
## [1] 614
Re-calculate and visualize summary statistics for the filtered data.
gds <- countGenotype(gds)
gds <- countRead(gds)
histGBSR(x = gds, stats = "missing")
histGBSR(x = gds, stats = "het")
histGBSR(x = gds, stats = "raf")
histGBSR(x = gds, stats = "dp")
Genotyping error correction and dosage estimation.
gds <- initScheme(object = gds, mating = cbind(c(1, 2)))
gds <- addScheme(object = gds, crosstype = "selfing")
gds <- estGeno(gds, het_parent = FALSE, iter = 4, n_threads = 10)
Visualize read ratio and estimated dosage
plotReadRatio(x = gds, ind = 1, alpha = 0.8)
Due to the dosage estimation conducted on the data with too few reads and few
samples, the estimated dosage is highly fluctuated and obviously contains error.
plotDosage(x = gds, ind = 1, alpha = 0.8)
The follwing plot shows the estimated dosage that generated by running estGen()
on 813 samples.
sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Asia/Tokyo
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] GBScleanR_1.5.11 SeqArray_1.40.1 gdsfmt_1.36.1 MCPtaggR_1.0.0
## [5] BiocStyle_2.28.1
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.0 farver_2.1.1
## [3] dplyr_1.1.3 blob_1.2.4
## [5] Biostrings_2.68.1 bitops_1.0-7
## [7] fastmap_1.1.1 RCurl_1.98-1.12
## [9] GenomicAlignments_1.36.0 XML_3.99-0.14
## [11] digest_0.6.33 lifecycle_1.0.3
## [13] RSQLite_2.3.2 magrittr_2.0.3
## [15] compiler_4.3.1 rlang_1.1.1
## [17] sass_0.4.7 tools_4.3.1
## [19] utf8_1.2.4 yaml_2.3.7
## [21] SNPRelate_1.34.1 data.table_1.14.8
## [23] rtracklayer_1.60.1 knitr_1.45
## [25] labeling_0.4.3 S4Arrays_1.0.6
## [27] bit_4.0.5 DelayedArray_0.26.7
## [29] abind_1.4-5 BiocParallel_1.34.2
## [31] expm_0.999-7 withr_2.5.2
## [33] purrr_1.0.2 BiocGenerics_0.46.0
## [35] grid_4.3.1 stats4_4.3.1
## [37] fansi_1.0.5 colorspace_2.1-0
## [39] ggplot2_3.4.4 scales_1.2.1
## [41] Rsubread_2.14.2 SummarizedExperiment_1.30.2
## [43] cli_3.6.1 rmarkdown_2.25
## [45] crayon_1.5.2 generics_0.1.3
## [47] RcppParallel_5.1.7 rstudioapi_0.15.0
## [49] rjson_0.2.21 DBI_1.1.3
## [51] cachem_1.0.8 zlibbioc_1.46.0
## [53] parallel_4.3.1 BiocManager_1.30.22
## [55] XVector_0.40.0 restfulr_0.0.15
## [57] matrixStats_1.0.0 vctrs_0.6.4
## [59] Matrix_1.6-1.1 jsonlite_1.8.7
## [61] bookdown_0.36 IRanges_2.34.1
## [63] S4Vectors_0.38.2 bit64_4.0.5
## [65] magick_2.8.1 tidyr_1.3.0
## [67] jquerylib_0.1.4 glue_1.6.2
## [69] codetools_0.2-19 gtable_0.3.4
## [71] GenomeInfoDb_1.36.4 BiocIO_1.10.0
## [73] GenomicRanges_1.52.1 munsell_0.5.0
## [75] tibble_3.2.1 pillar_1.9.0
## [77] htmltools_0.5.6.1 GenomeInfoDbData_1.2.10
## [79] BSgenome_1.68.0 R6_2.5.1
## [81] evaluate_0.22 lattice_0.22-5
## [83] Biobase_2.60.0 highr_0.10
## [85] png_0.1-8 Rsamtools_2.16.0
## [87] memoise_2.0.1 DECIPHER_2.28.0
## [89] bslib_0.5.1 Rcpp_1.0.11
## [91] xfun_0.40 MatrixGenerics_1.12.3
## [93] pkgconfig_2.0.3