1 Introduction

MCPtaggR provides the functions to conduct mappable-collinear-polymorphic tag genotyping (MCPtagg). MCPtagg is a pipeline for next generation sequencing (NGS)-based genotyping technique to precisely detect SNPs and count reads on validated tags (short genome segments or regions) based on genome comparison information. Short NGS reads can not always be mapped to the proper positions on the reference genome if the reads were originally derived from a genome that have plenty polymorphisms and structural variations against the reference genome, e.g. mapping reads of a hybrid population that were derived from a cross between distant relatives. Polymorphic reads that were derived from a non-reference genome could result in mismapping and mapping bias. The mismapping and mappability bias lead biased observation of mapped reads on SNP markers where reads of either allele are preferentially mapped. To eliminate unexpected detection of such error-prone markers, MCPtagg count reads only mapped on MCP tags that are the genomic sequences/regions on which NGS reads obtained by RRS can be mapped uniquely and properly to call genotypes at SNPs. The list of MCP tags is obtained by whole genome sequence comparison between the genome sequences of the prarents of a given mapping/breeding population followed by sequencing read mappability assessment using simulated reads. The MCPtagg pipeline efficiently eliminates error-prone markers from your resultant genotype data obtained via a NGS-based genotyping technique, compared to the result produced by the conventional pipeline that maps reads on one reference genome. MCPtaggR currently supports NGS reads generated by reduced-representation sequence, e.g. genotyping-by-sequencing (GBS) and Restriction site associated DNA markers sequencing (RAD-seq), for a biparental population derived from a cross between inbred parents.

2 Prerequisites

MCPtaggR requires external tools: MUMmer and Subreads. Please install these two tools on your system and make sure their executable files can be executed from your R environment. For installation of MUMmer, please visit MUMmer’s web site. For installation of Subread, please see the vignette of Rsubread. Briefly, once the executable files of the MUMmer suite were compiled in a directory, these files are available from MCPtaggR simply by specifying the directory path to the mummer_dir argument of the run_mummer() function. If you got a permission error for the MUMmer functions, change the permission settings of the compiled executable files to allow R to execute them. For Subreads, the instructions available at the above link will make the Subread functions executable by MCPtaggR via the functions of the Rsubread package.

3 Installation

You can install MCPtaggR from the GitHub repository.

if (!requireNamespace("devtools", quietly = TRUE))
    install.packages("devtools")
    
devtools::install_github("tomoyukif/MCPtaggR", build_vignettes = FALSE)

4 The basic pipeline

4.0.1 Load the package.

library("MCPtaggR")

4.1 Run MUMmer

4.1.1 Set paths to input files.

# Path to a directory containing FASTQ files
fq_dir <- system.file("extdata/fastq", package = "MCPtaggR")

# Path to a reference genome FASTA file
ref_fn <- system.file("extdata/ref/IRGSP-1.0_genome.masked.fa.gz", 
                      package = "MCPtaggR")

# Path to a alternative genome FASTA file
alt_fn <- system.file("extdata/ref/ol_genome_chr.masked.fa.gz", 
                      package = "MCPtaggR")

# Path to a directory containing MUMmer executable files
mummer_dir <- system.file("extdata/MUMmer3.23", package = "MCPtaggR")

# Path to an output directory. Set a path that you want to save outputs.
out_dir <- tempdir()

MUMmer executable files will be installed together with the MCPtaggR package. The directory path for these files can be obtained by the code shown above. However, these files were compiled on a Linux system with Ubuntu 22.04. There is no guarantee that these are also executable on your system and, off course, these won’t run on Windows and Mac. Please compile executable files on your machine and set the mummer_dir argument of the run_mummer() function to the directory where the executable files are stored.

4.1.2 Call the MUMmer suite from R

out_fn <- "mummer_out"

# Decompress FASTA files is needed
R.utils::gunzip(ref_fn, overwrite = TRUE, remove = FALSE)
R.utils::gunzip(alt_fn, overwrite = TRUE, remove = FALSE)

# Set FASTA file names again with removing the .gz suffix.
ref_fn <- sub("\\.gz", "", ref_fn)
alt_fn <- sub("\\.gz", "", alt_fn)
mummer_fn <- run_mummer(ref_fn = ref_fn, alt_fn = alt_fn,
                        mummer_dir = mummer_dir, 
                        out_dir = out_dir, out_fn = out_fn)

You can use pre-calculated mummer outputs from installation directory of MCPtaggR as demo data. These mummer outputs were obtained from the comparison between Oryza staiva cv. Nipponbare and O. longistaminata

mummer_fn <- c(system.file("extdata/mummer/ol2nb_mcptagg.snps.gz", 
                           package = "MCPtaggR"),
               system.file("extdata/mummer/ol2nb_mcptagg.coord.gz", 
                           package = "MCPtaggR"))

Import and organize information of SNPs in collinear blocks.

snps <- mummer2SNPs(mummer_fn = mummer_fn)

The mummer_fn argument requires as input a vector of two strings: .snps and .coord files geenrated by MUMmer. The run_mummer() function returns the vector that can be supplied to the mummer2SNPs() function as input.

4.2 Find MCP tags

4.2.1 In silico genome digestion

Execute in-silico digestion of the given genomes.

read_len <- 150
re_names <- c("PstI", "MspI")
re <- makeRE(re_names)
fragment_len = c(0, 1000)
out_fn <- "dgout"
dg <- digestGenome(snps = snps, ref_fn = ref_fn, alt_fn = alt_fn, 
                   out_dir = out_dir, read_len = read_len, 
                   re = re, fragment_len = fragment_len)

You should set the read_len argument to the maximum sequencing read length in your FASTQ files to be analyzed. The digestGenome function generates the MCP tags with the length of read_len bp at maximum, which will be used for genotype calling. The fragment_len argument requires a vector of two integers that is used to filter out the in-silico digested genome fragments to generate MCP tags. If in-silico digested genome fragments were shorter than fragment_len[1] or longer than fragment_len[1], these fragments will be discarded from the MCP tag generation step and will not be used for genotype calling. You also need to specify the re argument that specify the pattern of in-silico digestion. You can make an input for this argument by the makeRE() function as described below.

The function makeRE() outputs a data.frame that specify how restriction enzymes cut DNA sequences as shown below.

makeRE("KpnI")

##      site f  r
## 1 GGTAC/C 4 -1

The site column indicates the restriction pattern. In the case of KpnI, it recognizes GGTACC and cut the sequence between the 5th nucleotide and the 6th nucleotide. The f and r columns indicate how many bases away the remnant sequences start from in forward and reverse strands, respectively. For example, in the forward strand, the remnant sequence is GGTAC. Counting from the 1st nucleotide of the recognition sequence, the last nucleotide in the remnant sequence locates at 4-bp downstream. On the other hand, the remnant sequence in the reverse strand is GTACC. Counting from the 1st nucleotide of the recognition sequence, the last nucleotide in the remnant sequence locates at 1-bp upstream when you count nucleotides in the 5’ to 3’ direction in the reverse strand. Therefore, the data.frame generated by makeRE() represents the patterns of restriction remnant sequences. Currently, makeRE() only supports KpnI, PstI, MspI and ApeKI for the re_names argument. For other restriction enzymes, please specify recognition sequences to the re_sites argument, e.g. re_sites = c("G/AATTC", "A/AGCTT") for EcoRI and HindIII.

makeRE(re_sites = c("G/AATTC", "A/AGCTT"))

##      site f r
## 1 G/AATTC 0 3
## 2 A/AGCTT 3 0

4.2.2 Tag alignment

Align the simulated digested fragments to the given genomes.

ar <- alignTAG(dg = dg, out_dir = out_dir, 
               n_threads = 5, indexing = TRUE, reuse = TRUE)

The alignTAG() function take an output object from the digestGenome() function as a primary input. You can let the read alignment using the Rsubread package run on multiple threads by setting the n_threads argument. If you set indexing = TRUE (default), index files of the genomes will be generated. You can set indexing = FALSE and skip the time consuming indexing step in cases where you have already indexed the genomes using alignTAG() in the previous run. Similarly, you can skip the time consuming tag alignment step by setting reuse = FALSE. The settings indexing = FALSE and reuse = FALSE will be ignored when the index files and the tag alignment files are not available at the directory specified to out_dir and the indexing and tag alignment steps will be performed.

4.2.3 Evaluate tags

The findMCPtag() function summarizes the outputs of alignTAG() and digestGenome() and find MCP tag to be used for genotype calling.

mcptag <- findMCPtag(ar = ar, dg = dg)

4.3 Read mapping and genotype calling

4.3.1 Count reads on the MCPtags and call genotypes.

fq_list <- findFASTQ(in_dir = fq_dir)
mcptagg <- mcptagg(mcptag = mcptag, fq_list = fq_list, out_dir = out_dir, 
                   n_threads = 5, n_parallel = 3)

The output of findMCPtag() should be supplied to the first argument mcptag. A data.frame with two or three columns where paths to read1, paths to read2 (optional), and sample names are listed is required to be set to the fq_list argument. You can make the read alignment using the Rsubread package run on multiple threads by setting the n_threads argument. If you set an integer greater than 1 to n_parallel, n_parallel processes of n_threads threads (requiring n_parallel x n_threads threads in total) will be executed. Parallelization of multithreading requires a large amount of RAM. Thus, please set a small numbers to them first and check how much RAM are required.

5 Post genotype calling workflow

The output GDS file can be manipulated using the GBScleanR package that provides functions for data visualization, filtering, genotyping error correction, and allele dosage estimation.

library(GBScleanR)

Open the connection to the GDS file.

mcptagg <- system.file("extdata/genotype.gds", package = "MCPtaggR")
gds <- loadGDS(mcptagg)

Calculate and visualize summary statistics.

gds <- countGenotype(gds)
gds <- countRead(gds)
histGBSR(x = gds, stats = "missing")

histGBSR(x = gds, stats = "het")

histGBSR(x = gds, stats = "raf")

histGBSR(x = gds, stats = "dp")

Apply filtering on SNPs.

# Retain only one SNP if multiple SNPs in a 150 bp stretch.
gds <- thinMarker(object = gds, range = 150)
# Filter out SNPs at which all samples shows missing.
gds <- setMarFilter(object = gds, missing = 0.9999)
nmar(gds)

## [1] 614

Re-calculate and visualize summary statistics for the filtered data.

gds <- countGenotype(gds)
gds <- countRead(gds)
histGBSR(x = gds, stats = "missing")

histGBSR(x = gds, stats = "het")

histGBSR(x = gds, stats = "raf")

histGBSR(x = gds, stats = "dp")

Genotyping error correction and dosage estimation.

gds <- initScheme(object = gds, mating = cbind(c(1, 2)))
gds <- addScheme(object = gds, crosstype = "selfing")
gds <- estGeno(gds, het_parent = FALSE, iter = 4, n_threads = 10)

Visualize read ratio and estimated dosage

plotReadRatio(x = gds, ind = 1, alpha = 0.8)

Due to the dosage estimation conducted on the data with too few reads and few samples, the estimated dosage is highly fluctuated and obviously contains error.

plotDosage(x = gds, ind = 1, alpha = 0.8)

The follwing plot shows the estimated dosage that generated by running estGen() on 813 samples.

Session info

sessionInfo()

## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Asia/Tokyo
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] GBScleanR_1.5.11 SeqArray_1.40.1  gdsfmt_1.36.1    MCPtaggR_1.0.0  
## [5] BiocStyle_2.28.1
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.0            farver_2.1.1               
##  [3] dplyr_1.1.3                 blob_1.2.4                 
##  [5] Biostrings_2.68.1           bitops_1.0-7               
##  [7] fastmap_1.1.1               RCurl_1.98-1.12            
##  [9] GenomicAlignments_1.36.0    XML_3.99-0.14              
## [11] digest_0.6.33               lifecycle_1.0.3            
## [13] RSQLite_2.3.2               magrittr_2.0.3             
## [15] compiler_4.3.1              rlang_1.1.1                
## [17] sass_0.4.7                  tools_4.3.1                
## [19] utf8_1.2.4                  yaml_2.3.7                 
## [21] SNPRelate_1.34.1            data.table_1.14.8          
## [23] rtracklayer_1.60.1          knitr_1.45                 
## [25] labeling_0.4.3              S4Arrays_1.0.6             
## [27] bit_4.0.5                   DelayedArray_0.26.7        
## [29] abind_1.4-5                 BiocParallel_1.34.2        
## [31] expm_0.999-7                withr_2.5.2                
## [33] purrr_1.0.2                 BiocGenerics_0.46.0        
## [35] grid_4.3.1                  stats4_4.3.1               
## [37] fansi_1.0.5                 colorspace_2.1-0           
## [39] ggplot2_3.4.4               scales_1.2.1               
## [41] Rsubread_2.14.2             SummarizedExperiment_1.30.2
## [43] cli_3.6.1                   rmarkdown_2.25             
## [45] crayon_1.5.2                generics_0.1.3             
## [47] RcppParallel_5.1.7          rstudioapi_0.15.0          
## [49] rjson_0.2.21                DBI_1.1.3                  
## [51] cachem_1.0.8                zlibbioc_1.46.0            
## [53] parallel_4.3.1              BiocManager_1.30.22        
## [55] XVector_0.40.0              restfulr_0.0.15            
## [57] matrixStats_1.0.0           vctrs_0.6.4                
## [59] Matrix_1.6-1.1              jsonlite_1.8.7             
## [61] bookdown_0.36               IRanges_2.34.1             
## [63] S4Vectors_0.38.2            bit64_4.0.5                
## [65] magick_2.8.1                tidyr_1.3.0                
## [67] jquerylib_0.1.4             glue_1.6.2                 
## [69] codetools_0.2-19            gtable_0.3.4               
## [71] GenomeInfoDb_1.36.4         BiocIO_1.10.0              
## [73] GenomicRanges_1.52.1        munsell_0.5.0              
## [75] tibble_3.2.1                pillar_1.9.0               
## [77] htmltools_0.5.6.1           GenomeInfoDbData_1.2.10    
## [79] BSgenome_1.68.0             R6_2.5.1                   
## [81] evaluate_0.22               lattice_0.22-5             
## [83] Biobase_2.60.0              highr_0.10                 
## [85] png_0.1-8                   Rsamtools_2.16.0           
## [87] memoise_2.0.1               DECIPHER_2.28.0            
## [89] bslib_0.5.1                 Rcpp_1.0.11                
## [91] xfun_0.40                   MatrixGenerics_1.12.3      
## [93] pkgconfig_2.0.3

Basic usage of MCPtaggR

November 8, 2023

Package