Normalize the microbial abundance data

It is critical to normalize the feature table to eliminate any bias due to differences in the sampling sequencing depth.This function implements six widely-used normalization methods for microbial compositional data.

For rarefying, reads in the different samples are randomly removed until the same predefined number has been reached, to assure all samples have the same library size. Rarefying normalization method is the standard in microbial ecology. Please note that the authors of phyloseq do not advocate using this rarefying a normalization procedure, despite its recent popularity

TSS simply transforms the feature table into relative abundance by dividing the number of total reads of each sample.

CSS is based on the assumption that the count distributions in each sample are equivalent for low abundant genes up to a certain threshold. Only the segment of each sample’s count distribution that is relatively invariant across samples is scaled by CSS

RLE assumes most features are not differential and uses the relative abundances to calculate the normalization factor.

TMM calculates the normalization factor using a robust statistics based on the assumption that most features are not differential and should, in average, be equal between the samples. The TMM scaling factor is calculated as the weighted mean of log-ratios between each pair of samples, after excluding the highest count OTUs and OTUs with the largest log-fold change.

In CLR, the log-ratios are computed relative to the geometric mean of all features.

norm_cpm: This normalization method is from the original LEfSe algorithm, recommended when very low values are present (as shown in the LEfSe galaxy).

# S4 method for phyloseq
normalize(object, method = "TSS", ...)

# S4 method for otu_table
normalize(object, method = "TSS", ...)

# S4 method for data.frame
normalize(object, method = "TSS", ...)

# S4 method for matrix
normalize(object, method = "TSS", ...)

norm_rarefy(
  object,
  size = min(sample_sums(object)),
  rng_seed = FALSE,
  replace = TRUE,
  trim_otus = TRUE,
  verbose = TRUE
)

norm_tss(object)

norm_css(object, sl = 1000)

norm_rle(
  object,
  locfunc = stats::median,
  type = c("poscounts", "ratio"),
  geo_means = NULL,
  control_genes = NULL
)

norm_tmm(
  object,
  ref_column = NULL,
  logratio_trim = 0.3,
  sum_trim = 0.05,
  do_weighting = TRUE,
  Acutoff = -1e+10
)

norm_clr(object)

norm_cpm(object)

Arguments

object

a phyloseq::phyloseq or phyloseq::otu_table

method

the methods used to normalize the microbial abundance data. Options includes:

"none": do not normalize.
"rarefy": random subsampling counts to the smallest library size in the data set.
"TSS": total sum scaling, also referred to as "relative abundance", the abundances were normalized by dividing the corresponding sample library size.
"TMM": trimmed mean of m-values. First, a sample is chosen as reference. The scaling factor is then derived using a weighted trimmed mean over the differences of the log-transformed gene-count fold-change between the sample and the reference.
"RLE", relative log expression, RLE uses a pseudo-reference calculated using the geometric mean of the gene-specific abundances over all samples. The scaling factors are then calculated as the median of the gene counts ratios between the samples and the reference.
"CSS": cumulative sum scaling, calculates scaling factors as the cumulative sum of gene abundances up to a data-derived threshold.
"CLR": centered log-ratio normalization.
"CPM": pre-sample normalization of the sum of the values to 1e+06.

...

other arguments passed to the corresponding normalization methods.

size, rng_seed, replace, trim_otus, verbose

extra arguments passed to phyloseq::rarefy_even_depth().

sl

The value to scale.

locfunc

a function to compute a location for a sample. By default, the median is used.

type

method for estimation: either "ratio"or "poscounts" (recommend).

geo_means

default NULL, which means the geometric means of the counts are used. A vector of geometric means from another count matrix can be provided for a "frozen" size factor calculation.

control_genes

default NULL, which means all taxa are used for size factor estimation, numeric or logical index vector specifying the taxa used for size factor estimation (e.g. core taxa).

ref_column

column to use as reference

logratio_trim

amount of trim to use on log-ratios

sum_trim

amount of trim to use on the combined absolute levels ("A" values)

do_weighting

whether to compute the weights or not

Acutoff

cutoff on "A" values to use before trimming

Value

the same class with object.

Examples

data(caporaso)
normalize(caporaso, "TSS")
#> phyloseq-class experiment-level object
#> otu_table()   OTU Table:         [ 3426 taxa and 34 samples ]
#> sample_data() Sample Data:       [ 34 samples by 8 sample variables ]
#> tax_table()   Taxonomy Table:    [ 3426 taxa by 7 taxonomic ranks ]
#> phy_tree()    Phylogenetic Tree: [ 3426 tips and 3424 internal nodes ]

Arguments

Value

See also

Examples