Identify biomarkers using supervised leaning (SL) methods

Identify biomarkers using logistic regression, random forest, or support vector machine.

run_sl(
  ps,
  group,
  taxa_rank = "all",
  transform = c("identity", "log10", "log10p"),
  norm = "none",
  norm_para = list(),
  nfolds = 3,
  nrepeats = 3,
  sampling = NULL,
  tune_length = 5,
  top_n = 10,
  method = c("LR", "RF", "SVM"),
  ...
)

Arguments

ps

a phyloseq-class object.

group

character, the variable to set the group.

taxa_rank

character to specify taxonomic rank to perform differential analysis on. Should be one of phyloseq::rank_names(phyloseq), or "all" means to summarize the taxa by the top taxa ranks (summarize_taxa(ps, level = rank_names(ps)[1])), or "none" means perform differential analysis on the original taxa (taxa_names(phyloseq), e.g., OTU or ASV).

transform

character, the methods used to transform the microbial abundance. See transform_abundances() for more details. The options include:

"identity", return the original data without any transformation (default).
"log10", the transformation is log10(object), and if the data contains zeros the transformation is log10(1 + object).
"log10p", the transformation is log10(1 + object).

norm

the methods used to normalize the microbial abundance data. See normalize() for more details. Options include:

"none": do not normalize.
"rarefy": random subsampling counts to the smallest library size in the data set.
"TSS": total sum scaling, also referred to as "relative abundance", the abundances were normalized by dividing the corresponding sample library size.
"TMM": trimmed mean of m-values. First, a sample is chosen as reference. The scaling factor is then derived using a weighted trimmed mean over the differences of the log-transformed gene-count fold-change between the sample and the reference.
"RLE", relative log expression, RLE uses a pseudo-reference calculated using the geometric mean of the gene-specific abundances over all samples. The scaling factors are then calculated as the median of the gene counts ratios between the samples and the reference.
"CSS": cumulative sum scaling, calculates scaling factors as the cumulative sum of gene abundances up to a data-derived threshold.
"CLR": centered log-ratio normalization.
"CPM": pre-sample normalization of the sum of the values to 1e+06.

norm_para

named list. other arguments passed to specific normalization methods. Most users will not need to pass any additional arguments here.

nfolds

the number of splits in CV.

nrepeats

the number of complete sets of folds to compute.

sampling

a single character value describing the type of additional sampling that is conducted after resampling (usually to resolve class imbalances). Values are "none", "down", "up", "smote", or "rose". For more details see caret::trainControl().

tune_length

an integer denoting the amount of granularity in the tuning parameter grid. For more details see caret::train().

top_n

an integer denoting the top n features as the biomarker according the importance score.

method

supervised learning method, options are "LR" (logistic regression), "RF" (rando forest), or "SVM" (support vector machine).

...

extra arguments passed to the classification. e.g., importance for randomForest::randomForest.

Value

a microbiomeMarker object.

Details

Only support two groups comparison in the current version. And the marker was selected based on its importance score. Moreover, The hyper-parameters are selected automatically by a grid-search based method in the N-time K-fold cross-validation. Thus, the identified biomarker based can be biased due to model overfitting for small datasets (e.g., with less than 100 samples).

The argument top_n is used to denote the number of markers based on the importance score. There is no rule or principle on how to select top_n, however, usually it is very useful to try a different top_n and compare the performance of the marker predictions for the testing data.

Author

Yang Cao

Examples