Cluster sequences at a certain taxonomic similarity, and find clusters that contain mixed taxonomic names,
Note, it is recommended to set a unique seed using set.seed()
get_mixed_clusters(
x,
db,
rank = "order",
threshold = 0.97,
rngseed = FALSE,
confidence = 0.8,
return = "consensus",
k = 5,
quiet = FALSE,
...
)
A DNAbin list object whose names include NCBItaxonomic identification numbers.
A taxonomic database from get_ncbi_taxonomy
or get_ott_lineage
The taxonomic rank to check clusters at, accepts a character such as "order", or vector of characters such as c("species", "genus"). If "all", the clusters will be checked at all taxonomic ranks available.
numeric between 0 and 1 giving the OTU identity cutoff for clustering. Defaults to 0.97.
(Optional) A single integer value passed to set.seed, which is used to fix a seed for reproducibly random number generation for the kmeans clustering. If set to FALSE, then no fiddling with the RNG seed is performed, and it is up to the user to appropriately call set.seed beforehand to achieve reproducible results.
The minimum confidence value for a mixed cluster to be flagged. For example, if confidence = 0.8 (the default value)
a cluster will only be flagged if the taxonomy of a sequence within the cluster differs from at least four other independent sequences in its cluster.
@param nstart how many random sets should be chosen for kmeans
, It is recommended to set the value of nstart to at least 20.
While this can increase computation time, it can improve clustering accuracy considerably.
What type of data about the data should be returned. Options include: Consensus - The consensus taxonomy for each cluster and associated confidence level All - Return all taxa in mixed clusters and their sequence accession numbers Count - Return counts of all taxa within each cluster
integer giving the k-mer size used to generate the input matrix for k-means clustering.
logical indicating whether progress should be printed to the console.
further arguments to pass to kmer::otu.
if (FALSE) {
seqs <- ape::read.FASTA("test.fa.gz")
# NCBI taxonomy
mixed <- get_mixed_clusters(seqs, db, rank="species", threshold=0.99, confidence=0.8, quiet=FALSE)
# OTT taxonomy
seqs <- map_to_ott(
seqs, dir="ott3.2", from="ncbi",
resolve_synonyms=TRUE, filter_bads=TRUE, remove_na = TRUE, quiet=FALSE
)
mixed <- get_mixed_clusters(
seqs, db, rank="species",
threshold=0.99, confidence=0.6, quiet=FALSE
)
}