Title: | Clustering |
---|---|
Description: | A tool that implements the clustering algorithms from 'mothur' (Schloss PD et al. (2009) <doi:10.1128/AEM.01541-09>). 'clustur' make use of the cluster() and make.shared() command from 'mothur'. Our cluster() function has five different algorithms implemented: 'OptiClust', 'furthest', 'nearest', 'average', and 'weighted'. 'OptiClust' is an optimized clustering method for Operational Taxonomic Units, and you can learn more here, (Westcott SL, Schloss PD (2017) <doi:10.1128/mspheredirect.00073-17>). The make.shared() command is always applied at the end of the clustering command. This functionality allows us to generate and create clustering and abundance data efficiently. |
Authors: | Gregory Johnson [aut] , Sarah Westcott [aut], Patrick Schloss [aut, cre, cph] |
Maintainer: | Patrick Schloss <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.2 |
Built: | 2024-11-26 20:29:41 UTC |
Source: | https://github.com/schlosslab/clustur |
Clusters entities represented in a distance matrix and count table using one of several algorithms and outputs information about the composition and abundance of each cluster
cluster( distance_object, cutoff, method = "opticlust", feature_column_name_to = "feature", bin_column_name_to = "bin", random_seed = 123 )
cluster( distance_object, cutoff, method = "opticlust", feature_column_name_to = "feature", bin_column_name_to = "bin", random_seed = 123 )
distance_object |
The distance object that was created using the 'read_dist()' function. |
cutoff |
The cutoff you want to cluster towards. |
method |
The method of clustering to be performed: opticlust (default), furthest, nearest, average, or weighted. |
feature_column_name_to |
Set the name of the column in the cluster dataframe that contains the sequence names. |
bin_column_name_to |
Set the name of the column in the cluster dataframe that contains the name of the group of sequence names. |
random_seed |
the random seed to use, (default = 123). |
A list of 'data.frames' that contain abundance, and clustering results. If you used 'method = opticlust', it will also return clustering performance metrics.
cutoff <- 0.03 count_table <- read_count(example_path("amazon.full.count_table")) distance_data <- read_dist(example_path("amazon_column.dist"), count_table, cutoff) cluster_results <- cluster(distance_data, cutoff, method = "opticlust", feature_column_name_to = "sequence", bin_column_name_to = "omu") cluster_results <- cluster(distance_data, cutoff, method = "furthest") cluster_results <- cluster(distance_data, cutoff, method = "nearest") cluster_results <- cluster(distance_data, cutoff, method = "average") cluster_results <- cluster(distance_data, cutoff, method = "weighted")
cutoff <- 0.03 count_table <- read_count(example_path("amazon.full.count_table")) distance_data <- read_dist(example_path("amazon_column.dist"), count_table, cutoff) cluster_results <- cluster(distance_data, cutoff, method = "opticlust", feature_column_name_to = "sequence", bin_column_name_to = "omu") cluster_results <- cluster(distance_data, cutoff, method = "furthest") cluster_results <- cluster(distance_data, cutoff, method = "nearest") cluster_results <- cluster(distance_data, cutoff, method = "average") cluster_results <- cluster(distance_data, cutoff, method = "weighted")
Given a list of i indexes, j indexes, and distances values, we can create a sparse distance matrix for you. Each vector must have the same size.
create_sparse_matrix(i_index, j_index, distances)
create_sparse_matrix(i_index, j_index, distances)
i_index |
A list of i indexes, must be numeric |
j_index |
A list of j indexes, must be numeric |
distances |
A list of the distance at the i and j index |
a 'dgTMatrix' from the 'Matrix' library.
i_values <- as.integer(1:100) j_values <- as.integer(sample(1:100, 100, TRUE)) x_values <- as.numeric(runif(100, 0, 1)) s_matrix <- create_sparse_matrix(i_values, j_values, x_values)
i_values <- as.integer(1:100) j_values <- as.integer(sample(1:100, 100, TRUE)) x_values <- as.numeric(runif(100, 0, 1)) s_matrix <- create_sparse_matrix(i_values, j_values, x_values)
This function was created as a helper function to generate file paths to our internal data. You should use this function if you want to follow along with the example, or interact with the data
example_path(file = NULL)
example_path(file = NULL)
file |
The file name of the data; leave as NULL (default) to get full list of example files |
the path to the file as a 'character' or a vector of 'character' giving example filenames if 'fill = NULL'.
example_path("amazon_phylip.dist") example_path()
example_path("amazon_phylip.dist") example_path()
GetShared returns the generated abundance 'data.frame' from the 'cluster()' function
get_abundance(cluster_data)
get_abundance(cluster_data)
cluster_data |
The output from the 'cluster()' function. |
a shared data.frame
cutoff <- 0.2 count_table <- read_count(example_path("amazon.full.count_table")) distance_data <- read_dist(example_path("amazon_column.dist"), count_table, cutoff, FALSE) df_clusters <- cluster(distance_data, cutoff, method = "opticlust") shared <- get_abundance(df_clusters)
cutoff <- 0.2 count_table <- read_count(example_path("amazon.full.count_table")) distance_data <- read_dist(example_path("amazon_column.dist"), count_table, cutoff, FALSE) df_clusters <- cluster(distance_data, cutoff, method = "opticlust") shared <- get_abundance(df_clusters)
GetClusters returns a 'data.frame' of the generated clusters from the 'cluster()' function.
get_bins(cluster_data)
get_bins(cluster_data)
cluster_data |
The output from the 'cluster()' function. |
the created cluster 'data.frame'.
cutoff <- 0.2 count_table <- read_count(example_path("amazon.full.count_table")) distance_data <- read_dist(example_path("amazon_column.dist"), count_table, cutoff, FALSE) df_clusters <- cluster(distance_data, cutoff, method = "opticlust") clusters <- get_bins(df_clusters)
cutoff <- 0.2 count_table <- read_count(example_path("amazon.full.count_table")) distance_data <- read_dist(example_path("amazon_column.dist"), count_table, cutoff, FALSE) df_clusters <- cluster(distance_data, cutoff, method = "opticlust") clusters <- get_bins(df_clusters)
This function returns the count table that was used to generate the distance object.
get_count_table(distance_object)
get_count_table(distance_object)
distance_object |
The output from the 'read.dist()' function. |
a count_table 'data.frame'.
cutoff <- 0.2 count_table <- read_count(example_path("amazon.full.count_table")) distance_data <- read_dist(example_path("amazon_column.dist"), count_table, cutoff, FALSE) count_table <- get_count_table(distance_data)
cutoff <- 0.2 count_table <- read_count(example_path("amazon.full.count_table")) distance_data <- read_dist(example_path("amazon_column.dist"), count_table, cutoff, FALSE) count_table <- get_count_table(distance_data)
Returns the distance cutoff of the cluster object from the 'cluster()' function
get_cutoff(cluster_data)
get_cutoff(cluster_data)
cluster_data |
The output from the 'cluster()' function. |
the cutoff value as a 'dbl'
cutoff <- 0.2 count_table <- read_count(example_path("amazon.full.count_table")) distance_data <- read_dist(example_path("amazon_column.dist"), count_table, cutoff, FALSE) df_clusters <- cluster(distance_data, cutoff, method = "opticlust") cutoff <- get_cutoff(df_clusters)
cutoff <- 0.2 count_table <- read_count(example_path("amazon.full.count_table")) distance_data <- read_dist(example_path("amazon_column.dist"), count_table, cutoff, FALSE) df_clusters <- cluster(distance_data, cutoff, method = "opticlust") cutoff <- get_cutoff(df_clusters)
This function will generate a 'data.frame' that contains the distances of all the indexes.
get_distance_df(distance_object)
get_distance_df(distance_object)
distance_object |
The output from the 'read.dist()' function. |
a distance 'data.frame'.
cutoff <- 0.2 count_table <- read_count(example_path("amazon.full.count_table")) distance_data <- read_dist(example_path("amazon_column.dist"), count_table, cutoff, FALSE) count_table <- get_count_table(distance_data)
cutoff <- 0.2 count_table <- read_count(example_path("amazon.full.count_table")) distance_data <- read_dist(example_path("amazon_column.dist"), count_table, cutoff, FALSE) count_table <- get_count_table(distance_data)
GetMetrics returns the generated metrics 'data.frame' from the 'cluster()' function.
get_metrics(cluster_data)
get_metrics(cluster_data)
cluster_data |
The output from the 'cluster()' function. |
a list of metric data.frames
cutoff <- 0.2 count_table <- read_count(example_path("amazon.full.count_table")) distance_data <- read_dist(example_path("amazon_column.dist"), count_table, cutoff, FALSE) df_clusters <- cluster(distance_data, cutoff, method = "opticlust") list_of_metrics <- get_metrics(df_clusters)
cutoff <- 0.2 count_table <- read_count(example_path("amazon.full.count_table")) distance_data <- read_dist(example_path("amazon_column.dist"), count_table, cutoff, FALSE) df_clusters <- cluster(distance_data, cutoff, method = "opticlust") list_of_metrics <- get_metrics(df_clusters)
This function will read and return your count table. It can take in sparse and full count tables.
read_count(count_table_path)
read_count(count_table_path)
count_table_path |
The file path of your count table. |
a count table 'data.frame'.
count_table <- read_count(example_path("amazon.full.count_table"))
count_table <- read_count(example_path("amazon.full.count_table"))
Read in distances from a file that is formatted with three columns for the row, column, and distance of a sparse, square matrix or in a phylip-formatted distance matrix.
read_dist(distance_file, count_table, cutoff, is_similarity_matrix = FALSE)
read_dist(distance_file, count_table, cutoff, is_similarity_matrix = FALSE)
distance_file |
Either a phylip or column distance file, or a sparse matrix. The function will detect the format for you. |
count_table |
A table of names and the given abundance per group. Can be in mothur's sparse or full format. The function will detect the format for you. |
cutoff |
The value you wish to use as a cutoff when clustering. |
is_similarity_matrix |
are you using a similarity matrix (default) or distance matrix? |
A distance 'externalptr' object that contains all your distance information. Can be accessed using 'get_distance_df()'
i_values <- as.integer(1:100) j_values <- as.integer(sample(1:100, 100, TRUE)) x_values <- as.numeric(runif(100, 0, 1)) s_matrix <- create_sparse_matrix(i_values, j_values, x_values) sparse_count <- data.frame( Representative_Sequence = 1:100, total = rep(1, times = 100)) column_path <- example_path("amazon_column.dist") phylip_path <- example_path("amazon_phylip.dist") count_table <- read_count(example_path("amazon.full.count_table")) data_column <- read_dist(column_path, count_table, 0.03) data_phylip <- read_dist(phylip_path, count_table, 0.03) data_sparse <- read_dist(s_matrix, sparse_count, 0.03)
i_values <- as.integer(1:100) j_values <- as.integer(sample(1:100, 100, TRUE)) x_values <- as.numeric(runif(100, 0, 1)) s_matrix <- create_sparse_matrix(i_values, j_values, x_values) sparse_count <- data.frame( Representative_Sequence = 1:100, total = rep(1, times = 100)) column_path <- example_path("amazon_column.dist") phylip_path <- example_path("amazon_phylip.dist") count_table <- read_count(example_path("amazon.full.count_table")) data_column <- read_dist(column_path, count_table, 0.03) data_phylip <- read_dist(phylip_path, count_table, 0.03) data_sparse <- read_dist(s_matrix, sparse_count, 0.03)
'split_clusters_to_list()' will extract clusters from the cluster generated 'data.frame'. It will then turn those clusters into a list. This allows users to more easily visualize their data.
split_clusters_to_list(cluster)
split_clusters_to_list(cluster)
cluster |
The output generated from the 'cluster()' function. |
a named 'list' of clusters.
cutoff <- 0.2 count_table <- read_count(example_path("amazon.full.count_table")) distance_data <- read_dist(example_path("amazon_column.dist"), count_table, cutoff, FALSE) cluster_results <- cluster(distance_data, cutoff, method = "opticlust") cluster_list <- split_clusters_to_list(cluster_results)
cutoff <- 0.2 count_table <- read_count(example_path("amazon.full.count_table")) distance_data <- read_dist(example_path("amazon_column.dist"), count_table, cutoff, FALSE) cluster_results <- cluster(distance_data, cutoff, method = "opticlust") cluster_list <- split_clusters_to_list(cluster_results)
If the count table is already valid nothing will change, otherwise it will add a new group to the count table file.
validate_count_table(count_table_df)
validate_count_table(count_table_df)
count_table_df |
The count table 'data.frame' object. |
Determines whether user supplied count table is valid
A validated count table 'data.frame'
count_table <- read.delim(example_path("amazon.full.count_table")) count_table_valid <- validate_count_table(count_table)
count_table <- read.delim(example_path("amazon.full.count_table")) count_table_valid <- validate_count_table(count_table)