---
title: "clustur"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{clustur}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
editor_options: 
  chunk_output_type: console
---
```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## Background
clustur was developed to be similar to [mothur's cluster
function](https://mothur.org/wiki/cluster/) that was written in C++. In order to
cluster your data, users need to provide their own sparse or phylip-formatted
distance matrix. They also need to provide a count table that either comes from
mothur or that they create in R. Once these objects are built users can call the
`cluster()` function. We currently support 5 methods: opticlust (default) and
furthest, nearest, weighted, and average neighbor.  The opticlust method is
`cluster()` and mothur's default. The speed of the methods implemented in
{clustur} and mothur are comparable; {clustur} may even be faster! Below we will
show you how to create your sparse matrix and count table. If you do not have a
count table, clustur can produce one from you, but it will assume the abundance
of each sequence is one and it will only cluster the sequences in the distance
matrix. The output of running `clustur()` includes what is typically provided in
a mothur-formatted shared file.


## Starting Up
For the official release from CRAN you can use the standard `install.packages()`
function:

```r
# install via cran
install.packages("clustur")
```

For the developmental version, you can use the `install_github()` function from
the {devtools} package

```r
# install via github
devtools::install_github("SchlossLab/clustur")
```

Because {clustur}'s functions make use of a random number generator, users are strongly encouraged to set the seed.

```{r setup}
library(clustur)
set.seed(19760620)
```

## Read count files
clustur will produce the same output using either a sparse (default) or full
[count table](https://mothur.org/wiki/count_file/)

```{r}
full_count_table <- read_count(example_path("amazon.full.count_table"))
sparse_count_table <- read_count(example_path("amazon.sparse.count_table"))
```


## Read distance matrix file

clustur will read both mothur's [column/sparse distance
matrix](https://mothur.org/wiki/column-formatted_distance_matrix/) and
[Phylip-formatted distance ](https://mothur.org/wiki/phylip-formatted_distance_matrix/) matrix formats. 

```{r}
column_distance <- read_dist(example_path("amazon_column.dist"), full_count_table, cutoff = 0.03)
```

or

```{r}
phylip_distance <- read_dist(example_path("amazon_phylip.dist"), full_count_table, cutoff = 0.03)
```

The return value of `distance_data` will be a memory address. If you want a data
frame version of the distances, you can use `get_distance_df(distance_data)`.

```{r}
get_distance_df(column_distance)
get_distance_df(phylip_distance)
```


## Clustering the data

The default method for clustering in `cluster` is "opticlust" 

```{r}
cutoff <- 0.03
cluster_data <- cluster(column_distance, cutoff)
```


### Selecting different clustering methods
```{r}

cluster_data <- cluster(column_distance, cutoff, method = "furthest")
cluster_data <- cluster(column_distance, cutoff, method = "nearest")
cluster_data <- cluster(column_distance, cutoff, method = "average")
cluster_data <- cluster(column_distance, cutoff, method = "weighted")
```


## Output data from clustering

All methods produce a list object with an indicator of the cutoff that was used
(`label`), as well as cluster composition (`cluster`) and shared (`abundance`) data frames.
The `clusters` data frame shows which OTU (Operation Taxonomic Unit) each sequence was assigned to. The `abundance` data frame
contains columns indicating the `OTU` and `sample` identifiers and the
abundance of each OTU in each sample. The OptiClust method also includes the `metrics` data
frame, which describe the optimization value for each iteration in the fitting
process; the data in `clusters` and `shared` are taken from the last iteration.
clustur provides getter functions, `get_label()`, `get_clusters()`,
`get_shared()`, and `get_metrics()`, which will be demonstrated below.

```{r R.options=list(max.print=15)}
clusters <- cluster(column_distance, cutoff, method = "opticlust")
get_cutoff(clusters)
get_bins(clusters)
get_abundance(clusters)
get_metrics(clusters)
```