Bonus exercise: Using Metaxa2 to investigate the taxonomic content

Now when we are familiar to using R, we can just as well use it to go through another type of output generated by the Metaxa2 software. Metaxa2 does classification of rRNA at different taxonomic levels, assigning each read to a taxonomic affiliation only if it is reliably able to (given conservation between taxa etc.) You can read more about Metaxa2 here:

Install Metaxa2

For this exercise to work, you need Metaxa2 installed. If you did not do this earlier, here are the installation instructions again. If you did install Metaxa2 at the beginning of the workshop, you can skip this step and move straight to the next heading!

The code for Metaxa2 is available from You can install Metaxa2 as follows:

# Create a src and a bin directory
mkdir -p ~/src
mkdir -p ~/bin

# Go to the source directory and download the Metaxa2 tarball
cd ~/src
tar -xzvf Metaxa2_2.0rc3.tar.gz
cd Metaxa2_2.0rc3

# Run the installation script

# Try to run Metaxa2 (this should bring up the main options for the software)
metaxa2 -h

If this did not work, you can try this manual approach:

cd ~/src/Metaxa2_2.0rc3
cp -r metaxa2* ~/bin/

# Then try to run Metaxa2 again
metaxa2 -h

If this brings up the help message, you are all set!

Generating family level taxonomic counts

If you have already run Metaxa2 to get the number of 16S rRNA sequences, you can use the output of those runs. Otherwise you need to run the following command on all the raw read data from all libraries:

metaxa2 -i <input file> -o <output file> --cpu 16 --align none

To get counts on the family level from the metaxa2 output, we will use another tool bundled with the Metaxa2 package; the Metaxa2 Taxonomic Travesal Tool (metaxa2_ttt). Take a look at its options by typing:

metaxa2_ttt -h

In this exercise we are interested in bacterial counts only , so we will use the -t b option. Since we are only interested in family abundance (we have too little data to get any good genus or species counts), we will only output the data for phyla, classes, orders and families, that is we will use the -m 5 option. As input files, you should use the files ending with ”.taxonomy.txt” that Metaxa2 produced as output. That should give you a command looking like this:

metaxa2_ttt -i <metaxa .taxonomy.txt file> -o <output file> -m 5 -t b

Run this command on the taxonomy.txt files from all input libraries. It should be really quick. If you type ls you will notice that metaxa2_ttt produced a bunch of .level_X.txt files. Those are the files we are going to work with next.

Visualizing family level taxonomic counts

To visualize the family level counts, we will once again use R. Fire it up again and load in the count tables from Metaxa2:

b1_fam = read.table("baltic1.level_5.txt", sep = "\t", row.names = 1)

Repeat this procedure for all four data set. If you saved your workspace, the merge_four function should still be available. You can try it out on the taxonomic counts:

all_fam = merge_four(b1_fam,b2_fam,swe_fam,ind_fam,c("Baltic 1","Baltic 2","Sweden", "India"))

Let’s load in the gplots library again, and make a heatmap of the raw data:

heatmap.2(all_fam, trace = "none", col = c("grey",colorpanel(255,"black","red","yellow")), margin = c(5,10), cexCol = 1, cexRow = 0.7)

As you will notice, we will need to do some tweaking to fit in the taxonomic data:

heatmap.2(all_fam, trace = "none", col = c("grey",colorpanel(255,"black","red","yellow")), margin = c(5,30), cexCol = 1, cexRow = 0.7)

Apply normalizations

As you might already have guessed, taxonomic count data suffers from the same biases from, for example, sequencing library size as other gene data. To account for that, we will apply a normalization procedure. Please note that the normalization methods 2 and 3 (number of 16S and number of total matches to database) would in this case be the same. In other words, the both yield the relative abundances of the taxa. We will therefore only look at two normalization procedures in this part of the lab.

First, we will normalize to the number of reads in each sequencing library. Find the note you have taken on the data set sizes. Then apply a command like this on the data:

b1_fam_norm1 = b1_fam / 118025 * 1000000

That will give you the 16S rRNA counts for the different families per million reads. Do the same thing for the other data sets.

Next, we will do the same for the other type of normalization, the division by the mapped number of reads/total number of 16S rRNA. This can, once more, be done by dividing the vector by its sum:

b1_fam_norm2 = b1_fam / sum(b1_fam)

Follow the above procedure for all the data sets, and store the final result from merge_four into a variable, for example called fam_norm2.

Comparing taxonomic distributions

Next we will compare the taxonomic composition of the four environments. Let’s start out by just using a barplot. To get the different taxa on the x-axis, we will transform the matrix with normalized counts using the t() command. But first we need to set the margins to fit the taxonomic names:

par(mar = c(25, 4, 4, 2))
barplot(t(fam_norm1), main = "Counts per million reads", las = 2, cex.names = 0.6, beside = TRUE)

We can then do the same for the relative abundances:

barplot(t(fam_norm2), main = "Relative abundance", las = 2, cex.names = 0.6, beside = TRUE)

To only look at families present in at least two samples, we can use the following command for filtering:

fam_norm1_filter = fam_norm1[rowSums(fam_norm1 > 0) >= 2,]
barplot(t(fam_norm1_filter), main = "Counts per million reads", las = 2, cex.names = 0.6, beside = TRUE)

Question: Which normalization method would be most suitable to use in this case? Why?

We can also look at the differences in taxonomic content using a heatmap. As before, we will use the squareroot as a variance stabilizing transform:

heatmap.2(sqrt(fam_norm1), trace = "none", col = c("grey",colorpanel(255,"black","red","yellow")), margin = c(30,10), cexCol = 1, cexRow = 0.7)

Finally, we can of course also use PCA on taxonomic abundances. We will turn back to the prcomp PCA command:

fam_norm1_pca = prcomp(sqrt(fam_norm1))

We can visualize the PCA using the biplot command:

biplot(fam_norm1_pca, cex = 0.5)

To see the proportion of variance explained by the different components, we can use the normal plot command:


Question: Can you think about any other type of problem with the data we are using now? This problem applies to both kinds of data, but should be particularly problematic with the taxonomic counts...