Normalization of count data from the metagenomic data sets

An important aspects of working with metagenomics is to apply proper normalization procedures to the retrieved counts. There are several ways to do this, and in part the method of choice is dependent on the research question investigated, but in part also based on more philosphical considerations. Let’s start with a bit of theory.

Why is normalization important?

Generally, sequencing data sets are not of the same size. In addition, different genes and genomes come in different sizes, which means that at equal coverage, the number of mapped reads to a certain gene or region will be directly dependent on the length of that region. Luckily, the latter scenario is not a huge issue for Pfam families (although it exists), and we will not care about it more today. We will however care about the size of the sequencing libraries. To make relatively fair comparisons between sets, we need to normalize the gene counts to something. Let’s begin with checking how unequal the librairies are. You can do that by counting the number of sequences in the FASTA files, by checking for the number of “>” characters in each file, using grep:

grep -c ">" <input file>

As you will see, there are quite substantial differences in the number of reads in each library. How do we account for that?

What normalization methods are possible?

The choice of normalization method will depend on what research question we want to ask. An easy way of removing the technical bias related to different sequencing effort in different libraries is to simply divide each gene count with the total library size. That will yield a relative proportion of counts to that gene. To make that number easier to interpret, we can multiply it by 1,000,000 to get the number of reads corresponding to that gene or feature per million reads.

(counts of gene X / total number of reads) * 1000000

This is a quick way of normalizing, but it does not consider the composition of the sample. Say that you are interested in studying bacterial gene content within e.g. different plant hosts. Then the interesting changes in bacterial composition might be drowned by genetic material from the host plant. That will then have a huge impact on the gene abundances of the bacteria, even if those abundances are actually the same. The same applies to complex microbial communities with both bacteria, single-cell eukaryotes and viruses. In such cases, it might be better to consider a normalization to the number of bacteria in the sample (or eukaryotes if that is what you want to study). One way of doing that is to count the number of reads mapping to the 16S rRNA gene in each sample. You can then divide each gene count with the number of 16S rRNA counts, to yield a genes per 16S proportion.

(counts of gene X / counts of 16S rRNA gene)

There is a few problems with using the 16S rRNA gene in this way. The most prominient one is that the gene exists in a single copy in some bacteria, but in multiple (sometimes >10) copies in other species. That means that this number will not truly be a per-genome estimate. Other genetic markers, such as the rpoB gene has been suggested for this, but has not yet taken off.

Finally, we could imagine a scenario in which you are only interested in the proportion of different annotated features in your sample. One can then instead divide to the total number of reads mapped to something in the database used. That will give relative proportions, and will remove a lot of “noise”, but will have the limitation that only the well-defined part of the microbial community can be studied, and the rest is ignored.

(counts of gene X / total number of mapped reads)

Trying out some normalization methods

We are now ready to try out these methods on our data. Let’s begin generating the numbers we need for normalization. We begin with the library sizes. As you remember, those numbers can be generated using grep:

grep -c ">" <input file>

To get the number of 16S rRNA sequences, we will use Metaxa2. If you did not install it, you can “cheat” by getting the numbers from this file: /proj/g2014113/metagenomics/annotation/metaxa2_16S_rRNA_counts.txt. If you installed it previously, you can test it out using the following command:

metaxa2 -i <input file> -o <output file> --cpu 16 --align none

Metaxa2 will take a few minutes to run. You will then be able to get the number of bacterial 16S rRNA sequences from the file ending with .summary.txt.

Finally, we would like to get the number of reads mapping to any Pfam family in the database. To get that number, we can again use grep. This time however, we will use it to remove the entries that we are not interested in, and counting the rest. This can be done by:

grep -c -v "^#" <hmmer output file>

That will remove all lines beginning with a # character, and count all remaining lines. Write all the numbers down that you have got during this exercize, we will use them in the next step!