Mapping reads back to the assembly¶
There are many different mappers available to map your reads back to the assemblies. Usually they result in a SAM or BAM file (http://genome.sph.umich.edu/wiki/SAM). Those are formats that contain the alignment information, where BAM is the binary version of the plain text SAM format. In this tutorial we will be using bowtie2 (http://bowtie-bio.sourceforge.net/bowtie2/index.shtml).
The SAM/BAM file can afterwards be processed with Picard (http://picard.sourceforge.net/) to remove duplicate reads. Those are likely to be reads that come from a PCR duplicate (http://www.biostars.org/p/15818/).
BEDTools (http://code.google.com/p/bedtools/) can then be used to retrieve coverage statistics.
There is a script available that does it all at once. Read it and try to understand what happens in each step:
less `which map-bowtie2-markduplicates.sh` map-bowtie2-markduplicates.sh -h
Bowtie2 has some nice documentation: http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml
Question: what does bowtie2-build do?
Picard’s documentation also exists! Two bioinformatics programs in a row with decent documentation! Take a moment to celebrate, then have a look here: http://sourceforge.net/apps/mediawiki/picard/index.php
Question: Why not just remove all identitical pairs instead of mapping them and then removing them?
Question: What is the difference between samtools rmdup and Picard MarkDuplicates?
Mapping reads with bowtie2¶
Take an assembly and try to map the reads back using bowtie2. Do this on an interactive node again, and remember to change the ‘out_21’ part to the actual output directory that you generated:
# Create a new directory and link files mkdir -p ~/asm-workshop/bowtie2 cd ~/asm-workshop/bowtie2 ln -s ../velvet/out_21/contigs.fa contigs.fa ln -s ../sickle/pair1.fastq pair1.fastq ln -s ../sickle/pair2.fastq pair2.fastq # Run the everything in one go script. map-bowtie2-markduplicates.sh -t 1 -c pair1.fastq pair2.fastq pair contigs.fa contigs map > map.log 2> map.err
Inspect the map.log output and see if all went well.
Question: What is the overall alignment rate of your reads that bowtie2 reports?
Add the answer to the doc.
Some general statistics from the SAM/BAM file¶
You can also determine mapping statistics directly from the bam file. Use for instance:
# Mapped reads only samtools view -c -F 4 map/contigs_pair-smds.bam # Unmapped reads only samtools view -c -f 4 map/contigs_pair-smds.bam
From: http://left.subtree.org/2012/04/13/counting-the-number-of-reads-in-a-bam-file/. The number is different from the number that bowtie2 reports, because these are the numbers after removing duplicates. The -smds part stands for running samtools sort, MarkDuplicates.jar and samtools sort again on the bam file. If all went well with the mapping there should also be a map/contigs_pair-smd.metrics file where you can see the percentage of duplication. Add that to the doc as well.
Coverage information from BEDTools¶
Look at the output from BEDTools:
The format is explained here http://bedtools.readthedocs.org/en/latest/content/tools/genomecov.html. The map-bowtie2-markduplicates.sh script also outputs the mean coverage per contig:
Question: What is the contig with the highest coverage? Hint: use sort -k