4. Assembling

At last! All that filtering and diginorming is done, and we can get down to the serious business of assembling. Huzzah!

Install some assemblers

First, let’s install Velvet:

cd /root
curl -O http://www.ebi.ac.uk/~zerbino/velvet/velvet_1.2.10.tgz
tar xzf velvet_1.2.10.tgz
cd velvet_1.2.10
make MAXKMERLENGTH=51
cp velvet? /usr/local/bin

And also IDBA:

cd /root
curl -O http://hku-idba.googlecode.com/files/idba-1.1.1.tar.gz
tar xzf idba-1.1.1.tar.gz
cd idba-1.1.1
./configure && make

Splitting out PE and SE reads

Let’s assemble just one group, for now – group 5 should be nice and small. Generally assemblers will want interleaved reads to be distinguished from orphaned reads, so let’s split ‘em out:

mkdir group5
cd group5
python /usr/local/share/khmer/scripts/extract-paired-reads.py ../kak.group0005.fa.gz
python /usr/local/share/khmer/scripts/extract-paired-reads.py ../kak.group0005.fa.gz.sweep3
mv kak.group0005.fa.gz.sweep3.pe kak.group5.nodn.pe
mv kak.group0005.fa.gz.sweep3.se kak.group5.nodn.se

Using Velvet

I personally really like the Velvet assembler, since it yields pretty good results in a wide variety of situations. It’s also rather fast. The downside is that you have to specify a ‘k’ parameter, which gets annoying because it gives you different results (see presentation).

So, let’s just assemble across a lot of k’s.

for k in {21..43..2}
do
   velveth dn.$k $k -fasta -shortPaired kak.group0005.fa.gz.pe -short kak.group0005.fa.gz.se
   velvetg dn.$k -exp_cov auto
done

for k in {21..43..2}
do
   velveth nodn.$k $k -fasta -shortPaired kak.group5.nodn.pe -short kak.group5.nodn.se
   velvetg nodn.$k -exp_cov auto
done

Using IDBA

I’ve heard good things about IDBA. Let’s give it a try:

idba_ud --pre_correction -r kak.group0005.fa.gz.pe -o idba.dn.d

idba_ud --pre_correction -r kak.group5.nodn.pe -o idba.nodn.d/

Getting stats for the assemblies

To get some basic stats for the assemblies, run:

python /usr/local/share/khmer/sandbox/assemstats3.py 500 *.??/contigs.fa idba.*.d/scaffold.fa

This will yield something like:

N       sum     max     filename
38      671957  83467   dn.21/contigs.fa
32      668918  83568   dn.23/contigs.fa
35      668509  83401   dn.25/contigs.fa
31      671843  83817   dn.27/contigs.fa
32      669104  83721   dn.29/contigs.fa
32      672735  84066   dn.31/contigs.fa
32      673102  83774   dn.33/contigs.fa
31      674629  83912   dn.35/contigs.fa
31      677446  84200   dn.37/contigs.fa
33      681099  84554   dn.39/contigs.fa
35      685245  84852   dn.41/contigs.fa
40      686733  85276   dn.43/contigs.fa
41      649574  62719   nodn.21/contigs.fa
39      639388  62155   nodn.23/contigs.fa
49      646132  62145   nodn.25/contigs.fa
39      647100  83798   nodn.27/contigs.fa
38      650487  83750   nodn.29/contigs.fa
33      649863  83770   nodn.31/contigs.fa
31      636979  83822   nodn.33/contigs.fa
35      645536  83856   nodn.35/contigs.fa
36      647848  83800   nodn.37/contigs.fa
33      654660  83934   nodn.39/contigs.fa
36      645126  83897   nodn.41/contigs.fa
34      660289  83231   idba.dn.d/scaffold.fa
45      666147  41120   idba.nodn.d/scaffold.fa

Extracting sequences over a certain length

Let’s say that we want to work with the sequences in dn.43/contigs.fa, but we want to get rid of all the small sequences. How?

Easy:

/usr/local/share/khmer/sandbox/extract-long-sequences.py 500 dn.43/contigs.fa > group5-assembly.fa

Next: 6. Mapping and abundance quantitation.


LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- fork @ github.
comments powered by Disqus