============= 4. Assembling ============= At last! All that filtering and diginorming is done, and we can get down to the serious business of assembling. Huzzah! Install some assemblers ======================= First, let's install Velvet:: cd /root curl -O http://www.ebi.ac.uk/~zerbino/velvet/velvet_1.2.10.tgz tar xzf velvet_1.2.10.tgz cd velvet_1.2.10 make MAXKMERLENGTH=51 cp velvet? /usr/local/bin And also IDBA:: cd /root curl -O http://hku-idba.googlecode.com/files/idba-1.1.1.tar.gz tar xzf idba-1.1.1.tar.gz cd idba-1.1.1 ./configure && make Splitting out PE and SE reads ============================= Let's assemble just one group, for now -- group 5 should be nice and small. Generally assemblers will want interleaved reads to be distinguished from orphaned reads, so let's split 'em out:: mkdir group5 cd group5 python /usr/local/share/khmer/scripts/extract-paired-reads.py ../kak.group0005.fa.gz python /usr/local/share/khmer/scripts/extract-paired-reads.py ../kak.group0005.fa.gz.sweep3 mv kak.group0005.fa.gz.sweep3.pe kak.group5.nodn.pe mv kak.group0005.fa.gz.sweep3.se kak.group5.nodn.se Using Velvet ============ I personally really like the Velvet assembler, since it yields pretty good results in a wide variety of situations. It's also rather fast. The downside is that you have to specify a 'k' parameter, which gets annoying because it gives you different results (see presentation). So, let's just assemble across a lot of k's. :: for k in {21..43..2} do velveth dn.$k $k -fasta -shortPaired kak.group0005.fa.gz.pe -short kak.group0005.fa.gz.se velvetg dn.$k -exp_cov auto done for k in {21..43..2} do velveth nodn.$k $k -fasta -shortPaired kak.group5.nodn.pe -short kak.group5.nodn.se velvetg nodn.$k -exp_cov auto done Using IDBA ========== I've heard good things about IDBA. Let's give it a try:: idba_ud --pre_correction -r kak.group0005.fa.gz.pe -o idba.dn.d idba_ud --pre_correction -r kak.group5.nodn.pe -o idba.nodn.d/ Getting stats for the assemblies ================================ To get some basic stats for the assemblies, run:: python /usr/local/share/khmer/sandbox/assemstats3.py 500 *.??/contigs.fa idba.*.d/scaffold.fa This will yield something like:: N sum max filename 38 671957 83467 dn.21/contigs.fa 32 668918 83568 dn.23/contigs.fa 35 668509 83401 dn.25/contigs.fa 31 671843 83817 dn.27/contigs.fa 32 669104 83721 dn.29/contigs.fa 32 672735 84066 dn.31/contigs.fa 32 673102 83774 dn.33/contigs.fa 31 674629 83912 dn.35/contigs.fa 31 677446 84200 dn.37/contigs.fa 33 681099 84554 dn.39/contigs.fa 35 685245 84852 dn.41/contigs.fa 40 686733 85276 dn.43/contigs.fa 41 649574 62719 nodn.21/contigs.fa 39 639388 62155 nodn.23/contigs.fa 49 646132 62145 nodn.25/contigs.fa 39 647100 83798 nodn.27/contigs.fa 38 650487 83750 nodn.29/contigs.fa 33 649863 83770 nodn.31/contigs.fa 31 636979 83822 nodn.33/contigs.fa 35 645536 83856 nodn.35/contigs.fa 36 647848 83800 nodn.37/contigs.fa 33 654660 83934 nodn.39/contigs.fa 36 645126 83897 nodn.41/contigs.fa 34 660289 83231 idba.dn.d/scaffold.fa 45 666147 41120 idba.nodn.d/scaffold.fa Extracting sequences over a certain length ========================================== Let's say that we want to work with the sequences in dn.43/contigs.fa, but we want to get rid of all the small sequences. How? Easy:: /usr/local/share/khmer/sandbox/extract-long-sequences.py 500 dn.43/contigs.fa > group5-assembly.fa ---- Next: :doc:`5-mapping-and-quantitation`.