==========================================
6. Analyzing RNAseq expression with Salmon
==========================================

We will use `salmon <http://salmon.readthedocs.org/en/latest/>`__ to
quantify expression. `Salmon
<https://github.com/COMBINE-lab/salmon>`__ is a new breed of software
for quantifying RNAseq reads that is both really fast and takes
transcript length into consideration (`Patro et al. 2015
<http://biorxiv.org/content/early/2015/06/27/021592>`__).

Additional references:

https://github.com/ngs-docs/2015-nov-adv-rna/blob/master/salmon.rst

http://angus.readthedocs.io/en/2016/rob_quant/tut.html

https://2016-aug-nonmodel-rnaseq.readthedocs.io/en/latest/quantification.html

----

Be sure you have loaded the right Python packages
::

   source ~/venv/bin/activate

Salmon location
==========

I have already installed salmon for you. You may prepare your computer
to run salmon by altering the path::

  PATH=${PATH}:/work/eelpondworkshop/Salmon-0.8.2_linux_x86_64/bin

Test salmon by requesting citation information::

  salmon --cite

Run Salmon
==========

First, build an index for your new transcriptome:
::

   cd ${PROJECT}
   mkdir -p quant
   cd quant
   ln -s ${PROJECT}/assembly/trinity_out_dir/Trinity.fasta .
   salmon index --index nema --transcripts Trinity.fasta --type quasi

And also link in the QC reads (produced in :doc:`1-quality`):
::

   ln -s ../quality/*R1*.qc.fq.gz .
   ln -s ../quality/*R2*.qc.fq.gz .

Then, run the salmon command:
::

  for R1 in *R1*.qc.fq.gz
  do
    sample=$(basename $R1 extract.qc.fq.gz)
    echo sample is $sample, R1 is $R1
    R2=${R1/R1/R2}
    echo R2 is $R2
    salmon quant -i nema -p 2 -l IU -1 <(gunzip -c $R1) -2 <(gunzip -c $R2) -o ${sample}quant
  done

This will create a bunch of directories named something like
``0Hour_ATCACG_L002001.quant``, containing a bunch of files. Take a
look at what files there are:
::

    find 0Hour_ATCACG_L002_R1_001.quant -type f

You should see::

    0Hour_ATCACG_L002_R1_001.quant/quant.sf
    0Hour_ATCACG_L002_R1_001.quant/aux_info/observed_bias.gz
    0Hour_ATCACG_L002_R1_001.quant/aux_info/observed_bias_3p.gz
    0Hour_ATCACG_L002_R1_001.quant/aux_info/fld.gz
    0Hour_ATCACG_L002_R1_001.quant/aux_info/expected_bias.gz
    0Hour_ATCACG_L002_R1_001.quant/aux_info/meta_info.json
    0Hour_ATCACG_L002_R1_001.quant/cmd_info.json
    0Hour_ATCACG_L002_R1_001.quant/logs/salmon_quant.log
    0Hour_ATCACG_L002_R1_001.quant/libParams/flenDist.txt
    0Hour_ATCACG_L002_R1_001.quant/lib_format_counts.json

The two most interesting files are ``salmon_quant.log`` and
``quant.sf``. The latter contains the counts; the former contains the
log information from running things.

Working with the counts
-----------------------

The ``quant.sf`` files actually contain the relevant information about
expression -- take a look::

   head -20 0Hour_ATCACG_L002_R1_001.quant/quant.sf

The first column contains the transcript names, and the
fifth column is what edgeR etc will want - the "raw counts".
However, they're not in a convenient location / format for edgeR to use;
let's fix that.

Now, grab the script...

::

   curl -L -O https://raw.githubusercontent.com/dib-lab/eel-pond/master/gather-counts.py

and run it::

   python ./gather-counts.py

This will give you a bunch of .counts files, processed from the quant.sf files
and named for the directory they are in.