========================================== 6. Analyzing RNAseq expression with Salmon ========================================== We will use `salmon `__ to quantify expression. `Salmon `__ is a new breed of software for quantifying RNAseq reads that is both really fast and takes transcript length into consideration (`Patro et al. 2015 `__). Additional references: https://github.com/ngs-docs/2015-nov-adv-rna/blob/master/salmon.rst http://angus.readthedocs.io/en/2016/rob_quant/tut.html https://2016-aug-nonmodel-rnaseq.readthedocs.io/en/latest/quantification.html ---- Be sure you have loaded the right Python packages :: source ~/venv/bin/activate Salmon location ========== I have already installed salmon for you. You may prepare your computer to run salmon by altering the path:: PATH=${PATH}:/work/eelpondworkshop/Salmon-0.8.2_linux_x86_64/bin Test salmon by requesting citation information:: salmon --cite Run Salmon ========== First, build an index for your new transcriptome: :: cd ${PROJECT} mkdir -p quant cd quant ln -s ${PROJECT}/assembly/trinity_out_dir/Trinity.fasta . salmon index --index nema --transcripts Trinity.fasta --type quasi And also link in the QC reads (produced in :doc:`1-quality`): :: ln -s ../quality/*R1*.qc.fq.gz . ln -s ../quality/*R2*.qc.fq.gz . Then, run the salmon command: :: for R1 in *R1*.qc.fq.gz do sample=$(basename $R1 extract.qc.fq.gz) echo sample is $sample, R1 is $R1 R2=${R1/R1/R2} echo R2 is $R2 salmon quant -i nema -p 2 -l IU -1 <(gunzip -c $R1) -2 <(gunzip -c $R2) -o ${sample}quant done This will create a bunch of directories named something like ``0Hour_ATCACG_L002001.quant``, containing a bunch of files. Take a look at what files there are: :: find 0Hour_ATCACG_L002_R1_001.quant -type f You should see:: 0Hour_ATCACG_L002_R1_001.quant/quant.sf 0Hour_ATCACG_L002_R1_001.quant/aux_info/observed_bias.gz 0Hour_ATCACG_L002_R1_001.quant/aux_info/observed_bias_3p.gz 0Hour_ATCACG_L002_R1_001.quant/aux_info/fld.gz 0Hour_ATCACG_L002_R1_001.quant/aux_info/expected_bias.gz 0Hour_ATCACG_L002_R1_001.quant/aux_info/meta_info.json 0Hour_ATCACG_L002_R1_001.quant/cmd_info.json 0Hour_ATCACG_L002_R1_001.quant/logs/salmon_quant.log 0Hour_ATCACG_L002_R1_001.quant/libParams/flenDist.txt 0Hour_ATCACG_L002_R1_001.quant/lib_format_counts.json The two most interesting files are ``salmon_quant.log`` and ``quant.sf``. The latter contains the counts; the former contains the log information from running things. Working with the counts ----------------------- The ``quant.sf`` files actually contain the relevant information about expression -- take a look:: head -20 0Hour_ATCACG_L002_R1_001.quant/quant.sf The first column contains the transcript names, and the fifth column is what edgeR etc will want - the "raw counts". However, they're not in a convenient location / format for edgeR to use; let's fix that. Now, grab the script... :: curl -L -O https://raw.githubusercontent.com/dib-lab/eel-pond/master/gather-counts.py and run it:: python ./gather-counts.py This will give you a bunch of .counts files, processed from the quant.sf files and named for the directory they are in.