================================================
1. Quality Trimming and Filtering Your Sequences
================================================

Login to boqueron, and let's set up some software.

Install software
----------------

Install `khmer <http://khmer.readthedocs.org>`_. We can use
Python's pip command ::

   cd ${HOME}
   module load python2
   pip install --user virtualenv
   python2.7 -m virtualenv venv
   source ${HOME}/venv/bin/activate
   pip install khmer
 
The use of ``virtualenv`` allows us to install Python software without having
root access. If you come back to this protocol in a different terminal session
you will need to rerun ``source ${HOME}/venv/bin/activate``
again.

Find your data
--------------

Load in your own data (as in :doc:`0-download-and-save`)
(again, this is the data
from `Tulin et al., 2013 <http://www.evodevojournal.com/content/4/1/16>`__).

Check::

   ls ${WORK}/data

If you see all the files you think you should, good!  Otherwise, debug.

These are FASTQ files -- let's take a look at them::

  less 0Hour_ATCACG_L002_R1_001.extract.fastq.gz

(use the spacebar to scroll down, and type 'q' to exit 'less')

Link your data into a working directory
---------------------------------------

Rather than *copying* the files into the working directory, let's just
*link* them in -- this creates a reference so that UNIX knows where to
find them but doesn't need to actually move them around. :
::

   mkdir -p ${WORK}/quality
   cd ${WORK}/quality
   
   ln -fs ${WORK}/data/*.fastq.gz .

(The ``ln`` command does the linking.)

Now, do an ``ls`` to list the files.  If you see only one entry, ``*.fastq.gz``,
then the ln command above didn't work properly.  One possibility is that
your files aren't in /data; another is that their names don't end with
``.fastq.gz``.

Evaluate the quality of your files with FastQC
--------------------------------------------------------

We can use `FastQC <http://www.bioinformatics.babraham.ac.uk/projects/fastqc/>`__ to look at the quality of your sequences.
To use fastqc on boqueron, you can use a version installed by the
staff as a software module::

   module load fastqc
   fastqc *.fastq.gz

The output will be placed in the '${WORK}/quality' folder on boqueron.
You can transfer the html files to 
your local computer; look for the fastqc_report.html files, and
double click on them to load them into your browser.

Find the right Illumina adapters
--------------------------------

You'll need to know which Illumina sequencing adapters were used for
your library in order to trim them off. Below, we will use the TruSeq3-PE.fa
adapters::

   cd ${WORK}/quality
   wget https://sources.debian.net/data/main/t/trimmomatic/0.32+dfsg-2/adapters/TruSeq3-PE.fa

.. note::

   You'll need to make sure these are the right adapters for your
   data.  If they are the right adapters, you should see that some of
   the reads are trimmed; if they're not, you won't see anything
   get trimmed.

Adapter trim each pair of files
-------------------------------

(From this point on, you may want to be running things inside of
screen, so that you detach and log out while it's running; see
:doc:`../amazon/using-screen` for more information.)

If you're following along using the Nematostella data, you should have a
bunch of files that look like this (use 'ls' to show them)::

   24HourB_GCCAAT_L002_R1_001.fastq.gz
                       ^^

Each file with an R1 in its name should have a matching file with an R2 --
these are the paired ends.

See excellent paper on trimming from `MacManes 2014 <http://journal.frontiersin.org/article/10.3389/fgene.2014.00013/full>`__.

We will use the copy of Trimmomatic, read trimming tool (`homepage <http://www.usadellab.org/cms/?page=trimmomatic>`__).
Since Trimmomatic is a java program, we load a module
for java.

Run::

  # Load a newer java version
  module load jdk
  
  # Delete any old orphan reads
  rm -f orphans.qc.fq.gz
  
  for filename in *_R1_*.fastq.gz
  do
    # first, make the base by removing fastq.gz
    base=$(basename $filename .fastq.gz)
    echo $base
        
    # now, construct the R2 filename by replacing R1 with R2
    baseR2=${base/_R1_/_R2_}
    echo $baseR2
        
   # finally, run Trimmomatic
    java -jar /work/eelpondworkshop/Trimmomatic-0.36/trimmomatic-0.36.jar PE ${base}.fastq.gz ${baseR2}.fastq.gz \
      ${base}.qc.fq.gz s1_se \
      ${baseR2}.qc.fq.gz s2_se \
      ILLUMINACLIP:TruSeq3-PE.fa:2:40:15 \
      LEADING:2 TRAILING:2 \
      SLIDINGWINDOW:4:2 \
      MINLEN:25
        
    # save the orphans
    gzip -9c s1_se s2_se >> orphans.qc.fq.gz
    rm -f s1_se s2_se
  done
  
The paired sequences output by this set of commands will be in the files ending in '.qc.fq.gz', with any orphaned sequences all together
in 'orphans.qc.fq.gz'.

Make these trimmed reads read-only and keep them, as we will reuse them later.
::

  chmod u-w ${PROJECT}/quality/*.qc.fq.gz


Interleave the sequences
------------------------

Next, we need to take these R1 and R2 sequences and convert them into
interleaved form, for the next step.  To do this, we'll use scripts
from the `khmer package <http://khmer.readthedocs.org>`__, which we
installed above.

Now let's use a for loop again - you might notice this is only a minor
modification of the previous for loop...
::

   for filename in *_R1_*.qc.fq.gz
   do
        # first, make the base by removing .extract.fastq.gz
        base=$(basename $filename .qc.fq.gz)
        echo $base

        # now, construct the R2 filename by replacing R1 with R2
        baseR2=${base/_R1/_R2}
        echo $baseR2

        # construct the output filename
        output=${base/_R1/}.pe.qc.fq.gz

        (interleave-reads.py ${base}.qc.fq.gz ${baseR2}.qc.fq.gz | \
            gzip > $output)
   done

The final product of this is now a set of files named
``*.pe.qc.fq.gz`` that are paired-end / interleaved and quality
filtered sequences, together with the file ``orphans.qc.fq.gz`` that
contains orphaned sequences.

Finishing up
------------

Make the end product files read-only
::

   chmod u-w *.pe.qc.fq.gz orphans.qc.fq.gz

to make sure you don't accidentally delete them.

Since you linked your original data files into the ``quality`` directory, you
can now do:
::

   rm *.fastq.gz

to remove them from this location; you don't need them for any future steps.


Things to think about
~~~~~~~~~~~~~~~~~~~~~

Note that the filenames, while ugly, are conveniently structured with the
history of what you've done.  This is a good idea.


OPTIONAL: Evaluate the quality of your files with FastQC again
--------------------------------------------------------------

We can once again use FastQC to look at the
quality of your newly-trimmed sequences::

   cd ${WORK}/quality
   fastqc *.qc.fq.gz


Saving the files
----------------

At this point, you should save these files, which will be used in two
ways: first, for assembly; and second, for mapping, to do quantitation
and ultimately comparative expression analysis.  You can save them by
transferring the files to your local computer.

Next stop: :doc:`2-diginorm`.