5. Building transcript families and annotating the sequences¶
Install khmer, screed, and BLAST. (See 1. Quality Trimming and Filtering Your Sequences and BLASTing your assembled data). I would suggest using an m1.large or m1.xlarge machine.
You’ll also need to install some eel-pond scripts:
git clone https://github.com/ctb/eel-pond.git /usr/local/share/eel-pond
Copy in your data¶
You need to get ahold of your assembled transcriptome (from e.g. 3. Running the Actual Assembly). Put it in /mnt.
For the purposes of your first run through, I suggest just grabbing my copy of the Nematostella assembly:
cd /mnt
curl -O https://s3.amazonaws.com/public.ged.msu.edu/trinity-nematostella-raw.fa.gz
Run khmer partitioning¶
Partitioning runs a de Bruijn graph-based clustering algorithm that will cluster your transcripts by transitive sequence overlap. That is, it will build transcript families :).
/usr/local/share/khmer/scripts/do-partition.py -x 1e9 -N 4 --threads 4 nema trinity-nematostella-raw.fa.gz
This should take about 15 minutes, and outputs a file ending in ‘.part’ that contains the partition assignments. Now, group and rename the sequences:
python /usr/local/share/eel-pond/rename-with-partitions.py nema trinity-nematostella-raw.fa.gz.part
mv trinity-nematostella-raw.fa.gz.part.renamed.fasta.gz trinity-nematostella.renamed.fa.gz
Looking at the renamed sequences¶
Let’s look at the renamed sequences:
gunzip -c trinity-nematostella.renamed.fa.gz | head
You’ll see that each sequence name looks like this:
>nema.id1.tr16001 1_of_1_in_tr16001 len=261 id=1 tr=16001
Some explanation:
- ‘nema’ is the prefix that you gave the rename script, above; modify accordingly for your own organism. It’s best to change it each time you do an assembly, just to keep things straight.
- ‘idN’ is the unique ID for this sequence; it will never be repeated in this
- file.
- ‘trN’ is the transcript family, which may contain one or more transcripts.
- ‘1_of_1_in_tr16001’ tells you that this transcript family has only one transcript in it (this one!) Other transcript families may (will) have more.
- ‘len’ is the sequence length.
Doing a preliminary annotation against mouse¶
Now let’s assign putative homology & orthology to these transcripts, by doing BLASTs & reciprocal best hit analysis. First, uncompress your transcripts file:
gunzip trinity-nematostella.renamed.fa.gz
Now, grab the latest mouse RefSeq:
curl -O ftp://ftp.ncbi.nih.gov/refseq/M_musculus/mRNA_Prot/mouse.protein.faa.gz
gunzip mouse.protein.faa.gz
Format both as BLAST databases:
formatdb -i mouse.protein.faa -o T -p T
formatdb -i trinity-nematostella.renamed.fa -o T -p F
And, now, run BLAST in both directions. Note, this may take ~24 hours or longer; you probably want to run it in screen:
blastall -i trinity-nematostella.renamed.fa -d mouse.protein.faa -e 1e-3 -p blastx -o nema.x.mouse -a 8 -v 4 -b 4
blastall -i mouse.protein.faa -d trinity-nematostella.renamed.fa -e 1e-3 -p tblastn -o mouse.x.nema -a 8 -v 4 -b 4
Note
These BLASTs will take a long time, like 24-36 hours. If you want to work with canned BLASTs, do:
curl -O http://athyra.idyll.org/~t/nema.x.mouse.gz
curl -O http://athyra.idyll.org/~t/mouse.x.nema.gz
gunzip nema.x.mouse.gz
gunzip mouse.x.nema.gz
Assigning names to sequences¶
Now, calculate putative homology (best BLAST hit) and orthology (reciprocal best hits):
python /usr/local/share/eel-pond/make-uni-best-hits.py nema.x.mouse nema.x.mouse.homol
python /usr/local/share/eel-pond/make-reciprocal-best-hits.py nema.x.mouse mouse.x.nema nema.x.mouse.ortho
Prepare some of the mouse info:
python /usr/local/share/eel-pond/make-namedb.py mouse.protein.faa mouse.namedb
python -m screed.fadbm mouse.protein.faa
And, finally, annotate the sequences:
python /usr/local/share/eel-pond/annotate-seqs.py trinity-nematostella.renamed.fa nema.x.mouse.ortho nema.x.mouse.homol
This will produce a file ‘trinity-nematostella.renamed.fa.annot’, which will have sequences that look like this:
>nematostella.id1.tr115222 h=43% => suppressor of tumorigenicity 7 protein isoform 2 [Mus musculus] 1_of_7_in_tr115222 len=1635 id=1 tr=115222 1_of_7_in_tr115222 len=1635 id=1 tr=115222
I suggest renaming this file to ‘nematostella.fa’ and using it for BLASTs (see BLASTing your assembled data).
LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- fork @ github.
comments powered by Disqus