0. Downloading and Saving Your Initial Data

We’re going to run transcriptome assembly on the UPR HPCf cluster called boqueron because that way (a) you don’t need to buy a big computer, and (b) I don’t have to figure out all the special details of your own computer system.

This does mean that the first thing you need to do is get yourself and your data over to boqueron. I tend to just store it there in the first place, because...

The basics

... It’s a pain to move your data around. Every system has a special program for managing your data. Contact the helpdesk if you need assitance moving your data on or off boqueron.

Logging in to boqueron

To login remotely to boqueron you need an ssh client. An example is putty, a secure version of the puttytel program you use for registering courses in UPR.

The HPCf website has a brief primer on logging in to boqueron.

Using curl to copy files

You can also use curl to download files one at a time from Web or FTP sites. For example, to save a file from a website, you could use:

cd $WORK
curl -O http://path/to/file/on/website

Once you have the files, figure out their size using du -sh (e.g. after the above, du -sh $WORK will tell you how much data you have saved under /work).

Any files in the ‘/work’ directory may be lost if the filesystem crashes. However, files stored in the ‘/home’, directory will remain available and are backed up.

More information on the ‘/home’ and ‘/work’ filesystems is available at the HPCf website.

Some test data

To get started with multi-file analysis and assembly, I’ve provided some test mRNAseq data from embryonic stages of Nematostella vectensis; the source is this excellent paper by Tulin et al., “A quantitative reference transcriptome for Nematostella vectensis”. Make a directory to hold the data:

cd $WORK
curl -O https://s3.amazonaws.com/public.ged.msu.edu/mrnaseq-subset.tar
mkdir data
cd data
tar xvf ../mrnaseq-subset.tar

Additional information

Throughout this protocol we will be using commandline interfaces. There is a short document explaining the notations used here. (see Commandline conventions). The command line we use is bash, an interface used in UNIX and linux. As you work more in bioinformatics, you will eventually need to find a UNIX/linux tutorial, and perhaps a bash scripting tutorial. The HPCf has a very brief linux primer.


Next: 1. Quality Trimming and Filtering Your Sequences


LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- fork @ github.
comments powered by Disqus