MAKER-P Genome Annotation using Atmosphere (Images Tutorial)

Rationale and background:

MAKER-P is a flexible and scalable genome annotation pipeline that automates the many steps necessary for the detection of protein coding genes (Campbell et al. 2013). MAKER-P identifies repeats, aligns ESTs and proteins to a genome, produces ab initio gene predictions, and automatically synthesizes these data into gene annotations having evidence-based quality indices. MAKER-P was developed by the Yandell Lab. Its predecessor, MAKER, is described in several publications (Cantarel et al. 2008; Holt & Yandell 2011). Additional background is available at the MAKER Tutorial at GMOD and is highly recommended reading. MAKER-P v2.28 is currently available as an Atmosphere image and is MPI-enabled for parallel processing.

This tutorial will take users through steps of:

Launching the MAKER-P Atmosphere image
Uploading data from the CyVerse Data Store (optional)
Running MAKER-P on an example small genome assembly

What kind of data do I need?

Mandatory requirements
1. Genome assembly (fasta file)
2. Organism type
  1. Eukaryotic (default, set as: organism_type=eukaryotic)
  2. Prokaryotic (set as: organism_type=prokaryotic)
Additional data that can be used to improve the annotation (Highly recommended)
1. RNA evidence (at least one of them is needed)
  1. Assembled mRNA-seq transcriptome (fasta file)
  2. Expressed sequence tags (ESTs) data (fasta file)
  3. Aligned EST or transcriptome GFF3 from your organism
  4. Aligned EST or transcriptome GFF3 from a closely related organism
2. Protein evidence
  1. protein sequence file in fasta format (i.e. from multiple organisms)
  2. protein gff (aligned protein homology evidence from an external GFF3 file)

What kind of resources will I need for my project?

Enough storage space on the MAKER-P Atmosphere instance for both input and output files
1. Creating and mounting an external volume to the running Atmosphere instance is recommended (Part 3. Running MAKER-P on the volumes)
Enough AUs to finish your computation. It is recommended to run a small subset of your data to estimate the number of AU's needed to finish the whole MAKER-P annotation of your genome
If you think the current SU's is not enough to finish your annotation, then consider using WQ-MAKER on Jetstream which is faster than MAKER-P
Want more help, click on the intercom button (bottom right)

Part 1: Connect to an instance of an Atmosphere Image (virtual machine)

Step 1. Go to https://atmo.cyverse.org and log in with your CyVerse credentials.

note: click images to enlarge

Step 2. Create a project and name the project name as MAKER-P and description as MAKER-P atmosphere tutorial

note: click images to enlarge

Step 3. Click the project and then Select the image MAKER-P_2.31.9 and click Launch Instance. It will take ~10-15 minutes for the cloud instance to be launched.

note: click images to enlarge

Note: Instances can be configured for different amounts of CPU, memory, and storage depending on user needs. This tutorial can be accomplished with the small instance size, medium1 (4 CPUs, 8 GB memory, 80 GB root)

Step 4. Once the VM is ready. Click the VM which will take onto next screen where you can open Web Shell

note: click images to enlarge

or ssh into the ip address on a terminal like below.

$ ssh <username>@<ip.address>

Part 2: Set up a MAKER-P run using the Web Shell (Make sure you run all your analyses on volume. See below)

Step 1. Get oriented. You will find staged example data in folder "maker_test_data/" within the folder /opt". List its contents with the ls command:

$ ls /opt/maker_test_data/
maker_bopts.ctl  maker_exe.ctl  maker_opts.ctl  test_data  test_genome.maker.output

maker_*.ctl file are a set of configuration files that can be used for this tutorial or generated Step 5.
The subdirectory "test_data" includes input data files that you will use for this tutorial.
The subdirectory "test_genome.maker.output" contains output that you should expect to see when completing the tutorial. Have a look at "test_data" directory:

$ ls /opt/maker_test_data/test_data/
mRNA.fasta  msu-irgsp-proteins.fasta  O.sativa.hmm  Os-rRNA.fa  plant_repeats.fasta  test_genome_chr1.fasta  test_genome.fasta

fasta files include a scaled-down genome (test_genome.fasta) which is comprised of the first 300kb of 12 chromosomes of rice and scaled-down genome (test_genome_chr1.fasta) which is comprised of the first 300kb of first chromosome of rice
mRNA sequences from NCBI (mRNA.fasta)
publicly available annotated protein sequences of rice (MSU7.0 and IRGSP1.0) - msu-irgsp-proteins.fasta
collection of plant repeats (plant_repeats.fasta)
ribosomal RNAsequence of rice (Os-rRNA.fa)

Executables for running MAKER-P are located in /opt/maker/bin:

$ ls /opt/maker/bin
cegma2zff       fasta_merge        iprscan2gff3    maker2jbrowse           maker_functional_gff  map_gff_ids
chado2gff3      fasta_tool         iprscan_wrap    maker2wap               maker_map_ids         mpi_evaluator
compare         genemark_gtf2gff3  maker           maker2zff               map2assembly          mpi_iprscan
cufflinks2gff3  gff3_merge         maker2chado     maker_functional        map_data_ids          tophat2gff3
evaluator       ipr_update_gff     maker2eval_gtf  maker_functional_fasta  map_fasta_ids         wq_maker

As the names suggest, the "/opt/maker/bin" directory includes many useful auxiliary scripts. For example cufflinks2gff3 will convert output from an RNA-seq analysis into a GFF3 file that can be used for input as evidence for MAKER-P. Both Cufflinks and cufflinks2gff3 are available as tools in the iPlant Discovery Environment (DE). Other auxiliary scripts now available in the DE include tophat2gff3, maker2jbrowse, and maker2zff. We recommend reading MAKER Tutorial at GMOD for more information about these.

Step 2. Set up a MAKER-P run. Create a working directory called "maker_test_data" on your home directory using the mkdir command and cd into that directory:

$ cd ~
$ mkdir maker_test_data

$ cd maker_test_data/

Step 3. Copy the directory "test_data" into the current directory using cp -r. Verify using the contents using ls command:

$ sudo cp -r /opt/maker_test_data/test_data .
$ ls
test_data

Step 4. Run the maker command with the -h flag to get a usage statement and list of options. (Ignore the line that says Argument is missing)

$ maker -h
Argument "2.53_01" isn't numeric in numeric ge (>=) at /usr/local/lib/x86_64-linux-gnu/perl/5.22.1/forks.pm line 1570.
MAKER version 2.31.9
Usage:
maker [options] <maker_opts> <maker_bopts> <maker_exe>
Description:
MAKER is a program that produces gene annotations in GFF3 format using
evidence such as EST alignments and protein homology. MAKER can be used to
produce gene annotations for new genomes as well as update annotations
from existing genome databases.
The three input arguments are control files that specify how MAKER should
behave. All options for MAKER should be set in the control files, but a
few can also be set on the command line. Command line options provide a
convenient machanism to override commonly altered control file values.
MAKER will automatically search for the control files in the current
working directory if they are not specified on the command line.
Input files listed in the control options files must be in fasta format
unless otherwise specified. Please see MAKER documentation to learn more
about control file configuration. MAKER will automatically try and
locate the user control files in the current working directory if these
arguments are not supplied when initializing MAKER.
It is important to note that MAKER does not try and recalculated data that
it has already calculated. For example, if you run an analysis twice on
the same dataset you will notice that MAKER does not rerun any of the
BLAST analyses, but instead uses the blast analyses stored from the
previous run. To force MAKER to rerun all analyses, use the -f flag.
MAKER also supports parallelization via MPI on computer clusters. Just
launch MAKER via mpiexec (i.e. mpiexec -n 40 maker). MPI support must be
configured during the MAKER installation process for this to work though
Options:
-genome|g <file> Overrides the genome file path in the control files
-RM_off|R Turns all repeat masking options off.
-datastore/ Forcably turn on/off MAKER's two deep directory
nodatastore structure for output. Always on by default.
-old_struct Use the old directory styles (MAKER 2.26 and lower)
-base <string> Set the base name MAKER uses to save output files.
MAKER uses the input genome file name by default.
-tries|t <integer> Run contigs up to the specified number of tries.
-cpus|c <integer> Tells how many cpus to use for BLAST analysis.
Note: this is for BLAST and not for MPI!
-force|f Forces MAKER to delete old files before running again.
This will require all blast analyses to be rerun.
-again|a recaculate all annotations and output files even if no
settings have changed. Does not delete old analyses.
-quiet|q Regular quiet. Only a handlful of status messages.
-qq Even more quiet. There are no status messages.
-dsindex Quickly generate datastore index file. Note that this
will not check if run settings have changed on contigs
-nolock Turn off file locks. May be usful on some file systems,
but can cause race conditions if running in parallel.
-TMP Specify temporary directory to use.
-CTL Generate empty control files in the current directory.
-OPTS Generates just the maker_opts.ctl file.
-BOPTS Generates just the maker_bopts.ctl file.
-EXE Generates just the maker_exe.ctl file.
-MWAS <option> Easy way to control mwas_server for web-based GUI
options: STOP
START
RESTART
-version Prints the MAKER version.
-help|? Prints this usage statement.

Step 5. Create control files that tell MAKER-P what to do. Three files are required:

maker_opts.ctl - Gives location of input files (genome and evidence) and sets options that affect MAKER-P behavior
maker_exe.ctl - Gives path information for the underlying executables.
maker_bopt.ctl - Sets parameters for filtering BLAST and Exonerate alignment results

To create these files run the maker command with the -CTL flag. Verify with ls:

$ maker -CTL
Argument "2.53_01" isn't numeric in numeric ge (>=) at /usr/local/lib/x86_64-linux-gnu/perl/5.22.1/forks.pm line 1570.
$ ls
maker_bopts.ctl  maker_exe.ctl  maker_opts.ctl  test_data

The "maker_exe.ctl" is automatically generated with the correct paths to executables and does not need to be modified except for "snap" software. Run the following command to fix the file path for snap

snap

$ sed 's|snap=/usr/bin/snap|snap=/opt/snap|' maker_exe.ctl > temp && mv temp maker_exe.ctl

The "maker_bopt.ctl" is automatically generated with reasonable default parameters and also does not need to be modified unless you want to experiment with optimization of these parameters.
The automatically generated "maker_opts.ctl" file does need to be modified in order to specify the genome file and evidence files to be used as input. Several text editors are available, including emacs, nano, and gedit.

Note: If pressed for time, a pre edited version of the "maker_opts.ctl" file is staged in /opt/maker_test_data/. Delete the current file and copy the staged version here (see below). Then skip to Step 6.
$ rm maker_opts.ctl
$ cp /opt/maker_test_data/maker_opts.ctl .

$ nano maker_opts.ctl

Here are the sections of the "maker_opts.ctl" file you need to edit.

Do not allow any spaces after the equal sign or anywhere else
The files can be present in same the directory as the "maker_opts.ctl" or make sure you use the relative path if the files are present in other directories

This section pertains to specifying the genome assembly to be annotated and setting organism type:

#-----Genome (these are always required)
genome=./test_data/test_genome.fasta #genome sequence (fasta file or fasta embeded in GFF3 file)
organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic

The following section pertains to EST and other mRNA expression evidence. Here we are only using same species data, but one could specify data from a related species using the altest parameter. With RNA-seq data aligned to your genome by Cufflinks or Tophat, one could use maker auxiliary scripts (cufflinks2gff3 and tophat2gff3) to generate GFF3 files and specify these using the est_gff parameter:

#-----EST Evidence (for best results provide a file for at least one)
est=./test_data/mRNA.fasta #set of ESTs or assembled mRNA-seq in fasta format
altest= #EST/cDNA sequence file in fasta format from an alternate organism
est_gff= #aligned ESTs or mRNA-seq from an external GFF3 file
altest_gff= #aligned ESTs from a closly relate species in GFF3 format

The following section pertains to protein sequence evidence. Here we are using previously annotated protein sequences. Another option would be to use SwissProt or other database:

#-----Protein Homology Evidence (for best results provide a file for at least one)
protein=./test_data/msu-irgsp-proteins.fasta  #protein sequence file in fasta format (i.e. from mutiple oransisms)
protein_gff=  #aligned protein homology evidence from an external GFF3 file

This next section pertains to repeat identification:

#-----Repeat Masking (leave values blank to skip repeat masking)
model_org= #select a model organism for RepBase masking in RepeatMasker
rmlib=./test_data/plant_repeats.fasta #provide an organism specific repeat library in fasta format for RepeatMasker
repeat_protein= #provide a fasta file of transposable element proteins for RepeatRunner
rm_gff= #pre-identified repeat elements from an external GFF3 file
prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no
softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)

Various programs for ab initio gene prediction can be specified in the next section. Here we are using SNAP set to use an HMM trained on rice, tRNA annoation has been turned on and finally specify the path to the rRNA file for finding snoRNAs with Snoscan

#-----Gene Prediction
snaphmm= #SNAP HMM file
gmhmm= #GeneMark HMM file
augustus_species= #Augustus gene prediction species model
fgenesh_par_file= #FGENESH parameter file
pred_gff= #ab-initio predictions from an external GFF3 file
model_gff= #annotated gene models from an external GFF3 file (annotation pass-through)
est2genome=1 #infer gene predictions directly from ESTs, 1 = yes, 0 = no
protein2genome=1 #infer predictions from protein homology, 1 = yes, 0 = no
trna=1 #find tRNAs with tRNAscan, 1 = yes, 0 = no
snoscan_rrna=./test_data/Os-rRNA.fa #rRNA file to have Snoscan find snoRNAs
unmask=0 #also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no

Step 8. Run MAKER-P

Make sure you are in the "maker_test_data" directory and all of your files are in place. Perform these steps to check:

$ pwd
/home/upendra_35/maker_test_data
$ ls
maker_bopts.ctl  maker_exe.ctl  maker_opts.ctl test_data
$ ls test_data/
mRNA.fasta  msu-irgsp-proteins.fasta  O.sativa.hmm  Os-rRNA.fa  plant_repeats.fasta  test_genome_chr1.fasta  test_genome.fasta

Starting the MAKER-P run is as simple as entering the command maker. Because your maker control files are in the present directory, you do not have to explicitly specify these in the command; they will be found automatically.
MAKER-P automatically outputs thousands of lines of STDERR, reporting on its progress and producing warnings and errors if they arise. It is a good practice run the command in the background using nohup command.
Another useful practice is to employ the unix time command to report statistics on how long the run took to complete. The output of time will appear at the end of the captured standard error file.
Putting all of this together we type the following command to start MAKER-P:

Recommended

If you have launched an Atmosphere instance with multiple CPUs, you can distribute MAKER-P across the each processor using mpiexec command. This example assumes that a user has checked out a "medium" instance size with 4 CPU. If you launched a bigger instance, change the value accordingly to take advantage of all the CPUs on the instance

$ nohup time mpiexec -n 4 maker &
nohup: ignoring input and appending output to 'nohup.out'

MAKER-P should now be running. For this example, it usually takes about 12 minutes to complete (50x faster than non mpi run). Monitor progress and check for errors by examining the nohup.out file. You will know MAKER-P is finished when the log_file announces "Maker is now finished!!!". Use the tail command to look at the last 10 lines of the log_file

$ tail nohup.out
Maker is now finished!!!
Start_time: 1502241909
End_time:   1502242602
Elapsed:    693

Step 10. Examine MAKER-P output.

Output data appears in a new directory called "test_genome.maker.output". Move to that directory and examine its contents:

$ cd test_genome.maker.output/
$ ls -lh
-rw-r--r--  1 upendra_35 iplant-everyone 1.4K Aug  8 22:40 maker_bopts.log
-rw-r--r--  1 upendra_35 iplant-everyone 1.4K Aug  8 22:40 maker_exe.log
-rw-r--r--  1 upendra_35 iplant-everyone 4.6K Aug  8 22:40 maker_opts.log
drwxr-xr-x  5 upendra_35 iplant-everyone 4.0K Aug  8 22:40 mpi_blastdb
-rw-r--r--  1 upendra_35 iplant-everyone    0 Aug  8 22:40 seen.dbm
drwxr-xr-x 14 upendra_35 iplant-everyone 4.0K Aug  8 23:04 test_genome_datastore
-rw-r--r--  1 upendra_35 iplant-everyone 8.0K Aug  8 22:40 test_genome.db
-rw-r--r--  1 upendra_35 iplant-everyone 1.2K Aug  8 23:06 test_genome_master_datastore_index.log

The maker_opts.log, maker_exe.log, and maker_bopts.log files are logs of the control files used for this run of MAKER.
The mpi_blastdb directory contains FASTA indexes and BLAST database files created from the input EST, protein, and repeat databases.
test_genome_master_datastore_index.log contains information on both the run status of individual contigs and information on where individual contig data is stored.
The test_genome_datastore directory contains a set of subfolders, each containing the final MAKER output for individual contigs from the genomic fasta file.

Check the test_genome_master_datastore_index.log and task_outputs.txt to see if there were any failures:

$ cat test_genome_master_datastore_index.log
Chr1    test_genome_datastore/41/30/Chr1/       STARTED
Chr1    test_genome_datastore/41/30/Chr1/       FINISHED
Chr10   test_genome_datastore/7C/72/Chr10/      STARTED
Chr10   test_genome_datastore/7C/72/Chr10/      FINISHED
Chr11   test_genome_datastore/1E/AA/Chr11/      STARTED
Chr11   test_genome_datastore/1E/AA/Chr11/      FINISHED
Chr12   test_genome_datastore/1B/FA/Chr12/      STARTED
Chr12   test_genome_datastore/1B/FA/Chr12/      FINISHED
Chr2    test_genome_datastore/E9/36/Chr2/       STARTED
Chr2    test_genome_datastore/E9/36/Chr2/       FINISHED
Chr3    test_genome_datastore/CC/EF/Chr3/       STARTED
Chr3    test_genome_datastore/CC/EF/Chr3/       FINISHED
Chr4    test_genome_datastore/A3/11/Chr4/       STARTED
Chr4    test_genome_datastore/A3/11/Chr4/       FINISHED
Chr5    test_genome_datastore/8A/9B/Chr5/       STARTED
Chr5    test_genome_datastore/8A/9B/Chr5/       FINISHED
Chr6    test_genome_datastore/13/44/Chr6/       STARTED
Chr6    test_genome_datastore/13/44/Chr6/       FINISHED
Chr7    test_genome_datastore/91/B7/Chr7/       STARTED
Chr7    test_genome_datastore/91/B7/Chr7/       FINISHED
Chr8    test_genome_datastore/9A/9E/Chr8/       STARTED
Chr8    test_genome_datastore/9A/9E/Chr8/       FINISHED
Chr9    test_genome_datastore/87/90/Chr9/       STARTED
Chr9    test_genome_datastore/87/90/Chr9/       FINISHED

All completed. Other possible status entries include:

FAILED - indicates a failed run on this contig, MAKER will retry these
RETRY - indicates that MAKER is retrying a contig that failed
SKIPPED_SMALL - indicates the contig was too short to annotate (minimum contig length is specified in maker_opt.ctl)
DIED_SKIPPED_PERMANENT - indicates a failed contig that MAKER will not attempt to retry (number of times to retry a contig is specified in maker_opt.ctl)

The actual output data is stored in the nested set of directories under *test_genome_datastore* in a nested directory structure.

A typical set of outputs for a contig looks like this: (Only few lines from the output are shown below)

$ ls test_genome_datastore/*/*/*
test_genome_datastore/13/44/Chr6:
Chr6.gff                                                Chr6.maker.proteins.fasta                 Chr6.maker.snoscan.transcripts.fasta  theVoid.Chr6
Chr6.maker.non_overlapping_ab_initio.proteins.fasta     Chr6.maker.snap_masked.proteins.fasta     Chr6.maker.transcripts.fasta
Chr6.maker.non_overlapping_ab_initio.transcripts.fasta  Chr6.maker.snap_masked.transcripts.fasta  run.log

The Chr6.gff file is in GFF3 format and contains the maker gene models and underlying evidence such as repeat regions, alignment data, and ab initio gene predictions, as well as fasta sequence. Having all of these data in one file is important to enable visualization of the called gene models and underlying evidence, especially using tools like Apollo which enable manual editing and curation of gene models.
The fasta files Chr6.maker.proteins.fasta and Chr6.maker.transcripts.fasta contain the protein and transcript sequences for the final MAKER gene calls.
The Chr6.maker.non_overlapping_ab_initio.proteins.fasta and Chr6.maker.non_overlapping_ab_initio.transcripts.fasta files are models that don't overlap MAKER genes that were rejected for lack of support.
The Chr6.maker.snap_masked.proteins.fasta and Chr6.maker.snap_masked.transcript.fasta are the initial SNAP predicted models not further processed by MAKER

The output directory theVoid.Chr1 contains raw output data from all of the pipeline steps. One useful file found here is the repeat-masked version of the contig, query.masked.fasta.

Step 8: Merge the gff files to get a combined gff2 file which can be used for all downstream analysis

$ gff3_merge -d test_genome_master_datastore_index.log

-d The location of the MAKER datastore index log file.

By default, the output of the gff3_merge is test_genome.all, but you can have an alternate base name for the output files using "-o" option

The final output from gff3_merge is "test_genome.all.gff"

$ head test_genome.all.gff
##gff-version 3
Chr6    maker   gene    43764   46139   .   -   .   ID=maker-Chr6-snap-gene-0.3;Name=maker-Chr6-snap-gene-0.3
Chr6    maker   mRNA    43764   46139   .   -   .   ID=maker-Chr6-snap-gene-0.3-mRNA-1;Parent=maker-Chr6-snap-gene-0.3;Name=maker-Chr6-snap-gene-0.3-mRNA-1;_AED=0.12;_eAED=0.50;_QI=64|0|0|1|0|0.33|3|0|76
Chr6    maker   exon    43764   43846   .   -   .   ID=maker-Chr6-snap-gene-0.3-mRNA-1:exon:2;Parent=maker-Chr6-snap-gene-0.3-mRNA-1
Chr6    maker   exon    44833   44896   .   -   .   ID=maker-Chr6-snap-gene-0.3-mRNA-1:exon:1;Parent=maker-Chr6-snap-gene-0.3-mRNA-1
Chr6    maker   exon    45992   46139   .   -   .   ID=maker-Chr6-snap-gene-0.3-mRNA-1:exon:0;Parent=maker-Chr6-snap-gene-0.3-mRNA-1
Chr6    maker   five_prime_UTR  46076   46139   .   -   .   ID=maker-Chr6-snap-gene-0.3-mRNA-1:five_prime_utr;Parent=maker-Chr6-snap-gene-0.3-mRNA-1
Chr6    maker   CDS 45992   46075   .   -   0   ID=maker-Chr6-snap-gene-0.3-mRNA-1:cds;Parent=maker-Chr6-snap-gene-0.3-mRNA-1
Chr6    maker   CDS 44833   44896   .   -   0   ID=maker-Chr6-snap-gene-0.3-mRNA-1:cds;Parent=maker-Chr6-snap-gene-0.3-mRNA-1
Chr6    maker   CDS 43764   43846   .   -   2   ID=maker-Chr6-snap-gene-0.3-mRNA-1:cds;Parent=maker-Chr6-snap-gene-0.3-mRNA-1

If you don't want to run abinitio gene predictions, then you can run gff3_merge with -n option which do not print fasta sequence in footer. Otherwise, run gff3_merge as above and proceed to abinitio gene prediction step in here

Part 3: Running MAKER-P on the volumes (Recommended)

Since the m1 medium instance size (60GB disk space) may not be sufficient for most of the MAKER runs, it is recommended to run it on volumes

Step 1: Create a volume

Click the "New" button in the project and select "Create Volume". Enter the name of the volume (MAKER-P-Vol), volume size (200 GB) needed and the provider (Marana) and finally click "Create Volume"

note: click images to enlarge

Step2: Attach the created volume to the MASTER instance

note: click images to enlarge

Step 3: Mount the volume to a specified drive.

Once you have logged in to your instance using Web Shell or ssh of your MAKER-P instance, you must mount your attached volume to access it

# Check to see if the volume is mounted or not. As you can see, the volume is mounted on /dev/vdc

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            3.9G     0  3.9G   0% /dev
tmpfs           799M  8.7M  790M   2% /run
/dev/vda1        20G  8.4G   11G  44% /
tmpfs           3.9G  1.1M  3.9G   1% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           3.9G     0  3.9G   0% /sys/fs/cgroup
/dev/vdb         79G   56M   75G   1% /scratch
tmpfs           799M     0  799M   0% /run/user/14135
/dev/vdc        197G   60M  187G   1% /vol_c
 
# You can cd into the /vol_c and run all your analyses there from now on
 
$ cd /vol_c

Moving data from CyVerse Datastore using iCommands (Optional)

iCommands is a collection of commands for Linux and Mac OS operating systems that are used in the iRODS system to interact with the CyVerse Data Store. Many commands are very similar to Unix utilities. For example, to list files and directories, in Linux you use ls, but in iCommands you use ils.

While iCommands are great for all transfers and for automating tasks via scripts, they are the best choice for large files (2-100 GB each) and for bulk file transfers (many small files). For a comparison of the different methods of uploading and downloading data items, see Downloading and Uploading Data.

iCommands can be used by CyVerse account users to download files that have been shared by other users and to upload files to the Data Store, as well as add metadata, change permissions, and more. Commonly used iCommands are listed below. Follow the instructions on Setting Up iCommands for how to download and configure iCommands for your operating system.

A CyVerse account is not required to download a public data file via iCommands. To see instructions just for public data download with iCommands, see the iCommands section on Downloading Data Files Without a User Account.

Before you begin, you may want to watch a CyVerse video about iCommands.

For configuring icommands and the different commands that can be used to move the data in and out of datastore, please refer this link

You can use a script to back up your data to the Data Store. Backing up data from an instance to the Data Store. Follow the instruction in here - https://pods.iplantcollaborative.org/wiki/display/atmman/Backing+Up+and+Restoring+Your+Data+to+the+Data+Store

1 Learning Materials

Copy of MAKER-P_2.31.9 Atmosphere Tutorial

MAKER-P Genome Annotation using Atmosphere (Images Tutorial)

Rationale and background:

What kind of data do I need?

What kind of resources will I need for my project?

Part 1: Connect to an instance of an Atmosphere Image (virtual machine)

Part 2: Set up a MAKER-P run using the Web Shell (Make sure you run all your analyses on volume. See below)

Part 3: Running MAKER-P on the volumes (Recommended)

Moving data from CyVerse Datastore using iCommands (Optional)