...
...
...
...
...
...
...
...
...
...
...
...
...
...
MAKER Genome Annotation and gene editing using Apollo
Rationale and background:
MAKER-P is a flexible and scalable genome annotation pipeline that automates the many steps necessary for the detection of protein coding genes (Campbell et al. 2013). MAKER identifies repeats, aligns ESTs and proteins to a genome, produces ab initio gene predictions, and automatically synthesizes these data into gene annotations having evidence-based quality indices. MAKER was developed by the Yandell Lab and is described in several publications (Cantarel et al. 2008; Holt & Yandell 2011). Additional background is available at the MAKER Tutorial at GMOD and is highly recommended reading.
Apollo is the first instantaneous, collaborative genomic annotation editor available on the web. With Web Apollo researchers can use any of the common browsers (for example, Chrome or Firefox) to jointly analyze and precisely describe the features of a genome in real time, whether they are in the same room or working from opposite sides of the world. The task of manual curation is spread out among many hands and eyes, enabling the creation of virtual research networks of researchers linked by a common interest in a particular organism or population.
This tutorial will take users through steps of:
- Running MAKER on Jetstream cloud
- Running downstream qaulity control tools on the predicted genes
- Running Apollo gene editing tool to get highly curated gene annotations
Considerations
Sounds great, what do I need to get started?
- XSEDE account
- Later on, they can request a startup XSEDE allocation.
- Your data (or you can run example data)
What kind of data do I need?
- Mandatory requirements
- Genome assembly (fasta file)
- Organism type
- Eukaryotic (default, set as: organism_type=eukaryotic)
- Prokaryotic (set as: organism_type=prokaryotic)
- Additional data that can be used to improve the annotation (Highly recommended)
- RNA evidence (at least one of them is needed)
- Assembled mRNA-seq transcriptome (fasta file)
- Expressed sequence tags (ESTs) data (fasta file)
- Aligned EST or transcriptome GFF3 from your organism
- Aligned EST or transcriptome GFF3 from a closely related organism
- Protein evidence
protein sequence file in fasta format (i.e. from multiple organisms)
protein gff (aligned protein homology evidence from an external GFF3 file)
- RNA evidence (at least one of them is needed)
- For this particular tutorial we will use maize specific test data.
What kind of resources will I need for my project?
- Enough storage space on the MAKER-P Jetstream instance for both input and output files
- Creating and mounting an external volume to the running MAKER-P instance would be recommended
- Enough AUs to run your computation
Part 1: Connect to an instance of an MAKER Jetstream Image (virtual machine)
Step 1. Go to https://use.jetstream-cloud.org/application and log in with your XSEDE credentials.
...
We will iCommands a service from iRODS for transfering evidence data from Cyverse data commons repositiry. iCommands is a collection of commands for Linux and Mac OS operating systems that are used in the iRODS system to interact with the CyVerse Data Store. iCommands can used to transfer large amounts from CyVerse data to the running JetStream instance. Complete list of iCommands and its usage is here
The first time you use iCommands, you must initiate the connection to iRODS.
In a terminal window, enter
iinit
to initialize iCommands and your Data Store connection. For example, here's what you would do if your iRODS user name is cyverse-user:Code Block kap12@js-156-187:/vol_b/run_data$ iinit One or more fields in your iRODS environment file (irods_environment.json) are missing; please enter them. Enter the host name (DNS) of the server to connect to: data.cyverse.org Enter the port number: 1247 Enter your irods user name: cyverse-user Enter your irods zone: iplant Those values will be added to your environment file (for use by other i-commands) if the login succeeds. Enter your current iRODS password: kap12@js-156-187:/vol_b/run_data$
Once
iinit
has been finished, typeils
to check that iCommands is working. You should see your home directory at /iplant/home/your_user_nameDownload the evidence set required for annotation
Code Block $ iget -PVr /iplant/home/shared/commons_repo/curated/MaizeCode_annotation_evidence_data_2017 . $ mv MaizeCode_annotation_evidence_data_2017/* .
Part 3: Set up a MAKER run using the Terminal window
...
Code Block |
---|
test.all.gff- 'MAKER generated annotaiton file' test.all.maker.augustus_masked.proteins.fasta test.all.maker.augustus_masked.transcripts.fasta test.all.maker.non_overlapping_ab_initio.proteins.fasta 10:22 test.all.maker.non_overlapping_ab_initio.transcripts.fasta test.all.maker.proteins.fasta- 'MAKER generated proteins file' test.all.maker.transcripts.fasta- 'MAKER generated transcripts file' |
Part 4: Quality control of annotated genes
Once the MAKER run is finsihed, the next step is to filter out missannotated and low evidence supporting gene models. Below section descirbes some details to filter out such gene models.
...