MAKER-3.0 with Apollo on JetStream

MAKER Genome Annotation and gene editing using Apollo

Rationale and background:

MAKER-P is a flexible and scalable genome annotation pipeline that automates the many steps necessary for the detection of protein coding genes (Campbell et al. 2013). MAKER identifies repeats, aligns ESTs and proteins to a genome, produces ab initio gene predictions, and automatically synthesizes these data into gene annotations having evidence-based quality indices.  MAKER was developed by the Yandell Lab and is described in several publications (Cantarel et al. 2008; Holt & Yandell 2011).  Additional background is available at the MAKER Tutorial at GMOD and is highly recommended reading

Apollo is the first instantaneous, collaborative genomic annotation editor available on the web. With Web Apollo researchers can use any of the common browsers (for example, Chrome or Firefox) to jointly analyze and precisely describe the features of a genome in real time, whether they are in the same room or working from opposite sides of the world. The task of manual curation is spread out among many hands and eyes, enabling the creation of virtual research networks of researchers linked by a common interest in a particular organism or population.

This tutorial will take users through steps of:

  1. Running MAKER on Jetstream cloud
  2. Running downstream qaulity control tools on the predicted genes
  3. Running Apollo gene editing tool to get highly curated gene annotations

Considerations

Sounds great, what do I need to get started?

  1. XSEDE account
  2. Later on, they can request a startup XSEDE allocation
  3. Your data (or you can run example data)

What kind of data do I need?

  1. Mandatory requirements
    1. Genome assembly (fasta file)
    2. Organism type
      1. Eukaryotic (default, set as: organism_type=eukaryotic)
      2. Prokaryotic (set as: organism_type=prokaryotic)
  2. Additional data that can be used to improve the annotation (Highly recommended)
    1. RNA evidence (at least one of them is needed)
      1. Assembled mRNA-seq transcriptome (fasta file)
      2. Expressed sequence tags (ESTs) data (fasta file)
      3. Aligned EST or transcriptome GFF3 from your organism
      4. Aligned EST or transcriptome GFF3 from a closely related organism
    2. Protein evidence
      1. protein sequence file in fasta format (i.e. from multiple organisms)

      2. protein gff (aligned protein homology evidence from an external GFF3 file)

  3. For this particular tutorial we will use maize specific test data.

What kind of resources will I need for my project?

  1. Enough storage space on the MAKER-P Jetstream instance for both input and output files
    1. Creating and mounting an external volume to the running MAKER-P instance would be recommended
  2. Enough AUs to run your computation

Part 1: Connect to an instance of an MAKER Jetstream Image (virtual machine)

Step 1. Go to https://use.jetstream-cloud.org/application and log in with your XSEDE credentials.

 

Step 2. Click on the "Create New Project"  in the Project tab on the top and enter the name of the project and a brief description 


Step 3. Launch an instance from the selected image and name it as MAKER-run

After the project has been created and entered inside it, click the "New" button, select "MAKER-P_v3" image and then click Launch instance. In the next window (Basic Info),

  • name the instance as "MAKER-run" (don't worry if you forgot to name the instance at this point, as you can always modify the name of the instance later)
  • set base image version as "1.0" (default)
  • leave the project as it is or change to a different project if needed
  • select "Jetstream - Indiana University or Jetstream - TACC" as Provider and click 'Continue'. Your choice of provider will depend on the resources you have available (AUs) and the needs of your instance
  • select "m1.medium" as Instance size (this is the minimum size that is required by MAKER-P image) and click "Continue". 

Step 4. As the instance is launched behind the scenes, you will get an update as it goes through each step.


Status updates of Instance launch  include Build-requesting launch, Build-networking, Build-spawning, Active-networking, Active-deploying. Depending on the usage load on Jetstream, it can take anywhere from 2-5 mins for an instance to become active. You can force check updates by using the refresh button in the Instance launch page or the refresh button on your browser. Once the instance becomes active a virtual machine with the ip address provided will become available for you to connect to. This virtual machine will have all the necessary components to run MAKER-P.

Step 5: Create a volume

Since the m1 medium instance size (60GB disk space) selected for running MASTER instance of MAKER-P may not be sufficient for most of the MAKER runs, it is recommended to run it on volumes

5.1 Click the "New" button in the project and select "Create Volume". Enter the name of the volume, volume size (GB) needed and the provider (TACC or Indiana) and finally click "Create Volume" 

Attach the created volume to the MASTER instance

5.2 Click on the MAKER-P_v3 instance now

Jetstream provides web-shell, a web-based terminal, for accessing your VM at the command line level once it has been deployed.

However, you might find that you wish to access your VM via SSH if you’ve provisioned it with a routable IP number. For SSH access, you can create (or copy) SSH public-keys for your non-Jetstream computer that will allow it to access Jetstream then deposit those keys in your Atmosphere settings. More instructions can be found here 

$ ssh <username>@<ip.address>

5.3 Mount the volume to a specified drive. 

Once you have logged in to your instance using webshell or ssh of your MASTER instance, you must change the directory permissions as below
 
# Change the ownership and group permission on the mount location
$ sudo chown -hR $USER /vol_b
$ sudo chgrp -hR $USER /vol_b

# cd into the /vol_b and then run WQ-MAKER in there
$ cd /vol_b

# create a directory named run_data
$ mkdir run_data
$ cd run_data

Step 6  set up iCommands for data transfer

We will iCommands a service from iRODS for transfering evidence data from Cyverse data commons repositiry. iCommands is a collection of commands for Linux and Mac OS operating systems that are used in the iRODS system to interact with the CyVerse Data Store. iCommands can used to transfer large amounts from CyVerse data to the running JetStream instance. Complete list of iCommands and its usage is here

The first time you use iCommands, you must initiate the connection to iRODS.

 

  1. In a terminal window, enter iinit to initialize iCommands and your Data Store connection. For example, here's what you would do if your iRODS user name is cyverse-user:

    kap12@js-156-187:/vol_b/run_data$ iinit
    One or more fields in your iRODS environment file (irods_environment.json) are
    missing; please enter them.
    Enter the host name (DNS) of the server to connect to: data.cyverse.org
    Enter the port number: 1247
    Enter your irods user name: cyverse-user
    Enter your irods zone: iplant
    Those values will be added to your environment file (for use by
    other i-commands) if the login succeeds.
    
    Enter your current iRODS password:
    kap12@js-156-187:/vol_b/run_data$

     

     

  2. Once iinit has been finished, type ils to check that iCommands is working. You should see your home directory at /iplant/home/your_user_name

  3. Download the evidence set required for annotation

    $ iget -PVr /iplant/home/shared/commons_repo/curated/MaizeCode_annotation_evidence_data_2017 .
    $ mv MaizeCode_annotation_evidence_data_2017/* .

 

Part 3: Set up a MAKER run using the Terminal window

 

Step 1. Get oriented. You will find your test data within your mounted volume "/vol_b/run_data"  List its contents with the ls command:
$ ls /vol_b/run_data
genbank_ests_ATCG.fasta           jiao_w22_all.fasta                            martin_nature_seedling_transcriptome_longer_than_300bp.fa  wang_isoseq.fasta
genbank_ests.fasta                law_trinity_longer_than_300bp_cdhit_99.fasta  protein                                                    Wessler-Bennetzen_2.fasta
genbank_fl_cdnas_ATCG_only.fasta  MaizeCode_annotation_evidence_data_2017       Read_me.txt

The below list of files in the data folder will be used as evidence datasets in running MAKER-P annotaiton on maize genomes


1) genbank_ests.fasta: These are maize ESTs downloaded from genbank and identified using this search command: (EST[Keyword]) AND maize[Organism]
2) genbank_ests_ATCG.fasta: These are full length cDNAs downloaded from genbank and identified using this search command: (FLI-CDNA[Keyword]) AND maize[Organism]
3) wang_isoseq.fasta: hese are transcripts built from isoseq data. Published here:
    Unveiling the complexity of the maize transcriptome by single-molecule long-read sequencing.
    Wang B, Tseng E, Regulski M, Clark TA, Hon T, Jiao Y, Lu Z, Olson A, Stein JC, Ware D.
    Nat Commun. 2016 Jun 24;7:11708. doi: 10.1038/ncomms11708.
    PMID: 27339440
4) Protien files: All of these data sets were downloaded from the gramene ftp site using these commands
    Sorghum: wget ftp://ftp.gramene.org/pub/gramene/release46/data/fasta/sorghum_bicolor/pep/Sorghum_bicolor.Sorbi1.27.pep.all.fa.gz
    Rice: wget ftp://ftp.gramene.org/pub/gramene/release46/data/fasta/oryza_sativa/pep/Oryza_sativa.IRGSP-1.0.27.pep.all.fa.gz
    Arabidopsis: wget ftp://ftp.gramene.org/pub/gramene/release46/data/fasta/arabidopsis_thaliana/pep/Arabidopsis_thaliana.TAIR10.27.pep.all.fa.gz
    Setaria: wget ftp://ftp.gramene.org/pub/gramene/release46/data/fasta/setaria_italica/pep/Setaria_italica.JGIv2.0.27.pep.all.fa.gz
    Brachypodium: wget ftp://ftp.gramene.org/pub/gramene/release46/data/fasta/brachypodium_distachyon/pep/Brachypodium_distachyon.v1.0.27.pep.all.fa.gz
5) Wessler-Bennetzen_2.fasta: This is the repeat library generated for the original B73 annotation. The helitrons were removed to prevent overmasking.
6) martin_nature_seedling_transcriptome_longer_than_300bp.fa: This is the assembnled transcripts from a very high depth seedling transcirptome published here
    A near complete snapshot of the Zea mays seedling transcriptome revealed from ultra-deep sequencing.
    Martin JA, Johnson NV, Gross SM, Schnable J, Meng X, Wang M, Coleman-Derr D, Lindquist E, Wei CL, Kaeppler S, Chen F, Wang Z.
    Sci Rep. 2014 Mar 31;4:4519. doi: 10.1038/srep04519.
    PMID: 24682209
7) law_trinity_longer_than_300bp_cdhit_99.fasta: These are the trinity assemblies from the 95 RNAseq experimetns used for annotation in this paper
    Automated update, revision, and quality control of the maize genome annotations using MAKER-P improves the B73 RefGen_v3 gene models and identifies new genes.
    Law M, Childs KL, Campbell MS, Stein JC, Olson AJ, Holt C, Panchy N, Lei J, Jiao D, Andorf CM, Lawrence CJ, Ware D, Shiu SH, Sun Y, Jiang N, Yandell M.
    Plant Physiol. 2015 Jan;167(1):25-39. doi: 10.1104/pp.114.245027. Epub 2014 Nov 10.
    PMID: 25384563
    Transcripts less than 300bp have been removed and cdhit was run on the remaining sequences with a similarity threshold of .99
8) jiao_w22_all.fasta: These are trinity assembled W22 transcirpts from the following tissues
    ear
    embryo
    endosperm
    kernel
    leaf
    root
    shoot
    tassel

Executables for running MAKER are located in /opt/maker/bin and /opt/maker/exe:

$ ls /usr/local/maker/bin/
AED_cdf_generator.pl   cufflinks2gff3  genemark_gtf2gff3  ipr_update_gff          maker2eval_gtf  maker_functional_fasta  map_data_ids   quality_filter.pl
cegma2zff              evaluator       gff3_merge         look_at_transcripts.pl  maker2jbrowse   maker_functional_gff    map_fasta_ids  tophat2gff3
chado2gff3             fasta_merge     _Inline            maker                   maker2wap       maker_map_ids           map_gff_ids    train_augustus.pl
compare_gff3_to_chado  fasta_tool      iprscan2gff3       maker2chado             maker2zff       map2assembly            match2gene.pl  zff2genbank.pl

As the names suggest the "/usr/local/maker/bin/" directory includes many useful auxiliary scripts.  For example cufflinks2gff3 will convert output from an RNA-seq analysis into a GFF3 file that can be used for input as evidence for WQ-MAKER. RepeatMasker, augustus, blast, exonerate, and snap are programs that MAKER uses in its pipeline.  We recommend reading MAKER Tutorial at GMOD for more information about these.


Step 2. Run the maker command with the --help flag to get a usage statement and list of options:

$ maker -h
Argument "2.53_01" isn't numeric in numeric ge (>=) at /usr/local/lib/x86_64-linux-gnu/perl/5.22.1/forks.pm line 1570.
MAKER version 2.31.9
Usage:
     maker [options] <maker_opts> <maker_bopts> <maker_exe>
Description:
     MAKER is a program that produces gene annotations in GFF3 format using
     evidence such as EST alignments and protein homology. MAKER can be used to
     produce gene annotations for new genomes as well as update annotations
     from existing genome databases.
     The three input arguments are control files that specify how MAKER should
     behave. All options for MAKER should be set in the control files, but a
     few can also be set on the command line. Command line options provide a
     convenient machanism to override commonly altered control file values.
     MAKER will automatically search for the control files in the current
     working directory if they are not specified on the command line.
     Input files listed in the control options files must be in fasta format
     unless otherwise specified. Please see MAKER documentation to learn more
     about control file  configuration.  MAKER will automatically try and
     locate the user control files in the current working directory if these
     arguments are not supplied when initializing MAKER.
     It is important to note that MAKER does not try and recalculated data that
     it has already calculated.  For example, if you run an analysis twice on
     the same dataset you will notice that MAKER does not rerun any of the
     BLAST analyses, but instead uses the blast analyses stored from the
     previous run. To force MAKER to rerun all analyses, use the -f flag.
     MAKER also supports parallelization via MPI on computer clusters. Just
     launch MAKER via mpiexec (i.e. mpiexec -n 40 maker). MPI support must be
     configured during the MAKER installation process for this to work though
Options:
     -genome|g <file>    Overrides the genome file path in the control files
     -RM_off|R           Turns all repeat masking options off.
     -datastore/         Forcably turn on/off MAKER's two deep directory
      nodatastore        structure for output.  Always on by default.
     -old_struct         Use the old directory styles (MAKER 2.26 and lower)
     -base    <string>   Set the base name MAKER uses to save output files.
                         MAKER uses the input genome file name by default.
     -tries|t <integer>  Run contigs up to the specified number of tries.
     -cpus|c  <integer>  Tells how many cpus to use for BLAST analysis.
                         Note: this is for BLAST and not for MPI!
     -force|f            Forces MAKER to delete old files before running again.
                         This will require all blast analyses to be rerun.
     -again|a            recaculate all annotations and output files even if no
                         settings have changed. Does not delete old analyses.
     -quiet|q            Regular quiet. Only a handlful of status messages.
     -qq                 Even more quiet. There are no status messages.
     -dsindex            Quickly generate datastore index file. Note that this
                         will not check if run settings have changed on contigs
     -nolock             Turn off file locks. May be usful on some file systems,
                         but can cause race conditions if running in parallel.
     -TMP                Specify temporary directory to use.
     -CTL                Generate empty control files in the current directory.
     -OPTS               Generates just the maker_opts.ctl file.
     -BOPTS              Generates just the maker_bopts.ctl file.
     -EXE                Generates just the maker_exe.ctl file.
     -MWAS    <option>   Easy way to control mwas_server for web-based GUI
                              options:  STOP
                                        START
                                        RESTART
     -version            Prints the MAKER version.
     -help|?             Prints this usage statement.

Step 3.  Create control files that tell MAKER what to do. Three files are required:

  • maker_opts.ctl - gives location of input files (genome and evidence) and sets options that affect MAKER behavior
  • maker_exe.ctl - gives path information for the underlying executables.
  • maker_bopt.ctl - sets parameters for filtering BLAST and Exonerate alignment results

To create these files run the maker command with the -CTL flag. Verify with ls:

$ maker -CTL
$ ls
maker_bopts.ctl  maker_exe.ctl  maker_opts.ctl  test_data
  • The "maker_exe.ctl" is automatically generated with the correct paths to executables and does not need to be modified.  
  • The "maker_bopt.ctl" is automatically generated with reasonable default parameters and also does not need to be modified unless you want to experiment with optimization of these parameters.
  • The automatically generated "maker_opts.ctl" file needs to be modified in order to specify the genome file and evidence files to be used as input.  You can use the text editor "vi" or "nano" that is already installed in the instance

 

Open maker_opts.ctl with vi tool

$ vi maker_opts.ctl

 

Here are the sections of the "maker_opts.ctl" file you need to edit.  For more information about the this please check this The_MAKER_control_files_explained - Add path information to files as shown.

Do not allow any spaces after the equal sign or anywhere else

The files can be present in same the directory as the "maker_opts.ctl" or make sure you use the relative path if the files are present in other directories

This section pertains to specifying the genome assembly to be annotated and setting organism type:

#-----Genome (these are always required)
genome=/vol_b/run_data/test.fasta #genome sequence (fasta file or fasta embeded in GFF3 file)
organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic

The following section pertains to EST and other mRNA expression evidence.  Here we are only using maize  data, but one could specify data from a related species using the "altest" parameter. With RNA-seq data aligned to your genome by Cufflinks or Tophat one could use maker auxiliary scripts (cufflinks2gff3 and tophat2gff3) to generate GFF3 files and specify these using the est_gff parameter:

est=/vol_b/run_data/genbank_ests_ATCG.fasta:ATCG,/vol_b/run_data/genbank_fl_cdnas_ATCG_only.fasta:flc,/vol_b/run_data/jiao_w22_all.fasta:w22,/vol_b/run_data/law_trinity_longer_than_300bp_cdhit_99.fasta:lcdh,/vol_b/run_data/martin_nature_seedling_transcriptome_longer_than_300bp.fa:sdl,/vol_b/run_data/wang_isoseq_maize.fa:wang,/vol_b/run_data/Zea_mays.AGPv4.cdna.all.fa:grm #set of ESTs or assembled mRNA-seq in fasta format
altest= #EST/cDNA sequence file in fasta format from an alternate organism
est_gff= #aligned ESTs or mRNA-seq from an external GFF3 file
altest_gff= #aligned ESTs from a closly relate species in GFF3 format

The following section pertains to protein sequence evidence.  Here we are using previously annotated protein sequences.  Another option would be to use SwissProt or other database:

#-----Protein Homology Evidence (for best results provide a file for at least one)
protein=/vol_b/run_data/protein/Arabidopsis_thaliana.TAIR10.27.pep.all.fa:AT,/vol_b/run_data/protein/Brachypodium_distachyon.v1.0.27.pep.all.fa:BD,/vol_b/run_data/protein/Oryza_sativa.IRGSP-1.0.27.pep.all.fa:OS,/vol_b/run_data/protein/Setaria_italica.JGIv2.0.27.pep.all.fa:SI,/vol_b/run_data/protein/Sorghum_bicolor.Sorbi1.27.pep.all.fa:,/vol_b/run_data/protein/Zea_mays.AGPv4.pep.all.fa:grm  #protein sequence file in fasta format (i.e. from mutiple organisms)
protein_gff=  #aligned protein homology evidence from an external GFF3 file

This next section pertains to repeat identification:

#-----Repeat Masking (leave values blank to skip repeat masking)
model_org= #select a model organism for RepBase masking in RepeatMasker
rmlib=/vol_b/run_data/Wessler-Bennetzen_2.fasta #provide an organism specific repeat library in fasta format for RepeatMasker
repeat_protein=/vol_b/run_data/te_proteins.fasta #provide a fasta file of transposable element proteins for RepeatRunner
rm_gff= #pre-identified repeat elements from an external GFF3 file
prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no
softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)

This next section pertains to setting for gene predictors.

#-----Gene Prediction
snaphmm= #SNAP HMM file
gmhmm= #GeneMark HMM file
augustus_species=maize5 #Augustus gene prediction species model
fgenesh_par_file= #FGENESH parameter file
pred_gff=/vol_b/run_data/B73_fg.gff3  #ab-initio predictions from an external GFF3 file
model_gff= #annotated gene models from an external GFF3 file (annotation pass-through)
run_evm=0 #run EvidenceModeler, 1 = yes, 0 = no
est2genome=0 #infer gene predictions directly from ESTs, 1 = yes, 0 = no
protein2genome=0 #infer predictions from protein homology, 1 = yes, 0 = no
trna=1 #find tRNAs with tRNAscan, 1 = yes, 0 = no
snoscan_rrna= #rRNA file to have Snoscan find snoRNAs
snoscan_meth= #-O-methylation site fileto have Snoscan find snoRNAs
unmask=0 #also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no
allow_overlap= #allowed gene overlap fraction (value from 0 to 1, blank for default)


Keep the rest of the settings as default.

Step 7.  Run MAKER-P

MAKER-P will be run using MPI for scaling with 44 CPU available on the instance.

$ nohup mpiexec -n 44 maker < /dev/null > maker.log 2>&1 &

You can track the status of the MAKER-P run by checking the contents of the maker.log file

$ tail -f maker.log
STATUS: Parsing control files...
STATUS: Processing and indexing input FASTA files...
STATUS: Setting up database for any GFF3 input...
A data structure will be created for you at:
/vol_b/run_data/test.maker.output/test_datastore
To access files for individual sequences use the datastore index:
/vol_b/run_data/test.maker.output/test_master_datastore_index.log
STATUS: Now running MAKER...

Once MAKER-P finishes check again the status of maker.log fil. You should see the following message

$ tail -f maker.log
adding statistics to annotations
Calculating annotation quality statistics
choosing best annotation set
Choosing best annotations
processing chunk output
processing contig output
Maker is now finished!!!
Start_time: 1509292679
End_time:   1509303084
Elapsed:    10405

 

Step 8: Merge gff and fasta files generated from MAKER-P run

To merge gff's

$ gff3_merge -g -n -d test.maker.output/test_master_datastore_index.log

To merge fasta

$ fasta_merge -d test.maker.output/test_master_datastore_index.log

 

The above comands creates consolidated annotation files  as shown below:

test.all.gff- 'MAKER generated annotaiton file'
test.all.maker.augustus_masked.proteins.fasta
test.all.maker.augustus_masked.transcripts.fasta
test.all.maker.non_overlapping_ab_initio.proteins.fasta
10:22 test.all.maker.non_overlapping_ab_initio.transcripts.fasta
test.all.maker.proteins.fasta- 'MAKER generated proteins file'
test.all.maker.transcripts.fasta- 'MAKER generated transcripts file'

Part 4: Quality control of annotated genes

Once the MAKER run is finsihed, the next step is to filter out missannotated and low evidence supporting gene models. Below  section descirbes some details to filter out such gene models.

4.1 Gene and trancript  Counts

$ awk '{print $3}' test.all.gff | sort | uniq -c
  2207 CDS
   1233 exon
    322 five_prime_UTR
    184 gene
    361 mRNA
    481 three_prime_UTR
      2 tRNA

Make sure the MAKER generated protein and transcripts file same counts as the mRNA counts in test.all.all

$ grep "^>" test.all.maker.proteins.fasta | wc -l
361
$ grep "^>" test.all.maker.transcripts.fasta | wc -l
361

 

4.2 Run InterProScan on MAKER annotated Proteins

 First we create a directory called Interproscan and cp the Maker anntoeted protein file "test.all.maker.proteins.fasta" to it

$ mkdir interproscan
$ cp test.all.maker.proteins.fasta interproscan/
$ cd interproscan/

Next we divide the protein fasta into chunks of 10 parts using the fastq_plittl.pl script. This will allow us to run interproscan  parallely on the chunks instead of a single large protien sequence file.

$ fasta-splitter.pl --n-parts 10 test.all.maker.proteins.fasta

Create a jobs list of interproscan commands to be submitted

$ mkdir tsv; cd tsv
$ for i in `ls ../*part*fasta`; do echo /usr/local/interproscan-5.24-63.0/interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i $i; done > interpro_jobs_to_split.txt

With this we can  the batch file "interpro_jobs_to_split.txt" into mulitple files and run them in parallel.

$ split -l 1 interpro_jobs_to_split.txt batch.sh_
$ ls batch*|xargs -n 1 -P 6 bash

InterProScan output a tsv file with IPR domains

Combine the all the tsv into a single tsv file

$ cat *tsv > merged_iprscan_output.tsv

 

4.3 Run BLAST-P with MAKER annotated proteins against uniprot database

$ mkdir blastp; cd blastp
$ makeblastdb -in uniprot_sprot.fasta -dbtype prot
$ blastp -db uniprot_sprot.fasta -query test.all.maker.proteins.fasta -out maker2uni.blastp -evalue .000001 -outfmt 6 -num_alignments 1 -seg yes -soft_masking true -lcase_masking -max_hsps_per_subject 1

 

4.4 Update the MAKER annotation with the functional annotation

First we will update the MAKER gff with InterProScan output. In this step you will update the gff3 file to contain the iprscan information on the mRNA line

$  cd /vol_b/run_data
$  ipr_update_gff test.all.gff interproscan/tsv/merged_iprscan_output.tsv > test.all.functional_ipr.gff

 

This procedure added a Dbxref tag to column nine of the gene and mRNA features that have Pfam domains identified by InterProScan in the GFF3 file. The value for this tag contains InterPro and Pfam ids as well as the Gene Ontology ids associated with the identified doamins, and looks like this:

Dbxref=InterPro:IPR001300,Pfam:PF00648;Ontology_term=GO:0004198,GO:0005622, GO:0006508

We now update the MAKERgffandfastafiles with functions identified from BLASTP against UniProt/SwissProt

$  maker_functional_gff blastp/uniprot_sprot.fasta blastp/maker2uni.blastp test.all.functional_ipr.gff > test.all.functional_ipr.uniprot.gff

This procedureaddedafunctionaltagidentifiedbyblastptocolumnnineofthegeneand mRNA features in the GFF3 file. The value looks like this

Note=Similar to RNP1: Heterogeneous nuclear ribonucleoprotein 1 (Arabidopsis thaliana)

Similarlyaddthis information to MAKERfastafiles (protein and transcript sequences)

$ maker_functional_fasta blastp/uniprot_sprot.fasta blastp/maker2uni.blastp test.all.maker.proteins.fasta > test.all.maker.proteins.uniprot.fasta
$ maker_functional_fasta blastp/uniprot_sprot.fasta blastp/maker2uni.blastp test.all.maker.transcripts.fasta > test.all.maker.transcripts.uniprot.fasta

 

4.5 Build shorter IDs/Names for MAKER genes and transcripts following the NCBI suggested naming format

$ cd /vol_b/run_data
$ maker_map_ids --prefix PYU1_ --justify 6 test.all.functional_ipr.uniprot.gff > genome.all.id.map

where

  • --prefix : The prefix to use for all IDs (default = 'MAKER_')
  • --justify:Theuniqueintegerportionof the ID will be right justified with '0's to this length (default = 8)

Thiswillcreateamappoingfileasbelowand it can be used to rename the feature ID's

$ head genome.all.id.map
maker-scaffold10-augustus-gene-0.3	PYU1_000001
maker-scaffold10-augustus-gene-0.3-mRNA-3	PYU1_000001-RA
maker-scaffold10-augustus-gene-0.3-mRNA-2	PYU1_000001-RB
maker-scaffold10-augustus-gene-0.3-mRNA-1	PYU1_000001-RC
maker-scaffold10-augustus-gene-0.4	PYU1_000002
maker-scaffold10-augustus-gene-0.4-mRNA-1	PYU1_000002-RA
maker-scaffold10-augustus-gene-0.4-mRNA-2	PYU1_000002-RB
maker-scaffold10-augustus-gene-0.5	PYU1_000003
maker-scaffold10-augustus-gene-0.5-mRNA-5	PYU1_000003-RA
maker-scaffold10-augustus-gene-0.5-mRNA-6	PYU1_000003-RB

 

alternate transcripts are assigned the value -RA, -RB, -RC, .....etc

You can explore more options with the maker_map_ids script to name feature ID's

 

We map short IDs/Names from genome.all.id.map to MAKER GFF3 test.all.functional_ipr.uniprot.gff , old IDs/Names are mapped to to the Alias attribute

$ map_gff_ids genome.all.id.map test.all.functional_ipr.uniprot.gff
Similary map genome.all.id.map to MAKER protein and transcript sequences
$ map_fasta_ids genome.all.id.map test.all.maker.proteins.uniprot.fasta
$ map_fasta_ids genome.all.id.map test.all.maker.transcripts.uniprot.fasta
Unable to render {include} The included page could not be found.