MAKER-3.0 with Apollo on JetStream
MAKER Genome Annotation and gene editing using Apollo
Rationale and background:
MAKER-P is a flexible and scalable genome annotation pipeline that automates the many steps necessary for the detection of protein coding genes (Campbell et al. 2013). MAKER identifies repeats, aligns ESTs and proteins to a genome, produces ab initio gene predictions, and automatically synthesizes these data into gene annotations having evidence-based quality indices. MAKER was developed by the Yandell Lab and is described in several publications (Cantarel et al. 2008; Holt & Yandell 2011). Additional background is available at the MAKER Tutorial at GMOD and is highly recommended reading.
Apollo is the first instantaneous, collaborative genomic annotation editor available on the web. With Web Apollo researchers can use any of the common browsers (for example, Chrome or Firefox) to jointly analyze and precisely describe the features of a genome in real time, whether they are in the same room or working from opposite sides of the world. The task of manual curation is spread out among many hands and eyes, enabling the creation of virtual research networks of researchers linked by a common interest in a particular organism or population.
This tutorial will take users through steps of:
- Running MAKER on Jetstream cloud
- Running downstream qaulity control tools on the predicted genes
- Running Apollo gene editing tool to get highly curated gene annotations
Considerations
Sounds great, what do I need to get started?
- XSEDE account
- Later on, they can request a startup XSEDE allocation.
- Your data (or you can run example data)
What kind of data do I need?
- Mandatory requirements
- Genome assembly (fasta file)
- Organism type
- Eukaryotic (default, set as: organism_type=eukaryotic)
- Prokaryotic (set as: organism_type=prokaryotic)
- Additional data that can be used to improve the annotation (Highly recommended)
- RNA evidence (at least one of them is needed)
- Assembled mRNA-seq transcriptome (fasta file)
- Expressed sequence tags (ESTs) data (fasta file)
- Aligned EST or transcriptome GFF3 from your organism
- Aligned EST or transcriptome GFF3 from a closely related organism
- Protein evidence
protein sequence file in fasta format (i.e. from multiple organisms)
protein gff (aligned protein homology evidence from an external GFF3 file)
- RNA evidence (at least one of them is needed)
- For this particular tutorial we will use maize specific test data.
What kind of resources will I need for my project?
- Enough storage space on the MAKER-P Jetstream instance for both input and output files
- Creating and mounting an external volume to the running MAKER-P instance would be recommended
- Enough AUs to run your computation
Part 1: Connect to an instance of an MAKER Jetstream Image (virtual machine)
Step 1. Go to https://use.jetstream-cloud.org/application and log in with your XSEDE credentials.
Step 2. Click on the "Create New Project" in the Project tab on the top and enter the name of the project and a brief description
Step 3. Launch an instance from the selected image and name it as MAKER-run
After the project has been created and entered inside it, click the "New" button, select "MAKER-P_v3" image and then click Launch instance. In the next window (Basic Info),
- name the instance as "MAKER-run" (don't worry if you forgot to name the instance at this point, as you can always modify the name of the instance later)
- set base image version as "1.0" (default)
- leave the project as it is or change to a different project if needed
- select "Jetstream - Indiana University or Jetstream - TACC" as Provider and click 'Continue'. Your choice of provider will depend on the resources you have available (AUs) and the needs of your instance
- select "m1.medium" as Instance size (this is the minimum size that is required by MAKER-P image) and click "Continue".
Step 4. As the instance is launched behind the scenes, you will get an update as it goes through each step.
Status updates of Instance launch include Build-requesting launch, Build-networking, Build-spawning, Active-networking, Active-deploying. Depending on the usage load on Jetstream, it can take anywhere from 2-5 mins for an instance to become active. You can force check updates by using the refresh button in the Instance launch page or the refresh button on your browser. Once the instance becomes active a virtual machine with the ip address provided will become available for you to connect to. This virtual machine will have all the necessary components to run MAKER-P.
Step 5: Create a volume
Since the m1 medium instance size (60GB disk space) selected for running MASTER instance of MAKER-P may not be sufficient for most of the MAKER runs, it is recommended to run it on volumes
5.1 Click the "New" button in the project and select "Create Volume". Enter the name of the volume, volume size (GB) needed and the provider (TACC or Indiana) and finally click "Create Volume"
Attach the created volume to the MASTER instance
5.2 Click on the MAKER-P_v3 instance now
Jetstream provides web-shell, a web-based terminal, for accessing your VM at the command line level once it has been deployed.
However, you might find that you wish to access your VM via SSH if you’ve provisioned it with a routable IP number. For SSH access, you can create (or copy) SSH public-keys for your non-Jetstream computer that will allow it to access Jetstream then deposit those keys in your Atmosphere settings. More instructions can be found here
$ ssh <username>@<ip.address>
5.3 Mount the volume to a specified drive.
Once you have logged in to your instance using webshell or ssh of your MASTER instance, you must change the directory permissions as below
# Change the ownership and group permission on the mount location $ sudo chown -hR $USER /vol_b $ sudo chgrp -hR $USER /vol_b # cd into the /vol_b and then run WQ-MAKER in there $ cd /vol_b # create a directory named run_data $ mkdir run_data $ cd run_data
Step 6 set up iCommands for data transfer
We will iCommands a service from iRODS for transfering evidence data from Cyverse data commons repositiry. iCommands is a collection of commands for Linux and Mac OS operating systems that are used in the iRODS system to interact with the CyVerse Data Store. iCommands can used to transfer large amounts from CyVerse data to the running JetStream instance. Complete list of iCommands and its usage is here
The first time you use iCommands, you must initiate the connection to iRODS.
In a terminal window, enter
iinit
to initialize iCommands and your Data Store connection. For example, here's what you would do if your iRODS user name is cyverse-user:kap12@js-156-187:/vol_b/run_data$ iinit One or more fields in your iRODS environment file (irods_environment.json) are missing; please enter them. Enter the host name (DNS) of the server to connect to: data.cyverse.org Enter the port number: 1247 Enter your irods user name: cyverse-user Enter your irods zone: iplant Those values will be added to your environment file (for use by other i-commands) if the login succeeds. Enter your current iRODS password: kap12@js-156-187:/vol_b/run_data$
Once
iinit
has been finished, typeils
to check that iCommands is working. You should see your home directory at /iplant/home/your_user_nameDownload the evidence set required for annotation
$ iget -PVr /iplant/home/shared/commons_repo/curated/MaizeCode_annotation_evidence_data_2017 . $ mv MaizeCode_annotation_evidence_data_2017/* .
Part 3: Set up a MAKER run using the Terminal window
$ ls /vol_b/run_data genbank_ests_ATCG.fasta jiao_w22_all.fasta martin_nature_seedling_transcriptome_longer_than_300bp.fa wang_isoseq.fasta genbank_ests.fasta law_trinity_longer_than_300bp_cdhit_99.fasta protein Wessler-Bennetzen_2.fasta genbank_fl_cdnas_ATCG_only.fasta MaizeCode_annotation_evidence_data_2017 Read_me.txt
The below list of files in the data folder will be used as evidence datasets in running MAKER-P annotaiton on maize genomes
1) genbank_ests.fasta: These are maize ESTs downloaded from genbank and identified using this search command: (EST[Keyword]) AND maize[Organism]
2) genbank_ests_ATCG.fasta: These are full length cDNAs downloaded from genbank and identified using this search command: (FLI-CDNA[Keyword]) AND maize[Organism]
3) wang_isoseq.fasta: hese are transcripts built from isoseq data. Published here:
Unveiling the complexity of the maize transcriptome by single-molecule long-read sequencing.
Wang B, Tseng E, Regulski M, Clark TA, Hon T, Jiao Y, Lu Z, Olson A, Stein JC, Ware D.
Nat Commun. 2016 Jun 24;7:11708. doi: 10.1038/ncomms11708.
PMID: 27339440
4) Protien files: All of these data sets were downloaded from the gramene ftp site using these commands
Sorghum: wget ftp://ftp.gramene.org/pub/gramene/release46/data/fasta/sorghum_bicolor/pep/Sorghum_bicolor.Sorbi1.27.pep.all.fa.gz
Rice: wget ftp://ftp.gramene.org/pub/gramene/release46/data/fasta/oryza_sativa/pep/Oryza_sativa.IRGSP-1.0.27.pep.all.fa.gz
Arabidopsis: wget ftp://ftp.gramene.org/pub/gramene/release46/data/fasta/arabidopsis_thaliana/pep/Arabidopsis_thaliana.TAIR10.27.pep.all.fa.gz
Setaria: wget ftp://ftp.gramene.org/pub/gramene/release46/data/fasta/setaria_italica/pep/Setaria_italica.JGIv2.0.27.pep.all.fa.gz
Brachypodium: wget ftp://ftp.gramene.org/pub/gramene/release46/data/fasta/brachypodium_distachyon/pep/Brachypodium_distachyon.v1.0.27.pep.all.fa.gz
5) Wessler-Bennetzen_2.fasta: This is the repeat library generated for the original B73 annotation. The helitrons were removed to prevent overmasking.
6) martin_nature_seedling_transcriptome_longer_than_300bp.fa: This is the assembnled transcripts from a very high depth seedling transcirptome published here
A near complete snapshot of the Zea mays seedling transcriptome revealed from ultra-deep sequencing.
Martin JA, Johnson NV, Gross SM, Schnable J, Meng X, Wang M, Coleman-Derr D, Lindquist E, Wei CL, Kaeppler S, Chen F, Wang Z.
Sci Rep. 2014 Mar 31;4:4519. doi: 10.1038/srep04519.
PMID: 24682209
7) law_trinity_longer_than_300bp_cdhit_99.fasta: These are the trinity assemblies from the 95 RNAseq experimetns used for annotation in this paper
Automated update, revision, and quality control of the maize genome annotations using MAKER-P improves the B73 RefGen_v3 gene models and identifies new genes.
Law M, Childs KL, Campbell MS, Stein JC, Olson AJ, Holt C, Panchy N, Lei J, Jiao D, Andorf CM, Lawrence CJ, Ware D, Shiu SH, Sun Y, Jiang N, Yandell M.
Plant Physiol. 2015 Jan;167(1):25-39. doi: 10.1104/pp.114.245027. Epub 2014 Nov 10.
PMID: 25384563
Transcripts less than 300bp have been removed and cdhit was run on the remaining sequences with a similarity threshold of .99
8) jiao_w22_all.fasta: These are trinity assembled W22 transcirpts from the following tissues
ear
embryo
endosperm
kernel
leaf
root
shoot
tassel
Executables for running MAKER are located in /opt/maker/bin and /opt/maker/exe:
$ ls /usr/local/maker/bin/ AED_cdf_generator.pl cufflinks2gff3 genemark_gtf2gff3 ipr_update_gff maker2eval_gtf maker_functional_fasta map_data_ids quality_filter.pl cegma2zff evaluator gff3_merge look_at_transcripts.pl maker2jbrowse maker_functional_gff map_fasta_ids tophat2gff3 chado2gff3 fasta_merge _Inline maker maker2wap maker_map_ids map_gff_ids train_augustus.pl compare_gff3_to_chado fasta_tool iprscan2gff3 maker2chado maker2zff map2assembly match2gene.pl zff2genbank.pl
As the names suggest the "/usr/local/maker/bin/" directory includes many useful auxiliary scripts. For example cufflinks2gff3 will convert output from an RNA-seq analysis into a GFF3 file that can be used for input as evidence for WQ-MAKER. RepeatMasker, augustus, blast, exonerate, and snap are programs that MAKER uses in its pipeline. We recommend reading MAKER Tutorial at GMOD for more information about these.
Step 2. Run the maker command with the --help flag to get a usage statement and list of options:
$ maker -h Argument "2.53_01" isn't numeric in numeric ge (>=) at /usr/local/lib/x86_64-linux-gnu/perl/5.22.1/forks.pm line 1570. MAKER version 2.31.9 Usage: maker [options] <maker_opts> <maker_bopts> <maker_exe> Description: MAKER is a program that produces gene annotations in GFF3 format using evidence such as EST alignments and protein homology. MAKER can be used to produce gene annotations for new genomes as well as update annotations from existing genome databases. The three input arguments are control files that specify how MAKER should behave. All options for MAKER should be set in the control files, but a few can also be set on the command line. Command line options provide a convenient machanism to override commonly altered control file values. MAKER will automatically search for the control files in the current working directory if they are not specified on the command line. Input files listed in the control options files must be in fasta format unless otherwise specified. Please see MAKER documentation to learn more about control file configuration. MAKER will automatically try and locate the user control files in the current working directory if these arguments are not supplied when initializing MAKER. It is important to note that MAKER does not try and recalculated data that it has already calculated. For example, if you run an analysis twice on the same dataset you will notice that MAKER does not rerun any of the BLAST analyses, but instead uses the blast analyses stored from the previous run. To force MAKER to rerun all analyses, use the -f flag. MAKER also supports parallelization via MPI on computer clusters. Just launch MAKER via mpiexec (i.e. mpiexec -n 40 maker). MPI support must be configured during the MAKER installation process for this to work though Options: -genome|g <file> Overrides the genome file path in the control files -RM_off|R Turns all repeat masking options off. -datastore/ Forcably turn on/off MAKER's two deep directory nodatastore structure for output. Always on by default. -old_struct Use the old directory styles (MAKER 2.26 and lower) -base <string> Set the base name MAKER uses to save output files. MAKER uses the input genome file name by default. -tries|t <integer> Run contigs up to the specified number of tries. -cpus|c <integer> Tells how many cpus to use for BLAST analysis. Note: this is for BLAST and not for MPI! -force|f Forces MAKER to delete old files before running again. This will require all blast analyses to be rerun. -again|a recaculate all annotations and output files even if no settings have changed. Does not delete old analyses. -quiet|q Regular quiet. Only a handlful of status messages. -qq Even more quiet. There are no status messages. -dsindex Quickly generate datastore index file. Note that this will not check if run settings have changed on contigs -nolock Turn off file locks. May be usful on some file systems, but can cause race conditions if running in parallel. -TMP Specify temporary directory to use. -CTL Generate empty control files in the current directory. -OPTS Generates just the maker_opts.ctl file. -BOPTS Generates just the maker_bopts.ctl file. -EXE Generates just the maker_exe.ctl file. -MWAS <option> Easy way to control mwas_server for web-based GUI options: STOP START RESTART -version Prints the MAKER version. -help|? Prints this usage statement.
Step 3. Create control files that tell MAKER what to do. Three files are required:
maker_opts.ctl
- gives location of input files (genome and evidence) and sets options that affect MAKER behaviormaker_exe.ctl
- gives path information for the underlying executables.maker_bopt.ctl
- sets parameters for filtering BLAST and Exonerate alignment results
To create these files run the maker command with the -CTL flag. Verify with ls:
$ maker -CTL $ ls maker_bopts.ctl maker_exe.ctl maker_opts.ctl test_data
- The "maker_exe.ctl" is automatically generated with the correct paths to executables and does not need to be modified.
- The "maker_bopt.ctl" is automatically generated with reasonable default parameters and also does not need to be modified unless you want to experiment with optimization of these parameters.
- The automatically generated "maker_opts.ctl" file needs to be modified in order to specify the genome file and evidence files to be used as input. You can use the text editor "vi" or "nano" that is already installed in the instance
Open maker_opts.ctl with vi tool
$ vi maker_opts.ctl
Here are the sections of the "maker_opts.ctl" file you need to edit. For more information about the this please check this The_MAKER_control_files_explained - Add path information to files as shown.
Do not allow any spaces after the equal sign or anywhere else
The files can be present in same the directory as the "maker_opts.ctl" or make sure you use the relative path if the files are present in other directories
This section pertains to specifying the genome assembly to be annotated and setting organism type:
#-----Genome (these are always required) genome=/vol_b/run_data/test.fasta #genome sequence (fasta file or fasta embeded in GFF3 file) organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic
The following section pertains to EST and other mRNA expression evidence. Here we are only using maize data, but one could specify data from a related species using the "altest" parameter. With RNA-seq data aligned to your genome by Cufflinks or Tophat one could use maker auxiliary scripts (cufflinks2gff3 and tophat2gff3) to generate GFF3 files and specify these using the est_gff parameter:
est=/vol_b/run_data/genbank_ests_ATCG.fasta:ATCG,/vol_b/run_data/genbank_fl_cdnas_ATCG_only.fasta:flc,/vol_b/run_data/jiao_w22_all.fasta:w22,/vol_b/run_data/law_trinity_longer_than_300bp_cdhit_99.fasta:lcdh,/vol_b/run_data/martin_nature_seedling_transcriptome_longer_than_300bp.fa:sdl,/vol_b/run_data/wang_isoseq_maize.fa:wang,/vol_b/run_data/Zea_mays.AGPv4.cdna.all.fa:grm #set of ESTs or assembled mRNA-seq in fasta format altest= #EST/cDNA sequence file in fasta format from an alternate organism est_gff= #aligned ESTs or mRNA-seq from an external GFF3 file altest_gff= #aligned ESTs from a closly relate species in GFF3 format
The following section pertains to protein sequence evidence. Here we are using previously annotated protein sequences. Another option would be to use SwissProt or other database:
#-----Protein Homology Evidence (for best results provide a file for at least one) protein=/vol_b/run_data/protein/Arabidopsis_thaliana.TAIR10.27.pep.all.fa:AT,/vol_b/run_data/protein/Brachypodium_distachyon.v1.0.27.pep.all.fa:BD,/vol_b/run_data/protein/Oryza_sativa.IRGSP-1.0.27.pep.all.fa:OS,/vol_b/run_data/protein/Setaria_italica.JGIv2.0.27.pep.all.fa:SI,/vol_b/run_data/protein/Sorghum_bicolor.Sorbi1.27.pep.all.fa:,/vol_b/run_data/protein/Zea_mays.AGPv4.pep.all.fa:grm #protein sequence file in fasta format (i.e. from mutiple organisms) protein_gff= #aligned protein homology evidence from an external GFF3 file
This next section pertains to repeat identification:
#-----Repeat Masking (leave values blank to skip repeat masking) model_org= #select a model organism for RepBase masking in RepeatMasker rmlib=/vol_b/run_data/Wessler-Bennetzen_2.fasta #provide an organism specific repeat library in fasta format for RepeatMasker repeat_protein=/vol_b/run_data/te_proteins.fasta #provide a fasta file of transposable element proteins for RepeatRunner rm_gff= #pre-identified repeat elements from an external GFF3 file prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)
This next section pertains to setting for gene predictors.
#-----Gene Prediction snaphmm= #SNAP HMM file gmhmm= #GeneMark HMM file augustus_species=maize5 #Augustus gene prediction species model fgenesh_par_file= #FGENESH parameter file pred_gff=/vol_b/run_data/B73_fg.gff3 #ab-initio predictions from an external GFF3 file model_gff= #annotated gene models from an external GFF3 file (annotation pass-through) run_evm=0 #run EvidenceModeler, 1 = yes, 0 = no est2genome=0 #infer gene predictions directly from ESTs, 1 = yes, 0 = no protein2genome=0 #infer predictions from protein homology, 1 = yes, 0 = no trna=1 #find tRNAs with tRNAscan, 1 = yes, 0 = no snoscan_rrna= #rRNA file to have Snoscan find snoRNAs snoscan_meth= #-O-methylation site fileto have Snoscan find snoRNAs unmask=0 #also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no allow_overlap= #allowed gene overlap fraction (value from 0 to 1, blank for default)
Keep the rest of the settings as default.
Step 7. Run MAKER-P
MAKER-P will be run using MPI for scaling with 44 CPU available on the instance.
$ nohup mpiexec -n 44 maker < /dev/null > maker.log 2>&1 &
You can track the status of the MAKER-P run by checking the contents of the maker.log file
$ tail -f maker.log STATUS: Parsing control files... STATUS: Processing and indexing input FASTA files... STATUS: Setting up database for any GFF3 input... A data structure will be created for you at: /vol_b/run_data/test.maker.output/test_datastore To access files for individual sequences use the datastore index: /vol_b/run_data/test.maker.output/test_master_datastore_index.log STATUS: Now running MAKER...
Once MAKER-P finishes check again the status of maker.log fil. You should see the following message
$ tail -f maker.log adding statistics to annotations Calculating annotation quality statistics choosing best annotation set Choosing best annotations processing chunk output processing contig output Maker is now finished!!! Start_time: 1509292679 End_time: 1509303084 Elapsed: 10405
Step 8: Merge gff and fasta files generated from MAKER-P run
To merge gff's
$ gff3_merge -g -n -d test.maker.output/test_master_datastore_index.log
To merge fasta
$ fasta_merge -d test.maker.output/test_master_datastore_index.log
The above comands creates consolidated annotation files as shown below:
test.all.gff- 'MAKER generated annotaiton file' test.all.maker.augustus_masked.proteins.fasta test.all.maker.augustus_masked.transcripts.fasta test.all.maker.non_overlapping_ab_initio.proteins.fasta 10:22 test.all.maker.non_overlapping_ab_initio.transcripts.fasta test.all.maker.proteins.fasta- 'MAKER generated proteins file' test.all.maker.transcripts.fasta- 'MAKER generated transcripts file'
Part 4: Quality control of annotated genes
Once the MAKER run is finsihed, the next step is to filter out missannotated and low evidence supporting gene models. Below section descirbes some details to filter out such gene models.
4.1 Gene and trancript Counts
$ awk '{print $3}' test.all.gff | sort | uniq -c 2207 CDS 1233 exon 322 five_prime_UTR 184 gene 361 mRNA 481 three_prime_UTR 2 tRNA
Make sure the MAKER generated protein and transcripts file same counts as the mRNA counts in test.all.all
$ grep "^>" test.all.maker.proteins.fasta | wc -l 361 $ grep "^>" test.all.maker.transcripts.fasta | wc -l 361
4.2 Run InterProScan on MAKER annotated Proteins
First we create a directory called Interproscan and cp the Maker anntoeted protein file "test.all.maker.proteins.fasta" to it
$ mkdir interproscan $ cp test.all.maker.proteins.fasta interproscan/ $ cd interproscan/
Next we divide the protein fasta into chunks of 10 parts using the fastq_plittl.pl script. This will allow us to run interproscan parallely on the chunks instead of a single large protien sequence file.
$ fasta-splitter.pl --n-parts 10 test.all.maker.proteins.fasta
Create a jobs list of interproscan commands to be submitted
$ mkdir tsv; cd tsv $ for i in `ls ../*part*fasta`; do echo /usr/local/interproscan-5.24-63.0/interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i $i; done > interpro_jobs_to_split.txt
With this we can the batch file "interpro_jobs_to_split.txt" into mulitple files and run them in parallel.
$ split -l 1 interpro_jobs_to_split.txt batch.sh_ $ ls batch*|xargs -n 1 -P 6 bash
InterProScan output a tsv file with IPR domains
Combine the all the tsv into a single tsv file
$ cat *tsv > merged_iprscan_output.tsv
4.3 Run BLAST-P with MAKER annotated proteins against uniprot database
$ mkdir blastp; cd blastp $ makeblastdb -in uniprot_sprot.fasta -dbtype prot $ blastp -db uniprot_sprot.fasta -query test.all.maker.proteins.fasta -out maker2uni.blastp -evalue .000001 -outfmt 6 -num_alignments 1 -seg yes -soft_masking true -lcase_masking -max_hsps_per_subject 1
4.4 Update the MAKER annotation with the functional annotation
First we will update the MAKER gff with InterProScan output. In this step you will update the gff3 file to contain the iprscan information on the mRNA line
$ cd /vol_b/run_data $ ipr_update_gff test.all.gff interproscan/tsv/merged_iprscan_output.tsv > test.all.functional_ipr.gff
This procedure added a Dbxref tag to column nine of the gene and mRNA features that have Pfam domains identified by InterProScan in the GFF3 file. The value for this tag contains InterPro and Pfam ids as well as the Gene Ontology ids associated with the identified doamins, and looks like this:
Dbxref=InterPro:IPR001300,Pfam:PF00648;Ontology_term=GO:0004198,GO:0005622, GO:0006508
We now update the MAKERgffandfastafiles with functions identified from BLASTP against UniProt/SwissProt
$ maker_functional_gff blastp/uniprot_sprot.fasta blastp/maker2uni.blastp test.all.functional_ipr.gff > test.all.functional_ipr.uniprot.gff
This procedureaddedafunctionaltagidentifiedbyblastptocolumnnineofthegeneand mRNA features in the GFF3 file. The value looks like this
Note=Similar to RNP1: Heterogeneous nuclear ribonucleoprotein 1 (Arabidopsis thaliana)
Similarlyaddthis information to MAKERfastafiles (protein and transcript sequences)
$ maker_functional_fasta blastp/uniprot_sprot.fasta blastp/maker2uni.blastp test.all.maker.proteins.fasta > test.all.maker.proteins.uniprot.fasta $ maker_functional_fasta blastp/uniprot_sprot.fasta blastp/maker2uni.blastp test.all.maker.transcripts.fasta > test.all.maker.transcripts.uniprot.fasta
4.5 Build shorter IDs/Names for MAKER genes and transcripts following the NCBI suggested naming format
$ cd /vol_b/run_data $ maker_map_ids --prefix PYU1_ --justify 6 test.all.functional_ipr.uniprot.gff > genome.all.id.map
where
- --prefix : The prefix to use for all IDs (default = 'MAKER_')
- --justify:Theuniqueintegerportionof the ID will be right justified with '0's to this length (default = 8)
Thiswillcreateamappoingfileasbelowand it can be used to rename the feature ID's
$ head genome.all.id.map
maker-scaffold10-augustus-gene-0.3 PYU1_000001 maker-scaffold10-augustus-gene-0.3-mRNA-3 PYU1_000001-RA maker-scaffold10-augustus-gene-0.3-mRNA-2 PYU1_000001-RB maker-scaffold10-augustus-gene-0.3-mRNA-1 PYU1_000001-RC maker-scaffold10-augustus-gene-0.4 PYU1_000002 maker-scaffold10-augustus-gene-0.4-mRNA-1 PYU1_000002-RA maker-scaffold10-augustus-gene-0.4-mRNA-2 PYU1_000002-RB maker-scaffold10-augustus-gene-0.5 PYU1_000003 maker-scaffold10-augustus-gene-0.5-mRNA-5 PYU1_000003-RA maker-scaffold10-augustus-gene-0.5-mRNA-6 PYU1_000003-RB
alternate transcripts are assigned the value -RA, -RB, -RC, .....etc
You can explore more options with the maker_map_ids script to name feature ID's
We map short IDs/Names from genome.all.id.map to MAKER GFF3 test.all.functional_ipr.uniprot.gff , old IDs/Names are mapped to to the Alias attribute
$ map_gff_ids genome.all.id.map test.all.functional_ipr.uniprot.gff
$ map_fasta_ids genome.all.id.map test.all.maker.proteins.uniprot.fasta $ map_fasta_ids genome.all.id.map test.all.maker.transcripts.uniprot.fasta