InterProScan-5.26-65.0 Jetstream Tutorial*
What is InterProcan ?
InterPro is an integrated database of predictive protein "signatures" used for the classification and automatic annotation of proteins and genomes. InterPro classifies sequences at superfamily, family and subfamily levels, predicting the occurrence of functional domains, repeats and important sites. InterPro adds in-depth annotation, including GO terms, to the protein signatures.
More: http://www.ebi.ac.uk/interpro/
Which databases are used in InterPro ?
InterPro combines a number of databases (referred to as member databases) that use different methodologies and a varying degree of biological information on well-characterised proteins to derive protein signatures. By uniting the member databases, InterPro capitalises on their individual strengths, producing a powerful integrated database and diagnostic tool (InterProScan).
The member databases use a number of approaches:
- ProDom: provider of sequence-clusters built from UniProtKB using PSI-BLAST.
- PROSITE patterns: provider of simple regular expressions.
- PROSITE and HAMAP profiles: provide sequence matrices.
- PRINTS provider of fingerprints, which are groups of aligned, un-weighted Position Specific Sequence Matrices (PSSMs).
- PANTHER, PIRSF, Pfam, SMART, TIGRFAMs, Gene3D and SUPERFAMILY: providers of hidden Markov models (HMMs).
Diagnostically, these resources have different areas of optimum application owing to the different underlying analysis methods. In terms of family coverage, the protein signature databases are similar in size but differ in content. While all of the methods share a common interest in protein sequence classification, some focus on divergent domains (e.g., Pfam), some focus on functional sites (e.g., PROSITE), and others focus on families, specialising in hierarchical definitions from superfamily down to subfamily levels in order to pin-point specific functions (e.g., PRINTS). TIGRFAMs focus on building HMMs for functionally equivalent proteins and PIRSF always produces HMMs over the full length of a protein and have protein length restrictions to gather family members. HAMAP profiles are manually created by expert curators they identify proteins that are part of well-conserved bacterial, archaeal and plastid-encoded proteins families or subfamilies. PANTHER build HMMs based on the divergence of function within families. SUPERFAMILY and Gene3D are based on structure using the SCOP and CATH superfamilies, respectively, as a basis for building HMMs.
Considerations
Sounds great, what do I need to get started?
- XSEDE account
- XSEDE allocation. You can also request a trial access
- Your data (or you can run example data)
What kind of data do I need?
- Mandatory requirements: protein sequences file in fasta format only
What kind of resources will I need for my project?
- Enough storage space on the Jetstream instance for both input and output files
- Creating and attaching an external volume to the running instance would be recommended
https://iujetstream.atlassian.net/wiki/spaces/JWT/pages/32899113/Volumes
- Creating and attaching an external volume to the running instance would be recommended
- Enough AUs to run your computation
Part 1: Connect to an instance of an InterProScan-5.26-65.0 Jetstream Image (virtual machine)
Step 1. Go to https://use.jetstream-cloud.org/application and log in with your XSEDE credentials.
Step 2. Click on the "Create New Project" in the Project tab on the top and enter the name of the project and a brief description
Step 3. Launch an instance from the selected image
After the project has been created and entered inside it, click the "New" button, select "InterProScan-5.26-65.0" image and then click Launch instance. In the next window (Basic Info),
- Leave the name as default
- set base image version as "1.0.0" (default)
- leave the project as it is or change to a different project if needed
- select "Jetstream - Indiana University or Jetstream - TACC" as Provider and click 'Continue'. Your choice of provider will depend on the resources you have available (AUs) and the needs of your instance
- select "s1.large" as Instance size (this is the minimum size that is required by InterProScan-5.26-65.0 image) and click "Continue".
Step 4. As the instance is launched behind the scenes, you will get an update as it goes through each step.
Status updates of Instance launch include Build-requesting launch, Build-networking, Build-spawning, Active-networking, Active-deploying. Depending on the usage load on Jetstream, it can take anywhere from 2-5 mins for an instance to become active. You can force check updates by using the refresh button in the Instance launch page or the refresh button on your browser. Once the instance becomes active a virtual machine with the ip address provided will become available for you to connect to. This virtual machine will have all the necessary components to run InterProScan and test files to run a InterProScan demo.
Step 5. Access the instance either through web-shell or terminal
Jetstream provides web-shell, a web-based terminal, for accessing your VM at the command line level once it has been deployed.
However, you might find that you wish to access your VM via SSH if you’ve provisioned it with a routable IP number. For SSH access, you can create (or copy) SSH public-keys for your non-Jetstream computer that will allow it to access Jetstream then deposit those keys in your Atmosphere settings. More instructions can be found here
$ ssh <username>@<ip.address>
Part 2: Set up a InterProScan-5.26-65.0 run using the Terminal window
Step 1. Get oriented. You will find staged example data in "/opt/interproscan-5.26-65.0/" within the instance. List its contents with the ls command:
$ ls /opt/interproscan-5.26-65.0/ bin interproscan.properties readme.txt test_all_appl.fasta test_nt_seqs_convert_mode.xml test_proteins.fasta test_proteins_redundant.fasta data interproscan.sh src test_convert_mode.xml test_nt_seqs.fasta test_proteins.fasta.tsv test_single_protein.fasta interproscan-5.jar lib temp test_nt_redundant.fasta test_proteins_convert_mode.xml test_proteins_new.fasta work
Step 2. Set up an example InterProScan-5.26-65.0 run. Create a working directory called "interproscan_test" on your home directory using the mkdir command and use cd to move into that directory:
$ cd ~ $ mkdir interproscan_test $ cd interproscan_test
Step 3. Copy the example protein file into the current directory using cp command. Verify using the ls command.
$ cp /opt/interproscan-5.26-65.0/test_proteins.fasta .
Step 4. Run the InterProScan-5.26-65.0 command with the --help flag to get a usage statement and list of options:
$ interproscan.sh --help 17/10/2017 19:39:35:641 Welcome to InterProScan-5.26-65.0 usage: java -XX:+UseParallelGC -XX:ParallelGCThreads=2 -XX:+AggressiveOpts -XX:+UseFastAccessorMethods -Xms128M -Xmx2048M -jar interproscan-5.jar Please give us your feedback by sending an email to interhelp@ebi.ac.uk -appl,--applications <ANALYSES> Optional, comma separated list of analyses. If this option is not set, ALL analyses will be run. -b,--output-file-base <OUTPUT-FILE-BASE> Optional, base output filename (relative or absolute path). Note that this option, the --output-dir (-d) option and the --outfile (-o) option are mutually exclusive. The appropriate file extension for the output format(s) will be appended automatically. By default the input file path/name will be used. -cpu,--cpu <CPU> Optional, number of cores for inteproscan. -d,--output-dir <OUTPUT-DIR> Optional, output directory. Note that this option, the --outfile (-o) option and the --output-file-base (-b) option are mutually exclusive. The output filename(s) are the same as the input filename, with the appropriate file extension(s) for the output format(s) appended automatically . -dp,--disable-precalc Optional. Disables use of the precalculated match lookup service. All match calculations will be run locally. -dra,--disable-residue-annot Optional, excludes sites from the XML, JSON output -f,--formats <OUTPUT-FORMATS> Optional, case-insensitive, comma separated list of output formats. Supported formats are TSV, XML, JSON, GFF3, HTML and SVG. Default for protein sequences are TSV, XML and GFF3, or for nucleotide sequences GFF3 and XML. -goterms,--goterms Optional, switch on lookup of corresponding Gene Ontology annotation (IMPLIES -iprlookup option) -help,--help Optional, display help information -i,--input <INPUT-FILE-PATH> Optional, path to fasta file that should be loaded on Master startup. Alternatively, in CONVERT mode, the InterProScan 5 XML file to convert. -iprlookup,--iprlookup Also include lookup of corresponding InterPro annotation in the TSV and GFF3 output formats. -ms,--minsize <MINIMUM-SIZE> Optional, minimum nucleotide size of ORF to report. Will only be considered if n is specified as a sequence type. Please be aware of the fact that if you specify a too short value it might be that the analysis takes a very long time! -o,--outfile <EXPLICIT_OUTPUT_FILENAME> Optional explicit output file name (relative or absolute path). Note that this option, the --output-dir (-d) option and the --output-file-base (-b) option are mutually exclusive. If this option is given, you MUST specify a single output format using the -f option. The output file name will not be modified. Note that specifying an output file name using this option OVERWRITES ANY EXISTING FILE. -pa,--pathways Optional, switch on lookup of corresponding Pathway annotation (IMPLIES -iprlookup option) -t,--seqtype <SEQUENCE-TYPE> Optional, the type of the input sequences (dna/rna (n) or protein (p)). The default sequence type is protein. -T,--tempdir <TEMP-DIR> Optional, specify temporary file directory (relative or absolute path). The default location is temp/. -version,--version Optional, display version number -vtsv,--output-tsv-version Optional, includes a TSV version file along with any TSV output (when TSV output requested) Available analyses: TIGRFAM (15.0) : TIGRFAMs are protein families based on Hidden Markov Models or HMMs SFLD (3) : SFLDs are protein families based on Hidden Markov Models or HMMs SUPERFAMILY (1.75) : SUPERFAMILY is a database of structural and functional annotation for all proteins and genomes. Gene3D (4.1.0) : Structural assignment for whole genes and genomes using the CATH domain structure database Hamap (201701.18) : High-quality Automated and Manual Annotation of Microbial Proteomes Coils (2.2.1) : Prediction of Coiled Coil Regions in Proteins ProSiteProfiles (20.132) : PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them SMART (7.1) : SMART allows the identification and analysis of domain architectures based on Hidden Markov Models or HMMs CDD (3.16) : Prediction of CDD domains in Proteins PRINTS (42.0) : A fingerprint is a group of conserved motifs used to characterise a protein family ProSitePatterns (20.132) : PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them Pfam (31.0) : A large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs) ProDom (2006.1) : ProDom is a comprehensive set of protein domain families automatically generated from the UniProt Knowledge Database. MobiDBLite (1.0) : Prediction of disordered domains Regions in Proteins PIRSF (3.02) : The PIRSF concept is being used as a guiding principle to provide comprehensive and non-overlapping clustering of UniProtKB sequences into a hierarchical order to reflect their evolutionary relationships. Deactivated analyses: PANTHER (12.0) : Analysis Panther is deactivated, because the resources expected at the following paths do not exist: data/panther/12.0/panther.hmm, data/panther/12.0/names.tab SignalP_GRAM_POSITIVE (4.1) : Analysis SignalP_GRAM_POSITIVE is deactivated, because the resources expected at the following paths do not exist: bin/signalp/4.1/signalp SignalP_GRAM_NEGATIVE (4.1) : Analysis SignalP_GRAM_NEGATIVE is deactivated, because the resources expected at the following paths do not exist: bin/signalp/4.1/signalp Phobius (1.01) : Analysis Phobius is deactivated, because the resources expected at the following paths do not exist: bin/phobius/1.01/phobius.pl TMHMM (2.0c) : Analysis TMHMM is deactivated, because the resources expected at the following paths do not exist: bin/tmhmm/2.0c/decodeanhmm, data/tmhmm/2.0c/TMHMM2.0c.model SignalP_EUK (4.1) : Analysis SignalP_EUK is deactivated, because the resources expected at the following paths do not exist: bin/signalp/4.1/signalp
Step 5: Do a simple Interproscan test run
$ interproscan.sh -appl PfamA -iprlookup -i test_proteins.fasta 17/10/2017 23:45:40:711 Welcome to InterProScan-5.26-65.0 17/10/2017 23:45:45:981 Running InterProScan v5 in STANDALONE mode... on Linux 17/10/2017 23:46:00:068 Loading file /home/upendra/interproscan_test/test_proteins.fasta 17/10/2017 23:46:00:071 Running the following analyses: [Pfam-31.0] Available matches will be retrieved from the pre-calculated match lookup service. Matches for any sequences that are not represented in the lookup service will be calculated locally. 17/10/2017 23:46:07:031 100% done: InterProScan analyses completed $ ls temp test_proteins.fasta test_proteins.fasta.gff3 test_proteins.fasta.tsv test_proteins.fasta.xml
Step 6: Interproscan with pfam and GO annotation
$ interproscan.sh -appl PfamA -iprlookup -goterms -i test_proteins.fasta
Step 7: Interproscan with pfam, GO and KEGG annotation
$ interproscan.sh -appl PfamA -iprlookup -goterms -pa -cpu 10 -i test_proteins.fasta
Moving data from CyVerse Datastore using iCommands
iCommands is a collection of commands for Linux and Mac OS operating systems that are used in the iRODS system to interact with the CyVerse Data Store. Many commands are very similar to Unix utilities. For example, to list files and directories, in Linux you use ls
, but in iCommands you use ils.
While iCommands are great for all transfers and for automating tasks via scripts, they are the best choice for large files (2-100 GB each) and for bulk file transfers (many small files). For a comparison of the different methods of uploading and downloading data items, see Downloading and Uploading Data.
iCommands can be used by CyVerse account users to download files that have been shared by other users and to upload files to the Data Store, as well as add metadata, change permissions, and more. Commonly used iCommands are listed below. Follow the instructions on Setting Up iCommands for how to download and configure iCommands for your operating system.
A CyVerse account is not required to download a public data file via iCommands. To see instructions just for public data download with iCommands, see the iCommands section on Downloading Data Files Without a User Account.
Before you begin run the below command.
export IRODS_PLUGINS_HOME=/opt/icommands/plugins/
For configuring icommands and the different commands that can be used to move the data in and out of datastore, please refer this link. You may want to watch a CyVerse video about iCommands.