InterProScan-5.26-65.0 Jetstream Tutorial*
What is InterProcan ?
InterPro is an integrated database of predictive protein "signatures" used for the classification and automatic annotation of proteins and genomes. InterPro classifies sequences at superfamily, family and subfamily levels, predicting the occurrence of functional domains, repeats and important sites. InterPro adds in-depth annotation, including GO terms, to the protein signatures.
More: http://www.ebi.ac.uk/interpro/
Which databases are used in InterPro ?
InterPro combines a number of databases (referred to as member databases) that use different methodologies and a varying degree of biological information on well-characterised proteins to derive protein signatures. By uniting the member databases, InterPro capitalises on their individual strengths, producing a powerful integrated database and diagnostic tool (InterProScan).
The member databases use a number of approaches:
- ProDom: provider of sequence-clusters built from UniProtKB using PSI-BLAST.
- PROSITE patterns: provider of simple regular expressions.
- PROSITE and HAMAP profiles: provide sequence matrices.
- PRINTS provider of fingerprints, which are groups of aligned, un-weighted Position Specific Sequence Matrices (PSSMs).
- PANTHER, PIRSF, Pfam, SMART, TIGRFAMs, Gene3D and SUPERFAMILY: providers of hidden Markov models (HMMs).
Diagnostically, these resources have different areas of optimum application owing to the different underlying analysis methods. In terms of family coverage, the protein signature databases are similar in size but differ in content. While all of the methods share a common interest in protein sequence classification, some focus on divergent domains (e.g., Pfam), some focus on functional sites (e.g., PROSITE), and others focus on families, specialising in hierarchical definitions from superfamily down to subfamily levels in order to pin-point specific functions (e.g., PRINTS). TIGRFAMs focus on building HMMs for functionally equivalent proteins and PIRSF always produces HMMs over the full length of a protein and have protein length restrictions to gather family members. HAMAP profiles are manually created by expert curators they identify proteins that are part of well-conserved bacterial, archaeal and plastid-encoded proteins families or subfamilies. PANTHER build HMMs based on the divergence of function within families. SUPERFAMILY and Gene3D are based on structure using the SCOP and CATH superfamilies, respectively, as a basis for building HMMs.
Considerations
Sounds great, what do I need to get started?
- XSEDE account
- XSEDE allocation. You can also request a trial access
- Your data (or you can run example data)
What kind of data do I need?
- Mandatory requirements: protein sequences file in fasta format only
What kind of resources will I need for my project?
- Enough storage space on the Jetstream instance for both input and output files
- Creating and attaching an external volume to the running instance would be recommended
https://iujetstream.atlassian.net/wiki/spaces/JWT/pages/32899113/Volumes
- Creating and attaching an external volume to the running instance would be recommended
- Enough AUs to run your computation
Part 1: Connect to an instance of an InterProScan-5.26-65.0 Jetstream Image (virtual machine)
Step 1. Go to https://use.jetstream-cloud.org/application and log in with your XSEDE credentials.
Step 2. Click on the "Create New Project" in the Project tab on the top and enter the name of the project and a brief description
Step 3. Launch an instance from the selected image
After the project has been created and entered inside it, click the "New" button, select "InterProScan-5.26-65.0" image and then click Launch instance. In the next window (Basic Info),
- Leave the name as default
- set base image version as "1.0.0" (default)
- leave the project as it is or change to a different project if needed
- select "Jetstream - Indiana University or Jetstream - TACC" as Provider and click 'Continue'. Your choice of provider will depend on the resources you have available (AUs) and the needs of your instance
- select "s1.large" as Instance size (this is the minimum size that is required by InterProScan-5.26-65.0 image) and click "Continue".
Step 4. As the instance is launched behind the scenes, you will get an update as it goes through each step.
Status updates of Instance launch include Build-requesting launch, Build-networking, Build-spawning, Active-networking, Active-deploying. Depending on the usage load on Jetstream, it can take anywhere from 2-5 mins for an instance to become active. You can force check updates by using the refresh button in the Instance launch page or the refresh button on your browser. Once the instance becomes active a virtual machine with the ip address provided will become available for you to connect to. This virtual machine will have all the necessary components to run InterProScan and test files to run a InterProScan demo.
Step 5. Access the instance either through web-shell or terminal
Jetstream provides web-shell, a web-based terminal, for accessing your VM at the command line level once it has been deployed.
However, you might find that you wish to access your VM via SSH if you’ve provisioned it with a routable IP number. For SSH access, you can create (or copy) SSH public-keys for your non-Jetstream computer that will allow it to access Jetstream then deposit those keys in your Atmosphere settings. More instructions can be found here
$ ssh <username>@<ip.address>
Part 2: Set up a InterProScan-5.26-65.0 run using the Terminal window
Step 1. Get oriented. You will find staged example data in "/opt/interproscan-5.26-65.0/" within the instance. List its contents with the ls command:
$ ls /opt/interproscan-5.26-65.0/ bin interproscan.properties readme.txt test_all_appl.fasta test_nt_seqs_convert_mode.xml test_proteins.fasta test_proteins_redundant.fasta data interproscan.sh src test_convert_mode.xml test_nt_seqs.fasta test_proteins.fasta.tsv test_single_protein.fasta interproscan-5.jar lib temp test_nt_redundant.fasta test_proteins_convert_mode.xml test_proteins_new.fasta work
Step 2. Set up an example InterProScan-5.26-65.0 run. Create a working directory called "interproscan_test" on your home directory using the mkdir command and use cd to move into that directory:
$ cd ~ $ mkdir interproscan_test $ cd interproscan_test
Step 3. Copy the example protein file into the current directory using cp command. Verify using the ls command.
$ cp /opt/interproscan-5.26-65.0/test_proteins.fasta .
Step 4. Run the InterProScan-5.26-65.0 command with the --help flag to get a usage statement and list of options:
$ interproscan.sh --help
17/10/2017 19:39:35:641 Welcome to InterProScan-5.26-65.0
usage: java -XX:+UseParallelGC -XX:ParallelGCThreads=2 -XX:+AggressiveOpts
-XX:+UseFastAccessorMethods -Xms128M -Xmx2048M -jar
interproscan-5.jar
Please give us your feedback by sending an email to interhelp@ebi.ac.uk
-appl,--applications <ANALYSES> Optional, comma separated list of analyses.
If this option
is not set, ALL analyses will
be run.
-b,--output-file-base <OUTPUT-FILE-BASE> Optional, base output filename
(relative or absolute path).
Note that this option, the
--output-dir (-d) option and
the --outfile (-o) option are
mutually exclusive. The
appropriate file extension for
the output format(s) will be
appended automatically. By
default the input file
path/name will be used.
-cpu,--cpu <CPU> Optional, number of cores for
inteproscan.
-d,--output-dir <OUTPUT-DIR> Optional, output directory.
Note that this option, the
--outfile (-o) option and the
--output-file-base (-b) option
are mutually exclusive. The
output filename(s) are the
same as the input filename,
with the appropriate file
extension(s) for the output
format(s) appended
automatically .
-dp,--disable-precalc Optional. Disables use of the
precalculated match lookup
service. All match
calculations will be run
locally.
-dra,--disable-residue-annot Optional, excludes sites from
the XML, JSON output
-f,--formats <OUTPUT-FORMATS> Optional, case-insensitive,
comma separated list of output
formats. Supported formats are
TSV, XML, JSON, GFF3, HTML and
SVG. Default for protein
sequences are TSV, XML and
GFF3, or for nucleotide
sequences GFF3 and XML.
-goterms,--goterms Optional, switch on lookup of
corresponding Gene Ontology
annotation (IMPLIES -iprlookup
option)
-help,--help Optional, display help
information
-i,--input <INPUT-FILE-PATH> Optional, path to fasta file
that should be loaded on
Master startup. Alternatively,
in CONVERT mode, the
InterProScan 5 XML file to
convert.
-iprlookup,--iprlookup Also include lookup of
corresponding InterPro
annotation in the TSV and GFF3
output formats.
-ms,--minsize <MINIMUM-SIZE> Optional, minimum nucleotide
size of ORF to report. Will
only be considered if n is
specified as a sequence type.
Please be aware of the fact
that if you specify a too
short value it might be that
the analysis takes a very long
time!
-o,--outfile <EXPLICIT_OUTPUT_FILENAME> Optional explicit output file
name (relative or absolute
path). Note that this option,
the --output-dir (-d) option
and the --output-file-base
(-b) option are mutually
exclusive. If this option is
given, you MUST specify a
single output format using the
-f option. The output file
name will not be modified.
Note that specifying an output
file name using this option
OVERWRITES ANY EXISTING FILE.
-pa,--pathways Optional, switch on lookup of
corresponding Pathway
annotation (IMPLIES -iprlookup
option)
-t,--seqtype <SEQUENCE-TYPE> Optional, the type of the
input sequences (dna/rna (n)
or protein (p)). The default
sequence type is protein.
-T,--tempdir <TEMP-DIR> Optional, specify temporary
file directory (relative or
absolute path). The default
location is temp/.
-version,--version Optional, display version
number
-vtsv,--output-tsv-version Optional, includes a TSV
version file along with any
TSV output (when TSV output
requested)
Available analyses:
TIGRFAM (15.0) : TIGRFAMs are protein families based on Hidden Markov Models or HMMs
SFLD (3) : SFLDs are protein families based on Hidden Markov Models or HMMs
SUPERFAMILY (1.75) : SUPERFAMILY is a database of structural and functional annotation for all proteins and genomes.
Gene3D (4.1.0) : Structural assignment for whole genes and genomes using the CATH domain structure database
Hamap (201701.18) : High-quality Automated and Manual Annotation of Microbial Proteomes
Coils (2.2.1) : Prediction of Coiled Coil Regions in Proteins
ProSiteProfiles (20.132) : PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them
SMART (7.1) : SMART allows the identification and analysis of domain architectures based on Hidden Markov Models or HMMs
CDD (3.16) : Prediction of CDD domains in Proteins
PRINTS (42.0) : A fingerprint is a group of conserved motifs used to characterise a protein family
ProSitePatterns (20.132) : PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them
Pfam (31.0) : A large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs)
ProDom (2006.1) : ProDom is a comprehensive set of protein domain families automatically generated from the UniProt Knowledge Database.
MobiDBLite (1.0) : Prediction of disordered domains Regions in Proteins
PIRSF (3.02) : The PIRSF concept is being used as a guiding principle to provide comprehensive and non-overlapping clustering of UniProtKB sequences into a hierarchical order to reflect their evolutionary relationships.
Deactivated analyses:
PANTHER (12.0) : Analysis Panther is deactivated, because the resources expected at the following paths do not exist: data/panther/12.0/panther.hmm, data/panther/12.0/names.tab
SignalP_GRAM_POSITIVE (4.1) : Analysis SignalP_GRAM_POSITIVE is deactivated, because the resources expected at the following paths do not exist: bin/signalp/4.1/signalp
SignalP_GRAM_NEGATIVE (4.1) : Analysis SignalP_GRAM_NEGATIVE is deactivated, because the resources expected at the following paths do not exist: bin/signalp/4.1/signalp
Phobius (1.01) : Analysis Phobius is deactivated, because the resources expected at the following paths do not exist: bin/phobius/1.01/phobius.pl
TMHMM (2.0c) : Analysis TMHMM is deactivated, because the resources expected at the following paths do not exist: bin/tmhmm/2.0c/decodeanhmm, data/tmhmm/2.0c/TMHMM2.0c.model
SignalP_EUK (4.1) : Analysis SignalP_EUK is deactivated, because the resources expected at the following paths do not exist: bin/signalp/4.1/signalp
Step 5: Do a simple Interproscan test run
$ interproscan.sh -appl PfamA -iprlookup -i test_proteins.fasta 17/10/2017 23:45:40:711 Welcome to InterProScan-5.26-65.0 17/10/2017 23:45:45:981 Running InterProScan v5 in STANDALONE mode... on Linux 17/10/2017 23:46:00:068 Loading file /home/upendra/interproscan_test/test_proteins.fasta 17/10/2017 23:46:00:071 Running the following analyses: [Pfam-31.0] Available matches will be retrieved from the pre-calculated match lookup service. Matches for any sequences that are not represented in the lookup service will be calculated locally. 17/10/2017 23:46:07:031 100% done: InterProScan analyses completed $ ls temp test_proteins.fasta test_proteins.fasta.gff3 test_proteins.fasta.tsv test_proteins.fasta.xml
Step 6: Interproscan with pfam and GO annotation
$ interproscan.sh -appl PfamA -iprlookup -goterms -i test_proteins.fasta
Step 7: Interproscan with pfam, GO and KEGG annotation
$ interproscan.sh -appl PfamA -iprlookup -goterms -pa -cpu 10 -i test_proteins.fasta
Moving data from CyVerse Datastore using iCommands
iCommands is a collection of commands for Linux and Mac OS operating systems that are used in the iRODS system to interact with the CyVerse Data Store. Many commands are very similar to Unix utilities. For example, to list files and directories, in Linux you use ls, but in iCommands you use ils.
While iCommands are great for all transfers and for automating tasks via scripts, they are the best choice for large files (2-100 GB each) and for bulk file transfers (many small files). For a comparison of the different methods of uploading and downloading data items, see Downloading and Uploading Data.
iCommands can be used by CyVerse account users to download files that have been shared by other users and to upload files to the Data Store, as well as add metadata, change permissions, and more. Commonly used iCommands are listed below. Follow the instructions on Setting Up iCommands for how to download and configure iCommands for your operating system.
A CyVerse account is not required to download a public data file via iCommands. To see instructions just for public data download with iCommands, see the iCommands section on Downloading Data Files Without a User Account.
Before you begin run the below command.
export IRODS_PLUGINS_HOME=/opt/icommands/plugins/
For configuring icommands and the different commands that can be used to move the data in and out of datastore, please refer this link. You may want to watch a CyVerse video about iCommands.