InterProScan-5.26-65.0 Jetstream Tutorial*

What is InterProcan ?

InterPro is an integrated database of predictive protein "signatures" used for the classification and automatic annotation of proteins and genomes. InterPro classifies sequences at superfamily, family and subfamily levels, predicting the occurrence of functional domains, repeats and important sites. InterPro adds in-depth annotation, including GO terms, to the protein signatures.

Morehttp://www.ebi.ac.uk/interpro/

ReferenceHunter S, et al. InterPro: the integrative protein signature database. Nucleic Acids Res. 2009 Jan;37(Database issue):D211-5. Epub 2008 Oct 21.

Which databases are used in InterPro ?

InterPro combines a number of databases (referred to as member databases) that use different methodologies and a varying degree of biological information on well-characterised proteins to derive protein signatures. By uniting the member databases, InterPro capitalises on their individual strengths, producing a powerful integrated database and diagnostic tool (InterProScan).

The member databases use a number of approaches:

  • ProDom: provider of sequence-clusters built from UniProtKB using PSI-BLAST.
  • PROSITE patterns: provider of simple regular expressions.
  • PROSITE and HAMAP profiles: provide sequence matrices.
  • PRINTS provider of fingerprints, which are groups of aligned, un-weighted Position Specific Sequence Matrices (PSSMs).
  • PANTHER, PIRSF, Pfam, SMART, TIGRFAMs, Gene3D and SUPERFAMILY: providers of hidden Markov models (HMMs).

Diagnostically, these resources have different areas of optimum application owing to the different underlying analysis methods. In terms of family coverage, the protein signature databases are similar in size but differ in content. While all of the methods share a common interest in protein sequence classification, some focus on divergent domains (e.g., Pfam), some focus on functional sites (e.g., PROSITE), and others focus on families, specialising in hierarchical definitions from superfamily down to subfamily levels in order to pin-point specific functions (e.g., PRINTS). TIGRFAMs focus on building HMMs for functionally equivalent proteins and PIRSF always produces HMMs over the full length of a protein and have protein length restrictions to gather family members. HAMAP profiles are manually created by expert curators they identify proteins that are part of well-conserved bacterial, archaeal and plastid-encoded proteins families or subfamilies. PANTHER build HMMs based on the divergence of function within families. SUPERFAMILY and Gene3D are based on structure using the SCOP and CATH superfamilies, respectively, as a basis for building HMMs.

Considerations

Sounds great, what do I need to get started?

  1. XSEDE account
  2. XSEDE allocation. You can also request a trial access
  3. Your data (or you can run example data)

What kind of data do I need?

  1. Mandatory requirements: protein sequences file in fasta format only

What kind of resources will I need for my project?

  1. Enough storage space on the  Jetstream instance for both input and output files
    1. Creating and attaching an external volume to the running instance would be recommended
      https://iujetstream.atlassian.net/wiki/spaces/JWT/pages/32899113/Volumes
  2. Enough AUs to run your computation

Part 1: Connect to an instance of an InterProScan-5.26-65.0 Jetstream Image (virtual machine)

Step 1. Go to https://use.jetstream-cloud.org/application and log in with your XSEDE credentials.

 

Step 2. Click on the "Create New Project"  in the Project tab on the top and enter the name of the project and a brief description 


Step 3. Launch an instance from the selected image 

After the project has been created and entered inside it, click the "New" button, select "InterProScan-5.26-65.0" image and then click Launch instance. In the next window (Basic Info),

  • Leave the name as default 
  • set base image version as "1.0.0" (default)
  • leave the project as it is or change to a different project if needed
  • select "Jetstream - Indiana University or Jetstream - TACC" as Provider and click 'Continue'. Your choice of provider will depend on the resources you have available (AUs) and the needs of your instance
  • select "s1.large" as Instance size (this is the minimum size that is required by InterProScan-5.26-65.0 image) and click "Continue". 

Step 4. As the instance is launched behind the scenes, you will get an update as it goes through each step.

Status updates of Instance launch include Build-requesting launch, Build-networking, Build-spawning, Active-networking, Active-deploying. Depending on the usage load on Jetstream, it can take anywhere from 2-5 mins for an instance to become active. You can force check updates by using the refresh button in the Instance launch page or the refresh button on your browser. Once the instance becomes active a virtual machine with the ip address provided will become available for you to connect to. This virtual machine will have all the necessary components to run InterProScan and test files to run a InterProScan demo. 

Step 5. Access the instance either through web-shell or terminal

Jetstream provides web-shell, a web-based terminal, for accessing your VM at the command line level once it has been deployed.

However, you might find that you wish to access your VM via SSH if you’ve provisioned it with a routable IP number. For SSH access, you can create (or copy) SSH public-keys for your non-Jetstream computer that will allow it to access Jetstream then deposit those keys in your Atmosphere settings. More instructions can be found here 

$ ssh <username>@<ip.address>

 

Part 2: Set up a InterProScan-5.26-65.0 run using the Terminal window

Step 1. Get oriented. You will find staged example data in "/opt/interproscan-5.26-65.0/" within the instance.  List its contents with the ls command:

$ ls /opt/interproscan-5.26-65.0/
bin                 interproscan.properties  readme.txt  test_all_appl.fasta      test_nt_seqs_convert_mode.xml   test_proteins.fasta      test_proteins_redundant.fasta data                interproscan.sh          src         test_convert_mode.xml    test_nt_seqs.fasta              test_proteins.fasta.tsv  test_single_protein.fasta interproscan-5.jar  lib                      temp        test_nt_redundant.fasta  test_proteins_convert_mode.xml  test_proteins_new.fasta  work

Step 2. Set up an example InterProScan-5.26-65.0 run.  Create a working directory called "interproscan_test" on your home directory using the mkdir command and use cd to move into that directory:

$ cd ~
$ mkdir interproscan_test
$ cd interproscan_test

Step 3. Copy the example protein file into the current directory using cp command. Verify using the ls command. 

$ cp /opt/interproscan-5.26-65.0/test_proteins.fasta .

Step 4. Run the InterProScan-5.26-65.0 command with the --help flag to get a usage statement and list of options:

$ interproscan.sh --help

17/10/2017 19:39:35:641 Welcome to InterProScan-5.26-65.0

usage: java -XX:+UseParallelGC -XX:ParallelGCThreads=2 -XX:+AggressiveOpts

            -XX:+UseFastAccessorMethods -Xms128M -Xmx2048M -jar

            interproscan-5.jar


Please give us your feedback by sending an email to interhelp@ebi.ac.uk

 -appl,--applications <ANALYSES>            Optional, comma separated list of analyses. 
											If this option
                                            is not set, ALL analyses will
                                            be run.
 -b,--output-file-base <OUTPUT-FILE-BASE>   Optional, base output filename
                                            (relative or absolute path).
                                            Note that this option, the
                                            --output-dir (-d) option and
                                            the --outfile (-o) option are
                                            mutually exclusive.  The
                                            appropriate file extension for
                                            the output format(s) will be
                                            appended automatically. By
                                            default the input file
                                            path/name will be used.
 -cpu,--cpu <CPU>                           Optional, number of cores for
                                            inteproscan.
 -d,--output-dir <OUTPUT-DIR>               Optional, output directory.
                                            Note that this option, the
                                            --outfile (-o) option and the
                                            --output-file-base (-b) option
                                            are mutually exclusive. The
                                            output filename(s) are the
                                            same as the input filename,
                                            with the appropriate file
                                            extension(s) for the output
                                            format(s) appended
                                            automatically .
 -dp,--disable-precalc                      Optional.  Disables use of the
                                            precalculated match lookup
                                            service.  All match
                                            calculations will be run
                                            locally.
 -dra,--disable-residue-annot               Optional, excludes sites from
                                            the XML, JSON output
 -f,--formats <OUTPUT-FORMATS>              Optional, case-insensitive,
                                            comma separated list of output
                                            formats. Supported formats are
                                            TSV, XML, JSON, GFF3, HTML and
                                            SVG. Default for protein
                                            sequences are TSV, XML and
                                            GFF3, or for nucleotide
                                            sequences GFF3 and XML.
 -goterms,--goterms                         Optional, switch on lookup of
                                            corresponding Gene Ontology
                                            annotation (IMPLIES -iprlookup
                                            option)
 -help,--help                               Optional, display help
                                            information
 -i,--input <INPUT-FILE-PATH>               Optional, path to fasta file
                                            that should be loaded on
                                            Master startup. Alternatively,
                                            in CONVERT mode, the
                                            InterProScan 5 XML file to
                                            convert.
 -iprlookup,--iprlookup                     Also include lookup of
                                            corresponding InterPro
                                            annotation in the TSV and GFF3
                                            output formats.
 -ms,--minsize <MINIMUM-SIZE>               Optional, minimum nucleotide
                                            size of ORF to report. Will
                                            only be considered if n is
                                            specified as a sequence type.
                                            Please be aware of the fact
                                            that if you specify a too
                                            short value it might be that
                                            the analysis takes a very long
                                            time!
 -o,--outfile <EXPLICIT_OUTPUT_FILENAME>    Optional explicit output file
                                            name (relative or absolute
                                            path).  Note that this option,
                                            the --output-dir (-d) option
                                            and the --output-file-base
                                            (-b) option are mutually
                                            exclusive. If this option is
                                            given, you MUST specify a
                                            single output format using the
                                            -f option.  The output file
                                            name will not be modified.
                                            Note that specifying an output
                                            file name using this option
                                            OVERWRITES ANY EXISTING FILE.
 -pa,--pathways                             Optional, switch on lookup of
                                            corresponding Pathway
                                            annotation (IMPLIES -iprlookup
                                            option)
 -t,--seqtype <SEQUENCE-TYPE>               Optional, the type of the
                                            input sequences (dna/rna (n)
                                            or protein (p)).  The default
                                            sequence type is protein.
 -T,--tempdir <TEMP-DIR>                    Optional, specify temporary
                                            file directory (relative or
                                            absolute path). The default
                                            location is temp/.
 -version,--version                         Optional, display version
                                            number
 -vtsv,--output-tsv-version                 Optional, includes a TSV
                                            version file along with any
                                            TSV output (when TSV output
                                            requested)
Available analyses:
                      TIGRFAM (15.0) : TIGRFAMs are protein families based on Hidden Markov Models or HMMs
                         SFLD (3) : SFLDs are protein families based on Hidden Markov Models or HMMs
                  SUPERFAMILY (1.75) : SUPERFAMILY is a database of structural and functional annotation for all proteins and genomes.
                       Gene3D (4.1.0) : Structural assignment for whole genes and genomes using the CATH domain structure database
                        Hamap (201701.18) : High-quality Automated and Manual Annotation of Microbial Proteomes
                        Coils (2.2.1) : Prediction of Coiled Coil Regions in Proteins
              ProSiteProfiles (20.132) : PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them
                        SMART (7.1) : SMART allows the identification and analysis of domain architectures based on Hidden Markov Models or HMMs
                          CDD (3.16) : Prediction of CDD domains in Proteins
                       PRINTS (42.0) : A fingerprint is a group of conserved motifs used to characterise a protein family
              ProSitePatterns (20.132) : PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them
                         Pfam (31.0) : A large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs)
                       ProDom (2006.1) : ProDom is a comprehensive set of protein domain families automatically generated from the UniProt Knowledge Database.
                   MobiDBLite (1.0) : Prediction of disordered domains Regions in Proteins
                        PIRSF (3.02) : The PIRSF concept is being used as a guiding principle to provide comprehensive and non-overlapping clustering of UniProtKB sequences into a hierarchical order to reflect their evolutionary relationships.

Deactivated analyses:
                      PANTHER (12.0) : Analysis Panther is deactivated, because the resources expected at the following paths do not exist: data/panther/12.0/panther.hmm, data/panther/12.0/names.tab
        SignalP_GRAM_POSITIVE (4.1) : Analysis SignalP_GRAM_POSITIVE is deactivated, because the resources expected at the following paths do not exist: bin/signalp/4.1/signalp
        SignalP_GRAM_NEGATIVE (4.1) : Analysis SignalP_GRAM_NEGATIVE is deactivated, because the resources expected at the following paths do not exist: bin/signalp/4.1/signalp
                      Phobius (1.01) : Analysis Phobius is deactivated, because the resources expected at the following paths do not exist: bin/phobius/1.01/phobius.pl
                        TMHMM (2.0c) : Analysis TMHMM is deactivated, because the resources expected at the following paths do not exist: bin/tmhmm/2.0c/decodeanhmm, data/tmhmm/2.0c/TMHMM2.0c.model
                  SignalP_EUK (4.1) : Analysis SignalP_EUK is deactivated, because the resources expected at the following paths do not exist: bin/signalp/4.1/signalp

Step 5: Do a simple Interproscan test run

$ interproscan.sh -appl PfamA -iprlookup -i test_proteins.fasta
17/10/2017 23:45:40:711 Welcome to InterProScan-5.26-65.0
17/10/2017 23:45:45:981 Running InterProScan v5 in STANDALONE mode... on Linux
17/10/2017 23:46:00:068 Loading file /home/upendra/interproscan_test/test_proteins.fasta
17/10/2017 23:46:00:071 Running the following analyses:
[Pfam-31.0]
Available matches will be retrieved from the pre-calculated match lookup service.
Matches for any sequences that are not represented in the lookup service will be calculated locally.
17/10/2017 23:46:07:031 100% done:  InterProScan analyses completed
 
$ ls
temp  test_proteins.fasta  test_proteins.fasta.gff3  test_proteins.fasta.tsv  test_proteins.fasta.xml 

Step 6: Interproscan with pfam and GO annotation

$ interproscan.sh -appl PfamA -iprlookup -goterms -i test_proteins.fasta 

Step 7: Interproscan with pfam, GO and KEGG annotation 

$ interproscan.sh -appl PfamA -iprlookup -goterms -pa -cpu 10 -i test_proteins.fasta 

Moving data from CyVerse Datastore using iCommands

iCommands is a collection of commands for Linux and Mac OS operating systems that are used in the iRODS system to interact with the CyVerse Data Store. Many commands are very similar to Unix utilities. For example, to list files and directories, in Linux you use ls, but in iCommands you use ils.

While iCommands are great for all transfers and for automating tasks via scripts, they are the best choice for large files (2-100 GB each) and for bulk file transfers (many small files). For a comparison of the different methods of uploading and downloading data items, see Downloading and Uploading Data.

iCommands can be used by CyVerse account users to download files that have been shared by other users and to upload files to the Data Store, as well as add metadata, change permissions, and more. Commonly used iCommands are listed below. Follow the instructions on Setting Up iCommands for how to download and configure iCommands for your operating system.

A CyVerse account is not required to download a public data file via iCommands. To see instructions just for public data download with iCommands, see the iCommands section on Downloading Data Files Without a User Account.

Before you begin run the below command.

export IRODS_PLUGINS_HOME=/opt/icommands/plugins/

For configuring icommands and the different commands that can be used to move the data in and out of datastore, please refer this link. You may want to watch a CyVerse video about iCommands.

Unable to render {include} The included page could not be found.