Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

What is InterProcan ?

InterPro is an integrated database of predictive protein "signatures" used for the classification and automatic annotation of proteins and genomes. InterPro classifies sequences at superfamily, family and subfamily levels, predicting the occurrence of functional domains, repeats and important sites. InterPro adds in-depth annotation, including GO terms, to the protein signatures.

...

ReferenceHunter S, et al. InterPro: the integrative protein signature database. Nucleic Acids Res. 2009 Jan;37(Database issue):D211-5. Epub 2008 Oct 21.

Which databases are used in InterPro ?

InterPro combines a number of databases (referred to as member databases) that use different methodologies and a varying degree of biological information on well-characterised proteins to derive protein signatures. By uniting the member databases, InterPro capitalises on their individual strengths, producing a powerful integrated database and diagnostic tool (InterProScan).

...

Diagnostically, these resources have different areas of optimum application owing to the different underlying analysis methods. In terms of family coverage, the protein signature databases are similar in size but differ in content. While all of the methods share a common interest in protein sequence classification, some focus on divergent domains (e.g., Pfam), some focus on functional sites (e.g., PROSITE), and others focus on families, specialising in hierarchical definitions from superfamily down to subfamily levels in order to pin-point specific functions (e.g., PRINTS). TIGRFAMs focus on building HMMs for functionally equivalent proteins and PIRSF always produces HMMs over the full length of a protein and have protein length restrictions to gather family members. HAMAP profiles are manually created by expert curators they identify proteins that are part of well-conserved bacterial, archaeal and plastid-encoded proteins families or subfamilies. PANTHER build HMMs based on the divergence of function within families. SUPERFAMILY and Gene3D are based on structure using the SCOP and CATH superfamilies, respectively, as a basis for building HMMs.

Considerations

Sounds great, what do I need to get started?

  1. XSEDE account
  2. XSEDE allocation. You can also request a trial access
  3. Your data (or you can run example data)

What kind of data do I need?

  1. Mandatory requirements: protein sequences file in fasta format only

What kind of resources will I need for my project?

  1. Enough storage space on the  Jetstream instance for both input and output files
    1. Creating and attaching an external volume to the running instance would be recommended
      https://iujetstream.atlassian.net/wiki/spaces/JWT/pages/32899113/Volumes
  2. Enough AUs to run your computation

Part 1: Connect to an instance of an InterProScan-5.26-65.0 Jetstream Image (virtual machine)

Step 1. Go to https://use.jetstream-cloud.org/application and log in with your XSEDE credentials.

...

Code Block
$ ssh <username>@<ip.address>

 

Part 2: Set up a InterProScan-5.26-65.0 run using the Terminal window

Step 1. Get oriented. You will find staged example data in "/opt/interproscan-5.26-65.0/" within the instance.  List its contents with the ls command:

...

Code Block
$ interproscan.sh -appl PfamA -iprlookup -goterms -pa -cpu 10 -i test_proteins.fasta 

Moving data from CyVerse Datastore using iCommands

iCommands is a collection of commands for Linux and Mac OS operating systems that are used in the iRODS system to interact with the CyVerse Data Store. Many commands are very similar to Unix utilities. For example, to list files and directories, in Linux you use ls, but in iCommands you use ils.

...