Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Random Jungle 2.0.0

Random Jungle is an implementation of Random Forests™, used to analyze high dimensional data. In genetics, the algorithm can be used for analyzing big Genome Wide Association (GWA) data. Per the authors, some of the most interesting features of the algorithm include variable selection, missing value imputation, classifier creation, generalization error estimation and sample proximities between pairs of cases. Some of the commonly-used measurements implemented include Gini importance, permutation importance, conditional importance measures, and variable backward elimination. When multiple CPUs are available, Random Jungle uses multithreading and Message Passing Interface (MPI) parallellization.

Usage

Random Jungle has a large number of command line parameters; see Section 3.1 of the program documentation, RJ-manual-2.0.0_0.pdf.

Test Data

Info

Test data for this app appears directly in the Discovery Environment in the Data window under:

Community Data / iplantcollaborative / example_data / randomjungle / input

Results using the parameter file including with the relevant input files can be found at:

Community Data / iplantcollaborative / example_data / randomjungle / output

Command-line options to generate the output files are found in run.log and run2.log which are found among the other output files.

Parameters Used in App

Input File(s)

There is one input format, the RAW PLINK format (.raw).  If your current file format is one of the more standard PLINK forms (.map/.ped, .tfam/.tped, or .bed/.bim/.fam), use the PLINK Conversion to convert to the RAW PLINK format.  If you have an alternative file format that is not PLINK, then Tassel3 Conversion or Tassel4 Conversion can be used to attain the (.map/.ped) PLINK file format and then PLINK Conversion can be used to get the RAW PLINK format.

  • Input File Options.
    • --missingcode=NUM
      Missings should always coded as NA or NUM in your data. The program takes NUM as a internal representation of a missing value. DEFAULT: -99.
    • --impute=NUM
      Impute missings in input data using Random Forests (TM)'s imputation algorithm.  The number of iterations is given by NUM.  For more information, have a look at http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm.  Do not try to impute untyped SNPs (Schwarz et al. 2009, BMC Proceedings, 3, S65) if case-control-status is missing. Try a different program like: IMPUTE, MACH, PLINK, ... DEFAULT: switched off.
    • --impcts
      For imputing missings in continuous phenotype data
    • -W
      Adjust class weights of unbalanced datasets automatically.  DEFAULT: switched off.
    • --classweights=STRING
      Sets the weights of classes E.G.: "1.23;22;" ... first class gets weight 1.23 ... second class gets weight 22 DEFAULT: switched off
  • Model (Optional File).
    • -P
      Select file with model (forest).
  • General Parameters.
    • --probability
      Write probability predictions to file.
    • -w3
      Select option for training model.
    • --downsampling
      Choose randomly samples without replacement. DEFAULT: switched off.
  • Tree Type.
    • --treetype=ID
      Choose base classifier by setting ID. There are several treetypes but only CART is fully supported. Explanatory or exposure variables will be named: input variables. Explained or response variable will be named: output variable. If you want to use CART like Random Forest (TM) does choose one of five possible values as follows:
      • ID=1
        CART, y (output variable): nominal, x (input variable(s)): numeric, ID = 1 is recommended for less different values in the input variables (i.e. GWA SNP data or integer data).
      • ID=2
        CART, y (output variable): nominal, x (input variable(s)): nominal.
      • ID=3
        CART regression trees, y (output variable): numeric, x (input variable(s)): numeric.
      • ID=4
        CART regression trees, CART regression trees, y (output variable): numeric, x (input variable(s)): nominal.
      • ID=5
        Is recommended for more different values in the input variables (i.e. many floating point numbers).
        Like original Breiman/Cutler/Friedman algorithm.
    • --ntree=SIZE
      SIZE is the number of trees in jungle. If SIZE=0 then the size will be set automatically depending on mtry and variable size (experimental feature). DEFAULT is 500.
    • --mtry=SIZE
      SIZE of randomly choosen variable sets. At each node building step, a variable will be selected out of the set, that serves the biggest information gain. The bigger SIZE is set, the higher computing time might be. The bigger SIZE is set, the more similar trees in jungle will be. High noised data sets should processed with a big SIZE. Default is square root of number of input variables.
    • --maxtreedepth=NUM
      This is a stop criterium/tunning parameter. Tree growing will stop, when the tree exceeds a depth of NUM . DEFAULT: switched off.
  • Importance calculations.
    • --impmeasure==ID
      Variable selection: Choose an method for estimating variable importance as follows.  The results will be written to file FILEPREFIXNAME.importance. You can not turn off variable importance output. DEFAULT is 1.
      • ID=1
        Intrinsic Importrance (i.e. GINI-Index).
      • ID=2
        Permutation Importance by Breiman, Cutler (observed in Fortran code).
      • ID=3
        Permutation Importance by Liaw, Wiener (in R-package RandomForest).
      • ID=4
        Permutation Importance, raw values, no normalization.
      • ID=5
        Permutation Importance by Meng et. al
    • --nimpvar=STOPSIZE
      Only necessary if --impmeasure = 2,3,5,6 or 7. How many variable should remain. The lesser STOPSIZE is, the reliable the result might be. The smaller SIZE is, the higher computing time will be. DEFAULT is 100.
    • --condimp=NUM
      Perform conditional importance if option -i > 1. NUM is the pearson’s cor. coef. cutoff. The smaller NUM, the bigger a conditional importance permutation group will be created. (=> More accurate, but slower) Requires: 0 <= NUM <= 1 NUM < 0 => switched off DEFAULT: switched off.
  • Performane & Operation.
    • --memmode=ID
      If you want to use very small data coding, i.e. for SNP analysis give rjunglesparse a try! DEFAULT is 0.  Usage of the heap memory (RAM) as follows:
      • ID= 0
        Double precision floating point (BIG).
      • ID= 1
        Single precision floating point (Normal).
      • ID= 2
        Char (small). CHAR normally fits in one byte. DATA CELL VALUE HAS TO BE AN INTEGER IN [-127..127].
    • -z
      Seed of random number generators.
    • --nthreads=NUM
      Maximally use NUM threads (CPUs) for parallel processing. Limit for NUM is number of CPUs in computer. DEFAULT: Number of CPUs in computer.
    • --targetpartitionsize=NUM
      This is a stop criterium/tunning parameter. Tree growing will stop, when a partion falls below a size of NUM samples. DEFAULT: switched off.
  • Output files.
    • --outprefix=FILEPREFIXNAME
      FILEPREFIXNAME of output files is the first part of output files (i.e. rjungle.importance, rjungle.prediction, . . . ).  The default FILEPREFIXNAME is rjungle. Use for example -o my_analysis_no123.
    • --w2.
      Save model (forest) to a file. prefix.jungle.xml
    • --sampleproximities
      It computes proximities between pairs of cases that can be used in clustering, locating outliers, or (by scaling) give interesting views of the data. Can be used as the distance matrix for Multidimensional Scaling (MDS). The results will be written to file FILEPREFIXNAME.samproximity. DEFAULT: switched off
    • --oobset
      Outputs the oobset of forest in file *.oob (each row == one tree) DEFAULT: switched off.

Tool Source

...

Source

...

All versions

...

Version

...

2.0.0

...

User Guide

...

RandomJungleManual.pdf and RandomJunglePaper.pdf

...

Requires

...

N/A

...

JSON

...

  

Include Page
docs:_DE_archived_apps_blurb
docs:_DE_archived_apps_blurb