Please work through the documentation and add your comments on the bottom of this page, or email comments to support@cyverse.org. Thank you.
What is Medusa?
A draft genome scaffolder that uses multiple reference genomes in a graph-based approach.
Reference:
E Bosi, B Donati, M Galardini, S Brunetti, MF Sagot, P Lió, P Crescenzi, R Fani, and M Fondi. MeDuSa: a multi-draft based scaffolder. Bioinformatics (2015): btv171.
Input and Output
The following inputs are required:
Mandatory
- Target genome file: A draft genome in fasta format. This is the genome you are interested in scaffolding.
- Comparison drafts folder:
...
- An arbitrary long list of auxiliaryDraft files: other draft genomes in fasta format. The closest these organisms are related to the target, the better the results will be. These files are expected to be collected in a specific directory.
- Scripts folder: A sub-folder with python scripts needed to run the program (medusa_scripts)
Optional
- Output fasta file:
...
- Name of the output file (Default is output.fa)
- Number of cleaning rounds:
...
- This option allows the user to run a given number of cleaning rounds and keep the best solution. Since the variability is small, 5 rounds are usually sufficient to find the best score (Default is 5)
-
N50 stat of fasta file: This option allows the calculation of the N50 statistic on a FASTA file.
...
All the other options will be ignored if you chose this option
Sequence similarity based weighting scheme: This option allows for a sequence similarity based weighting scheme. Using a different weighting scheme may lead to better results
Estimation of the distance between pairs of contigs based on the reference genome: This option allows for the estimation of the distance between pairs of contigs based on the reference genome(s): in this case the scaffolded contigs will be separated by a number of N characters equal to this estimate. The estimated distances are also saved in the "*_distanceTable" file. By default the scaffolded contigs are separated by 100 Ns
gexf format of the contig network: he gexf format of the contig network and the path cover are provided
This option allows the calculation of the N50 statistic on a FASTA file
The following output files will be produced.
- targetGenome_SUMMARY: a textual file containing information about your data. Number of scaffolds, N50 value etc..
- targetGenomeScaffold.fasta: a fasta file with the sequences grouped in scaffolds. Contigs in the same scaffolds are separated by 100 Ns by default, or a variable number of Ns (estimate of the distance between the contigs), if the option "-d" is used.
The following output files can optionally be produced.
- targetGenome_distanceTable: a tabular file with the estimation of the distance between successive contigs (bp).
- targetGenome_network.gexf: the contig network in gexf format.
- targetGenome_cover.gexf: the final path cover in gexf format.
Test Run
All files are located in the Community Data directory of the CyVerse Discovery Environment at the following path:
...
Leave the optional arguments as they are.
outputs:
Two output files will be generated.
...
Step 2: Drag and drop the target files into the newly created HT analysis path list file
Step 3: Save the newly created HT analysis path list file as medusa_ht_path (You can named it whatever you want)
...
Step 4 Click on the medusa-1.6 app and enter "Medusa-1.6_analysis1_ht_path" under the Analysis Name
Mandatory Inputs:
- Use medusa_ht_path for Target genome file
- Use reference_genomes for Comparison drafts folder
- Use medusa_scripts for Medusa scripts folder. Note: The medusa_scripts folder is available at /iplant/home/shared/iplantcollaborative/example_data/medusa/medusa_script location
Leave the optional arguments as they are.
Once the app is launched and completes running, then you can find different analysis folders corresponding to the number of target files. Since in this case, three target files has been used in the ht path list file, you will find 3 analysis outputs
Tool Source for App
...