Create assembly histograms in R
Based on: http://www.scott-fay.com/blog/making-a-histogram-of-transcript-lengths-in-r
Tutorial
I. Setup
This tutorial can be done in R or RStudio
To do this tutorial in Atmosphere:
- Log in to Atmosphere https://atmo.iplantcollaborative.org/ (ensure you have access - go to https://user.iplantcollaborative.org/dashboard/ - if Atmosphere is not listed under 'My Services' scroll down to Available Services and request access to Atmosphere).
- Click 'Launch New Instance'
- Under 'Select an Image' search for 'TSW Workshop Williams 1.0'
- If desired, name your image.
- If you will be using the sample data, you may leave this instance size as 'tiny 1(1 CPUs, 4GB memory, 30G disk), if working with your own data you may choose an appropriate size.
- Click Launch Instance
Launching an instance should take 15-20 minutes; launch times and instance sizes are subject to the capacity of Atmosphere at any given time.
Connect to your Atmosphere instance
- If you wish to use R studio and view a desktop - see the documentation here: https://pods.iplantcollaborative.org/wiki/display/atmman/Using+VNC+Viewer+to+Connect+to+an+Atmosphere+VM
- If you wish to use R via the shell see the documentation here https://pods.iplantcollaborative.org/wiki/display/atmman/Logging+In+to+an+Instance#LoggingIntoanInstance-LogginginviaSSH (note you will have to export graphics to view them)
To do this tutorial in a local installation or R or RStudio:
- Download the sample data here: /iplant/home/shared/iplant_training/ars_workshop/transcriptome_assm/sample_assemblies
- Ensure you have installed the following R packages (some instructions if you are new at this: http://math.usask.ca/~longhai/software/installrpkg.html)
- Biostrings
- ggplot2
1. Set Working Directory*
You will need to download the sample data into your Atmosphere instance:
/iplant/home/shared/iplant_training/ars_workshop/transcriptome_assm/sample_assemblies
> setwd('sample_assemblies')
2. Add the Bioconductor repository and install the BioStrings library
> source("http://bioconductor.org/biocLite.R") > biocLite("Biostrings")
3. Load the biostrings library
> library(Biostrings)
4. Read the sequences into a variable (e.g. k27) using the readDNAStringSet function
> k27 <- readDNAStringSet("27.SoapOutput.scafSeq")
5. Use the histogram function to plot histogram (log scale)
> hist(log10(width(k27)),breaks=50)
sample histogram for the k=27
Plot multiple assemblies
6. Load ggplot2 library
> library(ggplot2)
7. Load all of the sample assemblies
> k27 <- readDNAStringSet("27.SoapOutput.scafSeq") > k41 <- readDNAStringSet("41.SoapOutput.scafSeq") > k51 <- readDNAStringSet("51.SoapOutput.scafSeq") > k71 <- readDNAStringSet("71.SoapOutput.scafSeq")
8. Create a data frame for each of the sets of assemblies
> k27mer_assembly=data.frame(name="k=27",length=width(k27)) > k41mer_assembly=data.frame(name="k=41",length=width(k41)) > k51mer_assembly=data.frame(name="k=51",length=width(k51)) > k71mer_assembly=data.frame(name="k=71",length=width(k71))
9. transform these into a single frame
> assemblies<- rbind(k27mer_assembly,k41mer_assembly,k51mer_assembly,k71mer_assembly)
10. Create a density histogram
ggplot(assemblies, aes(length, fill=name)) + geom_density(alpha = 0.2) + scale_x_log10() + theme_bw(base_size = 32, base_family = "")
sample density plot for k=27,41,51,71