Mothur-1.40.0 on Atmosphere
Mothur-1.40.0 on Atmosphere (Images Tutorial)
Rationale and background:
MiSeq SOP
This tutorial follows the basic Schloss Lab MiSeq Standard Operating Procedures Tutorial. Follow along there for more details about each command. For help with any command, use the mothur website or type the command with 'help' e.g. make.contigs(help)
. This tutorial only uses one sample and does not make use of 'groups', see the SOP for more details on running multiple samples simultaneously.
Logistics
Starting out we need to first determine, what is our question? The Schloss lab is interested in understanding the effect of normal variation in the gut microbiome on host health. To that end we collected fresh feces from mice on a daily basis for 365 days post weaning (we're accepting applications). During the first 150 days post weaning (dpw), nothing was done to our mice except allow them to eat, get fat, and be merry. We were curious whether the rapid change in weight observed during the first 10 dpw affected the stability microbiome compared to the microbiome observed between days 140 and 150. We will address this question in this tutorial using a combination of OTU, phylotype, and phylogenetic methods. To make this tutorial easier to execute, we are providing only part of the data - you are given the flow files for one animal at 10 time points (5 early and 5 late). In addition, to sequencing samples from mice fecal material, we resequenced a mock community composed of genomic DNA from 21 bacterial strains. We will use the 10 fecal samples to look at how to analyze microbial communities and the mock community to measure the error rate and its effect on other analyses.
Part 1: Connect to an instance of an Atmosphere Image (virtual machine)
Step 1. Go to https://atmo.cyverse.org and log in with your CyVerse credentials.
note: click images to enlarge
Step 2. Create a project tab and name the project name as Mothur and description as Prediction-based taxonomic classification
note: click images to enlarge
Step 3. Click the Mothur project and then click Instance. Select the image Mothur and click Launch Instance. It will take ~10-15 minutes for the cloud instance to be launched.
note: click images to enlarge
Note: Instances can be configured for different amounts of CPU, memory, and storage depending on user needs. This tutorial can be accomplished with the small instance size, medium1 (4 CPUs, 8 GB memory, 80 GB root)
Step 4. Once the VM is ready. Click the VM which will take onto next screen where you can open Web Shell
note: click images to enlarge
or ssh into the ip address on a terminal like below.
$ ssh <username>@<ip.address>
Part 2: Set up a Mothur run using the Web Shell
$ cp -r /opt/MiSeq_SOP_sub . $ cd MiSeq_SOP_sub
Reducing sequencing and PCR errors
The first thing we want to do is combine our two sets of reads for each sample and then to combine the data from all of the samples. This is done using the make.contigs command, which requires stability.files as input. This command will extract the sequence and quality score data from your fastq files, create the reverse complement of the reverse read and then join the reads into contigs. We have a very simple algorithm to do this. First, we align the pairs of sequences. Next, we look across the alignment and identify any positions where the two reads disagree. If one sequence has a base and the other has a gap, the quality score of the base must be over 25 to be considered real. If both sequences have a base at that position, then we require one of the bases to have a quality score 6 or more points better than the other. If it is less than 6 points better, then we set the consensus base to an N.
mothur > make.contigs(file=stability.files) Making contigs... 1000 1000 1000 1000 1948 1948 1948 1949 Done. It took 4 secs to assemble 7793 reads. Group count: F3D0 7793 Total of all groups is 7793 It took 4 secs to process 7793 sequences. Output File Names: stability.trim.contigs.fasta stability.trim.contigs.qual stability.scrap.contigs.fasta stability.scrap.contigs.qual stability.contigs.report stability.contigs.groups
this outputs several files
stability.trim.contigs.fasta # trimmed contigs in fasta format
stability.trim.contigs.qual # base by base quality calls on contigs
stability.contigs.report # combined sequence report
stability.scrap.contigs.fasta # contigs not passing quality controls
stability.scrap.contigs.qual # quality calls on bad contigs
stability.contigs.groups # group membership of each new contig
Let's see what these sequences look like using the
mothur > summary.seqs(fasta=stability.trim.contigs.fasta) Start End NBases Ambigs Polymer NumSeqs Minimum: 1 249 249 0 3 1 2.5%-tile: 1 252 252 0 4 195 25%-tile: 1 252 252 0 4 1949 Median: 1 252 252 0 4 3897 75%-tile: 1 253 253 0 5 5845 97.5%-tile: 1 253 253 5 6 7599 Maximum: 1 502 502 241 188 7793 Mean: 1 252 252 0 4 # of Seqs: 7793
This tells us that we have 7793 sequences that for the most part vary between 249 and 253 bases. Interestingly, the longest read in the dataset is 502 bp. Be suspicious of this. Recall that the reads are supposed to be 251 bp each. This read clearly didn't assemble well (or at all). Also, note that at least 2.5% of our sequences had some ambiguous base calls. We'll take care of these issues in the next step when we run screen.seqs.
mothur > screen.seqs(fasta=stability.trim.contigs.fasta, group=stability.contigs.groups, maxambig=0, maxlength=275)
This implementation of the command will remove any sequences with ambiguous bases and anything longer than 275 bp. There's another way to run this using the output from summary.seqs:
mothur > screen.seqs(fasta=stability.trim.contigs.fasta, group=stability.contigs.groups, summary=stability.trim.contigs.summary, maxambig=0, maxlength=275)
This may be faster because the summary.seqs output file already has the number of ambiguous bases and sequence length calculated for your sequences. Also, mothur is smart enough to remember that we used 4 processors in make.contigs and so it will use that throughout your current session. To see what else mothur knows about you, run the following:
Processing improved sequences
We anticipate that many of our sequences are duplicates of each other. Because it's computationally wasteful to align the same thing a bazillion times, we'll unique our sequences using the unique.seqs command:
mothur > unique.seqs(fasta=stability.trim.contigs.good.fasta) 1000 368 2000 609 3000 824 4000 1025 5000 1221 6000 1408 6638 1533
mothur > count.seqs(name=stability.trim.contigs.good.names, group=stability.contigs.good.groups)
This will generate a file called stability.trim.contigs.good.count_table. In subsequent commands we'll use it by using the count option:
mothur > summary.seqs(fasta=stability.trim.contigs.good.unique.fasta, count=stability.trim.contigs.good.count_table) Using 4 processors. Start End NBases Ambigs Polymer NumSeqs Minimum: 1 250 250 0 3 1 2.5%-tile: 1 252 252 0 4 166 25%-tile: 1 252 252 0 4 1660 Median: 1 252 252 0 4 3320 75%-tile: 1 253 253 0 5 4979 97.5%-tile: 1 253 253 0 6 6473 Maximum: 1 255 255 0 8 6638 Mean: 1 252 252 0 4 # of unique seqs: 1533 total # of seqs: 6638
mothur > pcr.seqs(fasta=silva.bacteria.fasta, start=11894, end=25319, keepdots=F)
Let's rename it to something more useful using the rename.file command:
mothur > rename.file(input=silva.bacteria.pcr.fasta, new=silva.v4.fasta)
Let's take a look at what we've made:
mothur > summary.seqs(fasta=silva.v4.fasta) Using 4 processors. Start End NBases Ambigs Polymer NumSeqs Minimum: 1 13424 270 0 3 1 2.5%-tile: 1 13425 292 0 4 374 25%-tile: 1 13425 293 0 4 3740 Median: 1 13425 293 0 4 7479 75%-tile: 1 13425 293 0 5 11218 97.5%-tile: 1 13425 294 1 6 14583 Maximum: 3 13425 351 5 9 14956 Mean: 1 13424 292 0 4 # of Seqs: 14956
Now we have a customized reference alignment to align our sequences to. The nice thing about this reference is that instead of being 50,000 columns wide, it is now 13,425 columns wide which will save our hard drive some space and should improve the overall alignment quality. We'll do the alignment with align.seqs:
mothur > align.seqs(fasta=stability.trim.contigs.good.unique.fasta, reference=silva.v4.fasta)
This should be done in a manner of seconds and we can run summary.seqs again:
mothur > summary.seqs(fasta=stability.trim.contigs.good.unique.align, count=stability.trim.contigs.good.count_table) Using 4 processors. Start End NBases Ambigs Polymer NumSeqs Minimum: 1967 11549 250 0 3 1 2.5%-tile: 1968 11550 252 0 4 166 25%-tile: 1968 11550 252 0 4 1660 Median: 1968 11550 252 0 4 3320 75%-tile: 1968 11550 253 0 5 4979 97.5%-tile: 1968 11550 253 0 6 6473 Maximum: 1968 11552 255 0 8 6638 Mean: 1967 11549 252 0 4 # of unique seqs: 1533 total # of seqs: 6638
So what does this mean? You'll see that the bulk of the sequences start at position 1968 and end at position 11550. Some sequences start at position 1967 and end at 11549. These deviants from the mode positions are likely due to an insertion or deletion at the terminal ends of the alignments. Sometimes you'll see sequences that start and end at the same position indicating a very poor alignment, which is generally due to non-specific amplification. To make sure that everything overlaps the same region we'll re-run screen.seqs to get sequences that start at or before position 1968 and end at or after position 11550. We'll also set the maximum homopolymer length to 8 since there's nothing in the database with a stretch of 9 or more of the same base in a row (this really could have been done in the first execution of screen.seqs above). Note that we need the count table so that we can update the table for the sequences we're removing and we're also using the summary file so we don't have to figure out again all the start and stop positions:
mothur > screen.seqs(fasta=stability.trim.contigs.good.unique.align, count=stability.trim.contigs.good.count_table, summary=stability.trim.contigs.good.unique.summary, start=1968, end=11550, maxhomop=8) Using 4 processors. 383 383 383 384 It took 0 secs to screen 1533 sequences, removed 10. /******************************************/ Running command: remove.seqs(accnos=stability.trim.contigs.good.unique.bad.accnos, count=stability.trim.contigs.good.count_table) Removed 10 sequences from your count file. Output File Names: stability.trim.contigs.good.pick.count_table /******************************************/ Output File Names: stability.trim.contigs.good.unique.bad.accnos stability.trim.contigs.good.unique.good.summary stability.trim.contigs.good.unique.good.align stability.trim.contigs.good.good.count_table
mothur > summary.seqs(fasta=stability.trim.contigs.good.unique.good.align, count=stability.trim.contigs.good.good.count_table) Using 4 processors. Start End NBases Ambigs Polymer NumSeqs Minimum: 1967 11550 250 0 3 1 2.5%-tile: 1968 11550 252 0 4 166 25%-tile: 1968 11550 252 0 4 1658 Median: 1968 11550 252 0 4 3315 75%-tile: 1968 11550 253 0 5 4972 97.5%-tile: 1968 11550 253 0 6 6463 Maximum: 1968 11552 255 0 8 6628 Mean: 1967 11550 252 0 4 # of unique seqs: 1523 total # of seqs: 6628
Now we know our sequences overlap the same alignment coordinates, we want to make sure they only overlap that region. So we'll filter the sequences to remove the overhangs at both ends. Since we've done paired-end sequencing, this shouldn't be much of an issue, but whatever. In addition, there are many columns in the alignment that only contain gap characters (i.e. "-"). These can be pulled out without losing any information. We'll do all this with filter.seqs:
mothur > filter.seqs(fasta=stability.trim.contigs.good.unique.good.align, vertical=T, trump=.)
At the end of running the command we get the following information:
Length of filtered alignment: 297 Number of columns removed: 13128 Length of the original alignment: 13425 Number of sequences used to construct filter: 1523
This means that our initial alignment was 13425 columns wide and that we were able to remove 13128 terminal gap characters using trump=. and vertical gap characters using vertical=T. The final alignment length is 297 columns. Because we've perhaps created some redundancy across our sequences by trimming the ends, we can re-run unique.seqs:
mothur > unique.seqs(fasta=stability.trim.contigs.good.unique.good.filter.fasta, count=stability.trim.contigs.good.good.count_table)
mothur > pre.cluster(fasta=stability.trim.contigs.good.unique.good.filter.unique.fasta, count=stability.trim.contigs.good.unique.good.filter.count_table, diffs=2)
We now have 564 unique sequences. At this point we have removed as much sequencing error as we can and it is time to turn our attention to removing chimeras. We'll do this using the VSEARCH algorithm that is called within mothur using the chimera.vsearch command. Again, this command will split the data by sample and check for chimeras. Our preferred way of doing this is to use the abundant sequences as our reference. In addition, if a sequence is flagged as chimeric in one sample, the default (dereplicate=F) is to remove it from all samples. Our experience suggests that this is a bit aggressive since we've seen rare sequences get flagged as chimeric when they're the most abundant sequence in another sample. This is how we do it:
mothur > chimera.vsearch(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.count_table, dereplicate=t) Using 4 processors. Checking sequences from stability.trim.contigs.good.unique.good.filter.unique.precluster.fasta ... vsearch v2.3.4_linux_x86_64, 7.8GB RAM, 4 cores https://github.com/torognes/vsearch Reading file stability.trim.contigs.good.unique.good.filter.unique.precluster.temp 100% 142526 nt in 564 seqs, min 250, max 255, avg 253 Masking 100% Sorting by abundance 100% Counting unique k-mers 100% Detecting chimeras 100% Found 271 (48.0%) chimeras, 280 (49.6%) non-chimeras, and 13 (2.3%) borderline sequences in 564 unique sequences. Taking abundance information into account, this corresponds to 427 (6.4%) chimeras, 6188 (93.4%) non-chimeras, and 13 (0.2%) borderline sequences in 6628 total sequences. It took 1 secs to check 564 sequences from group F3D0. Output File Names: stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.pick.count_table stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.chimeras stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.accnos
Running chimera.vsearch with the count file will remove the chimeric sequences from the count file. But you still need to remove those sequences from the fasta file. We do this using remove.seqs:
mothur > remove.seqs(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.fasta, accnos=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.accnos)
Running summary.seqs we see what we're left with:
mothur > summary.seqs(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.count_table) Using 4 processors. Start End NBases Ambigs Polymer NumSeqs Minimum: 1 297 250 0 3 1 2.5%-tile: 1 297 252 0 4 166 25%-tile: 1 297 252 0 4 1658 Median: 1 297 252 0 4 3315 75%-tile: 1 297 253 0 5 4972 97.5%-tile: 1 297 253 0 6 6463 Maximum: 1 297 255 0 8 6628 Mean: 1 297 252 0 4 # of unique seqs: 564 total # of seqs: 6628
Note that we went from 7793 to 6628 sequences; this is a reasonable number of sequences to be flagged as chimeric. As a final quality control step, we need to see if there are any "undesirables" in our dataset. Sometimes when we pick a primer set they will amplify other stuff that gets to this point in the pipeline such as 18S rRNA gene fragments or 16S rRNA from Archaea, chloroplasts, and mitochondria. There's also just the random stuff that we want to get rid of. Now you may say, "But wait I want that stuff". Fine. But, the primers we use, are only supposed to amplify members of the Bacteria and if they're hitting Eukaryota or Archaea, then its a mistake. Also, realize that chloroplasts and mitochondria have no functional role in a microbial community. But I digress. Let's go ahead and classify those sequences using the Bayesian classifier with the classify.seqs command:
mothur > classify.seqs(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.pick.count_table, reference=trainset9_032012.pds.fasta, taxonomy=trainset9_032012.pds.tax, cutoff=80)
Now that everything is classified we want to remove our undesirables. We do this with the remove.lineage command:
mothur > remove.lineage(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.pick.count_table, taxonomy=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pds.wang.taxonomy, taxon=Chloroplast-Mitochondria-unknown-Archaea-Eukaryota)
Also of note is that "unknown" only pops up as a classification if the classifier cannot classify your sequence to one of the domains. If you run summary.seqs you'll see that we now have 2281 unique sequences and a total of 118150 total sequences. This means about 350 of our sequences were in these various groups. Now, to create an updated taxonomy summary file that reflects these removals we use the summary.tax command:
mothur > summary.seqs(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.pick.pick.count_table) Using 4 processors. Start End NBases Ambigs Polymer NumSeqs Minimum: 1 297 250 0 3 1 2.5%-tile: 1 297 252 0 4 155 25%-tile: 1 297 252 0 4 1550 Median: 1 297 252 0 4 3099 75%-tile: 1 297 253 0 5 4648 97.5%-tile: 1 297 253 0 6 6042 Maximum: 1 297 255 0 6 6196 Mean: 1 297 252 0 4 # of unique seqs: 289 total # of seqs: 6196
This creates a pick.tax.summary file with the undesirables removed. At this point we have curated our data as far as possible and we're ready to see what our error rate is.
Assessing error rates
Measuring the error rate of your sequences is something you can only do if you have co-sequenced a mock community. This is something we include for every 95 samples we sequence. You should too because it will help you gauge your error rates and allow you to see how well your curation is going and whether something is wrong with your sequencing set up. First we want to pull the sequences out that were from our "Mock" sample using the get.groups command:
mothur > get.groups(count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.pick.pick.count_table, fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.fasta, groups=F3D0) [NOTE]: The count file should contain only unique names, so mothur assumes your fasta, list and taxonomy files also contain only uniques. Selected 289 sequences from your fasta file. Selected 6196 sequences from your count file.
This tells us that we had 289 unique sequences and a total of 6196 total sequences in our F3D0 sample. We can now cluster the sequences into OTUs to see how many spurious OTUs we have:
mothur > dist.seqs(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.pick.fasta, cutoff=0.03) mothur > cluster(column=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.pick.dist, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.pick.pick.pick.count_table) mothur > make.shared(list=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.pick.opti_mcc.list, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.pick.pick.pick.count_table, label=0.03) mothur > rarefaction.single(shared=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.pick.opti_mcc.shared)
This string of commands will produce a file for you called stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.pick.an.unique_list.groups.rarefaction. Open it. You'll see that for 4061 sequences, we'd have 64 OTUs from the Mock community. This number of course includes some stealthy chimeras that escaped our detection methods. If we used 3000 sequences, we would have about 31 OTUs. In a perfect world with no chimeras and no sequencing errors, we'd have 20 OTUs. This is not a perfect world. But this is pretty darn good!
Preparing for analysis
You can follow the rest of the tutorial from here - https://www.mothur.org/wiki/MiSeq_SOP