Mothur-1.40.0 on Atmosphere (Images Tutorial)

Rationale and background:

Mothur project seeks to develop a single piece of open-source, expandable software to fill the bioinformatics needs of the microbial ecology community.

Kozich JJ, Westcott SL, Baxter NT, Highlander SK, Schloss PD. (2013): Development of a dual-index sequencing strategy and curation pipeline for analyzing amplicon sequence data on the MiSeq Illumina sequencing platform. Applied and Environmental Microbiology. 79(17):5112-20.

MiSeq SOP

This tutorial follows the basic Schloss Lab MiSeq Standard Operating Procedures Tutorial. Follow along there for more details about each command. For help with any command, use the mothur website or type the command with 'help' e.g. make.contigs(help). This tutorial only uses one sample and does not make use of 'groups', see the SOP for more details on running multiple samples simultaneously.

Logistics

Starting out we need to first determine, what is our question? The Schloss lab is interested in understanding the effect of normal variation in the gut microbiome on host health. To that end we collected fresh feces from mice on a daily basis for 365 days post weaning (we're accepting applications). During the first 150 days post weaning (dpw), nothing was done to our mice except allow them to eat, get fat, and be merry. We were curious whether the rapid change in weight observed during the first 10 dpw affected the stability microbiome compared to the microbiome observed between days 140 and 150. We will address this question in this tutorial using a combination of OTU, phylotype, and phylogenetic methods. To make this tutorial easier to execute, we are providing only part of the data - you are given the flow files for one animal at 10 time points (5 early and 5 late). In addition, to sequencing samples from mice fecal material, we resequenced a mock community composed of genomic DNA from 21 bacterial strains. We will use the 10 fecal samples to look at how to analyze microbial communities and the mock community to measure the error rate and its effect on other analyses.

Part 1: Connect to an instance of an Atmosphere Image (virtual machine)

Step 1. Go to https://atmo.cyverse.org and log in with your CyVerse credentials.

note: click images to enlarge

Step 2. Create a project tab and name the project name as Mothur and description as Prediction-based taxonomic classification

note: click images to enlarge

Step 3. Click the Mothur project and then click Instance. Select the image Mothur and click Launch Instance. It will take ~10-15 minutes for the cloud instance to be launched.

note: click images to enlarge

Note: Instances can be configured for different amounts of CPU, memory, and storage depending on user needs. This tutorial can be accomplished with the small instance size, medium1 (4 CPUs, 8 GB memory, 80 GB root)

Step 4. Once the VM is ready. Click the VM which will take onto next screen where you can open Web Shell

note: click images to enlarge

or ssh into the ip address on a terminal like below.

$ ssh <username>@<ip.address>

Part 2: Set up a Mothur run using the Web Shell

Get oriented.

You will find staged example data in folder "MiSeq_SOP_sub/" within the folder /opt". Copy that folder onto your home directory

$ cp -r /opt/MiSeq_SOP_sub .
$ cd MiSeq_SOP_sub

Reducing sequencing and PCR errors

The first thing we want to do is combine our two sets of reads for each sample and then to combine the data from all of the samples. This is done using the make.contigs command, which requires stability.files as input. This command will extract the sequence and quality score data from your fastq files, create the reverse complement of the reverse read and then join the reads into contigs. We have a very simple algorithm to do this. First, we align the pairs of sequences. Next, we look across the alignment and identify any positions where the two reads disagree. If one sequence has a base and the other has a gap, the quality score of the base must be over 25 to be considered real. If both sequences have a base at that position, then we require one of the bases to have a quality score 6 or more points better than the other. If it is less than 6 points better, then we set the consensus base to an N.

mothur > make.contigs(file=stability.files)
 
Making contigs...
1000
1000
1000
1000
1948
1948
1948
1949
Done.
It took 4 secs to assemble 7793 reads.

Group count: 
F3D0	7793
Total of all groups is 7793
It took 4 secs to process 7793 sequences.
Output File Names: 
stability.trim.contigs.fasta
stability.trim.contigs.qual
stability.scrap.contigs.fasta
stability.scrap.contigs.qual
stability.contigs.report
stability.contigs.groups

This command will also produce several files that you will need down the road: stability.trim.contigs.fasta and stability.contigs.groups. These contain the sequence data and group identity for each sequence. The stability.contigs.report file will tell you something about the contig assembly for each read.

this outputs several files

  stability.trim.contigs.fasta   # trimmed contigs in fasta format
  stability.trim.contigs.qual    # base by base quality calls on contigs
  stability.contigs.report       # combined sequence report
  stability.scrap.contigs.fasta  # contigs not passing quality controls
  stability.scrap.contigs.qual   # quality calls on bad contigs
  stability.contigs.groups       # group membership of each new contig

Let's see what these sequences look like using the

mothur > summary.seqs(fasta=stability.trim.contigs.fasta)

		Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	1	249	249	0	3	1
2.5%-tile:	1	252	252	0	4	195
25%-tile:	1	252	252	0	4	1949
Median: 	1	252	252	0	4	3897
75%-tile:	1	253	253	0	5	5845
97.5%-tile:	1	253	253	5	6	7599
Maximum:	1	502	502	241	188	7793
Mean:	1	252	252	0	4
# of Seqs:	7793

This tells us that we have 7793 sequences that for the most part vary between 249 and 253 bases. Interestingly, the longest read in the dataset is 502 bp. Be suspicious of this. Recall that the reads are supposed to be 251 bp each. This read clearly didn't assemble well (or at all). Also, note that at least 2.5% of our sequences had some ambiguous base calls. We'll take care of these issues in the next step when we run screen.seqs.

mothur > screen.seqs(fasta=stability.trim.contigs.fasta, group=stability.contigs.groups, maxambig=0, maxlength=275)

This implementation of the command will remove any sequences with ambiguous bases and anything longer than 275 bp. There's another way to run this using the output from summary.seqs:

mothur > screen.seqs(fasta=stability.trim.contigs.fasta, group=stability.contigs.groups, summary=stability.trim.contigs.summary, maxambig=0, maxlength=275)

This may be faster because the summary.seqs output file already has the number of ambiguous bases and sequence length calculated for your sequences. Also, mothur is smart enough to remember that we used 4 processors in make.contigs and so it will use that throughout your current session. To see what else mothur knows about you, run the following:

Processing improved sequences

We anticipate that many of our sequences are duplicates of each other. Because it's computationally wasteful to align the same thing a bazillion times, we'll unique our sequences using the unique.seqs command:

mothur > unique.seqs(fasta=stability.trim.contigs.good.fasta)
1000	368
2000	609
3000	824
4000	1025
5000	1221
6000	1408
6638	1533

If two sequences have the same identical sequence, then they're considered duplicates and will get merged. In the screen output there are two columns - the first is the number of sequences characterized and the second is the number of unique sequences remaining. So after running unique.seqs we have gone from 7793 to 6638 sequences. This will make our life much easier. Another thing to do to make our lives easier is to simplify the names and group files. If you look at the most recent versions of those files you'll see together they are 13 MB. This may not seem like much, but with a full MiSeq run those long sequence names can add up and make life tedious. So we'll run count.seqs to generate a table where the rows are the names of the unique sequences and the columns are the names of the groups. The table is then filled with the number of times each unique sequence shows up in each group.

mothur > count.seqs(name=stability.trim.contigs.good.names, group=stability.contigs.good.groups)

This will generate a file called stability.trim.contigs.good.count_table. In subsequent commands we'll use it by using the count option:

mothur > summary.seqs(fasta=stability.trim.contigs.good.unique.fasta, count=stability.trim.contigs.good.count_table)

Using 4 processors.
		Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	1	250	250	0	3	1
2.5%-tile:	1	252	252	0	4	166
25%-tile:	1	252	252	0	4	1660
Median: 	1	252	252	0	4	3320
75%-tile:	1	253	253	0	5	4979
97.5%-tile:	1	253	253	0	6	6473
Maximum:	1	255	255	0	8	6638
Mean:	1	252	252	0	4
# of unique seqs:	1533
total # of seqs:	6638

Cool, right? Now we need to align our sequences to the reference alignment. Again we can make our lives a bit easier by making a database customized to our region of interest using the pcr.seqscommand. To run this command you need to have the reference database (silva.bacteria.fasta) and know where in that alignment your sequences start and end. To remove the leading and trailing dots we will set keepdots to false. You could also run this command using your primers of interest.:

mothur > pcr.seqs(fasta=silva.bacteria.fasta, start=11894, end=25319, keepdots=F)

Let's rename it to something more useful using the rename.file command:

mothur > rename.file(input=silva.bacteria.pcr.fasta, new=silva.v4.fasta)

Let's take a look at what we've made:

mothur > summary.seqs(fasta=silva.v4.fasta)

Using 4 processors.
		Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	1	13424	270	0	3	1
2.5%-tile:	1	13425	292	0	4	374
25%-tile:	1	13425	293	0	4	3740
Median: 	1	13425	293	0	4	7479
75%-tile:	1	13425	293	0	5	11218
97.5%-tile:	1	13425	294	1	6	14583
Maximum:	3	13425	351	5	9	14956
Mean:	1	13424	292	0	4
# of Seqs:	14956

Now we have a customized reference alignment to align our sequences to. The nice thing about this reference is that instead of being 50,000 columns wide, it is now 13,425 columns wide which will save our hard drive some space and should improve the overall alignment quality. We'll do the alignment with align.seqs:

mothur > align.seqs(fasta=stability.trim.contigs.good.unique.fasta, reference=silva.v4.fasta)

This should be done in a manner of seconds and we can run summary.seqs again:

mothur > summary.seqs(fasta=stability.trim.contigs.good.unique.align, count=stability.trim.contigs.good.count_table)

Using 4 processors.
		Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	1967	11549	250	0	3	1
2.5%-tile:	1968	11550	252	0	4	166
25%-tile:	1968	11550	252	0	4	1660
Median: 	1968	11550	252	0	4	3320
75%-tile:	1968	11550	253	0	5	4979
97.5%-tile:	1968	11550	253	0	6	6473
Maximum:	1968	11552	255	0	8	6638
Mean:	1967	11549	252	0	4
# of unique seqs:	1533
total # of seqs:	6638

So what does this mean? You'll see that the bulk of the sequences start at position 1968 and end at position 11550. Some sequences start at position 1967 and end at 11549. These deviants from the mode positions are likely due to an insertion or deletion at the terminal ends of the alignments. Sometimes you'll see sequences that start and end at the same position indicating a very poor alignment, which is generally due to non-specific amplification. To make sure that everything overlaps the same region we'll re-run screen.seqs to get sequences that start at or before position 1968 and end at or after position 11550. We'll also set the maximum homopolymer length to 8 since there's nothing in the database with a stretch of 9 or more of the same base in a row (this really could have been done in the first execution of screen.seqs above). Note that we need the count table so that we can update the table for the sequences we're removing and we're also using the summary file so we don't have to figure out again all the start and stop positions:

mothur > screen.seqs(fasta=stability.trim.contigs.good.unique.align, count=stability.trim.contigs.good.count_table, summary=stability.trim.contigs.good.unique.summary, start=1968, end=11550, maxhomop=8)


Using 4 processors.
383
383
383
384
It took 0 secs to screen 1533 sequences, removed 10.
/******************************************/
Running command: remove.seqs(accnos=stability.trim.contigs.good.unique.bad.accnos, count=stability.trim.contigs.good.count_table)
Removed 10 sequences from your count file.
Output File Names: 
stability.trim.contigs.good.pick.count_table
/******************************************/
Output File Names: 
stability.trim.contigs.good.unique.bad.accnos
stability.trim.contigs.good.unique.good.summary
stability.trim.contigs.good.unique.good.align
stability.trim.contigs.good.good.count_table

mothur > summary.seqs(fasta=stability.trim.contigs.good.unique.good.align, count=stability.trim.contigs.good.good.count_table)
Using 4 processors.
		Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	1967	11550	250	0	3	1
2.5%-tile:	1968	11550	252	0	4	166
25%-tile:	1968	11550	252	0	4	1658
Median: 	1968	11550	252	0	4	3315
75%-tile:	1968	11550	253	0	5	4972
97.5%-tile:	1968	11550	253	0	6	6463
Maximum:	1968	11552	255	0	8	6628
Mean:	1967	11550	252	0	4
# of unique seqs:	1523
total # of seqs:	6628

Now we know our sequences overlap the same alignment coordinates, we want to make sure they only overlap that region. So we'll filter the sequences to remove the overhangs at both ends. Since we've done paired-end sequencing, this shouldn't be much of an issue, but whatever. In addition, there are many columns in the alignment that only contain gap characters (i.e. "-"). These can be pulled out without losing any information. We'll do all this with filter.seqs:

mothur > filter.seqs(fasta=stability.trim.contigs.good.unique.good.align, vertical=T, trump=.)

At the end of running the command we get the following information:

Length of filtered alignment: 297
Number of columns removed: 13128
Length of the original alignment: 13425
Number of sequences used to construct filter: 1523

This means that our initial alignment was 13425 columns wide and that we were able to remove 13128 terminal gap characters using trump=. and vertical gap characters using vertical=T. The final alignment length is 297 columns. Because we've perhaps created some redundancy across our sequences by trimming the ends, we can re-run unique.seqs:

mothur > unique.seqs(fasta=stability.trim.contigs.good.unique.good.filter.fasta, count=stability.trim.contigs.good.good.count_table)

The next thing we want to do to further de-noise our sequences is to pre-cluster the sequences using the pre.cluster command allowing for up to 2 differences between sequences. This command will split the sequences by group and then sort them by abundance and go from most abundant to least and identify sequences that are within 2 nt of each other. If they are then they get merged. We generally favor allowing 1 difference for every 100 bp of sequence:

mothur > pre.cluster(fasta=stability.trim.contigs.good.unique.good.filter.unique.fasta, count=stability.trim.contigs.good.unique.good.filter.count_table, diffs=2)

We now have 564 unique sequences. At this point we have removed as much sequencing error as we can and it is time to turn our attention to removing chimeras. We'll do this using the VSEARCH algorithm that is called within mothur using the chimera.vsearch command. Again, this command will split the data by sample and check for chimeras. Our preferred way of doing this is to use the abundant sequences as our reference. In addition, if a sequence is flagged as chimeric in one sample, the default (dereplicate=F) is to remove it from all samples. Our experience suggests that this is a bit aggressive since we've seen rare sequences get flagged as chimeric when they're the most abundant sequence in another sample. This is how we do it:

mothur > chimera.vsearch(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.count_table, dereplicate=t)


Using 4 processors.
Checking sequences from stability.trim.contigs.good.unique.good.filter.unique.precluster.fasta ...
vsearch v2.3.4_linux_x86_64, 7.8GB RAM, 4 cores
https://github.com/torognes/vsearch
Reading file stability.trim.contigs.good.unique.good.filter.unique.precluster.temp 100%  
142526 nt in 564 seqs, min 250, max 255, avg 253
Masking 100%  
Sorting by abundance 100%
Counting unique k-mers 100%  
Detecting chimeras 100%  
Found 271 (48.0%) chimeras, 280 (49.6%) non-chimeras,
and 13 (2.3%) borderline sequences in 564 unique sequences.
Taking abundance information into account, this corresponds to
427 (6.4%) chimeras, 6188 (93.4%) non-chimeras,
and 13 (0.2%) borderline sequences in 6628 total sequences.
It took 1 secs to check 564 sequences from group F3D0.


Output File Names: 
stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.pick.count_table
stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.chimeras
stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.accnos

Running chimera.vsearch with the count file will remove the chimeric sequences from the count file. But you still need to remove those sequences from the fasta file. We do this using remove.seqs:

mothur > remove.seqs(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.fasta, accnos=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.accnos)

Running summary.seqs we see what we're left with:

mothur > summary.seqs(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.count_table)


Using 4 processors.
		Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	1	297	250	0	3	1
2.5%-tile:	1	297	252	0	4	166
25%-tile:	1	297	252	0	4	1658
Median: 	1	297	252	0	4	3315
75%-tile:	1	297	253	0	5	4972
97.5%-tile:	1	297	253	0	6	6463
Maximum:	1	297	255	0	8	6628
Mean:	1	297	252	0	4
# of unique seqs:	564
total # of seqs:	6628

Note that we went from 7793 to 6628 sequences; this is a reasonable number of sequences to be flagged as chimeric. As a final quality control step, we need to see if there are any "undesirables" in our dataset. Sometimes when we pick a primer set they will amplify other stuff that gets to this point in the pipeline such as 18S rRNA gene fragments or 16S rRNA from Archaea, chloroplasts, and mitochondria. There's also just the random stuff that we want to get rid of. Now you may say, "But wait I want that stuff". Fine. But, the primers we use, are only supposed to amplify members of the Bacteria and if they're hitting Eukaryota or Archaea, then its a mistake. Also, realize that chloroplasts and mitochondria have no functional role in a microbial community. But I digress. Let's go ahead and classify those sequences using the Bayesian classifier with the classify.seqs command:

mothur > classify.seqs(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.pick.count_table, reference=trainset9_032012.pds.fasta, taxonomy=trainset9_032012.pds.tax, cutoff=80)

Now that everything is classified we want to remove our undesirables. We do this with the remove.lineage command:

mothur > remove.lineage(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.pick.count_table, taxonomy=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pds.wang.taxonomy, taxon=Chloroplast-Mitochondria-unknown-Archaea-Eukaryota)

Also of note is that "unknown" only pops up as a classification if the classifier cannot classify your sequence to one of the domains. If you run summary.seqs you'll see that we now have 2281 unique sequences and a total of 118150 total sequences. This means about 350 of our sequences were in these various groups. Now, to create an updated taxonomy summary file that reflects these removals we use the summary.tax command:

mothur > summary.seqs(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.pick.pick.count_table)


Using 4 processors.


		Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	1	297	250	0	3	1
2.5%-tile:	1	297	252	0	4	155
25%-tile:	1	297	252	0	4	1550
Median: 	1	297	252	0	4	3099
75%-tile:	1	297	253	0	5	4648
97.5%-tile:	1	297	253	0	6	6042
Maximum:	1	297	255	0	6	6196
Mean:	1	297	252	0	4
# of unique seqs:	289
total # of seqs:	6196

This creates a pick.tax.summary file with the undesirables removed. At this point we have curated our data as far as possible and we're ready to see what our error rate is.

Assessing error rates

Measuring the error rate of your sequences is something you can only do if you have co-sequenced a mock community. This is something we include for every 95 samples we sequence. You should too because it will help you gauge your error rates and allow you to see how well your curation is going and whether something is wrong with your sequencing set up. First we want to pull the sequences out that were from our "Mock" sample using the get.groups command:

mothur > get.groups(count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.pick.pick.count_table, fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.fasta, groups=F3D0)


[NOTE]: The count file should contain only unique names, so mothur assumes your fasta, list and taxonomy files also contain only uniques.


Selected 289 sequences from your fasta file.
Selected 6196 sequences from your count file.

This tells us that we had 289 unique sequences and a total of 6196 total sequences in our F3D0 sample. We can now cluster the sequences into OTUs to see how many spurious OTUs we have:

mothur > dist.seqs(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.pick.fasta, cutoff=0.03)
mothur > cluster(column=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.pick.dist, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.pick.pick.pick.count_table)
mothur > make.shared(list=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.pick.opti_mcc.list, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.pick.pick.pick.count_table, label=0.03)
mothur > rarefaction.single(shared=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.pick.opti_mcc.shared)

This string of commands will produce a file for you called stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.pick.an.unique_list.groups.rarefaction. Open it. You'll see that for 4061 sequences, we'd have 64 OTUs from the Mock community. This number of course includes some stealthy chimeras that escaped our detection methods. If we used 3000 sequences, we would have about 31 OTUs. In a perfect world with no chimeras and no sequencing errors, we'd have 20 OTUs. This is not a perfect world. But this is pretty darn good!

Preparing for analysis

You can follow the rest of the tutorial from here - https://www.mothur.org/wiki/MiSeq_SOP

1 Learning Materials

Mothur-1.40.0 on Atmosphere

Analytics

Mothur-1.40.0 on Atmosphere (Images Tutorial)

Rationale and background:

MiSeq SOP

Logistics

Part 1: Connect to an instance of an Atmosphere Image (virtual machine)

Part 2: Set up a Mothur run using the Web Shell

Reducing sequencing and PCR errors

Processing improved sequences

Assessing error rates

Preparing for analysis

Related content