Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

Kozich JJ, Westcott SL, Baxter NT, Highlander SK, Schloss PD. (2013): Development of a dual-index sequencing strategy and curation pipeline for analyzing amplicon sequence data on the MiSeq Illumina sequencing platform. Applied and Environmental Microbiology. 79(17):5112-20.

 

...

Mothur-1.40.0 on Atmosphere (Images Tutorial)

Rationale and background:


Mothur project seeks to develop a single piece of open-source, expandable software to fill the bioinformatics needs of the microbial ecology community.
Kozich JJ, Westcott SL, Baxter NT, Highlander SK, Schloss PD. (2013): Development of a dual-index sequencing strategy and curation pipeline for analyzing amplicon sequence data on the MiSeq Illumina sequencing platform. Applied and Environmental Microbiology. 79(17):5112-20.

MiSeq SOP

This tutorial follows the basic Schloss Lab MiSeq Standard Operating Procedures Tutorial. Follow along there for more details about each command. For help with any command, use the mothur website or type the command with 'help' e.g. make.contigs(help). This tutorial only uses one sample and does not make use of 'groups', see the SOP for more details on running multiple samples simultaneously.

 

Logistics

Starting out we need to first determine, what is our question? The Schloss lab is interested in understanding the effect of normal variation in the gut microbiome on host health. To that end we collected fresh feces from mice on a daily basis for 365 days post weaning (we're accepting applications). During the first 150 days post weaning (dpw), nothing was done to our mice except allow them to eat, get fat, and be merry. We were curious whether the rapid change in weight observed during the first 10 dpw affected the stability microbiome compared to the microbiome observed between days 140 and 150. We will address this question in this tutorial using a combination of OTU, phylotype, and phylogenetic methods. To make this tutorial easier to execute, we are providing only part of the data - you are given the flow files for one animal at 10 time points (5 early and 5 late). In addition, to sequencing samples from mice fecal material, we resequenced a mock community composed of genomic DNA from 21 bacterial strains. We will use the 10 fecal samples to look at how to analyze microbial communities and the mock community to measure the error rate and its effect on other analyses.In a manuscript submitted to Applied & Environmental Microbiology, we describe a set of primers that will allow you to sequence 1536 samples in parallel using only 80 primers (32+48) and obtain sequence reads that are at least as good as those generated by 454 sequencing using our 454 SOP. Please consult the supplementary methods of that manuscript for more information and our wet-lab SOP. All of the data from that study are available through our server. Sequences come off the MiSeq as pairs of fastq files with each pair representing the two sets of reads per sample. fastq files contain both the sequence data and the quality score data. If you aren't getting these files off the sequencer, then you likely have the software parameters set incorrectly. For this tutorial you will need several sets of files. To speed up the tutorial we provide some of the downstream files that take awhile to generate (e.g. the output of shhh.flows):

Checking Mothur version

Code Block
$ mothur --version

Linux 64Bit Version
Mothur version=1.40.0
Release Date=4/10/2018

...

Part 1: Connect to an instance of an Atmosphere Image (virtual machine)

Step 1. Go to https://atmo.cyverse.org and log in with your CyVerse credentials.

Image Added

note: click images to enlarge

Step 2. Create a project tab and name the project name as Mothur and description as Prediction-based taxonomic classification

Image Added

Image Added

note: click images to enlarge

Step 3. Click the Mothur project and then click Instance. Select the image Mothur and click Launch Instance. It will take ~10-15 minutes for the cloud instance to be launched. 

Image Added

Image Added

Image Added 

note: click images to enlarge

Note: Instances can be configured for different amounts of CPU, memory, and storage depending on user needs.  This tutorial can be accomplished with the small instance size, medium1 (4 CPUs, 8 GB memory, 80 GB root)


Step 4. Once the VM is ready. Click the VM which will take onto next screen where you can open Web Shell


Image Added
Image Added

note: click images to enlarge

or ssh into the ip address on a terminal like below.

Code Block
$ ssh <username>@<ip.address>

 

Part 2: Set up a Mothur run using the Web Shell

Get oriented.
You will find staged example data in folder "MiSeq_SOP_sub/" within the folder /opt".  Copy that folder onto your home directory
Code Block
$ cp -r /opt/MiSeq_SOP_sub .
$ cd MiSeq_SOP_sub

...

Code Block
mothur > make.contigs(file=stability.files)
 
Making contigs...
1000
1000
1000
1000
1948
1948
1948
1949
Done.
It took 4 secs to assemble 7793 reads.

Group count: 
F3D0	7793
Total of all groups is 7793
It took 4 secs to process 7793 sequences.
Output File Names: 
stability.trim.contigs.fasta
stability.trim.contigs.qual
stability.scrap.contigs.fasta
stability.scrap.contigs.qual
stability.contigs.report
stability.contigs.groups
This command will also produce several files that you will need down the road: stability.trim.contigs.fasta and stability.contigs.groups. These contain the sequence data and group identity for each sequence. The stability.contigs.report file will tell you something about the contig assembly for each read.
Info

this outputs several files

  stability.trim.contigs.fasta   # trimmed contigs in fasta format
  stability.trim.contigs.qual    # base by base quality calls on contigs
  stability.contigs.report       # combined sequence report
  stability.scrap.contigs.fasta  # contigs not passing quality controls
  stability.scrap.contigs.qual   # quality calls on bad contigs
  stability.contigs.groups       # group membership of each new contig

Let's see what these sequences look like using the 

Code Block
mothur > summary.seqs(fasta=stability.trim.contigs.fasta)

		Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	1	249	249	0	3	1
2.5%-tile:	1	252	252	0	4	195
25%-tile:	1	252	252	0	4	1949
Median: 	1	252	252	0	4	3897
75%-tile:	1	253	253	0	5	5845
97.5%-tile:	1	253	253	5	6	7599
Maximum:	1	502	502	241	188	7793
Mean:	1	252	252	0	4
# of Seqs:	7793

This tells us that we have 7793 sequences that for the most part vary between 249 and 253 bases. Interestingly, the longest read in the dataset is 502 bp. Be suspicious of this. Recall that the reads are supposed to be 251 bp each. This read clearly didn't assemble well (or at all). Also, note that at least 2.5% of our sequences had some ambiguous base calls. We'll take care of these issues in the next step when we run screen.seqs.

Code Block
mothur > screen.seqs(fasta=stability.trim.contigs.fasta, group=stability.contigs.groups, maxambig=0, maxlength=275)

 This implementation of the command will remove any sequences with ambiguous bases and anything longer than 275 bp. There's another way to run this using the output from summary.seqs:

 

Code Block
mothur > screen.seqs(fasta=stability.trim.contigs.fasta, group=stability.contigs.groups, summary=stability.trim.contigs.summary, maxambig=0, maxlength=275)

This may be faster because the summary.seqs output file already has the number of ambiguous bases and sequence length calculated for your sequences. Also, mothur is smart enough to remember that we used 4 processors in make.contigs and so it will use that throughout your current session. To see what else mothur knows about you, run the following:

 

Processing improved sequences

 

We anticipate that many of our sequences are duplicates of each other. Because it's computationally wasteful to align the same thing a bazillion times, we'll unique our sequences using the unique.seqs command:

 

Code Block
mothur > unique.seqs(fasta=stability.trim.contigs.good.fasta)
1000	368
2000	609
3000	824
4000	1025
5000	1221
6000	1408
6638	1533
 If two sequences have the same identical sequence, then they're considered duplicates and will get merged. In the screen output there are two columns - the first is the number of sequences characterized and the second is the number of unique sequences remaining. So after running unique.seqs we have gone from 7793 to 6638 sequences. This will make our life much easier. Another thing to do to make our lives easier is to simplify the names and group files. If you look at the most recent versions of those files you'll see together they are 13 MB. This may not seem like much, but with a full MiSeq run those long sequence names can add up and make life tedious. So we'll run count.seqs to generate a table where the rows are the names of the unique sequences and the columns are the names of the groups. The table is then filled with the number of times each unique sequence shows up in each group.
Code Block
mothur > count.seqs(name=stability.trim.contigs.good.names, group=stability.contigs.good.groups)

 This will generate a file called stability.trim.contigs.good.count_table. In subsequent commands we'll use it by using the count option:

 

Code Block
mothur > summary.seqs(fasta=stability.trim.contigs.good.unique.fasta, count=stability.trim.contigs.good.count_table)

Using 4 processors.
		Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	1	250	250	0	3	1
2.5%-tile:	1	252	252	0	4	166
25%-tile:	1	252	252	0	4	1660
Median: 	1	252	252	0	4	3320
75%-tile:	1	253	253	0	5	4979
97.5%-tile:	1	253	253	0	6	6473
Maximum:	1	255	255	0	8	6638
Mean:	1	252	252	0	4
# of unique seqs:	1533
total # of seqs:	6638
Cool, right? Now we need to align our sequences to the reference alignment. Again we can make our lives a bit easier by making a database customized to our region of interest using the pcr.seqscommand. To run this command you need to have the reference database (silva.bacteria.fasta) and know where in that alignment your sequences start and end. To remove the leading and trailing dots we will set keepdots to false. You could also run this command using your primers of interest.:
Code Block
mothur > pcr.seqs(fasta=silva.bacteria.fasta, start=11894, end=25319, keepdots=F)

Let's rename it to something more useful using the rename.file command: 

 

Code Block
mothur > rename.file(input=silva.bacteria.pcr.fasta, new=silva.v4.fasta)

Let's take a look at what we've made: 

Code Block
mothur > summary.seqs(fasta=silva.v4.fasta)

Using 4 processors.
		Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	1	13424	270	0	3	1
2.5%-tile:	1	13425	292	0	4	374
25%-tile:	1	13425	293	0	4	3740
Median: 	1	13425	293	0	4	7479
75%-tile:	1	13425	293	0	5	11218
97.5%-tile:	1	13425	294	1	6	14583
Maximum:	3	13425	351	5	9	14956
Mean:	1	13424	292	0	4
# of Seqs:	14956

 

 

Now we have a customized reference alignment to align our sequences to. The nice thing about this reference is that instead of being 50,000 columns wide, it is now 13,425 columns wide which will save our hard drive some space and should improve the overall alignment quality. We'll do the alignment with align.seqs:

 

Code Block
mothur > align.seqs(fasta=stability.trim.contigs.good.unique.fasta, reference=silva.v4.fasta)

This should be done in a manner of seconds and we can run summary.seqs again: 

Code Block
mothur > summary.seqs(fasta=stability.trim.contigs.good.unique.align, count=stability.trim.contigs.good.count_table)

Using 4 processors.
		Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	1967	11549	250	0	3	1
2.5%-tile:	1968	11550	252	0	4	166
25%-tile:	1968	11550	252	0	4	1660
Median: 	1968	11550	252	0	4	3320
75%-tile:	1968	11550	253	0	5	4979
97.5%-tile:	1968	11550	253	0	6	6473
Maximum:	1968	11552	255	0	8	6638
Mean:	1967	11549	252	0	4
# of unique seqs:	1533
total # of seqs:	6638
 

So what does this mean? You'll see that the bulk of the sequences start at position 1968 and end at position 11550. Some sequences start at position 1967 and end at 11549. These deviants from the mode positions are likely due to an insertion or deletion at the terminal ends of the alignments. Sometimes you'll see sequences

that start and end at the same position indicating a very poor alignment, which is generally due to non-specific amplification. To make sure that everything overlaps the same region we'll re-run screen.seqs to get sequences that start at or before position 1968 and end at or after position 11550. We'll also set the maximum homopolymer length to 8 since there's nothing in the database with a stretch of 9 or more of the same base in a row (this really could have been done in the first execution of screen.seqs above). Note that we need the count table so that we can update the table for the sequences we're removing and we're also using the summary file so we don't have to figure out again all the start and stop positions: Code Blockmothur > screen.seqs(fasta

that start and end at the same position indicating a very poor alignment, which is generally due to non-specific amplification. To make sure that everything overlaps the same region we'll re-run screen.seqs to get sequences that start at or before position 1968 and end at or after position 11550. We'll also set the maximum homopolymer length to 8 since there's nothing in the database with a stretch of 9 or more of the same base in a row (this really could have been done in the first execution of screen.seqs above). Note that we need the count table so that we can update the table for the sequences we're removing and we're also using the summary file so we don't have to figure out again all the start and stop positions:

Code Block
mothur > screen.seqs(fasta=stability.trim.contigs.good.unique.align, count=stability.trim.contigs.good.count_table, summary=stability.trim.contigs.good.unique.summary, start=1968, end=11550, maxhomop=8)


Using 4 processors.
383
383
383
384
It took 0 secs to screen 1533 sequences, removed 10.
/******************************************/
Running command: remove.seqs(accnos=stability.trim.contigs.good.unique.alignbad.accnos, count=stability.trim.contigs.good.count_table, summary=stability.trim.contigs.good.unique.summary, start=1968, end=11550, maxhomop=8)


Using 4 processors.
383
383
383
384
It took 0 secs to screen 1533 sequences, removed 10.
/)
Removed 10 sequences from your count file.
Output File Names: 
stability.trim.contigs.good.pick.count_table
/*******************************************/
Running command: remove.seqs(accnos*/
Output File Names: 
stability.trim.contigs.good.unique.bad.accnos
stability.trim.contigs.good.unique.good.summary
stability.trim.contigs.good.unique.good.align
stability.trim.contigs.good.good.count_table
Code Block
mothur > summary.seqs(fasta=stability.trim.contigs.good.unique.badgood.accnosalign, count=stability.trim.contigs.good.good.count_table)
Removed 10 sequences from your count file.
Output File Names: 
stability.trim.contigs.good.pick.count_table
/******************************************/
Output File Names: 
stability.trim.contigs.good.unique.bad.accnos
stability.trim.contigs.good.unique.good.summary
stability.trim.contigs.good.unique.good.align
stability.trim.contigs.good.good.count_table
Using 4 processors.
		Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	1967	11550	250	0	3	1
2.5%-tile:	1968	11550	252	0	4	166
25%-tile:	1968	11550	252	0	4	1658
Median: 	1968	11550	252	0	4	3315
75%-tile:	1968	11550	253	0	5	4972
97.5%-tile:	1968	11550	253	0	6	6463
Maximum:	1968	11552	255	0	8	6628
Mean:	1967	11550	252	0	4
# of unique seqs:	1523
total # of seqs:	6628

Now we know our sequences overlap the same alignment coordinates, we want to make sure they only overlap that region. So we'll filter the sequences to remove the overhangs at both ends. Since we've done paired-end sequencing, this shouldn't be much of an issue, but whatever. In addition, there are many columns in the alignment that only contain gap characters (i.e. "-"). These can be pulled out without losing any information. We'll do all this with filter.seqs:

 

Code Block
mothur > 
summary.seqs(fasta=stability.trim.contigs.good.unique.good.align, count
filter.seqs(fasta=stability.trim.contigs.good.unique.good.
count_table) Using 4 processors. Start End NBases Ambigs Polymer NumSeqs Minimum: 1967 11550 250 0 3 1 2.5%-tile: 1968 11550 252 0 4 166 25%-tile: 1968 11550 252 0 4 1658 Median: 1968 11550 252 0 4 3315 75%-tile: 1968 11550 253 0 5 4972 97.5%-tile: 1968 11550 253 0 6 6463 Maximum: 1968 11552 255 0 8 6628 Mean: 1967 11550 252 0 4 # of unique seqs: 1523 total # of seqs: 6628

Now we know our sequences overlap the same alignment coordinates, we want to make sure they only overlap that region. So we'll filter the sequences to remove the overhangs at both ends. Since we've done paired-end sequencing, this shouldn't be much of an issue, but whatever. In addition, there are many columns in the alignment that only contain gap characters (i.e. "-"). These can be pulled out without losing any information. We'll do all this with filter.seqs:

 

Code Block
mothur > filter.seqs(fastaalign, vertical=T, trump=.)

At the end of running the command we get the following information:

Code Block
Length of filtered alignment: 297
Number of columns removed: 13128
Length of the original alignment: 13425
Number of sequences used to construct filter: 1523

This means that our initial alignment was 13425 columns wide and that we were able to remove 13128 terminal gap characters using trump=. and vertical gap characters using vertical=T. The final alignment length is 297 columns. Because we've perhaps created some redundancy across our sequences by trimming the ends, we can re-run unique.seqs:

Code Block
mothur > unique.seqs(fasta=stability.trim.contigs.good.unique.good.filter.fasta, count=stability.trim.contigs.good.good.count_table)
The next thing we want to do to further de-noise our sequences is to pre-cluster the sequences using the pre.cluster command allowing for up to 2 differences between sequences. This command will split the sequences by group and then sort them by abundance and go from most abundant to least and identify sequences that are within 2 nt of each other. If they are then they get merged. We generally favor allowing 1 difference for every 100 bp of sequence:
Code Block
mothur > pre.cluster(fasta=stability.trim.contigs.good.unique.good.filter.unique.fasta, count=stability.trim.contigs.good.unique.good.filter.
align
count_table, 
vertical=T, trump=.)

At the end of running the command we get the following information:

Code Block
Length of filtered alignment: 297
Number of columns removed: 13128
Length of the original alignment: 13425
Number of sequences used to construct filter: 1523
This means that our initial alignment was 13425 columns wide and that we were able to remove 13128 terminal gap characters using trump=. and vertical gap characters using vertical=T. The final alignment length is 297 columns. Because we've perhaps created some redundancy across our sequences by trimming the ends, we can re-run unique.seqs
diffs=2)

We now have 564 unique sequences. At this point we have removed as much sequencing error as we can and it is time to turn our attention to removing chimeras. We'll do this using the VSEARCH algorithm that is called within mothur using the chimera.vsearch command. Again, this command will split the data by sample and check for chimeras. Our preferred way of doing this is to use the abundant sequences as our reference. In addition, if a sequence is flagged as chimeric in one sample, the default (dereplicate=F) is to remove it from all samples. Our experience suggests that this is a bit aggressive since we've seen rare sequences get flagged as chimeric when they're the most abundant sequence in another sample. This is how we do it:

Code Block
mothur > 
unique
chimera.
seqs
vsearch(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.count_table
)The next thing we want to do to further de-noise our sequences is to pre-cluster the sequences using the pre.cluster command allowing for up to 2 differences between sequences. This command will split the sequences by group and then sort them by abundance and go from most abundant to least and identify sequences that are within 2 nt of each other. If they are then they get merged. We generally favor allowing 1 difference for every 100 bp of sequence:

 

Code Block
mothur > pre.cluster(fasta=, dereplicate=t)


Using 4 processors.
Checking sequences from stability.trim.contigs.good.unique.good.filter.unique.precluster.fasta ...
vsearch v2.3.4_linux_x86_64, 7.8GB RAM, 4 cores
https://github.com/torognes/vsearch
Reading file stability.trim.contigs.good.unique.good.filter.unique.fasta, count=stability.trim.contigs.good.unique.good.filter.count_table, diffs=2)

We now have 564 unique sequences. At this point we have removed as much sequencing error as we can and it is time to turn our attention to removing chimeras. We'll do this using the VSEARCH algorithm that is called within mothur using the chimera.vsearch command. Again, this command will split the data by sample and check for chimeras. Our preferred way of doing this is to use the abundant sequences as our reference. In addition, if a sequence is flagged as chimeric in one sample, the default (dereplicate=F) is to remove it from all samples. Our experience suggests that this is a bit aggressive since we've seen rare sequences get flagged as chimeric when they're the most abundant sequence in another sample. This is how we do it:

 

Code Block
mothur > chimera.vsearch(fasta=.precluster.temp 100%  
142526 nt in 564 seqs, min 250, max 255, avg 253
Masking 100%  
Sorting by abundance 100%
Counting unique k-mers 100%  
Detecting chimeras 100%  
Found 271 (48.0%) chimeras, 280 (49.6%) non-chimeras,
and 13 (2.3%) borderline sequences in 564 unique sequences.
Taking abundance information into account, this corresponds to
427 (6.4%) chimeras, 6188 (93.4%) non-chimeras,
and 13 (0.2%) borderline sequences in 6628 total sequences.
It took 1 secs to check 564 sequences from group F3D0.


Output File Names: 
stability.trim.contigs.good.unique.good.filter.unique.precluster.fasta, count=.denovo.vsearch.pick.count_table
stability.trim.contigs.good.unique.good.filter.unique.precluster.count_table, dereplicate=t)


Using 4 processors.
Checking sequences fromdenovo.vsearch.chimeras
stability.trim.contigs.good.unique.good.filter.unique.precluster.fasta denovo.vsearch.accnos

Running chimera.

vsearch v2.3.4_linux_x86_64, 7.8GB RAM, 4 cores https://github.com/torognes/vsearch Reading file

vsearch with the count file will remove the chimeric sequences from the count file. But you still need to remove those sequences from the fasta file. We do this using remove.seqs:

Code Block
mothur > remove.seqs(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.temp 100%  
142526 nt in 564 seqs, min 250, max 255, avg 253
Masking 100%  
Sorting by abundance 100%
Counting unique k-mers 100%  
Detecting chimeras 100%  
Found 271 (48.0%) chimeras, 280 (49.6%) non-chimeras,
and 13 (2.3%) borderline sequences in 564 unique sequences.
Taking abundance information into account, this corresponds to
427 (6.4%) chimeras, 6188 (93.4%) non-chimeras,
and 13 (0.2%) borderline sequences in 6628 total sequences.
It took 1 secs to check 564 sequences from group F3D0.


Output File Names: 
stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.pick.count_table
stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.chimeras
stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.accnos

 

 

 
 
 
Reduce error by creating contigs from your forward and reverse reads. the more they overlap the better.

Identifying reads by comparing them to a database of known sequences

...

 

summarize the output contigs

...

.precluster.fasta, accnos=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.accnos)

Running summary.seqs we see what we're left with:

 

Code Block
mothur > summary.seqs(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.count_table)


Using 4 processors.
		Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	1	297	250	0	3	1
2.5%-tile:	1	297	252	0	4	166
25%-tile:	1	297	252	0	4	1658
Median: 	1	297	252	0	4	3315
75%-tile:	1	297	253	0	5	4972
97.5%-tile:	1	297	253	0	6	6463
Maximum:	1	297	255	0	8	6628
Mean:	1	297	252	0	4
# of unique seqs:	564
total # of seqs:	6628

Note that we went from 7793 to 6628 sequences; this is a reasonable number of sequences to be flagged as chimeric. As a final quality control step, we need to see if there are any "undesirables" in our dataset. Sometimes when we pick a primer set they will amplify other stuff that gets to this point in the pipeline such as 18S rRNA gene fragments or 16S rRNA from Archaea, chloroplasts, and mitochondria. There's also just the random stuff that we want to get rid of. Now you may say, "But wait I want that stuff". Fine. But, the primers we use, are only supposed to amplify members of the Bacteria and if they're hitting Eukaryota or Archaea, then its a mistake. Also, realize that chloroplasts and mitochondria have no functional role in a microbial community. But I digress. Let's go ahead and classify those sequences using the Bayesian classifier with the classify.seqs command:

 

Code Block
mothur > classify.seqs(fasta=stability.trim.contigs.

...

trim sequences that don't match expected lengths

  screen.seqs(fasta=stability.trim.contigs.fasta, maxambig=0, maxlength=275)

the output files from screen.seqs:

...

good.unique.good.filter.unique.precluster.pick.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.pick.count_table, reference=trainset9_032012.pds.fasta, taxonomy=trainset9_032012.pds.tax, cutoff=80)

Now that everything is classified we want to remove our undesirables. We do this with the remove.lineage command:

 

Code Block
mothur > remove.lineage(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta, count=stability.trim.contigs.

...

show current files... you can use these as shortcuts for writing the full file name

  get.current()

it is computationally wasteful to work with identical sequences, so get rid of them (mothur will remember that they exist)

...

good.unique.good.filter.unique.precluster.denovo.vsearch.pick.count_table, taxonomy=stability.trim.contigs.good.unique.good

...

count up the sequences to make future calculations easier. "generate a table where the rows are the names of the unique sequences and the columns are the names of the groups. The table is then filled with the number of times each unique sequence shows up in each group."

  count.seqs(name=stability.trim.contigs.good.names)

summarize again

...

.filter.unique.precluster.pick.pds.wang.taxonomy, taxon=Chloroplast-Mitochondria-unknown-Archaea-Eukaryota)

Also of note is that "unknown" only pops up as a classification if the classifier cannot classify your sequence to one of the domains. If you run summary.seqs you'll see that we now have 2281 unique sequences and a total of 118150 total sequences. This means about 350 of our sequences were in these various groups. Now, to create an updated taxonomy summary file that reflects these removals we use the summary.tax command:

Code Block
mothur > summary.seqs(fasta=stability.trim.contigs.good.

...

we want a good database to align our sequences to, the smaller the better. this will output a trimmed database starting with a full-length 16S database. if you have your primer sequences you can make your own database. here we use the start and end positions for the primers used by the Schloss lab.

...

unique.good.filter.unique.precluster.pick.pick.fasta, 

...

rename the output database using the system command mv

  system(mv silva.bacteria.pcr.fasta silva.v4.fasta)

...

summarize

  summary.seqs(fasta=silva.v4.fasta)

...

let's align our sequences to the database

   align.seqs(fasta=stability.trim.contigs.good.unique.fasta, reference=silva.v4.fasta)

...

summarize

  summary.seqs(fasta=stability.trim.contigs.good.unique.align, count=stability.trim.contigs.good.count_table)

...

clean up

  screen.seqs(fasta=stability.trim.contigs.good.unique.align, count=stability.trim.contigs.good.count_table, summary=stability.trim.contigs.good.unique.summary, start=1968, end=11550, maxhomop=8)

BUG ALERT

...

There is a bug in mothur (at least in v1.36.1)... if you get an error that says you have duplicate sequences names in your count_table file, remove the last line which has been duplicated.

...

You can suspend mothur (or any other program) by typing Ctrl-Z. When you want to go back to it, type fg for 'foreground'. If you want to leave it in the background, type bgfor background. PS Don't leave things running in the background that are never going to finish, make sure to close out of them by doing Ctrl-C.

If you use pico to edit the file, you can type Ctrl-W and then Ctrl-V to go to the last line, then type Ctrl-K to delete it and Ctrl-X to save the file.

...

count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.pick.pick.count_table)


Using 4 processors.


		Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	1	297	250	0	3	1
2.5%-tile:	1	297	252	0	4	155
25%-tile:	1	297	252	0	4	1550
Median: 	1	297	252	0	4	3099
75%-tile:	1	297	253	0	5	4648
97.5%-tile:	1	297	253	0	6	6042
Maximum:	1	297	255	0	6	6196
Mean:	1	297	252	0	4
# of unique seqs:	289
total # of seqs:	6196

 This creates a pick.tax.summary file with the undesirables removed. At this point we have curated our data as far as possible and we're ready to see what our error rate is.

Assessing error rates

Measuring the error rate of your sequences is something you can only do if you have co-sequenced a mock community. This is something we include for every 95 samples we sequence. You should too because it will help you gauge your error rates and allow you to see how well your curation is going and whether something is wrong with your sequencing set up. First we want to pull the sequences out that were from our "Mock" sample using the get.groups command:

Code Block
mothur > get.groups(count=stability.trim.contigs.good.unique.good

...

Or, simply do

...

.filter.unique.precluster.denovo.vsearch.pick.pick.count_table, fasta=stability.trim.contigs.good.unique.good

...

summarize

  summary.seqs(fasta=current, count=current)

...

filter sequences to remove any bases that don't match the reference database

  filter.seqs(fasta=stability.trim.contigs.good.unique.good.align, vertical=T, trump=.)

...

filter to remove duplicated sequences again

  unique.seqs(fasta=stability.trim.contigs.good.unique.good.filter.fasta, count=stability.trim.contigs.good.good.count_table)

precluster sequences to remove some sequencing error

...

.filter.unique.precluster.pick.pick.fasta, groups=F3D0)


[NOTE]: The count file should contain only unique names, so mothur assumes your fasta, list and taxonomy files also contain only uniques.


Selected 289 sequences from your fasta file.
Selected 6196 sequences from your count file.

 This tells us that we had 289 unique sequences and a total of 6196 total sequences in our F3D0 sample. We can now cluster the sequences into OTUs to see how many spurious OTUs we have:

Code Block
mothur > dist.seqs(fasta=stability.trim.contigs.good.unique.good.filter.unique.

...

remove chimeras

...

precluster.pick.pick.pick.fasta, cutoff=0.03)
mothur > cluster(column=stability.trim.contigs.good.unique.good.filter.unique.precluster

...

.pick.pick.pick.dist, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.pick.pick.pick.count_table

...

)
mothur > 

...

make.

...

shared(

...

list=stability.trim.contigs.good.unique.good.filter.unique.precluster

...

.pick.pick.pick.opti_mcc.list, 

...

count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.

...

classify sequences using the RDP training set downloaded earlier. The cutoff of 80 selects only the best matches. If doing phylotyping prior to OTU clustering it is probably a good idea to set this much lower to catch all of the matches.

...

vsearch.pick.pick.pick.count_table, label=0.03)
mothur > rarefaction.single(shared=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.pick

...

.opti_mcc.shared)

This string of commands will produce a file for you called stability.trim.contigs.good.unique.good.filter.unique.precluster.

...

pick.pick.

...

pick.an.unique_list.groups.rarefaction. Open it. You'll see that for 4061 sequences, we'd have 64 OTUs from the Mock community. This number of course includes some stealthy chimeras that escaped our detection methods. If we used 3000 sequences, we would have about 31 OTUs. In a perfect world with no chimeras and no sequencing errors, we'd have 20 OTUs. This is not a perfect world. But this is pretty darn good!

Preparing for analysis

You can follow the rest of the tutorial from here - https://www.mothur.org/wiki/MiSeq_SOP