...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
Mothur-1.40.0 on Atmosphere (Images Tutorial)
Rationale and background:
MiSeq SOP
This tutorial follows the basic Schloss Lab MiSeq Standard Operating Procedures Tutorial. Follow along there for more details about each command. For help with any command, use the mothur website or type the command with 'help' e.g. make.contigs(help)
. This tutorial only uses one sample and does not make use of 'groups', see the SOP for more details on running multiple samples simultaneously.
Logistics
Starting out we need to first determine, what is our question? The Schloss lab is interested in understanding the effect of normal variation in the gut microbiome on host health. To that end we collected fresh feces from mice on a daily basis for 365 days post weaning (we're accepting applications). During the first 150 days post weaning (dpw), nothing was done to our mice except allow them to eat, get fat, and be merry. We were curious whether the rapid change in weight observed during the first 10 dpw affected the stability microbiome compared to the microbiome observed between days 140 and 150. We will address this question in this tutorial using a combination of OTU, phylotype, and phylogenetic methods. To make this tutorial easier to execute, we are providing only part of the data - you are given the flow files for one animal at 10 time points (5 early and 5 late). In addition, to sequencing samples from mice fecal material, we resequenced a mock community composed of genomic DNA from 21 bacterial strains. We will use the 10 fecal samples to look at how to analyze microbial communities and the mock community to measure the error rate and its effect on other analyses.In a manuscript submitted to Applied & Environmental Microbiology, we describe a set of primers that will allow you to sequence 1536 samples in parallel using only 80 primers (32+48) and obtain sequence reads that are at least as good as those generated by 454 sequencing using our 454 SOP. Please consult the supplementary methods of that manuscript for more information and our wet-lab SOP. All of the data from that study are available through our server. Sequences come off the MiSeq as pairs of fastq files with each pair representing the two sets of reads per sample. fastq files contain both the sequence data and the quality score data. If you aren't getting these files off the sequencer, then you likely have the software parameters set incorrectly. For this tutorial you will need several sets of files. To speed up the tutorial we provide some of the downstream files that take awhile to generate (e.g. the output of shhh.flows):
Checking Mothur version
Code Block |
---|
$ mothur --version
Linux 64Bit Version
Mothur version=1.40.0
Release Date=4/10/2018 |
...
Part 1: Connect to an instance of an Atmosphere Image (virtual machine)
Step 1. Go to https://atmo.cyverse.org and log in with your CyVerse credentials.
note: click images to enlarge
Step 2. Create a project tab and name the project name as Mothur and description as Prediction-based taxonomic classification
note: click images to enlarge
Step 3. Click the Mothur project and then click Instance. Select the image Mothur and click Launch Instance. It will take ~10-15 minutes for the cloud instance to be launched.
note: click images to enlarge
Note: Instances can be configured for different amounts of CPU, memory, and storage depending on user needs. This tutorial can be accomplished with the small instance size, medium1 (4 CPUs, 8 GB memory, 80 GB root)
Step 4. Once the VM is ready. Click the VM which will take onto next screen where you can open Web Shell
note: click images to enlarge
or ssh into the ip address on a terminal like below.
Code Block |
---|
$ ssh <username>@<ip.address> |
Part 2: Set up a Mothur run using the Web Shell
Code Block |
---|
$ cp -r /opt/MiSeq_SOP_sub . $ cd MiSeq_SOP_sub |
...
Code Block |
---|
mothur > make.contigs(file=stability.files) Making contigs... 1000 1000 1000 1000 1948 1948 1948 1949 Done. It took 4 secs to assemble 7793 reads. Group count: F3D0 7793 Total of all groups is 7793 It took 4 secs to process 7793 sequences. Output File Names: stability.trim.contigs.fasta stability.trim.contigs.qual stability.scrap.contigs.fasta stability.scrap.contigs.qual stability.contigs.report stability.contigs.groups |
Info |
---|
this outputs several files
|
Let's see what these sequences look like using the
Code Block |
---|
mothur > summary.seqs(fasta=stability.trim.contigs.fasta) Start End NBases Ambigs Polymer NumSeqs Minimum: 1 249 249 0 3 1 2.5%-tile: 1 252 252 0 4 195 25%-tile: 1 252 252 0 4 1949 Median: 1 252 252 0 4 3897 75%-tile: 1 253 253 0 5 5845 97.5%-tile: 1 253 253 5 6 7599 Maximum: 1 502 502 241 188 7793 Mean: 1 252 252 0 4 # of Seqs: 7793 |
This tells us that we have 7793 sequences that for the most part vary between 249 and 253 bases. Interestingly, the longest read in the dataset is 502 bp. Be suspicious of this. Recall that the reads are supposed to be 251 bp each. This read clearly didn't assemble well (or at all). Also, note that at least 2.5% of our sequences had some ambiguous base calls. We'll take care of these issues in the next step when we run screen.seqs.
Code Block |
---|
mothur > screen.seqs(fasta=stability.trim.contigs.fasta, group=stability.contigs.groups, maxambig=0, maxlength=275) |
This implementation of the command will remove any sequences with ambiguous bases and anything longer than 275 bp. There's another way to run this using the output from summary.seqs:
Code Block |
---|
mothur > screen.seqs(fasta=stability.trim.contigs.fasta, group=stability.contigs.groups, summary=stability.trim.contigs.summary, maxambig=0, maxlength=275) |
This may be faster because the summary.seqs output file already has the number of ambiguous bases and sequence length calculated for your sequences. Also, mothur is smart enough to remember that we used 4 processors in make.contigs and so it will use that throughout your current session. To see what else mothur knows about you, run the following:
Processing improved sequences
We anticipate that many of our sequences are duplicates of each other. Because it's computationally wasteful to align the same thing a bazillion times, we'll unique our sequences using the unique.seqs command:
Code Block |
---|
mothur > unique.seqs(fasta=stability.trim.contigs.good.fasta) 1000 368 2000 609 3000 824 4000 1025 5000 1221 6000 1408 6638 1533 |
Code Block |
---|
mothur > count.seqs(name=stability.trim.contigs.good.names, group=stability.contigs.good.groups) |
This will generate a file called stability.trim.contigs.good.count_table. In subsequent commands we'll use it by using the count option:
Code Block |
---|
mothur > summary.seqs(fasta=stability.trim.contigs.good.unique.fasta, count=stability.trim.contigs.good.count_table) Using 4 processors. Start End NBases Ambigs Polymer NumSeqs Minimum: 1 250 250 0 3 1 2.5%-tile: 1 252 252 0 4 166 25%-tile: 1 252 252 0 4 1660 Median: 1 252 252 0 4 3320 75%-tile: 1 253 253 0 5 4979 97.5%-tile: 1 253 253 0 6 6473 Maximum: 1 255 255 0 8 6638 Mean: 1 252 252 0 4 # of unique seqs: 1533 total # of seqs: 6638 |
Code Block |
---|
mothur > pcr.seqs(fasta=silva.bacteria.fasta, start=11894, end=25319, keepdots=F) |
Let's rename it to something more useful using the rename.file command:
Code Block |
---|
mothur > rename.file(input=silva.bacteria.pcr.fasta, new=silva.v4.fasta) |
Let's take a look at what we've made:
Code Block |
---|
mothur > summary.seqs(fasta=silva.v4.fasta) Using 4 processors. Start End NBases Ambigs Polymer NumSeqs Minimum: 1 13424 270 0 3 1 2.5%-tile: 1 13425 292 0 4 374 25%-tile: 1 13425 293 0 4 3740 Median: 1 13425 293 0 4 7479 75%-tile: 1 13425 293 0 5 11218 97.5%-tile: 1 13425 294 1 6 14583 Maximum: 3 13425 351 5 9 14956 Mean: 1 13424 292 0 4 # of Seqs: 14956 |
Now we have a customized reference alignment to align our sequences to. The nice thing about this reference is that instead of being 50,000 columns wide, it is now 13,425 columns wide which will save our hard drive some space and should improve the overall alignment quality. We'll do the alignment with align.seqs:
Code Block |
---|
mothur > align.seqs(fasta=stability.trim.contigs.good.unique.fasta, reference=silva.v4.fasta) |
This should be done in a manner of seconds and we can run summary.seqs again:
Code Block |
---|
mothur > summary.seqs(fasta=stability.trim.contigs.good.unique.align, count=stability.trim.contigs.good.count_table) Using 4 processors. Start End NBases Ambigs Polymer NumSeqs Minimum: 1967 11549 250 0 3 1 2.5%-tile: 1968 11550 252 0 4 166 25%-tile: 1968 11550 252 0 4 1660 Median: 1968 11550 252 0 4 3320 75%-tile: 1968 11550 253 0 5 4979 97.5%-tile: 1968 11550 253 0 6 6473 Maximum: 1968 11552 255 0 8 6638 Mean: 1967 11549 252 0 4 # of unique seqs: 1533 total # of seqs: 6638 |
So what does this mean? You'll see that the bulk of the sequences start at position 1968 and end at position 11550. Some sequences start at position 1967 and end at 11549. These deviants from the mode positions are likely due to an insertion or deletion at the terminal ends of the alignments. Sometimes you'll see sequences
that start and end at the same position indicating a very poor alignment, which is generally due to non-specific amplification. To make sure that everything overlaps the same region we'll re-run screen.seqs to get sequences that start at or before position 1968 and end at or after position 11550. We'll also set the maximum homopolymer length to 8 since there's nothing in the database with a stretch of 9 or more of the same base in a row (this really could have been done in the first execution of screen.seqs above). Note that we need the count table so that we can update the table for the sequences we're removing and we're also using the summary file so we don't have to figure out again all the start and stop positions:
Code Block |
---|
mothur > screen.seqs(fasta=stability.trim.contigs.good.unique.align, count=stability.trim.contigs.good.count_table, summary=stability.trim.contigs.good.unique.summary, start=1968, end=11550, maxhomop=8) Using 4 processors. 383 383 383 384 It took 0 secs to screen 1533 sequences, removed 10. /******************************************/ Running command: remove.seqs(accnos=stability.trim.contigs.good.unique.alignbad.accnos, count=stability.trim.contigs.good.count_table, summary=stability.trim.contigs.good.unique.summary, start=1968, end=11550, maxhomop=8) Using 4 processors. 383 383 383 384 It took 0 secs to screen 1533 sequences, removed 10. /) Removed 10 sequences from your count file. Output File Names: stability.trim.contigs.good.pick.count_table /*******************************************/ Running command: remove.seqs(accnos*/ Output File Names: stability.trim.contigs.good.unique.bad.accnos stability.trim.contigs.good.unique.good.summary stability.trim.contigs.good.unique.good.align stability.trim.contigs.good.good.count_table |
Code Block |
---|
mothur > summary.seqs(fasta=stability.trim.contigs.good.unique.badgood.accnosalign, count=stability.trim.contigs.good.good.count_table) Removed 10 sequences from your count file. Output File Names: stability.trim.contigs.good.pick.count_table /******************************************/ Output File Names: stability.trim.contigs.good.unique.bad.accnos stability.trim.contigs.good.unique.good.summary stability.trim.contigs.good.unique.good.align stability.trim.contigs.good.good.count_table Using 4 processors. Start End NBases Ambigs Polymer NumSeqs Minimum: 1967 11550 250 0 3 1 2.5%-tile: 1968 11550 252 0 4 166 25%-tile: 1968 11550 252 0 4 1658 Median: 1968 11550 252 0 4 3315 75%-tile: 1968 11550 253 0 5 4972 97.5%-tile: 1968 11550 253 0 6 6463 Maximum: 1968 11552 255 0 8 6628 Mean: 1967 11550 252 0 4 # of unique seqs: 1523 total # of seqs: 6628 |
Now we know our sequences overlap the same alignment coordinates, we want to make sure they only overlap that region. So we'll filter the sequences to remove the overhangs at both ends. Since we've done paired-end sequencing, this shouldn't be much of an issue, but whatever. In addition, there are many columns in the alignment that only contain gap characters (i.e. "-"). These can be pulled out without losing any information. We'll do all this with filter.seqs:
Code Block |
---|
mothur > |
filter.seqs(fasta=stability.trim.contigs.good.unique.good. |
Now we know our sequences overlap the same alignment coordinates, we want to make sure they only overlap that region. So we'll filter the sequences to remove the overhangs at both ends. Since we've done paired-end sequencing, this shouldn't be much of an issue, but whatever. In addition, there are many columns in the alignment that only contain gap characters (i.e. "-"). These can be pulled out without losing any information. We'll do all this with filter.seqs:
Code Block |
---|
mothur > filter.seqs(fastaalign, vertical=T, trump=.) |
At the end of running the command we get the following information:
Code Block |
---|
Length of filtered alignment: 297
Number of columns removed: 13128
Length of the original alignment: 13425
Number of sequences used to construct filter: 1523 |
This means that our initial alignment was 13425 columns wide and that we were able to remove 13128 terminal gap characters using trump=. and vertical gap characters using vertical=T. The final alignment length is 297 columns. Because we've perhaps created some redundancy across our sequences by trimming the ends, we can re-run unique.seqs:
Code Block |
---|
mothur > unique.seqs(fasta=stability.trim.contigs.good.unique.good.filter.fasta, count=stability.trim.contigs.good.good.count_table) |
Code Block |
---|
mothur > pre.cluster(fasta=stability.trim.contigs.good.unique.good.filter.unique.fasta, count=stability.trim.contigs.good.unique.good.filter. |
count_table, |
At the end of running the command we get the following information:
Code Block |
---|
Length of filtered alignment: 297
Number of columns removed: 13128
Length of the original alignment: 13425
Number of sequences used to construct filter: 1523 |
diffs=2) |
We now have 564 unique sequences. At this point we have removed as much sequencing error as we can and it is time to turn our attention to removing chimeras. We'll do this using the VSEARCH algorithm that is called within mothur using the chimera.vsearch command. Again, this command will split the data by sample and check for chimeras. Our preferred way of doing this is to use the abundant sequences as our reference. In addition, if a sequence is flagged as chimeric in one sample, the default (dereplicate=F) is to remove it from all samples. Our experience suggests that this is a bit aggressive since we've seen rare sequences get flagged as chimeric when they're the most abundant sequence in another sample. This is how we do it:
Code Block |
---|
mothur > |
chimera. |
vsearch(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.count_table |
Code Block |
---|
mothur > pre.cluster(fasta=, dereplicate=t) Using 4 processors. Checking sequences from stability.trim.contigs.good.unique.good.filter.unique.precluster.fasta ... vsearch v2.3.4_linux_x86_64, 7.8GB RAM, 4 cores https://github.com/torognes/vsearch Reading file stability.trim.contigs.good.unique.good.filter.unique.fasta, count=stability.trim.contigs.good.unique.good.filter.count_table, diffs=2) |
We now have 564 unique sequences. At this point we have removed as much sequencing error as we can and it is time to turn our attention to removing chimeras. We'll do this using the VSEARCH algorithm that is called within mothur using the chimera.vsearch command. Again, this command will split the data by sample and check for chimeras. Our preferred way of doing this is to use the abundant sequences as our reference. In addition, if a sequence is flagged as chimeric in one sample, the default (dereplicate=F) is to remove it from all samples. Our experience suggests that this is a bit aggressive since we've seen rare sequences get flagged as chimeric when they're the most abundant sequence in another sample. This is how we do it:
Code Block |
---|
mothur > chimera.vsearch(fasta=.precluster.temp 100% 142526 nt in 564 seqs, min 250, max 255, avg 253 Masking 100% Sorting by abundance 100% Counting unique k-mers 100% Detecting chimeras 100% Found 271 (48.0%) chimeras, 280 (49.6%) non-chimeras, and 13 (2.3%) borderline sequences in 564 unique sequences. Taking abundance information into account, this corresponds to 427 (6.4%) chimeras, 6188 (93.4%) non-chimeras, and 13 (0.2%) borderline sequences in 6628 total sequences. It took 1 secs to check 564 sequences from group F3D0. Output File Names: stability.trim.contigs.good.unique.good.filter.unique.precluster.fasta, count=.denovo.vsearch.pick.count_table stability.trim.contigs.good.unique.good.filter.unique.precluster.count_table, dereplicate=t) Using 4 processors. Checking sequences fromdenovo.vsearch.chimeras stability.trim.contigs.good.unique.good.filter.unique.precluster.fasta denovo.vsearch.accnos |
Running chimera.
vsearch v2.3.4_linux_x86_64, 7.8GB RAM, 4 cores https://github.com/torognes/vsearch Reading filevsearch with the count file will remove the chimeric sequences from the count file. But you still need to remove those sequences from the fasta file. We do this using remove.seqs:
Code Block |
---|
mothur > remove.seqs(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.temp 100% 142526 nt in 564 seqs, min 250, max 255, avg 253 Masking 100% Sorting by abundance 100% Counting unique k-mers 100% Detecting chimeras 100% Found 271 (48.0%) chimeras, 280 (49.6%) non-chimeras, and 13 (2.3%) borderline sequences in 564 unique sequences. Taking abundance information into account, this corresponds to 427 (6.4%) chimeras, 6188 (93.4%) non-chimeras, and 13 (0.2%) borderline sequences in 6628 total sequences. It took 1 secs to check 564 sequences from group F3D0. Output File Names: stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.pick.count_table stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.chimeras stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.accnos |
Reduce error by creating contigs from your forward and reverse reads. the more they overlap the better.
Identifying reads by comparing them to a database of known sequences
...
summarize the output contigs
...
.precluster.fasta, accnos=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.accnos) |
Running summary.seqs we see what we're left with:
Code Block |
---|
mothur > summary.seqs(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.count_table)
Using 4 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 297 250 0 3 1
2.5%-tile: 1 297 252 0 4 166
25%-tile: 1 297 252 0 4 1658
Median: 1 297 252 0 4 3315
75%-tile: 1 297 253 0 5 4972
97.5%-tile: 1 297 253 0 6 6463
Maximum: 1 297 255 0 8 6628
Mean: 1 297 252 0 4
# of unique seqs: 564
total # of seqs: 6628
|
Note that we went from 7793 to 6628 sequences; this is a reasonable number of sequences to be flagged as chimeric. As a final quality control step, we need to see if there are any "undesirables" in our dataset. Sometimes when we pick a primer set they will amplify other stuff that gets to this point in the pipeline such as 18S rRNA gene fragments or 16S rRNA from Archaea, chloroplasts, and mitochondria. There's also just the random stuff that we want to get rid of. Now you may say, "But wait I want that stuff". Fine. But, the primers we use, are only supposed to amplify members of the Bacteria and if they're hitting Eukaryota or Archaea, then its a mistake. Also, realize that chloroplasts and mitochondria have no functional role in a microbial community. But I digress. Let's go ahead and classify those sequences using the Bayesian classifier with the classify.seqs command:
Code Block |
---|
mothur > classify.seqs(fasta=stability.trim.contigs. |
...
trim sequences that don't match expected lengths
screen.seqs(fasta=stability.trim.contigs.fasta, maxambig=0, maxlength=275)
the output files from screen.seqs:
...
good.unique.good.filter.unique.precluster.pick.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.pick.count_table, reference=trainset9_032012.pds.fasta, taxonomy=trainset9_032012.pds.tax, cutoff=80) |
Now that everything is classified we want to remove our undesirables. We do this with the remove.lineage command:
Code Block |
---|
mothur > remove.lineage(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta, count=stability.trim.contigs. |
...
show current
files... you can use these as shortcuts for writing the full file name
get.current()
it is computationally wasteful to work with identical sequences, so get rid of them (mothur will remember that they exist)
...
good.unique.good.filter.unique.precluster.denovo.vsearch.pick.count_table, taxonomy=stability.trim.contigs.good.unique.good |
...
count up the sequences to make future calculations easier. "generate a table where the rows are the names of the unique sequences and the columns are the names of the groups. The table is then filled with the number of times each unique sequence shows up in each group."
count.seqs(name=stability.trim.contigs.good.names)
summarize again
...
.filter.unique.precluster.pick.pds.wang.taxonomy, taxon=Chloroplast-Mitochondria-unknown-Archaea-Eukaryota) |
Also of note is that "unknown" only pops up as a classification if the classifier cannot classify your sequence to one of the domains. If you run summary.seqs you'll see that we now have 2281 unique sequences and a total of 118150 total sequences. This means about 350 of our sequences were in these various groups. Now, to create an updated taxonomy summary file that reflects these removals we use the summary.tax command:
Code Block |
---|
mothur > summary.seqs(fasta=stability.trim.contigs.good. |
...
we want a good database to align our sequences to, the smaller the better. this will output a trimmed database starting with a full-length 16S database. if you have your primer sequences you can make your own database. here we use the start and end positions for the primers used by the Schloss lab.
...
unique.good.filter.unique.precluster.pick.pick.fasta, |
...
rename the output database using the system command mv
system(mv silva.bacteria.pcr.fasta silva.v4.fasta)
...
summarize
summary.seqs(fasta=silva.v4.fasta)
...
let's align our sequences to the database
align.seqs(fasta=stability.trim.contigs.good.unique.fasta, reference=silva.v4.fasta)
...
summarize
summary.seqs(fasta=stability.trim.contigs.good.unique.align, count=stability.trim.contigs.good.count_table)
...
clean up
screen.seqs(fasta=stability.trim.contigs.good.unique.align, count=stability.trim.contigs.good.count_table, summary=stability.trim.contigs.good.unique.summary, start=1968, end=11550, maxhomop=8)
BUG ALERT
...
There is a bug in mothur (at least in v1.36.1)... if you get an error that says you have duplicate sequences names in your count_table file, remove the last line which has been duplicated.
...
You can suspend mothur (or any other program) by typing Ctrl-Z. When you want to go back to it, type fg
for 'foreground'. If you want to leave it in the background, type bg
for background. PS Don't leave things running in the background that are never going to finish, make sure to close out of them by doing Ctrl-C.
If you use pico to edit the file, you can type Ctrl-W and then Ctrl-V to go to the last line, then type Ctrl-K to delete it and Ctrl-X to save the file.
...
count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.pick.pick.count_table)
Using 4 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 297 250 0 3 1
2.5%-tile: 1 297 252 0 4 155
25%-tile: 1 297 252 0 4 1550
Median: 1 297 252 0 4 3099
75%-tile: 1 297 253 0 5 4648
97.5%-tile: 1 297 253 0 6 6042
Maximum: 1 297 255 0 6 6196
Mean: 1 297 252 0 4
# of unique seqs: 289
total # of seqs: 6196 |
This creates a pick.tax.summary file with the undesirables removed. At this point we have curated our data as far as possible and we're ready to see what our error rate is.
Assessing error rates
Measuring the error rate of your sequences is something you can only do if you have co-sequenced a mock community. This is something we include for every 95 samples we sequence. You should too because it will help you gauge your error rates and allow you to see how well your curation is going and whether something is wrong with your sequencing set up. First we want to pull the sequences out that were from our "Mock" sample using the get.groups command:
Code Block |
---|
mothur > get.groups(count=stability.trim.contigs.good.unique.good |
...
Or, simply do
...
.filter.unique.precluster.denovo.vsearch.pick.pick.count_table, fasta=stability.trim.contigs.good.unique.good |
...
summarize
summary.seqs(fasta=current, count=current)
...
filter sequences to remove any bases that don't match the reference database
filter.seqs(fasta=stability.trim.contigs.good.unique.good.align, vertical=T, trump=.)
...
filter to remove duplicated sequences again
unique.seqs(fasta=stability.trim.contigs.good.unique.good.filter.fasta, count=stability.trim.contigs.good.good.count_table)
precluster sequences to remove some sequencing error
...
.filter.unique.precluster.pick.pick.fasta, groups=F3D0)
[NOTE]: The count file should contain only unique names, so mothur assumes your fasta, list and taxonomy files also contain only uniques.
Selected 289 sequences from your fasta file.
Selected 6196 sequences from your count file. |
This tells us that we had 289 unique sequences and a total of 6196 total sequences in our F3D0 sample. We can now cluster the sequences into OTUs to see how many spurious OTUs we have:
Code Block |
---|
mothur > dist.seqs(fasta=stability.trim.contigs.good.unique.good.filter.unique. |
...
remove chimeras
...
precluster.pick.pick.pick.fasta, cutoff=0.03) mothur > cluster(column=stability.trim.contigs.good.unique.good.filter.unique.precluster |
...
.pick.pick.pick.dist, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.pick.pick.pick.count_table |
...
) mothur > |
...
make. |
...
shared( |
...
list=stability.trim.contigs.good.unique.good.filter.unique.precluster |
...
.pick.pick.pick.opti_mcc.list, |
...
count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo. |
...
classify sequences using the RDP training set downloaded earlier. The cutoff of 80 selects only the best matches. If doing phylotyping prior to OTU clustering it is probably a good idea to set this much lower to catch all of the matches.
...
vsearch.pick.pick.pick.count_table, label=0.03) mothur > rarefaction.single(shared=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.pick |
...
.opti_mcc.shared) |
This string of commands will produce a file for you called stability.trim.contigs.good.unique.good.filter.unique.precluster.
...
pick.pick.
...
pick.an.unique_list.groups.rarefaction. Open it. You'll see that for 4061 sequences, we'd have 64 OTUs from the Mock community. This number of course includes some stealthy chimeras that escaped our detection methods. If we used 3000 sequences, we would have about 31 OTUs. In a perfect world with no chimeras and no sequencing errors, we'd have 20 OTUs. This is not a perfect world. But this is pretty darn good!
Preparing for analysis
You can follow the rest of the tutorial from here - https://www.mothur.org/wiki/MiSeq_SOP