Pre-processing of sequence reads

The 16S rRNA amplicons are from the V3/V4 region of the 16S rRNA gene and were sequenced on an Illumina MiSeq with 2 x 300 bp read chemistry. The 18S rRNA amplicons are from the Earth Microbiome Project: 1391f and EukBr, with the Nextera adapters attached on the end to allow the Nextera dual-indexing approach. Additionally, the PCR included the mammalian blocking primer from the link above, to reduce the amplification of human DNA. The amplicons were sequenced on an Illumina MiSeq with 2 x 250 bp read chemistry. Note that I DO NOT recommend these 18S rRNA primers for eukaryotic amplification as they are universal and will amplify bacteria (the EMP protocol has now been updated to clarify this).

16S rRNA data

Step 1: Import the data into QIIME2

# First check original files intact
md5sum -c checksums.md5 # All OK

# Copy and rename files to match QIIME2's expected Casava 1.8
mkdir qiime_import
cp *.fastq.gz qiime_import
cd qiime_import

for file in *.fastq.gz;
do
newname=$(echo "$file" | sed 's/0_BPDNR//' | sed 's/.fastq/_001.fastq/')
mv $file $newname
done
# Analysis directory
mkdir /WORKING/Lappan/VL_16S_analysis
cd /WORKING/Lappan/VL_16S_analysis

# Import raw files into QIIME
qiime tools import \
--type 'SampleData[PairedEndSequencesWithQuality]' \
--input-path /RAW/Lappan/VL_faecal_16S/AGRF_CAGRF16622_BPDNR/qiime_import \
--input-format CasavaOneEightSingleLanePerSampleDirFmt \
--output-path VL_16S_PE.qza \

This produces the QIIME2 artifact VL_16S_PE.qza.

Step 2: Remove amplicon primers

# Trim amplicon primers
qiime cutadapt trim-paired \
--i-demultiplexed-sequences VL_16S_PE.qza \
--p-cores 16 \
# 341f
--p-front-f CCTAYGGGRBGCASCAG \
# 806r
--p-front-r GGACTACNNGGGTATCTAAT  \
--o-trimmed-sequences primer-trimmed-VL_16S_PE.qza \
# Verbose to monitor trimming
--verbose \
# Save the output to a log
&> primer_trimming.log 

This step produces the QIIME2 artifact primer-trimmed-VL_16S_PE.qza

Step 3: Check quality plots and sequence length

# Summarise the reads
qiime demux summarize \
--i-data primer-trimmed-VL_16S_PE.qza \
--o-visualization primer-trimmed-VL_16S_PE.qzv \

Based on this summary I choose the nucleotide positions to trim at.

Step 4: DADA2 length trimming, denoising, chimera and PhiX removal

# Run DADA2 denoising and filtering
qiime dada2 denoise-paired \
--p-n-threads 16 \
--i-demultiplexed-seqs primer-trimmed-VL_16S_PE.qza \
--p-trunc-len-f 283 \
--p-trim-left-r 7 \
--p-trunc-len-r 237 \
--output-dir DADA2_denoising_output \
--verbose \
&> DADA2_denoising.log

This step took a few days to run.

This step produces three output files in the DADA2_denoising_output directory:

  • denoising_stats.qza, with a summary of the denoising results
  • representative_sequences.qza, the sequences of the exact sequence variants (features); they are joined paired-end reads
  • table.qza, the feature table (OTU table - feature counts by samples)

Step 5: Summarise and visualise DADA2 results

# Denoising stats
qiime metadata tabulate \
--m-input-file denoising_stats.qza \
--o-visualization denoising_stats.qzv \

# Representative sequences
qiime feature-table tabulate-seqs \
--i-data representative_sequences.qza \
--o-visualization rep_seqs.qzv \

# Feature table
qiime feature-table summarize \
--i-table table.qza \
--o-visualization table.qzv \

Step 6: Assign taxonomy to features

# Get SILVA database
mkdir /REFERENCE/genomes/rlappan/SILVA
cd /REFERENCE/genomes/rlappan/SILVA

wget https://www.arb-silva.de/fileadmin/silva_databases/qiime/Silva_132_release.zip
unzip Silva_132_release.zip
rm Silva_132_release.zip
# Read database into QIIME2
cd SILVA_132_QIIME_release

# The sequences
qiime tools import \
--type 'FeatureData[Sequence]' \
--input-path rep_set/rep_set_16S_only/99/silva_132_99_16S.fna \
--output-path 99_otus_16S.qza

# The taxonomy strings
qiime tools import \
--type 'FeatureData[Taxonomy]' \
--input-format HeaderlessTSVTaxonomyFormat \
--input-path taxonomy/16S_only/99/majority_taxonomy_7_levels.txt \
--output-path 99_otus_16S_taxonomy.qza

This produces the QIIME2 artifacts 99_otus_16S.qza (the reference sequences) and 99_otus_16S_taxonomy.qza (the taxonomic names).

# Extract the V3/V4 region from the reference database
qiime feature-classifier extract-reads \
--i-sequences 99_otus_16S.qza \
# 341f
--p-f-primer CCTAYGGGRBGCASCAG \
# 806r
--p-r-primer GGACTACNNGGGTATCTAAT \
--p-min-length 300 \
--p-max-length 600 \
--o-reads ref_seqs_16S_V3-V4.qza \
--verbose \
&> 16S_V3-V4_training.log

This produces the QIIME2 artifact ref_seqs_16S_V3-V4.qza; the V3/V4 reference sequences.

# Train the classifier on this region
qiime feature-classifier fit-classifier-naive-bayes \
--i-reference-reads ref_seqs_16S_V3-V4.qza \
--i-reference-taxonomy 99_otus_16S_taxonomy.qza \
--o-classifier classifier_16S_V3-V4.qza \
--verbose \
&> 16S_V3-V4_classifier.log

This step took a couple of hours. This produces the QIIME2 artifact classified_rep_seqs.qza.

mkdir /WORKING/Lappan/VL_16S_analysis/taxonomy
cd /WORKING/Lappan/VL_16S_analysis/taxonomy

# Classify rep seqs
qiime feature-classifier classify-sklearn \
--i-classifier /REFERENCE/genomes/rlappan/SILVA/SILVA_132_QIIME_release/classifier_16S_V3-V4.qza \
--i-reads ../DADA2_denoising_output/representative_sequences.qza \
--o-classification classified_rep_seqs.qza

# Tabulate the features, their taxonomy and the confidence of taxonomy assignment
qiime metadata tabulate \
--m-input-file classified_rep_seqs.qza \
--o-visualization classified_rep_seqs.qzv

This step took a few hours. This produces the QIIME2 artifact classifier_16S_V3-V4.qza.

Step 7: Create a phylogenetic tree

cd /WORKING/Lappan/VL_16S_analysis

# The pipeline requires only the rep seqs, not the classified rep seqs
qiime phylogeny align-to-tree-mafft-fasttree \
--i-sequences DADA2_denoising_output/representative_sequences.qza \
--output-dir phylogenetic_tree \
--p-n-threads 16 \
--verbose \
&> phylogenetic_tree_generation.log

This creates the following QIIME2 artifacts:

  • alignment.qza, the aligned sequences
  • masked_alignment.qza, the masked alignment
  • tree.qza, the unrooted tree
  • rooted_tree.qza, the rooted tree (this is the file I will use downstream).

18S rRNA data

Step 1: Import the data into QIIME2

# First check original files intact
md5sum -c checksums.md5 # All OK

# Copy and rename files to match QIIME2's expected Casava 1.8
mkdir qiime_import
cp *.fastq.gz qiime_import
cd qiime_import

for file in *.fastq.gz;
do
newname=$(echo "$file" | sed 's/_BVMB2//' | sed 's/.fastq/_001.fastq/')
mv $file $newname
done

# Create analysis directory
mkdir /WORKING/Lappan/VL_18S_analysis
cd /WORKING/Lappan/VL_18S_analysis

# Import raw files into QIIME
qiime tools import \
--type 'SampleData[PairedEndSequencesWithQuality]' \
--input-path /RAW/Lappan/VL_faecal_18S/AGRF_CAGRF17509_BVMB2/qiime_import \
--input-format CasavaOneEightSingleLanePerSampleDirFmt \
--output-path VL_18S_PE.qza

This produces the QIIME2 artifact VL_18S_PE.qza.

Step 2: Remove amplicon primers

# Trim amplicon primers
qiime cutadapt trim-paired \
--i-demultiplexed-sequences VL_18S_PE.qza \
--p-cores 16 \
--p-adapter-f GTACACACCGCCCGTC...GTAGGTGAACCTGCAGAAGGATCA \
--p-adapter-r TGATCCTTCTGCAGGTTCACCTAC...GACGGGCGGTGTGTAC \
--o-trimmed-sequences primer-trimmed-VL_18S_PE.qza \
--verbose \
&> primer_trimming.log 

This step produces the QIIME2 artifact primer-trimmed-VL_18S_PE.qza

Step 3: Check quality plots and sequence length

qiime demux summarize \
--i-data primer-trimmed-VL_18S_PE.qza \
--o-visualization primer-trimmed-VL_18S_PE.qzv

I use this to determine the truncation lengths.

Step 4: DADA2 length trimming, denoising, chimera and PhiX removal

Run DADA2 with those parameters to filter and join reads:

# Run DADA2 denoising and filtering
qiime dada2 denoise-paired \
--p-n-threads 16 \
--i-demultiplexed-seqs primer-trimmed-VL_18S_PE.qza \
--p-trunc-len-f 87 \
--p-trunc-len-r 87 \
--output-dir DADA2_denoising_output \
--verbose \
&> DADA2_denoising.log

This ran very quick (less than half an hour).

This produces three QIIME2 artifacts in the DADA2_denoising_output directory:

  • denoising_stats.qza, with a summary of the denoising results
  • representative_sequences.qza, the sequences of the exact sequence variants (features); they are joined paired-end reads
  • table.qza, the feature table (OTU table - feature counts by samples)

Step 5: Summarise and visualise DADA2 results

I summarise these objects:

# Denoising stats
qiime metadata tabulate \
--m-input-file denoising_stats.qza \
--o-visualization denoising_stats.qzv \

# Representative sequences
qiime feature-table tabulate-seqs \
--i-data representative_sequences.qza \
--o-visualization rep_seqs.qzv

# Feature table
qiime feature-table summarize \
--i-table table.qza \
--o-visualization table.qzv

Step 6: Assign taxonomy to features

# Import database into QIIME
cd SILVA_132_QIIME_release

# The sequences
qiime tools import \
--type 'FeatureData[Sequence]' \
--input-path rep_set/rep_set_18S_only/99/silva_132_99_18S.fna \
--output-path 99_otus_18S.qza

# The taxonomy strings
qiime tools import \
--type 'FeatureData[Taxonomy]' \
--input-format HeaderlessTSVTaxonomyFormat \
--input-path taxonomy/18S_only/99/majority_taxonomy_7_levels.txt \
--output-path 99_otus_18S_taxonomy.qza

This produces the QIIME2 artifacts 99_otus_18S.qza (the reference sequences) and 99_otus_18S_taxonomy.qza (the taxonomic names).

qiime feature-classifier extract-reads \
--i-sequences 99_otus_18S.qza \
# 1391f
--p-f-primer GTACACACCGCCCGTC \
# EukBr
--p-r-primer TGATCCTTCTGCAGGTTCACCTAC \
--p-min-length 50 \
--p-max-length 300 \
--o-reads ref_seqs_18S_V9.qza \
--verbose \
&> 18S_V9_training.log

This produces the QIIME2 artifact ref_seqs_18S_V9.qza; the V3/V4 reference sequences.

After extracting the V9 region, the next step is to train the classifier.

qiime feature-classifier fit-classifier-naive-bayes \
--i-reference-reads ref_seqs_18S_V9.qza \
--i-reference-taxonomy 99_otus_18S_taxonomy.qza \
--o-classifier classifier_18S_V9.qza \
--verbose \
&> 18S_V9_classifier.log

This produces the QIIME2 artifact classifier_18S_V9.qza.

mkdir /WORKING/Lappan/VL_18S_analysis/taxonomy
cd /WORKING/Lappan/VL_18S_analysis/taxonomy

# Classify rep seqs
qiime feature-classifier classify-sklearn \
--i-classifier /REFERENCE/genomes/rlappan/SILVA/SILVA_132_QIIME_release/classifier_18S_V9.qza \
--i-reads ../DADA2_denoising_output/representative_sequences.qza \
--o-classification classified_rep_seqs.qza

# Tabulate the features, their taxonomy and the confidence of taxonomy assignment
qiime metadata tabulate \
--m-input-file classified_rep_seqs.qza \
--o-visualization classified_rep_seqs.qzv

This produces the QIIME2 artifact classified_rep_seqs.qza.

Step 7: Create a phylogenetic tree

cd /WORKING/Lappan/VL_18S_analysis

# The pipeline requires only the rep seqs, not the classified rep seqs
qiime phylogeny align-to-tree-mafft-fasttree \
--i-sequences DADA2_denoising_output/representative_sequences.qza \
--output-dir phylogenetic_tree \
--p-n-threads 16 \
--verbose \
&> phylogenetic_tree_generation.log

This creates the following QIIME2 artifacts:

  • alignment.qza, the aligned sequences
  • masked_alignment.qza, the masked alignment
  • tree.qza, the unrooted tree
  • rooted_tree.qza, the rooted tree (this is the file I will use downstream).