Exploring taxonomy
Summarising taxa
Now that I have the OTU table, the first thing I want to do is look at which bacteria we have in the samples! I have done this by summarising the taxonomy in the form of bar charts, though this information as a table is more detailed.
Usually, there will be multiple OTUs with the same taxonomy assignment. These may be spurious (perpetuated errors, chimeras) or could be different species where the sequence identity is less than 97% but the genus-level taxonomy is the same. Some OTUs may also contain multiple species if they are more than 97% identical. Summarising the taxonomy in QIIME groups the OTUs by their taxonomy name to give a general overview of what’s present in the samples.
I was more interested in summarising by sample type than by individual samples, so I included this in the taxa summary step.
At this point, I have filtered my input OTU table, named otu_table.biom
here so that it contains only samples (not positive or negative sequencing controls) with at least 1499 reads by using filter_samples_from_otu_table.py
.
summarize_taxa_through_plots.py -i otu_table.biom -o sample_type_plots
-m mapfile_Runs1234.txt -c sample_type
From this output, I didn’t take the plots but the file it produced: sample_type_otu_table_L6.txt
as I wanted to make more customisable plots in R. I altered the taxonomy in this file by removing everything before D_5, which designates genus, so it only contained readable genus names instead of the entire taxonomy string. I then made the plot in R:
data <- read.table("data/sample_type_otu_table_L6.txt", header = T, sep = "\t")
# Collapse anything with a total of less than 1.0% into "Other" before plotting (just to have a concise legend)
data$taxa <- ifelse(rowSums(data[,2:6]/5) < 0.01, "Other", as.character(data$taxa))
other <- data[which(data$taxa == "Other"),]
other <- colSums(other[2:6])
data <- data[-which(data$taxa == "Other"),]
other <- append(other, "Other (49 other taxa)")
names(other) <- c("control", "rinse", "fluid", "NPS", "canal", "taxa")
other <- other[c(6,1:5)]
data <- rbind(data,as.data.frame(t(other)))
# Make all numeric
for (i in 2:length(data)) {
data[,i] <- as.numeric(data[,i])
}
# Change to long format
library(reshape2)
datam <- melt(data)
datam$value <- datam$value * 100
datam$taxa <- factor(datam$taxa, levels = data$taxa)
# To label the axis ticks properly, use mapvalues (plyr):
library(plyr)
datam$sample_type_labels <- mapvalues(datam$variable, c("NPS", "control", "fluid", "rinse", "canal"), c("Case NPS\n(n = 87)", "Control NPS\n(n = 101)", "MEF\n(n = 127)", "MER\n(n = 113)", "Ear canal\n(n = 53)"))
# Reorder the factor so the NPS are next to each other
datam$sample_type_labels <- factor(datam$sample_type_labels, levels = c("Case NPS\n(n = 87)", "Control NPS\n(n = 101)", "MEF\n(n = 127)", "MER\n(n = 113)", "Ear canal\n(n = 53)"))
# Plot as a bar graph; keep the ggplot colours but specify them
library(scales)
library(ggplot2)
set.seed(4564)
n <- length(levels(datam$taxa))
cols <- hue_pal(h = c(0,360) + 15, c = 100, l = 65, h.start = 0, direction = 1)(n)[c(11,7,1,3,5,9,2,6,4,8,10)]
ggplot(datam, aes(x = sample_type_labels, y = value)) +
geom_bar(aes(fill = taxa), stat = "identity") +
labs(x = "Sample type", y = "Relative abundance (%)") +
scale_fill_manual(name = "Genus", values = cols)
This is a nice overview of the major genera in each sample type. But, as it can be difficult to compare by eye, the table is more useful to look at.
taxa | control | rinse | fluid | NPS | canal |
---|---|---|---|---|---|
Corynebacterium 1 | 13.3041203 | 1.3356343 | 1.3548518 | 1.6322519 | 2.7252701 |
Turicella | 0.0319593 | 11.7150761 | 6.7163477 | 0.0273128 | 13.0573800 |
Staphylococcus | 1.4200136 | 22.2282728 | 9.9390506 | 0.8057762 | 27.0067114 |
Alloiococcus | 0.1709547 | 57.0409060 | 49.8448806 | 0.1851017 | 53.6245124 |
Dolosigranulum | 16.3358312 | 0.0216272 | 0.0467854 | 1.8556966 | 0.0278550 |
Streptococcus | 7.0502892 | 1.1965846 | 3.5213523 | 15.2947064 | 1.4377719 |
Neisseria | 0.2656784 | 0.1410047 | 0.1916705 | 10.9479557 | 0.0648392 |
Haemophilus | 12.4260445 | 3.1828279 | 18.5228928 | 18.9603792 | 1.3024851 |
Moraxella | 47.5547992 | 0.2103666 | 2.1732045 | 30.8021377 | 0.1023219 |
Pseudomonas | 0.0165929 | 1.3399312 | 6.3429914 | 0.0089300 | 0.1152835 |
Other (49 other taxa) | 1.4237155 | 1.5877684 | 1.3459577 | 19.4797492 | 0.5355731 |
Correlating taxa in the sequencing controls
The other samples that I sequenced were the positive control (mock community) and negative controls. It’s also important that I have a look at these.
Let’s deal with the positive control first.
Positive sequencing control
For this analysis, I’ve taken the OTU table and filtered it to only contain the MOCK samples, which are the positive controls. As before, I’ve summarised the taxa with QIIME but I did not include the -c
or -m
flags this time: I don’t want to summarise by groups, I want to see individual samples.
I took the output file and used it to plot this in R:
# I have altered this table so the taxonomy is readable at genus level
data <- read.table("data/final_otu_table_mock3_L6.txt", header = T, sep = "\t")
# Collapse anything with a total of less than 1.0% into "Other"
data$taxa <- ifelse(rowSums(data[,2:9]/8) < 0.01, "Other", as.character(data$taxa))
other <- data[which(data$taxa == "Other"),]
other <- colSums(other[2:9])
data <- data[-which(data$taxa == "Other"),]
other <- append(other, "Other (47 other taxa)")
names(other)[9] <- "taxa"
other <- other[c(9,1:8)]
data <- rbind(data,as.data.frame(t(other)))
# Make all numeric
for (i in 2:length(data)) {
data[,i] <- as.numeric(data[,i])
}
# Change to long format
library(reshape2)
datam <- melt(data)
## Using taxa as id variables
datam$value <- datam$value * 100
datam$taxa <- factor(datam$taxa, levels = data$taxa)
# Change sample names to only "MOCK"
datam$variable <- gsub("MOCK3", "MOCK", datam$variable)
# Plot with randomised ggplot colours
library(scales)
library(ggplot2)
set.seed(4564)
n <- length(levels(datam$taxa))
cols <- hue_pal(h = c(0,360) + 15, c = 100, l = 65, h.start = 0, direction = 1)(n)[order(sample(1:n, n))]
ggplot(datam, aes(x = variable, y = value)) +
geom_bar(aes(fill = taxa), stat = "identity") +
labs(x = "MOCK replicate", y = "Relative abundance (%)") +
scale_fill_manual(name = "Genus", values = cols)
Again, the table may be more useful:
taxa | MOCK31 | MOCK32 | MOCK33 | MOCK34 | MOCK35 | MOCK36 | MOCK37 | MOCK38 |
---|---|---|---|---|---|---|---|---|
Corynebacterium 1 | 4.831868 | 5.029025 | 4.0849733 | 3.940035 | 4.804647 | 4.890366 | 4.158366 | 3.613126 |
Propionibacterium | 5.151673 | 4.881288 | 4.0923028 | 3.734198 | 5.193078 | 5.369925 | 4.973969 | 4.809393 |
Staphylococcus | 20.668246 | 19.984779 | 21.9384078 | 22.121677 | 21.377706 | 21.056727 | 22.001475 | 22.804263 |
Globicatella | 5.915727 | 4.809658 | 5.5985146 | 5.522652 | 6.332126 | 5.446315 | 5.219420 | 4.796126 |
Alloiococcus | 7.712791 | 8.001671 | 8.7648575 | 8.734489 | 8.046822 | 7.974254 | 7.832433 | 7.991332 |
Streptococcus | 12.713259 | 13.644029 | 11.1554953 | 10.827815 | 12.393050 | 12.568963 | 12.354022 | 13.017424 |
Veillonella | 6.130491 | 5.949769 | 5.7011275 | 5.555663 | 5.947194 | 5.500071 | 6.092259 | 5.762427 |
Neisseria | 6.657032 | 6.598917 | 7.3014012 | 6.833408 | 6.218396 | 5.695289 | 7.058655 | 6.589422 |
Klebsiella | 7.140085 | 7.301786 | 8.1699466 | 8.217954 | 7.019754 | 7.109916 | 7.025635 | 6.472227 |
Haemophilus | 10.387647 | 10.796735 | 9.7103627 | 11.600676 | 10.267178 | 11.970576 | 10.354088 | 11.834424 |
Moraxella | 6.675766 | 6.985420 | 6.9214889 | 6.579024 | 6.283135 | 6.576602 | 6.824211 | 6.921104 |
Pseudomonas | 6.006717 | 6.003492 | 6.5159233 | 6.309105 | 6.106416 | 5.826850 | 6.088957 | 5.337874 |
Other (47 other taxa) | 0.008708 | 0.013420 | 0.0451954 | 0.023290 | 0.010500 | 0.014140 | 0.016500 | 0.050847 |
It looks like the MOCK replicates across the four sequencing runs (two each) are very consistent, and represent very closely the expected theoretical composition. However, I wanted to test this: so I used a QIIME script to correlate the expected MOCK composition to these sequenced MOCKs.
The mock community contained 16 species at equal concentrations, so each one should make up 6.25% of the sample. I included some species from the same genus, so those ones should add up accordingly.
I made a file that contained all of the taxonomy strings representing the genera I included in the positive control, and the proportion that they should be present in (expected_mock3.txt
). This was my “expected” mock community, which I correlated against my observed mock communities in QIIME:
compare_taxa_summaries.py -i taxa_summaries/mock3_only/final_otu_table_mock3_L6.txt,taxa_summaries/mock3_only/expected_mock3.txt -o mock3_compare -m expected
The output from this script includes a Pearson correlation coefficient with p-value and confidence interval, showing me that they were extremely similar.
Negative sequencing controls
I also needed to look at the taxa that were present in the negative controls. Firstly, it was important to check that the major contaminant taxa there were not abundant in the samples, and secondly I wanted to test whether the contamination was consistent (indicating reagent or workspace contamination). Reagents and workspaces are probably always contaminated no matter what you do, so I wanted to know if there were any consistent contaminants that I needed to look out for in the samples. My expectation was that if there was no strong contaminant signal, the contaminants were less likely to be consistent enough in the samples to interfere with results.
I did not produce a plot for this one, but I looked at the most abundant genera and created a table:
# QIIME taxa summary of negative controls with taxonomy changed to just genus name
negdata <- read.table("data/neg_only_genus_summary.txt", header = TRUE, sep = "\t")
# Collapse anything with a total of less than 1.0% into "Other"
negdata$taxa <- ifelse(rowSums(negdata[,2:17]/16) < 0.01, "Other", as.character(negdata$taxa))
other <- negdata[which(negdata$taxa == "Other"),]
other <- colSums(other[2:17])
negdata <- negdata[-which(negdata$taxa == "Other"),]
other <- append(other, "Other (45 other taxa)")
names(other)[17] <- "taxa"
other <- other[c(17,1:16)]
negdata <- rbind(negdata,as.data.frame(t(other)))
# Make all numeric
for (i in 2:length(negdata)) {
negdata[,i] <- as.numeric(negdata[,i])
}
# Make it a percentage
negdata[2:17] <- negdata[2:17] * 100
kable(negdata, row.names = FALSE)
taxa | Negex13 | Negex1 | NegPCR1 | NegPCR2 | Negex19 | Negex32 | NegPCR3 | NegPCR4 | Negex46 | Negex55 | NegPCR5 | NegPCR6 | Negex11 | Negex64 | NegPCR7 | NegPCR8 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Corynebacterium 1 | 0.0542594 | 0.3853565 | 0.0383436 | 0.0000000 | 7.294833 | 0.2433090 | 0.6835789 | 3.2915361 | 3.9800995 | 1.3592233 | 0.1203369 | 0.2580645 | 0.0000000 | 0.000000 | 0.1703578 | 0.0000000 |
Propionibacterium | 1.5735214 | 0.1926782 | 0.0000000 | 0.0000000 | 2.127660 | 0.0000000 | 0.0151906 | 0.0000000 | 13.9303483 | 0.7766990 | 0.0000000 | 0.1290323 | 0.0000000 | 0.000000 | 0.1703578 | 0.0000000 |
Bacillus | 29.4628323 | 28.9017341 | 0.0383436 | 12.6748971 | 16.818642 | 6.9343066 | 0.0000000 | 0.0000000 | 0.0000000 | 4.0776699 | 0.8423586 | 19.6129032 | 0.0000000 | 12.903226 | 0.0000000 | 0.0000000 |
Geobacillus | 4.0694520 | 13.8728324 | 0.0000000 | 0.0000000 | 2.938197 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 1.9417476 | 0.0000000 | 0.0000000 | 0.0000000 | 1.075269 | 0.0000000 | 0.0000000 |
Lysinibacillus | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 9.017224 | 0.0000000 | 0.0000000 | 12.6175549 | 44.7761194 | 4.0776699 | 12.7557160 | 0.0000000 | 26.5017668 | 0.000000 | 14.1396934 | 47.1571906 |
Staphylococcus | 0.1627781 | 4.4315992 | 0.3834356 | 1.7283951 | 8.611955 | 0.0000000 | 3.2963694 | 1.1755486 | 8.9552239 | 0.5825243 | 1.2033694 | 0.6451613 | 3.1802120 | 15.053763 | 5.7921635 | 3.0100334 |
Alloiococcus | 4.6120456 | 0.0000000 | 0.4984663 | 0.5761317 | 14.792300 | 53.6496350 | 2.3393590 | 60.8150470 | 7.4626866 | 17.2815534 | 1.9253911 | 6.5806452 | 18.0212014 | 20.430107 | 6.3032368 | 0.8361204 |
Rhizobium | 5.5344547 | 5.0096339 | 20.3987730 | 10.2880658 | 2.836879 | 0.0000000 | 0.2886222 | 2.5078370 | 0.4975124 | 16.1165049 | 10.2286402 | 21.0322581 | 1.7667845 | 0.000000 | 5.7921635 | 1.8394649 |
Delftia | 44.8724905 | 27.5529865 | 73.3895706 | 72.1810700 | 2.026343 | 1.2165450 | 0.0000000 | 0.1567398 | 0.0000000 | 12.2330097 | 29.3622142 | 0.0000000 | 27.5618375 | 0.000000 | 18.9097104 | 10.0334448 |
Escherichia-Shigella | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.000000 | 0.0000000 | 87.5740544 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.000000 | 0.0000000 | 0.0000000 |
Haemophilus | 2.0075963 | 9.4412331 | 1.5720859 | 0.9053498 | 8.004053 | 3.7712895 | 0.0759532 | 0.3134796 | 3.9800995 | 0.9708738 | 19.0132371 | 1.1612903 | 0.0000000 | 15.053763 | 1.0221465 | 0.0000000 |
Acinetobacter | 2.1703744 | 0.0000000 | 0.0000000 | 0.0000000 | 11.043566 | 0.8515815 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 6.1371841 | 0.0000000 | 0.0000000 | 0.000000 | 0.0000000 | 0.0000000 |
Moraxella | 3.7438958 | 2.8901734 | 0.8052147 | 0.6584362 | 3.748734 | 17.3965937 | 1.4431110 | 2.1159875 | 2.4875622 | 3.8834951 | 0.9626955 | 3.8709677 | 0.3533569 | 12.903226 | 1.3628620 | 0.0000000 |
Pseudomonas | 0.0000000 | 0.1926782 | 0.1533742 | 0.0000000 | 6.889564 | 2.4330900 | 3.6001823 | 14.0282132 | 11.4427861 | 32.6213592 | 16.6064982 | 41.8064516 | 18.0212014 | 7.526882 | 44.1226576 | 32.9431438 |
Other (45 other taxa) | 1.7362997 | 7.1290944 | 2.7223927 | 0.9876542 | 3.850051 | 13.5036495 | 0.6835788 | 2.9780564 | 2.4875622 | 4.0776701 | 0.8423585 | 4.9032260 | 4.5936396 | 15.053763 | 2.2146510 | 4.1806021 |
The most abundant contaminants are OTU41, OTU58, OTU1, OTU64, which would be most likely to appear in the samples. OTU1 is Alloiococcus is abundant in some of the negative controls, but we found lots of it in the middle ear and ear canal samples, and hardly at all in the nasopharynx. This indicates to me that it is likely to have ended up in the negative controls as a result of cross-contamination from samples rather than the other way around.
I checked the abundance of the other three across the samples themselves:
# OTU table contains only clinical samples above 1499 reads
otu.table <- read.table("data/final_otu_table_samples.txt", sep = "\t", header = T)
# Create abundance table
rownames(otu.table) <- otu.table$taxa
otu.table$taxa <- NULL
otu.total <- colSums(otu.table)
otu.abund <- otu.table
for (i in 1:length(otu.abund)) {
otu.abund[, i] <- otu.abund[, i]/otu.total[i] *
100
}
# Mean abundance across samples
sample.otu.mean <- rowSums(otu.abund)/length(otu.abund)
sample.otu.mean[grep("OTU41", names(sample.otu.mean))]
## OTU41
## 0.09111572
sample.otu.mean[grep("OTU58", names(sample.otu.mean))]
## OTU58
## 0.05113759
sample.otu.mean[grep("OTU64", names(sample.otu.mean))]
## OTU64
## 0.01275775
# Median abundance
sample.otu.medians <- apply(otu.abund, 1, median)
sample.otu.medians[grep("OTU41", names(sample.otu.medians))]
## OTU41
## 0.002617184
sample.otu.medians[grep("OTU58", names(sample.otu.medians))]
## OTU58
## 0.00299904
sample.otu.medians[grep("OTU64", names(sample.otu.medians))]
## OTU64
## 0
These OTUs are very low in the clinical samples.
Having confirmed this, I then wanted to see whether the negative controls were similar to each other.
For this, I correlated the taxa summaries again but this time in “paired” rather than “expected” mode. I made three comparisons: 1. Negex-NegPCR pairs from the same plate 2. Negex-Negex pairs from the same run 3. NegPCR-NegPCR pairs from the same run
To set up the files for compare_taxa_summaries.py
I first subsetted the OTU table to contain only the samples I wanted to compare. This was done twice for each comparison (e.g. for comparison 1, all of the Negex samples were in one OTU table and the NegPCR in another).
# Subset OTU table by sample names designated in text file
filter_samples_from_otu_table.py -i ../final_otu_table_all.biom -o pair1_1.biom
--sample_id_fp pair1_1.txt
I repeated this for each comparison (1_2, 2_1, 2_2 and so on). The purpose being, I can then summarise each OTU table giving me two taxonomy summaries to compare per comparison:
# Genus-level taxonomy summaries - repeated for each table
summarize_taxa.py -i pair1_1.biom --suppress_biom_table_output -L 6
Each comparison now has two taxonomy summary files to compare; lastly, I needed to indicate to the script which samples go together as pairs. I made files like this:
Negex1 Ex1
Negex19 Ex2
Negex13 Ex1
Negex32 Ex2
to show that Negex1 and Negex13 go together, Negex19 and Negex32, etc. This gives each pair of samples a “new name” and is passed with the -s
flag in the script.
Finally, to do the comparisons (specifying paired mode with -m
):
compare_taxa_summaries.py -i pair1_1_L6.txt,pair1_2_L6.txt -o pair1 -m paired -s
pair1_map.txt
compare_taxa_summaries.py -i pair2_1_L6.txt,pair2_2_L6.txt -o pair2 -m paired -s
pair2_map.txt
compare_taxa_summaries.py -i pair3_1_L6.txt,pair3_2_L6.txt -o pair3 -m paired -s
pair3_map.txt
From these correlation coefficients, I saw that there was a correlation between the pairs but it was only moderate (~0.5-0.6).