When I analysed my recent gut flora results from uBiome I noticed something odd.
Not every level of bacteria, when their members were added together, added up to 100. In fact, none of the ranks added up to 100 and at the deepest genus level, more than a quarter of the data was missing!
Here are the different bacteria levels and the unassigned – or missing – portion of data in parentheses:
Phylum level – 98.9 (1.1)%
Class level – 98.8 (1.2)%
Order level – 98.9 (1.1)%
Family level – 96.6 (3.4)%
Genus level – 73.68 (26.32)%
I wondered if there had been a sampling error, or perhaps uBiome’s website was failing to display all the data from their database. So I shot their very good customer services team a message.
The answer, as it turned out, was that not all the bacteria in my sample could be identified. When a bacteria’s DNA doesn’t match any of the entries in uBiome’s database, they use algorithms to help assign those bacteria, but some bacteria cannot be classified yet by their algorithms; particularly at the deeper, genus level!
So is a quarter of data missing at this level the norm? It appears that my missing 26% may be fairly typical but I’m hoping those of you who have already tested with uBiome will check how much of your data is missing and leave a comment below so we can see how common this problem is.
uBiome tell me that they are working on developing their algorithms further to make them more accurate and achieve a greater depth of classification, and that these new algorithms will be applied to both new and existing customer data. That’s good to hear. But in the meantime, what does this mean?
The obvious problem is at the genus level…Future analysis is likely to mean a large jump in population for one or more species of bacteria, and maybe new species of bacteria appearing on my report that currently aren’t represented at all. This is likely to have a dramatic effect on how my results are interpreted. At the higher level where the missing data is much smaller, the impact is likely to be less, but in some cases it could still be significant. If one sample has more missing data than others then it could affect the accuracy of comparisons between those samples.
uBiome is a relatively new, citizen science enterprise. Their data visualization tools are currently beta and are still being perfected. And I’m willing to give them time to get their algorithms sorted out too. I’m told they released an upgrade to their algorithms a few weeks ago, and it sounds like more updates will come along in the future. I just hope they aren’t too long to reduce the unassigned portion of data. As uBiome are a modern, forward-thinking organization, I’m hoping they’ll feel comfortable posting here and giving us some kind of idea on the time-scales we’re looking at and what level of classification at each level is a realistic target, given the technology.
If you’ve had a uBiome test yourself then log in to your data on the uBiome beta site and select the “Compare” tool. Select the level you’re interested in looking at (phylum, class, order, family, or genus) and scroll down to see all the bacteria under that level. If you hover your cursor over each one, it will tell you how much of that bacteria appears in your sample. Add each of these up and that’s your total for that level of bacteria. Minus that from 100 and that’s how much is currently unclassified.
I found it helpful to record it all in a spreadsheet. Here is my template on Googledocs, which you might find handy as a starting point (though you may need to modify it for your needs). Just click File and then download yourself a copy. All you have to do is input your results and the spreadsheet will add it all up for you and tell you what’s missing.
If you’re happy to, leave a comment below, telling us how much of your data is currently unclassified.
Oh, and one last thing. If you haven’t checked out Prof. Lipkin’s Microbe Discovery Project yet, then please take a look and consider making a donation to this great cause.