Warning: Cannot modify header information - headers already sent by (output started at /home/pathomap/pathomap.org/wp-content/object-cache.php:2) in /home/pathomap/pathomap.org/wp-includes/feed-rss2.php on line 8
PathoMap » PathoMap http://www.pathomap.org Mapping New York's Microscopic Residents Fri, 22 Jan 2016 16:40:27 +0000 en-US hourly 1 http://wordpress.org/?v=4.1.17 Culture Methods http://www.pathomap.org/2015/04/13/culture-methods/ http://www.pathomap.org/2015/04/13/culture-methods/#comments Tue, 14 Apr 2015 01:44:36 +0000 http://www.pathomap.org/?p=319 One of the current challenges in metagenomic studies and relying on sequence data is to determine whether the organisms you characterize are alive or not. Indeed, order just finding DNA of Staphylococcus aureus on a wooden bench or subway pole doesn’t mean the organism is alive. Thus, sick to supplement our metagenomic profile of the […]

The post Culture Methods appeared first on PathoMap.

]]>
One of the current challenges in metagenomic studies and relying on sequence data is to determine whether the organisms you characterize are alive or not. Indeed, order just finding DNA of Staphylococcus aureus on a wooden bench or subway pole doesn’t mean the organism is alive.

Thus, sick to supplement our metagenomic profile of the subway system we performed two culture experiments. The first culture experiment was designed to directly compare sequence data against culture data. We would collect two swabs from the same surface, sales with one swab sequenced directly, and the other swab was first cultured then sequenced. This allowed us to compare the MetaPhlAn output and see how many organisms were found in both. Our data showed that the sequence dataset had greater number of taxa found than the culture set, with shared hits ranging from 7-13 organisms (See Figure 4 of the Paper). A set of cultures in this experiment also had tetracycline treatment, allowing us to perform a functional genetic analysis to survey for the presence of tetracycline-resistance genes. Indeed, we found high coverage to these tet genes (especially tetk) in the samples with tetracycline treatment. The second culture experiment dove deeper into the question of the presence of antibiotic resistance microorganisms. We collected more samples and plated them on various media including treatments of kanamycin, chloramphenicol, and ampicillin. The findings from this experiment further supported the first experiment that viable microorganisms can be cultured from the subway system. Moreover, the results suggested the presence of viable, antibiotic-resistant microorganisms. For more detailed results check out our full paper.

 The methodology of the second experiment wasn’t reported as in-depth as the first experiment in our paper. Thus, we’ve included more details regarding the second experiment below:

Bacteria were cultured in LB agar and then spread onto LB plates, after lawn growth, they used to select for singles. Single colonies were then plated onto AB plates and growth was assessed. Plates were incubated at 37 C. As a control, air samples were taken and cultured at every location. In all cases these did not yield growth. The non-selective plate done last when replica plating also serves as a control. The antibiotics were used according to the following concentrations:

  • Kanamycin – 50ug/mL
  • Chloramphenicol – 35ug/mL
  • Ampicillin – 100ug/mL 

Finally, there was no quantitative confirmation of bacterial vs. non-bacterial organisms, although there was no observable fungal growth in the samples.

Further experiments are being done to dive deeper into the question of viability of microorganisms on the subway system as well as the presence of antibiotic-resistant bacteria.

 

The post Culture Methods appeared first on PathoMap.

]]>
http://www.pathomap.org/2015/04/13/culture-methods/feed/ 0
Run the PathoMap Human Ancestry Pipeline on Arvados! http://www.pathomap.org/2015/04/08/run-the-pathomap-human-ancestry-pipeline-on-arvados/ http://www.pathomap.org/2015/04/08/run-the-pathomap-human-ancestry-pipeline-on-arvados/#comments Wed, 08 Apr 2015 19:22:21 +0000 http://www.pathomap.org/?p=299 Introduction PathoMap started with a simple idea; can we collect and study DNA from the New York City subway stations? We soon found the answer to be “yes!” And so we began to create a to do list: Collect samples from all 468 stations………………………Complete Extract DNA from all samples………………………………..Complete Prepare libraries and sequence samples……………………Complete Classify […]

The post Run the PathoMap Human Ancestry Pipeline on Arvados! appeared first on PathoMap.

]]>
Introduction

PathoMap started with a simple idea; can we collect and study DNA from the New York City subway stations? We soon found the answer to be “yes!” And so we began to create a to do list:

  • Collect samples from all 468 stations………………………Complete
  • Extract DNA from all samples………………………………..Complete
  • Prepare libraries and sequence samples……………………Complete
  • Classify taxonomy with various tools………………………Complete

Now what? We had done our preliminary analysis with BLAST and MetaPhlAn and were stuck on how to move forward. Looking closer to the BLAST results we found that 0.2% of our reads matched human. And then we wondered, physician could we re-create the NYC Census map with the human data collected in our samples?

In collaboration with Drs. Sean Ennis, Eoghan O’Halloran, and Tiago R. Magalhaes from University College and Our Lady’s Children’s Hospital in Dublin, Ireland we developed a novel pipeline to answer this question:

pipeline

Essentially, we take and process our raw sequence reads we align them to the human genome, call the variants (single nucleotide polymorphisms or SNPs) and utilize two R packages, Admixture and Ancestry Mapper to run ancestry assignment algorithms. Finally, we plot the findings and compare to the Census data. The novel pipeline and analysis method was quite exciting as in many instances we found near-perfect matches between the ancestry analysis and Census data. We also found in some stations great disparity between the two datasets though we hypothesize this could be due to a recently touched surface by a non-resident of the area. For instance, an Asian man could have just used a kiosk before our sample was collected and the sample has a spike in Asian markers, though the station is in a predominantly African-American neighborhood. Some of the results of our analysis are below but check out our manuscript for more details on the methods and the full results.

 

Results

We were able to download the US Census map for New York City and ran the machine learning BIS segmentation algorithm that breaks down the map for further analysis.

 Screen Shot 2015-04-08 at 3.02.20 PM

Census data there is an enrichment of people registered as “white,” which correlates to the ancestry prediction by our analysis, which shows an abundance of European markers.

The sample below was from the Cortelyou Road station in Brooklyn. According to the US Census data there is an enrichment of people registered as “white,” which correlates to the ancestry prediction by our analysis, which shows an abundance of European markers.

Screen Shot 2015-04-08 at 3.05.09 PM

Next, at Fulton Street station in lower Manhattan you have a mixture of Hispanic and Asian residents according to the US Census data, which makes some sense, especially since the station is in close proximity Chinatown. Our pipeline found the same signals in its ancestry prediction with Hispanic and Asian genetic markers (alleles) showing the highest predicted proportions.

Screen Shot 2015-04-08 at 3.05.16 PM

Finally, at the 168th Street station in upper Manhattan, in the neighborhood of Harlem we found a dominance of African as well as Puerto Rican alleles that match the higher numbers of registered Black and Hispanic people in the area, according to U.S. Census data.

Screen Shot 2015-04-08 at 3.05.21 PM

Here is the total data set from our paper:

 Untitled

 

Curoverse Collaboration

We wanted to take our novel pipeline and create a user-friendly interface to allow anyone to perform the analysis on their own samples (.vcf files). So, we collaborated with the team at Curoverse, which is developing the Arvados open source platform for managing and processing sequencing and other data. Working with Curoverse, we implemented the PathoMap Ancestry pipeline in Arvados, organized the data, and published the entire project publicly on Curoverse Cloud (Arvados in a cloud-hosted service).

You can access the project online, download the data and pipeline, or make a copy of the project and run it yourself in your own Arvados instance or in your own account on Curoverse CloudAccounts are free to setup. Now anyone can reproduce the work we did, and use the tools and methods to extend their own research.

There are a number of resources to help you use the PathoMap Ancestry pipeline. Start with the step-by-step tutorial. If you run into any trouble or have questions about Arvados or using Arvados through Curoverse Cloud, you can ask on the public open source project IRC channel or email Curoverse directly at support@curoverse.com. If you have any questions about the methods or code, please contact us.

This novel analysis highlights the forensic applications of city-scale metagenomic surveys and opens the door to several other projects and experiments to detect and study the molecular signatures and echoes left on the built environment around us.

The post Run the PathoMap Human Ancestry Pipeline on Arvados! appeared first on PathoMap.

]]>
http://www.pathomap.org/2015/04/08/run-the-pathomap-human-ancestry-pipeline-on-arvados/feed/ 0
The long road from Data to Wisdom, and from DNA to Pathogen http://www.pathomap.org/2015/03/10/long-road-data-wisdom-dna-pathogen/ http://www.pathomap.org/2015/03/10/long-road-data-wisdom-dna-pathogen/#comments Tue, 10 Mar 2015 17:14:32 +0000 http://www.pathomap.org/?p=292 I. Introduction There is an oft-cited hierarchy for data, prostate wherein ideally it should flow: Data –>Information –>Knowledge –>Wisdom (DIKW).  Just because you have data, this it takes some processing to get quality information, and even good information is not necessarily knowledge, and knowledge often requires context or application to become wisdom. For example, you […]

The post The long road from Data to Wisdom, and from DNA to Pathogen appeared first on PathoMap.

]]>
I. Introduction

There is an oft-cited hierarchy for data, prostate wherein ideally it should flow:

Data –>Information –>Knowledge –>Wisdom (DIKW).  Just because you have data, this it takes some processing to get quality information, and even good information is not necessarily knowledge, and knowledge often requires context or application to become wisdom.

For example, you could have raw, infrared thermographic images (data), which lead you to think you have a heat source inside of a house (information), and then allegations of marijuana being sold from the house suggest that the house is a drug dispensary.  This might enable police to think that they enough knowledge to arrest the occupant (although not in all U.S. states).

But, the police would be lacking the wisdom that needs to supersede all of this data, information, and knowledge.  Specifically, they would have not had the wisdom in 2001 to foresee that thermal imaging of a home, like Danny Kyllo’s home, constituted a “search,” and thus violates the Constitution’s 4th amendment since the police did not have a warrant. In this case, Kyllo had a reasonable expectation of privacy since most people do not have thermographic vision or methods to see in thermographic vision (the Predator notwithstanding). So, even when you have data,information, and knowledge, a broader context is needed to have wisdom. While this may seem like an odd legal introduction for a metagenomics post – stay with me… privacy, wisdom, and genomics will all come together in the end.

II. Metagenomics of Cities

Recently, our paper published in Cell Systems surveyed DNA of the entire NYC subway system, and it has served as a good analogy of this hierarchical nesting.  Specifically, we had fragments of DNA (data) that matched a myriad of species (information), and some of these species were alive through functional data in cultures (knowledge).

These data led us to five exciting metagenomic conclusions:

1) Interestingly, 48% of the DNA did not match any known species, indicating a wealth of discovery left in metagenomics and genome assembly projects.  This was in range with another recent study from urban air metagenomics.

2) A “molecular echo” could be seen from marine-related bacteria in a subway station flooded by Hurricane Sandy, distinct from the Gowanus Canal in Brooklyn, which created a subway version of the “microbiome aura” posited by Jack Gilbert and Ed Yong.

3) As far as we know, this was the first demonstration that human alleles and their predicted ancestry left on surfaces can match the U.S. Census data demographics in a city, which was previously shown across Europe.

4) We measured the rapid pace of metagenomic dynamics over the course of a day on a public surface.

5) We showed that metagenomics profiling can reveal the likely source of antibiotic (tetracycline) resistance.

We were excited about all of these data and their release, and even made a map so you could look up any organism from any kingdom and play with the data, and the Wall Street Journal made a cool map too.  But, what we did not appreciate was the avalanche of fear-based responses that came from the press after reporting that a handful of samples (out of almost 1,500) matched fragments of Yersinia pestis (Yp) and Bacallis anthracis (Ba). We only claimed evidence in our paper for the DNA of these species, we specifically noted there was no evidence that they were alive, and even hedged our discussion on Yp and Ba with statements such as “if truly present.” Yet, the casual mention of pathogens in our paper created a public impression that pathogens were present and pervasive in the subway.

The subsequent discussion of the data has made me appreciate a new, DIKW-like hierarchy, which could be called “DSOP:”

III. DSOP: DNA – Species – Organism – Pathogen

Even if you have DNA fragments, and you correctly match it to the right species, and you have evidence that it is coming from a living organism, this still does not mean you have a pathogen. Pathogenicity, as we state in the paper, “depends on many factors: infective dosage, immune state of the hosts, route of transmission, other competitive species, informatics approaches to species identification, horizontal transfer, bacterial methylome state and unique base modifications, and many others.” Most importantly, to be a pathogen it has to be shown to be infectious, usually along the lines of Koch’s postulates.

In the case of our paper, we had Data that indicated some Species, which we claim in the paper. But to say we have knowledge about their source, or whether they are living Organisms, would take more work. Most importantly, to claim wisdom that we know the best practice for large-scale city management is also beyond the scope of our paper – it was the first survey of a city’s transit system and serves as a snapshot. The really interesting data will come when we examine it over time, and compare it to other cities.

This large interest/excitement/fear/curiosity in our study on city-scale metagenomics and species identification has led to some questions about the data, which I post some clarification below, specifically on how we went from DNA (data) to species (information). But, it is good to say again, we do not claim that any of these organisms are alive, and most especially we have no evidence they are pathogenic. Also, even though bacteria on the subway show antibiotic resistance, this is also not cause for concern, since many bacteria are intrinsically resistant to antibiotics.  I have looked at these issues from the two areas of the DSOP hierarchy that we know, DNA–>Species.

  1. Data (DNA):

Details on the study are published; but briefly, the data used in the study were billions of fragments of DNA that were sequenced as paired-end (PE) reads (125×125 nucleotides, or nt, and some 100x100nt) from an Illumina HiSeq2500 sequencer, which is currently the next-generation sequencer (NGS) with the lowest first-pass error rate. Samples’ DNA were collected across New York City (NYC) using a mobile app (freely available) that recorded the GPS coordinates for the samples, as well as photo documentation, for the Copan nylon swabs, and DNA was extracted with MoBio PowerSoil kits. Since NGS instruments have been established for almost a decade, the accuracy of DNA sequenced from the Illumina instruments is now at 99.9%. As such, we have high confidence in the quality of the DNA sequence data from the study. But, how do we turn that data into information? Let the sequence alignments begin!

  1. Information (Species):

In an ideal world (and one that is getting closer with Moleculo, PacBio, 10X, and Oxford Nanopores), we would sequence entire plasmids and chromosomes and get megabases (Mb) of contiguous DNA in one first pass, and we could unambiguously know which species is present. But, today, we usually get short fragments (100-125nt) and we have to try and find the best match. There are many tools for the identification of short DNA reads against all known genomes, and we chose four: (1) MetaPhlAn, which uses unique genetic markers to find species-specific fragments of DNA, (2) BLAST, which uses a Smith-Waterman algorithm and other heuristics to compare one sequence against all known genomes, (3) SURPI, which uses two aligners (SNAP and RAPSearch) to compare sequences to known pathogens, and the (4) Burrows-Wheeler Aligner (BWA), which used a compressed reference genome to rapidly search for matches. We found hundreds of potential bacterial species by these tools, and it is worth repeating here that most of them were perfectly benign or inert towards human health. But, interestingly, in our data set, we found rare cases where all four methods seem to indicate that Yp or Ba was present. But how do we know this information is correct? How can this data become information? Is it real?

IV. The Yp genome

The simplest (and often the best) way is to look at the presence of DNA from an organism is to look at where, and how, the short fragments of DNA map to the genome.

Globally, we saw reads mapping across the genome:

Yp_genome_cov

When we look closer at the coverage across the reference genome of Yp, using the free integrated genome viewer, we see some areas of high coverage (gi|30407161|emb|AL590842.1|:233,779-234,032), but we also see many single-nucleotide polymorphisms (SNPs, vertical lines), which means there has been a mutation or a mis-match of the DNA from our sample versus the reference genome:

Yp_zoom_cov

But, as pointed out on Twitter (great analysis by and @pathogenomenick) and in blogs (Ed Yong’s post at National Geographic, and also in our manuscript, just because you map to a region of a genome, it could be the same as another part of a similar, related bacterial genome. Thus, certain tools like MetaPhlAn have worked around this problem by searching for unique areas of the genome for each species. When we look at the Yp genome and map the unique areas from the MetaPhlAn database, we can see that almost none of the areas of the Yp genome that are unique to Yp (red/blue bars, on top) show reads that map.

Yp_unique_areas_cov

This figure is notable for two reasons. First, most of the genome of Yp is too similar to other species to uniquely identify it, and thus even deeply re-sequenced areas and mapped reads do not necessarily represent Yp.  This is a general challenge for metagenomics, wherein reads that look to be mapping to one species can be only 1-2nt away from a perfect match to another bacterial species. This is especially complicated in bacteria because they can readily share DNA by horizontal transfer.

Second, only one of these Yp-unique areas shows coverage (left most region), meaning that there is only one covered area of Yp that could uniquely represent Yp. When we examined this coverage in more detail, it appeared in a differentially covered between samples. Some samples showed high levels of local coverage (e.g. P00606, P01027, P01090), whereas other samples where no algorithm predicted Yp (P00400, P01177, P01167) showed no mapped reads to this area.

Yp_six_samples

So, does that mean that it is real? Even then, not necessarily.  The big issue here is that there are still a fair number of SNPs (vertical lines) across this region, and when you see too many SNPs, it either means that this strain is very mutated and diverged, or that this is the wrong species. There is strong convergence and shared sequences within Yp and across the Enterobacteriaceae family (including organisms like Salmonella, Klebsiella, Shigella, and E. coli), and the same issue can occur within the Bacillus family, between Bacillus anthracis, Bacillus cereus and Bacillus thuringiensis.

We discuss this in the paper, noting that while there is some evidence of Yp-related sequences, it is at the cutoffs we considered as the limit of detection for BLAST and MetaPhlAn. We kept it in the paper since were also saw distinct predictions, from the same samples, for related species (cereus, thuringiensis, etc.).  BUT!  Given the rarity of Yp in the East Coast, any good Bayesian would take the prior probability of this being Yp as so low and assume that it cannot be there. By these very reasonable assumptions, we could have just not reported it in our dataset and assumed it was a false positive. So, why did we keep it in?

V. Reporting Yp; change the database, change the answer

We reported all of these results for two reasons.

First, once the data were published, anyone could download these data and see the same algorithms predicting Yp or Ba. Indeed, in about five quick steps, you can do it with online tools like Galaxy and MetaPhlAn.

1) Download the raw, publicly available data: ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads//BySample/sra/SRS/SRS802/SRS802263/SRR1749190/

2) Download the SRA toolkit: http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software

3) Use sratoolkit2.4.4 to convert SRA to Fastq, then Fastq to Fasta, like this:

>./fastq-dump.2.4.4 ~/Desktop/SRR1749190.sra

4) Upload data to the free Galaxy instance of MetaPhlAn 1.7 and run on default parameters: http://huttenhower.sph.harvard.edu/galaxy/ (props to the Huttenhower lab for setting this up for all to use).

5) Examine the output.  Once done, you will see Yersinia pestis (highlighted in blue):

metaphlan_output_yp

Secondly, we reported it was because all of the computational tools seemed to indicate some evidence of low-level Yp or Ba. But again, how can we know it is real? This was borne out by many late nights staring at IGV plots, with reads like these (below).  For some of these fragments of DNA, according to BLAST’s knowledge of all genomes, they are best explained as coming from Yp. Specifically, from the .bam files, we can pull reads like this:

>PC140529:373:C36UKACXX:3:2111:20835:13378

AAGACCGCCTGCCGCAGCTTTTAGCGAGATATCCAGCCGCTCTTTGGAAATGTCATTCGCGTCCCAGCGGTGGCCGTTCCATTCGAAATAGACGGTGAAAC

>PC140529:373:C36UKACXX:2:2207:6712:90748

GCATCTGTCCAGAAGAAATTGTCCGGAAGACCGCCTGCCGCAGCTTTTAGCGAGATATCCAGCCGCTCTTTGGAAATGTCATTCGCGTCCCAGCGGTGGCC

>PC140529:373:C36UKACXX:2:2216:3349:23864

ATTGTCCGCATCTGTCCAGAAGAAATTGTCCGGAAGACCGCCTGCCGCAGCTTTTAGCGAGATATCCAGCCGCTCTTTGGAAATGTCATTCGCGTCCCAGC

>PC140529:373:C36UKACXX:2:2214:4391:51898

ATTGTCCGGAAGACCGCCTGCCGCAGCTTTTAGCGAGATATCCAGCCGCACTTTGGAAATGTCATTCGCGTCCCAGCGGTGGCCGTTCCATTCGAAATAGA

Also, most of these reads’ pairs also mapped best to this part of the genome, like these:

>PC140529:373:C36UKACXX:4:2312:9640:70842/1

TAAATGTGATAGTACCTTCCCTGTACCACATATTCTCTCCTTGAAAAGACGGGCATCCATGCCCGCCATAAGTAAGTACATACTTATTTTTACCGATGTAAAGATGGATGTCTACCGTCGAGCAG

>PC140529:373:C36UKACXX:4:2312:9640:70842/2

GGTTTCTTCCATAATATCCTTCTCCATTTTTAAGCCTTAAACACCAGAACGAGATAAGGCTCGTTAACACGGTGATTGTATTTACGCTGCTCGTGCCAGGTATAAACATCGACAGTTGCAGACTG

> PC140529:373:C36UKACXX:2:2208:15618:68333/1

TAAATGTGATAGTACCTTCCCTGTACCACATATTCTCTCCTTGAAAAGACGGGCATCCATGCCCGCCATAAGTAAGTACATACTTATTTTTACCGATGTAAAGATGGATGTCTACCGTCGAGCAG

> PC140529:373:C36UKACXX:2:2208:15618:68333/2

GGTTTCTTCCATAATATCCTTCTCCATTTTTAAGCCTTAAACACCAGAACGAGATAAGGCTCGTTAACACGGTGATTGTATTTACGCTGCTCGTGCCAGGTATAAACATCGACAGTTGCAGACTG

For reads like these, the current best match according to BLAST is to Yp. You can put these reads into the public BLAST database and see the same thing here:

http://blast.ncbi.nlm.nih.gov/Blast.cgi

blastn alignments tophits

We mentioned in the discussion that “it is always possible that improved bacterial annotations and newly completed genomes can move the ‘best-hit’ evidence to a different species in the Yersinia or Bacillus genera, or a different genus altogether.” Or, bacteria can pick up fragments of DNA from each other and appear as one, even though they are different. These are all possibilities, but the challenge is that we are only as good as our references.

To highlight this point, we’ve looked further at our data relative to the European Nucleotide Archive (ENA) Sequence Search, which hosts a slightly different set of sequences than NCBI’s BLAST, and also the DNA Data Bank of Japan (DDBJ). Strikingly, when we use the DDBJ, we still get Yp as the top hit,

DDBJ

but when you use the ENA search system you do indeed get a better match, and to a different genus altogether (Enterobacter hormaechei), but still a potentially pathogenic organism the same family.

ENA

So which one is right?  Given the match statistics, it is probably the ENA.  But why are they different? This could be due to different rates of update between the databases, different data sets, or different organization of the data, but seeing this difference is likely just the tip of the iceberg.  If we see 2/3 sequence databases reporting the less-perfect match, clearly this is an issue.  In our paper, we only used the BLAST database at NCBI, which is clearly a limitation.

VI. The fragmented plasmid

We also looked at a plasmid that is associated with Yp (pMT1), where we found that most areas of this plasmid had higher coverage than the Yp genome itself.  In the paper, we only showed a small area of the plasmid, but when we examine the entire plasmid, we see a mix of areas of very high coverage and some of absent coverage:

pMT1_plasmid

Interestingly, there are hundreds of Yp plasmids can range anywhere from 60-280kb, and so the fact that we see these areas of deletion could mean that this plasmid was mutated, or (again) it could be that the reads belong to another plasmid or genome for another organism. Among the variations of the pMT1 plasmid for Yp according to NCBI, some are annotated and some have only predicted genes, and we do note an error in our Figure 3 (well-pointed out by ); yMT is not covered when using the  annotated plasmids that we have subsequently tested, and this is an erratum. Also, we found that Yp was found with MetaPhlAn 1.7, but not v2.0, yet still found with SURPI, BLAST, and BWA. Clearly, these alignment and annotation details are not trivial, since they can change a result from the presence of a putative pathogen to its complete absence, and this represents an ongoing challenge in genomics for all disciplines: human, bacterial, and meta. Thus, the penultimate and final parts of the DSOP hierarchy, that of the Organism and potential for aPathogen, cannot be addressed, since even our species match is ambiguous is some cases.  We can only say that we saw no evidence of these pathogens during our culturing experiments, and we certainly have no evidence of any pathogenicity.

The two most important aspects relevant to public health are that these DNA fragments, “if truly present,” are present at extremely low levels, and that there are no reported cases of either of these pathogens. That either means they are not there, or not there enough to be relevant, or not infectious, or in a modified state (as with modified nucleic acids), or not living, or some combination of the above. In any of these cases, no one should be concerned or worried about riding the subway, which we stated in the paper, and as often as we could in the press.

VII. Conclusions: the need for standards, “the 4th reviewer,” and metagenome privacy

The importance of computational analysis in metagenomics cannot be overstated, and Nick Loman and others have pointed out quite clearly.  It is a clear problem that one can use different methods and to get dramatically different, unique answers from the same data, or complete data non-reproducibility.  But, this is a recurrent issue for science in many fields.  This includes microarray analysis (MAQC), clinical grade sequencing (SEQC), cross-platform discrepancies (ABRF), synthetic RNAs (ERCC), and even k-mer analysis of gene patents. Ongoing work to make metagenomic standards (MBQC) are pivotal, just as they have been for DNA-seq and RNA-seq, and our positive control with eleven titrated species in our paper is simply not enough. Ideally, titrated positive controls and cross-kingdom controls should be placed into each experiment, along with negative controls, even if there is no detectible DNA in the tube.  Like all experiments, the data are only as interpretable as your controls frame it.

Yet, in spite of these challenges, improvements are coming. Nicola Segata’s beautiful species-specific markers in MetaPhlAn are great, but such maps have not yet been created for all fragments of DNA in NCBI, which would help us contextualize if this is really the best match for any set of reads across all references (genome, plasmid, and assembly). Such a measure of “mappability” helped regions of the human genome avoid mis-mapping of short reads, and you can see them here. Ideally, a “cross-kingdom mappability” index for every read, of any size, should be created, but unfortunately it does not yet exist. Also, as reads get longer, these problems will be solved. But, this will take some time, amid the many groups working on these fronts.

But, perhaps the bigger issue is social.  I confess I grossly underestimated how the press would sensationalize these results, and even the Department of Health (DOH) did not believe it, claiming it simply could not be true. We sent the MTA and the DOH our first draft upon submission in October 2014, the raw and processed data, as well as both of our revised drafts in December 2014 and January 2015, and we did get some feedback, but they also had other concerns at the time (Ebola guy in the subway). This is also different from what they normally do (PCR for specific targets), so we both learned from each other.  Yet, upon publication, it was clear that Twitter and blogs provided some of the same scrutiny as the three reviewers during the two rounds of peer review. But, they went even deeper and dug into the raw data, within hours of the paper coming online, and I would argue that online reviewers have become an invaluable part of scientific publishing. Thus, published work is effectively a living entity before (bioRxiv), during (online), and after publication (WSJ, Twitter, and others), and online voices constitute an critical, ensemble 4threviewer.

Going forward, the transparency of the methods, annotations, algorithms, and techniques has never been more essential. To this end, we have detailed our work in the supplemental methods, but we have also posted complete pipelines in this blog post on how to go from raw data to annotated species on Galaxy. Even better, the precise virtual machines and instantiation of what was run on a server needs to be tracked and guaranteed to be 100% reproducible. For example, for our .vcf characterizations of the human alleles, we have set up our entire pipeline on Arvados/Curoverse, free to use, so that anyone can take a .vcf file and run the exact same ancestry analyses and get the exact same results. Eventually, tools like this can automate and normalize computational aspects of metagenomics work, which is an ever-increasingly important component of genomics.

Lastly, back to privacy, which has only recently developed as a large ELSI question in metagenomics.   It is possible that a forensic metagenome signature may exist which could reveal one’s history of transit through a city, since allele frequencies across all kingdoms of life can be enriched or depleted across different boroughs or areas of a metropolitan area. Thus, one’s expectation of privacy is potentially lost as soon as you slough off cells from your skin, which include the genetic/epigenetic/metagenomic history of where you have been recently. The implications of this geospatial, genome history can even impact you in the home, just like the thermographic vision used by the police in 2001 in the Supreme Court case for Danny Kyllo’s home. But, here, if cells are sloughed from your body, and then escape through the vents of your house, into the outdoors, they are no longer considered private, and could reveal where you have been in the past day or week.

Can the methods really be that accurate? Could this really come to pass? It is not yet known how long, and across how many kingdoms, such a forensic signature might last. As long as we can accurately detect it, we will eventually know.

The post The long road from Data to Wisdom, and from DNA to Pathogen appeared first on PathoMap.

]]>
http://www.pathomap.org/2015/03/10/long-road-data-wisdom-dna-pathogen/feed/ 0
Donate to PathoMap! http://www.pathomap.org/2015/02/06/donate-pathomap/ http://www.pathomap.org/2015/02/06/donate-pathomap/#comments Fri, 06 Feb 2015 22:40:53 +0000 http://www.pathomap.org/?p=275 We need you to help keep the science going! To make a tax deductible donation to Weill Cornell Medical College to support the PathoMap project, please click here Or Donations can be mailed to: Weill Cornell Medical College Office of External Affairs 1300 York Avenue, Box 314 New York, NY  10065 Phone:  646-317-7427 or toll […]

The post Donate to PathoMap! appeared first on PathoMap.

]]>
We need you to help keep the science going!

To make a tax deductible donation to Weill Cornell Medical College to support the PathoMap project, please click here

Or

Donations can be mailed to:

Weill Cornell Medical College

Office of External Affairs

1300 York Avenue, Box 314

New York, NY  10065

Phone:  646-317-7427 or toll free: 866-329-8915

Checks should be payable to “Weill Cornell Medical College” and donors should indicate “PathoMap” on the notation line of the check.

 

Weill Cornell Medical College’s Office of External Affairs will issue tax receipts to all donors.

 

Money donated to the PathoMap campaign will go towards paying for reagents and materials, funding student volunteers working on the project, and expanding the project’s aims.

The post Donate to PathoMap! appeared first on PathoMap.

]]>
http://www.pathomap.org/2015/02/06/donate-pathomap/feed/ 0
A day in the life of the Penn Station microbiome http://www.pathomap.org/2015/02/05/gc-timecourse/ http://www.pathomap.org/2015/02/05/gc-timecourse/#comments Fri, 06 Feb 2015 03:07:48 +0000 http://www.pathomap.org/?p=238 To gauge the persistence of a microbial signature at a station, buy we sampled one train station (Penn Station) in triplicate every hour on the hour during a weekday. We found that certain taxa are prevalent at every time point, discount yet a high degree of fluctuation was observed in some genera over the course […]

The post A day in the life of the Penn Station microbiome appeared first on PathoMap.

]]>

To gauge the persistence of a microbial signature at a station, buy we sampled one train station (Penn Station) in triplicate every hour on the hour during a weekday. We found that certain taxa are prevalent at every time point, discount yet a high degree of fluctuation was observed in some genera over the course of the day. For instance, Pseudomadaceae has its greatest abundance between 11:00 and 13:00, and Moraxellaceae was greatest at the end of the day. However, for the majority of families, the peaks greatly vary by the time of day, with low traces at the rest of the time intervals.

Use your mouse to see how the microbiome at Penn Station changes over time!

Interactive design by Elizabeth Hénaff, made with Processing.js

The post A day in the life of the Penn Station microbiome appeared first on PathoMap.

]]>
http://www.pathomap.org/2015/02/05/gc-timecourse/feed/ 0
The elusive bacteria of the South Ferry station http://www.pathomap.org/2015/02/05/interactive-viz-south-ferry-station/ http://www.pathomap.org/2015/02/05/interactive-viz-south-ferry-station/#comments Thu, 05 Feb 2015 23:30:55 +0000 http://www.pathomap.org/?p=179 The station with the most unique bacteria was the South Ferry Station on the ‘‘1’’ subway line in Manhattan. This was the only station completely flooded during Hurricane Sandy in 2012, healing and it has been closed since that time. Notably, price we observe ten unique species of bacteria that were present in the single […]

The post The elusive bacteria of the South Ferry station appeared first on PathoMap.

]]>

The station with the most unique bacteria was the South Ferry Station on the ‘‘1’’ subway line in Manhattan. This was the only station completely flooded during Hurricane Sandy in 2012, healing and it has been closed since that time. Notably, price we observe ten unique species of bacteria that were present in the single flooded station and were not present in any of the other MTA stations or other samples.

Push the microbes around with your mouse to see their names!

Interactive design by Elizabeth Hénaff, made with Processing.js

The post The elusive bacteria of the South Ferry station appeared first on PathoMap.

]]>
http://www.pathomap.org/2015/02/05/interactive-viz-south-ferry-station/feed/ 0