Run the PathoMap Human Ancestry Pipeline on Arvados!


PathoMap started with a simple idea; can we collect and study DNA from the New York City subway stations? We soon found the answer to be “yes!” And so we began to create a to do list:

  • Collect samples from all 468 stations………………………Complete
  • Extract DNA from all samples………………………………..Complete
  • Prepare libraries and sequence samples……………………Complete
  • Classify taxonomy with various tools………………………Complete

Now what? We had done our preliminary analysis with BLAST and MetaPhlAn and were stuck on how to move forward. Looking closer to the BLAST results we found that 0.2% of our reads matched human. And then we wondered, physician could we re-create the NYC Census map with the human data collected in our samples?

In collaboration with Drs. Sean Ennis, Eoghan O’Halloran, and Tiago R. Magalhaes from University College and Our Lady’s Children’s Hospital in Dublin, Ireland we developed a novel pipeline to answer this question:


Essentially, we take and process our raw sequence reads we align them to the human genome, call the variants (single nucleotide polymorphisms or SNPs) and utilize two R packages, Admixture and Ancestry Mapper to run ancestry assignment algorithms. Finally, we plot the findings and compare to the Census data. The novel pipeline and analysis method was quite exciting as in many instances we found near-perfect matches between the ancestry analysis and Census data. We also found in some stations great disparity between the two datasets though we hypothesize this could be due to a recently touched surface by a non-resident of the area. For instance, an Asian man could have just used a kiosk before our sample was collected and the sample has a spike in Asian markers, though the station is in a predominantly African-American neighborhood. Some of the results of our analysis are below but check out our manuscript for more details on the methods and the full results.



We were able to download the US Census map for New York City and ran the machine learning BIS segmentation algorithm that breaks down the map for further analysis.

 Screen Shot 2015-04-08 at 3.02.20 PM

Census data there is an enrichment of people registered as “white,” which correlates to the ancestry prediction by our analysis, which shows an abundance of European markers.

The sample below was from the Cortelyou Road station in Brooklyn. According to the US Census data there is an enrichment of people registered as “white,” which correlates to the ancestry prediction by our analysis, which shows an abundance of European markers.

Screen Shot 2015-04-08 at 3.05.09 PM

Next, at Fulton Street station in lower Manhattan you have a mixture of Hispanic and Asian residents according to the US Census data, which makes some sense, especially since the station is in close proximity Chinatown. Our pipeline found the same signals in its ancestry prediction with Hispanic and Asian genetic markers (alleles) showing the highest predicted proportions.

Screen Shot 2015-04-08 at 3.05.16 PM

Finally, at the 168th Street station in upper Manhattan, in the neighborhood of Harlem we found a dominance of African as well as Puerto Rican alleles that match the higher numbers of registered Black and Hispanic people in the area, according to U.S. Census data.

Screen Shot 2015-04-08 at 3.05.21 PM

Here is the total data set from our paper:



Curoverse Collaboration

We wanted to take our novel pipeline and create a user-friendly interface to allow anyone to perform the analysis on their own samples (.vcf files). So, we collaborated with the team at Curoverse, which is developing the Arvados open source platform for managing and processing sequencing and other data. Working with Curoverse, we implemented the PathoMap Ancestry pipeline in Arvados, organized the data, and published the entire project publicly on Curoverse Cloud (Arvados in a cloud-hosted service).

You can access the project online, download the data and pipeline, or make a copy of the project and run it yourself in your own Arvados instance or in your own account on Curoverse CloudAccounts are free to setup. Now anyone can reproduce the work we did, and use the tools and methods to extend their own research.

There are a number of resources to help you use the PathoMap Ancestry pipeline. Start with the step-by-step tutorial. If you run into any trouble or have questions about Arvados or using Arvados through Curoverse Cloud, you can ask on the public open source project IRC channel or email Curoverse directly at If you have any questions about the methods or code, please contact us.

This novel analysis highlights the forensic applications of city-scale metagenomic surveys and opens the door to several other projects and experiments to detect and study the molecular signatures and echoes left on the built environment around us.