GitHub page launched

Since we have so many open source projects, we decided to launch a GitHub page:

Gut microbiome paper published in PLoS One

“Gut Microbiome Metagenomics Analysis Suggests a Functional Model for the Development of Autoimmunity for Type 1 Diabetes”

Read it in PLoS One.

See our GitHub!

You can now track the development of  Pangea and TaxCollector on our GitHub page.

TaxCollector Published

Our project TaxCollector has been published in MPDI Diversity.


The high level of conservation of 16S ribosomal RNA gene (16S rRNA) in all Prokaryotes makes this gene an ideal tool for the rapid identification and classification of these microorganisms. Databases such as the Ribosomal Database Project II (RDP-II) and the Greengenes Project offer access to sets of ribosomal RNA sequence databases useful in identification of microbes in a culture-independent analysis of microbial communities. However, these databases do not contain all of the taxonomic levels attached to the published names of the bacterial and archaeal sequences. TaxCollector is a set of scripts developed in Python language that attaches taxonomic information to all 16S rRNA sequences in the RDP-II and Greengenes databases. These modified databases are referred to as TaxCollector databases, which when used in conjunction with BLAST allow for rapid classification of sequences from any environmental or clinical source at six different taxonomic levels, from domain to species. The TaxCollector database prepared from the RDP-II database is an important component of a new 16S rRNA pipeline called PANGEA. The usefulness of TaxCollector databases is demonstrated with two very different datasets obtained using samples from a clinical setting and an agricultural soil.

TaxCollector is available at SourceForge and GitHub and is licensed under the Open Source GNU GPL v3.

Pangea Paper available at ISME

The paper on PANGEA has been released and is available in Nature’s Journal of Microbial Ecology.


High-throughput DNA sequencing can identify organisms and describe population structures in many environmental and clinical samples. Current technologies generate millions of reads in a single run, requiring extensive computational strategies to organize, analyze and interpret those sequences. A series of bioinformatics tools for high-throughput sequencing analysis, including pre-processing, clustering, database matching and classification, have been compiled into a pipeline called PANGEA. The PANGEA pipeline was written in Perl and can be run on Mac OSX, Windows or Linux. With PANGEA, sequences obtained directly from the sequencer can be processed quickly to provide the files needed for sequence identification by BLAST and for comparison of microbial communities. Two different sets of bacterial 16S rRNA sequences were used to show the efficiency of this workflow. The first set of 16S rRNA sequences is derived from various soils from Hawaii Volcanoes National Park. The second set is derived from stool samples collected from diabetes-resistant and diabetes-prone rats. The workflow described here allows the investigator to quickly assess libraries of sequences on personal computers with customized databases. PANGEA is provided for users as individual scripts for each step in the process or as a single script where all processes, except the χ2step, are joined into one program called the ‘backbone’.

Pangea available on SourceForge

You can now download Pangea from

Here is the URL: is the world’s largest provider of hosting for Open Source software development projects. provides a variety of services to projects, including a download mirror network, collaborative development tools (like CVS and Subversion), and tools to support discussion and support. These services are provided to projects and their end-users free-of-charge.