Shotgun sequencing and assembly of the genome of Geospiza magnirostris – one of the Darwin’s finches
A paper describing this data set is in preparation. Draft text of the paper is below;
“Darwin’s Finches” are a model system for the study of various aspects of evolution and development. In 2008 we commenced on a project to sequence the genomes of some of these species – inspired by the (then) upcoming celebration of the 200th anniversary of the birth of Charles Darwin (which was in February 2009). The project started with a brief discussion at the AGBT meeting in 2008 and then via an email conversation between Jonathan Eisen and Jason Affourtit about the possibility of a collaboration involving the 454 company (which was looking for projects to highlight the power of it’s then relatively new 454 sequencing machines). After further discussions between Jonathan Eisen, his brother Michael Eisen (who separately had become interested in Darwin’s finches) and people from 454 it was decided that this was a potentially good project for a scientific and marketing collaboration.
In these conversations it was determined that the most likely limiting factor would be access to DNA from the finches. This was largely an issue due to the fact that the Galapagos Islands (where the finches reside) are a National Park in Ecuador and also a World Heritage site. Collection of samples there for any type of research is highly regulated. Thus, Jonathan Eisen made contact with Peter and Rosemary Grant – the most prominent researchers working on the finches – and who Eisen had discussed sequencing the finch genomes in the early 2000s. In that previous conversation it was determined that the sequencing would be too expensive to carry out without a major fundraising effort. However, with the advent of “next generation” sequencing methods such as 454 the total costs of such a project would be much lower. In the conversations with the Grants the Grants offered to ask around to see if anyone had sufficient amounts of DNA (or access to samples), which would be needed for genome library construction. Subsequently they identified Arkhat Abzhanov from Harvard as someone who likely had samples as well as permission to do DNA-based work on them, from many of the finch species.
Abzhanov offered to provide samples from three key species (large ground finch Geospiza magnirostris, large cactus finch G. conirostris and sharp-billed finch G. difficilis) and DNA was sent to Roche-454 for sequencing in July of 2008. In August, the first “test” sequence data was provided from Geospiza magnirostris. A plan was then made to generate additional data and Roche offered to do the sequencing at their center at a steep discount. Funds were raised by Jonathan Eisen, Greg Wray, Monica Riley, and others to pay for the sequencing and over the next year or so, three sequencing bursts were conducted at Roche-454. As the most deeply sequenced species was G. magnirostris, an assembly was generated for this species using Newbler. We report here the results of the sequencing and assembly and present this as a “data paper” in order to make it available to the community.
DNA isolation: Samples were provided by Arkhat Abzhanov. A genomic DNA samples were taken from individual late stage embryos representing the three species of Darwin’s finches (G. magnirostris, G. conirostris and G. difficilis) collected during a field trip to the island of Genovesa (Galápagos) in 2009. The embryonic trunk tissue was preserved in RNAlater solution (Ambion) and treated as fresh tissue with a commercial genomic DNA preparation kit (QIAGEN Genomic DNA Purification Kit). The quality of the obtained gDNA was checked with a NanoDrop Spectrophotometer (ThermoScientific) and Agilent 2100 Bioanalyzer.
Library construction and sequencing: DNA library construction and sequencing was done at 454-Corporation under the coordination of Timothy Harkins, Jason Affourtit, Clotilde Teiling and Benjamin Boese. DNA libraries were constructed using standard techniques for Roche-454 sequencing. In summary: 3 ug of purified genomic DNA was fractionated into fragments of the targeted size ranges; short adaptors were ligated to each fragments; single stranded fragments created and immobilized onto specifically designed DNA capture beads; the bead-bound library was emulsified with amplification reagents in water in oil mixture resulting in microreactors containing just (ideally) one bead with one unique sample-library fragment; emulsion beads then submitted to PCR amplification; emulsion mixture is then broken while the amplified fragments remain bound to their beads; and the DNA-carrying capture beads are loaded onto a PicoTiterPlate device for sequencing. The device was then loaded into the Genome Sequencer system where individual nucleotides are flowed in a fixed order across the open wells and DNA capture beads; complementary nucleotides to the template strand results in a chemiluminescent signal recorded by the CCD camera of the instrument. Roche-454 software was then used to determine the sequence of ~900,000 reads per instrument run – this is done by analyzing a combination of signal intensity and positional information generated across the PicoTiterPlate device.
In total twenty-eight long read runs, ninety-six runs on 2.5kbp mate-pair libraries, and forty runs on 5kbp mate-pair libraries were generated. Mate-pair libraries in each size range were constructed multiple times, yielding five mate-pair libraries of approximately 5kbp insert size and an additional five libraries at about 2.5kbp. More detail on the sequencing data is provided in appendix 1.
Assembly of G. magnirostris genome: Data from twenty-eight long read runs, ninety-six runs on 2.5kbp mate-pair libraries, and forty runs on 5kbp mate-pair libraries were combined for the assembly. Mate-pair libraries in each size range were constructed multiple times, yielding five mate-pair libraries of approximately 5kbp insert size and an additional five libraries at about 2.5kbp. Pyrosequencing reads in SFF format were assembled by the Newbler software version 2.3 using the vendor recommended protocol. Briefly, contigs were generated using the long read data, and mate-pair reads were mapped to the contigs and used to link contigs into scaffolds. In total, 24.4 million reads comprising 7.0Gbp were used to form contigs and an additional 4.1 million read pairs were used for scaffolding.
The resulting assembly contains 12958 scaffolds in an estimated genome size of 1254.6Mbp, with a scaffold N50 of 382kbp. The scaffolds comprise 394409 contigs spanning 958.3Mbp. The coverage distribution has a median at 6.5x with a long tail to the right, suggesting that some repeat regions may not be fully resolved.
We are currently carrying out a more detailed analysis of the genome assembly but wanted to release the data to the community and thus have written this “data” paper.
We acknowledge the following people: Peter and Rosemary Grant for helping initiate the project, Greg Wray and Peg Riley for financial support, Russell Neches for assistance with some sequencing analyses, and Brian Desany, Karin Frederickson, Courtney Brady and Take Ogawa for help with communications with Roche.