X. tropicalis Genome Sequencing Steering Committee Meeting
Report of the Xenopus tropicalis genome sequencing steering committee: February 3, 2006
JGI: Paul Richardson, Igor Grigoriev, Erika Lindquist, Astrid Terry, Len Pennachio
Stanford: Jeremey Schmutz
UC Berkeley: Dan Rokhsar, Richard Harland, Mustafa Khokha, Tim Grammer, Nik Putnam
U Virginia: Rob Grainger
UC Irvine: Bruce Blumberg
Washington University: Wes Warren, Tina Graves
University of Houston: Amy Sater
A list of potential participants is being assembled. Gene structure and content will be under the purview of JGI, while pathways and organ systems, with their genes of interest, will be farmed out to the community at large. Grainger and Harland are sending preliminary inquiries and invitations from the Xenopus Community. Paul Richardson will look into non-Xenopus researchers that may be of help based on their expertise in particular areas of interest.
The genome portal will have a manual annotation feature turned on, and Astrid Terry will run tutorials. At least two tutorial sessions are planned to best accommodate the number of participants expected to be around 50.
What are the expectations? The genes should be correct, missing genes should be noted, overpredicted genes should be removed.
Comments on the browser:
In general people find the JGI browser the easiest to look at, but it has lost the important feature of access to the full set of ESTs. Both the JGI clustered ESTs and the Gurdon Institute (Mike Gilchrist) clusters are shown, but both are dead ends. The ID number from Gilchrist can be pasted into his search at http://informatics.gurdon.cam.ac.uk/online/xt-fl-db.html but he will work with Astrid to make a link work. EST tracks, which used to be available, (and are still available at the UCSC browser,) will be restored, unless the link to the Gurdon Institute clusters is made transparent.
Nomenclature is not settled and a call will be set up for next week to finalize this. It was suggested that the best option (based on past experience with other genomes) may be to use the mouse nomenclature since it appears that the mouse genome has the best annotation. It was agreed that Peter Vize and Enrique Amaya should be brought into this discussion.
Wes Warren and Tina Graves have completed another ~4x clone coverage, with at minimum 1.2-1.7 x paired end coverage. (Tina reported they had done another 99,000 reads with 76% successful paired ends)
The new paired end distribution showed no evidence for extensive bias in the library.
The assembly could be improved further with more reads, but the yield of new joins of scaffolds is not impressive, and at the current yield, arguably may not be financially efficient. About 300,000 BAC end reads were analyzed by Jeremy’s group and resulted in only 100 additional scaffold linkages. This still leaves around 1300 scaffolds that remain unlinked. Both Jeremy and Wes suggested that additional BAC libraries and end sequencing will likely not provide the needed assembly. It was said that a problem is the relatively small insert size of the current BACs. Sheared BAC libraries, which had previously been suggested as a way of avoiding the biased genome coverage that some of the earlier Restriction-enzyme based libraries had suffered from, would likely also result in relatively small inserts and therefore not solve this problem. It was mentioned that production of a larger insert BAC library might not be attainable.
It was agreed that a new assembly of the genome that incorporates all of the newly available BAC sequences might help provide a little more additional contiguity.
Probably the best way to increase linkage between clusters would be to use alternative methods of genetic mapping, some of which are under way, or Happy Mapping, which is not yet tried. Amy Sater and the UH/BCM group will initiate a Happy Mapping pilot project for long-range linkage analysis this spring.
Currently the genetic map has 523 markers; analysis is underway for an additional 700 markers, and a 1000-marker map should be posted in April. Amy will also assess how much of the assembly is represented in the genetic map and identify discrepancies between the genetic map and the assembly. Over 95% of scaffolds represented by multiple markers appear in a single linkage group, as expected. However, in many cases, markers from a single scaffold are interrupted by markers from a different scaffold (loss of contiguity), or the order of markers differs between the genetic map and the assembly (loss of colinearity). The UH/BCM group has modified the marker identification script to select markers from scaffolds that are not currently represented on the map. Tri- and tetranucleotide markers were identified from scaffolds representing over 85% of the genome, but less than half of these are suitable for mapping. The group has switched to dinucleotide markers, which are far more abundant.
How should a user look for likely regions of misassembly?
Astrid will assess whether the fosmid paired ends could be put on another browser track, so that the user can look for low coverage regions.
Blocks of synteny could also be added to provide confidence.
It was mentioned that a goal of the current genome project may not be the assembly of all the scaffolds, but rather might be to have 99% of ESTs with conserved coding regions be found in the assembly, and this may require further screening and shotgun sequencing of BACs, possibly even examination of different libraries (Pollet’s library and Shimizu’s library are possibilities).