Unique Full Length Expression Library
Further Full-Length Clone Selection for Xenopus tropicalis
Sept 19, 2005.
This document discusses the current status and future proposals for selection of putative full-length clones for full insert sequencing, and for expansion of the Xenopus tropicalis full-length clone physical reagent set, used for functional analysis.
The primary issue to be addressed is how best to work with the range of cloning vectors used in the original cDNA library construction in order both to sequence the greatest number of different genes at the least cost, and generate the largest possible set of clones as physical reagents for functional analysis.
Xenopus is unique as a model system for testing the in vivo bioactivity of proteins. Xenopus eggs are easily injected in large numbers, and both oocytes and embryos generate proteins faithfully from injected mRNA. As a consequence, the bioactivity of proteins can be quickly tested in the Xenopus system. This system for gene discovery has been used extremely effectively to advance our understanding of many diverse fields of biology including molecular signaling pathways, cell biology, and developmental biology. In order to make this system even more effective, creating a complete, unique, full-length clone collection in a vector suitable for expression analysis is imperative.
The Xenopus community is keen to continue expanding its full-length clone collection for functional analysis. It is generally accepted in the community that clones in the vector pCS107/8 are optimal for this purpose, and that clones in pCMV-SPORT6 are not. The reasons for this are discussed in Technical Note 1, below.
Full insert sequencing is successful on clones in both vectors.
Although it is technically possible to use clones for functional analysis which have not gone through full-insert sequencing, single pass EST sequencing is not usually enough to accurately confirm the extent and sequence of the coding region, and the lack of cloning artefacts.
There are about 544,000 Xenopus tropicalis cDNA clones in accessible collections of which about 88,000 (16%) are in pCMV-SPORT6 (62,000 retained JGI; 26,000 Pollet). The SPORT6 clones may represent above average gene diversity as (a) they comprise most of the brain libraries, and (b) the very deeply EST-sequenced (>20,000 sequenced), and therefore more redundant, libraries are all in pCS10x. There are ~6,000 clones in pCS22+ but these are all in gastrula and neurula libraries which are abundantly represented in pCS10x. The remainder are in pCS10x.
It is clear that for a significant number of genes the only potential full-length clones are in SPORT6.
Clones in SPORT6 can be utilised for functional analysis, but require some additional work to prepare them. An estimated 12% of these will require transfer to a different vector, which is significantly more work. This is discussed in more detail in Technical Note 2, below. The process of converting the SPORT6 clones into a form that is amenable for functional analysis appears quite feasible, but there is clearly some additional risk of error. The converted clones may also require some level of sequencing for confirmation.
There is evidence that long 5' UTR sequences are not ideal for functional analysis, and this clearly creates a potential divergence between the utility of a clone sequence as measured by its information content (sequence length) and its use in functional analysis.
Full insert sequencing for Xenopus tropicalis has so far been undertaken by Wellcome/Sanger in the UK, and by NIH/MGC/JGI in the US.
Approximately 11,000 clones have been fed into the respective sequencing pipelines of the two groups. Of these about 2,500 are still in the pipelines. About 6,500 of the finished clones have been analysed as actually full-length, and these represent ~4,800 different genes.
The EST collection has expanded considerably since the initial rounds of clone selection for full insert sequencing, and analysis indicates that there are up to about 8,000 further clones that could sensibly be put forward for full insert sequencing, on the basis of one clone per as-yet-unsequenced gene. This breaks down to ~6,000 clones which have a good choice in pCS10x, ~2,000 clones where the choice is only pCMV-SPORT6 and 390 clones where there are choices in both, but the pCMV-SPORT6 choice has evidence that the 3’ end is complete.
Sequencing Centres and Promised Capacity
Both Wellcome/Sanger and NIH/MGC/JGI have indicated that they will fund further full insert sequencing, and also Genescope (France), through Nicolas Pollet, are committed to sequence a substantial number of clones.
Wellcome/Sanger and NIH/MGC/JGI are comfortable sequencing clones from their own 'local' collections; Genescope have not yet been asked about this, but their opinion is being canvassed at the end of September. There is some possibility that Genescope will consider sequencing clones from other sources, especially if the clones are already re-racked.
centre number committed to
NIH/MGC/JGI 2,000+ (still under discussion)
Pollet/Genescope 4 - 5,000 (actual commitment is to 10 Mb of finished sequence)
At some level the promised capacity may exceed the probable demand.
The Wellcome/Sanger clones are all in pCS10x, the Pollet/Genescope are all in SPORT6 (variant), and the JGI clones are in a mixture as described above.
Both Wellcome/Sanger and NIH/MGC/JGI currently prefer to maximise the number of new genes being sequenced, and are not in favour of systematic sequencing of known splice variants. On the other hand Pollet/Genescope have expressed an interest in sequencing alternative transcripts. Wellcome/Sanger have also expressed some interest in sequencing alternative clones with long 5' UTRs compared to their original selection which was biased to short 5' UTRs (see above).
Pollet/Genescope will probably pick a preliminary round of up to 500 clones this month, as they are keen to start, but will be taking input from Gilchrist's analysis to avoid unnessecary duplication of existing cDNA, and possibly also of potentially good picks in pCS10x.
Clone Selection Strategy
To optimise the usefulness of the next round of full-insert sequencing for both the Xenopus and sequencing communities, it is clear that clone selection needs to take the cloning vector into account. The best strategy would be to select clones in pCS10x in preference to SPORT6 where there is an equivalent choice. Where there is no choice, clones in SPORT6 will be selected. Numbers left to select are ~6,000 and ~2,000 clones in pCS10x and SPORT6 respectively.
Complication arises when the choice is not quite equivalent. This may occur where there is no 3' EST sequence to confirm that a clone continues beyond the 3' end of the coding sequence. Approximately 5 - 10% (??) of clones suffer from internal priming (where the 3' end of the clone starts upstream of the mRNA poly-A tail), although this is highly sequence/gene dependent. So one may have the choice between a 'probable' full-length clone in pCS10x, vs. an 'almost certainly' full-length clone in SPORT6. Given the anticipated failure rate of ~25 - 30% in sequencing it still seems to be sensible to prioritise pCS10x in this situation. By preliminary analysis the number of clones to fall into this category is 390.
Input for the clone selection process comes from two different algorithms, Gilchrist/Wellcome and Wagner/MGC. They work in slightly different ways, and it is assumed that the best result will come from combining the output of the two algorithms. In a preliminary analysis of 3,700 sequences by Gilchrist comparing the results of the two algorithms, they agreed in 84% of the cases. Where they disagreed, 6% were Partial CDS by Wagner/MGC but Full-Length by Gilchrist/Wellcome, and 5% were Complete CDS by Wagner/MGC and Full-Length Fail by Gilchrist/Wellcome. The remainder fell into one or more ambiguous categories.
Gilchrist and Wagner are coordinating both full-length post-sequencing analysis, and clone selection. Accurate post-sequencing analysis is important as it will determine which genes are considered to be already 'done', and which genes require a new clone to be picked, the original pick having failed. Both analysis, and selection methods are converging (or will be converged) by analysing differences.
The following clone selection strategy is proposed:
1. list all genes (from EST data) that have a full-length clone (sequenced or putative)
2. remove from gene list those where we already have a good full-length sequence
3. remove from gene list those which have a clone in one of the sequencing pipelines
4. for each gene, select clone in pCS10x in preference to SPORT6 where there is a choice
4a. for genes where the analysis of the existing cDNA sequence is ambiguous for some reason (but not an obvious failure), a secondary pick will be made, but attention will be paid to ensuring that secondary picks do not simply replicate the ambiguity of the original pick, for example in cases where the ATG is possible but not definitive.
This strategy does not take into account the source of the clone and the capacity of the sequencing centres. A preliminary analysis (detail not included here) indicates that this approach would use up the Sanger capacity of 2,000, feed 600-1000 clones into the Genescope pipeline, and leave the balance to be sequenced in the US. The primary reasons for not being able to utilise more of the offered capacity at Genescope are that there are many fewer French clones, they represent fewer genes, and they are all in SPORT6. This would leave Pollet/Genescope with capacity to pursue their interest in alternative transcripts, which would of course be of general interest and benefit.
5' UTR length
Given 5' EST sequences from two different clones which both contain the predicted start ATG, but have very different lengths of 5' UTR, one is faced with a difficult choice (for full-insert sequencing) between the short 5' UTR, which seems optimal for experimental work, and the long 5' UTR, which will enable accurate confirmation of sequence closer to the start of transcription. One obvious possibility would be to sequence both the longest and the shortest where the difference was > (say) 250 nt, although there are straightforward cost implications. The enthusiasm for doing this may depend on the numbers involved.
One can construct an argument to say that, as the EST sequences are already known, very little new information will be derived from sequencing the longer 5' UTR. And there will be little practical impact on derived gene models (so long as ESTs are used as well as cDNA sequences).
Although most participants to these discussions are aware of this issue, it has not so far been widely debated. The choice may have to be negotiated between the various funding bodies, sequencing centres and the Xenopus community.
The Sanger Institute has (historically) expressed an interest in sequencing a limited number of longer 5' UTR clones (the original Wellcome/Sanger full-length set was deliberately biased towards short 5' UTRs).
We have already described the possibility for 'rescuing' clones in SPORT6, for use in functional analysis. In the cases where there is no alternative pCS10x clone, this is clearly the only option, short of making new cDNA libraries. There will be a certain amount of work involved, and who does this and how it is funded has yet to be decided, although the technical issues appear to be largely resolved.
In the cases where there is an apparently good (but not full-insert sequenced) alternative clone in pCS10x, there are then two possible options. Either rescue the SPORT6 clone, or do full-insert sequencing on the pCS10x. A clear picture of the relative cost/effort of these two possibilities, and the relative risk of introducing errors through additional transformations, will help in making this decision. It is of course complicated by the fact that the effort and cost may fall on different organisations in the different cases. A further complication is that the cost of full-insert sequencing is to some extent dependent on the length of the insert. It may turn out to be more cost effective to do full-insert sequencing on short genes, and clone rescue on long genes.
As a special case, and again where there is an option of an alternative pCS10x clone, it may be that sequence analysis of the existing SPORT6 full-length cDNA sequence and the EST data of the proposed pCS10x alternative would show that the EST (combining 5' and 3' where available) sequence is identical to the cDNA sequence throughout the coding region. In these cases the pCS10x clone would (presumably?) make a satisfactory substitute for the SPORT6 clone for the functional analysis set, without further cost, i.e. sequencing. This would obviously only work for short clones (<1.4 kb). We need to do some analysis to see what the numbers look like, and whether this approach would be acceptable to the community, but it would certainly be cost effective.
A small amount of work still needs to be done to co-ordinate the lists of clones still in the various sequencing pipelines.
Efforts to converge the analyses of existing cDNAs and picking algorithms is ongoing.
Agreement still to be reached on best approach to dealing with existing full-insert sequenced SPORT6 clones which have a viable pCS10x alternative.
Agreement tentatively reached on best approach to dealing with differing lengths of 5' UTR in alternative picks for the same gene, i.e. take shortest 5' UTR. Sanger may have an opinion, and Pollet/Genescope may make their own choices here as picking from their clone set is being independently coordinated.
Technical Note 1: Why pCS10x is Better for Expression Studies than pCMV-SPORT6 (MK)
The pCS10x vectors have been constructed so that mRNA can be generated by Sp6 polymerase. The cDNA insert is then followed by the SV40 polyadenylation signal. To linearize the vector (in order to prevent unnecessary transcription of vector sequence), the polyadenylation signal is followed by a rare cutter (AscI). pCMV-SPORT6 is more difficult because the only effective restriction site for linearizing the vector is ClaI which is more frequent as a cutter and is blocked by overlapping dam methylation in Sport6. Most libraries are propagated in bacteria that effectively methylate at the dam site eliminating the only viable site in which to linearize this vector,
Technical Note 2: Utilisation of Clones in pCMV-SPORT6 (MK)
Clones is SPORT6 may be amenable to functional analysis but not without modification from the current libraries. First, full length cDNAs will need to be evaluated for internal ClaI sites. Those without internal ClaI sites will need to be transferred to dam- bacteria. This step can be accomplished in a high-throughput fashion and fortunately is likely to be true for ~88% of the clones. Of the ~12% that have at least one internal ClaI site within the cDNA insert, these inserts will need to be transferred to another vector for expression analysis since linearizing the vector would truncate the cDNA insert. Fortunately, Sport6 utilizes Invitrogen’s proprietary Gateway technology which allows rapid and simple cloning of inserts into appropriate recipient vectors. In collaboration with Curtis Altmann, Maura Lane in the Harland lab has created CS110K, which can be used to directly clone inserts from Sport6 using BP recombination, a simple and highly efficient in vitro reaction. CS110K has all of the advantages of the pCS10x vectors except that it does require Kanamycin resistance.
Technical Note 3: Activity of mRNA from pCS10x, pCMV-Sport6, and CS110K
Of course, the critical issue for any vector for functional analysis is that it make mRNA that is active, and therefore, not have any cryptic sequences flanking the mRNA that would reduce its potency. This is not trivial since both pCMV-Sport6 and CS110K contain flanking sequences which will be transcribed as mRNA and therefore could have deleterious effects. pCS10x vectors have been extensively tested and are currently the standard. In testing a single mRNA (Wnt-8), all three vectors can produce mRNA that is active as assayed by the generation of secondary axes in the Xenopus embryo. In addition, the potency of the mRNA can be compared since Wnt-8 requires relatively low doses (10 pg) to induce secondary axes. These experiments suggested that the pCMV-Sport6 vector makes mRNA nearly identical in potency to pCS10x but that the CS110K vector may be slightly less potent. However, these are observations that were not rigorously tested. Regardless, each vector can make functional mRNA. Since a high priority is to make the complete gene set, it is expedient to make use of CS110K and then in the future consider replacing these clones with pCS10x clones, again highlighting the evolving nature of this gene collection.