Genome Sequencer 20 System: Breakthrough in a New Applications Age of Sequencing

Jan Berka1, Cynthia Turcotte1, Bruce Taillon1, and Marcus Dröge2*
1454 Corporation, Branford, Conneticut, USA; 2Roche Applied Science, Penzberg, Germany

*Corresponding author

Introduction

The Genome Sequencer 20 System is an ultra-high-throughput automated DNA sequencing system capable of carrying out and monitoring sequencing reactions in a massively parallel fashion. Since the Genome Sequencer 20 System provides a complete solution for ultra-high-throughput DNA sequencing, an individual researcher can for the first time prepare samples and sequencing reactions, generate sequence reads, and assemble genome sequence data within days. The whole genome sequenc­ing workflow from sample input to data output consists of DNA library preparation, emulsion-based clonal PCR amplification (emPCR), PicoTiterPlate device preparation, sequencing run, and data analysis. The output of a single run is typically 20 × 106 nucleotides or more (for the 70 × 75 mm PicoTiterPlate device) at an average read length of 100 high-quality bases, and multiple runs can be pooled for off-line assembly/mapping. The final consensus sequence is output as a FASTA file, with an associated basecall quality score file.

It is the combination of both the massive throughput and low costs per clonal read that enables the addressing of many new applications which were not possible before the launch of the system. This article provides an overview of what is currently possible, including new applications.


Applications

Whole genome sequencing

The Genome Sequencer 20 System has already revolutionized whole genome sequencing of microorganisms. For instance, sequencing of a 3 megabase bacterial genome up to a high-quality draft is now possible within days, rather than months. Due to the fact that high-quality reads at an average read length of 100 bases are generated, both de novo assembly as well as resequencing (mapping) of genomes is possible.

The mapping application generates the consensus DNA sequence by mapping, or alignment, of the reads to a reference sequence, as well as a list of high-confidence mutations. The current version of the Genome Sequencer 20 Software has the capacity to analyze genomes up to 50 Mbp in size at 15-25x depth of coverage. The mapping application will typically result in ≥99.99% accuracy over 95% of the non-repeat parts of the genome (Q40+ bases), when the average genome coverage is at least 15-fold. The assembler application will yield N50 contig sizes ≥10 kb with ≥99.99% accuracy over 95% of the non-repeat parts of the genome (Q40+ bases), when the average genome coverage is at least 25-fold. Examples of several bacterial genome assemblies are shown in Table 1.

Since the Genome Sequencer 20 System uses ­neither cloning, nor electrophoretic separation, sequence coverage biases normally associated with these techniques are eliminated. We have indeed confirmed the lack of sequence coverage bias by sequencing several bacterial genomes. The remaining gaps in assembled genome sequences are due largely to the presence of sequence repeats longer than ~75 bp. This means that the Genome Sequencer 20 System is particularly useful in sequencing AT-rich organisms resistant to subcloning in E. coli. One example is sequencing of the filamentous fungus Neurospora crassa. By using the 454 sequencing technique, 2.5% additional sequence information has been identified compared with the Sanger sequencing approach. Not surpris-ingly, the GC content of this additional information was quite low (27%; www.genome-sequencing.com)


Recently, 454 Life Sciences has developed a new protocol which makes whole genome sequencing using the Genome Sequencer 20 System even more efficient. Paired-end libraries are generated and sequenced in order to determine the orientation and relative positions of contigs produced by the de novo shotgun sequencing and assembly (Figure 1). Sequence data obtained from the paired-end libraries are combined with standard Genome Sequencer 20 whole genome shotgun sequencing reads in a new version of the assembler. The benefits of combining the Genome Sequencer shotgun sequence reads with the paired-end reads have been tested on several bacterial genomes and a Saccharomyces cerevisiae genome previously sequenced at 454 Life Sciences.

For instance, the 4.6 Mbp genome of E. coli K12 strain was sequenced in three standard runs to a depth of 22-fold. The assembly performed with the Newbler assembly software resulted in 140 unoriented contigs. One additional sequenc-ing run of a paired-end library yielded approximately 112,000 reads. The paired-end data improved the genome assembly to 20 multi-contig scaffolds covering 98.6% of the genome. The 12.2 Mbp genome of S. cerevisiae S288C (16 haploid chromosomes and one 86 Kbp mitochondrion) was shotgun sequenc­ed in nine sequencing runs yielding approximately 23-fold over sampling. The assembly performed with the Genome Sequencer De Novo Assembler resulted in 821 unoriented contigs. Two additional sequencing runs of a paired-end library yielded approximately 395,000 reads. The paired-end data reduced the assembly to 153 scaffolds, covering 93.2% of the genome.

Amplicon analysis

The characteristics of the Genome Sequencer 20 sequence reads, on average 100 bases long, but tens-of-thousand-fold deep, open a unique opportunity to employ the Genome Sequencer 20 System in applications where detection of rare variants of a known sequence in complex mixtures of sequences is crucial. Direct sequencing of mixed, non-clonal PCR products (amplicons) using Sanger dideoxy terminator chemistry is not sensitive enough to identify and quantify many of the sequence variants present in biological specimens. Bacterial cloning of amplicons into a vector prior to traditional sequencing of individual clones will increase the sensitivity, but not without a large increase in time and cost, thus making this approach uneconomical. The 454 technology provides amplification of hundreds of thousands of molecules via the emulsion PCR step and highly accurate sequencing, as each fragment is sequenced hundred- or thousand-fold deep.

Although there are many potential uses for amplicon sequenc­ing, the molecular biology and software developments at 454 Life Sciences have initially focused on the oncology research applications, more specifically on the detection of rare somatic mutations in complex cancer samples. The ability to sensitively detect somatic mutations in cancer cells will help researchers to understand the development of cancer on the genetic level in much greater detail. Additionally, none of the existing high-throughput technologies offer the possibility of novel variant detection.


To demonstrate the power of the Genome Sequencer 20 System, we have chosen the previously described single nucleotide polymorphisms from upstream of the HLA-DMA gene to the TAP2 gene in the class II region of the MHC as a model system [1]. We were able to reproduce the published data using our system; allele frequencies down to 3% were easily detected (Figure 2).

The results of a recent study confirmed that using the Genome Sequencer 20 System enabled detection of low-abundance oncogene mutations in complex samples with low tumor content for which conventional Sanger sequencing was not informative [2]. Somatic EGFR mutations were identified that were missed when the Sanger sequencing method was used. Hence, applications using the Genome Sequencer 20 System clearly have the potential to accelerate cancer research.

Transcriptome and gene regulation studies

The Genome Sequencer 20 System enables studying transcriptomes at a previously impossible depth of coverage and sensitivity. This is due to the system’s massively parallel sequencing technology which generates a high number of sequence reads (minimum of 200,000 single reads per 5-hour run), facilitating the identification of previously unknown transcripts [3]. First results within the framework of a short-tag sequencing project also revealed that the Genome Sequencer 20 System is very well-suited for transcript quantification (data not shown).

In terms of gene regulation, the 454 technology so far has been shown to be perfectly suited for the genome wide identification of small non-coding RNAs (sncRNAs), for the identification of transcription factor binding sites, or the elucidation of DNA-methylation patterns. Compared with sequencing of small non-coding RNAs (sncRNAs) using the Sanger approach, during which miRNA fragments are concatemerized in order to make sequencing more economical, the Genome Sequencer 20 approach is much more straightforward. The often difficult concatemerization step can be skipped. Moreover, costs per clonal read are much lower using the Genome Sequencer 20 System, thus providing a real basis for screening for scnRNA on a genome wide level. As an example, Girad et al. used the system in order to characterize a new class of small RNAs, called piwi-interacting RNAs (piRNAs), in mouse testes [4]. More than 87,000 reads were generated, around 53,000 of which would be classified as candidate piRNAs. Other exam­ples regarding the characterization of ­sncRNAs include the genome wide analysis of an Arabidopsis thaliana dicer mutant [5], or the characterization of the piRNA complex from rat testes [6].

The identification of binding sites of DNA-binding proteins, such as those of the transcription factor p53, using the Genome Sequencer 20 System has recently been shown [3]. DNA fragments that include binding-site sequences can be isolated after immunoprecipitation with their protecting transcription factors and characterized using high-throughput sequencing. The study revealed that binding sites can be detected with unprecedented efficiency and sensitivity.


Loss of methylation as well as hypermethylation of CpG islands within promoter regions is known to be a very important regulation mechanism of many genes. Genome methylation occurs at cytosine residues located 5´ to a guanosine in a CpG dinucleotide. Dense areas of CpG dinucleotides within promoter regions are organized into CpG islands.

Applying a known bisulfite treatment procedure, 454 recently has established a sequencing-based technology to quantitatively characterize the methylation state of each CpG dinu-cleotide in a given target genomic sequence (Figure 3). To better understand how the chemistry will perform on cancer research samples, eight colorectal cancer (CRC) tumor samples, and their matched normal adjacent tissue (NAT) were analyzed (Figure 4). The frequency of methylation is determined by the formula Fmeth=1-(FreqCtoT) where FreqCtoT is the frequency of conversion of the cytosine to a thymine in a CpG island. The results obtained in this experiment are supported by published literature: a significant percentage of CRCs show methylation of the p16 CpG island [7, 8].


Conclusions

The Genome Sequencer 20 System is the first high-throughput, low-cost alternative to current systems based on Sanger chemistry. Using this sequencing system, it is possible to address a broad variety of different applications in the fields of whole genome sequencing, transcriptome and gene regulation studies, as well as amplicon analysis. Many of the applications mentioned in this article can – for technical or economical reasons – simply not be addressed using the Sanger technology, thereby proving the enabling character of the 454 technology. It will lead to completely new insights in genomic research, as already with the identification of novel transcripts, or unknown classes of ­sncRNAs. Importantly, this technology will provide oncologists with a tool that facilitates cancer research to an extent that was not previously possible.

References

1. http://www.le.ac.uk/gc/ajj/HLA/Polymorphism.html

2. Thomas RK et al. (2006) Nat Med 12: 852–855

3. Ng P et al. (2006) Nucleic Acids Res 34: e84

4. Girard A et al. (2006) Nature 442: 199–202

5. Henderson IR et al. (2006) Nat Genet 38: 721–725

6. Lau NC et al. (2006) Science 313: 363–367

7. Kim BN et al. (2005) Int J Oncol 26: 1217–1226

8. Herman JG et al. (1996) Proc Natl Acad Sci 93: 9821–9826


This article was originally published in Biochemica 4/2006, pages 7-10. ©Springer Medizin Verlag 2006

Facts, background information, dossiers
  • Binding Site
  • bases
  • PCR
  • Life Sciences
  • gene regulation
  • chemistry
  • DNA sequencing
  • solution
  • single nucleotide p…
  • Saccharomyces cerevisiae
  • proteins
  • polymorphism
  • oncology
  • nucleotides
  • molecular biology
  • microorganisms
  • high-throughput sequencing
  • genes
  • fungus
  • DNA methylation
  • colorectal cancer
  • cloning
  • chromosomes
  • acids
More about Roche Diagnostics
Your browser is not current. Microsoft Internet Explorer 6.0 does not support some functions on Chemie.DE