Genome and Transcriptome Assemblies

Accurate and complete genome assembly using nextgen sequencing data remains a significant challenge. This is especially so for large complex Eukaryotic genomes. The best assemblies are generated from sequencing performed on a mix of paired-end and mate-paired libraries. Ideally the mate-paired libraries also contain a range of insert sizes ( ie 2, 5, 10kb). Once a genome has been assembled it is often favourable to generate gene model predictions to help identify features of interest. These gene models not only describe important gene features such as open reading frames, promoter positions, and exon-intron boundaries, they are also often essential for many downstream applications such as mutation detection and gene expression analysis.

For transcriptome assemblies high quality pure RNA is a key step in obtaining a robust snapshot of gene expression under the conditions being investigated. As part of our transcriptome assembly workflow we are also able to annotate the resulting transcripts and/or carry out differential expression analysis if you have data from different cell types or treatments.

BTW if you have PacBio data (great), we can help there as well!

Typical outputs that can be delivered

  • Fasta file containing your assembly
  • Annotation and gene model prediction
  • Comparison to closely related species
  • Differential expression analysis
  • Assembly statistics (N50, L50 etc)

Frequently Asked Questions

For genomic assembly aim for at least 50x coverage of your genome (100x is better). It is difficult to make a general recommendation for transcriptome assemblies due to the dynamic range of expression (both the number of genes expressed and the level of gene expression). We would recommend you consult literature to get a good estimate of the required coverage based on similar studies (we are happy to help you).

Mate-pair libraries are key to obtaining a complete assembly because they help the assembler link scaffolds that are separated by long repeat rich regions. This results in an assembly with fewer scaffolds and thus a large N50.

Gene models are predicted open reading frames found in your assembled contigs and scaffolds. The quality of these predictions depends on a number of factors, including the availability of gene models from closely related organisms, evidential RNA-seq data, and manually annotated training sets.