The diploid genome sequence of a Chinese individual was sequenced

From PGI

Revision as of 07:18, 24 November 2008 by J (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Accession number: ERA000005

Eight single-end and two paired-end libraries.

Illumina Genome Analysers 

The read lengths averaged 35 base pairs 

2 paired-end libraries had a span size of 135 bp and 440 bp

3.3 billion reads 

117.7 gigabases (Gb) of sequence. 72 Gb from single-end reads and 45.7 Gb from paired-end reads.

Program used: Short Oligonucleotide Alignment Program (SOAP)

102.9 Gb of sequence (87.4% of all data) was properly aligned to the NCBI human reference genome (build 36.1; hereafter called NCBI36). 

36-fold average coverage of NCBI36

The effective genome coverage of the single- and paired-end sequencing was 22.5-fold and 13.5-fold, respectively.

99.97% of NCBI36 (excluding Ns, which are undetermined sequence of the reference genome) was covered by at least one uniquely or repeatedly aligned read (uniquely aligned reads had only one best hit on NBCI36; repeatedly aligned reads had multiple possible alignments; see Methods for details).

Average per-nucleotide difference of 1.45% from the NCBI36 sequence

YH genome consensus sequence covered 92% of the NCBI36 sequence (92.6% of the autosomes; 83.1% of the sex chromosomes), 

SNPs
3.07 million SNPs. 

The remaining 8% of the reference sequence was composed of either repetitive sequence (6.6%) that did not have any uniquely mapped reads or sequence that didn't pass our filtering steps (1.4%).

2.26 million (73.5%) of the YH SNPs were present in dbSNP as validated SNPs, and 0.4 million (12.9%) were present as non-validated SNPs.
0.42 million SNPs were novel 

They also did Illumina 1M BeadChip for genotyping

They used polymerase chain reaction (PCR) amplification and traditional Sanger sequencing technology on a subset of the inconsistent SNPs and small indels to determine whether they conformed to the genotyping or GA sequencing results 

The use of paired-end reads as opposed to single-end reads further reduces SNP calling errors. Of note, SNP calling errors of homozygous and heterozygous SNPs differ significantly.

Unique to the 3 genomes sequenced: for YH, 978,370 (31.8%) SNPs; for Venter, 924,333 (30.1%); and for Watson, 1,096,873 (33.0%)

For Indels

For indel identification, they required at least 3 pairs of reads to define an indel. They only considered paired-end read-gapped alignments that had insertion or deletion sizes of 3 bp or less to avoid creating alignment errors. Confining indel size was necessary to obtain the best detection accuracy given our short-read sequencing strategy. From this analysis, they identified a total of 135,262 indels.

Most (59.1%) of the indels were novel 


Depth effect on genome sequencing
To determine what sequencing depth provides the best genome coverage and lowest SNP-calling error rates for a diploid human genome, they randomly extracted subsets of reads with different average depths from all the mapped reads on chromosome 12, which has a relatively moderate number of repeats. 
SNPs were identified using GA sequencing and then compared with the genotyping data. 
They applied the same filtering steps as used in SNP identification.

At a depth greater than 10-fold, the assembled consensus covered 83.63% of the NCBI reference genome using single-end reads and 95.88% coverage using paired-end reads. Thus, greater sequencing depth provides only a small increase in genome coverage.


Evolution and selection
Alleles that are identical between two random individuals are more likely to be the most common type of allele in the population

Ancestry
The YH individual was estimated to share alleles (thus ancestry) at 94.12% with the Asian, 4.12% with the European and 1.76% with the African populations

effective Chinese population size is about 5,700

 3,300 for the effective human population size


References

  1. International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001) | Article | PubMed | ISI | ChemPort |
  2. Venter, J. C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001) | Article | PubMed | ISI | ChemPort |
  3. Levy, S. et al. The diploid genome sequence of an individual human. PLoS Biol. 5, e254 (2007) | Article | PubMed | ChemPort |
  4. Wheeler, D. A. et al. The complete genome of an individual by massively parallel DNA sequencing. Nature 452, 872–876 (2008) | Article | PubMed | ChemPort |
  5. Church, G. M. The personal genome project. Mol. Syst. Biol. 1 doi: doi:10.1038/msb4100040 (2005) | Article |
  6. Li, R., Li, Y., Kristiansen, K. & Wang, J. SOAP: short oligonucleotide alignment program. Bioinformatics 24, 713–714 (2008) | Article | PubMed | ChemPort |
  7. Sherry, S. T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308–311 (2001) | Article | PubMed | ISI | ChemPort |
  8. Iafrate, A. J. et al. Detection of large-scale variation in the human genome. Nature Genet. 36, 949–951 (2004) | Article |
  9. Korbel, J. O. et al. Paired-end mapping reveals extensive structural variation in the human genome. Science 318, 420–426 (2007) | Article | PubMed | ChemPort |
  10. Kidd, J. M. et al. Mapping and sequencing of structural variation from eight human genomes. Nature 453, 56–64 (2008) | Article | PubMed | ChemPort |
  11. Bovee, D. et al. Closing gaps in the human genome with fosmid resources generated from multiple individuals. Nature Genet. 40, 96–101 (2008) | Article |
  12. Zerbino, D. R. & Birney, E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829 (2008) | Article | PubMed | ChemPort |
  13. Stephens, M., Smith, N. J. & Donnelly, P. A new statistical method for haplotype reconstruction from population data. Am. J. Hum. Genet. 68, 978–989 (2001) | Article | PubMed | ISI | ChemPort |
  14. Tang, H., Peng, J., Wang, P. & Risch, N. J. Estimation of individual admixture: analytical and study design considerations. Genet. Epidemiol. 28, 289–301 (2005) | Article | PubMed |
  15. Wright, S. Evolution in Mendelian populations. Genetics 16, 97–159 (1931) | PubMed | ChemPort |
  16. Noonan, J. P. et al. Sequencing and analysis of Neanderthal genomic DNA. Science 314, 1113–1118 (2006) | Article | PubMed | ISI | ChemPort |
  17. Tenesa, A. et al. Recent human effective population size estimated from linkage disequilibrium. Genome Res. 17, 520–526 (2007) | Article | PubMed | ChemPort |
  18. McKusick, V. A. Mendelian Inheritance in Man and its online version, OMIM. Am. J. Hum. Genet. 80, 588–604 (2007) | Article | PubMed | ChemPort |
  19. Coon, K. D. et al. A high-density whole-genome association study reveals that APOE is the major susceptibility gene for sporadic late-onset Alzheimer's disease. J. Clin. Psychiatry 68, 613–618 (2007) | PubMed | ISI | ChemPort |
  20. Rogaeva, E. et al. The neuronal sortilin-related receptor SORL1 is genetically associated with Alzheimer disease. Nature Genet. 39, 168–177 (2007) | Article |