«Pendant des siècles, la Médecine s’est préoccupée de soigner. Aujourd’hui elle s'est donnée comme but de prévenir plutôt que de guérir.»
Pr Jean Dausset, Prix Nobel de Médecine, 1980
La Fondation Jean Dausset - Centre d’Etude du Polymorphisme Humain participe aux efforts nationaux et internationaux de recherche pour mieux déterminer le rôle du polymorphisme génétique chez l’Homme, tout particulièrement dans les maladies complexes, pour mieux les comprendre, les diagnostiquer et participer au développement d’une médecine personnalisée.
The HGDP Panel HGDP-CEPH Data HGDP-CEPH Database access Help page

HGDP-CEPH Collaborative data

The HGDP-CEPH database was completely refurbished to be compliant with the European General Data Protection Regulation (GDPR).

Data on the Global Screening Array (GSAMD-24v1-0_20011747_A1) chip are included in this new database.

Allelic frequencies are available for 2 million SNPs on the different HGDP-CEPH populations and groups of populations.

Despite our careful testing and checking, maybe few bugs or errors still remain. Please send your comments to the HGDP-CEPH Program manager.

To comply to the General Data Protection Regulation (GPDR), the Fondation Jean Dausset - CEPH don't provide anymore individual genotypes from the HGDP - CEPH panel.

The Fondation Jean Dausset - CEPH now supplies genetic diversity information in terms of allelic frequencies for 52 populations and 7 population groups.
Allelic frequency computation was performed on the H952 panel individuals.
The Fondation Jean Dausset - CEPH provides information for more than 2 million genomic variations, representing more than 100 million allelic frequencies in 52 populations and 14 millions in 7 population groups.

This information can be accessed by variation identifier, gene identifier, by genomic or interval position (GRCh37 or GRCh38).

Data integration

Marker information is processed in order to:

  • Merge submissions while keeping each submitter's identifier and marker frequency
  • Check or set genomic positions for current genome builds GRCh37 and GRCh38 (original submissions referred to GRCh36 or GRCh37)
  • Unify polymorphism description and frequencies across submitters
  • Check marker identification in the latest dbSNP revision

The first step is to check the rs number (if provided by the submitter) in dbSNP:

  • The rs number is present in dbSNP => GRCh37 and GRCh38 genomic positions are checked/set from dbSNP.
  • The rs number is removed from dbSNP => no genomic position from sbSNP => use of UCSC liftOver to calculate GRCh37 and GRCh38 genomic positions, submitter's identifier remains unchanged.
  • The rs number is merged into another rs number => GRCh37 and GRCh38 genomic positions are set from dbSNP, rs number information is updated, submitter's identifier remains unchanged.
  • No rs number provided => use of UCSC liftOver => position and alleles are searched in dbSNP => if the marker is found in dbSNP, the rs number is added.

When allele coding for a marker differs from one submitter to another (commonly, an A/G polymorphism can be submitted as : G/A, T/C or C/T), allele alphabetic order is used (with complementation if needed) and frequencies are recomputed accordingly. For example :

Database allele codingFirst allele frequency
Submitter's coding

Database content

The HGDP-CEPH Database contains the datasets below:

Dataset 1 generated by 25 collaborators
Frequencies by population and population group from former HGDP-CEP genotypic database V3.

  • 15 billion genomic variations
  • 1 gene deletion and duplication polymorphism (CYP2D6)

Dataset 2 Stanford University
Genotypes for ~ 660,918 tag SNPs (Illumina HuHap 650k), in autosomes, chromosome X and Y, the pseudoautosomal region and mitochondrial DNA, typed across 1043 individuals from all panel populations (Li JZ et al. Science 319: 1100-4, 2008).

Dataset 3 Michigan University (UMich-NIH)
Genotypes (and CNV calls, supplement 5, below) for some 525,910 tag SNPs (Illumina HuHap550k), all of which are included in the HuHap 650k genotyping panel, typed across 485 HGDP-CEPH individuals from 29 populations (Jakobsson M et al. Nature 451: 998-1003, 2008).

Dataset 4 Max Planck Institute, Leipzig: MPI-EVA
Genotypes for 488,755 SNPs (Affymetrix GeneChip Human Mapping 500 K Array Set), typed across 255 individuals from all 52 HGDP-CEPH populations (5 samples per population) (López Herráez D et al. PLoS One. 2009 Nov 18;4(11):e7888). After merging the Affymetrix and Illumina (data supplement 1) non-overlapping datasets for 250 of these same individuals (no filters applied), we find genotypes for 939,383 unique SNPs.

Dataset 9 Max Planck Institute, Leipzig: MPI-EVA-Neandertal/aa-capture
Sequences from 50 different HGDP-CEPH populations, covering ~14,000 protein-coding human lineage positions (Burbano HA et al. Science 328: 723-725, 2010). Paired end sequences, derived from bar-coded genomic libraries, were captured on a single microarray containing these positions. Experiment ERX004007 contains all the HGDP-CEPH sequences in fastq files from runs ERR011028-ERR0011032 and can be downloaded from www.ebi.ac.uk/ena/data/view/ERX004007. For links between the sequence barcodes and the corresponding HGDP-CEPH identifiers, click on "View XML" found at the top left of the web page.

Dataset 11 Harvard Genetic Department
Data from 629,443 SNPs that were obtained by genotyping 934 unrelated HGDP-CEPH individuals with the soon to be released Affymetrix Axiom® Human Origins Array Plate, and merging the genotypes with data from Neandertal, Denisova and chimpanzee. The SNN P data are divided among 14 partially overlapping datasets, 13 of which are of value for analysis of different population genetics scenarios. For each of datasets 1-12, SNPs were discovered as heterozygotes by whole-genome shotgun sequencing of a different HGDP-CEPH individual of known ancestry, as per Keinan et al. Nature Genetics 39: 1251-1255, 2007. Dataset 13 contains heterozygote SNPs for each of which a random Denisovan allele matches that of chimpanzee, and the random San Bushman allele is derived. Dataset 14, which is valuable for studying population structure using a maximum number of SNPs, does not allow demographic modeling. This dataset combines all SNPs along with an additional 87,044 SNPs chosen to allow haplotype inference at mitochondrial DNA and the Y chromosome, and to provide overlap with previous Affymetrix and Illumina genotyping arrays so that users can merge the data available here with previously published datasets. IMPORTANT: Read the detailed technical document before using these data in order to avoid pitfalls. The array was developed by David Reich and colleagues in collaboration with Affymetrix for the purpose of generating data with clearly documented ascertainment.

Dataset 15 Institute for Translational Genomics and Population Sciences Los Angeles Biomedical Research Institute at Harbor/UCLA Medical Center
Genotyping data (Illumina ImmunoChip) for 143.945 markers on 889 individuals from all HGDP-CEPH panel populations (52).

Dataset 16 Institute of Clinical Pharmacology, University Medical Center Goettingen, Germany
Genotyping and sequencing data for 21 coding SNPs (aminoacid substitutions) in OCT1 gene, on 962 individuals from all HGDP-CEPH panel populations (Tina Seitz, Robert Stalmann, Nawar Dalila, Jiayin Chen, Sherin Pojar, Joao N. Dos Santos Pereira, Ralph Krätzner, Jürgen Brockmöller and Mladen V. Tzvetkov
Global genetic analyses reveal strong inter-ethnic variability in the loss of activity of the organic cation transporter OCT1 Genome Medicine 2015, 7:56 doi:10.1186/s13073-015-0172-0).

Dataset 20, CEPH/CNRGH Illumina chip GSAMD-24v1-0_20011747_A1
Genotyping of 1013 HGDP-CEPH / 1013 individuals (11 duplicate pairs) 687 572 SNPs typed.

Datasets not included in the database

Dataset 1b generated by the Department of Genetics and Evolutionary Biology, Instituto de Biociências, Universidade de São Paulo, São Paulo, São Paulo, Brazil.
Inbreeding is observed in almost all the populations of the panel HGDP-CEPH with different levels of inbreeding and frequencies. (PMID: 21364699)

Access to the published data.

Dataset 5 Washington University (UWash-NIH)
Calls for 6538 copy number variants (CNVs), size range 225-5,470,050 bp. These calls were ascertained in 883 unrelated HGDP-CEPH individuals (all panel populations) from SNP intensity data (data supplement 1), using rigorous statistical criteria and direct validation with CGH oligonucleotide arrays for CNV discovery on 12 panel individuals with 98 CNVs (Itsara A et al. Amer J Hum Genet 84: 148-161, 2009). This study is available at www.ebi.ac.uk/dgva/ or www.ncbi.nlm.nih.gov/dbvar, study number nstd27.

Dataset 6 Michigan University (UMich-NIH)
3436 CNV calls, size range (2,019-998,213 bp) for 438 individuals in 29 populations. These calls were based on SNP intensity data (data supplement 2) and quality thresholds of the PennCNV algorithm. This study is found at www.ebi.ac.uk or www.ncbi.nlm.nih.gov, study number nstd30.

Dataset 7 New Mexico University (UNM)
Sequences from the D-loop region of mitochondrial DNA for 1064 HGDP-CEPH individuals. The number of base pairs sequenced per individual ranges from 1021 to 1047 (average 1044.4, median 1045). These sequences are found at www.ncbi.nlm.nih.gov.

Dataset 8 Max Planck Insitute, Leipzig: MPI-EVA-Neandertal/Denisova
Whole-genome, shotgun sequences at 4-6x coverage (Illumina GAII platform) for five HGDP-CEPH individuals, HGDP00778 (Han), HGDP00542 (Papuan), HGDP00927 (Yoruban), HGDP01029 (San) and HGDP00521 (French), as part of the Neandertal Genome Project (Green RE et al. Science 328: 710-722, 2010).
In addition, this data supplement also contains whole genome sequences at 1.3-1.9x coverage for seven HGDP-CEPH DNAs, from HGDP00456, (Mbuti Pygmy), 00998 (Karitiana), 00665 (Sardinian), 00491 (Melanesian from Bougainville Island), 00711 (Cambodian), 01224 (Mongola), and 00551 (Papuan), as part of the characterization of an archaic hominin from Denisova Cave, Siberia (Reich D et al Nature 468: 1053-1060, 2010). Sequences from these seven HGDP-CEPH genomes are available from the NCBI SRA, www.ncbi.nlm.nih.gov/sra/?term=hgdp-ceph.

Dataset 10 Erasmus Forensic University, Rotterdam
Genotypes for 76 Y-STRs for HGDP-CEPH males. Genotypes are presented as repeat numbers. Descriptions of the Y-STRs used can be found in Kayser et al 2004 Am J Hum Genet 74: 1183-1197 and Ballantyne et al Am J Hum Genet 2010 87: 341-353. Genotyping procedures are as described in Vermeulen et al Forensic Sci Int Genet 2009 3: 205-213, and Ballantyne et al Forensic Science Int Genet 2011. doi:10.1016/j.fsigen.2011.04.017.

Dataset 12 Max Planck Institute, Leipzig: MPI-EVA-Denisova
High-coverage sequences of 10 HGDP-CEPH genomes: HGDP00456-Mbuti Pygmy (24.3x), HGDP00521-French (26.7x), 00542-Papuan (25.9x), 00665-Sardinian (24.7x), 00778-Han (27.7x), 00927-Yoruba (32.1x), 00998-Karitiana (26.0x), 01029-San (32.7x), 01284-Mandenka (24.5x), and 01307-Dai (28.3x) (rounded averages).
These sequences were determined for comparison with the genome of an archaic Denisovan individual (Meyer, M. Science 338:222-6, 2012).
The raw human sequences and alignments to hg19, are available in BAM format from http://cdna.eva.mpg.de/denisova/BAM/human/. The BAM files may be analyzed with sequence tool kits e.g. SAMtools and Picard.

Dataset 13 Children's Hospital Oakland Research Institute, Oakland, CA
Genotypic data on presence/absence information for 16 genes at the Killer Immunoglobulin-like Receptor (KIR) locus obtained on 976 HGDP-CEPH individuals (Hollenbach et al 2012 Immunogenetics 64: 719-737).

Dataset 14 Max Planck Institute, Leipzig: MPI-EVA
Sequencing date of 500Kb in chromosome Y on 632 males for HGDP-CEPH panel. 2228 SNPs were identified.
Lippold S. et al. 2014.

Dataset 17 Genetics Department Harvard Medical School, Boston, Massachusetts 02115, USA.
NGS sequencing of 300 individuals (including 132 individuals from the HGDP CEPH panel) high quality genomes including at least 5.8 million base pairs that are not present in the human reference genome.
The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature. 2016 Sep 21. doi: 10.1038/nature18964).

Dataset access (Harvard Medical School)

Dataset 18 Unit of Forensic Genetic, Centre universitaire romand de médecine légale, Lausanne, Switzerland
Genotyping data on a set of DIP-STR markers. These markers are phased haplotypes comprising one Indel (DIP) and a closely located STR. Moriot, A., Santos, C., Freire-Aradas, A. et al. Inferring biogeographic ancestry with compound markers of slow and fast evolving polymorphisms. Eur J Hum Genet 26, 1697-1707 (2018).

Dataset 19, Wellcome Sanger Institute, Hinxton CB10 1SA, UK.
Sequencing data for 929 individuals from the HGDP- CEPH panel.
EBI data access

Donate to help our Research !