Chip Definition File
Table of Contents
Background:
In addition to the chip definition file (CDF) provided by Affymetrix, there are custom-made CDFs redefining the aggregation of probes into probe-sets [Gautier et al., 2004, Dai et al., 2005, Ferrari et al., 2007, Lu et al., 2007]. Although more accurate, these CDFs are limited by a conservative generation process that ignores up to 30 percent of the probes. Moreover, only the ones from Dai et al. [2005] are actively maintained, whereas Gautier et al. [2004] proposes a Bioconductor [Gentleman et al., 2004] package to build them.
Ebased CDFs
Here, we introduce our custom CDF: "Ebased" that retains all map-able probes to extend the information retrieved from an Affymetrix GeneChip® [Delhomme et al., submitted, 2012]. Shortly, using a short read aligner (bowtie [Langmead et al., 2009] by default), the probe sequences are aligned against the cdna and dna references for the selected GeneChip® (e.g., the human ones for the Affymetrix GeneChip® HG-U133A platform). The cdna and dna references are retrieved from the selected Ensembl [Flicek et al., 2011] version. In parallel, using biomaRt [Durinck et al., 2005], the gene and transcript information is retrieved from the same Ensembl version. All the alignments are performed to allow a maximum of 2 mismatches, as suggested by the study of He et al. [2005], as the probes are 25bp in length.
Probeset generation
All this information is used to create probesets. First, probes that align only to genomic positions are extracted and used to create genomic probesets, in which two consecutive probes have to be separated by less than 1 kb. At first probesets are generated so that probes mapping multiple positions are ignored. If this is not possible, then genomic_multiple probesets are created. Then probes aligning to genic regions, either exonic or intronic are combined together. Whenever possible, transcript specific probesets are generated. If every reported transcript is mapped, then gene probesets are created. Finally if only some of the reported transcripts are mapped, partial probesets are returned. As for the genomic approach, only uniquely mapping probes are used unless there are not enough to generate a probeset, in which case a multiple tag is added. If it is impossible to assign all the probes either to a given set of transcripts or to the full gene, then the probesets are tagged as dubious, indicating that a combination of transcripts is necessary to have more than the minimal number of probes in the probeset. Finally, two last kinds of probesets are generated using the same criteria: untranslated and antisense. The first ones are constituted of probes that map in an intron or UTR of a gene, in the same orientation as this one. The second ones are constituted of probes that map the gene locus, but in the opposite direction. The different mentioned suffixes are not mutually exclusive but for the transcript and gene ones, e.g. one can get a transcript untranslated antisense multiple probeset.
Generated packages
The generated probesets and their annotation are then combined into three kinds of Bioconductor packages: 'cdf', '.db' and 'probes'. These packages once installed can be used for pre-processing and analyzing Affymetrix GeneChip® micro-array data.
The customCDF package
To ease the use of these probeset, we developed yet another Bioconductor package: customCDF that offers functionalities to create the CDFs, as well as to normalize data using it. This last function selects a given CDF version, check for its availability, and if need be, downloads and/or installs it from a remote or local repository. The installed CDF is then used to normalize the data with the method of choice - currently expresso, rma, gcrma and vsn are available.
A repository example
The following is an example of a CDFs repository that can be used by the customCDF package. It contains the latest CDFs version of the HG-U133A and HG-U133Plus2 Affymetrix GeneChip®. It holds as well the former CDF version of the HG-U95Av2 Affymetrix GeneChip® that is used in the submitted manuscript [Delhomme et al., submitted, 2012]
References
Manhong Dai et al. Evolving gene/transcript definitions significantly alter the interpretation of genechip data. Nucleic Acids Research, Jan 2005. Steffen Durinck et al. Biomart and bioconductor: a powerful link between biological databases and microarray data analysis. Bioinformatics, Aug 2005. Nicolas Delhomme et al. Ensembl based custom definition file for affymetrix GeneChip. Submitted, 2012. Francesco Ferrari et al. Novel definition files for human genechips based on geneannot. BMC Bioinformatics, Jan 2007. Paul Flicek et al. Ensembl 2011. Nucleic Acids Research, Jan 2011. Laurent Gautier et al. Alternative mapping of probes to genes for affymetrix chips. BMC Bioinformatics, Aug 2004. Robert C Gentleman et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biology, Jan 2004. Zhili He et al. Empirical establishment of oligonucleotide probe design criteria. Appl Environ Microbiol, Jul 2005. Wolfgang Huber et al. Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics, Jan 2002. Ben Langmead et al. Ultrafast and memory-efficient alignment of short dna sequences to the human genome. Genome Biology, Jan 2009. Jun Lu et al. Transcript-based redefinition of grouped oligonucleotide probesets using aceview: high-resolution annotation for microarrays. BMC Bioinformatics, Jan 2007. R Development Core Team R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria,