Elsevier

Journal of Theoretical Biology

Volume 243, Issue 2, 21 November 2006, Pages 273-278
Journal of Theoretical Biology

PQN and DQN: Algorithms for expression microarrays

https://doi.org/10.1016/j.jtbi.2006.06.017Get rights and content

Abstract

An ideal expression algorithm should be able to tell truly different expression levels with small false positive errors and be robust to assay changes. We propose two algorithms. PQN is the non-central trimmed mean of perfect match intensities with quantile normalization. DQN is the non-central trimmed mean of differences between perfect match and mismatch intensities with quantile normalization. The quantiles for normalization can be either empirical or theoretical. When array types and/or assay change in a study, the normalization to common quantiles at the probe set level is essential. We compared DQN, PQN, RMA, GCRMA, DCHIP, PLIER and MAS5 for the Affymetrix Latin square data and our data of two sets of experiments using the same bone marrow but different types of microarrays and different assay. We found the computation for AUC of ROC at affycomp.biostat.jhsph.edu can be improved.

Introduction

Bioinformatics plays an important role in biomedical research (Fodor et al., 1991, Chou, 2004). Oligonucleotide microarrays are widely used to detect differential RNA expression and genotypes or mutations in DNA (Lockhart et al., 1996, Hu et al., 2001, Kennedy et al., 2003, Di et al., 2005). The strengths of microarray technology are its high throughput and the small amount of required sample material. Its limitations include signal variations and the relatively narrow dynamic range in comparison with PCR-based sequencing technology. Therefore, there is constant interest in improving its sensitivity, specificity and reproducibility. Affymetrix Microarray Suite 5 (MAS5) provided signals for a single microarray (Hubbell et al., 2002). Li and Wong, 2001a, Li and Wong, 2001b first proposed to use multiple microarrays to obtain reliable estimates of expression signals. Irizarry et al., 2003a, Irizarry et al., 2003b provided the robust multi-array analysis (RMA) based on median polish for perfect match probe data. Bolstad et al. (2003) suggested the quantile normalization to change the intensity distributions of all microarrays in a study to a common distribution for better comparative analysis. Wu et al. (2004) proposed to use sequence information to make background adjustment (GCRMA). Hubbell (2003) proposed the probe logarithmic intensity error estimation (PLIER) algorithm based on minimization of a special error function to reduce the bias at the low intensity end (Affymetrix, 2003).

MAS5 is based on the biweight estimation of the differences between perfect match intensities (PM) and mismatch intensities (MM). Model-based approaches based on PM intensities such as RMA show smaller variance and some other performance better than MAS5. DCHIP and GCRMA can use either the PM intensities or the differences between PM and MM intensities.

With a simple descriptive statistic on PM intensities and a quantile normalization at the probe set level, PQN yields area under the curve (AUC) of receiver operation characteristic (ROC) similar to RMA, but smaller than GCRMA for the HG-U95A Latin square data set (Table 1), and it gives the largest AUC of ROC in the mid and high concentration ranges for the HG-U133A_tag Latin square data set (Table 2). Moreover, for data including different types of microarrays and different assay, most PM–MM based algorithms such as DQN show smaller variations than the corresponding PM based algorithms.

We also provide a new algorithm for computation of the AUC of ROC for the Latin square data.

Section snippets

Methods and results

PQN and DQN use a non-central trimmed mean, i.e. the mean between the 40th percentile and the 90th percentile. The higher-end non-central trimmed mean can eliminate the influence of low intensity probes, for example, the probes selected improperly in the microarray design stage due to incorrect information in the public genomic databases. Not using intensities over the 90th percentile helps reducing the influence of outliers due to cross hybridization or white image blemishes. We also tried

Algorithms

Let PMik and MMik be the raw intensities of the ith pm cell and mm cell for probe set (gene) k. The raw intensity is a real number in the interval [1,Imax]. The intensity upper bound Imax is dependent on the scanner. For PQN signals, We define the background, B, as the trimmed mean of the lowest 2% intensities of PM probes. We also define the lower bound, L, as half of the standard deviation of the lowest 2% PM intensities. They are very similar to MAS5 except that those for MAS5 are location

Discussion

From Tables 1 and 2, we can see that for the Latin square data sets, where the assay and microarray type do not change, algorithms based only on PM probes (DCHIP, RMA, GCRMA and PQN) usually have larger AUC of ROC than algorithms based on PM–MM. However, Fig. 1 shows that for identical probes on different microarrays using different assay, the algorithms based on PM–MM can show smaller variations (DQNB vs. PQNB and GCRMAdQN vs. GCRMAQN). The transformed signal DQNBT is only used for comparison

Acknowledgments

We thank Anton Belooussov, Laurent Essioux, Grant Hillman, Walter Koch, Friedmann Krause, Aki Nakao, Sunhee K. Ro and Guido Steiner for helpful discussions. We thank referees for their critics and constructive suggestions.

References (19)

  • Affymetrix

    Guide to Probe Logarithmic Intensity Error Estimation (PLIER)

    (2003)
  • B.M. Bolstad et al.

    A comparison of normalization methods for high density oligonucleotide array data based on variance and bias

    Bioinformatics

    (2003)
  • K.-C. Chou

    Structural bioinformatics and its impact to biomedical science

    Current Medicinal Chemistry

    (2004)
  • L.M. Cope et al.

    A benchmark for Affymetrix GeneChip expression measures

    Bioinformatics

    (2004)
  • X. Di et al.

    Dynamic model based algorithms for screening and genotyping over 100 K SNPs on oligonucleotide microarrays

    Bioinformatics

    (2005)
  • S.P. Fodor et al.

    Light-directed, spatially addressable parallel chemical synthesis

    Science

    (1991)
  • D.M. Green et al.

    Signal Detection Theory and Psychophysics

    (1966)
  • G.K. Hu et al.

    Predicting splice variant from DNA chip expression data

    Genome Res.

    (2001)
  • Hubbell, E., 2003. Some M-estimates for expression analysis. Affymetrix GeneChip microarray low-level workshop....
There are more references available in the full text version of this article.

Cited by (30)

  • Multi-agent deep reinforcement learning strategy for distributed energy

    2021, Measurement: Journal of the International Measurement Confederation
    Citation Excerpt :

    Therefore, the neural network is used to fit the value function Q(s,a;θ)≈Q'(s,a) instead of the look-up table in traditional reinforcement learning, which speeds up the convergence while ensuring the control accuracy. In ref. [20], a deep Q network (DQN) is proposed, and the state dimension disaster problem is solved using the neural network approximation method instead of a lookup table. However, the simulation results show that the neural network is unstable and has a slow convergence speed, so a satisfactory AGC effect is not usually achieved.

  • Molecular Subtyping of Diffuse Large B-Cell Lymphoma Using a Novel Quantitative RT-PCR Assay

    2021, Journal of Molecular Diagnostics
    Citation Excerpt :

    Raw microarray data were preprocessed by using an internal software tool developed at Roche Molecular Systems Inc. Specifically, a special DQN (differences between perfect match and mismatch intensities with quantile normalization) signal used the quantities of β distribution with P = 1.2 and q = 3 for normalization.21 DLBCL subtype classification was determined by using a slightly modified RMSG100 algorithm (Roche Molecular Systems) that combines expression of 100 genes into a linear predictor score and assigns subtypes.22,23

  • Application of tools and techniques of Big data analytics for healthcare system

    2021, Applications of Big Data in Healthcare: Theory and Practice
  • Clinical Significance of PTEN Deletion, Mutation, and Loss of PTEN Expression in De Novo Diffuse Large B-Cell Lymphoma

    2018, Neoplasia (United States)
    Citation Excerpt :

    Single nucleotide polymorphisms documented by the NCBI dbSNP database (build 147) have been excluded. Gene expression profiling was performed by using the Affymetrix GeneChip Human Genome HG-U133 Plus Version 2.0 Array as described previously (GSE31312) [32,42]. Microarray data were normalized for further supervised clustering analysis.

View all citing articles on Scopus
View full text