Determination of Local Statistical Significance of ...mckao/documents/JCB-LMM.pdf · that the...

14
JOURNAL OF COMPUTATIONAL BIOLOGY Volume 11, Number 1, 2004 © Mary Ann Liebert, Inc. Pp. 1–14 Determination of Local Statistical Signi cance of Patterns in Markov Sequences with Application to Promoter Element Identi cation HAIYAN HUANG, 1;2;3 MING-CHIH J. KAO, 1;3 XIANGHONG ZHOU, 1 JUN S. LIU, 1;2 and WING H. WONG 1;2 ABSTRACT High-level eukaryotic genomes present a particular challenge to the computational identi- cation of transcription factor binding sites (TFBSs) because of their long noncoding re- gions and large numbers of repeat elements. This is evidenced by the noisy results gener- ated by most current methods. In this paper, we present a p-value–based scoring scheme using probability generating functions to evaluate the statistical signi cance of potential TFBSs. Furthermore, we introduce the local genomic context into the model so that can- didate sites are evaluated based both on their similarities to known binding sites and on their contrasts against their respective local genomic contexts. We demonstrate that our approach is advantageous in the prediction of myogenin and MEF2 binding sites in the hu- man genome. We also apply LMM to large-scale human binding site sequences in situ and found that, compared to current popular methods, LMM analysis can reduce false positive errors by more than 50% without compromising sensitivity. This improvement will be of importance to any subsequent algorithm that aims to detect regulatory modules based on known PSSMs. Key words: probability generating function, statistical signi cance, local genomic context, Position Speci c Score Matrix (PSSM), transcription factor binding site. INTRODUCTION T he elucidation of gene function, genetic network, and cellular processes requires the accurate identi cation of transcription factor binding sites (TFBSs). Experimental approaches, such as DNase footprinting (Galas and Schmitz, 1978) and gel mobility shift assay (Fried and Crothers, 1981; Garner and Revzin, 1981), are in general expensive and time consuming. Given the large number of transcription factors and the vast spans of noncoding genomic regions onto which they may bind, molecular characterization of transcription mechanisms will be facilitated by the prediction of transcription factor binding sites in silico. 1 Department of Biostatistics, Harvard University, 655 Huntington Avenue, Boston, MA 02115. 2 Department of Statistics, Harvard University, 1 Oxford Street, Cambridge, MA 02138. 3 These authors contributed equally to this work. 1

Transcript of Determination of Local Statistical Significance of ...mckao/documents/JCB-LMM.pdf · that the...

Page 1: Determination of Local Statistical Significance of ...mckao/documents/JCB-LMM.pdf · that the incorporation of the local genomic context can be advantageous in the prediction of myogenin

JOURNAL OF COMPUTATIONAL BIOLOGYVolume 11 Number 1 2004copy Mary Ann Liebert IncPp 1ndash14

Determination of Local Statistical Signi cance ofPatterns in Markov Sequences with Application

to Promoter Element Identi cation

HAIYAN HUANG123 MING-CHIH J KAO13 XIANGHONG ZHOU1

JUN S LIU12 and WING H WONG12

ABSTRACT

High-level eukaryotic genomes present a particular challenge to the computational identi- cation of transcription factor binding sites (TFBSs) because of their long noncoding re-gions and large numbers of repeat elements This is evidenced by the noisy results gener-ated by most current methods In this paper we present a p-valuendashbased scoring schemeusing probability generating functions to evaluate the statistical signi cance of potentialTFBSs Furthermore we introduce the local genomic context into the model so that can-didate sites are evaluated based both on their similarities to known binding sites and ontheir contrasts against their respective local genomic contexts We demonstrate that ourapproach is advantageous in the prediction of myogenin and MEF2 binding sites in the hu-man genome We also apply LMM to large-scale human binding site sequences in situ andfound that compared to current popular methods LMM analysis can reduce false positiveerrors by more than 50 without compromising sensitivity This improvement will be ofimportance to any subsequent algorithm that aims to detect regulatory modules based onknown PSSMs

Key words probability generating function statistical signi cance local genomic contextPosition Speci c Score Matrix (PSSM) transcription factor binding site

INTRODUCTION

The elucidation of gene function genetic network and cellular processes requires the accurateidenti cation of transcription factor binding sites (TFBSs) Experimental approaches such as DNase

footprinting (Galas and Schmitz 1978) and gel mobility shift assay (Fried and Crothers 1981 Garner andRevzin 1981) are in general expensive and time consuming Given the large number of transcription factorsand the vast spans of noncoding genomic regions onto which they may bind molecular characterization oftranscription mechanisms will be facilitated by the prediction of transcription factor binding sites in silico

1Department of Biostatistics Harvard University 655 Huntington Avenue Boston MA 021152Department of Statistics Harvard University 1 Oxford Street Cambridge MA 021383These authors contributed equally to this work

1

2 HUANG ET AL

Efforts on the computational prediction of TFBSs fall into two general approaches The rst seeks novelrecurrent patterns in a set of DNA sequences often the promoters of genes found to be coregulated in geneexpression microarray experiments A number of statistical models have been developed in the past decadefor this purpose based on Bayesian models and Monte Carlo methods (Bailey and Elkan 1994 Hugheset al 2000 Lawrence et al 1993 Lawrence and Reilly 1990 Liu et al 2001 Liu et al 2002 Rothet al 1998) They have been widely applied and found to be most successful in lower organisms such asbacteria and yeast However in higher organisms such as the human these methods may yield noisy resultsbecause of the long noncoding regions and the large numbers of nonfunctional repeat elements (Landeret al 2001) A recent trend to improve upon these de novo methods is to incorporate the information fromcross-species comparisons

The other major approach to predict transcription factor binding sites makes use of prior knowledgeon the binding sites These methods evaluate individual candidate site sequences by their similarities toclusters of experimentally determined binding sites (Chen et al 1995 Hertz et al 1990 Quandt et al1995 Stormo and Hartzell 1989 Wingender et al 2000) These binding site sequences are most oftensummarized using position-specic scoring matrices (PSSMs) which are used to summarize the sequencepatterns and to compare against candidate DNA segments This is the approach of interest in this paper

Various methods exist to score candidate segments for their similarities to known binding sites usingPSSMs We provide an example in Fig 1 using the transcription factor myogenin PSSM constructionbegins by using the alignment of known binding site sequences and tabulating the nucleotide distributionmatrix (Fig 1a) The counts are then transformed using either of two related schemes log-odds (Fig 1b)or entropy (Fig 1c) to generate the PSSM Candidate sites are scored against the PSSMs by summingover the corresponding scores of the nucleotides across the site sequence ie the score of candidatesite S D S1 Sp against PSSM is wij ppound4 is S D

Ppositioni wiSi In practice these scores are then

compared to some predetermined cutoff values to generate computational TFBS predictions Note that themost widely used database of transcription factor binding TRANSFAC (Wingender et al 2000) is basedon entropy-weighted PSSMs

While probabilities are used in the construction of the PSSMs the scores themselves cannot be interpretedstatistically This has led to the general dif culty with choosing the score cutoff values for each matrix aproblem that may have contributed to the large numbers of false positive predictions seen in practice Wepropose a p-value based scoring scheme which evaluates the statistical signi cance of the candidate sitesegment This should apply to both the entropy-based and the log-oddsndashbased scoring methods Howeverin order to obtain a valid p-value one needs to model the background sequence properly which may serveeither as the ldquonull modelrdquo or a component in computing the log-odds scoring function

In this paper we model the background sequences or the ldquonull distributionrdquo as a Markov chain Asin previous methods candidate binding site sequences are scored by PSSMs Each score is evaluatedstatistically by computing its p-value that is the probability that the background model can achieve ascore at least as high as that observed In order to calculate this p-value we develop an ef cient and exactalgorithm based on probability-generating functions that can achieve up to 1000-fold speed up comparedto Monte Carlo simulations We note that in contrast to score-based evaluation the p-values we generatecan serve as a universal measure of statistical signi cance of all candidate binding sites regardless of theircorresponding binding factor or of their genomic locations

It has been known that the effectiveness of a binding site in recruiting its corresponding transcriptionfactors can be dramatically affected by the genomic context that it is in This can be attributed to a numberof factors such as the local DNA bending the accessibility of the binding site or the positive or negativeeffects of neighboring TFBSs We incorporate the local genomic context into the p-valuendashbased scoringmethod and develop the Local Markov Method (LMM) The p-value for a candidate site provides a measureof its similarity to known binding sites and its contrast against the local genomic context We rst showthat the incorporation of the local genomic context can be advantageous in the prediction of myogenin andMEF2 binding sites in the human genome an advantage observed independently of the method of PSSMconstruction We further compare the abilities of LMM and TRANSFAC to pick up 101 experimentallydetermined TFBSs from large tracts of human genomic sequences and nd that LMM can identify TFBSswith more speci city (50 fewer false positive predictions) without compromising sensitivity The LMMsoftware is available upon request (wwwbiostatharvardedcomplabLMM)

LOCAL STATISTICAL SIGNIFICANCE OF PATTERNS 3

FIG 1 The construction of a position speci c scoring matrix for myogenin binding sites

MAIN RESULTS

p-value calculation

In order to calculate the probability that a given Markov background model can achieve a score at leastas high as the observed score of the candidate site we extend a previous method designed for a similarpurpose but applicable only to the independent and identically distributed (iid) background sequencemodel (Staden 1989) The key part for this method is the reformulation of the distribution of the score asa probability-generating function which leads to an ef cient algorithm for its computation We formulatethe score probability-generating function under Markov models (detailed in the Detailed Methods section)and derive an algorithm with time complexity linear in the length of the PSSM a dramatic improvementover the naive enumeration method which has time complexity exponential in the length of the PSSM

4 HUANG ET AL

FIG 2 TRANSFAC vs local Markov model (LMM) in the identi cation of transcription factor binding sites (TFBSs)in a given genomic sequence (a) TRANSFAC scans a genomic sequence generates similarity scores of each subse-quence against a given PSSM and uses three matrix-specic cutoffs FN FP and SUM to make putative calls The threesets of cutoffs attempt to minimize false-negative error false-positiveerror or the sum of these two errors respectively(b) LMM begins by selecting the top 01 candidate sites based on their PSSM similarity scores since sites withlow similarity scores are unlikely to be true binding sites For each candidate TFBS LMM models the DNA sequencesegment of length L (eg 1000) centered around the target site as a homogeneous Markov chain of orders k D 0 12 or 3 Under the estimated Markov model LMM calculates the probability distribution of the similarity score usingour algorithm This distribution then allows us to assign statistical signi cance to the given candidate TFBS

We implement the algorithm in C and incorporate the program into the local Markov method (LMM)program to study TFBSs in situ by evaluating each candidate binding site with respect to its local genomiccontext A summary of the LMM method is in Fig 2b For comparison we also describe in Fig 2a theprediction program which accompanies the TRANSFAC database

To assess the ef ciency of our algorithm we compare it against Monte Carlo simulations Our exper-iments showed that the ef ciency of our approach can be many times more ef cient than Monte Carlosimulation For example at p middot 00001 a sequence of length at least 109 basepairs needs to be simu-lated in order to obtain a suf ciently accurate cutoff value (relative error middot 1) which needs more than1000-fold more computing time than our exact algorithm (Table 1)

APPLICATIONS

MYL1 30 enhancer myogenin binding site prediction

In human mouse and rat there is a well-conserved 200bp-long skeletal muscle-speci c enhancer about24 kb 30 of MYL1 (Rosenthal et al 1990 Wentworth et al 1991) Three myogenic determination factorbinding sites A B and C are found in this region which are located 1267 bp 1323 bp and 1339 bp

LOCAL STATISTICAL SIGNIFICANCE OF PATTERNS 5

Table 1 Running Time Comparison of LMM with Monte Carlo Simulationa

Signicance Signicanceat p middot 0001 at p middot 00001

Relative error Relative errorTime

Simulation (sec) Mean SD Mean SD

N D 105 006 65 43 240 169N D 106 06 17 12 77 45N D 107 57 10 07 26 17N D 108 57 10 05 20 08N D 109 570 07 06 06 04N D 232 1228 06 06 06 04LMM 043 0 0

aFor 10 randomly chosen intergenic regions we estimate a 2nd-order Markov model for eachsequence using maximum likelihood estimation Under the 10 estimated Markov models we use ouralgorithm to derive and Monte Carlo simulation to estimate the score cutoffs of the p53 PSSM at twosigni cance levels p middot 0001 and p middot 00001 on a 1500 MHz AMD Athlon machine running LinuxTo assess the difference between the cutoffs Cp derived by our algorithm and the cutoffs Cp estimatedby simulations we consider the p-value F Cp attained by Cp and the true p-value FCp of Cp

derived using our algorithm We assess the relative errors of the simulation estimate by calculatingFCp iexcl F Cp=FCp

respectively downstream of the last exon of MYL1 in the human genome Sites A and B are myogeninmyf4binding sites (Rosenthal et al 1990) while site C is a MyoD binding site (Wentworth et al 1991) alsoconsidered to be a myogenin binding site (Fickett 1996)

We applied the LMM to the 10000 bp MYL1 downstream region (starting from the end of the lastexon) to derive the local p-values for each candidate The local p-value for each candidate is the statisticalsigni cance of observing its score (derived by both log-odds and entropy-related PSSMs) assuming that itis generated under a local random model where Markov models of different orders (eg 0 1 or 2) are usedand with parameters estimated from the local 1000 bp genomic sequence centered at the candidate The top10 score candidate sites derived using log-odds or entropy-weighted PSSMs are listed in Table 2a and 2b

Table 2 Incorporating Local Sequence Information to Transcription Factor Binding SitePrediction Using Two Types of PSSMs for Myogenin in the Human MYL1 30 Enhancer (ab)

or for MEF2 in the Human Phosphoglycerate Mutase Promoter (cd)a

(a) Using log-odds myogenin PSSMs

p-values of observed score under local background modelPosition Log-odds

(bp from last PSSM 1st 2ndexon of MYL1) score iid Markov| Markov Binding site

1267 (A) 556 0000008 0000017 0000030 AGCAGGTG1339 (C) 550 0000015 0000027 0000055 GACAGGTG1323 (B) 548 0000033 0000057 0000112 ACCAGCTG

5434 556 0000036 0000074 0000095 AGCAGCTG2463 550 0000059 0000135 0000179 GCCAGCTG1235 531 0000212 0000354 0000442 ACCATGTG926 534 0000181 0000363 0000468 TGCAGGTG

2574 536 0000225 0000416 0000421 GGCAGATG783 531 0000274 0000453 0000537 AACATCTG470 529 0000404 0000624 0000731 GGAAGCTG

(continued)

6 HUANG ET AL

Table 2 (Continued)

(b) Using entropy-weighted myogenin PSSM

p-values of observed score under local background modelPosition

(bp from last TRANSFAC 1st 2ndexon of MYL1) score iid Markov| Markov Binding site

1267 (A) 4667 0000008 0000017 000003 AGCAGGTG1339 (C) 4628 0000018 0000032 0000059 GACAGGTG

5434 4667 0000036 0000074 0000095 AGCAGCTG1323 (B) 4581 0000045 0000077 0000127 ACCAGCTG

2463 4628 0000068 0000152 0000194 GCCAGCTG2574 4596 0000073 0000177 0000191 GGCAGATG926 4463 0000224 0000414 0000532 TGCAGGTG

7534 4377 0000378 0000534 0000788 TACAGCTG7156 4377 0000346 000054 0000686 CCCAGCTG4895 4322 0000829 0001998 0002045 CTCAGGTG

(c) Using log-odds MEF2 PSSMs

p-values of observed score under local background modelPosition Log-odds

(bp from last PSSM 1st 2ndexon of MYL1) score iid Markov| Markov Binding site

iexcl2970 669 0000199 0000160 0000228 ATTTTAAATAiexcl3115 671 0000209 0000183 0000243 GTTATAAATAiexcl161 649 0000355 0000183 0000322 ATTTTAAGCA

iexcl2939 668 0000233 0000190 0000266 TGTTTAAATCiexcl3151 663 0000807 0000655 0000747 TGTTTAAGAAiexcl4767 656 0000951 0001009 0001712 TTTTTATATAiexcl3433 649 0003940 0003099 0003383 AAACTAAAAAiexcl3566 644 0005710 0004913 0005231 TTTTTAAAGCiexcl3214 643 0007155 0005654 0006459 AGTTTATATCiexcl3577 641 0007363 0006312 0006625 GGTTTAACAT

(d) Using entropy-weighted MEF2 PSSM

p-values of observed score under local background modelPosition

(bp from last TRANSFAC 1st 2ndexon of MYL1) score iid Markov| Markov Binding site

iexcl2970 591 0000103 0000082 0000133 ATTTTAAATAiexcl161 550 0000191 0000101 0000174 ATTTTAAGCA

iexcl3115 590 0000206 0000181 0000243 GTTATAAATAiexcl2939 554 0001001 0000807 0001016 TGTTTAAATCiexcl3151 562 0001091 0000914 0001083 TGTTTAAGAAiexcl4700 531 0001451 0001451 0001451 TTGTTAAAGAiexcl3566 543 0002961 0002532 0002676 TTTTTAAAGCiexcl3433 545 0003271 0002610 0002788 AAACTAAAAAiexcl4444 532 0003080 0003200 0004320 CATATAATTAiexcl3687 535 0003761 0003320 0003671 GAAGTAAAGA

aSorted in increasing order by column marked with |

LOCAL STATISTICAL SIGNIFICANCE OF PATTERNS 7

respectively with the true sites A B and C labeled and shaded in gray along with their local p-valuesWe nd the PSSM scores to be less sensitive a measure than the local p-value the true sites A B andC stood out under the local p-values while they are not as distinct from the false predictions under thePSSM scores

PGAM-M MEF2 binding site prediction

A major positive regulatory element is required for the muscle-speci c expression of the muscle-speci csubunit of the human phosphoglycerate mutase (PGAM-M) gene (Nakatsuji et al 1992) This elementlocated 161 bp upstream of the gene is found to be bound by the transcription factor MEF-2

We applied the LMM to the 5000 bp PGAM-M upstream region using the MEF2_Q6 PSSM to derivethe local p-values for each candidate The top 10 score candidate sites derived using log-odds or entropy-weighted PSSMs are listed in Tables 2c and 2d respectively with the true site labeled and shaded in grayalong with their local p-values We nd that LMM behaves similarly as in the MYL1 enhancer

Overall from Table 2 we see that by taking into account the local sequence composition we havereordered the candidate sequences in a way that is favorable to the true binding sites

LARGE-SCALE VALIDATION

In order to evaluate the performance of LMM and to compare our local p-values to PSSM similarityscores we apply both LMM and TRANSFAC to 101 known binding sites in the human genome obtainedby mapping binding sites in the TRANSFAC database onto the human genome We recorded and evaluatedthe extent to which LMM and TRANSFAC can capture this large collection of known binding sites in thehuman genome and the amount of noise generated in so doing

In Figure 3a the trade-off between sensitivity and noise is shown in terms of the proportion of theknown binding sites detected and the amount of concomitant noise generated Noise is measured bythe noise-to-signal ratio which is de ned as the number of binding site calls not known to be correctdivided by the number of known binding sites found For comparison we show the tradeoffs achieved byTRANSFAC using its three matrix-speci c similarity score cutoffs (FN SUM FP) along with that achievedby LMM under Markov models of orders 0 1 2 and 3 at various p-value cutoffs starting at the stringentp D 000001 From the inset graph we see that at all levels of sensitivity LMM outperformed TRANSFAC

FIG 3 Large-scale validation of TRANSFAC and LMM Tradeoff between sensitivity and noise (a) We comparedthe abilities of the two methods to detect the 101 known binding sites in the human genome by looking at their sensitivityand noise-to-signal ratio The balance of the tradeoff between these two measures achieved at various signi cancelevels by LMM are traced and compared to that attained by TRANSFAC The inset graph shows the performance ofLMM and TRANSFAC across all levels of sensitivity (b) Detailed results for p D 000001 and 00002

8 HUANG ET AL

by producing signi cantly less noise While the performance of LMM comes close to that of TRANSFACas the p-value cutoff increases in fact by then for both methods the advantage of increased sensitivity hasbeen nulli ed by the high level of accompanying noise rendering them impractical Overall not only is thesensitivity of LMM comparable to TRANSFAC its noise-to-signal ratio is also vastly superior It shouldbe noted that since only a limited number of true binding sites are known not every unsupported bindingsite prediction is necessarily a false-positive prediction Thus the noise-to-signal ratio overestimates thetrue noise level especially when stringent criteria are used to generate putative TFBSs with high sequencesimilarity to known binding sites As the criteria relax the large numbers of predictions over and abovethe known binding sites imply a high level of true background noise

More detailed results for TRANSFAC using the three cutoffs and for LMM using different signi cancecutoffs 000001 and 00002 and under different Markov models are summarized in Fig 3b While theFN cut off missed relatively few known binding sites it generated more than 45 false-positive predictionsfor every accurate binding site call On the other hand FP made fewer false positives but it detected onlyone in nine known binding sites The SUM cutoff designed as a balance of these inherent tradeoffs didstrike a reasonable compromise having generated about nine false positives for every real binding site anddetected more than half of the known sites

At the stringent signi cance cutoff p D 000001 LMM detected about twice the binding sites thandid the FP cutoff and on average produced about 60 fewer false-positive predictions for every correctprediction At the more relaxed p-value cutoff p D 00002 the sensitivity of LMM is comparable tothat of the SUM cutoff while only half of the noise is generated The binding sites that were detectedby LMM at p middot 00002 but missed by TRANSFAC using the SUM cutoff include a MEF2 binding siteover the desmin gene an ATF1 (activating transcription factor 1) binding site over the TGFmacr2 gene aHIF (hypoxia-inducible factor) binding site over the VEGF gene and an ICSBP (IFN consensus sequencebinding protein) binding site over the OAS1 gene We choose p D 00002 as the general signi cance cutofffor the application of LMM to mammalian genomic sequences a cutoff with a suf ciently high sensitivityand an acceptable amount of noise Overall the LMM provides an advantageous tradeoff between noise-to-signal ratio and sensitivity

In our validation experiment we found that Markov models of orders 1 2 and 3 have better combinationsof high sensitivity and low noise than the iid model con rming an earlier observation (Liu et al 2001)that Markov models can better capture the structure of biological sequences In addition we compared

FIG 4 The use of local sequence context is advantageous The performance of the second-order LMM is comparedagainst an analogous global Markov model with parameters estimated from a large collection of upstream regionsThe performance is assessed in terms of the noise-to-signal ratio and sensitivity At the recommended p-value cutoff00002 LMM is more sensitive and less noisy

LOCAL STATISTICAL SIGNIFICANCE OF PATTERNS 9

the performance of the second-order LMM against an analogous global Markov model with parametersestimated from a large collection of upstream regions in order to assess the ability of LMM to modelthe local sequence context information We found that over the 101 known human TFBSs in situ LMMgenerally outperforms the global Markov model while they behave similarly at high and low sensitivitylevels (Fig 4) At high sensitivity levels the lax p-value cutoffs produce large numbers of putative TFBScalls overwhelming the advantage enjoyed by LMM At low sensitivity levels the stringent p-valuecutoffs yield only putative TFBSs with undeniable sequence similarity to known binding sites Thus thenoise-to-signal ratio may not re ect the true noise level in this region

DISCUSSION

The work presented in this paper attemps to identify TFBSs by considering simultaneously both theirsimilarity to the query PSSM and their differences from the local genomic context Through the studyof the human TFBSs in TRANSFAC we show that LMM which makes putative TFBS calls using localp-values yields a much improved false-positive to true-positive ratio than that using the TRANSFAC orlog-odds scores alone

It has been known that neighboring nucleotide compositions can affect the interaction between a tran-scription factor and its binding site To our best knowledge however there is no documented study onwhether and how much an improvement can be made on the PSSM-based TFBS detection using a localbackground model The result we present which is based on more than 100 experimentally determinedTFBS sequences in the human genome shows a clear overall advantage for incorporating the local se-quence context into PSSM-based TFBS search There are various biological mechanisms that can explainthis effect which may lead to more complicated and more speci c models For instance it may be thatthe local 1000 bp genomic region does not contain DNA sequences similar to the true binding site be-cause otherwise the target transcription factor may be competed away from its biologically meaningfulbinding site

While this improvement does not in itself render a solution to the much more dif cult problem ofdetecting regulatory modules by signi cantly reducing false-positive calls for single sites the local p-value approach will contribute substantially to any subsequent algorithms aiming to detect combinatorialregulatory modules The method we developed here is seen as a proof of principle and can be used asa component of a more complex approach For example considering that clusters of binding sites alsooften occur within small regions of about 200 bp to cooperatively recruit the transcription factors a naturalfuture development of LMM would be to take this distance effect into the background estimation andcombine the LMM p-values of a few candidate PSSM sites Many challenging problems in computationalbiology eg translation initiation site identi cation splice site recognition and RNA secondary structureprediction can be modeled in terms of the recognition of motifs Our work may be adapted and extendedto these problems as well However it should be noted that when applied to protein sequences which arecomposed of a 20-letter alphabet the performance of our algorithm may become an issue especially whenthe order of the Markov chain k is large

DETAILED METHODS

Data extraction for large-scale validation

To evaluate the performance of the LMM we apply it to known TFBSs in the human genome Knownbinding sites are extracted from the SITE table of the TRANSFAC database version 62 About half of the12262 binding sites in this table are experimentally derived from various species The rest are generatedfrom in vitro binding assays on arti cial nucleotide sequences Since LMM studies binding sites withrespect to their genomic contexts these arti cial sequences which do not correspond to any genomicregion cannot be used for our validation study Of the 6073 in vivo binding sites 1425 sites are basedon the human genome Of these 149 (105) are annotated with a corresponding PSSM We use thesebinding sites for validation

10 HUANG ET AL

To locate the known TFBSs in the human genome we focus on the 5000 bp upstream sequences of allgenes We made use of the annotations provided by Ensembl (Hubbard et al 2002) and extracted 22808human gene promoters from the human genome assembly NCBI golden path 29 (wwwensemblorgHomo_sapiens) Since heuristic sequence-mapping algorithms do not perform well on short sequences such asTFBSs we use an exact-match algorithm based on suf x trees (Gus eld 1997) We found that many bindingsite sequences are precisely mapped onto the promoters of the correct target genes For those binding siteswith mappings onto multiple promoters or with no mapping we attempted to retrieve them by manualreview To nd the correct one among multiple mappings we made correspondences between the Ensemblgene name and the target gene name of the binding site as recorded by TRANSFAC A review of somemissed matches using inexact match algorithms revealed a small number of single-basepair differencesbetween the recorded binding site sequences and the promoter sequences of the target genes for examplethe binding site HS$ALBU_06 over the human albumin promoter After validating against the primaryliterature for the positions of these binding sites we included these mappings as well In total we located101 human TFBSs

Local p-value calculation

Although the exact score distribution can be obtained by enumerating all possible binding site sequencesunder any ldquonullrdquo model for the observed nucleotide base pairs the computational cost for a PSSM of lengthp is 4p Stadenrsquos method (Staden 1989) which turns this into an order-p computation is based on thePGF of the score under the simple null model that the base pairs are independent and identically distributed(iid) Recently however there are some evidences suggesting that Markov background models work betterthan the iid model for detecting TFBS (Liu et al 2002) By extending Stadenrsquos PGF method to dependentrandom variables we present here the derivation of the PGFs under a rst-order Markov model the basisof the ef cient algorithm for computing the exact score distribution

Probability generating function derivation

In our study we make use of the PSSMs constructed by TRANSFAC version 62 Given a PSSMm D wij ppound4 where i D 1 p and j D ACGT the match score S and the similarity score S=Smax

of a sequence D1D2 Dp is de ned as (Quandt et al 1995)

S DpX

iD1

wiDi and S=Smax DpX

iD1

wij

iquest pX

iD1

maxj

fwij g

Let S be a random variable taking integer values then its probability generating function Gt is theexpected value of tS Gt is a polynomial and the coef cient of the term tn is the probability of the eventS D n (Gut 1995)

Given a PSSM m of length p under the assumption that the DNA sequence is iid Staden pro-vided the PGF of the match score in the form of a product of p polynomials (Staden 1989) Gt DQp

iD1

PjDACGT fj twij where fj is the frequency of letter j in the iid DNA sequence For the rst-

order Markov case k D 1 let the transition matrix be P D fregjmacr 4pound4 and the stationary distribution of theMarkov chain be frac14 (viewed as a four-dimentional row vector) Then the PGF under the rst-order Markovmodel is

Gt D frac14

pY

iD1

PMi t I (curren)

where Mi t D DiagtwiA twiC twiG twiT and I D 1 1 1 1T (proof provided at the end of thissection)

Since a Markov chain of order k on set 0 is equivalently a rst-order Markov chain on the set 0k witha little modi cation on Mi t we can generalize the above results to k gt 1 An example of PGF for

LOCAL STATISTICAL SIGNIFICANCE OF PATTERNS 11

k D 3 is in the online supplement (wwwbiostatharvardeducomplabLMM) Using this representation forthe PGF we developed and implemented an algorithm using CCC to calculate the exact score distributionGenerally for a kth-order Markov chain and a PSSM of length p the time complexity of our algorithm isO4k cent Smax cent p linear in the matrix length but exponential in the order of the Markov chain The sourcecode is available upon request (wwwbiostathsphharvardeduLMM)

Proof of equation (curren) For ease of notation and without loss of generality we let p the length of thePSSM be 3

For a DNA sequence D1D2D3 its match score against PSSM m is w1D1 C w2D2 C w3D3 and theprobability of the occurrence of D1D2D3 is fD1 fD2 jD1 fD3 jD2 By de nition the PGF of match scoreagainst m is

X

D1D2D3

fD1fD2jD1fD3jD2 tw1D1Cw2D2

Cw3D3 DX

D1D2D3

fD1 tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

In the following we derive the PGF in the alternative form of a product of p matrices First

X

D1D2D3

fD1 tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

DX

D1D2D3

X

a

fa cent fD1 ja tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

DX

a

fa centX

D1D2D3

fD1 ja tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

D fa fC fG fT cent

0

X

D1D2D3

fD1 jAtw1D1 cent fD2jD1 tw2D2 cent fD3 jD2 tw3D3

X

D1D2D3

fD1 jT tw1D1 cent fD2jD1 tw2D2 cent fD3jD2 tw3D3

1

AT

For the component of the second vector corresponding to base A

X

D1D2D3

fD1 jAtw1D1 cent fD2 jD1 tw2D2 cent fD3jD2 tw3D3

DX

D1

fD1 jAtw1D1 centX

D2D3

fD2 jD1 tw2D2 cent fD3jD2 tw3D3

D fAjAtw1A fCjAtw1C fGjAtw1G fT jAtw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3jD2 tw3D3

1

AT

12 HUANG ET AL

We apply similar arguments to the components corresponding to bases C G and T and obtain

X

D1D2D3

fD1jC tw1D1 cent fD2 jD1 tw2D2 cent fD3jD2 tw3D3

D fAjC tw1A fCjC tw1C fGjC tw1G fT jC tw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

X

D1D2D3

fD1jGtw1D1 cent fD2 jD1 tw2D2 cent fD3jD2 tw3D3

D fAjGtw1A fCjGtw1C fGjGtw1G fT jGtw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

X

D1D2D3

fD1jT tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

D fAjT tw1A fCjT tw1C fGjT tw1G fT jT tw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

Therefore for the rst position we have

0

X

D1D2D3

fD1jAtw1D1 cent fD2jD1 tw2D2 cent fD3 jD2 tw3D3 X

D1D2D3

fD1 jT tw1D1 cent fD2 jD1 tw2D2 cent fD3jD2 tw3D3

1

AT

D

0

BB

fAjAtw1A fCjAtw1C fGjAtw1G fT jAtw1T

fAjC tw1A fCjC tw1C fGjC tw1G fT jC tw1T

fAjGtw1A fCjGtw1C fGjGtw1G fT jGtw1T

fAjT tw1A fCjT tw1C fGjT tw1G fT jT tw1T

1

CCA cent

0

BB

PD2D3

fD2 jAtw2D2 fD3 jD2 tw3D3PD2D3

fD2 jC tw2D2 fD3jD2 tw3D3PD2D3

fD2 jGtw2D2 fD3jD2 tw3D3PD2D3

fD2 jT tw2D2 fD3 jD2 tw3D3

1

CCA

D P cent Diagtw1A tw1C tw1G tw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD2 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

D P cent M1 t cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

LOCAL STATISTICAL SIGNIFICANCE OF PATTERNS 13

Further applying the above arguments to positions 2 and 3 we have

0

X

D2D3

fD2jAtw2D2 cent fD3jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

D P cent Diagtw2A tw2C tw2G tw2T cent

0

X

D3

fD3jAtw3D3 X

D3

fD3 jT tw3D3

1

AT

D P cent Diagtw2A tw2C tw2G tw2T cent P cent Diagtw2A tw2C tw2G tw2T cent 1 1 1 1T

iexcl P cent M2 t cent P cent M3 t cent I

Above all Gt D frac14Qp

iD1PMi tI

ACKNOWLEDGMENTS

The work of HH XZ and WHW is supported by NSF grants DBI0196176 and DMS-0090166 Thework of HH and JSL is supported by NSF grant DMS-0204674 and NIH grant R01 HG02518-01 Thework of M-CJK is supported by the Howard Hughes Medical Institute predoctoral fellowship

REFERENCES

Bailey TL and Elkan C 1994 Fitting a mixture model by expectation maximization to discover motifs in biopoly-mers Proc Int Conf Intell Syst Mol Biol 2 28ndash36

Chen QK Hertz GZ and Stormo GD 1995 MATRIX SEARCH 10 A computer program that scans DNAsequences for transcriptional elements using a database of weight matrices Comput Appl Biosci 11 563ndash566

Durbin R Eddy SR Krogh A and Mitchison G 1998 Biological Sequence Analysis Probalistic Models ofProteins and Nucleic Acids Cambridge University Press Cambridge UK

Fickett JW 1996 Coordinate positioning of MEF2 and myogenin binding sites Gene 172 GC19ndash32Fried M and Crothers DM 1981 Equilibria and kinetics of lac repressor-operator interactions by polyacrylamide

gel electrophoresis Nucl Acids Res 9 6505ndash6525Galas DJ and Schmitz A 1978 DNAse footprinting A simple method for the detection of proteinndashDNA binding

speci city Nucl Acids Res 5 3157ndash3170Garner MM and Revzin A 1981 A gel electrophoresis method for quantifying the binding of proteins to speci c

DNA regions Application to components of the Escherichia coli lactose operon regulatory system Nucl Acids Res9 3047ndash3060

Gus eld D 1997 Algorithms on Strings Trees and Sequences Computer Science and Computational BiologyCambridge University Press Cambridge England

Gut A 1995 An Intermediate Course in Probability Springer-Verlag New YorkHertz GZ Hartzell 3rd GW and Stormo GD 1990 Identi cation of consensus patterns in unaligned DNA

sequences known to be functionally related Comput Appl Biosci 6 81ndash92Hubbard T Barker D Birney E Cameron G Chen Y Clark L Cox T Cuff J Curwen V Down T Durbin

R Eyras E Gilbert J Hammond M Huminiecki L Kasprzyk A Lehvaslaiho H Lijnzaad P MelsoppC Mongin E Pettett R Pocock M Potter S Rust A Schmidt E Searle S Slater G Smith J SpoonerW Stabenau A Stalker J Stupka E Ureta-Vidal A Vastrik I and Clamp M 2002 The Ensembl genomedatabase project Nucl Acids Res 30 38ndash41

Hughes JD Estep PW Tavazoie S and Church GM 2000 Computational identi cation of cis-regulatoryelementsassociated with groups of functionally related genes in Saccharomyces cerevisiae J Mol Biol 296 1205ndash1214

Lander ES Linton LM Birren B Nusbaum C Zody MC Baldwin J Devon K Dewar K Doyle MFitzHugh W Funke R Gage D Harris K Heaford A Howland J Kann L Lehoczky J LeVine RMcEwan P McKernan K Meldrim J Mesirov JP Miranda C Morris W Naylor J Raymond C RosettiM Santos R Sheridan A Sougnez C Stange-Thomann N Stojanovic N Subramanian A Wyman DRogers J Sulston J Ainscough R Beck S Bentley D Burton J Clee C Carter N Coulson A DeadmanR Deloukas P Dunham A Dunham I Durbin R French L Grafham D Gregory S Hubbard T Humphray

14 HUANG ET AL

S Hunt A Jones M Lloyd C McMurray A Matthews L Mercer S Milne S Mullikin JC MungallA Plumb R Ross M Shownkeen R Sims S Waterston RH Wilson RK Hillier LW McPherson JDMarra MA Mardis ER Fulton LA Chinwalla AT Pepin KH Gish WR Chissoe SL Wendl MCDelehaunty KD Miner TL Delehaunty A Kramer JB Cook LL Fulton RS Johnson DL Minx PJClifton SW Hawkins T Branscomb E Predki P Richardson P Wenning S Slezak T Doggett N ChengJF Olsen A Lucas S Elkin C Uberbacher E Frazier M Gibbs RA Muzny DM Scherer SE BouckJB Sodergren EJ Worley KC Rives CM Gorrell JH Metzker ML Naylor SL Kucherlapati RSNelson DL Weinstock GM Sakaki Y Fujiyama A Hattori M Yada T Toyoda A Itoh T Kawagoe CWatanabe H Totoki Y Taylor T Weissenbach J Heilig R Saurin W Artiguenave F Brottier P Bruls TPelletier E Robert C Wincker P Smith DR Doucette-Stamm L Ruben eld M Weinstock K Lee HMDubois J Rosenthal A Platzer M Nyakatura G Taudien S Rump A Yang H Yu J Wang J HuangG Gu J Hood L Rowen L Madan A Qin S Davis RW Federspiel NA Abola AP Proctor MJMyers RM Schmutz J Dickson M Grimwood J Cox DR Olson MV Kaul R Shimizu N KawasakiK Minoshima S Evans GA Athanasiou M Schultz R Roe BA Chen F Pan H Ramser J LehrachH Reinhardt R McCombie WR de la Bastide M Dedhia N Blocker H Hornischer K Nordsiek GAgarwala R Aravind L Bailey JA Bateman A Batzoglou S Birney E Bork P Brown DG BurgeCB Cerutti L Chen HC Church D Clamp M Copley RR Doerks T Eddy SR Eichler EE FureyTS Galagan J Gilbert JG Harmon C Hayashizaki Y Haussler D Hermjakob H Hokamp K Jang WJohnson LS Jones TA Kasif S Kaspryzk A Kennedy S Kent WJ Kitts P Koonin EV Korf I KulpD Lancet D Lowe TM McLysaght A Mikkelsen T Moran JV Mulder N Pollara VJ Ponting CPSchuler G Schultz J Slater G Smit AF Stupka E Szustakowski J Thierry-Mieg D Thierry-Mieg JWagner L Wallis J Wheeler R Williams A Wolf YI Wolfe KH Yang SP Yeh RF Collins F GuyerMS Peterson J Felsenfeld A Wetterstrand KA Patrinos A Morgan MJ Szustakowki J de Jong PCatanese JJ Osoegawa K Shizuya H Choi S and Chen YJ 2001 Initial sequencing and analysis of thehuman genome Nature 409 860ndash921

Lawrence CE Altschul SF Boguski MS Liu JS Neuwald AF and Wootton JC 1993 Detecting subtlesequence signals A Gibbs sampling strategy for multiple alignment Science 262 208ndash214

Lawrence CE and Reilly AA 1990 An expectation maximization (EM) algorithm for the identi cation and char-acterization of common sites in unaligned biopolymer sequences Proteins 7 41ndash51

Liu X Brutlag DL and Liu JS 2001 BioProspector Discovering conserved DNA motifs in upstream regulatoryregions of co-expressed genes Pac Symp Biocomput 127ndash138

Liu XS Brutlag DL and Liu JS 2002 An algorithm for nding protein DNA binding sites with applications tochromatin-immunoprecipitation microarray experiments Nat Biotechnol

Nakatsuji Y Hidaka K Tsujino S Yamamoto Y Mukai T Yanagihara T Kishimoto T and Sakoda S 1992A single MEF-2 site is a major positive regulatory element required for transcription of the muscle-specic subunitof the human phosphoglycerate mutase gene in skeletal and cardiac muscle cells Mol Cell Biol 12 4384ndash4390

Quandt K Frech K Karas H Wingender E and Werner T 1995 MatInd and MatInspector New fast and versatiletools for detection of consensus matches in nucleotide sequence data Nucl Acids Res 23 4878ndash4884

Rosenthal N Berglund EB Wentworth BM Donoghue M Winter B Bober E Braun T and Arnold HH1990 A highly conserved enhancer downstream of the human MLC13 locus is a target for multiple myogenicdetermination factors Nucl Acids Res 18 6239ndash6246

Roth FP Hughes JD Estep PW and Church GM 1998 Finding DNA regulatory motifs within unalignednoncoding sequences clustered by whole-genome mRNA quantitation Nat Biotechnol 16 939ndash945

Staden R 1989 Methods for calculating the probabilities of nding patterns in sequences Comput Appl Biosci 589ndash96

Stormo GD and Hartzell 3rd GW 1989 Identifying protein-binding sites from unaligned DNA fragments ProcNatl Acad Sci USA 86 1183ndash1187

Wentworth BM Donoghue M Engert JC Berglund EB and Rosenthal N 1991 Paired MyoD-binding sitesregulate myosin light chain gene expression Proc Natl Acad Sci USA 88 1242ndash1246

Wingender E Chen X Hehl R Karas H Liebich I Matys V Meinhardt T Pruss M Reuter I and SchachererF 2000 TRANSFAC An integrated system for gene expression regulation Nucl Acids Res 28 316ndash319

Address correspondence toJun S Liu Wing H Wong

Department of StatisticsScience Center 6th oor

1 Oxford StreetCambridge MA 02138

E-mail jliu wwongstatharvardedu

Page 2: Determination of Local Statistical Significance of ...mckao/documents/JCB-LMM.pdf · that the incorporation of the local genomic context can be advantageous in the prediction of myogenin

2 HUANG ET AL

Efforts on the computational prediction of TFBSs fall into two general approaches The rst seeks novelrecurrent patterns in a set of DNA sequences often the promoters of genes found to be coregulated in geneexpression microarray experiments A number of statistical models have been developed in the past decadefor this purpose based on Bayesian models and Monte Carlo methods (Bailey and Elkan 1994 Hugheset al 2000 Lawrence et al 1993 Lawrence and Reilly 1990 Liu et al 2001 Liu et al 2002 Rothet al 1998) They have been widely applied and found to be most successful in lower organisms such asbacteria and yeast However in higher organisms such as the human these methods may yield noisy resultsbecause of the long noncoding regions and the large numbers of nonfunctional repeat elements (Landeret al 2001) A recent trend to improve upon these de novo methods is to incorporate the information fromcross-species comparisons

The other major approach to predict transcription factor binding sites makes use of prior knowledgeon the binding sites These methods evaluate individual candidate site sequences by their similarities toclusters of experimentally determined binding sites (Chen et al 1995 Hertz et al 1990 Quandt et al1995 Stormo and Hartzell 1989 Wingender et al 2000) These binding site sequences are most oftensummarized using position-specic scoring matrices (PSSMs) which are used to summarize the sequencepatterns and to compare against candidate DNA segments This is the approach of interest in this paper

Various methods exist to score candidate segments for their similarities to known binding sites usingPSSMs We provide an example in Fig 1 using the transcription factor myogenin PSSM constructionbegins by using the alignment of known binding site sequences and tabulating the nucleotide distributionmatrix (Fig 1a) The counts are then transformed using either of two related schemes log-odds (Fig 1b)or entropy (Fig 1c) to generate the PSSM Candidate sites are scored against the PSSMs by summingover the corresponding scores of the nucleotides across the site sequence ie the score of candidatesite S D S1 Sp against PSSM is wij ppound4 is S D

Ppositioni wiSi In practice these scores are then

compared to some predetermined cutoff values to generate computational TFBS predictions Note that themost widely used database of transcription factor binding TRANSFAC (Wingender et al 2000) is basedon entropy-weighted PSSMs

While probabilities are used in the construction of the PSSMs the scores themselves cannot be interpretedstatistically This has led to the general dif culty with choosing the score cutoff values for each matrix aproblem that may have contributed to the large numbers of false positive predictions seen in practice Wepropose a p-value based scoring scheme which evaluates the statistical signi cance of the candidate sitesegment This should apply to both the entropy-based and the log-oddsndashbased scoring methods Howeverin order to obtain a valid p-value one needs to model the background sequence properly which may serveeither as the ldquonull modelrdquo or a component in computing the log-odds scoring function

In this paper we model the background sequences or the ldquonull distributionrdquo as a Markov chain Asin previous methods candidate binding site sequences are scored by PSSMs Each score is evaluatedstatistically by computing its p-value that is the probability that the background model can achieve ascore at least as high as that observed In order to calculate this p-value we develop an ef cient and exactalgorithm based on probability-generating functions that can achieve up to 1000-fold speed up comparedto Monte Carlo simulations We note that in contrast to score-based evaluation the p-values we generatecan serve as a universal measure of statistical signi cance of all candidate binding sites regardless of theircorresponding binding factor or of their genomic locations

It has been known that the effectiveness of a binding site in recruiting its corresponding transcriptionfactors can be dramatically affected by the genomic context that it is in This can be attributed to a numberof factors such as the local DNA bending the accessibility of the binding site or the positive or negativeeffects of neighboring TFBSs We incorporate the local genomic context into the p-valuendashbased scoringmethod and develop the Local Markov Method (LMM) The p-value for a candidate site provides a measureof its similarity to known binding sites and its contrast against the local genomic context We rst showthat the incorporation of the local genomic context can be advantageous in the prediction of myogenin andMEF2 binding sites in the human genome an advantage observed independently of the method of PSSMconstruction We further compare the abilities of LMM and TRANSFAC to pick up 101 experimentallydetermined TFBSs from large tracts of human genomic sequences and nd that LMM can identify TFBSswith more speci city (50 fewer false positive predictions) without compromising sensitivity The LMMsoftware is available upon request (wwwbiostatharvardedcomplabLMM)

LOCAL STATISTICAL SIGNIFICANCE OF PATTERNS 3

FIG 1 The construction of a position speci c scoring matrix for myogenin binding sites

MAIN RESULTS

p-value calculation

In order to calculate the probability that a given Markov background model can achieve a score at leastas high as the observed score of the candidate site we extend a previous method designed for a similarpurpose but applicable only to the independent and identically distributed (iid) background sequencemodel (Staden 1989) The key part for this method is the reformulation of the distribution of the score asa probability-generating function which leads to an ef cient algorithm for its computation We formulatethe score probability-generating function under Markov models (detailed in the Detailed Methods section)and derive an algorithm with time complexity linear in the length of the PSSM a dramatic improvementover the naive enumeration method which has time complexity exponential in the length of the PSSM

4 HUANG ET AL

FIG 2 TRANSFAC vs local Markov model (LMM) in the identi cation of transcription factor binding sites (TFBSs)in a given genomic sequence (a) TRANSFAC scans a genomic sequence generates similarity scores of each subse-quence against a given PSSM and uses three matrix-specic cutoffs FN FP and SUM to make putative calls The threesets of cutoffs attempt to minimize false-negative error false-positiveerror or the sum of these two errors respectively(b) LMM begins by selecting the top 01 candidate sites based on their PSSM similarity scores since sites withlow similarity scores are unlikely to be true binding sites For each candidate TFBS LMM models the DNA sequencesegment of length L (eg 1000) centered around the target site as a homogeneous Markov chain of orders k D 0 12 or 3 Under the estimated Markov model LMM calculates the probability distribution of the similarity score usingour algorithm This distribution then allows us to assign statistical signi cance to the given candidate TFBS

We implement the algorithm in C and incorporate the program into the local Markov method (LMM)program to study TFBSs in situ by evaluating each candidate binding site with respect to its local genomiccontext A summary of the LMM method is in Fig 2b For comparison we also describe in Fig 2a theprediction program which accompanies the TRANSFAC database

To assess the ef ciency of our algorithm we compare it against Monte Carlo simulations Our exper-iments showed that the ef ciency of our approach can be many times more ef cient than Monte Carlosimulation For example at p middot 00001 a sequence of length at least 109 basepairs needs to be simu-lated in order to obtain a suf ciently accurate cutoff value (relative error middot 1) which needs more than1000-fold more computing time than our exact algorithm (Table 1)

APPLICATIONS

MYL1 30 enhancer myogenin binding site prediction

In human mouse and rat there is a well-conserved 200bp-long skeletal muscle-speci c enhancer about24 kb 30 of MYL1 (Rosenthal et al 1990 Wentworth et al 1991) Three myogenic determination factorbinding sites A B and C are found in this region which are located 1267 bp 1323 bp and 1339 bp

LOCAL STATISTICAL SIGNIFICANCE OF PATTERNS 5

Table 1 Running Time Comparison of LMM with Monte Carlo Simulationa

Signicance Signicanceat p middot 0001 at p middot 00001

Relative error Relative errorTime

Simulation (sec) Mean SD Mean SD

N D 105 006 65 43 240 169N D 106 06 17 12 77 45N D 107 57 10 07 26 17N D 108 57 10 05 20 08N D 109 570 07 06 06 04N D 232 1228 06 06 06 04LMM 043 0 0

aFor 10 randomly chosen intergenic regions we estimate a 2nd-order Markov model for eachsequence using maximum likelihood estimation Under the 10 estimated Markov models we use ouralgorithm to derive and Monte Carlo simulation to estimate the score cutoffs of the p53 PSSM at twosigni cance levels p middot 0001 and p middot 00001 on a 1500 MHz AMD Athlon machine running LinuxTo assess the difference between the cutoffs Cp derived by our algorithm and the cutoffs Cp estimatedby simulations we consider the p-value F Cp attained by Cp and the true p-value FCp of Cp

derived using our algorithm We assess the relative errors of the simulation estimate by calculatingFCp iexcl F Cp=FCp

respectively downstream of the last exon of MYL1 in the human genome Sites A and B are myogeninmyf4binding sites (Rosenthal et al 1990) while site C is a MyoD binding site (Wentworth et al 1991) alsoconsidered to be a myogenin binding site (Fickett 1996)

We applied the LMM to the 10000 bp MYL1 downstream region (starting from the end of the lastexon) to derive the local p-values for each candidate The local p-value for each candidate is the statisticalsigni cance of observing its score (derived by both log-odds and entropy-related PSSMs) assuming that itis generated under a local random model where Markov models of different orders (eg 0 1 or 2) are usedand with parameters estimated from the local 1000 bp genomic sequence centered at the candidate The top10 score candidate sites derived using log-odds or entropy-weighted PSSMs are listed in Table 2a and 2b

Table 2 Incorporating Local Sequence Information to Transcription Factor Binding SitePrediction Using Two Types of PSSMs for Myogenin in the Human MYL1 30 Enhancer (ab)

or for MEF2 in the Human Phosphoglycerate Mutase Promoter (cd)a

(a) Using log-odds myogenin PSSMs

p-values of observed score under local background modelPosition Log-odds

(bp from last PSSM 1st 2ndexon of MYL1) score iid Markov| Markov Binding site

1267 (A) 556 0000008 0000017 0000030 AGCAGGTG1339 (C) 550 0000015 0000027 0000055 GACAGGTG1323 (B) 548 0000033 0000057 0000112 ACCAGCTG

5434 556 0000036 0000074 0000095 AGCAGCTG2463 550 0000059 0000135 0000179 GCCAGCTG1235 531 0000212 0000354 0000442 ACCATGTG926 534 0000181 0000363 0000468 TGCAGGTG

2574 536 0000225 0000416 0000421 GGCAGATG783 531 0000274 0000453 0000537 AACATCTG470 529 0000404 0000624 0000731 GGAAGCTG

(continued)

6 HUANG ET AL

Table 2 (Continued)

(b) Using entropy-weighted myogenin PSSM

p-values of observed score under local background modelPosition

(bp from last TRANSFAC 1st 2ndexon of MYL1) score iid Markov| Markov Binding site

1267 (A) 4667 0000008 0000017 000003 AGCAGGTG1339 (C) 4628 0000018 0000032 0000059 GACAGGTG

5434 4667 0000036 0000074 0000095 AGCAGCTG1323 (B) 4581 0000045 0000077 0000127 ACCAGCTG

2463 4628 0000068 0000152 0000194 GCCAGCTG2574 4596 0000073 0000177 0000191 GGCAGATG926 4463 0000224 0000414 0000532 TGCAGGTG

7534 4377 0000378 0000534 0000788 TACAGCTG7156 4377 0000346 000054 0000686 CCCAGCTG4895 4322 0000829 0001998 0002045 CTCAGGTG

(c) Using log-odds MEF2 PSSMs

p-values of observed score under local background modelPosition Log-odds

(bp from last PSSM 1st 2ndexon of MYL1) score iid Markov| Markov Binding site

iexcl2970 669 0000199 0000160 0000228 ATTTTAAATAiexcl3115 671 0000209 0000183 0000243 GTTATAAATAiexcl161 649 0000355 0000183 0000322 ATTTTAAGCA

iexcl2939 668 0000233 0000190 0000266 TGTTTAAATCiexcl3151 663 0000807 0000655 0000747 TGTTTAAGAAiexcl4767 656 0000951 0001009 0001712 TTTTTATATAiexcl3433 649 0003940 0003099 0003383 AAACTAAAAAiexcl3566 644 0005710 0004913 0005231 TTTTTAAAGCiexcl3214 643 0007155 0005654 0006459 AGTTTATATCiexcl3577 641 0007363 0006312 0006625 GGTTTAACAT

(d) Using entropy-weighted MEF2 PSSM

p-values of observed score under local background modelPosition

(bp from last TRANSFAC 1st 2ndexon of MYL1) score iid Markov| Markov Binding site

iexcl2970 591 0000103 0000082 0000133 ATTTTAAATAiexcl161 550 0000191 0000101 0000174 ATTTTAAGCA

iexcl3115 590 0000206 0000181 0000243 GTTATAAATAiexcl2939 554 0001001 0000807 0001016 TGTTTAAATCiexcl3151 562 0001091 0000914 0001083 TGTTTAAGAAiexcl4700 531 0001451 0001451 0001451 TTGTTAAAGAiexcl3566 543 0002961 0002532 0002676 TTTTTAAAGCiexcl3433 545 0003271 0002610 0002788 AAACTAAAAAiexcl4444 532 0003080 0003200 0004320 CATATAATTAiexcl3687 535 0003761 0003320 0003671 GAAGTAAAGA

aSorted in increasing order by column marked with |

LOCAL STATISTICAL SIGNIFICANCE OF PATTERNS 7

respectively with the true sites A B and C labeled and shaded in gray along with their local p-valuesWe nd the PSSM scores to be less sensitive a measure than the local p-value the true sites A B andC stood out under the local p-values while they are not as distinct from the false predictions under thePSSM scores

PGAM-M MEF2 binding site prediction

A major positive regulatory element is required for the muscle-speci c expression of the muscle-speci csubunit of the human phosphoglycerate mutase (PGAM-M) gene (Nakatsuji et al 1992) This elementlocated 161 bp upstream of the gene is found to be bound by the transcription factor MEF-2

We applied the LMM to the 5000 bp PGAM-M upstream region using the MEF2_Q6 PSSM to derivethe local p-values for each candidate The top 10 score candidate sites derived using log-odds or entropy-weighted PSSMs are listed in Tables 2c and 2d respectively with the true site labeled and shaded in grayalong with their local p-values We nd that LMM behaves similarly as in the MYL1 enhancer

Overall from Table 2 we see that by taking into account the local sequence composition we havereordered the candidate sequences in a way that is favorable to the true binding sites

LARGE-SCALE VALIDATION

In order to evaluate the performance of LMM and to compare our local p-values to PSSM similarityscores we apply both LMM and TRANSFAC to 101 known binding sites in the human genome obtainedby mapping binding sites in the TRANSFAC database onto the human genome We recorded and evaluatedthe extent to which LMM and TRANSFAC can capture this large collection of known binding sites in thehuman genome and the amount of noise generated in so doing

In Figure 3a the trade-off between sensitivity and noise is shown in terms of the proportion of theknown binding sites detected and the amount of concomitant noise generated Noise is measured bythe noise-to-signal ratio which is de ned as the number of binding site calls not known to be correctdivided by the number of known binding sites found For comparison we show the tradeoffs achieved byTRANSFAC using its three matrix-speci c similarity score cutoffs (FN SUM FP) along with that achievedby LMM under Markov models of orders 0 1 2 and 3 at various p-value cutoffs starting at the stringentp D 000001 From the inset graph we see that at all levels of sensitivity LMM outperformed TRANSFAC

FIG 3 Large-scale validation of TRANSFAC and LMM Tradeoff between sensitivity and noise (a) We comparedthe abilities of the two methods to detect the 101 known binding sites in the human genome by looking at their sensitivityand noise-to-signal ratio The balance of the tradeoff between these two measures achieved at various signi cancelevels by LMM are traced and compared to that attained by TRANSFAC The inset graph shows the performance ofLMM and TRANSFAC across all levels of sensitivity (b) Detailed results for p D 000001 and 00002

8 HUANG ET AL

by producing signi cantly less noise While the performance of LMM comes close to that of TRANSFACas the p-value cutoff increases in fact by then for both methods the advantage of increased sensitivity hasbeen nulli ed by the high level of accompanying noise rendering them impractical Overall not only is thesensitivity of LMM comparable to TRANSFAC its noise-to-signal ratio is also vastly superior It shouldbe noted that since only a limited number of true binding sites are known not every unsupported bindingsite prediction is necessarily a false-positive prediction Thus the noise-to-signal ratio overestimates thetrue noise level especially when stringent criteria are used to generate putative TFBSs with high sequencesimilarity to known binding sites As the criteria relax the large numbers of predictions over and abovethe known binding sites imply a high level of true background noise

More detailed results for TRANSFAC using the three cutoffs and for LMM using different signi cancecutoffs 000001 and 00002 and under different Markov models are summarized in Fig 3b While theFN cut off missed relatively few known binding sites it generated more than 45 false-positive predictionsfor every accurate binding site call On the other hand FP made fewer false positives but it detected onlyone in nine known binding sites The SUM cutoff designed as a balance of these inherent tradeoffs didstrike a reasonable compromise having generated about nine false positives for every real binding site anddetected more than half of the known sites

At the stringent signi cance cutoff p D 000001 LMM detected about twice the binding sites thandid the FP cutoff and on average produced about 60 fewer false-positive predictions for every correctprediction At the more relaxed p-value cutoff p D 00002 the sensitivity of LMM is comparable tothat of the SUM cutoff while only half of the noise is generated The binding sites that were detectedby LMM at p middot 00002 but missed by TRANSFAC using the SUM cutoff include a MEF2 binding siteover the desmin gene an ATF1 (activating transcription factor 1) binding site over the TGFmacr2 gene aHIF (hypoxia-inducible factor) binding site over the VEGF gene and an ICSBP (IFN consensus sequencebinding protein) binding site over the OAS1 gene We choose p D 00002 as the general signi cance cutofffor the application of LMM to mammalian genomic sequences a cutoff with a suf ciently high sensitivityand an acceptable amount of noise Overall the LMM provides an advantageous tradeoff between noise-to-signal ratio and sensitivity

In our validation experiment we found that Markov models of orders 1 2 and 3 have better combinationsof high sensitivity and low noise than the iid model con rming an earlier observation (Liu et al 2001)that Markov models can better capture the structure of biological sequences In addition we compared

FIG 4 The use of local sequence context is advantageous The performance of the second-order LMM is comparedagainst an analogous global Markov model with parameters estimated from a large collection of upstream regionsThe performance is assessed in terms of the noise-to-signal ratio and sensitivity At the recommended p-value cutoff00002 LMM is more sensitive and less noisy

LOCAL STATISTICAL SIGNIFICANCE OF PATTERNS 9

the performance of the second-order LMM against an analogous global Markov model with parametersestimated from a large collection of upstream regions in order to assess the ability of LMM to modelthe local sequence context information We found that over the 101 known human TFBSs in situ LMMgenerally outperforms the global Markov model while they behave similarly at high and low sensitivitylevels (Fig 4) At high sensitivity levels the lax p-value cutoffs produce large numbers of putative TFBScalls overwhelming the advantage enjoyed by LMM At low sensitivity levels the stringent p-valuecutoffs yield only putative TFBSs with undeniable sequence similarity to known binding sites Thus thenoise-to-signal ratio may not re ect the true noise level in this region

DISCUSSION

The work presented in this paper attemps to identify TFBSs by considering simultaneously both theirsimilarity to the query PSSM and their differences from the local genomic context Through the studyof the human TFBSs in TRANSFAC we show that LMM which makes putative TFBS calls using localp-values yields a much improved false-positive to true-positive ratio than that using the TRANSFAC orlog-odds scores alone

It has been known that neighboring nucleotide compositions can affect the interaction between a tran-scription factor and its binding site To our best knowledge however there is no documented study onwhether and how much an improvement can be made on the PSSM-based TFBS detection using a localbackground model The result we present which is based on more than 100 experimentally determinedTFBS sequences in the human genome shows a clear overall advantage for incorporating the local se-quence context into PSSM-based TFBS search There are various biological mechanisms that can explainthis effect which may lead to more complicated and more speci c models For instance it may be thatthe local 1000 bp genomic region does not contain DNA sequences similar to the true binding site be-cause otherwise the target transcription factor may be competed away from its biologically meaningfulbinding site

While this improvement does not in itself render a solution to the much more dif cult problem ofdetecting regulatory modules by signi cantly reducing false-positive calls for single sites the local p-value approach will contribute substantially to any subsequent algorithms aiming to detect combinatorialregulatory modules The method we developed here is seen as a proof of principle and can be used asa component of a more complex approach For example considering that clusters of binding sites alsooften occur within small regions of about 200 bp to cooperatively recruit the transcription factors a naturalfuture development of LMM would be to take this distance effect into the background estimation andcombine the LMM p-values of a few candidate PSSM sites Many challenging problems in computationalbiology eg translation initiation site identi cation splice site recognition and RNA secondary structureprediction can be modeled in terms of the recognition of motifs Our work may be adapted and extendedto these problems as well However it should be noted that when applied to protein sequences which arecomposed of a 20-letter alphabet the performance of our algorithm may become an issue especially whenthe order of the Markov chain k is large

DETAILED METHODS

Data extraction for large-scale validation

To evaluate the performance of the LMM we apply it to known TFBSs in the human genome Knownbinding sites are extracted from the SITE table of the TRANSFAC database version 62 About half of the12262 binding sites in this table are experimentally derived from various species The rest are generatedfrom in vitro binding assays on arti cial nucleotide sequences Since LMM studies binding sites withrespect to their genomic contexts these arti cial sequences which do not correspond to any genomicregion cannot be used for our validation study Of the 6073 in vivo binding sites 1425 sites are basedon the human genome Of these 149 (105) are annotated with a corresponding PSSM We use thesebinding sites for validation

10 HUANG ET AL

To locate the known TFBSs in the human genome we focus on the 5000 bp upstream sequences of allgenes We made use of the annotations provided by Ensembl (Hubbard et al 2002) and extracted 22808human gene promoters from the human genome assembly NCBI golden path 29 (wwwensemblorgHomo_sapiens) Since heuristic sequence-mapping algorithms do not perform well on short sequences such asTFBSs we use an exact-match algorithm based on suf x trees (Gus eld 1997) We found that many bindingsite sequences are precisely mapped onto the promoters of the correct target genes For those binding siteswith mappings onto multiple promoters or with no mapping we attempted to retrieve them by manualreview To nd the correct one among multiple mappings we made correspondences between the Ensemblgene name and the target gene name of the binding site as recorded by TRANSFAC A review of somemissed matches using inexact match algorithms revealed a small number of single-basepair differencesbetween the recorded binding site sequences and the promoter sequences of the target genes for examplethe binding site HS$ALBU_06 over the human albumin promoter After validating against the primaryliterature for the positions of these binding sites we included these mappings as well In total we located101 human TFBSs

Local p-value calculation

Although the exact score distribution can be obtained by enumerating all possible binding site sequencesunder any ldquonullrdquo model for the observed nucleotide base pairs the computational cost for a PSSM of lengthp is 4p Stadenrsquos method (Staden 1989) which turns this into an order-p computation is based on thePGF of the score under the simple null model that the base pairs are independent and identically distributed(iid) Recently however there are some evidences suggesting that Markov background models work betterthan the iid model for detecting TFBS (Liu et al 2002) By extending Stadenrsquos PGF method to dependentrandom variables we present here the derivation of the PGFs under a rst-order Markov model the basisof the ef cient algorithm for computing the exact score distribution

Probability generating function derivation

In our study we make use of the PSSMs constructed by TRANSFAC version 62 Given a PSSMm D wij ppound4 where i D 1 p and j D ACGT the match score S and the similarity score S=Smax

of a sequence D1D2 Dp is de ned as (Quandt et al 1995)

S DpX

iD1

wiDi and S=Smax DpX

iD1

wij

iquest pX

iD1

maxj

fwij g

Let S be a random variable taking integer values then its probability generating function Gt is theexpected value of tS Gt is a polynomial and the coef cient of the term tn is the probability of the eventS D n (Gut 1995)

Given a PSSM m of length p under the assumption that the DNA sequence is iid Staden pro-vided the PGF of the match score in the form of a product of p polynomials (Staden 1989) Gt DQp

iD1

PjDACGT fj twij where fj is the frequency of letter j in the iid DNA sequence For the rst-

order Markov case k D 1 let the transition matrix be P D fregjmacr 4pound4 and the stationary distribution of theMarkov chain be frac14 (viewed as a four-dimentional row vector) Then the PGF under the rst-order Markovmodel is

Gt D frac14

pY

iD1

PMi t I (curren)

where Mi t D DiagtwiA twiC twiG twiT and I D 1 1 1 1T (proof provided at the end of thissection)

Since a Markov chain of order k on set 0 is equivalently a rst-order Markov chain on the set 0k witha little modi cation on Mi t we can generalize the above results to k gt 1 An example of PGF for

LOCAL STATISTICAL SIGNIFICANCE OF PATTERNS 11

k D 3 is in the online supplement (wwwbiostatharvardeducomplabLMM) Using this representation forthe PGF we developed and implemented an algorithm using CCC to calculate the exact score distributionGenerally for a kth-order Markov chain and a PSSM of length p the time complexity of our algorithm isO4k cent Smax cent p linear in the matrix length but exponential in the order of the Markov chain The sourcecode is available upon request (wwwbiostathsphharvardeduLMM)

Proof of equation (curren) For ease of notation and without loss of generality we let p the length of thePSSM be 3

For a DNA sequence D1D2D3 its match score against PSSM m is w1D1 C w2D2 C w3D3 and theprobability of the occurrence of D1D2D3 is fD1 fD2 jD1 fD3 jD2 By de nition the PGF of match scoreagainst m is

X

D1D2D3

fD1fD2jD1fD3jD2 tw1D1Cw2D2

Cw3D3 DX

D1D2D3

fD1 tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

In the following we derive the PGF in the alternative form of a product of p matrices First

X

D1D2D3

fD1 tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

DX

D1D2D3

X

a

fa cent fD1 ja tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

DX

a

fa centX

D1D2D3

fD1 ja tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

D fa fC fG fT cent

0

X

D1D2D3

fD1 jAtw1D1 cent fD2jD1 tw2D2 cent fD3 jD2 tw3D3

X

D1D2D3

fD1 jT tw1D1 cent fD2jD1 tw2D2 cent fD3jD2 tw3D3

1

AT

For the component of the second vector corresponding to base A

X

D1D2D3

fD1 jAtw1D1 cent fD2 jD1 tw2D2 cent fD3jD2 tw3D3

DX

D1

fD1 jAtw1D1 centX

D2D3

fD2 jD1 tw2D2 cent fD3jD2 tw3D3

D fAjAtw1A fCjAtw1C fGjAtw1G fT jAtw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3jD2 tw3D3

1

AT

12 HUANG ET AL

We apply similar arguments to the components corresponding to bases C G and T and obtain

X

D1D2D3

fD1jC tw1D1 cent fD2 jD1 tw2D2 cent fD3jD2 tw3D3

D fAjC tw1A fCjC tw1C fGjC tw1G fT jC tw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

X

D1D2D3

fD1jGtw1D1 cent fD2 jD1 tw2D2 cent fD3jD2 tw3D3

D fAjGtw1A fCjGtw1C fGjGtw1G fT jGtw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

X

D1D2D3

fD1jT tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

D fAjT tw1A fCjT tw1C fGjT tw1G fT jT tw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

Therefore for the rst position we have

0

X

D1D2D3

fD1jAtw1D1 cent fD2jD1 tw2D2 cent fD3 jD2 tw3D3 X

D1D2D3

fD1 jT tw1D1 cent fD2 jD1 tw2D2 cent fD3jD2 tw3D3

1

AT

D

0

BB

fAjAtw1A fCjAtw1C fGjAtw1G fT jAtw1T

fAjC tw1A fCjC tw1C fGjC tw1G fT jC tw1T

fAjGtw1A fCjGtw1C fGjGtw1G fT jGtw1T

fAjT tw1A fCjT tw1C fGjT tw1G fT jT tw1T

1

CCA cent

0

BB

PD2D3

fD2 jAtw2D2 fD3 jD2 tw3D3PD2D3

fD2 jC tw2D2 fD3jD2 tw3D3PD2D3

fD2 jGtw2D2 fD3jD2 tw3D3PD2D3

fD2 jT tw2D2 fD3 jD2 tw3D3

1

CCA

D P cent Diagtw1A tw1C tw1G tw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD2 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

D P cent M1 t cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

LOCAL STATISTICAL SIGNIFICANCE OF PATTERNS 13

Further applying the above arguments to positions 2 and 3 we have

0

X

D2D3

fD2jAtw2D2 cent fD3jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

D P cent Diagtw2A tw2C tw2G tw2T cent

0

X

D3

fD3jAtw3D3 X

D3

fD3 jT tw3D3

1

AT

D P cent Diagtw2A tw2C tw2G tw2T cent P cent Diagtw2A tw2C tw2G tw2T cent 1 1 1 1T

iexcl P cent M2 t cent P cent M3 t cent I

Above all Gt D frac14Qp

iD1PMi tI

ACKNOWLEDGMENTS

The work of HH XZ and WHW is supported by NSF grants DBI0196176 and DMS-0090166 Thework of HH and JSL is supported by NSF grant DMS-0204674 and NIH grant R01 HG02518-01 Thework of M-CJK is supported by the Howard Hughes Medical Institute predoctoral fellowship

REFERENCES

Bailey TL and Elkan C 1994 Fitting a mixture model by expectation maximization to discover motifs in biopoly-mers Proc Int Conf Intell Syst Mol Biol 2 28ndash36

Chen QK Hertz GZ and Stormo GD 1995 MATRIX SEARCH 10 A computer program that scans DNAsequences for transcriptional elements using a database of weight matrices Comput Appl Biosci 11 563ndash566

Durbin R Eddy SR Krogh A and Mitchison G 1998 Biological Sequence Analysis Probalistic Models ofProteins and Nucleic Acids Cambridge University Press Cambridge UK

Fickett JW 1996 Coordinate positioning of MEF2 and myogenin binding sites Gene 172 GC19ndash32Fried M and Crothers DM 1981 Equilibria and kinetics of lac repressor-operator interactions by polyacrylamide

gel electrophoresis Nucl Acids Res 9 6505ndash6525Galas DJ and Schmitz A 1978 DNAse footprinting A simple method for the detection of proteinndashDNA binding

speci city Nucl Acids Res 5 3157ndash3170Garner MM and Revzin A 1981 A gel electrophoresis method for quantifying the binding of proteins to speci c

DNA regions Application to components of the Escherichia coli lactose operon regulatory system Nucl Acids Res9 3047ndash3060

Gus eld D 1997 Algorithms on Strings Trees and Sequences Computer Science and Computational BiologyCambridge University Press Cambridge England

Gut A 1995 An Intermediate Course in Probability Springer-Verlag New YorkHertz GZ Hartzell 3rd GW and Stormo GD 1990 Identi cation of consensus patterns in unaligned DNA

sequences known to be functionally related Comput Appl Biosci 6 81ndash92Hubbard T Barker D Birney E Cameron G Chen Y Clark L Cox T Cuff J Curwen V Down T Durbin

R Eyras E Gilbert J Hammond M Huminiecki L Kasprzyk A Lehvaslaiho H Lijnzaad P MelsoppC Mongin E Pettett R Pocock M Potter S Rust A Schmidt E Searle S Slater G Smith J SpoonerW Stabenau A Stalker J Stupka E Ureta-Vidal A Vastrik I and Clamp M 2002 The Ensembl genomedatabase project Nucl Acids Res 30 38ndash41

Hughes JD Estep PW Tavazoie S and Church GM 2000 Computational identi cation of cis-regulatoryelementsassociated with groups of functionally related genes in Saccharomyces cerevisiae J Mol Biol 296 1205ndash1214

Lander ES Linton LM Birren B Nusbaum C Zody MC Baldwin J Devon K Dewar K Doyle MFitzHugh W Funke R Gage D Harris K Heaford A Howland J Kann L Lehoczky J LeVine RMcEwan P McKernan K Meldrim J Mesirov JP Miranda C Morris W Naylor J Raymond C RosettiM Santos R Sheridan A Sougnez C Stange-Thomann N Stojanovic N Subramanian A Wyman DRogers J Sulston J Ainscough R Beck S Bentley D Burton J Clee C Carter N Coulson A DeadmanR Deloukas P Dunham A Dunham I Durbin R French L Grafham D Gregory S Hubbard T Humphray

14 HUANG ET AL

S Hunt A Jones M Lloyd C McMurray A Matthews L Mercer S Milne S Mullikin JC MungallA Plumb R Ross M Shownkeen R Sims S Waterston RH Wilson RK Hillier LW McPherson JDMarra MA Mardis ER Fulton LA Chinwalla AT Pepin KH Gish WR Chissoe SL Wendl MCDelehaunty KD Miner TL Delehaunty A Kramer JB Cook LL Fulton RS Johnson DL Minx PJClifton SW Hawkins T Branscomb E Predki P Richardson P Wenning S Slezak T Doggett N ChengJF Olsen A Lucas S Elkin C Uberbacher E Frazier M Gibbs RA Muzny DM Scherer SE BouckJB Sodergren EJ Worley KC Rives CM Gorrell JH Metzker ML Naylor SL Kucherlapati RSNelson DL Weinstock GM Sakaki Y Fujiyama A Hattori M Yada T Toyoda A Itoh T Kawagoe CWatanabe H Totoki Y Taylor T Weissenbach J Heilig R Saurin W Artiguenave F Brottier P Bruls TPelletier E Robert C Wincker P Smith DR Doucette-Stamm L Ruben eld M Weinstock K Lee HMDubois J Rosenthal A Platzer M Nyakatura G Taudien S Rump A Yang H Yu J Wang J HuangG Gu J Hood L Rowen L Madan A Qin S Davis RW Federspiel NA Abola AP Proctor MJMyers RM Schmutz J Dickson M Grimwood J Cox DR Olson MV Kaul R Shimizu N KawasakiK Minoshima S Evans GA Athanasiou M Schultz R Roe BA Chen F Pan H Ramser J LehrachH Reinhardt R McCombie WR de la Bastide M Dedhia N Blocker H Hornischer K Nordsiek GAgarwala R Aravind L Bailey JA Bateman A Batzoglou S Birney E Bork P Brown DG BurgeCB Cerutti L Chen HC Church D Clamp M Copley RR Doerks T Eddy SR Eichler EE FureyTS Galagan J Gilbert JG Harmon C Hayashizaki Y Haussler D Hermjakob H Hokamp K Jang WJohnson LS Jones TA Kasif S Kaspryzk A Kennedy S Kent WJ Kitts P Koonin EV Korf I KulpD Lancet D Lowe TM McLysaght A Mikkelsen T Moran JV Mulder N Pollara VJ Ponting CPSchuler G Schultz J Slater G Smit AF Stupka E Szustakowski J Thierry-Mieg D Thierry-Mieg JWagner L Wallis J Wheeler R Williams A Wolf YI Wolfe KH Yang SP Yeh RF Collins F GuyerMS Peterson J Felsenfeld A Wetterstrand KA Patrinos A Morgan MJ Szustakowki J de Jong PCatanese JJ Osoegawa K Shizuya H Choi S and Chen YJ 2001 Initial sequencing and analysis of thehuman genome Nature 409 860ndash921

Lawrence CE Altschul SF Boguski MS Liu JS Neuwald AF and Wootton JC 1993 Detecting subtlesequence signals A Gibbs sampling strategy for multiple alignment Science 262 208ndash214

Lawrence CE and Reilly AA 1990 An expectation maximization (EM) algorithm for the identi cation and char-acterization of common sites in unaligned biopolymer sequences Proteins 7 41ndash51

Liu X Brutlag DL and Liu JS 2001 BioProspector Discovering conserved DNA motifs in upstream regulatoryregions of co-expressed genes Pac Symp Biocomput 127ndash138

Liu XS Brutlag DL and Liu JS 2002 An algorithm for nding protein DNA binding sites with applications tochromatin-immunoprecipitation microarray experiments Nat Biotechnol

Nakatsuji Y Hidaka K Tsujino S Yamamoto Y Mukai T Yanagihara T Kishimoto T and Sakoda S 1992A single MEF-2 site is a major positive regulatory element required for transcription of the muscle-specic subunitof the human phosphoglycerate mutase gene in skeletal and cardiac muscle cells Mol Cell Biol 12 4384ndash4390

Quandt K Frech K Karas H Wingender E and Werner T 1995 MatInd and MatInspector New fast and versatiletools for detection of consensus matches in nucleotide sequence data Nucl Acids Res 23 4878ndash4884

Rosenthal N Berglund EB Wentworth BM Donoghue M Winter B Bober E Braun T and Arnold HH1990 A highly conserved enhancer downstream of the human MLC13 locus is a target for multiple myogenicdetermination factors Nucl Acids Res 18 6239ndash6246

Roth FP Hughes JD Estep PW and Church GM 1998 Finding DNA regulatory motifs within unalignednoncoding sequences clustered by whole-genome mRNA quantitation Nat Biotechnol 16 939ndash945

Staden R 1989 Methods for calculating the probabilities of nding patterns in sequences Comput Appl Biosci 589ndash96

Stormo GD and Hartzell 3rd GW 1989 Identifying protein-binding sites from unaligned DNA fragments ProcNatl Acad Sci USA 86 1183ndash1187

Wentworth BM Donoghue M Engert JC Berglund EB and Rosenthal N 1991 Paired MyoD-binding sitesregulate myosin light chain gene expression Proc Natl Acad Sci USA 88 1242ndash1246

Wingender E Chen X Hehl R Karas H Liebich I Matys V Meinhardt T Pruss M Reuter I and SchachererF 2000 TRANSFAC An integrated system for gene expression regulation Nucl Acids Res 28 316ndash319

Address correspondence toJun S Liu Wing H Wong

Department of StatisticsScience Center 6th oor

1 Oxford StreetCambridge MA 02138

E-mail jliu wwongstatharvardedu

Page 3: Determination of Local Statistical Significance of ...mckao/documents/JCB-LMM.pdf · that the incorporation of the local genomic context can be advantageous in the prediction of myogenin

LOCAL STATISTICAL SIGNIFICANCE OF PATTERNS 3

FIG 1 The construction of a position speci c scoring matrix for myogenin binding sites

MAIN RESULTS

p-value calculation

In order to calculate the probability that a given Markov background model can achieve a score at leastas high as the observed score of the candidate site we extend a previous method designed for a similarpurpose but applicable only to the independent and identically distributed (iid) background sequencemodel (Staden 1989) The key part for this method is the reformulation of the distribution of the score asa probability-generating function which leads to an ef cient algorithm for its computation We formulatethe score probability-generating function under Markov models (detailed in the Detailed Methods section)and derive an algorithm with time complexity linear in the length of the PSSM a dramatic improvementover the naive enumeration method which has time complexity exponential in the length of the PSSM

4 HUANG ET AL

FIG 2 TRANSFAC vs local Markov model (LMM) in the identi cation of transcription factor binding sites (TFBSs)in a given genomic sequence (a) TRANSFAC scans a genomic sequence generates similarity scores of each subse-quence against a given PSSM and uses three matrix-specic cutoffs FN FP and SUM to make putative calls The threesets of cutoffs attempt to minimize false-negative error false-positiveerror or the sum of these two errors respectively(b) LMM begins by selecting the top 01 candidate sites based on their PSSM similarity scores since sites withlow similarity scores are unlikely to be true binding sites For each candidate TFBS LMM models the DNA sequencesegment of length L (eg 1000) centered around the target site as a homogeneous Markov chain of orders k D 0 12 or 3 Under the estimated Markov model LMM calculates the probability distribution of the similarity score usingour algorithm This distribution then allows us to assign statistical signi cance to the given candidate TFBS

We implement the algorithm in C and incorporate the program into the local Markov method (LMM)program to study TFBSs in situ by evaluating each candidate binding site with respect to its local genomiccontext A summary of the LMM method is in Fig 2b For comparison we also describe in Fig 2a theprediction program which accompanies the TRANSFAC database

To assess the ef ciency of our algorithm we compare it against Monte Carlo simulations Our exper-iments showed that the ef ciency of our approach can be many times more ef cient than Monte Carlosimulation For example at p middot 00001 a sequence of length at least 109 basepairs needs to be simu-lated in order to obtain a suf ciently accurate cutoff value (relative error middot 1) which needs more than1000-fold more computing time than our exact algorithm (Table 1)

APPLICATIONS

MYL1 30 enhancer myogenin binding site prediction

In human mouse and rat there is a well-conserved 200bp-long skeletal muscle-speci c enhancer about24 kb 30 of MYL1 (Rosenthal et al 1990 Wentworth et al 1991) Three myogenic determination factorbinding sites A B and C are found in this region which are located 1267 bp 1323 bp and 1339 bp

LOCAL STATISTICAL SIGNIFICANCE OF PATTERNS 5

Table 1 Running Time Comparison of LMM with Monte Carlo Simulationa

Signicance Signicanceat p middot 0001 at p middot 00001

Relative error Relative errorTime

Simulation (sec) Mean SD Mean SD

N D 105 006 65 43 240 169N D 106 06 17 12 77 45N D 107 57 10 07 26 17N D 108 57 10 05 20 08N D 109 570 07 06 06 04N D 232 1228 06 06 06 04LMM 043 0 0

aFor 10 randomly chosen intergenic regions we estimate a 2nd-order Markov model for eachsequence using maximum likelihood estimation Under the 10 estimated Markov models we use ouralgorithm to derive and Monte Carlo simulation to estimate the score cutoffs of the p53 PSSM at twosigni cance levels p middot 0001 and p middot 00001 on a 1500 MHz AMD Athlon machine running LinuxTo assess the difference between the cutoffs Cp derived by our algorithm and the cutoffs Cp estimatedby simulations we consider the p-value F Cp attained by Cp and the true p-value FCp of Cp

derived using our algorithm We assess the relative errors of the simulation estimate by calculatingFCp iexcl F Cp=FCp

respectively downstream of the last exon of MYL1 in the human genome Sites A and B are myogeninmyf4binding sites (Rosenthal et al 1990) while site C is a MyoD binding site (Wentworth et al 1991) alsoconsidered to be a myogenin binding site (Fickett 1996)

We applied the LMM to the 10000 bp MYL1 downstream region (starting from the end of the lastexon) to derive the local p-values for each candidate The local p-value for each candidate is the statisticalsigni cance of observing its score (derived by both log-odds and entropy-related PSSMs) assuming that itis generated under a local random model where Markov models of different orders (eg 0 1 or 2) are usedand with parameters estimated from the local 1000 bp genomic sequence centered at the candidate The top10 score candidate sites derived using log-odds or entropy-weighted PSSMs are listed in Table 2a and 2b

Table 2 Incorporating Local Sequence Information to Transcription Factor Binding SitePrediction Using Two Types of PSSMs for Myogenin in the Human MYL1 30 Enhancer (ab)

or for MEF2 in the Human Phosphoglycerate Mutase Promoter (cd)a

(a) Using log-odds myogenin PSSMs

p-values of observed score under local background modelPosition Log-odds

(bp from last PSSM 1st 2ndexon of MYL1) score iid Markov| Markov Binding site

1267 (A) 556 0000008 0000017 0000030 AGCAGGTG1339 (C) 550 0000015 0000027 0000055 GACAGGTG1323 (B) 548 0000033 0000057 0000112 ACCAGCTG

5434 556 0000036 0000074 0000095 AGCAGCTG2463 550 0000059 0000135 0000179 GCCAGCTG1235 531 0000212 0000354 0000442 ACCATGTG926 534 0000181 0000363 0000468 TGCAGGTG

2574 536 0000225 0000416 0000421 GGCAGATG783 531 0000274 0000453 0000537 AACATCTG470 529 0000404 0000624 0000731 GGAAGCTG

(continued)

6 HUANG ET AL

Table 2 (Continued)

(b) Using entropy-weighted myogenin PSSM

p-values of observed score under local background modelPosition

(bp from last TRANSFAC 1st 2ndexon of MYL1) score iid Markov| Markov Binding site

1267 (A) 4667 0000008 0000017 000003 AGCAGGTG1339 (C) 4628 0000018 0000032 0000059 GACAGGTG

5434 4667 0000036 0000074 0000095 AGCAGCTG1323 (B) 4581 0000045 0000077 0000127 ACCAGCTG

2463 4628 0000068 0000152 0000194 GCCAGCTG2574 4596 0000073 0000177 0000191 GGCAGATG926 4463 0000224 0000414 0000532 TGCAGGTG

7534 4377 0000378 0000534 0000788 TACAGCTG7156 4377 0000346 000054 0000686 CCCAGCTG4895 4322 0000829 0001998 0002045 CTCAGGTG

(c) Using log-odds MEF2 PSSMs

p-values of observed score under local background modelPosition Log-odds

(bp from last PSSM 1st 2ndexon of MYL1) score iid Markov| Markov Binding site

iexcl2970 669 0000199 0000160 0000228 ATTTTAAATAiexcl3115 671 0000209 0000183 0000243 GTTATAAATAiexcl161 649 0000355 0000183 0000322 ATTTTAAGCA

iexcl2939 668 0000233 0000190 0000266 TGTTTAAATCiexcl3151 663 0000807 0000655 0000747 TGTTTAAGAAiexcl4767 656 0000951 0001009 0001712 TTTTTATATAiexcl3433 649 0003940 0003099 0003383 AAACTAAAAAiexcl3566 644 0005710 0004913 0005231 TTTTTAAAGCiexcl3214 643 0007155 0005654 0006459 AGTTTATATCiexcl3577 641 0007363 0006312 0006625 GGTTTAACAT

(d) Using entropy-weighted MEF2 PSSM

p-values of observed score under local background modelPosition

(bp from last TRANSFAC 1st 2ndexon of MYL1) score iid Markov| Markov Binding site

iexcl2970 591 0000103 0000082 0000133 ATTTTAAATAiexcl161 550 0000191 0000101 0000174 ATTTTAAGCA

iexcl3115 590 0000206 0000181 0000243 GTTATAAATAiexcl2939 554 0001001 0000807 0001016 TGTTTAAATCiexcl3151 562 0001091 0000914 0001083 TGTTTAAGAAiexcl4700 531 0001451 0001451 0001451 TTGTTAAAGAiexcl3566 543 0002961 0002532 0002676 TTTTTAAAGCiexcl3433 545 0003271 0002610 0002788 AAACTAAAAAiexcl4444 532 0003080 0003200 0004320 CATATAATTAiexcl3687 535 0003761 0003320 0003671 GAAGTAAAGA

aSorted in increasing order by column marked with |

LOCAL STATISTICAL SIGNIFICANCE OF PATTERNS 7

respectively with the true sites A B and C labeled and shaded in gray along with their local p-valuesWe nd the PSSM scores to be less sensitive a measure than the local p-value the true sites A B andC stood out under the local p-values while they are not as distinct from the false predictions under thePSSM scores

PGAM-M MEF2 binding site prediction

A major positive regulatory element is required for the muscle-speci c expression of the muscle-speci csubunit of the human phosphoglycerate mutase (PGAM-M) gene (Nakatsuji et al 1992) This elementlocated 161 bp upstream of the gene is found to be bound by the transcription factor MEF-2

We applied the LMM to the 5000 bp PGAM-M upstream region using the MEF2_Q6 PSSM to derivethe local p-values for each candidate The top 10 score candidate sites derived using log-odds or entropy-weighted PSSMs are listed in Tables 2c and 2d respectively with the true site labeled and shaded in grayalong with their local p-values We nd that LMM behaves similarly as in the MYL1 enhancer

Overall from Table 2 we see that by taking into account the local sequence composition we havereordered the candidate sequences in a way that is favorable to the true binding sites

LARGE-SCALE VALIDATION

In order to evaluate the performance of LMM and to compare our local p-values to PSSM similarityscores we apply both LMM and TRANSFAC to 101 known binding sites in the human genome obtainedby mapping binding sites in the TRANSFAC database onto the human genome We recorded and evaluatedthe extent to which LMM and TRANSFAC can capture this large collection of known binding sites in thehuman genome and the amount of noise generated in so doing

In Figure 3a the trade-off between sensitivity and noise is shown in terms of the proportion of theknown binding sites detected and the amount of concomitant noise generated Noise is measured bythe noise-to-signal ratio which is de ned as the number of binding site calls not known to be correctdivided by the number of known binding sites found For comparison we show the tradeoffs achieved byTRANSFAC using its three matrix-speci c similarity score cutoffs (FN SUM FP) along with that achievedby LMM under Markov models of orders 0 1 2 and 3 at various p-value cutoffs starting at the stringentp D 000001 From the inset graph we see that at all levels of sensitivity LMM outperformed TRANSFAC

FIG 3 Large-scale validation of TRANSFAC and LMM Tradeoff between sensitivity and noise (a) We comparedthe abilities of the two methods to detect the 101 known binding sites in the human genome by looking at their sensitivityand noise-to-signal ratio The balance of the tradeoff between these two measures achieved at various signi cancelevels by LMM are traced and compared to that attained by TRANSFAC The inset graph shows the performance ofLMM and TRANSFAC across all levels of sensitivity (b) Detailed results for p D 000001 and 00002

8 HUANG ET AL

by producing signi cantly less noise While the performance of LMM comes close to that of TRANSFACas the p-value cutoff increases in fact by then for both methods the advantage of increased sensitivity hasbeen nulli ed by the high level of accompanying noise rendering them impractical Overall not only is thesensitivity of LMM comparable to TRANSFAC its noise-to-signal ratio is also vastly superior It shouldbe noted that since only a limited number of true binding sites are known not every unsupported bindingsite prediction is necessarily a false-positive prediction Thus the noise-to-signal ratio overestimates thetrue noise level especially when stringent criteria are used to generate putative TFBSs with high sequencesimilarity to known binding sites As the criteria relax the large numbers of predictions over and abovethe known binding sites imply a high level of true background noise

More detailed results for TRANSFAC using the three cutoffs and for LMM using different signi cancecutoffs 000001 and 00002 and under different Markov models are summarized in Fig 3b While theFN cut off missed relatively few known binding sites it generated more than 45 false-positive predictionsfor every accurate binding site call On the other hand FP made fewer false positives but it detected onlyone in nine known binding sites The SUM cutoff designed as a balance of these inherent tradeoffs didstrike a reasonable compromise having generated about nine false positives for every real binding site anddetected more than half of the known sites

At the stringent signi cance cutoff p D 000001 LMM detected about twice the binding sites thandid the FP cutoff and on average produced about 60 fewer false-positive predictions for every correctprediction At the more relaxed p-value cutoff p D 00002 the sensitivity of LMM is comparable tothat of the SUM cutoff while only half of the noise is generated The binding sites that were detectedby LMM at p middot 00002 but missed by TRANSFAC using the SUM cutoff include a MEF2 binding siteover the desmin gene an ATF1 (activating transcription factor 1) binding site over the TGFmacr2 gene aHIF (hypoxia-inducible factor) binding site over the VEGF gene and an ICSBP (IFN consensus sequencebinding protein) binding site over the OAS1 gene We choose p D 00002 as the general signi cance cutofffor the application of LMM to mammalian genomic sequences a cutoff with a suf ciently high sensitivityand an acceptable amount of noise Overall the LMM provides an advantageous tradeoff between noise-to-signal ratio and sensitivity

In our validation experiment we found that Markov models of orders 1 2 and 3 have better combinationsof high sensitivity and low noise than the iid model con rming an earlier observation (Liu et al 2001)that Markov models can better capture the structure of biological sequences In addition we compared

FIG 4 The use of local sequence context is advantageous The performance of the second-order LMM is comparedagainst an analogous global Markov model with parameters estimated from a large collection of upstream regionsThe performance is assessed in terms of the noise-to-signal ratio and sensitivity At the recommended p-value cutoff00002 LMM is more sensitive and less noisy

LOCAL STATISTICAL SIGNIFICANCE OF PATTERNS 9

the performance of the second-order LMM against an analogous global Markov model with parametersestimated from a large collection of upstream regions in order to assess the ability of LMM to modelthe local sequence context information We found that over the 101 known human TFBSs in situ LMMgenerally outperforms the global Markov model while they behave similarly at high and low sensitivitylevels (Fig 4) At high sensitivity levels the lax p-value cutoffs produce large numbers of putative TFBScalls overwhelming the advantage enjoyed by LMM At low sensitivity levels the stringent p-valuecutoffs yield only putative TFBSs with undeniable sequence similarity to known binding sites Thus thenoise-to-signal ratio may not re ect the true noise level in this region

DISCUSSION

The work presented in this paper attemps to identify TFBSs by considering simultaneously both theirsimilarity to the query PSSM and their differences from the local genomic context Through the studyof the human TFBSs in TRANSFAC we show that LMM which makes putative TFBS calls using localp-values yields a much improved false-positive to true-positive ratio than that using the TRANSFAC orlog-odds scores alone

It has been known that neighboring nucleotide compositions can affect the interaction between a tran-scription factor and its binding site To our best knowledge however there is no documented study onwhether and how much an improvement can be made on the PSSM-based TFBS detection using a localbackground model The result we present which is based on more than 100 experimentally determinedTFBS sequences in the human genome shows a clear overall advantage for incorporating the local se-quence context into PSSM-based TFBS search There are various biological mechanisms that can explainthis effect which may lead to more complicated and more speci c models For instance it may be thatthe local 1000 bp genomic region does not contain DNA sequences similar to the true binding site be-cause otherwise the target transcription factor may be competed away from its biologically meaningfulbinding site

While this improvement does not in itself render a solution to the much more dif cult problem ofdetecting regulatory modules by signi cantly reducing false-positive calls for single sites the local p-value approach will contribute substantially to any subsequent algorithms aiming to detect combinatorialregulatory modules The method we developed here is seen as a proof of principle and can be used asa component of a more complex approach For example considering that clusters of binding sites alsooften occur within small regions of about 200 bp to cooperatively recruit the transcription factors a naturalfuture development of LMM would be to take this distance effect into the background estimation andcombine the LMM p-values of a few candidate PSSM sites Many challenging problems in computationalbiology eg translation initiation site identi cation splice site recognition and RNA secondary structureprediction can be modeled in terms of the recognition of motifs Our work may be adapted and extendedto these problems as well However it should be noted that when applied to protein sequences which arecomposed of a 20-letter alphabet the performance of our algorithm may become an issue especially whenthe order of the Markov chain k is large

DETAILED METHODS

Data extraction for large-scale validation

To evaluate the performance of the LMM we apply it to known TFBSs in the human genome Knownbinding sites are extracted from the SITE table of the TRANSFAC database version 62 About half of the12262 binding sites in this table are experimentally derived from various species The rest are generatedfrom in vitro binding assays on arti cial nucleotide sequences Since LMM studies binding sites withrespect to their genomic contexts these arti cial sequences which do not correspond to any genomicregion cannot be used for our validation study Of the 6073 in vivo binding sites 1425 sites are basedon the human genome Of these 149 (105) are annotated with a corresponding PSSM We use thesebinding sites for validation

10 HUANG ET AL

To locate the known TFBSs in the human genome we focus on the 5000 bp upstream sequences of allgenes We made use of the annotations provided by Ensembl (Hubbard et al 2002) and extracted 22808human gene promoters from the human genome assembly NCBI golden path 29 (wwwensemblorgHomo_sapiens) Since heuristic sequence-mapping algorithms do not perform well on short sequences such asTFBSs we use an exact-match algorithm based on suf x trees (Gus eld 1997) We found that many bindingsite sequences are precisely mapped onto the promoters of the correct target genes For those binding siteswith mappings onto multiple promoters or with no mapping we attempted to retrieve them by manualreview To nd the correct one among multiple mappings we made correspondences between the Ensemblgene name and the target gene name of the binding site as recorded by TRANSFAC A review of somemissed matches using inexact match algorithms revealed a small number of single-basepair differencesbetween the recorded binding site sequences and the promoter sequences of the target genes for examplethe binding site HS$ALBU_06 over the human albumin promoter After validating against the primaryliterature for the positions of these binding sites we included these mappings as well In total we located101 human TFBSs

Local p-value calculation

Although the exact score distribution can be obtained by enumerating all possible binding site sequencesunder any ldquonullrdquo model for the observed nucleotide base pairs the computational cost for a PSSM of lengthp is 4p Stadenrsquos method (Staden 1989) which turns this into an order-p computation is based on thePGF of the score under the simple null model that the base pairs are independent and identically distributed(iid) Recently however there are some evidences suggesting that Markov background models work betterthan the iid model for detecting TFBS (Liu et al 2002) By extending Stadenrsquos PGF method to dependentrandom variables we present here the derivation of the PGFs under a rst-order Markov model the basisof the ef cient algorithm for computing the exact score distribution

Probability generating function derivation

In our study we make use of the PSSMs constructed by TRANSFAC version 62 Given a PSSMm D wij ppound4 where i D 1 p and j D ACGT the match score S and the similarity score S=Smax

of a sequence D1D2 Dp is de ned as (Quandt et al 1995)

S DpX

iD1

wiDi and S=Smax DpX

iD1

wij

iquest pX

iD1

maxj

fwij g

Let S be a random variable taking integer values then its probability generating function Gt is theexpected value of tS Gt is a polynomial and the coef cient of the term tn is the probability of the eventS D n (Gut 1995)

Given a PSSM m of length p under the assumption that the DNA sequence is iid Staden pro-vided the PGF of the match score in the form of a product of p polynomials (Staden 1989) Gt DQp

iD1

PjDACGT fj twij where fj is the frequency of letter j in the iid DNA sequence For the rst-

order Markov case k D 1 let the transition matrix be P D fregjmacr 4pound4 and the stationary distribution of theMarkov chain be frac14 (viewed as a four-dimentional row vector) Then the PGF under the rst-order Markovmodel is

Gt D frac14

pY

iD1

PMi t I (curren)

where Mi t D DiagtwiA twiC twiG twiT and I D 1 1 1 1T (proof provided at the end of thissection)

Since a Markov chain of order k on set 0 is equivalently a rst-order Markov chain on the set 0k witha little modi cation on Mi t we can generalize the above results to k gt 1 An example of PGF for

LOCAL STATISTICAL SIGNIFICANCE OF PATTERNS 11

k D 3 is in the online supplement (wwwbiostatharvardeducomplabLMM) Using this representation forthe PGF we developed and implemented an algorithm using CCC to calculate the exact score distributionGenerally for a kth-order Markov chain and a PSSM of length p the time complexity of our algorithm isO4k cent Smax cent p linear in the matrix length but exponential in the order of the Markov chain The sourcecode is available upon request (wwwbiostathsphharvardeduLMM)

Proof of equation (curren) For ease of notation and without loss of generality we let p the length of thePSSM be 3

For a DNA sequence D1D2D3 its match score against PSSM m is w1D1 C w2D2 C w3D3 and theprobability of the occurrence of D1D2D3 is fD1 fD2 jD1 fD3 jD2 By de nition the PGF of match scoreagainst m is

X

D1D2D3

fD1fD2jD1fD3jD2 tw1D1Cw2D2

Cw3D3 DX

D1D2D3

fD1 tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

In the following we derive the PGF in the alternative form of a product of p matrices First

X

D1D2D3

fD1 tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

DX

D1D2D3

X

a

fa cent fD1 ja tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

DX

a

fa centX

D1D2D3

fD1 ja tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

D fa fC fG fT cent

0

X

D1D2D3

fD1 jAtw1D1 cent fD2jD1 tw2D2 cent fD3 jD2 tw3D3

X

D1D2D3

fD1 jT tw1D1 cent fD2jD1 tw2D2 cent fD3jD2 tw3D3

1

AT

For the component of the second vector corresponding to base A

X

D1D2D3

fD1 jAtw1D1 cent fD2 jD1 tw2D2 cent fD3jD2 tw3D3

DX

D1

fD1 jAtw1D1 centX

D2D3

fD2 jD1 tw2D2 cent fD3jD2 tw3D3

D fAjAtw1A fCjAtw1C fGjAtw1G fT jAtw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3jD2 tw3D3

1

AT

12 HUANG ET AL

We apply similar arguments to the components corresponding to bases C G and T and obtain

X

D1D2D3

fD1jC tw1D1 cent fD2 jD1 tw2D2 cent fD3jD2 tw3D3

D fAjC tw1A fCjC tw1C fGjC tw1G fT jC tw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

X

D1D2D3

fD1jGtw1D1 cent fD2 jD1 tw2D2 cent fD3jD2 tw3D3

D fAjGtw1A fCjGtw1C fGjGtw1G fT jGtw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

X

D1D2D3

fD1jT tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

D fAjT tw1A fCjT tw1C fGjT tw1G fT jT tw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

Therefore for the rst position we have

0

X

D1D2D3

fD1jAtw1D1 cent fD2jD1 tw2D2 cent fD3 jD2 tw3D3 X

D1D2D3

fD1 jT tw1D1 cent fD2 jD1 tw2D2 cent fD3jD2 tw3D3

1

AT

D

0

BB

fAjAtw1A fCjAtw1C fGjAtw1G fT jAtw1T

fAjC tw1A fCjC tw1C fGjC tw1G fT jC tw1T

fAjGtw1A fCjGtw1C fGjGtw1G fT jGtw1T

fAjT tw1A fCjT tw1C fGjT tw1G fT jT tw1T

1

CCA cent

0

BB

PD2D3

fD2 jAtw2D2 fD3 jD2 tw3D3PD2D3

fD2 jC tw2D2 fD3jD2 tw3D3PD2D3

fD2 jGtw2D2 fD3jD2 tw3D3PD2D3

fD2 jT tw2D2 fD3 jD2 tw3D3

1

CCA

D P cent Diagtw1A tw1C tw1G tw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD2 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

D P cent M1 t cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

LOCAL STATISTICAL SIGNIFICANCE OF PATTERNS 13

Further applying the above arguments to positions 2 and 3 we have

0

X

D2D3

fD2jAtw2D2 cent fD3jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

D P cent Diagtw2A tw2C tw2G tw2T cent

0

X

D3

fD3jAtw3D3 X

D3

fD3 jT tw3D3

1

AT

D P cent Diagtw2A tw2C tw2G tw2T cent P cent Diagtw2A tw2C tw2G tw2T cent 1 1 1 1T

iexcl P cent M2 t cent P cent M3 t cent I

Above all Gt D frac14Qp

iD1PMi tI

ACKNOWLEDGMENTS

The work of HH XZ and WHW is supported by NSF grants DBI0196176 and DMS-0090166 Thework of HH and JSL is supported by NSF grant DMS-0204674 and NIH grant R01 HG02518-01 Thework of M-CJK is supported by the Howard Hughes Medical Institute predoctoral fellowship

REFERENCES

Bailey TL and Elkan C 1994 Fitting a mixture model by expectation maximization to discover motifs in biopoly-mers Proc Int Conf Intell Syst Mol Biol 2 28ndash36

Chen QK Hertz GZ and Stormo GD 1995 MATRIX SEARCH 10 A computer program that scans DNAsequences for transcriptional elements using a database of weight matrices Comput Appl Biosci 11 563ndash566

Durbin R Eddy SR Krogh A and Mitchison G 1998 Biological Sequence Analysis Probalistic Models ofProteins and Nucleic Acids Cambridge University Press Cambridge UK

Fickett JW 1996 Coordinate positioning of MEF2 and myogenin binding sites Gene 172 GC19ndash32Fried M and Crothers DM 1981 Equilibria and kinetics of lac repressor-operator interactions by polyacrylamide

gel electrophoresis Nucl Acids Res 9 6505ndash6525Galas DJ and Schmitz A 1978 DNAse footprinting A simple method for the detection of proteinndashDNA binding

speci city Nucl Acids Res 5 3157ndash3170Garner MM and Revzin A 1981 A gel electrophoresis method for quantifying the binding of proteins to speci c

DNA regions Application to components of the Escherichia coli lactose operon regulatory system Nucl Acids Res9 3047ndash3060

Gus eld D 1997 Algorithms on Strings Trees and Sequences Computer Science and Computational BiologyCambridge University Press Cambridge England

Gut A 1995 An Intermediate Course in Probability Springer-Verlag New YorkHertz GZ Hartzell 3rd GW and Stormo GD 1990 Identi cation of consensus patterns in unaligned DNA

sequences known to be functionally related Comput Appl Biosci 6 81ndash92Hubbard T Barker D Birney E Cameron G Chen Y Clark L Cox T Cuff J Curwen V Down T Durbin

R Eyras E Gilbert J Hammond M Huminiecki L Kasprzyk A Lehvaslaiho H Lijnzaad P MelsoppC Mongin E Pettett R Pocock M Potter S Rust A Schmidt E Searle S Slater G Smith J SpoonerW Stabenau A Stalker J Stupka E Ureta-Vidal A Vastrik I and Clamp M 2002 The Ensembl genomedatabase project Nucl Acids Res 30 38ndash41

Hughes JD Estep PW Tavazoie S and Church GM 2000 Computational identi cation of cis-regulatoryelementsassociated with groups of functionally related genes in Saccharomyces cerevisiae J Mol Biol 296 1205ndash1214

Lander ES Linton LM Birren B Nusbaum C Zody MC Baldwin J Devon K Dewar K Doyle MFitzHugh W Funke R Gage D Harris K Heaford A Howland J Kann L Lehoczky J LeVine RMcEwan P McKernan K Meldrim J Mesirov JP Miranda C Morris W Naylor J Raymond C RosettiM Santos R Sheridan A Sougnez C Stange-Thomann N Stojanovic N Subramanian A Wyman DRogers J Sulston J Ainscough R Beck S Bentley D Burton J Clee C Carter N Coulson A DeadmanR Deloukas P Dunham A Dunham I Durbin R French L Grafham D Gregory S Hubbard T Humphray

14 HUANG ET AL

S Hunt A Jones M Lloyd C McMurray A Matthews L Mercer S Milne S Mullikin JC MungallA Plumb R Ross M Shownkeen R Sims S Waterston RH Wilson RK Hillier LW McPherson JDMarra MA Mardis ER Fulton LA Chinwalla AT Pepin KH Gish WR Chissoe SL Wendl MCDelehaunty KD Miner TL Delehaunty A Kramer JB Cook LL Fulton RS Johnson DL Minx PJClifton SW Hawkins T Branscomb E Predki P Richardson P Wenning S Slezak T Doggett N ChengJF Olsen A Lucas S Elkin C Uberbacher E Frazier M Gibbs RA Muzny DM Scherer SE BouckJB Sodergren EJ Worley KC Rives CM Gorrell JH Metzker ML Naylor SL Kucherlapati RSNelson DL Weinstock GM Sakaki Y Fujiyama A Hattori M Yada T Toyoda A Itoh T Kawagoe CWatanabe H Totoki Y Taylor T Weissenbach J Heilig R Saurin W Artiguenave F Brottier P Bruls TPelletier E Robert C Wincker P Smith DR Doucette-Stamm L Ruben eld M Weinstock K Lee HMDubois J Rosenthal A Platzer M Nyakatura G Taudien S Rump A Yang H Yu J Wang J HuangG Gu J Hood L Rowen L Madan A Qin S Davis RW Federspiel NA Abola AP Proctor MJMyers RM Schmutz J Dickson M Grimwood J Cox DR Olson MV Kaul R Shimizu N KawasakiK Minoshima S Evans GA Athanasiou M Schultz R Roe BA Chen F Pan H Ramser J LehrachH Reinhardt R McCombie WR de la Bastide M Dedhia N Blocker H Hornischer K Nordsiek GAgarwala R Aravind L Bailey JA Bateman A Batzoglou S Birney E Bork P Brown DG BurgeCB Cerutti L Chen HC Church D Clamp M Copley RR Doerks T Eddy SR Eichler EE FureyTS Galagan J Gilbert JG Harmon C Hayashizaki Y Haussler D Hermjakob H Hokamp K Jang WJohnson LS Jones TA Kasif S Kaspryzk A Kennedy S Kent WJ Kitts P Koonin EV Korf I KulpD Lancet D Lowe TM McLysaght A Mikkelsen T Moran JV Mulder N Pollara VJ Ponting CPSchuler G Schultz J Slater G Smit AF Stupka E Szustakowski J Thierry-Mieg D Thierry-Mieg JWagner L Wallis J Wheeler R Williams A Wolf YI Wolfe KH Yang SP Yeh RF Collins F GuyerMS Peterson J Felsenfeld A Wetterstrand KA Patrinos A Morgan MJ Szustakowki J de Jong PCatanese JJ Osoegawa K Shizuya H Choi S and Chen YJ 2001 Initial sequencing and analysis of thehuman genome Nature 409 860ndash921

Lawrence CE Altschul SF Boguski MS Liu JS Neuwald AF and Wootton JC 1993 Detecting subtlesequence signals A Gibbs sampling strategy for multiple alignment Science 262 208ndash214

Lawrence CE and Reilly AA 1990 An expectation maximization (EM) algorithm for the identi cation and char-acterization of common sites in unaligned biopolymer sequences Proteins 7 41ndash51

Liu X Brutlag DL and Liu JS 2001 BioProspector Discovering conserved DNA motifs in upstream regulatoryregions of co-expressed genes Pac Symp Biocomput 127ndash138

Liu XS Brutlag DL and Liu JS 2002 An algorithm for nding protein DNA binding sites with applications tochromatin-immunoprecipitation microarray experiments Nat Biotechnol

Nakatsuji Y Hidaka K Tsujino S Yamamoto Y Mukai T Yanagihara T Kishimoto T and Sakoda S 1992A single MEF-2 site is a major positive regulatory element required for transcription of the muscle-specic subunitof the human phosphoglycerate mutase gene in skeletal and cardiac muscle cells Mol Cell Biol 12 4384ndash4390

Quandt K Frech K Karas H Wingender E and Werner T 1995 MatInd and MatInspector New fast and versatiletools for detection of consensus matches in nucleotide sequence data Nucl Acids Res 23 4878ndash4884

Rosenthal N Berglund EB Wentworth BM Donoghue M Winter B Bober E Braun T and Arnold HH1990 A highly conserved enhancer downstream of the human MLC13 locus is a target for multiple myogenicdetermination factors Nucl Acids Res 18 6239ndash6246

Roth FP Hughes JD Estep PW and Church GM 1998 Finding DNA regulatory motifs within unalignednoncoding sequences clustered by whole-genome mRNA quantitation Nat Biotechnol 16 939ndash945

Staden R 1989 Methods for calculating the probabilities of nding patterns in sequences Comput Appl Biosci 589ndash96

Stormo GD and Hartzell 3rd GW 1989 Identifying protein-binding sites from unaligned DNA fragments ProcNatl Acad Sci USA 86 1183ndash1187

Wentworth BM Donoghue M Engert JC Berglund EB and Rosenthal N 1991 Paired MyoD-binding sitesregulate myosin light chain gene expression Proc Natl Acad Sci USA 88 1242ndash1246

Wingender E Chen X Hehl R Karas H Liebich I Matys V Meinhardt T Pruss M Reuter I and SchachererF 2000 TRANSFAC An integrated system for gene expression regulation Nucl Acids Res 28 316ndash319

Address correspondence toJun S Liu Wing H Wong

Department of StatisticsScience Center 6th oor

1 Oxford StreetCambridge MA 02138

E-mail jliu wwongstatharvardedu

Page 4: Determination of Local Statistical Significance of ...mckao/documents/JCB-LMM.pdf · that the incorporation of the local genomic context can be advantageous in the prediction of myogenin

4 HUANG ET AL

FIG 2 TRANSFAC vs local Markov model (LMM) in the identi cation of transcription factor binding sites (TFBSs)in a given genomic sequence (a) TRANSFAC scans a genomic sequence generates similarity scores of each subse-quence against a given PSSM and uses three matrix-specic cutoffs FN FP and SUM to make putative calls The threesets of cutoffs attempt to minimize false-negative error false-positiveerror or the sum of these two errors respectively(b) LMM begins by selecting the top 01 candidate sites based on their PSSM similarity scores since sites withlow similarity scores are unlikely to be true binding sites For each candidate TFBS LMM models the DNA sequencesegment of length L (eg 1000) centered around the target site as a homogeneous Markov chain of orders k D 0 12 or 3 Under the estimated Markov model LMM calculates the probability distribution of the similarity score usingour algorithm This distribution then allows us to assign statistical signi cance to the given candidate TFBS

We implement the algorithm in C and incorporate the program into the local Markov method (LMM)program to study TFBSs in situ by evaluating each candidate binding site with respect to its local genomiccontext A summary of the LMM method is in Fig 2b For comparison we also describe in Fig 2a theprediction program which accompanies the TRANSFAC database

To assess the ef ciency of our algorithm we compare it against Monte Carlo simulations Our exper-iments showed that the ef ciency of our approach can be many times more ef cient than Monte Carlosimulation For example at p middot 00001 a sequence of length at least 109 basepairs needs to be simu-lated in order to obtain a suf ciently accurate cutoff value (relative error middot 1) which needs more than1000-fold more computing time than our exact algorithm (Table 1)

APPLICATIONS

MYL1 30 enhancer myogenin binding site prediction

In human mouse and rat there is a well-conserved 200bp-long skeletal muscle-speci c enhancer about24 kb 30 of MYL1 (Rosenthal et al 1990 Wentworth et al 1991) Three myogenic determination factorbinding sites A B and C are found in this region which are located 1267 bp 1323 bp and 1339 bp

LOCAL STATISTICAL SIGNIFICANCE OF PATTERNS 5

Table 1 Running Time Comparison of LMM with Monte Carlo Simulationa

Signicance Signicanceat p middot 0001 at p middot 00001

Relative error Relative errorTime

Simulation (sec) Mean SD Mean SD

N D 105 006 65 43 240 169N D 106 06 17 12 77 45N D 107 57 10 07 26 17N D 108 57 10 05 20 08N D 109 570 07 06 06 04N D 232 1228 06 06 06 04LMM 043 0 0

aFor 10 randomly chosen intergenic regions we estimate a 2nd-order Markov model for eachsequence using maximum likelihood estimation Under the 10 estimated Markov models we use ouralgorithm to derive and Monte Carlo simulation to estimate the score cutoffs of the p53 PSSM at twosigni cance levels p middot 0001 and p middot 00001 on a 1500 MHz AMD Athlon machine running LinuxTo assess the difference between the cutoffs Cp derived by our algorithm and the cutoffs Cp estimatedby simulations we consider the p-value F Cp attained by Cp and the true p-value FCp of Cp

derived using our algorithm We assess the relative errors of the simulation estimate by calculatingFCp iexcl F Cp=FCp

respectively downstream of the last exon of MYL1 in the human genome Sites A and B are myogeninmyf4binding sites (Rosenthal et al 1990) while site C is a MyoD binding site (Wentworth et al 1991) alsoconsidered to be a myogenin binding site (Fickett 1996)

We applied the LMM to the 10000 bp MYL1 downstream region (starting from the end of the lastexon) to derive the local p-values for each candidate The local p-value for each candidate is the statisticalsigni cance of observing its score (derived by both log-odds and entropy-related PSSMs) assuming that itis generated under a local random model where Markov models of different orders (eg 0 1 or 2) are usedand with parameters estimated from the local 1000 bp genomic sequence centered at the candidate The top10 score candidate sites derived using log-odds or entropy-weighted PSSMs are listed in Table 2a and 2b

Table 2 Incorporating Local Sequence Information to Transcription Factor Binding SitePrediction Using Two Types of PSSMs for Myogenin in the Human MYL1 30 Enhancer (ab)

or for MEF2 in the Human Phosphoglycerate Mutase Promoter (cd)a

(a) Using log-odds myogenin PSSMs

p-values of observed score under local background modelPosition Log-odds

(bp from last PSSM 1st 2ndexon of MYL1) score iid Markov| Markov Binding site

1267 (A) 556 0000008 0000017 0000030 AGCAGGTG1339 (C) 550 0000015 0000027 0000055 GACAGGTG1323 (B) 548 0000033 0000057 0000112 ACCAGCTG

5434 556 0000036 0000074 0000095 AGCAGCTG2463 550 0000059 0000135 0000179 GCCAGCTG1235 531 0000212 0000354 0000442 ACCATGTG926 534 0000181 0000363 0000468 TGCAGGTG

2574 536 0000225 0000416 0000421 GGCAGATG783 531 0000274 0000453 0000537 AACATCTG470 529 0000404 0000624 0000731 GGAAGCTG

(continued)

6 HUANG ET AL

Table 2 (Continued)

(b) Using entropy-weighted myogenin PSSM

p-values of observed score under local background modelPosition

(bp from last TRANSFAC 1st 2ndexon of MYL1) score iid Markov| Markov Binding site

1267 (A) 4667 0000008 0000017 000003 AGCAGGTG1339 (C) 4628 0000018 0000032 0000059 GACAGGTG

5434 4667 0000036 0000074 0000095 AGCAGCTG1323 (B) 4581 0000045 0000077 0000127 ACCAGCTG

2463 4628 0000068 0000152 0000194 GCCAGCTG2574 4596 0000073 0000177 0000191 GGCAGATG926 4463 0000224 0000414 0000532 TGCAGGTG

7534 4377 0000378 0000534 0000788 TACAGCTG7156 4377 0000346 000054 0000686 CCCAGCTG4895 4322 0000829 0001998 0002045 CTCAGGTG

(c) Using log-odds MEF2 PSSMs

p-values of observed score under local background modelPosition Log-odds

(bp from last PSSM 1st 2ndexon of MYL1) score iid Markov| Markov Binding site

iexcl2970 669 0000199 0000160 0000228 ATTTTAAATAiexcl3115 671 0000209 0000183 0000243 GTTATAAATAiexcl161 649 0000355 0000183 0000322 ATTTTAAGCA

iexcl2939 668 0000233 0000190 0000266 TGTTTAAATCiexcl3151 663 0000807 0000655 0000747 TGTTTAAGAAiexcl4767 656 0000951 0001009 0001712 TTTTTATATAiexcl3433 649 0003940 0003099 0003383 AAACTAAAAAiexcl3566 644 0005710 0004913 0005231 TTTTTAAAGCiexcl3214 643 0007155 0005654 0006459 AGTTTATATCiexcl3577 641 0007363 0006312 0006625 GGTTTAACAT

(d) Using entropy-weighted MEF2 PSSM

p-values of observed score under local background modelPosition

(bp from last TRANSFAC 1st 2ndexon of MYL1) score iid Markov| Markov Binding site

iexcl2970 591 0000103 0000082 0000133 ATTTTAAATAiexcl161 550 0000191 0000101 0000174 ATTTTAAGCA

iexcl3115 590 0000206 0000181 0000243 GTTATAAATAiexcl2939 554 0001001 0000807 0001016 TGTTTAAATCiexcl3151 562 0001091 0000914 0001083 TGTTTAAGAAiexcl4700 531 0001451 0001451 0001451 TTGTTAAAGAiexcl3566 543 0002961 0002532 0002676 TTTTTAAAGCiexcl3433 545 0003271 0002610 0002788 AAACTAAAAAiexcl4444 532 0003080 0003200 0004320 CATATAATTAiexcl3687 535 0003761 0003320 0003671 GAAGTAAAGA

aSorted in increasing order by column marked with |

LOCAL STATISTICAL SIGNIFICANCE OF PATTERNS 7

respectively with the true sites A B and C labeled and shaded in gray along with their local p-valuesWe nd the PSSM scores to be less sensitive a measure than the local p-value the true sites A B andC stood out under the local p-values while they are not as distinct from the false predictions under thePSSM scores

PGAM-M MEF2 binding site prediction

A major positive regulatory element is required for the muscle-speci c expression of the muscle-speci csubunit of the human phosphoglycerate mutase (PGAM-M) gene (Nakatsuji et al 1992) This elementlocated 161 bp upstream of the gene is found to be bound by the transcription factor MEF-2

We applied the LMM to the 5000 bp PGAM-M upstream region using the MEF2_Q6 PSSM to derivethe local p-values for each candidate The top 10 score candidate sites derived using log-odds or entropy-weighted PSSMs are listed in Tables 2c and 2d respectively with the true site labeled and shaded in grayalong with their local p-values We nd that LMM behaves similarly as in the MYL1 enhancer

Overall from Table 2 we see that by taking into account the local sequence composition we havereordered the candidate sequences in a way that is favorable to the true binding sites

LARGE-SCALE VALIDATION

In order to evaluate the performance of LMM and to compare our local p-values to PSSM similarityscores we apply both LMM and TRANSFAC to 101 known binding sites in the human genome obtainedby mapping binding sites in the TRANSFAC database onto the human genome We recorded and evaluatedthe extent to which LMM and TRANSFAC can capture this large collection of known binding sites in thehuman genome and the amount of noise generated in so doing

In Figure 3a the trade-off between sensitivity and noise is shown in terms of the proportion of theknown binding sites detected and the amount of concomitant noise generated Noise is measured bythe noise-to-signal ratio which is de ned as the number of binding site calls not known to be correctdivided by the number of known binding sites found For comparison we show the tradeoffs achieved byTRANSFAC using its three matrix-speci c similarity score cutoffs (FN SUM FP) along with that achievedby LMM under Markov models of orders 0 1 2 and 3 at various p-value cutoffs starting at the stringentp D 000001 From the inset graph we see that at all levels of sensitivity LMM outperformed TRANSFAC

FIG 3 Large-scale validation of TRANSFAC and LMM Tradeoff between sensitivity and noise (a) We comparedthe abilities of the two methods to detect the 101 known binding sites in the human genome by looking at their sensitivityand noise-to-signal ratio The balance of the tradeoff between these two measures achieved at various signi cancelevels by LMM are traced and compared to that attained by TRANSFAC The inset graph shows the performance ofLMM and TRANSFAC across all levels of sensitivity (b) Detailed results for p D 000001 and 00002

8 HUANG ET AL

by producing signi cantly less noise While the performance of LMM comes close to that of TRANSFACas the p-value cutoff increases in fact by then for both methods the advantage of increased sensitivity hasbeen nulli ed by the high level of accompanying noise rendering them impractical Overall not only is thesensitivity of LMM comparable to TRANSFAC its noise-to-signal ratio is also vastly superior It shouldbe noted that since only a limited number of true binding sites are known not every unsupported bindingsite prediction is necessarily a false-positive prediction Thus the noise-to-signal ratio overestimates thetrue noise level especially when stringent criteria are used to generate putative TFBSs with high sequencesimilarity to known binding sites As the criteria relax the large numbers of predictions over and abovethe known binding sites imply a high level of true background noise

More detailed results for TRANSFAC using the three cutoffs and for LMM using different signi cancecutoffs 000001 and 00002 and under different Markov models are summarized in Fig 3b While theFN cut off missed relatively few known binding sites it generated more than 45 false-positive predictionsfor every accurate binding site call On the other hand FP made fewer false positives but it detected onlyone in nine known binding sites The SUM cutoff designed as a balance of these inherent tradeoffs didstrike a reasonable compromise having generated about nine false positives for every real binding site anddetected more than half of the known sites

At the stringent signi cance cutoff p D 000001 LMM detected about twice the binding sites thandid the FP cutoff and on average produced about 60 fewer false-positive predictions for every correctprediction At the more relaxed p-value cutoff p D 00002 the sensitivity of LMM is comparable tothat of the SUM cutoff while only half of the noise is generated The binding sites that were detectedby LMM at p middot 00002 but missed by TRANSFAC using the SUM cutoff include a MEF2 binding siteover the desmin gene an ATF1 (activating transcription factor 1) binding site over the TGFmacr2 gene aHIF (hypoxia-inducible factor) binding site over the VEGF gene and an ICSBP (IFN consensus sequencebinding protein) binding site over the OAS1 gene We choose p D 00002 as the general signi cance cutofffor the application of LMM to mammalian genomic sequences a cutoff with a suf ciently high sensitivityand an acceptable amount of noise Overall the LMM provides an advantageous tradeoff between noise-to-signal ratio and sensitivity

In our validation experiment we found that Markov models of orders 1 2 and 3 have better combinationsof high sensitivity and low noise than the iid model con rming an earlier observation (Liu et al 2001)that Markov models can better capture the structure of biological sequences In addition we compared

FIG 4 The use of local sequence context is advantageous The performance of the second-order LMM is comparedagainst an analogous global Markov model with parameters estimated from a large collection of upstream regionsThe performance is assessed in terms of the noise-to-signal ratio and sensitivity At the recommended p-value cutoff00002 LMM is more sensitive and less noisy

LOCAL STATISTICAL SIGNIFICANCE OF PATTERNS 9

the performance of the second-order LMM against an analogous global Markov model with parametersestimated from a large collection of upstream regions in order to assess the ability of LMM to modelthe local sequence context information We found that over the 101 known human TFBSs in situ LMMgenerally outperforms the global Markov model while they behave similarly at high and low sensitivitylevels (Fig 4) At high sensitivity levels the lax p-value cutoffs produce large numbers of putative TFBScalls overwhelming the advantage enjoyed by LMM At low sensitivity levels the stringent p-valuecutoffs yield only putative TFBSs with undeniable sequence similarity to known binding sites Thus thenoise-to-signal ratio may not re ect the true noise level in this region

DISCUSSION

The work presented in this paper attemps to identify TFBSs by considering simultaneously both theirsimilarity to the query PSSM and their differences from the local genomic context Through the studyof the human TFBSs in TRANSFAC we show that LMM which makes putative TFBS calls using localp-values yields a much improved false-positive to true-positive ratio than that using the TRANSFAC orlog-odds scores alone

It has been known that neighboring nucleotide compositions can affect the interaction between a tran-scription factor and its binding site To our best knowledge however there is no documented study onwhether and how much an improvement can be made on the PSSM-based TFBS detection using a localbackground model The result we present which is based on more than 100 experimentally determinedTFBS sequences in the human genome shows a clear overall advantage for incorporating the local se-quence context into PSSM-based TFBS search There are various biological mechanisms that can explainthis effect which may lead to more complicated and more speci c models For instance it may be thatthe local 1000 bp genomic region does not contain DNA sequences similar to the true binding site be-cause otherwise the target transcription factor may be competed away from its biologically meaningfulbinding site

While this improvement does not in itself render a solution to the much more dif cult problem ofdetecting regulatory modules by signi cantly reducing false-positive calls for single sites the local p-value approach will contribute substantially to any subsequent algorithms aiming to detect combinatorialregulatory modules The method we developed here is seen as a proof of principle and can be used asa component of a more complex approach For example considering that clusters of binding sites alsooften occur within small regions of about 200 bp to cooperatively recruit the transcription factors a naturalfuture development of LMM would be to take this distance effect into the background estimation andcombine the LMM p-values of a few candidate PSSM sites Many challenging problems in computationalbiology eg translation initiation site identi cation splice site recognition and RNA secondary structureprediction can be modeled in terms of the recognition of motifs Our work may be adapted and extendedto these problems as well However it should be noted that when applied to protein sequences which arecomposed of a 20-letter alphabet the performance of our algorithm may become an issue especially whenthe order of the Markov chain k is large

DETAILED METHODS

Data extraction for large-scale validation

To evaluate the performance of the LMM we apply it to known TFBSs in the human genome Knownbinding sites are extracted from the SITE table of the TRANSFAC database version 62 About half of the12262 binding sites in this table are experimentally derived from various species The rest are generatedfrom in vitro binding assays on arti cial nucleotide sequences Since LMM studies binding sites withrespect to their genomic contexts these arti cial sequences which do not correspond to any genomicregion cannot be used for our validation study Of the 6073 in vivo binding sites 1425 sites are basedon the human genome Of these 149 (105) are annotated with a corresponding PSSM We use thesebinding sites for validation

10 HUANG ET AL

To locate the known TFBSs in the human genome we focus on the 5000 bp upstream sequences of allgenes We made use of the annotations provided by Ensembl (Hubbard et al 2002) and extracted 22808human gene promoters from the human genome assembly NCBI golden path 29 (wwwensemblorgHomo_sapiens) Since heuristic sequence-mapping algorithms do not perform well on short sequences such asTFBSs we use an exact-match algorithm based on suf x trees (Gus eld 1997) We found that many bindingsite sequences are precisely mapped onto the promoters of the correct target genes For those binding siteswith mappings onto multiple promoters or with no mapping we attempted to retrieve them by manualreview To nd the correct one among multiple mappings we made correspondences between the Ensemblgene name and the target gene name of the binding site as recorded by TRANSFAC A review of somemissed matches using inexact match algorithms revealed a small number of single-basepair differencesbetween the recorded binding site sequences and the promoter sequences of the target genes for examplethe binding site HS$ALBU_06 over the human albumin promoter After validating against the primaryliterature for the positions of these binding sites we included these mappings as well In total we located101 human TFBSs

Local p-value calculation

Although the exact score distribution can be obtained by enumerating all possible binding site sequencesunder any ldquonullrdquo model for the observed nucleotide base pairs the computational cost for a PSSM of lengthp is 4p Stadenrsquos method (Staden 1989) which turns this into an order-p computation is based on thePGF of the score under the simple null model that the base pairs are independent and identically distributed(iid) Recently however there are some evidences suggesting that Markov background models work betterthan the iid model for detecting TFBS (Liu et al 2002) By extending Stadenrsquos PGF method to dependentrandom variables we present here the derivation of the PGFs under a rst-order Markov model the basisof the ef cient algorithm for computing the exact score distribution

Probability generating function derivation

In our study we make use of the PSSMs constructed by TRANSFAC version 62 Given a PSSMm D wij ppound4 where i D 1 p and j D ACGT the match score S and the similarity score S=Smax

of a sequence D1D2 Dp is de ned as (Quandt et al 1995)

S DpX

iD1

wiDi and S=Smax DpX

iD1

wij

iquest pX

iD1

maxj

fwij g

Let S be a random variable taking integer values then its probability generating function Gt is theexpected value of tS Gt is a polynomial and the coef cient of the term tn is the probability of the eventS D n (Gut 1995)

Given a PSSM m of length p under the assumption that the DNA sequence is iid Staden pro-vided the PGF of the match score in the form of a product of p polynomials (Staden 1989) Gt DQp

iD1

PjDACGT fj twij where fj is the frequency of letter j in the iid DNA sequence For the rst-

order Markov case k D 1 let the transition matrix be P D fregjmacr 4pound4 and the stationary distribution of theMarkov chain be frac14 (viewed as a four-dimentional row vector) Then the PGF under the rst-order Markovmodel is

Gt D frac14

pY

iD1

PMi t I (curren)

where Mi t D DiagtwiA twiC twiG twiT and I D 1 1 1 1T (proof provided at the end of thissection)

Since a Markov chain of order k on set 0 is equivalently a rst-order Markov chain on the set 0k witha little modi cation on Mi t we can generalize the above results to k gt 1 An example of PGF for

LOCAL STATISTICAL SIGNIFICANCE OF PATTERNS 11

k D 3 is in the online supplement (wwwbiostatharvardeducomplabLMM) Using this representation forthe PGF we developed and implemented an algorithm using CCC to calculate the exact score distributionGenerally for a kth-order Markov chain and a PSSM of length p the time complexity of our algorithm isO4k cent Smax cent p linear in the matrix length but exponential in the order of the Markov chain The sourcecode is available upon request (wwwbiostathsphharvardeduLMM)

Proof of equation (curren) For ease of notation and without loss of generality we let p the length of thePSSM be 3

For a DNA sequence D1D2D3 its match score against PSSM m is w1D1 C w2D2 C w3D3 and theprobability of the occurrence of D1D2D3 is fD1 fD2 jD1 fD3 jD2 By de nition the PGF of match scoreagainst m is

X

D1D2D3

fD1fD2jD1fD3jD2 tw1D1Cw2D2

Cw3D3 DX

D1D2D3

fD1 tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

In the following we derive the PGF in the alternative form of a product of p matrices First

X

D1D2D3

fD1 tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

DX

D1D2D3

X

a

fa cent fD1 ja tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

DX

a

fa centX

D1D2D3

fD1 ja tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

D fa fC fG fT cent

0

X

D1D2D3

fD1 jAtw1D1 cent fD2jD1 tw2D2 cent fD3 jD2 tw3D3

X

D1D2D3

fD1 jT tw1D1 cent fD2jD1 tw2D2 cent fD3jD2 tw3D3

1

AT

For the component of the second vector corresponding to base A

X

D1D2D3

fD1 jAtw1D1 cent fD2 jD1 tw2D2 cent fD3jD2 tw3D3

DX

D1

fD1 jAtw1D1 centX

D2D3

fD2 jD1 tw2D2 cent fD3jD2 tw3D3

D fAjAtw1A fCjAtw1C fGjAtw1G fT jAtw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3jD2 tw3D3

1

AT

12 HUANG ET AL

We apply similar arguments to the components corresponding to bases C G and T and obtain

X

D1D2D3

fD1jC tw1D1 cent fD2 jD1 tw2D2 cent fD3jD2 tw3D3

D fAjC tw1A fCjC tw1C fGjC tw1G fT jC tw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

X

D1D2D3

fD1jGtw1D1 cent fD2 jD1 tw2D2 cent fD3jD2 tw3D3

D fAjGtw1A fCjGtw1C fGjGtw1G fT jGtw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

X

D1D2D3

fD1jT tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

D fAjT tw1A fCjT tw1C fGjT tw1G fT jT tw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

Therefore for the rst position we have

0

X

D1D2D3

fD1jAtw1D1 cent fD2jD1 tw2D2 cent fD3 jD2 tw3D3 X

D1D2D3

fD1 jT tw1D1 cent fD2 jD1 tw2D2 cent fD3jD2 tw3D3

1

AT

D

0

BB

fAjAtw1A fCjAtw1C fGjAtw1G fT jAtw1T

fAjC tw1A fCjC tw1C fGjC tw1G fT jC tw1T

fAjGtw1A fCjGtw1C fGjGtw1G fT jGtw1T

fAjT tw1A fCjT tw1C fGjT tw1G fT jT tw1T

1

CCA cent

0

BB

PD2D3

fD2 jAtw2D2 fD3 jD2 tw3D3PD2D3

fD2 jC tw2D2 fD3jD2 tw3D3PD2D3

fD2 jGtw2D2 fD3jD2 tw3D3PD2D3

fD2 jT tw2D2 fD3 jD2 tw3D3

1

CCA

D P cent Diagtw1A tw1C tw1G tw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD2 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

D P cent M1 t cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

LOCAL STATISTICAL SIGNIFICANCE OF PATTERNS 13

Further applying the above arguments to positions 2 and 3 we have

0

X

D2D3

fD2jAtw2D2 cent fD3jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

D P cent Diagtw2A tw2C tw2G tw2T cent

0

X

D3

fD3jAtw3D3 X

D3

fD3 jT tw3D3

1

AT

D P cent Diagtw2A tw2C tw2G tw2T cent P cent Diagtw2A tw2C tw2G tw2T cent 1 1 1 1T

iexcl P cent M2 t cent P cent M3 t cent I

Above all Gt D frac14Qp

iD1PMi tI

ACKNOWLEDGMENTS

The work of HH XZ and WHW is supported by NSF grants DBI0196176 and DMS-0090166 Thework of HH and JSL is supported by NSF grant DMS-0204674 and NIH grant R01 HG02518-01 Thework of M-CJK is supported by the Howard Hughes Medical Institute predoctoral fellowship

REFERENCES

Bailey TL and Elkan C 1994 Fitting a mixture model by expectation maximization to discover motifs in biopoly-mers Proc Int Conf Intell Syst Mol Biol 2 28ndash36

Chen QK Hertz GZ and Stormo GD 1995 MATRIX SEARCH 10 A computer program that scans DNAsequences for transcriptional elements using a database of weight matrices Comput Appl Biosci 11 563ndash566

Durbin R Eddy SR Krogh A and Mitchison G 1998 Biological Sequence Analysis Probalistic Models ofProteins and Nucleic Acids Cambridge University Press Cambridge UK

Fickett JW 1996 Coordinate positioning of MEF2 and myogenin binding sites Gene 172 GC19ndash32Fried M and Crothers DM 1981 Equilibria and kinetics of lac repressor-operator interactions by polyacrylamide

gel electrophoresis Nucl Acids Res 9 6505ndash6525Galas DJ and Schmitz A 1978 DNAse footprinting A simple method for the detection of proteinndashDNA binding

speci city Nucl Acids Res 5 3157ndash3170Garner MM and Revzin A 1981 A gel electrophoresis method for quantifying the binding of proteins to speci c

DNA regions Application to components of the Escherichia coli lactose operon regulatory system Nucl Acids Res9 3047ndash3060

Gus eld D 1997 Algorithms on Strings Trees and Sequences Computer Science and Computational BiologyCambridge University Press Cambridge England

Gut A 1995 An Intermediate Course in Probability Springer-Verlag New YorkHertz GZ Hartzell 3rd GW and Stormo GD 1990 Identi cation of consensus patterns in unaligned DNA

sequences known to be functionally related Comput Appl Biosci 6 81ndash92Hubbard T Barker D Birney E Cameron G Chen Y Clark L Cox T Cuff J Curwen V Down T Durbin

R Eyras E Gilbert J Hammond M Huminiecki L Kasprzyk A Lehvaslaiho H Lijnzaad P MelsoppC Mongin E Pettett R Pocock M Potter S Rust A Schmidt E Searle S Slater G Smith J SpoonerW Stabenau A Stalker J Stupka E Ureta-Vidal A Vastrik I and Clamp M 2002 The Ensembl genomedatabase project Nucl Acids Res 30 38ndash41

Hughes JD Estep PW Tavazoie S and Church GM 2000 Computational identi cation of cis-regulatoryelementsassociated with groups of functionally related genes in Saccharomyces cerevisiae J Mol Biol 296 1205ndash1214

Lander ES Linton LM Birren B Nusbaum C Zody MC Baldwin J Devon K Dewar K Doyle MFitzHugh W Funke R Gage D Harris K Heaford A Howland J Kann L Lehoczky J LeVine RMcEwan P McKernan K Meldrim J Mesirov JP Miranda C Morris W Naylor J Raymond C RosettiM Santos R Sheridan A Sougnez C Stange-Thomann N Stojanovic N Subramanian A Wyman DRogers J Sulston J Ainscough R Beck S Bentley D Burton J Clee C Carter N Coulson A DeadmanR Deloukas P Dunham A Dunham I Durbin R French L Grafham D Gregory S Hubbard T Humphray

14 HUANG ET AL

S Hunt A Jones M Lloyd C McMurray A Matthews L Mercer S Milne S Mullikin JC MungallA Plumb R Ross M Shownkeen R Sims S Waterston RH Wilson RK Hillier LW McPherson JDMarra MA Mardis ER Fulton LA Chinwalla AT Pepin KH Gish WR Chissoe SL Wendl MCDelehaunty KD Miner TL Delehaunty A Kramer JB Cook LL Fulton RS Johnson DL Minx PJClifton SW Hawkins T Branscomb E Predki P Richardson P Wenning S Slezak T Doggett N ChengJF Olsen A Lucas S Elkin C Uberbacher E Frazier M Gibbs RA Muzny DM Scherer SE BouckJB Sodergren EJ Worley KC Rives CM Gorrell JH Metzker ML Naylor SL Kucherlapati RSNelson DL Weinstock GM Sakaki Y Fujiyama A Hattori M Yada T Toyoda A Itoh T Kawagoe CWatanabe H Totoki Y Taylor T Weissenbach J Heilig R Saurin W Artiguenave F Brottier P Bruls TPelletier E Robert C Wincker P Smith DR Doucette-Stamm L Ruben eld M Weinstock K Lee HMDubois J Rosenthal A Platzer M Nyakatura G Taudien S Rump A Yang H Yu J Wang J HuangG Gu J Hood L Rowen L Madan A Qin S Davis RW Federspiel NA Abola AP Proctor MJMyers RM Schmutz J Dickson M Grimwood J Cox DR Olson MV Kaul R Shimizu N KawasakiK Minoshima S Evans GA Athanasiou M Schultz R Roe BA Chen F Pan H Ramser J LehrachH Reinhardt R McCombie WR de la Bastide M Dedhia N Blocker H Hornischer K Nordsiek GAgarwala R Aravind L Bailey JA Bateman A Batzoglou S Birney E Bork P Brown DG BurgeCB Cerutti L Chen HC Church D Clamp M Copley RR Doerks T Eddy SR Eichler EE FureyTS Galagan J Gilbert JG Harmon C Hayashizaki Y Haussler D Hermjakob H Hokamp K Jang WJohnson LS Jones TA Kasif S Kaspryzk A Kennedy S Kent WJ Kitts P Koonin EV Korf I KulpD Lancet D Lowe TM McLysaght A Mikkelsen T Moran JV Mulder N Pollara VJ Ponting CPSchuler G Schultz J Slater G Smit AF Stupka E Szustakowski J Thierry-Mieg D Thierry-Mieg JWagner L Wallis J Wheeler R Williams A Wolf YI Wolfe KH Yang SP Yeh RF Collins F GuyerMS Peterson J Felsenfeld A Wetterstrand KA Patrinos A Morgan MJ Szustakowki J de Jong PCatanese JJ Osoegawa K Shizuya H Choi S and Chen YJ 2001 Initial sequencing and analysis of thehuman genome Nature 409 860ndash921

Lawrence CE Altschul SF Boguski MS Liu JS Neuwald AF and Wootton JC 1993 Detecting subtlesequence signals A Gibbs sampling strategy for multiple alignment Science 262 208ndash214

Lawrence CE and Reilly AA 1990 An expectation maximization (EM) algorithm for the identi cation and char-acterization of common sites in unaligned biopolymer sequences Proteins 7 41ndash51

Liu X Brutlag DL and Liu JS 2001 BioProspector Discovering conserved DNA motifs in upstream regulatoryregions of co-expressed genes Pac Symp Biocomput 127ndash138

Liu XS Brutlag DL and Liu JS 2002 An algorithm for nding protein DNA binding sites with applications tochromatin-immunoprecipitation microarray experiments Nat Biotechnol

Nakatsuji Y Hidaka K Tsujino S Yamamoto Y Mukai T Yanagihara T Kishimoto T and Sakoda S 1992A single MEF-2 site is a major positive regulatory element required for transcription of the muscle-specic subunitof the human phosphoglycerate mutase gene in skeletal and cardiac muscle cells Mol Cell Biol 12 4384ndash4390

Quandt K Frech K Karas H Wingender E and Werner T 1995 MatInd and MatInspector New fast and versatiletools for detection of consensus matches in nucleotide sequence data Nucl Acids Res 23 4878ndash4884

Rosenthal N Berglund EB Wentworth BM Donoghue M Winter B Bober E Braun T and Arnold HH1990 A highly conserved enhancer downstream of the human MLC13 locus is a target for multiple myogenicdetermination factors Nucl Acids Res 18 6239ndash6246

Roth FP Hughes JD Estep PW and Church GM 1998 Finding DNA regulatory motifs within unalignednoncoding sequences clustered by whole-genome mRNA quantitation Nat Biotechnol 16 939ndash945

Staden R 1989 Methods for calculating the probabilities of nding patterns in sequences Comput Appl Biosci 589ndash96

Stormo GD and Hartzell 3rd GW 1989 Identifying protein-binding sites from unaligned DNA fragments ProcNatl Acad Sci USA 86 1183ndash1187

Wentworth BM Donoghue M Engert JC Berglund EB and Rosenthal N 1991 Paired MyoD-binding sitesregulate myosin light chain gene expression Proc Natl Acad Sci USA 88 1242ndash1246

Wingender E Chen X Hehl R Karas H Liebich I Matys V Meinhardt T Pruss M Reuter I and SchachererF 2000 TRANSFAC An integrated system for gene expression regulation Nucl Acids Res 28 316ndash319

Address correspondence toJun S Liu Wing H Wong

Department of StatisticsScience Center 6th oor

1 Oxford StreetCambridge MA 02138

E-mail jliu wwongstatharvardedu

Page 5: Determination of Local Statistical Significance of ...mckao/documents/JCB-LMM.pdf · that the incorporation of the local genomic context can be advantageous in the prediction of myogenin

LOCAL STATISTICAL SIGNIFICANCE OF PATTERNS 5

Table 1 Running Time Comparison of LMM with Monte Carlo Simulationa

Signicance Signicanceat p middot 0001 at p middot 00001

Relative error Relative errorTime

Simulation (sec) Mean SD Mean SD

N D 105 006 65 43 240 169N D 106 06 17 12 77 45N D 107 57 10 07 26 17N D 108 57 10 05 20 08N D 109 570 07 06 06 04N D 232 1228 06 06 06 04LMM 043 0 0

aFor 10 randomly chosen intergenic regions we estimate a 2nd-order Markov model for eachsequence using maximum likelihood estimation Under the 10 estimated Markov models we use ouralgorithm to derive and Monte Carlo simulation to estimate the score cutoffs of the p53 PSSM at twosigni cance levels p middot 0001 and p middot 00001 on a 1500 MHz AMD Athlon machine running LinuxTo assess the difference between the cutoffs Cp derived by our algorithm and the cutoffs Cp estimatedby simulations we consider the p-value F Cp attained by Cp and the true p-value FCp of Cp

derived using our algorithm We assess the relative errors of the simulation estimate by calculatingFCp iexcl F Cp=FCp

respectively downstream of the last exon of MYL1 in the human genome Sites A and B are myogeninmyf4binding sites (Rosenthal et al 1990) while site C is a MyoD binding site (Wentworth et al 1991) alsoconsidered to be a myogenin binding site (Fickett 1996)

We applied the LMM to the 10000 bp MYL1 downstream region (starting from the end of the lastexon) to derive the local p-values for each candidate The local p-value for each candidate is the statisticalsigni cance of observing its score (derived by both log-odds and entropy-related PSSMs) assuming that itis generated under a local random model where Markov models of different orders (eg 0 1 or 2) are usedand with parameters estimated from the local 1000 bp genomic sequence centered at the candidate The top10 score candidate sites derived using log-odds or entropy-weighted PSSMs are listed in Table 2a and 2b

Table 2 Incorporating Local Sequence Information to Transcription Factor Binding SitePrediction Using Two Types of PSSMs for Myogenin in the Human MYL1 30 Enhancer (ab)

or for MEF2 in the Human Phosphoglycerate Mutase Promoter (cd)a

(a) Using log-odds myogenin PSSMs

p-values of observed score under local background modelPosition Log-odds

(bp from last PSSM 1st 2ndexon of MYL1) score iid Markov| Markov Binding site

1267 (A) 556 0000008 0000017 0000030 AGCAGGTG1339 (C) 550 0000015 0000027 0000055 GACAGGTG1323 (B) 548 0000033 0000057 0000112 ACCAGCTG

5434 556 0000036 0000074 0000095 AGCAGCTG2463 550 0000059 0000135 0000179 GCCAGCTG1235 531 0000212 0000354 0000442 ACCATGTG926 534 0000181 0000363 0000468 TGCAGGTG

2574 536 0000225 0000416 0000421 GGCAGATG783 531 0000274 0000453 0000537 AACATCTG470 529 0000404 0000624 0000731 GGAAGCTG

(continued)

6 HUANG ET AL

Table 2 (Continued)

(b) Using entropy-weighted myogenin PSSM

p-values of observed score under local background modelPosition

(bp from last TRANSFAC 1st 2ndexon of MYL1) score iid Markov| Markov Binding site

1267 (A) 4667 0000008 0000017 000003 AGCAGGTG1339 (C) 4628 0000018 0000032 0000059 GACAGGTG

5434 4667 0000036 0000074 0000095 AGCAGCTG1323 (B) 4581 0000045 0000077 0000127 ACCAGCTG

2463 4628 0000068 0000152 0000194 GCCAGCTG2574 4596 0000073 0000177 0000191 GGCAGATG926 4463 0000224 0000414 0000532 TGCAGGTG

7534 4377 0000378 0000534 0000788 TACAGCTG7156 4377 0000346 000054 0000686 CCCAGCTG4895 4322 0000829 0001998 0002045 CTCAGGTG

(c) Using log-odds MEF2 PSSMs

p-values of observed score under local background modelPosition Log-odds

(bp from last PSSM 1st 2ndexon of MYL1) score iid Markov| Markov Binding site

iexcl2970 669 0000199 0000160 0000228 ATTTTAAATAiexcl3115 671 0000209 0000183 0000243 GTTATAAATAiexcl161 649 0000355 0000183 0000322 ATTTTAAGCA

iexcl2939 668 0000233 0000190 0000266 TGTTTAAATCiexcl3151 663 0000807 0000655 0000747 TGTTTAAGAAiexcl4767 656 0000951 0001009 0001712 TTTTTATATAiexcl3433 649 0003940 0003099 0003383 AAACTAAAAAiexcl3566 644 0005710 0004913 0005231 TTTTTAAAGCiexcl3214 643 0007155 0005654 0006459 AGTTTATATCiexcl3577 641 0007363 0006312 0006625 GGTTTAACAT

(d) Using entropy-weighted MEF2 PSSM

p-values of observed score under local background modelPosition

(bp from last TRANSFAC 1st 2ndexon of MYL1) score iid Markov| Markov Binding site

iexcl2970 591 0000103 0000082 0000133 ATTTTAAATAiexcl161 550 0000191 0000101 0000174 ATTTTAAGCA

iexcl3115 590 0000206 0000181 0000243 GTTATAAATAiexcl2939 554 0001001 0000807 0001016 TGTTTAAATCiexcl3151 562 0001091 0000914 0001083 TGTTTAAGAAiexcl4700 531 0001451 0001451 0001451 TTGTTAAAGAiexcl3566 543 0002961 0002532 0002676 TTTTTAAAGCiexcl3433 545 0003271 0002610 0002788 AAACTAAAAAiexcl4444 532 0003080 0003200 0004320 CATATAATTAiexcl3687 535 0003761 0003320 0003671 GAAGTAAAGA

aSorted in increasing order by column marked with |

LOCAL STATISTICAL SIGNIFICANCE OF PATTERNS 7

respectively with the true sites A B and C labeled and shaded in gray along with their local p-valuesWe nd the PSSM scores to be less sensitive a measure than the local p-value the true sites A B andC stood out under the local p-values while they are not as distinct from the false predictions under thePSSM scores

PGAM-M MEF2 binding site prediction

A major positive regulatory element is required for the muscle-speci c expression of the muscle-speci csubunit of the human phosphoglycerate mutase (PGAM-M) gene (Nakatsuji et al 1992) This elementlocated 161 bp upstream of the gene is found to be bound by the transcription factor MEF-2

We applied the LMM to the 5000 bp PGAM-M upstream region using the MEF2_Q6 PSSM to derivethe local p-values for each candidate The top 10 score candidate sites derived using log-odds or entropy-weighted PSSMs are listed in Tables 2c and 2d respectively with the true site labeled and shaded in grayalong with their local p-values We nd that LMM behaves similarly as in the MYL1 enhancer

Overall from Table 2 we see that by taking into account the local sequence composition we havereordered the candidate sequences in a way that is favorable to the true binding sites

LARGE-SCALE VALIDATION

In order to evaluate the performance of LMM and to compare our local p-values to PSSM similarityscores we apply both LMM and TRANSFAC to 101 known binding sites in the human genome obtainedby mapping binding sites in the TRANSFAC database onto the human genome We recorded and evaluatedthe extent to which LMM and TRANSFAC can capture this large collection of known binding sites in thehuman genome and the amount of noise generated in so doing

In Figure 3a the trade-off between sensitivity and noise is shown in terms of the proportion of theknown binding sites detected and the amount of concomitant noise generated Noise is measured bythe noise-to-signal ratio which is de ned as the number of binding site calls not known to be correctdivided by the number of known binding sites found For comparison we show the tradeoffs achieved byTRANSFAC using its three matrix-speci c similarity score cutoffs (FN SUM FP) along with that achievedby LMM under Markov models of orders 0 1 2 and 3 at various p-value cutoffs starting at the stringentp D 000001 From the inset graph we see that at all levels of sensitivity LMM outperformed TRANSFAC

FIG 3 Large-scale validation of TRANSFAC and LMM Tradeoff between sensitivity and noise (a) We comparedthe abilities of the two methods to detect the 101 known binding sites in the human genome by looking at their sensitivityand noise-to-signal ratio The balance of the tradeoff between these two measures achieved at various signi cancelevels by LMM are traced and compared to that attained by TRANSFAC The inset graph shows the performance ofLMM and TRANSFAC across all levels of sensitivity (b) Detailed results for p D 000001 and 00002

8 HUANG ET AL

by producing signi cantly less noise While the performance of LMM comes close to that of TRANSFACas the p-value cutoff increases in fact by then for both methods the advantage of increased sensitivity hasbeen nulli ed by the high level of accompanying noise rendering them impractical Overall not only is thesensitivity of LMM comparable to TRANSFAC its noise-to-signal ratio is also vastly superior It shouldbe noted that since only a limited number of true binding sites are known not every unsupported bindingsite prediction is necessarily a false-positive prediction Thus the noise-to-signal ratio overestimates thetrue noise level especially when stringent criteria are used to generate putative TFBSs with high sequencesimilarity to known binding sites As the criteria relax the large numbers of predictions over and abovethe known binding sites imply a high level of true background noise

More detailed results for TRANSFAC using the three cutoffs and for LMM using different signi cancecutoffs 000001 and 00002 and under different Markov models are summarized in Fig 3b While theFN cut off missed relatively few known binding sites it generated more than 45 false-positive predictionsfor every accurate binding site call On the other hand FP made fewer false positives but it detected onlyone in nine known binding sites The SUM cutoff designed as a balance of these inherent tradeoffs didstrike a reasonable compromise having generated about nine false positives for every real binding site anddetected more than half of the known sites

At the stringent signi cance cutoff p D 000001 LMM detected about twice the binding sites thandid the FP cutoff and on average produced about 60 fewer false-positive predictions for every correctprediction At the more relaxed p-value cutoff p D 00002 the sensitivity of LMM is comparable tothat of the SUM cutoff while only half of the noise is generated The binding sites that were detectedby LMM at p middot 00002 but missed by TRANSFAC using the SUM cutoff include a MEF2 binding siteover the desmin gene an ATF1 (activating transcription factor 1) binding site over the TGFmacr2 gene aHIF (hypoxia-inducible factor) binding site over the VEGF gene and an ICSBP (IFN consensus sequencebinding protein) binding site over the OAS1 gene We choose p D 00002 as the general signi cance cutofffor the application of LMM to mammalian genomic sequences a cutoff with a suf ciently high sensitivityand an acceptable amount of noise Overall the LMM provides an advantageous tradeoff between noise-to-signal ratio and sensitivity

In our validation experiment we found that Markov models of orders 1 2 and 3 have better combinationsof high sensitivity and low noise than the iid model con rming an earlier observation (Liu et al 2001)that Markov models can better capture the structure of biological sequences In addition we compared

FIG 4 The use of local sequence context is advantageous The performance of the second-order LMM is comparedagainst an analogous global Markov model with parameters estimated from a large collection of upstream regionsThe performance is assessed in terms of the noise-to-signal ratio and sensitivity At the recommended p-value cutoff00002 LMM is more sensitive and less noisy

LOCAL STATISTICAL SIGNIFICANCE OF PATTERNS 9

the performance of the second-order LMM against an analogous global Markov model with parametersestimated from a large collection of upstream regions in order to assess the ability of LMM to modelthe local sequence context information We found that over the 101 known human TFBSs in situ LMMgenerally outperforms the global Markov model while they behave similarly at high and low sensitivitylevels (Fig 4) At high sensitivity levels the lax p-value cutoffs produce large numbers of putative TFBScalls overwhelming the advantage enjoyed by LMM At low sensitivity levels the stringent p-valuecutoffs yield only putative TFBSs with undeniable sequence similarity to known binding sites Thus thenoise-to-signal ratio may not re ect the true noise level in this region

DISCUSSION

The work presented in this paper attemps to identify TFBSs by considering simultaneously both theirsimilarity to the query PSSM and their differences from the local genomic context Through the studyof the human TFBSs in TRANSFAC we show that LMM which makes putative TFBS calls using localp-values yields a much improved false-positive to true-positive ratio than that using the TRANSFAC orlog-odds scores alone

It has been known that neighboring nucleotide compositions can affect the interaction between a tran-scription factor and its binding site To our best knowledge however there is no documented study onwhether and how much an improvement can be made on the PSSM-based TFBS detection using a localbackground model The result we present which is based on more than 100 experimentally determinedTFBS sequences in the human genome shows a clear overall advantage for incorporating the local se-quence context into PSSM-based TFBS search There are various biological mechanisms that can explainthis effect which may lead to more complicated and more speci c models For instance it may be thatthe local 1000 bp genomic region does not contain DNA sequences similar to the true binding site be-cause otherwise the target transcription factor may be competed away from its biologically meaningfulbinding site

While this improvement does not in itself render a solution to the much more dif cult problem ofdetecting regulatory modules by signi cantly reducing false-positive calls for single sites the local p-value approach will contribute substantially to any subsequent algorithms aiming to detect combinatorialregulatory modules The method we developed here is seen as a proof of principle and can be used asa component of a more complex approach For example considering that clusters of binding sites alsooften occur within small regions of about 200 bp to cooperatively recruit the transcription factors a naturalfuture development of LMM would be to take this distance effect into the background estimation andcombine the LMM p-values of a few candidate PSSM sites Many challenging problems in computationalbiology eg translation initiation site identi cation splice site recognition and RNA secondary structureprediction can be modeled in terms of the recognition of motifs Our work may be adapted and extendedto these problems as well However it should be noted that when applied to protein sequences which arecomposed of a 20-letter alphabet the performance of our algorithm may become an issue especially whenthe order of the Markov chain k is large

DETAILED METHODS

Data extraction for large-scale validation

To evaluate the performance of the LMM we apply it to known TFBSs in the human genome Knownbinding sites are extracted from the SITE table of the TRANSFAC database version 62 About half of the12262 binding sites in this table are experimentally derived from various species The rest are generatedfrom in vitro binding assays on arti cial nucleotide sequences Since LMM studies binding sites withrespect to their genomic contexts these arti cial sequences which do not correspond to any genomicregion cannot be used for our validation study Of the 6073 in vivo binding sites 1425 sites are basedon the human genome Of these 149 (105) are annotated with a corresponding PSSM We use thesebinding sites for validation

10 HUANG ET AL

To locate the known TFBSs in the human genome we focus on the 5000 bp upstream sequences of allgenes We made use of the annotations provided by Ensembl (Hubbard et al 2002) and extracted 22808human gene promoters from the human genome assembly NCBI golden path 29 (wwwensemblorgHomo_sapiens) Since heuristic sequence-mapping algorithms do not perform well on short sequences such asTFBSs we use an exact-match algorithm based on suf x trees (Gus eld 1997) We found that many bindingsite sequences are precisely mapped onto the promoters of the correct target genes For those binding siteswith mappings onto multiple promoters or with no mapping we attempted to retrieve them by manualreview To nd the correct one among multiple mappings we made correspondences between the Ensemblgene name and the target gene name of the binding site as recorded by TRANSFAC A review of somemissed matches using inexact match algorithms revealed a small number of single-basepair differencesbetween the recorded binding site sequences and the promoter sequences of the target genes for examplethe binding site HS$ALBU_06 over the human albumin promoter After validating against the primaryliterature for the positions of these binding sites we included these mappings as well In total we located101 human TFBSs

Local p-value calculation

Although the exact score distribution can be obtained by enumerating all possible binding site sequencesunder any ldquonullrdquo model for the observed nucleotide base pairs the computational cost for a PSSM of lengthp is 4p Stadenrsquos method (Staden 1989) which turns this into an order-p computation is based on thePGF of the score under the simple null model that the base pairs are independent and identically distributed(iid) Recently however there are some evidences suggesting that Markov background models work betterthan the iid model for detecting TFBS (Liu et al 2002) By extending Stadenrsquos PGF method to dependentrandom variables we present here the derivation of the PGFs under a rst-order Markov model the basisof the ef cient algorithm for computing the exact score distribution

Probability generating function derivation

In our study we make use of the PSSMs constructed by TRANSFAC version 62 Given a PSSMm D wij ppound4 where i D 1 p and j D ACGT the match score S and the similarity score S=Smax

of a sequence D1D2 Dp is de ned as (Quandt et al 1995)

S DpX

iD1

wiDi and S=Smax DpX

iD1

wij

iquest pX

iD1

maxj

fwij g

Let S be a random variable taking integer values then its probability generating function Gt is theexpected value of tS Gt is a polynomial and the coef cient of the term tn is the probability of the eventS D n (Gut 1995)

Given a PSSM m of length p under the assumption that the DNA sequence is iid Staden pro-vided the PGF of the match score in the form of a product of p polynomials (Staden 1989) Gt DQp

iD1

PjDACGT fj twij where fj is the frequency of letter j in the iid DNA sequence For the rst-

order Markov case k D 1 let the transition matrix be P D fregjmacr 4pound4 and the stationary distribution of theMarkov chain be frac14 (viewed as a four-dimentional row vector) Then the PGF under the rst-order Markovmodel is

Gt D frac14

pY

iD1

PMi t I (curren)

where Mi t D DiagtwiA twiC twiG twiT and I D 1 1 1 1T (proof provided at the end of thissection)

Since a Markov chain of order k on set 0 is equivalently a rst-order Markov chain on the set 0k witha little modi cation on Mi t we can generalize the above results to k gt 1 An example of PGF for

LOCAL STATISTICAL SIGNIFICANCE OF PATTERNS 11

k D 3 is in the online supplement (wwwbiostatharvardeducomplabLMM) Using this representation forthe PGF we developed and implemented an algorithm using CCC to calculate the exact score distributionGenerally for a kth-order Markov chain and a PSSM of length p the time complexity of our algorithm isO4k cent Smax cent p linear in the matrix length but exponential in the order of the Markov chain The sourcecode is available upon request (wwwbiostathsphharvardeduLMM)

Proof of equation (curren) For ease of notation and without loss of generality we let p the length of thePSSM be 3

For a DNA sequence D1D2D3 its match score against PSSM m is w1D1 C w2D2 C w3D3 and theprobability of the occurrence of D1D2D3 is fD1 fD2 jD1 fD3 jD2 By de nition the PGF of match scoreagainst m is

X

D1D2D3

fD1fD2jD1fD3jD2 tw1D1Cw2D2

Cw3D3 DX

D1D2D3

fD1 tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

In the following we derive the PGF in the alternative form of a product of p matrices First

X

D1D2D3

fD1 tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

DX

D1D2D3

X

a

fa cent fD1 ja tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

DX

a

fa centX

D1D2D3

fD1 ja tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

D fa fC fG fT cent

0

X

D1D2D3

fD1 jAtw1D1 cent fD2jD1 tw2D2 cent fD3 jD2 tw3D3

X

D1D2D3

fD1 jT tw1D1 cent fD2jD1 tw2D2 cent fD3jD2 tw3D3

1

AT

For the component of the second vector corresponding to base A

X

D1D2D3

fD1 jAtw1D1 cent fD2 jD1 tw2D2 cent fD3jD2 tw3D3

DX

D1

fD1 jAtw1D1 centX

D2D3

fD2 jD1 tw2D2 cent fD3jD2 tw3D3

D fAjAtw1A fCjAtw1C fGjAtw1G fT jAtw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3jD2 tw3D3

1

AT

12 HUANG ET AL

We apply similar arguments to the components corresponding to bases C G and T and obtain

X

D1D2D3

fD1jC tw1D1 cent fD2 jD1 tw2D2 cent fD3jD2 tw3D3

D fAjC tw1A fCjC tw1C fGjC tw1G fT jC tw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

X

D1D2D3

fD1jGtw1D1 cent fD2 jD1 tw2D2 cent fD3jD2 tw3D3

D fAjGtw1A fCjGtw1C fGjGtw1G fT jGtw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

X

D1D2D3

fD1jT tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

D fAjT tw1A fCjT tw1C fGjT tw1G fT jT tw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

Therefore for the rst position we have

0

X

D1D2D3

fD1jAtw1D1 cent fD2jD1 tw2D2 cent fD3 jD2 tw3D3 X

D1D2D3

fD1 jT tw1D1 cent fD2 jD1 tw2D2 cent fD3jD2 tw3D3

1

AT

D

0

BB

fAjAtw1A fCjAtw1C fGjAtw1G fT jAtw1T

fAjC tw1A fCjC tw1C fGjC tw1G fT jC tw1T

fAjGtw1A fCjGtw1C fGjGtw1G fT jGtw1T

fAjT tw1A fCjT tw1C fGjT tw1G fT jT tw1T

1

CCA cent

0

BB

PD2D3

fD2 jAtw2D2 fD3 jD2 tw3D3PD2D3

fD2 jC tw2D2 fD3jD2 tw3D3PD2D3

fD2 jGtw2D2 fD3jD2 tw3D3PD2D3

fD2 jT tw2D2 fD3 jD2 tw3D3

1

CCA

D P cent Diagtw1A tw1C tw1G tw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD2 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

D P cent M1 t cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

LOCAL STATISTICAL SIGNIFICANCE OF PATTERNS 13

Further applying the above arguments to positions 2 and 3 we have

0

X

D2D3

fD2jAtw2D2 cent fD3jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

D P cent Diagtw2A tw2C tw2G tw2T cent

0

X

D3

fD3jAtw3D3 X

D3

fD3 jT tw3D3

1

AT

D P cent Diagtw2A tw2C tw2G tw2T cent P cent Diagtw2A tw2C tw2G tw2T cent 1 1 1 1T

iexcl P cent M2 t cent P cent M3 t cent I

Above all Gt D frac14Qp

iD1PMi tI

ACKNOWLEDGMENTS

The work of HH XZ and WHW is supported by NSF grants DBI0196176 and DMS-0090166 Thework of HH and JSL is supported by NSF grant DMS-0204674 and NIH grant R01 HG02518-01 Thework of M-CJK is supported by the Howard Hughes Medical Institute predoctoral fellowship

REFERENCES

Bailey TL and Elkan C 1994 Fitting a mixture model by expectation maximization to discover motifs in biopoly-mers Proc Int Conf Intell Syst Mol Biol 2 28ndash36

Chen QK Hertz GZ and Stormo GD 1995 MATRIX SEARCH 10 A computer program that scans DNAsequences for transcriptional elements using a database of weight matrices Comput Appl Biosci 11 563ndash566

Durbin R Eddy SR Krogh A and Mitchison G 1998 Biological Sequence Analysis Probalistic Models ofProteins and Nucleic Acids Cambridge University Press Cambridge UK

Fickett JW 1996 Coordinate positioning of MEF2 and myogenin binding sites Gene 172 GC19ndash32Fried M and Crothers DM 1981 Equilibria and kinetics of lac repressor-operator interactions by polyacrylamide

gel electrophoresis Nucl Acids Res 9 6505ndash6525Galas DJ and Schmitz A 1978 DNAse footprinting A simple method for the detection of proteinndashDNA binding

speci city Nucl Acids Res 5 3157ndash3170Garner MM and Revzin A 1981 A gel electrophoresis method for quantifying the binding of proteins to speci c

DNA regions Application to components of the Escherichia coli lactose operon regulatory system Nucl Acids Res9 3047ndash3060

Gus eld D 1997 Algorithms on Strings Trees and Sequences Computer Science and Computational BiologyCambridge University Press Cambridge England

Gut A 1995 An Intermediate Course in Probability Springer-Verlag New YorkHertz GZ Hartzell 3rd GW and Stormo GD 1990 Identi cation of consensus patterns in unaligned DNA

sequences known to be functionally related Comput Appl Biosci 6 81ndash92Hubbard T Barker D Birney E Cameron G Chen Y Clark L Cox T Cuff J Curwen V Down T Durbin

R Eyras E Gilbert J Hammond M Huminiecki L Kasprzyk A Lehvaslaiho H Lijnzaad P MelsoppC Mongin E Pettett R Pocock M Potter S Rust A Schmidt E Searle S Slater G Smith J SpoonerW Stabenau A Stalker J Stupka E Ureta-Vidal A Vastrik I and Clamp M 2002 The Ensembl genomedatabase project Nucl Acids Res 30 38ndash41

Hughes JD Estep PW Tavazoie S and Church GM 2000 Computational identi cation of cis-regulatoryelementsassociated with groups of functionally related genes in Saccharomyces cerevisiae J Mol Biol 296 1205ndash1214

Lander ES Linton LM Birren B Nusbaum C Zody MC Baldwin J Devon K Dewar K Doyle MFitzHugh W Funke R Gage D Harris K Heaford A Howland J Kann L Lehoczky J LeVine RMcEwan P McKernan K Meldrim J Mesirov JP Miranda C Morris W Naylor J Raymond C RosettiM Santos R Sheridan A Sougnez C Stange-Thomann N Stojanovic N Subramanian A Wyman DRogers J Sulston J Ainscough R Beck S Bentley D Burton J Clee C Carter N Coulson A DeadmanR Deloukas P Dunham A Dunham I Durbin R French L Grafham D Gregory S Hubbard T Humphray

14 HUANG ET AL

S Hunt A Jones M Lloyd C McMurray A Matthews L Mercer S Milne S Mullikin JC MungallA Plumb R Ross M Shownkeen R Sims S Waterston RH Wilson RK Hillier LW McPherson JDMarra MA Mardis ER Fulton LA Chinwalla AT Pepin KH Gish WR Chissoe SL Wendl MCDelehaunty KD Miner TL Delehaunty A Kramer JB Cook LL Fulton RS Johnson DL Minx PJClifton SW Hawkins T Branscomb E Predki P Richardson P Wenning S Slezak T Doggett N ChengJF Olsen A Lucas S Elkin C Uberbacher E Frazier M Gibbs RA Muzny DM Scherer SE BouckJB Sodergren EJ Worley KC Rives CM Gorrell JH Metzker ML Naylor SL Kucherlapati RSNelson DL Weinstock GM Sakaki Y Fujiyama A Hattori M Yada T Toyoda A Itoh T Kawagoe CWatanabe H Totoki Y Taylor T Weissenbach J Heilig R Saurin W Artiguenave F Brottier P Bruls TPelletier E Robert C Wincker P Smith DR Doucette-Stamm L Ruben eld M Weinstock K Lee HMDubois J Rosenthal A Platzer M Nyakatura G Taudien S Rump A Yang H Yu J Wang J HuangG Gu J Hood L Rowen L Madan A Qin S Davis RW Federspiel NA Abola AP Proctor MJMyers RM Schmutz J Dickson M Grimwood J Cox DR Olson MV Kaul R Shimizu N KawasakiK Minoshima S Evans GA Athanasiou M Schultz R Roe BA Chen F Pan H Ramser J LehrachH Reinhardt R McCombie WR de la Bastide M Dedhia N Blocker H Hornischer K Nordsiek GAgarwala R Aravind L Bailey JA Bateman A Batzoglou S Birney E Bork P Brown DG BurgeCB Cerutti L Chen HC Church D Clamp M Copley RR Doerks T Eddy SR Eichler EE FureyTS Galagan J Gilbert JG Harmon C Hayashizaki Y Haussler D Hermjakob H Hokamp K Jang WJohnson LS Jones TA Kasif S Kaspryzk A Kennedy S Kent WJ Kitts P Koonin EV Korf I KulpD Lancet D Lowe TM McLysaght A Mikkelsen T Moran JV Mulder N Pollara VJ Ponting CPSchuler G Schultz J Slater G Smit AF Stupka E Szustakowski J Thierry-Mieg D Thierry-Mieg JWagner L Wallis J Wheeler R Williams A Wolf YI Wolfe KH Yang SP Yeh RF Collins F GuyerMS Peterson J Felsenfeld A Wetterstrand KA Patrinos A Morgan MJ Szustakowki J de Jong PCatanese JJ Osoegawa K Shizuya H Choi S and Chen YJ 2001 Initial sequencing and analysis of thehuman genome Nature 409 860ndash921

Lawrence CE Altschul SF Boguski MS Liu JS Neuwald AF and Wootton JC 1993 Detecting subtlesequence signals A Gibbs sampling strategy for multiple alignment Science 262 208ndash214

Lawrence CE and Reilly AA 1990 An expectation maximization (EM) algorithm for the identi cation and char-acterization of common sites in unaligned biopolymer sequences Proteins 7 41ndash51

Liu X Brutlag DL and Liu JS 2001 BioProspector Discovering conserved DNA motifs in upstream regulatoryregions of co-expressed genes Pac Symp Biocomput 127ndash138

Liu XS Brutlag DL and Liu JS 2002 An algorithm for nding protein DNA binding sites with applications tochromatin-immunoprecipitation microarray experiments Nat Biotechnol

Nakatsuji Y Hidaka K Tsujino S Yamamoto Y Mukai T Yanagihara T Kishimoto T and Sakoda S 1992A single MEF-2 site is a major positive regulatory element required for transcription of the muscle-specic subunitof the human phosphoglycerate mutase gene in skeletal and cardiac muscle cells Mol Cell Biol 12 4384ndash4390

Quandt K Frech K Karas H Wingender E and Werner T 1995 MatInd and MatInspector New fast and versatiletools for detection of consensus matches in nucleotide sequence data Nucl Acids Res 23 4878ndash4884

Rosenthal N Berglund EB Wentworth BM Donoghue M Winter B Bober E Braun T and Arnold HH1990 A highly conserved enhancer downstream of the human MLC13 locus is a target for multiple myogenicdetermination factors Nucl Acids Res 18 6239ndash6246

Roth FP Hughes JD Estep PW and Church GM 1998 Finding DNA regulatory motifs within unalignednoncoding sequences clustered by whole-genome mRNA quantitation Nat Biotechnol 16 939ndash945

Staden R 1989 Methods for calculating the probabilities of nding patterns in sequences Comput Appl Biosci 589ndash96

Stormo GD and Hartzell 3rd GW 1989 Identifying protein-binding sites from unaligned DNA fragments ProcNatl Acad Sci USA 86 1183ndash1187

Wentworth BM Donoghue M Engert JC Berglund EB and Rosenthal N 1991 Paired MyoD-binding sitesregulate myosin light chain gene expression Proc Natl Acad Sci USA 88 1242ndash1246

Wingender E Chen X Hehl R Karas H Liebich I Matys V Meinhardt T Pruss M Reuter I and SchachererF 2000 TRANSFAC An integrated system for gene expression regulation Nucl Acids Res 28 316ndash319

Address correspondence toJun S Liu Wing H Wong

Department of StatisticsScience Center 6th oor

1 Oxford StreetCambridge MA 02138

E-mail jliu wwongstatharvardedu

Page 6: Determination of Local Statistical Significance of ...mckao/documents/JCB-LMM.pdf · that the incorporation of the local genomic context can be advantageous in the prediction of myogenin

6 HUANG ET AL

Table 2 (Continued)

(b) Using entropy-weighted myogenin PSSM

p-values of observed score under local background modelPosition

(bp from last TRANSFAC 1st 2ndexon of MYL1) score iid Markov| Markov Binding site

1267 (A) 4667 0000008 0000017 000003 AGCAGGTG1339 (C) 4628 0000018 0000032 0000059 GACAGGTG

5434 4667 0000036 0000074 0000095 AGCAGCTG1323 (B) 4581 0000045 0000077 0000127 ACCAGCTG

2463 4628 0000068 0000152 0000194 GCCAGCTG2574 4596 0000073 0000177 0000191 GGCAGATG926 4463 0000224 0000414 0000532 TGCAGGTG

7534 4377 0000378 0000534 0000788 TACAGCTG7156 4377 0000346 000054 0000686 CCCAGCTG4895 4322 0000829 0001998 0002045 CTCAGGTG

(c) Using log-odds MEF2 PSSMs

p-values of observed score under local background modelPosition Log-odds

(bp from last PSSM 1st 2ndexon of MYL1) score iid Markov| Markov Binding site

iexcl2970 669 0000199 0000160 0000228 ATTTTAAATAiexcl3115 671 0000209 0000183 0000243 GTTATAAATAiexcl161 649 0000355 0000183 0000322 ATTTTAAGCA

iexcl2939 668 0000233 0000190 0000266 TGTTTAAATCiexcl3151 663 0000807 0000655 0000747 TGTTTAAGAAiexcl4767 656 0000951 0001009 0001712 TTTTTATATAiexcl3433 649 0003940 0003099 0003383 AAACTAAAAAiexcl3566 644 0005710 0004913 0005231 TTTTTAAAGCiexcl3214 643 0007155 0005654 0006459 AGTTTATATCiexcl3577 641 0007363 0006312 0006625 GGTTTAACAT

(d) Using entropy-weighted MEF2 PSSM

p-values of observed score under local background modelPosition

(bp from last TRANSFAC 1st 2ndexon of MYL1) score iid Markov| Markov Binding site

iexcl2970 591 0000103 0000082 0000133 ATTTTAAATAiexcl161 550 0000191 0000101 0000174 ATTTTAAGCA

iexcl3115 590 0000206 0000181 0000243 GTTATAAATAiexcl2939 554 0001001 0000807 0001016 TGTTTAAATCiexcl3151 562 0001091 0000914 0001083 TGTTTAAGAAiexcl4700 531 0001451 0001451 0001451 TTGTTAAAGAiexcl3566 543 0002961 0002532 0002676 TTTTTAAAGCiexcl3433 545 0003271 0002610 0002788 AAACTAAAAAiexcl4444 532 0003080 0003200 0004320 CATATAATTAiexcl3687 535 0003761 0003320 0003671 GAAGTAAAGA

aSorted in increasing order by column marked with |

LOCAL STATISTICAL SIGNIFICANCE OF PATTERNS 7

respectively with the true sites A B and C labeled and shaded in gray along with their local p-valuesWe nd the PSSM scores to be less sensitive a measure than the local p-value the true sites A B andC stood out under the local p-values while they are not as distinct from the false predictions under thePSSM scores

PGAM-M MEF2 binding site prediction

A major positive regulatory element is required for the muscle-speci c expression of the muscle-speci csubunit of the human phosphoglycerate mutase (PGAM-M) gene (Nakatsuji et al 1992) This elementlocated 161 bp upstream of the gene is found to be bound by the transcription factor MEF-2

We applied the LMM to the 5000 bp PGAM-M upstream region using the MEF2_Q6 PSSM to derivethe local p-values for each candidate The top 10 score candidate sites derived using log-odds or entropy-weighted PSSMs are listed in Tables 2c and 2d respectively with the true site labeled and shaded in grayalong with their local p-values We nd that LMM behaves similarly as in the MYL1 enhancer

Overall from Table 2 we see that by taking into account the local sequence composition we havereordered the candidate sequences in a way that is favorable to the true binding sites

LARGE-SCALE VALIDATION

In order to evaluate the performance of LMM and to compare our local p-values to PSSM similarityscores we apply both LMM and TRANSFAC to 101 known binding sites in the human genome obtainedby mapping binding sites in the TRANSFAC database onto the human genome We recorded and evaluatedthe extent to which LMM and TRANSFAC can capture this large collection of known binding sites in thehuman genome and the amount of noise generated in so doing

In Figure 3a the trade-off between sensitivity and noise is shown in terms of the proportion of theknown binding sites detected and the amount of concomitant noise generated Noise is measured bythe noise-to-signal ratio which is de ned as the number of binding site calls not known to be correctdivided by the number of known binding sites found For comparison we show the tradeoffs achieved byTRANSFAC using its three matrix-speci c similarity score cutoffs (FN SUM FP) along with that achievedby LMM under Markov models of orders 0 1 2 and 3 at various p-value cutoffs starting at the stringentp D 000001 From the inset graph we see that at all levels of sensitivity LMM outperformed TRANSFAC

FIG 3 Large-scale validation of TRANSFAC and LMM Tradeoff between sensitivity and noise (a) We comparedthe abilities of the two methods to detect the 101 known binding sites in the human genome by looking at their sensitivityand noise-to-signal ratio The balance of the tradeoff between these two measures achieved at various signi cancelevels by LMM are traced and compared to that attained by TRANSFAC The inset graph shows the performance ofLMM and TRANSFAC across all levels of sensitivity (b) Detailed results for p D 000001 and 00002

8 HUANG ET AL

by producing signi cantly less noise While the performance of LMM comes close to that of TRANSFACas the p-value cutoff increases in fact by then for both methods the advantage of increased sensitivity hasbeen nulli ed by the high level of accompanying noise rendering them impractical Overall not only is thesensitivity of LMM comparable to TRANSFAC its noise-to-signal ratio is also vastly superior It shouldbe noted that since only a limited number of true binding sites are known not every unsupported bindingsite prediction is necessarily a false-positive prediction Thus the noise-to-signal ratio overestimates thetrue noise level especially when stringent criteria are used to generate putative TFBSs with high sequencesimilarity to known binding sites As the criteria relax the large numbers of predictions over and abovethe known binding sites imply a high level of true background noise

More detailed results for TRANSFAC using the three cutoffs and for LMM using different signi cancecutoffs 000001 and 00002 and under different Markov models are summarized in Fig 3b While theFN cut off missed relatively few known binding sites it generated more than 45 false-positive predictionsfor every accurate binding site call On the other hand FP made fewer false positives but it detected onlyone in nine known binding sites The SUM cutoff designed as a balance of these inherent tradeoffs didstrike a reasonable compromise having generated about nine false positives for every real binding site anddetected more than half of the known sites

At the stringent signi cance cutoff p D 000001 LMM detected about twice the binding sites thandid the FP cutoff and on average produced about 60 fewer false-positive predictions for every correctprediction At the more relaxed p-value cutoff p D 00002 the sensitivity of LMM is comparable tothat of the SUM cutoff while only half of the noise is generated The binding sites that were detectedby LMM at p middot 00002 but missed by TRANSFAC using the SUM cutoff include a MEF2 binding siteover the desmin gene an ATF1 (activating transcription factor 1) binding site over the TGFmacr2 gene aHIF (hypoxia-inducible factor) binding site over the VEGF gene and an ICSBP (IFN consensus sequencebinding protein) binding site over the OAS1 gene We choose p D 00002 as the general signi cance cutofffor the application of LMM to mammalian genomic sequences a cutoff with a suf ciently high sensitivityand an acceptable amount of noise Overall the LMM provides an advantageous tradeoff between noise-to-signal ratio and sensitivity

In our validation experiment we found that Markov models of orders 1 2 and 3 have better combinationsof high sensitivity and low noise than the iid model con rming an earlier observation (Liu et al 2001)that Markov models can better capture the structure of biological sequences In addition we compared

FIG 4 The use of local sequence context is advantageous The performance of the second-order LMM is comparedagainst an analogous global Markov model with parameters estimated from a large collection of upstream regionsThe performance is assessed in terms of the noise-to-signal ratio and sensitivity At the recommended p-value cutoff00002 LMM is more sensitive and less noisy

LOCAL STATISTICAL SIGNIFICANCE OF PATTERNS 9

the performance of the second-order LMM against an analogous global Markov model with parametersestimated from a large collection of upstream regions in order to assess the ability of LMM to modelthe local sequence context information We found that over the 101 known human TFBSs in situ LMMgenerally outperforms the global Markov model while they behave similarly at high and low sensitivitylevels (Fig 4) At high sensitivity levels the lax p-value cutoffs produce large numbers of putative TFBScalls overwhelming the advantage enjoyed by LMM At low sensitivity levels the stringent p-valuecutoffs yield only putative TFBSs with undeniable sequence similarity to known binding sites Thus thenoise-to-signal ratio may not re ect the true noise level in this region

DISCUSSION

The work presented in this paper attemps to identify TFBSs by considering simultaneously both theirsimilarity to the query PSSM and their differences from the local genomic context Through the studyof the human TFBSs in TRANSFAC we show that LMM which makes putative TFBS calls using localp-values yields a much improved false-positive to true-positive ratio than that using the TRANSFAC orlog-odds scores alone

It has been known that neighboring nucleotide compositions can affect the interaction between a tran-scription factor and its binding site To our best knowledge however there is no documented study onwhether and how much an improvement can be made on the PSSM-based TFBS detection using a localbackground model The result we present which is based on more than 100 experimentally determinedTFBS sequences in the human genome shows a clear overall advantage for incorporating the local se-quence context into PSSM-based TFBS search There are various biological mechanisms that can explainthis effect which may lead to more complicated and more speci c models For instance it may be thatthe local 1000 bp genomic region does not contain DNA sequences similar to the true binding site be-cause otherwise the target transcription factor may be competed away from its biologically meaningfulbinding site

While this improvement does not in itself render a solution to the much more dif cult problem ofdetecting regulatory modules by signi cantly reducing false-positive calls for single sites the local p-value approach will contribute substantially to any subsequent algorithms aiming to detect combinatorialregulatory modules The method we developed here is seen as a proof of principle and can be used asa component of a more complex approach For example considering that clusters of binding sites alsooften occur within small regions of about 200 bp to cooperatively recruit the transcription factors a naturalfuture development of LMM would be to take this distance effect into the background estimation andcombine the LMM p-values of a few candidate PSSM sites Many challenging problems in computationalbiology eg translation initiation site identi cation splice site recognition and RNA secondary structureprediction can be modeled in terms of the recognition of motifs Our work may be adapted and extendedto these problems as well However it should be noted that when applied to protein sequences which arecomposed of a 20-letter alphabet the performance of our algorithm may become an issue especially whenthe order of the Markov chain k is large

DETAILED METHODS

Data extraction for large-scale validation

To evaluate the performance of the LMM we apply it to known TFBSs in the human genome Knownbinding sites are extracted from the SITE table of the TRANSFAC database version 62 About half of the12262 binding sites in this table are experimentally derived from various species The rest are generatedfrom in vitro binding assays on arti cial nucleotide sequences Since LMM studies binding sites withrespect to their genomic contexts these arti cial sequences which do not correspond to any genomicregion cannot be used for our validation study Of the 6073 in vivo binding sites 1425 sites are basedon the human genome Of these 149 (105) are annotated with a corresponding PSSM We use thesebinding sites for validation

10 HUANG ET AL

To locate the known TFBSs in the human genome we focus on the 5000 bp upstream sequences of allgenes We made use of the annotations provided by Ensembl (Hubbard et al 2002) and extracted 22808human gene promoters from the human genome assembly NCBI golden path 29 (wwwensemblorgHomo_sapiens) Since heuristic sequence-mapping algorithms do not perform well on short sequences such asTFBSs we use an exact-match algorithm based on suf x trees (Gus eld 1997) We found that many bindingsite sequences are precisely mapped onto the promoters of the correct target genes For those binding siteswith mappings onto multiple promoters or with no mapping we attempted to retrieve them by manualreview To nd the correct one among multiple mappings we made correspondences between the Ensemblgene name and the target gene name of the binding site as recorded by TRANSFAC A review of somemissed matches using inexact match algorithms revealed a small number of single-basepair differencesbetween the recorded binding site sequences and the promoter sequences of the target genes for examplethe binding site HS$ALBU_06 over the human albumin promoter After validating against the primaryliterature for the positions of these binding sites we included these mappings as well In total we located101 human TFBSs

Local p-value calculation

Although the exact score distribution can be obtained by enumerating all possible binding site sequencesunder any ldquonullrdquo model for the observed nucleotide base pairs the computational cost for a PSSM of lengthp is 4p Stadenrsquos method (Staden 1989) which turns this into an order-p computation is based on thePGF of the score under the simple null model that the base pairs are independent and identically distributed(iid) Recently however there are some evidences suggesting that Markov background models work betterthan the iid model for detecting TFBS (Liu et al 2002) By extending Stadenrsquos PGF method to dependentrandom variables we present here the derivation of the PGFs under a rst-order Markov model the basisof the ef cient algorithm for computing the exact score distribution

Probability generating function derivation

In our study we make use of the PSSMs constructed by TRANSFAC version 62 Given a PSSMm D wij ppound4 where i D 1 p and j D ACGT the match score S and the similarity score S=Smax

of a sequence D1D2 Dp is de ned as (Quandt et al 1995)

S DpX

iD1

wiDi and S=Smax DpX

iD1

wij

iquest pX

iD1

maxj

fwij g

Let S be a random variable taking integer values then its probability generating function Gt is theexpected value of tS Gt is a polynomial and the coef cient of the term tn is the probability of the eventS D n (Gut 1995)

Given a PSSM m of length p under the assumption that the DNA sequence is iid Staden pro-vided the PGF of the match score in the form of a product of p polynomials (Staden 1989) Gt DQp

iD1

PjDACGT fj twij where fj is the frequency of letter j in the iid DNA sequence For the rst-

order Markov case k D 1 let the transition matrix be P D fregjmacr 4pound4 and the stationary distribution of theMarkov chain be frac14 (viewed as a four-dimentional row vector) Then the PGF under the rst-order Markovmodel is

Gt D frac14

pY

iD1

PMi t I (curren)

where Mi t D DiagtwiA twiC twiG twiT and I D 1 1 1 1T (proof provided at the end of thissection)

Since a Markov chain of order k on set 0 is equivalently a rst-order Markov chain on the set 0k witha little modi cation on Mi t we can generalize the above results to k gt 1 An example of PGF for

LOCAL STATISTICAL SIGNIFICANCE OF PATTERNS 11

k D 3 is in the online supplement (wwwbiostatharvardeducomplabLMM) Using this representation forthe PGF we developed and implemented an algorithm using CCC to calculate the exact score distributionGenerally for a kth-order Markov chain and a PSSM of length p the time complexity of our algorithm isO4k cent Smax cent p linear in the matrix length but exponential in the order of the Markov chain The sourcecode is available upon request (wwwbiostathsphharvardeduLMM)

Proof of equation (curren) For ease of notation and without loss of generality we let p the length of thePSSM be 3

For a DNA sequence D1D2D3 its match score against PSSM m is w1D1 C w2D2 C w3D3 and theprobability of the occurrence of D1D2D3 is fD1 fD2 jD1 fD3 jD2 By de nition the PGF of match scoreagainst m is

X

D1D2D3

fD1fD2jD1fD3jD2 tw1D1Cw2D2

Cw3D3 DX

D1D2D3

fD1 tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

In the following we derive the PGF in the alternative form of a product of p matrices First

X

D1D2D3

fD1 tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

DX

D1D2D3

X

a

fa cent fD1 ja tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

DX

a

fa centX

D1D2D3

fD1 ja tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

D fa fC fG fT cent

0

X

D1D2D3

fD1 jAtw1D1 cent fD2jD1 tw2D2 cent fD3 jD2 tw3D3

X

D1D2D3

fD1 jT tw1D1 cent fD2jD1 tw2D2 cent fD3jD2 tw3D3

1

AT

For the component of the second vector corresponding to base A

X

D1D2D3

fD1 jAtw1D1 cent fD2 jD1 tw2D2 cent fD3jD2 tw3D3

DX

D1

fD1 jAtw1D1 centX

D2D3

fD2 jD1 tw2D2 cent fD3jD2 tw3D3

D fAjAtw1A fCjAtw1C fGjAtw1G fT jAtw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3jD2 tw3D3

1

AT

12 HUANG ET AL

We apply similar arguments to the components corresponding to bases C G and T and obtain

X

D1D2D3

fD1jC tw1D1 cent fD2 jD1 tw2D2 cent fD3jD2 tw3D3

D fAjC tw1A fCjC tw1C fGjC tw1G fT jC tw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

X

D1D2D3

fD1jGtw1D1 cent fD2 jD1 tw2D2 cent fD3jD2 tw3D3

D fAjGtw1A fCjGtw1C fGjGtw1G fT jGtw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

X

D1D2D3

fD1jT tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

D fAjT tw1A fCjT tw1C fGjT tw1G fT jT tw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

Therefore for the rst position we have

0

X

D1D2D3

fD1jAtw1D1 cent fD2jD1 tw2D2 cent fD3 jD2 tw3D3 X

D1D2D3

fD1 jT tw1D1 cent fD2 jD1 tw2D2 cent fD3jD2 tw3D3

1

AT

D

0

BB

fAjAtw1A fCjAtw1C fGjAtw1G fT jAtw1T

fAjC tw1A fCjC tw1C fGjC tw1G fT jC tw1T

fAjGtw1A fCjGtw1C fGjGtw1G fT jGtw1T

fAjT tw1A fCjT tw1C fGjT tw1G fT jT tw1T

1

CCA cent

0

BB

PD2D3

fD2 jAtw2D2 fD3 jD2 tw3D3PD2D3

fD2 jC tw2D2 fD3jD2 tw3D3PD2D3

fD2 jGtw2D2 fD3jD2 tw3D3PD2D3

fD2 jT tw2D2 fD3 jD2 tw3D3

1

CCA

D P cent Diagtw1A tw1C tw1G tw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD2 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

D P cent M1 t cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

LOCAL STATISTICAL SIGNIFICANCE OF PATTERNS 13

Further applying the above arguments to positions 2 and 3 we have

0

X

D2D3

fD2jAtw2D2 cent fD3jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

D P cent Diagtw2A tw2C tw2G tw2T cent

0

X

D3

fD3jAtw3D3 X

D3

fD3 jT tw3D3

1

AT

D P cent Diagtw2A tw2C tw2G tw2T cent P cent Diagtw2A tw2C tw2G tw2T cent 1 1 1 1T

iexcl P cent M2 t cent P cent M3 t cent I

Above all Gt D frac14Qp

iD1PMi tI

ACKNOWLEDGMENTS

The work of HH XZ and WHW is supported by NSF grants DBI0196176 and DMS-0090166 Thework of HH and JSL is supported by NSF grant DMS-0204674 and NIH grant R01 HG02518-01 Thework of M-CJK is supported by the Howard Hughes Medical Institute predoctoral fellowship

REFERENCES

Bailey TL and Elkan C 1994 Fitting a mixture model by expectation maximization to discover motifs in biopoly-mers Proc Int Conf Intell Syst Mol Biol 2 28ndash36

Chen QK Hertz GZ and Stormo GD 1995 MATRIX SEARCH 10 A computer program that scans DNAsequences for transcriptional elements using a database of weight matrices Comput Appl Biosci 11 563ndash566

Durbin R Eddy SR Krogh A and Mitchison G 1998 Biological Sequence Analysis Probalistic Models ofProteins and Nucleic Acids Cambridge University Press Cambridge UK

Fickett JW 1996 Coordinate positioning of MEF2 and myogenin binding sites Gene 172 GC19ndash32Fried M and Crothers DM 1981 Equilibria and kinetics of lac repressor-operator interactions by polyacrylamide

gel electrophoresis Nucl Acids Res 9 6505ndash6525Galas DJ and Schmitz A 1978 DNAse footprinting A simple method for the detection of proteinndashDNA binding

speci city Nucl Acids Res 5 3157ndash3170Garner MM and Revzin A 1981 A gel electrophoresis method for quantifying the binding of proteins to speci c

DNA regions Application to components of the Escherichia coli lactose operon regulatory system Nucl Acids Res9 3047ndash3060

Gus eld D 1997 Algorithms on Strings Trees and Sequences Computer Science and Computational BiologyCambridge University Press Cambridge England

Gut A 1995 An Intermediate Course in Probability Springer-Verlag New YorkHertz GZ Hartzell 3rd GW and Stormo GD 1990 Identi cation of consensus patterns in unaligned DNA

sequences known to be functionally related Comput Appl Biosci 6 81ndash92Hubbard T Barker D Birney E Cameron G Chen Y Clark L Cox T Cuff J Curwen V Down T Durbin

R Eyras E Gilbert J Hammond M Huminiecki L Kasprzyk A Lehvaslaiho H Lijnzaad P MelsoppC Mongin E Pettett R Pocock M Potter S Rust A Schmidt E Searle S Slater G Smith J SpoonerW Stabenau A Stalker J Stupka E Ureta-Vidal A Vastrik I and Clamp M 2002 The Ensembl genomedatabase project Nucl Acids Res 30 38ndash41

Hughes JD Estep PW Tavazoie S and Church GM 2000 Computational identi cation of cis-regulatoryelementsassociated with groups of functionally related genes in Saccharomyces cerevisiae J Mol Biol 296 1205ndash1214

Lander ES Linton LM Birren B Nusbaum C Zody MC Baldwin J Devon K Dewar K Doyle MFitzHugh W Funke R Gage D Harris K Heaford A Howland J Kann L Lehoczky J LeVine RMcEwan P McKernan K Meldrim J Mesirov JP Miranda C Morris W Naylor J Raymond C RosettiM Santos R Sheridan A Sougnez C Stange-Thomann N Stojanovic N Subramanian A Wyman DRogers J Sulston J Ainscough R Beck S Bentley D Burton J Clee C Carter N Coulson A DeadmanR Deloukas P Dunham A Dunham I Durbin R French L Grafham D Gregory S Hubbard T Humphray

14 HUANG ET AL

S Hunt A Jones M Lloyd C McMurray A Matthews L Mercer S Milne S Mullikin JC MungallA Plumb R Ross M Shownkeen R Sims S Waterston RH Wilson RK Hillier LW McPherson JDMarra MA Mardis ER Fulton LA Chinwalla AT Pepin KH Gish WR Chissoe SL Wendl MCDelehaunty KD Miner TL Delehaunty A Kramer JB Cook LL Fulton RS Johnson DL Minx PJClifton SW Hawkins T Branscomb E Predki P Richardson P Wenning S Slezak T Doggett N ChengJF Olsen A Lucas S Elkin C Uberbacher E Frazier M Gibbs RA Muzny DM Scherer SE BouckJB Sodergren EJ Worley KC Rives CM Gorrell JH Metzker ML Naylor SL Kucherlapati RSNelson DL Weinstock GM Sakaki Y Fujiyama A Hattori M Yada T Toyoda A Itoh T Kawagoe CWatanabe H Totoki Y Taylor T Weissenbach J Heilig R Saurin W Artiguenave F Brottier P Bruls TPelletier E Robert C Wincker P Smith DR Doucette-Stamm L Ruben eld M Weinstock K Lee HMDubois J Rosenthal A Platzer M Nyakatura G Taudien S Rump A Yang H Yu J Wang J HuangG Gu J Hood L Rowen L Madan A Qin S Davis RW Federspiel NA Abola AP Proctor MJMyers RM Schmutz J Dickson M Grimwood J Cox DR Olson MV Kaul R Shimizu N KawasakiK Minoshima S Evans GA Athanasiou M Schultz R Roe BA Chen F Pan H Ramser J LehrachH Reinhardt R McCombie WR de la Bastide M Dedhia N Blocker H Hornischer K Nordsiek GAgarwala R Aravind L Bailey JA Bateman A Batzoglou S Birney E Bork P Brown DG BurgeCB Cerutti L Chen HC Church D Clamp M Copley RR Doerks T Eddy SR Eichler EE FureyTS Galagan J Gilbert JG Harmon C Hayashizaki Y Haussler D Hermjakob H Hokamp K Jang WJohnson LS Jones TA Kasif S Kaspryzk A Kennedy S Kent WJ Kitts P Koonin EV Korf I KulpD Lancet D Lowe TM McLysaght A Mikkelsen T Moran JV Mulder N Pollara VJ Ponting CPSchuler G Schultz J Slater G Smit AF Stupka E Szustakowski J Thierry-Mieg D Thierry-Mieg JWagner L Wallis J Wheeler R Williams A Wolf YI Wolfe KH Yang SP Yeh RF Collins F GuyerMS Peterson J Felsenfeld A Wetterstrand KA Patrinos A Morgan MJ Szustakowki J de Jong PCatanese JJ Osoegawa K Shizuya H Choi S and Chen YJ 2001 Initial sequencing and analysis of thehuman genome Nature 409 860ndash921

Lawrence CE Altschul SF Boguski MS Liu JS Neuwald AF and Wootton JC 1993 Detecting subtlesequence signals A Gibbs sampling strategy for multiple alignment Science 262 208ndash214

Lawrence CE and Reilly AA 1990 An expectation maximization (EM) algorithm for the identi cation and char-acterization of common sites in unaligned biopolymer sequences Proteins 7 41ndash51

Liu X Brutlag DL and Liu JS 2001 BioProspector Discovering conserved DNA motifs in upstream regulatoryregions of co-expressed genes Pac Symp Biocomput 127ndash138

Liu XS Brutlag DL and Liu JS 2002 An algorithm for nding protein DNA binding sites with applications tochromatin-immunoprecipitation microarray experiments Nat Biotechnol

Nakatsuji Y Hidaka K Tsujino S Yamamoto Y Mukai T Yanagihara T Kishimoto T and Sakoda S 1992A single MEF-2 site is a major positive regulatory element required for transcription of the muscle-specic subunitof the human phosphoglycerate mutase gene in skeletal and cardiac muscle cells Mol Cell Biol 12 4384ndash4390

Quandt K Frech K Karas H Wingender E and Werner T 1995 MatInd and MatInspector New fast and versatiletools for detection of consensus matches in nucleotide sequence data Nucl Acids Res 23 4878ndash4884

Rosenthal N Berglund EB Wentworth BM Donoghue M Winter B Bober E Braun T and Arnold HH1990 A highly conserved enhancer downstream of the human MLC13 locus is a target for multiple myogenicdetermination factors Nucl Acids Res 18 6239ndash6246

Roth FP Hughes JD Estep PW and Church GM 1998 Finding DNA regulatory motifs within unalignednoncoding sequences clustered by whole-genome mRNA quantitation Nat Biotechnol 16 939ndash945

Staden R 1989 Methods for calculating the probabilities of nding patterns in sequences Comput Appl Biosci 589ndash96

Stormo GD and Hartzell 3rd GW 1989 Identifying protein-binding sites from unaligned DNA fragments ProcNatl Acad Sci USA 86 1183ndash1187

Wentworth BM Donoghue M Engert JC Berglund EB and Rosenthal N 1991 Paired MyoD-binding sitesregulate myosin light chain gene expression Proc Natl Acad Sci USA 88 1242ndash1246

Wingender E Chen X Hehl R Karas H Liebich I Matys V Meinhardt T Pruss M Reuter I and SchachererF 2000 TRANSFAC An integrated system for gene expression regulation Nucl Acids Res 28 316ndash319

Address correspondence toJun S Liu Wing H Wong

Department of StatisticsScience Center 6th oor

1 Oxford StreetCambridge MA 02138

E-mail jliu wwongstatharvardedu

Page 7: Determination of Local Statistical Significance of ...mckao/documents/JCB-LMM.pdf · that the incorporation of the local genomic context can be advantageous in the prediction of myogenin

LOCAL STATISTICAL SIGNIFICANCE OF PATTERNS 7

respectively with the true sites A B and C labeled and shaded in gray along with their local p-valuesWe nd the PSSM scores to be less sensitive a measure than the local p-value the true sites A B andC stood out under the local p-values while they are not as distinct from the false predictions under thePSSM scores

PGAM-M MEF2 binding site prediction

A major positive regulatory element is required for the muscle-speci c expression of the muscle-speci csubunit of the human phosphoglycerate mutase (PGAM-M) gene (Nakatsuji et al 1992) This elementlocated 161 bp upstream of the gene is found to be bound by the transcription factor MEF-2

We applied the LMM to the 5000 bp PGAM-M upstream region using the MEF2_Q6 PSSM to derivethe local p-values for each candidate The top 10 score candidate sites derived using log-odds or entropy-weighted PSSMs are listed in Tables 2c and 2d respectively with the true site labeled and shaded in grayalong with their local p-values We nd that LMM behaves similarly as in the MYL1 enhancer

Overall from Table 2 we see that by taking into account the local sequence composition we havereordered the candidate sequences in a way that is favorable to the true binding sites

LARGE-SCALE VALIDATION

In order to evaluate the performance of LMM and to compare our local p-values to PSSM similarityscores we apply both LMM and TRANSFAC to 101 known binding sites in the human genome obtainedby mapping binding sites in the TRANSFAC database onto the human genome We recorded and evaluatedthe extent to which LMM and TRANSFAC can capture this large collection of known binding sites in thehuman genome and the amount of noise generated in so doing

In Figure 3a the trade-off between sensitivity and noise is shown in terms of the proportion of theknown binding sites detected and the amount of concomitant noise generated Noise is measured bythe noise-to-signal ratio which is de ned as the number of binding site calls not known to be correctdivided by the number of known binding sites found For comparison we show the tradeoffs achieved byTRANSFAC using its three matrix-speci c similarity score cutoffs (FN SUM FP) along with that achievedby LMM under Markov models of orders 0 1 2 and 3 at various p-value cutoffs starting at the stringentp D 000001 From the inset graph we see that at all levels of sensitivity LMM outperformed TRANSFAC

FIG 3 Large-scale validation of TRANSFAC and LMM Tradeoff between sensitivity and noise (a) We comparedthe abilities of the two methods to detect the 101 known binding sites in the human genome by looking at their sensitivityand noise-to-signal ratio The balance of the tradeoff between these two measures achieved at various signi cancelevels by LMM are traced and compared to that attained by TRANSFAC The inset graph shows the performance ofLMM and TRANSFAC across all levels of sensitivity (b) Detailed results for p D 000001 and 00002

8 HUANG ET AL

by producing signi cantly less noise While the performance of LMM comes close to that of TRANSFACas the p-value cutoff increases in fact by then for both methods the advantage of increased sensitivity hasbeen nulli ed by the high level of accompanying noise rendering them impractical Overall not only is thesensitivity of LMM comparable to TRANSFAC its noise-to-signal ratio is also vastly superior It shouldbe noted that since only a limited number of true binding sites are known not every unsupported bindingsite prediction is necessarily a false-positive prediction Thus the noise-to-signal ratio overestimates thetrue noise level especially when stringent criteria are used to generate putative TFBSs with high sequencesimilarity to known binding sites As the criteria relax the large numbers of predictions over and abovethe known binding sites imply a high level of true background noise

More detailed results for TRANSFAC using the three cutoffs and for LMM using different signi cancecutoffs 000001 and 00002 and under different Markov models are summarized in Fig 3b While theFN cut off missed relatively few known binding sites it generated more than 45 false-positive predictionsfor every accurate binding site call On the other hand FP made fewer false positives but it detected onlyone in nine known binding sites The SUM cutoff designed as a balance of these inherent tradeoffs didstrike a reasonable compromise having generated about nine false positives for every real binding site anddetected more than half of the known sites

At the stringent signi cance cutoff p D 000001 LMM detected about twice the binding sites thandid the FP cutoff and on average produced about 60 fewer false-positive predictions for every correctprediction At the more relaxed p-value cutoff p D 00002 the sensitivity of LMM is comparable tothat of the SUM cutoff while only half of the noise is generated The binding sites that were detectedby LMM at p middot 00002 but missed by TRANSFAC using the SUM cutoff include a MEF2 binding siteover the desmin gene an ATF1 (activating transcription factor 1) binding site over the TGFmacr2 gene aHIF (hypoxia-inducible factor) binding site over the VEGF gene and an ICSBP (IFN consensus sequencebinding protein) binding site over the OAS1 gene We choose p D 00002 as the general signi cance cutofffor the application of LMM to mammalian genomic sequences a cutoff with a suf ciently high sensitivityand an acceptable amount of noise Overall the LMM provides an advantageous tradeoff between noise-to-signal ratio and sensitivity

In our validation experiment we found that Markov models of orders 1 2 and 3 have better combinationsof high sensitivity and low noise than the iid model con rming an earlier observation (Liu et al 2001)that Markov models can better capture the structure of biological sequences In addition we compared

FIG 4 The use of local sequence context is advantageous The performance of the second-order LMM is comparedagainst an analogous global Markov model with parameters estimated from a large collection of upstream regionsThe performance is assessed in terms of the noise-to-signal ratio and sensitivity At the recommended p-value cutoff00002 LMM is more sensitive and less noisy

LOCAL STATISTICAL SIGNIFICANCE OF PATTERNS 9

the performance of the second-order LMM against an analogous global Markov model with parametersestimated from a large collection of upstream regions in order to assess the ability of LMM to modelthe local sequence context information We found that over the 101 known human TFBSs in situ LMMgenerally outperforms the global Markov model while they behave similarly at high and low sensitivitylevels (Fig 4) At high sensitivity levels the lax p-value cutoffs produce large numbers of putative TFBScalls overwhelming the advantage enjoyed by LMM At low sensitivity levels the stringent p-valuecutoffs yield only putative TFBSs with undeniable sequence similarity to known binding sites Thus thenoise-to-signal ratio may not re ect the true noise level in this region

DISCUSSION

The work presented in this paper attemps to identify TFBSs by considering simultaneously both theirsimilarity to the query PSSM and their differences from the local genomic context Through the studyof the human TFBSs in TRANSFAC we show that LMM which makes putative TFBS calls using localp-values yields a much improved false-positive to true-positive ratio than that using the TRANSFAC orlog-odds scores alone

It has been known that neighboring nucleotide compositions can affect the interaction between a tran-scription factor and its binding site To our best knowledge however there is no documented study onwhether and how much an improvement can be made on the PSSM-based TFBS detection using a localbackground model The result we present which is based on more than 100 experimentally determinedTFBS sequences in the human genome shows a clear overall advantage for incorporating the local se-quence context into PSSM-based TFBS search There are various biological mechanisms that can explainthis effect which may lead to more complicated and more speci c models For instance it may be thatthe local 1000 bp genomic region does not contain DNA sequences similar to the true binding site be-cause otherwise the target transcription factor may be competed away from its biologically meaningfulbinding site

While this improvement does not in itself render a solution to the much more dif cult problem ofdetecting regulatory modules by signi cantly reducing false-positive calls for single sites the local p-value approach will contribute substantially to any subsequent algorithms aiming to detect combinatorialregulatory modules The method we developed here is seen as a proof of principle and can be used asa component of a more complex approach For example considering that clusters of binding sites alsooften occur within small regions of about 200 bp to cooperatively recruit the transcription factors a naturalfuture development of LMM would be to take this distance effect into the background estimation andcombine the LMM p-values of a few candidate PSSM sites Many challenging problems in computationalbiology eg translation initiation site identi cation splice site recognition and RNA secondary structureprediction can be modeled in terms of the recognition of motifs Our work may be adapted and extendedto these problems as well However it should be noted that when applied to protein sequences which arecomposed of a 20-letter alphabet the performance of our algorithm may become an issue especially whenthe order of the Markov chain k is large

DETAILED METHODS

Data extraction for large-scale validation

To evaluate the performance of the LMM we apply it to known TFBSs in the human genome Knownbinding sites are extracted from the SITE table of the TRANSFAC database version 62 About half of the12262 binding sites in this table are experimentally derived from various species The rest are generatedfrom in vitro binding assays on arti cial nucleotide sequences Since LMM studies binding sites withrespect to their genomic contexts these arti cial sequences which do not correspond to any genomicregion cannot be used for our validation study Of the 6073 in vivo binding sites 1425 sites are basedon the human genome Of these 149 (105) are annotated with a corresponding PSSM We use thesebinding sites for validation

10 HUANG ET AL

To locate the known TFBSs in the human genome we focus on the 5000 bp upstream sequences of allgenes We made use of the annotations provided by Ensembl (Hubbard et al 2002) and extracted 22808human gene promoters from the human genome assembly NCBI golden path 29 (wwwensemblorgHomo_sapiens) Since heuristic sequence-mapping algorithms do not perform well on short sequences such asTFBSs we use an exact-match algorithm based on suf x trees (Gus eld 1997) We found that many bindingsite sequences are precisely mapped onto the promoters of the correct target genes For those binding siteswith mappings onto multiple promoters or with no mapping we attempted to retrieve them by manualreview To nd the correct one among multiple mappings we made correspondences between the Ensemblgene name and the target gene name of the binding site as recorded by TRANSFAC A review of somemissed matches using inexact match algorithms revealed a small number of single-basepair differencesbetween the recorded binding site sequences and the promoter sequences of the target genes for examplethe binding site HS$ALBU_06 over the human albumin promoter After validating against the primaryliterature for the positions of these binding sites we included these mappings as well In total we located101 human TFBSs

Local p-value calculation

Although the exact score distribution can be obtained by enumerating all possible binding site sequencesunder any ldquonullrdquo model for the observed nucleotide base pairs the computational cost for a PSSM of lengthp is 4p Stadenrsquos method (Staden 1989) which turns this into an order-p computation is based on thePGF of the score under the simple null model that the base pairs are independent and identically distributed(iid) Recently however there are some evidences suggesting that Markov background models work betterthan the iid model for detecting TFBS (Liu et al 2002) By extending Stadenrsquos PGF method to dependentrandom variables we present here the derivation of the PGFs under a rst-order Markov model the basisof the ef cient algorithm for computing the exact score distribution

Probability generating function derivation

In our study we make use of the PSSMs constructed by TRANSFAC version 62 Given a PSSMm D wij ppound4 where i D 1 p and j D ACGT the match score S and the similarity score S=Smax

of a sequence D1D2 Dp is de ned as (Quandt et al 1995)

S DpX

iD1

wiDi and S=Smax DpX

iD1

wij

iquest pX

iD1

maxj

fwij g

Let S be a random variable taking integer values then its probability generating function Gt is theexpected value of tS Gt is a polynomial and the coef cient of the term tn is the probability of the eventS D n (Gut 1995)

Given a PSSM m of length p under the assumption that the DNA sequence is iid Staden pro-vided the PGF of the match score in the form of a product of p polynomials (Staden 1989) Gt DQp

iD1

PjDACGT fj twij where fj is the frequency of letter j in the iid DNA sequence For the rst-

order Markov case k D 1 let the transition matrix be P D fregjmacr 4pound4 and the stationary distribution of theMarkov chain be frac14 (viewed as a four-dimentional row vector) Then the PGF under the rst-order Markovmodel is

Gt D frac14

pY

iD1

PMi t I (curren)

where Mi t D DiagtwiA twiC twiG twiT and I D 1 1 1 1T (proof provided at the end of thissection)

Since a Markov chain of order k on set 0 is equivalently a rst-order Markov chain on the set 0k witha little modi cation on Mi t we can generalize the above results to k gt 1 An example of PGF for

LOCAL STATISTICAL SIGNIFICANCE OF PATTERNS 11

k D 3 is in the online supplement (wwwbiostatharvardeducomplabLMM) Using this representation forthe PGF we developed and implemented an algorithm using CCC to calculate the exact score distributionGenerally for a kth-order Markov chain and a PSSM of length p the time complexity of our algorithm isO4k cent Smax cent p linear in the matrix length but exponential in the order of the Markov chain The sourcecode is available upon request (wwwbiostathsphharvardeduLMM)

Proof of equation (curren) For ease of notation and without loss of generality we let p the length of thePSSM be 3

For a DNA sequence D1D2D3 its match score against PSSM m is w1D1 C w2D2 C w3D3 and theprobability of the occurrence of D1D2D3 is fD1 fD2 jD1 fD3 jD2 By de nition the PGF of match scoreagainst m is

X

D1D2D3

fD1fD2jD1fD3jD2 tw1D1Cw2D2

Cw3D3 DX

D1D2D3

fD1 tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

In the following we derive the PGF in the alternative form of a product of p matrices First

X

D1D2D3

fD1 tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

DX

D1D2D3

X

a

fa cent fD1 ja tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

DX

a

fa centX

D1D2D3

fD1 ja tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

D fa fC fG fT cent

0

X

D1D2D3

fD1 jAtw1D1 cent fD2jD1 tw2D2 cent fD3 jD2 tw3D3

X

D1D2D3

fD1 jT tw1D1 cent fD2jD1 tw2D2 cent fD3jD2 tw3D3

1

AT

For the component of the second vector corresponding to base A

X

D1D2D3

fD1 jAtw1D1 cent fD2 jD1 tw2D2 cent fD3jD2 tw3D3

DX

D1

fD1 jAtw1D1 centX

D2D3

fD2 jD1 tw2D2 cent fD3jD2 tw3D3

D fAjAtw1A fCjAtw1C fGjAtw1G fT jAtw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3jD2 tw3D3

1

AT

12 HUANG ET AL

We apply similar arguments to the components corresponding to bases C G and T and obtain

X

D1D2D3

fD1jC tw1D1 cent fD2 jD1 tw2D2 cent fD3jD2 tw3D3

D fAjC tw1A fCjC tw1C fGjC tw1G fT jC tw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

X

D1D2D3

fD1jGtw1D1 cent fD2 jD1 tw2D2 cent fD3jD2 tw3D3

D fAjGtw1A fCjGtw1C fGjGtw1G fT jGtw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

X

D1D2D3

fD1jT tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

D fAjT tw1A fCjT tw1C fGjT tw1G fT jT tw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

Therefore for the rst position we have

0

X

D1D2D3

fD1jAtw1D1 cent fD2jD1 tw2D2 cent fD3 jD2 tw3D3 X

D1D2D3

fD1 jT tw1D1 cent fD2 jD1 tw2D2 cent fD3jD2 tw3D3

1

AT

D

0

BB

fAjAtw1A fCjAtw1C fGjAtw1G fT jAtw1T

fAjC tw1A fCjC tw1C fGjC tw1G fT jC tw1T

fAjGtw1A fCjGtw1C fGjGtw1G fT jGtw1T

fAjT tw1A fCjT tw1C fGjT tw1G fT jT tw1T

1

CCA cent

0

BB

PD2D3

fD2 jAtw2D2 fD3 jD2 tw3D3PD2D3

fD2 jC tw2D2 fD3jD2 tw3D3PD2D3

fD2 jGtw2D2 fD3jD2 tw3D3PD2D3

fD2 jT tw2D2 fD3 jD2 tw3D3

1

CCA

D P cent Diagtw1A tw1C tw1G tw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD2 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

D P cent M1 t cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

LOCAL STATISTICAL SIGNIFICANCE OF PATTERNS 13

Further applying the above arguments to positions 2 and 3 we have

0

X

D2D3

fD2jAtw2D2 cent fD3jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

D P cent Diagtw2A tw2C tw2G tw2T cent

0

X

D3

fD3jAtw3D3 X

D3

fD3 jT tw3D3

1

AT

D P cent Diagtw2A tw2C tw2G tw2T cent P cent Diagtw2A tw2C tw2G tw2T cent 1 1 1 1T

iexcl P cent M2 t cent P cent M3 t cent I

Above all Gt D frac14Qp

iD1PMi tI

ACKNOWLEDGMENTS

The work of HH XZ and WHW is supported by NSF grants DBI0196176 and DMS-0090166 Thework of HH and JSL is supported by NSF grant DMS-0204674 and NIH grant R01 HG02518-01 Thework of M-CJK is supported by the Howard Hughes Medical Institute predoctoral fellowship

REFERENCES

Bailey TL and Elkan C 1994 Fitting a mixture model by expectation maximization to discover motifs in biopoly-mers Proc Int Conf Intell Syst Mol Biol 2 28ndash36

Chen QK Hertz GZ and Stormo GD 1995 MATRIX SEARCH 10 A computer program that scans DNAsequences for transcriptional elements using a database of weight matrices Comput Appl Biosci 11 563ndash566

Durbin R Eddy SR Krogh A and Mitchison G 1998 Biological Sequence Analysis Probalistic Models ofProteins and Nucleic Acids Cambridge University Press Cambridge UK

Fickett JW 1996 Coordinate positioning of MEF2 and myogenin binding sites Gene 172 GC19ndash32Fried M and Crothers DM 1981 Equilibria and kinetics of lac repressor-operator interactions by polyacrylamide

gel electrophoresis Nucl Acids Res 9 6505ndash6525Galas DJ and Schmitz A 1978 DNAse footprinting A simple method for the detection of proteinndashDNA binding

speci city Nucl Acids Res 5 3157ndash3170Garner MM and Revzin A 1981 A gel electrophoresis method for quantifying the binding of proteins to speci c

DNA regions Application to components of the Escherichia coli lactose operon regulatory system Nucl Acids Res9 3047ndash3060

Gus eld D 1997 Algorithms on Strings Trees and Sequences Computer Science and Computational BiologyCambridge University Press Cambridge England

Gut A 1995 An Intermediate Course in Probability Springer-Verlag New YorkHertz GZ Hartzell 3rd GW and Stormo GD 1990 Identi cation of consensus patterns in unaligned DNA

sequences known to be functionally related Comput Appl Biosci 6 81ndash92Hubbard T Barker D Birney E Cameron G Chen Y Clark L Cox T Cuff J Curwen V Down T Durbin

R Eyras E Gilbert J Hammond M Huminiecki L Kasprzyk A Lehvaslaiho H Lijnzaad P MelsoppC Mongin E Pettett R Pocock M Potter S Rust A Schmidt E Searle S Slater G Smith J SpoonerW Stabenau A Stalker J Stupka E Ureta-Vidal A Vastrik I and Clamp M 2002 The Ensembl genomedatabase project Nucl Acids Res 30 38ndash41

Hughes JD Estep PW Tavazoie S and Church GM 2000 Computational identi cation of cis-regulatoryelementsassociated with groups of functionally related genes in Saccharomyces cerevisiae J Mol Biol 296 1205ndash1214

Lander ES Linton LM Birren B Nusbaum C Zody MC Baldwin J Devon K Dewar K Doyle MFitzHugh W Funke R Gage D Harris K Heaford A Howland J Kann L Lehoczky J LeVine RMcEwan P McKernan K Meldrim J Mesirov JP Miranda C Morris W Naylor J Raymond C RosettiM Santos R Sheridan A Sougnez C Stange-Thomann N Stojanovic N Subramanian A Wyman DRogers J Sulston J Ainscough R Beck S Bentley D Burton J Clee C Carter N Coulson A DeadmanR Deloukas P Dunham A Dunham I Durbin R French L Grafham D Gregory S Hubbard T Humphray

14 HUANG ET AL

S Hunt A Jones M Lloyd C McMurray A Matthews L Mercer S Milne S Mullikin JC MungallA Plumb R Ross M Shownkeen R Sims S Waterston RH Wilson RK Hillier LW McPherson JDMarra MA Mardis ER Fulton LA Chinwalla AT Pepin KH Gish WR Chissoe SL Wendl MCDelehaunty KD Miner TL Delehaunty A Kramer JB Cook LL Fulton RS Johnson DL Minx PJClifton SW Hawkins T Branscomb E Predki P Richardson P Wenning S Slezak T Doggett N ChengJF Olsen A Lucas S Elkin C Uberbacher E Frazier M Gibbs RA Muzny DM Scherer SE BouckJB Sodergren EJ Worley KC Rives CM Gorrell JH Metzker ML Naylor SL Kucherlapati RSNelson DL Weinstock GM Sakaki Y Fujiyama A Hattori M Yada T Toyoda A Itoh T Kawagoe CWatanabe H Totoki Y Taylor T Weissenbach J Heilig R Saurin W Artiguenave F Brottier P Bruls TPelletier E Robert C Wincker P Smith DR Doucette-Stamm L Ruben eld M Weinstock K Lee HMDubois J Rosenthal A Platzer M Nyakatura G Taudien S Rump A Yang H Yu J Wang J HuangG Gu J Hood L Rowen L Madan A Qin S Davis RW Federspiel NA Abola AP Proctor MJMyers RM Schmutz J Dickson M Grimwood J Cox DR Olson MV Kaul R Shimizu N KawasakiK Minoshima S Evans GA Athanasiou M Schultz R Roe BA Chen F Pan H Ramser J LehrachH Reinhardt R McCombie WR de la Bastide M Dedhia N Blocker H Hornischer K Nordsiek GAgarwala R Aravind L Bailey JA Bateman A Batzoglou S Birney E Bork P Brown DG BurgeCB Cerutti L Chen HC Church D Clamp M Copley RR Doerks T Eddy SR Eichler EE FureyTS Galagan J Gilbert JG Harmon C Hayashizaki Y Haussler D Hermjakob H Hokamp K Jang WJohnson LS Jones TA Kasif S Kaspryzk A Kennedy S Kent WJ Kitts P Koonin EV Korf I KulpD Lancet D Lowe TM McLysaght A Mikkelsen T Moran JV Mulder N Pollara VJ Ponting CPSchuler G Schultz J Slater G Smit AF Stupka E Szustakowski J Thierry-Mieg D Thierry-Mieg JWagner L Wallis J Wheeler R Williams A Wolf YI Wolfe KH Yang SP Yeh RF Collins F GuyerMS Peterson J Felsenfeld A Wetterstrand KA Patrinos A Morgan MJ Szustakowki J de Jong PCatanese JJ Osoegawa K Shizuya H Choi S and Chen YJ 2001 Initial sequencing and analysis of thehuman genome Nature 409 860ndash921

Lawrence CE Altschul SF Boguski MS Liu JS Neuwald AF and Wootton JC 1993 Detecting subtlesequence signals A Gibbs sampling strategy for multiple alignment Science 262 208ndash214

Lawrence CE and Reilly AA 1990 An expectation maximization (EM) algorithm for the identi cation and char-acterization of common sites in unaligned biopolymer sequences Proteins 7 41ndash51

Liu X Brutlag DL and Liu JS 2001 BioProspector Discovering conserved DNA motifs in upstream regulatoryregions of co-expressed genes Pac Symp Biocomput 127ndash138

Liu XS Brutlag DL and Liu JS 2002 An algorithm for nding protein DNA binding sites with applications tochromatin-immunoprecipitation microarray experiments Nat Biotechnol

Nakatsuji Y Hidaka K Tsujino S Yamamoto Y Mukai T Yanagihara T Kishimoto T and Sakoda S 1992A single MEF-2 site is a major positive regulatory element required for transcription of the muscle-specic subunitof the human phosphoglycerate mutase gene in skeletal and cardiac muscle cells Mol Cell Biol 12 4384ndash4390

Quandt K Frech K Karas H Wingender E and Werner T 1995 MatInd and MatInspector New fast and versatiletools for detection of consensus matches in nucleotide sequence data Nucl Acids Res 23 4878ndash4884

Rosenthal N Berglund EB Wentworth BM Donoghue M Winter B Bober E Braun T and Arnold HH1990 A highly conserved enhancer downstream of the human MLC13 locus is a target for multiple myogenicdetermination factors Nucl Acids Res 18 6239ndash6246

Roth FP Hughes JD Estep PW and Church GM 1998 Finding DNA regulatory motifs within unalignednoncoding sequences clustered by whole-genome mRNA quantitation Nat Biotechnol 16 939ndash945

Staden R 1989 Methods for calculating the probabilities of nding patterns in sequences Comput Appl Biosci 589ndash96

Stormo GD and Hartzell 3rd GW 1989 Identifying protein-binding sites from unaligned DNA fragments ProcNatl Acad Sci USA 86 1183ndash1187

Wentworth BM Donoghue M Engert JC Berglund EB and Rosenthal N 1991 Paired MyoD-binding sitesregulate myosin light chain gene expression Proc Natl Acad Sci USA 88 1242ndash1246

Wingender E Chen X Hehl R Karas H Liebich I Matys V Meinhardt T Pruss M Reuter I and SchachererF 2000 TRANSFAC An integrated system for gene expression regulation Nucl Acids Res 28 316ndash319

Address correspondence toJun S Liu Wing H Wong

Department of StatisticsScience Center 6th oor

1 Oxford StreetCambridge MA 02138

E-mail jliu wwongstatharvardedu

Page 8: Determination of Local Statistical Significance of ...mckao/documents/JCB-LMM.pdf · that the incorporation of the local genomic context can be advantageous in the prediction of myogenin

8 HUANG ET AL

by producing signi cantly less noise While the performance of LMM comes close to that of TRANSFACas the p-value cutoff increases in fact by then for both methods the advantage of increased sensitivity hasbeen nulli ed by the high level of accompanying noise rendering them impractical Overall not only is thesensitivity of LMM comparable to TRANSFAC its noise-to-signal ratio is also vastly superior It shouldbe noted that since only a limited number of true binding sites are known not every unsupported bindingsite prediction is necessarily a false-positive prediction Thus the noise-to-signal ratio overestimates thetrue noise level especially when stringent criteria are used to generate putative TFBSs with high sequencesimilarity to known binding sites As the criteria relax the large numbers of predictions over and abovethe known binding sites imply a high level of true background noise

More detailed results for TRANSFAC using the three cutoffs and for LMM using different signi cancecutoffs 000001 and 00002 and under different Markov models are summarized in Fig 3b While theFN cut off missed relatively few known binding sites it generated more than 45 false-positive predictionsfor every accurate binding site call On the other hand FP made fewer false positives but it detected onlyone in nine known binding sites The SUM cutoff designed as a balance of these inherent tradeoffs didstrike a reasonable compromise having generated about nine false positives for every real binding site anddetected more than half of the known sites

At the stringent signi cance cutoff p D 000001 LMM detected about twice the binding sites thandid the FP cutoff and on average produced about 60 fewer false-positive predictions for every correctprediction At the more relaxed p-value cutoff p D 00002 the sensitivity of LMM is comparable tothat of the SUM cutoff while only half of the noise is generated The binding sites that were detectedby LMM at p middot 00002 but missed by TRANSFAC using the SUM cutoff include a MEF2 binding siteover the desmin gene an ATF1 (activating transcription factor 1) binding site over the TGFmacr2 gene aHIF (hypoxia-inducible factor) binding site over the VEGF gene and an ICSBP (IFN consensus sequencebinding protein) binding site over the OAS1 gene We choose p D 00002 as the general signi cance cutofffor the application of LMM to mammalian genomic sequences a cutoff with a suf ciently high sensitivityand an acceptable amount of noise Overall the LMM provides an advantageous tradeoff between noise-to-signal ratio and sensitivity

In our validation experiment we found that Markov models of orders 1 2 and 3 have better combinationsof high sensitivity and low noise than the iid model con rming an earlier observation (Liu et al 2001)that Markov models can better capture the structure of biological sequences In addition we compared

FIG 4 The use of local sequence context is advantageous The performance of the second-order LMM is comparedagainst an analogous global Markov model with parameters estimated from a large collection of upstream regionsThe performance is assessed in terms of the noise-to-signal ratio and sensitivity At the recommended p-value cutoff00002 LMM is more sensitive and less noisy

LOCAL STATISTICAL SIGNIFICANCE OF PATTERNS 9

the performance of the second-order LMM against an analogous global Markov model with parametersestimated from a large collection of upstream regions in order to assess the ability of LMM to modelthe local sequence context information We found that over the 101 known human TFBSs in situ LMMgenerally outperforms the global Markov model while they behave similarly at high and low sensitivitylevels (Fig 4) At high sensitivity levels the lax p-value cutoffs produce large numbers of putative TFBScalls overwhelming the advantage enjoyed by LMM At low sensitivity levels the stringent p-valuecutoffs yield only putative TFBSs with undeniable sequence similarity to known binding sites Thus thenoise-to-signal ratio may not re ect the true noise level in this region

DISCUSSION

The work presented in this paper attemps to identify TFBSs by considering simultaneously both theirsimilarity to the query PSSM and their differences from the local genomic context Through the studyof the human TFBSs in TRANSFAC we show that LMM which makes putative TFBS calls using localp-values yields a much improved false-positive to true-positive ratio than that using the TRANSFAC orlog-odds scores alone

It has been known that neighboring nucleotide compositions can affect the interaction between a tran-scription factor and its binding site To our best knowledge however there is no documented study onwhether and how much an improvement can be made on the PSSM-based TFBS detection using a localbackground model The result we present which is based on more than 100 experimentally determinedTFBS sequences in the human genome shows a clear overall advantage for incorporating the local se-quence context into PSSM-based TFBS search There are various biological mechanisms that can explainthis effect which may lead to more complicated and more speci c models For instance it may be thatthe local 1000 bp genomic region does not contain DNA sequences similar to the true binding site be-cause otherwise the target transcription factor may be competed away from its biologically meaningfulbinding site

While this improvement does not in itself render a solution to the much more dif cult problem ofdetecting regulatory modules by signi cantly reducing false-positive calls for single sites the local p-value approach will contribute substantially to any subsequent algorithms aiming to detect combinatorialregulatory modules The method we developed here is seen as a proof of principle and can be used asa component of a more complex approach For example considering that clusters of binding sites alsooften occur within small regions of about 200 bp to cooperatively recruit the transcription factors a naturalfuture development of LMM would be to take this distance effect into the background estimation andcombine the LMM p-values of a few candidate PSSM sites Many challenging problems in computationalbiology eg translation initiation site identi cation splice site recognition and RNA secondary structureprediction can be modeled in terms of the recognition of motifs Our work may be adapted and extendedto these problems as well However it should be noted that when applied to protein sequences which arecomposed of a 20-letter alphabet the performance of our algorithm may become an issue especially whenthe order of the Markov chain k is large

DETAILED METHODS

Data extraction for large-scale validation

To evaluate the performance of the LMM we apply it to known TFBSs in the human genome Knownbinding sites are extracted from the SITE table of the TRANSFAC database version 62 About half of the12262 binding sites in this table are experimentally derived from various species The rest are generatedfrom in vitro binding assays on arti cial nucleotide sequences Since LMM studies binding sites withrespect to their genomic contexts these arti cial sequences which do not correspond to any genomicregion cannot be used for our validation study Of the 6073 in vivo binding sites 1425 sites are basedon the human genome Of these 149 (105) are annotated with a corresponding PSSM We use thesebinding sites for validation

10 HUANG ET AL

To locate the known TFBSs in the human genome we focus on the 5000 bp upstream sequences of allgenes We made use of the annotations provided by Ensembl (Hubbard et al 2002) and extracted 22808human gene promoters from the human genome assembly NCBI golden path 29 (wwwensemblorgHomo_sapiens) Since heuristic sequence-mapping algorithms do not perform well on short sequences such asTFBSs we use an exact-match algorithm based on suf x trees (Gus eld 1997) We found that many bindingsite sequences are precisely mapped onto the promoters of the correct target genes For those binding siteswith mappings onto multiple promoters or with no mapping we attempted to retrieve them by manualreview To nd the correct one among multiple mappings we made correspondences between the Ensemblgene name and the target gene name of the binding site as recorded by TRANSFAC A review of somemissed matches using inexact match algorithms revealed a small number of single-basepair differencesbetween the recorded binding site sequences and the promoter sequences of the target genes for examplethe binding site HS$ALBU_06 over the human albumin promoter After validating against the primaryliterature for the positions of these binding sites we included these mappings as well In total we located101 human TFBSs

Local p-value calculation

Although the exact score distribution can be obtained by enumerating all possible binding site sequencesunder any ldquonullrdquo model for the observed nucleotide base pairs the computational cost for a PSSM of lengthp is 4p Stadenrsquos method (Staden 1989) which turns this into an order-p computation is based on thePGF of the score under the simple null model that the base pairs are independent and identically distributed(iid) Recently however there are some evidences suggesting that Markov background models work betterthan the iid model for detecting TFBS (Liu et al 2002) By extending Stadenrsquos PGF method to dependentrandom variables we present here the derivation of the PGFs under a rst-order Markov model the basisof the ef cient algorithm for computing the exact score distribution

Probability generating function derivation

In our study we make use of the PSSMs constructed by TRANSFAC version 62 Given a PSSMm D wij ppound4 where i D 1 p and j D ACGT the match score S and the similarity score S=Smax

of a sequence D1D2 Dp is de ned as (Quandt et al 1995)

S DpX

iD1

wiDi and S=Smax DpX

iD1

wij

iquest pX

iD1

maxj

fwij g

Let S be a random variable taking integer values then its probability generating function Gt is theexpected value of tS Gt is a polynomial and the coef cient of the term tn is the probability of the eventS D n (Gut 1995)

Given a PSSM m of length p under the assumption that the DNA sequence is iid Staden pro-vided the PGF of the match score in the form of a product of p polynomials (Staden 1989) Gt DQp

iD1

PjDACGT fj twij where fj is the frequency of letter j in the iid DNA sequence For the rst-

order Markov case k D 1 let the transition matrix be P D fregjmacr 4pound4 and the stationary distribution of theMarkov chain be frac14 (viewed as a four-dimentional row vector) Then the PGF under the rst-order Markovmodel is

Gt D frac14

pY

iD1

PMi t I (curren)

where Mi t D DiagtwiA twiC twiG twiT and I D 1 1 1 1T (proof provided at the end of thissection)

Since a Markov chain of order k on set 0 is equivalently a rst-order Markov chain on the set 0k witha little modi cation on Mi t we can generalize the above results to k gt 1 An example of PGF for

LOCAL STATISTICAL SIGNIFICANCE OF PATTERNS 11

k D 3 is in the online supplement (wwwbiostatharvardeducomplabLMM) Using this representation forthe PGF we developed and implemented an algorithm using CCC to calculate the exact score distributionGenerally for a kth-order Markov chain and a PSSM of length p the time complexity of our algorithm isO4k cent Smax cent p linear in the matrix length but exponential in the order of the Markov chain The sourcecode is available upon request (wwwbiostathsphharvardeduLMM)

Proof of equation (curren) For ease of notation and without loss of generality we let p the length of thePSSM be 3

For a DNA sequence D1D2D3 its match score against PSSM m is w1D1 C w2D2 C w3D3 and theprobability of the occurrence of D1D2D3 is fD1 fD2 jD1 fD3 jD2 By de nition the PGF of match scoreagainst m is

X

D1D2D3

fD1fD2jD1fD3jD2 tw1D1Cw2D2

Cw3D3 DX

D1D2D3

fD1 tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

In the following we derive the PGF in the alternative form of a product of p matrices First

X

D1D2D3

fD1 tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

DX

D1D2D3

X

a

fa cent fD1 ja tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

DX

a

fa centX

D1D2D3

fD1 ja tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

D fa fC fG fT cent

0

X

D1D2D3

fD1 jAtw1D1 cent fD2jD1 tw2D2 cent fD3 jD2 tw3D3

X

D1D2D3

fD1 jT tw1D1 cent fD2jD1 tw2D2 cent fD3jD2 tw3D3

1

AT

For the component of the second vector corresponding to base A

X

D1D2D3

fD1 jAtw1D1 cent fD2 jD1 tw2D2 cent fD3jD2 tw3D3

DX

D1

fD1 jAtw1D1 centX

D2D3

fD2 jD1 tw2D2 cent fD3jD2 tw3D3

D fAjAtw1A fCjAtw1C fGjAtw1G fT jAtw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3jD2 tw3D3

1

AT

12 HUANG ET AL

We apply similar arguments to the components corresponding to bases C G and T and obtain

X

D1D2D3

fD1jC tw1D1 cent fD2 jD1 tw2D2 cent fD3jD2 tw3D3

D fAjC tw1A fCjC tw1C fGjC tw1G fT jC tw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

X

D1D2D3

fD1jGtw1D1 cent fD2 jD1 tw2D2 cent fD3jD2 tw3D3

D fAjGtw1A fCjGtw1C fGjGtw1G fT jGtw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

X

D1D2D3

fD1jT tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

D fAjT tw1A fCjT tw1C fGjT tw1G fT jT tw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

Therefore for the rst position we have

0

X

D1D2D3

fD1jAtw1D1 cent fD2jD1 tw2D2 cent fD3 jD2 tw3D3 X

D1D2D3

fD1 jT tw1D1 cent fD2 jD1 tw2D2 cent fD3jD2 tw3D3

1

AT

D

0

BB

fAjAtw1A fCjAtw1C fGjAtw1G fT jAtw1T

fAjC tw1A fCjC tw1C fGjC tw1G fT jC tw1T

fAjGtw1A fCjGtw1C fGjGtw1G fT jGtw1T

fAjT tw1A fCjT tw1C fGjT tw1G fT jT tw1T

1

CCA cent

0

BB

PD2D3

fD2 jAtw2D2 fD3 jD2 tw3D3PD2D3

fD2 jC tw2D2 fD3jD2 tw3D3PD2D3

fD2 jGtw2D2 fD3jD2 tw3D3PD2D3

fD2 jT tw2D2 fD3 jD2 tw3D3

1

CCA

D P cent Diagtw1A tw1C tw1G tw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD2 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

D P cent M1 t cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

LOCAL STATISTICAL SIGNIFICANCE OF PATTERNS 13

Further applying the above arguments to positions 2 and 3 we have

0

X

D2D3

fD2jAtw2D2 cent fD3jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

D P cent Diagtw2A tw2C tw2G tw2T cent

0

X

D3

fD3jAtw3D3 X

D3

fD3 jT tw3D3

1

AT

D P cent Diagtw2A tw2C tw2G tw2T cent P cent Diagtw2A tw2C tw2G tw2T cent 1 1 1 1T

iexcl P cent M2 t cent P cent M3 t cent I

Above all Gt D frac14Qp

iD1PMi tI

ACKNOWLEDGMENTS

The work of HH XZ and WHW is supported by NSF grants DBI0196176 and DMS-0090166 Thework of HH and JSL is supported by NSF grant DMS-0204674 and NIH grant R01 HG02518-01 Thework of M-CJK is supported by the Howard Hughes Medical Institute predoctoral fellowship

REFERENCES

Bailey TL and Elkan C 1994 Fitting a mixture model by expectation maximization to discover motifs in biopoly-mers Proc Int Conf Intell Syst Mol Biol 2 28ndash36

Chen QK Hertz GZ and Stormo GD 1995 MATRIX SEARCH 10 A computer program that scans DNAsequences for transcriptional elements using a database of weight matrices Comput Appl Biosci 11 563ndash566

Durbin R Eddy SR Krogh A and Mitchison G 1998 Biological Sequence Analysis Probalistic Models ofProteins and Nucleic Acids Cambridge University Press Cambridge UK

Fickett JW 1996 Coordinate positioning of MEF2 and myogenin binding sites Gene 172 GC19ndash32Fried M and Crothers DM 1981 Equilibria and kinetics of lac repressor-operator interactions by polyacrylamide

gel electrophoresis Nucl Acids Res 9 6505ndash6525Galas DJ and Schmitz A 1978 DNAse footprinting A simple method for the detection of proteinndashDNA binding

speci city Nucl Acids Res 5 3157ndash3170Garner MM and Revzin A 1981 A gel electrophoresis method for quantifying the binding of proteins to speci c

DNA regions Application to components of the Escherichia coli lactose operon regulatory system Nucl Acids Res9 3047ndash3060

Gus eld D 1997 Algorithms on Strings Trees and Sequences Computer Science and Computational BiologyCambridge University Press Cambridge England

Gut A 1995 An Intermediate Course in Probability Springer-Verlag New YorkHertz GZ Hartzell 3rd GW and Stormo GD 1990 Identi cation of consensus patterns in unaligned DNA

sequences known to be functionally related Comput Appl Biosci 6 81ndash92Hubbard T Barker D Birney E Cameron G Chen Y Clark L Cox T Cuff J Curwen V Down T Durbin

R Eyras E Gilbert J Hammond M Huminiecki L Kasprzyk A Lehvaslaiho H Lijnzaad P MelsoppC Mongin E Pettett R Pocock M Potter S Rust A Schmidt E Searle S Slater G Smith J SpoonerW Stabenau A Stalker J Stupka E Ureta-Vidal A Vastrik I and Clamp M 2002 The Ensembl genomedatabase project Nucl Acids Res 30 38ndash41

Hughes JD Estep PW Tavazoie S and Church GM 2000 Computational identi cation of cis-regulatoryelementsassociated with groups of functionally related genes in Saccharomyces cerevisiae J Mol Biol 296 1205ndash1214

Lander ES Linton LM Birren B Nusbaum C Zody MC Baldwin J Devon K Dewar K Doyle MFitzHugh W Funke R Gage D Harris K Heaford A Howland J Kann L Lehoczky J LeVine RMcEwan P McKernan K Meldrim J Mesirov JP Miranda C Morris W Naylor J Raymond C RosettiM Santos R Sheridan A Sougnez C Stange-Thomann N Stojanovic N Subramanian A Wyman DRogers J Sulston J Ainscough R Beck S Bentley D Burton J Clee C Carter N Coulson A DeadmanR Deloukas P Dunham A Dunham I Durbin R French L Grafham D Gregory S Hubbard T Humphray

14 HUANG ET AL

S Hunt A Jones M Lloyd C McMurray A Matthews L Mercer S Milne S Mullikin JC MungallA Plumb R Ross M Shownkeen R Sims S Waterston RH Wilson RK Hillier LW McPherson JDMarra MA Mardis ER Fulton LA Chinwalla AT Pepin KH Gish WR Chissoe SL Wendl MCDelehaunty KD Miner TL Delehaunty A Kramer JB Cook LL Fulton RS Johnson DL Minx PJClifton SW Hawkins T Branscomb E Predki P Richardson P Wenning S Slezak T Doggett N ChengJF Olsen A Lucas S Elkin C Uberbacher E Frazier M Gibbs RA Muzny DM Scherer SE BouckJB Sodergren EJ Worley KC Rives CM Gorrell JH Metzker ML Naylor SL Kucherlapati RSNelson DL Weinstock GM Sakaki Y Fujiyama A Hattori M Yada T Toyoda A Itoh T Kawagoe CWatanabe H Totoki Y Taylor T Weissenbach J Heilig R Saurin W Artiguenave F Brottier P Bruls TPelletier E Robert C Wincker P Smith DR Doucette-Stamm L Ruben eld M Weinstock K Lee HMDubois J Rosenthal A Platzer M Nyakatura G Taudien S Rump A Yang H Yu J Wang J HuangG Gu J Hood L Rowen L Madan A Qin S Davis RW Federspiel NA Abola AP Proctor MJMyers RM Schmutz J Dickson M Grimwood J Cox DR Olson MV Kaul R Shimizu N KawasakiK Minoshima S Evans GA Athanasiou M Schultz R Roe BA Chen F Pan H Ramser J LehrachH Reinhardt R McCombie WR de la Bastide M Dedhia N Blocker H Hornischer K Nordsiek GAgarwala R Aravind L Bailey JA Bateman A Batzoglou S Birney E Bork P Brown DG BurgeCB Cerutti L Chen HC Church D Clamp M Copley RR Doerks T Eddy SR Eichler EE FureyTS Galagan J Gilbert JG Harmon C Hayashizaki Y Haussler D Hermjakob H Hokamp K Jang WJohnson LS Jones TA Kasif S Kaspryzk A Kennedy S Kent WJ Kitts P Koonin EV Korf I KulpD Lancet D Lowe TM McLysaght A Mikkelsen T Moran JV Mulder N Pollara VJ Ponting CPSchuler G Schultz J Slater G Smit AF Stupka E Szustakowski J Thierry-Mieg D Thierry-Mieg JWagner L Wallis J Wheeler R Williams A Wolf YI Wolfe KH Yang SP Yeh RF Collins F GuyerMS Peterson J Felsenfeld A Wetterstrand KA Patrinos A Morgan MJ Szustakowki J de Jong PCatanese JJ Osoegawa K Shizuya H Choi S and Chen YJ 2001 Initial sequencing and analysis of thehuman genome Nature 409 860ndash921

Lawrence CE Altschul SF Boguski MS Liu JS Neuwald AF and Wootton JC 1993 Detecting subtlesequence signals A Gibbs sampling strategy for multiple alignment Science 262 208ndash214

Lawrence CE and Reilly AA 1990 An expectation maximization (EM) algorithm for the identi cation and char-acterization of common sites in unaligned biopolymer sequences Proteins 7 41ndash51

Liu X Brutlag DL and Liu JS 2001 BioProspector Discovering conserved DNA motifs in upstream regulatoryregions of co-expressed genes Pac Symp Biocomput 127ndash138

Liu XS Brutlag DL and Liu JS 2002 An algorithm for nding protein DNA binding sites with applications tochromatin-immunoprecipitation microarray experiments Nat Biotechnol

Nakatsuji Y Hidaka K Tsujino S Yamamoto Y Mukai T Yanagihara T Kishimoto T and Sakoda S 1992A single MEF-2 site is a major positive regulatory element required for transcription of the muscle-specic subunitof the human phosphoglycerate mutase gene in skeletal and cardiac muscle cells Mol Cell Biol 12 4384ndash4390

Quandt K Frech K Karas H Wingender E and Werner T 1995 MatInd and MatInspector New fast and versatiletools for detection of consensus matches in nucleotide sequence data Nucl Acids Res 23 4878ndash4884

Rosenthal N Berglund EB Wentworth BM Donoghue M Winter B Bober E Braun T and Arnold HH1990 A highly conserved enhancer downstream of the human MLC13 locus is a target for multiple myogenicdetermination factors Nucl Acids Res 18 6239ndash6246

Roth FP Hughes JD Estep PW and Church GM 1998 Finding DNA regulatory motifs within unalignednoncoding sequences clustered by whole-genome mRNA quantitation Nat Biotechnol 16 939ndash945

Staden R 1989 Methods for calculating the probabilities of nding patterns in sequences Comput Appl Biosci 589ndash96

Stormo GD and Hartzell 3rd GW 1989 Identifying protein-binding sites from unaligned DNA fragments ProcNatl Acad Sci USA 86 1183ndash1187

Wentworth BM Donoghue M Engert JC Berglund EB and Rosenthal N 1991 Paired MyoD-binding sitesregulate myosin light chain gene expression Proc Natl Acad Sci USA 88 1242ndash1246

Wingender E Chen X Hehl R Karas H Liebich I Matys V Meinhardt T Pruss M Reuter I and SchachererF 2000 TRANSFAC An integrated system for gene expression regulation Nucl Acids Res 28 316ndash319

Address correspondence toJun S Liu Wing H Wong

Department of StatisticsScience Center 6th oor

1 Oxford StreetCambridge MA 02138

E-mail jliu wwongstatharvardedu

Page 9: Determination of Local Statistical Significance of ...mckao/documents/JCB-LMM.pdf · that the incorporation of the local genomic context can be advantageous in the prediction of myogenin

LOCAL STATISTICAL SIGNIFICANCE OF PATTERNS 9

the performance of the second-order LMM against an analogous global Markov model with parametersestimated from a large collection of upstream regions in order to assess the ability of LMM to modelthe local sequence context information We found that over the 101 known human TFBSs in situ LMMgenerally outperforms the global Markov model while they behave similarly at high and low sensitivitylevels (Fig 4) At high sensitivity levels the lax p-value cutoffs produce large numbers of putative TFBScalls overwhelming the advantage enjoyed by LMM At low sensitivity levels the stringent p-valuecutoffs yield only putative TFBSs with undeniable sequence similarity to known binding sites Thus thenoise-to-signal ratio may not re ect the true noise level in this region

DISCUSSION

The work presented in this paper attemps to identify TFBSs by considering simultaneously both theirsimilarity to the query PSSM and their differences from the local genomic context Through the studyof the human TFBSs in TRANSFAC we show that LMM which makes putative TFBS calls using localp-values yields a much improved false-positive to true-positive ratio than that using the TRANSFAC orlog-odds scores alone

It has been known that neighboring nucleotide compositions can affect the interaction between a tran-scription factor and its binding site To our best knowledge however there is no documented study onwhether and how much an improvement can be made on the PSSM-based TFBS detection using a localbackground model The result we present which is based on more than 100 experimentally determinedTFBS sequences in the human genome shows a clear overall advantage for incorporating the local se-quence context into PSSM-based TFBS search There are various biological mechanisms that can explainthis effect which may lead to more complicated and more speci c models For instance it may be thatthe local 1000 bp genomic region does not contain DNA sequences similar to the true binding site be-cause otherwise the target transcription factor may be competed away from its biologically meaningfulbinding site

While this improvement does not in itself render a solution to the much more dif cult problem ofdetecting regulatory modules by signi cantly reducing false-positive calls for single sites the local p-value approach will contribute substantially to any subsequent algorithms aiming to detect combinatorialregulatory modules The method we developed here is seen as a proof of principle and can be used asa component of a more complex approach For example considering that clusters of binding sites alsooften occur within small regions of about 200 bp to cooperatively recruit the transcription factors a naturalfuture development of LMM would be to take this distance effect into the background estimation andcombine the LMM p-values of a few candidate PSSM sites Many challenging problems in computationalbiology eg translation initiation site identi cation splice site recognition and RNA secondary structureprediction can be modeled in terms of the recognition of motifs Our work may be adapted and extendedto these problems as well However it should be noted that when applied to protein sequences which arecomposed of a 20-letter alphabet the performance of our algorithm may become an issue especially whenthe order of the Markov chain k is large

DETAILED METHODS

Data extraction for large-scale validation

To evaluate the performance of the LMM we apply it to known TFBSs in the human genome Knownbinding sites are extracted from the SITE table of the TRANSFAC database version 62 About half of the12262 binding sites in this table are experimentally derived from various species The rest are generatedfrom in vitro binding assays on arti cial nucleotide sequences Since LMM studies binding sites withrespect to their genomic contexts these arti cial sequences which do not correspond to any genomicregion cannot be used for our validation study Of the 6073 in vivo binding sites 1425 sites are basedon the human genome Of these 149 (105) are annotated with a corresponding PSSM We use thesebinding sites for validation

10 HUANG ET AL

To locate the known TFBSs in the human genome we focus on the 5000 bp upstream sequences of allgenes We made use of the annotations provided by Ensembl (Hubbard et al 2002) and extracted 22808human gene promoters from the human genome assembly NCBI golden path 29 (wwwensemblorgHomo_sapiens) Since heuristic sequence-mapping algorithms do not perform well on short sequences such asTFBSs we use an exact-match algorithm based on suf x trees (Gus eld 1997) We found that many bindingsite sequences are precisely mapped onto the promoters of the correct target genes For those binding siteswith mappings onto multiple promoters or with no mapping we attempted to retrieve them by manualreview To nd the correct one among multiple mappings we made correspondences between the Ensemblgene name and the target gene name of the binding site as recorded by TRANSFAC A review of somemissed matches using inexact match algorithms revealed a small number of single-basepair differencesbetween the recorded binding site sequences and the promoter sequences of the target genes for examplethe binding site HS$ALBU_06 over the human albumin promoter After validating against the primaryliterature for the positions of these binding sites we included these mappings as well In total we located101 human TFBSs

Local p-value calculation

Although the exact score distribution can be obtained by enumerating all possible binding site sequencesunder any ldquonullrdquo model for the observed nucleotide base pairs the computational cost for a PSSM of lengthp is 4p Stadenrsquos method (Staden 1989) which turns this into an order-p computation is based on thePGF of the score under the simple null model that the base pairs are independent and identically distributed(iid) Recently however there are some evidences suggesting that Markov background models work betterthan the iid model for detecting TFBS (Liu et al 2002) By extending Stadenrsquos PGF method to dependentrandom variables we present here the derivation of the PGFs under a rst-order Markov model the basisof the ef cient algorithm for computing the exact score distribution

Probability generating function derivation

In our study we make use of the PSSMs constructed by TRANSFAC version 62 Given a PSSMm D wij ppound4 where i D 1 p and j D ACGT the match score S and the similarity score S=Smax

of a sequence D1D2 Dp is de ned as (Quandt et al 1995)

S DpX

iD1

wiDi and S=Smax DpX

iD1

wij

iquest pX

iD1

maxj

fwij g

Let S be a random variable taking integer values then its probability generating function Gt is theexpected value of tS Gt is a polynomial and the coef cient of the term tn is the probability of the eventS D n (Gut 1995)

Given a PSSM m of length p under the assumption that the DNA sequence is iid Staden pro-vided the PGF of the match score in the form of a product of p polynomials (Staden 1989) Gt DQp

iD1

PjDACGT fj twij where fj is the frequency of letter j in the iid DNA sequence For the rst-

order Markov case k D 1 let the transition matrix be P D fregjmacr 4pound4 and the stationary distribution of theMarkov chain be frac14 (viewed as a four-dimentional row vector) Then the PGF under the rst-order Markovmodel is

Gt D frac14

pY

iD1

PMi t I (curren)

where Mi t D DiagtwiA twiC twiG twiT and I D 1 1 1 1T (proof provided at the end of thissection)

Since a Markov chain of order k on set 0 is equivalently a rst-order Markov chain on the set 0k witha little modi cation on Mi t we can generalize the above results to k gt 1 An example of PGF for

LOCAL STATISTICAL SIGNIFICANCE OF PATTERNS 11

k D 3 is in the online supplement (wwwbiostatharvardeducomplabLMM) Using this representation forthe PGF we developed and implemented an algorithm using CCC to calculate the exact score distributionGenerally for a kth-order Markov chain and a PSSM of length p the time complexity of our algorithm isO4k cent Smax cent p linear in the matrix length but exponential in the order of the Markov chain The sourcecode is available upon request (wwwbiostathsphharvardeduLMM)

Proof of equation (curren) For ease of notation and without loss of generality we let p the length of thePSSM be 3

For a DNA sequence D1D2D3 its match score against PSSM m is w1D1 C w2D2 C w3D3 and theprobability of the occurrence of D1D2D3 is fD1 fD2 jD1 fD3 jD2 By de nition the PGF of match scoreagainst m is

X

D1D2D3

fD1fD2jD1fD3jD2 tw1D1Cw2D2

Cw3D3 DX

D1D2D3

fD1 tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

In the following we derive the PGF in the alternative form of a product of p matrices First

X

D1D2D3

fD1 tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

DX

D1D2D3

X

a

fa cent fD1 ja tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

DX

a

fa centX

D1D2D3

fD1 ja tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

D fa fC fG fT cent

0

X

D1D2D3

fD1 jAtw1D1 cent fD2jD1 tw2D2 cent fD3 jD2 tw3D3

X

D1D2D3

fD1 jT tw1D1 cent fD2jD1 tw2D2 cent fD3jD2 tw3D3

1

AT

For the component of the second vector corresponding to base A

X

D1D2D3

fD1 jAtw1D1 cent fD2 jD1 tw2D2 cent fD3jD2 tw3D3

DX

D1

fD1 jAtw1D1 centX

D2D3

fD2 jD1 tw2D2 cent fD3jD2 tw3D3

D fAjAtw1A fCjAtw1C fGjAtw1G fT jAtw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3jD2 tw3D3

1

AT

12 HUANG ET AL

We apply similar arguments to the components corresponding to bases C G and T and obtain

X

D1D2D3

fD1jC tw1D1 cent fD2 jD1 tw2D2 cent fD3jD2 tw3D3

D fAjC tw1A fCjC tw1C fGjC tw1G fT jC tw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

X

D1D2D3

fD1jGtw1D1 cent fD2 jD1 tw2D2 cent fD3jD2 tw3D3

D fAjGtw1A fCjGtw1C fGjGtw1G fT jGtw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

X

D1D2D3

fD1jT tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

D fAjT tw1A fCjT tw1C fGjT tw1G fT jT tw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

Therefore for the rst position we have

0

X

D1D2D3

fD1jAtw1D1 cent fD2jD1 tw2D2 cent fD3 jD2 tw3D3 X

D1D2D3

fD1 jT tw1D1 cent fD2 jD1 tw2D2 cent fD3jD2 tw3D3

1

AT

D

0

BB

fAjAtw1A fCjAtw1C fGjAtw1G fT jAtw1T

fAjC tw1A fCjC tw1C fGjC tw1G fT jC tw1T

fAjGtw1A fCjGtw1C fGjGtw1G fT jGtw1T

fAjT tw1A fCjT tw1C fGjT tw1G fT jT tw1T

1

CCA cent

0

BB

PD2D3

fD2 jAtw2D2 fD3 jD2 tw3D3PD2D3

fD2 jC tw2D2 fD3jD2 tw3D3PD2D3

fD2 jGtw2D2 fD3jD2 tw3D3PD2D3

fD2 jT tw2D2 fD3 jD2 tw3D3

1

CCA

D P cent Diagtw1A tw1C tw1G tw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD2 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

D P cent M1 t cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

LOCAL STATISTICAL SIGNIFICANCE OF PATTERNS 13

Further applying the above arguments to positions 2 and 3 we have

0

X

D2D3

fD2jAtw2D2 cent fD3jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

D P cent Diagtw2A tw2C tw2G tw2T cent

0

X

D3

fD3jAtw3D3 X

D3

fD3 jT tw3D3

1

AT

D P cent Diagtw2A tw2C tw2G tw2T cent P cent Diagtw2A tw2C tw2G tw2T cent 1 1 1 1T

iexcl P cent M2 t cent P cent M3 t cent I

Above all Gt D frac14Qp

iD1PMi tI

ACKNOWLEDGMENTS

The work of HH XZ and WHW is supported by NSF grants DBI0196176 and DMS-0090166 Thework of HH and JSL is supported by NSF grant DMS-0204674 and NIH grant R01 HG02518-01 Thework of M-CJK is supported by the Howard Hughes Medical Institute predoctoral fellowship

REFERENCES

Bailey TL and Elkan C 1994 Fitting a mixture model by expectation maximization to discover motifs in biopoly-mers Proc Int Conf Intell Syst Mol Biol 2 28ndash36

Chen QK Hertz GZ and Stormo GD 1995 MATRIX SEARCH 10 A computer program that scans DNAsequences for transcriptional elements using a database of weight matrices Comput Appl Biosci 11 563ndash566

Durbin R Eddy SR Krogh A and Mitchison G 1998 Biological Sequence Analysis Probalistic Models ofProteins and Nucleic Acids Cambridge University Press Cambridge UK

Fickett JW 1996 Coordinate positioning of MEF2 and myogenin binding sites Gene 172 GC19ndash32Fried M and Crothers DM 1981 Equilibria and kinetics of lac repressor-operator interactions by polyacrylamide

gel electrophoresis Nucl Acids Res 9 6505ndash6525Galas DJ and Schmitz A 1978 DNAse footprinting A simple method for the detection of proteinndashDNA binding

speci city Nucl Acids Res 5 3157ndash3170Garner MM and Revzin A 1981 A gel electrophoresis method for quantifying the binding of proteins to speci c

DNA regions Application to components of the Escherichia coli lactose operon regulatory system Nucl Acids Res9 3047ndash3060

Gus eld D 1997 Algorithms on Strings Trees and Sequences Computer Science and Computational BiologyCambridge University Press Cambridge England

Gut A 1995 An Intermediate Course in Probability Springer-Verlag New YorkHertz GZ Hartzell 3rd GW and Stormo GD 1990 Identi cation of consensus patterns in unaligned DNA

sequences known to be functionally related Comput Appl Biosci 6 81ndash92Hubbard T Barker D Birney E Cameron G Chen Y Clark L Cox T Cuff J Curwen V Down T Durbin

R Eyras E Gilbert J Hammond M Huminiecki L Kasprzyk A Lehvaslaiho H Lijnzaad P MelsoppC Mongin E Pettett R Pocock M Potter S Rust A Schmidt E Searle S Slater G Smith J SpoonerW Stabenau A Stalker J Stupka E Ureta-Vidal A Vastrik I and Clamp M 2002 The Ensembl genomedatabase project Nucl Acids Res 30 38ndash41

Hughes JD Estep PW Tavazoie S and Church GM 2000 Computational identi cation of cis-regulatoryelementsassociated with groups of functionally related genes in Saccharomyces cerevisiae J Mol Biol 296 1205ndash1214

Lander ES Linton LM Birren B Nusbaum C Zody MC Baldwin J Devon K Dewar K Doyle MFitzHugh W Funke R Gage D Harris K Heaford A Howland J Kann L Lehoczky J LeVine RMcEwan P McKernan K Meldrim J Mesirov JP Miranda C Morris W Naylor J Raymond C RosettiM Santos R Sheridan A Sougnez C Stange-Thomann N Stojanovic N Subramanian A Wyman DRogers J Sulston J Ainscough R Beck S Bentley D Burton J Clee C Carter N Coulson A DeadmanR Deloukas P Dunham A Dunham I Durbin R French L Grafham D Gregory S Hubbard T Humphray

14 HUANG ET AL

S Hunt A Jones M Lloyd C McMurray A Matthews L Mercer S Milne S Mullikin JC MungallA Plumb R Ross M Shownkeen R Sims S Waterston RH Wilson RK Hillier LW McPherson JDMarra MA Mardis ER Fulton LA Chinwalla AT Pepin KH Gish WR Chissoe SL Wendl MCDelehaunty KD Miner TL Delehaunty A Kramer JB Cook LL Fulton RS Johnson DL Minx PJClifton SW Hawkins T Branscomb E Predki P Richardson P Wenning S Slezak T Doggett N ChengJF Olsen A Lucas S Elkin C Uberbacher E Frazier M Gibbs RA Muzny DM Scherer SE BouckJB Sodergren EJ Worley KC Rives CM Gorrell JH Metzker ML Naylor SL Kucherlapati RSNelson DL Weinstock GM Sakaki Y Fujiyama A Hattori M Yada T Toyoda A Itoh T Kawagoe CWatanabe H Totoki Y Taylor T Weissenbach J Heilig R Saurin W Artiguenave F Brottier P Bruls TPelletier E Robert C Wincker P Smith DR Doucette-Stamm L Ruben eld M Weinstock K Lee HMDubois J Rosenthal A Platzer M Nyakatura G Taudien S Rump A Yang H Yu J Wang J HuangG Gu J Hood L Rowen L Madan A Qin S Davis RW Federspiel NA Abola AP Proctor MJMyers RM Schmutz J Dickson M Grimwood J Cox DR Olson MV Kaul R Shimizu N KawasakiK Minoshima S Evans GA Athanasiou M Schultz R Roe BA Chen F Pan H Ramser J LehrachH Reinhardt R McCombie WR de la Bastide M Dedhia N Blocker H Hornischer K Nordsiek GAgarwala R Aravind L Bailey JA Bateman A Batzoglou S Birney E Bork P Brown DG BurgeCB Cerutti L Chen HC Church D Clamp M Copley RR Doerks T Eddy SR Eichler EE FureyTS Galagan J Gilbert JG Harmon C Hayashizaki Y Haussler D Hermjakob H Hokamp K Jang WJohnson LS Jones TA Kasif S Kaspryzk A Kennedy S Kent WJ Kitts P Koonin EV Korf I KulpD Lancet D Lowe TM McLysaght A Mikkelsen T Moran JV Mulder N Pollara VJ Ponting CPSchuler G Schultz J Slater G Smit AF Stupka E Szustakowski J Thierry-Mieg D Thierry-Mieg JWagner L Wallis J Wheeler R Williams A Wolf YI Wolfe KH Yang SP Yeh RF Collins F GuyerMS Peterson J Felsenfeld A Wetterstrand KA Patrinos A Morgan MJ Szustakowki J de Jong PCatanese JJ Osoegawa K Shizuya H Choi S and Chen YJ 2001 Initial sequencing and analysis of thehuman genome Nature 409 860ndash921

Lawrence CE Altschul SF Boguski MS Liu JS Neuwald AF and Wootton JC 1993 Detecting subtlesequence signals A Gibbs sampling strategy for multiple alignment Science 262 208ndash214

Lawrence CE and Reilly AA 1990 An expectation maximization (EM) algorithm for the identi cation and char-acterization of common sites in unaligned biopolymer sequences Proteins 7 41ndash51

Liu X Brutlag DL and Liu JS 2001 BioProspector Discovering conserved DNA motifs in upstream regulatoryregions of co-expressed genes Pac Symp Biocomput 127ndash138

Liu XS Brutlag DL and Liu JS 2002 An algorithm for nding protein DNA binding sites with applications tochromatin-immunoprecipitation microarray experiments Nat Biotechnol

Nakatsuji Y Hidaka K Tsujino S Yamamoto Y Mukai T Yanagihara T Kishimoto T and Sakoda S 1992A single MEF-2 site is a major positive regulatory element required for transcription of the muscle-specic subunitof the human phosphoglycerate mutase gene in skeletal and cardiac muscle cells Mol Cell Biol 12 4384ndash4390

Quandt K Frech K Karas H Wingender E and Werner T 1995 MatInd and MatInspector New fast and versatiletools for detection of consensus matches in nucleotide sequence data Nucl Acids Res 23 4878ndash4884

Rosenthal N Berglund EB Wentworth BM Donoghue M Winter B Bober E Braun T and Arnold HH1990 A highly conserved enhancer downstream of the human MLC13 locus is a target for multiple myogenicdetermination factors Nucl Acids Res 18 6239ndash6246

Roth FP Hughes JD Estep PW and Church GM 1998 Finding DNA regulatory motifs within unalignednoncoding sequences clustered by whole-genome mRNA quantitation Nat Biotechnol 16 939ndash945

Staden R 1989 Methods for calculating the probabilities of nding patterns in sequences Comput Appl Biosci 589ndash96

Stormo GD and Hartzell 3rd GW 1989 Identifying protein-binding sites from unaligned DNA fragments ProcNatl Acad Sci USA 86 1183ndash1187

Wentworth BM Donoghue M Engert JC Berglund EB and Rosenthal N 1991 Paired MyoD-binding sitesregulate myosin light chain gene expression Proc Natl Acad Sci USA 88 1242ndash1246

Wingender E Chen X Hehl R Karas H Liebich I Matys V Meinhardt T Pruss M Reuter I and SchachererF 2000 TRANSFAC An integrated system for gene expression regulation Nucl Acids Res 28 316ndash319

Address correspondence toJun S Liu Wing H Wong

Department of StatisticsScience Center 6th oor

1 Oxford StreetCambridge MA 02138

E-mail jliu wwongstatharvardedu

Page 10: Determination of Local Statistical Significance of ...mckao/documents/JCB-LMM.pdf · that the incorporation of the local genomic context can be advantageous in the prediction of myogenin

10 HUANG ET AL

To locate the known TFBSs in the human genome we focus on the 5000 bp upstream sequences of allgenes We made use of the annotations provided by Ensembl (Hubbard et al 2002) and extracted 22808human gene promoters from the human genome assembly NCBI golden path 29 (wwwensemblorgHomo_sapiens) Since heuristic sequence-mapping algorithms do not perform well on short sequences such asTFBSs we use an exact-match algorithm based on suf x trees (Gus eld 1997) We found that many bindingsite sequences are precisely mapped onto the promoters of the correct target genes For those binding siteswith mappings onto multiple promoters or with no mapping we attempted to retrieve them by manualreview To nd the correct one among multiple mappings we made correspondences between the Ensemblgene name and the target gene name of the binding site as recorded by TRANSFAC A review of somemissed matches using inexact match algorithms revealed a small number of single-basepair differencesbetween the recorded binding site sequences and the promoter sequences of the target genes for examplethe binding site HS$ALBU_06 over the human albumin promoter After validating against the primaryliterature for the positions of these binding sites we included these mappings as well In total we located101 human TFBSs

Local p-value calculation

Although the exact score distribution can be obtained by enumerating all possible binding site sequencesunder any ldquonullrdquo model for the observed nucleotide base pairs the computational cost for a PSSM of lengthp is 4p Stadenrsquos method (Staden 1989) which turns this into an order-p computation is based on thePGF of the score under the simple null model that the base pairs are independent and identically distributed(iid) Recently however there are some evidences suggesting that Markov background models work betterthan the iid model for detecting TFBS (Liu et al 2002) By extending Stadenrsquos PGF method to dependentrandom variables we present here the derivation of the PGFs under a rst-order Markov model the basisof the ef cient algorithm for computing the exact score distribution

Probability generating function derivation

In our study we make use of the PSSMs constructed by TRANSFAC version 62 Given a PSSMm D wij ppound4 where i D 1 p and j D ACGT the match score S and the similarity score S=Smax

of a sequence D1D2 Dp is de ned as (Quandt et al 1995)

S DpX

iD1

wiDi and S=Smax DpX

iD1

wij

iquest pX

iD1

maxj

fwij g

Let S be a random variable taking integer values then its probability generating function Gt is theexpected value of tS Gt is a polynomial and the coef cient of the term tn is the probability of the eventS D n (Gut 1995)

Given a PSSM m of length p under the assumption that the DNA sequence is iid Staden pro-vided the PGF of the match score in the form of a product of p polynomials (Staden 1989) Gt DQp

iD1

PjDACGT fj twij where fj is the frequency of letter j in the iid DNA sequence For the rst-

order Markov case k D 1 let the transition matrix be P D fregjmacr 4pound4 and the stationary distribution of theMarkov chain be frac14 (viewed as a four-dimentional row vector) Then the PGF under the rst-order Markovmodel is

Gt D frac14

pY

iD1

PMi t I (curren)

where Mi t D DiagtwiA twiC twiG twiT and I D 1 1 1 1T (proof provided at the end of thissection)

Since a Markov chain of order k on set 0 is equivalently a rst-order Markov chain on the set 0k witha little modi cation on Mi t we can generalize the above results to k gt 1 An example of PGF for

LOCAL STATISTICAL SIGNIFICANCE OF PATTERNS 11

k D 3 is in the online supplement (wwwbiostatharvardeducomplabLMM) Using this representation forthe PGF we developed and implemented an algorithm using CCC to calculate the exact score distributionGenerally for a kth-order Markov chain and a PSSM of length p the time complexity of our algorithm isO4k cent Smax cent p linear in the matrix length but exponential in the order of the Markov chain The sourcecode is available upon request (wwwbiostathsphharvardeduLMM)

Proof of equation (curren) For ease of notation and without loss of generality we let p the length of thePSSM be 3

For a DNA sequence D1D2D3 its match score against PSSM m is w1D1 C w2D2 C w3D3 and theprobability of the occurrence of D1D2D3 is fD1 fD2 jD1 fD3 jD2 By de nition the PGF of match scoreagainst m is

X

D1D2D3

fD1fD2jD1fD3jD2 tw1D1Cw2D2

Cw3D3 DX

D1D2D3

fD1 tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

In the following we derive the PGF in the alternative form of a product of p matrices First

X

D1D2D3

fD1 tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

DX

D1D2D3

X

a

fa cent fD1 ja tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

DX

a

fa centX

D1D2D3

fD1 ja tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

D fa fC fG fT cent

0

X

D1D2D3

fD1 jAtw1D1 cent fD2jD1 tw2D2 cent fD3 jD2 tw3D3

X

D1D2D3

fD1 jT tw1D1 cent fD2jD1 tw2D2 cent fD3jD2 tw3D3

1

AT

For the component of the second vector corresponding to base A

X

D1D2D3

fD1 jAtw1D1 cent fD2 jD1 tw2D2 cent fD3jD2 tw3D3

DX

D1

fD1 jAtw1D1 centX

D2D3

fD2 jD1 tw2D2 cent fD3jD2 tw3D3

D fAjAtw1A fCjAtw1C fGjAtw1G fT jAtw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3jD2 tw3D3

1

AT

12 HUANG ET AL

We apply similar arguments to the components corresponding to bases C G and T and obtain

X

D1D2D3

fD1jC tw1D1 cent fD2 jD1 tw2D2 cent fD3jD2 tw3D3

D fAjC tw1A fCjC tw1C fGjC tw1G fT jC tw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

X

D1D2D3

fD1jGtw1D1 cent fD2 jD1 tw2D2 cent fD3jD2 tw3D3

D fAjGtw1A fCjGtw1C fGjGtw1G fT jGtw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

X

D1D2D3

fD1jT tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

D fAjT tw1A fCjT tw1C fGjT tw1G fT jT tw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

Therefore for the rst position we have

0

X

D1D2D3

fD1jAtw1D1 cent fD2jD1 tw2D2 cent fD3 jD2 tw3D3 X

D1D2D3

fD1 jT tw1D1 cent fD2 jD1 tw2D2 cent fD3jD2 tw3D3

1

AT

D

0

BB

fAjAtw1A fCjAtw1C fGjAtw1G fT jAtw1T

fAjC tw1A fCjC tw1C fGjC tw1G fT jC tw1T

fAjGtw1A fCjGtw1C fGjGtw1G fT jGtw1T

fAjT tw1A fCjT tw1C fGjT tw1G fT jT tw1T

1

CCA cent

0

BB

PD2D3

fD2 jAtw2D2 fD3 jD2 tw3D3PD2D3

fD2 jC tw2D2 fD3jD2 tw3D3PD2D3

fD2 jGtw2D2 fD3jD2 tw3D3PD2D3

fD2 jT tw2D2 fD3 jD2 tw3D3

1

CCA

D P cent Diagtw1A tw1C tw1G tw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD2 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

D P cent M1 t cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

LOCAL STATISTICAL SIGNIFICANCE OF PATTERNS 13

Further applying the above arguments to positions 2 and 3 we have

0

X

D2D3

fD2jAtw2D2 cent fD3jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

D P cent Diagtw2A tw2C tw2G tw2T cent

0

X

D3

fD3jAtw3D3 X

D3

fD3 jT tw3D3

1

AT

D P cent Diagtw2A tw2C tw2G tw2T cent P cent Diagtw2A tw2C tw2G tw2T cent 1 1 1 1T

iexcl P cent M2 t cent P cent M3 t cent I

Above all Gt D frac14Qp

iD1PMi tI

ACKNOWLEDGMENTS

The work of HH XZ and WHW is supported by NSF grants DBI0196176 and DMS-0090166 Thework of HH and JSL is supported by NSF grant DMS-0204674 and NIH grant R01 HG02518-01 Thework of M-CJK is supported by the Howard Hughes Medical Institute predoctoral fellowship

REFERENCES

Bailey TL and Elkan C 1994 Fitting a mixture model by expectation maximization to discover motifs in biopoly-mers Proc Int Conf Intell Syst Mol Biol 2 28ndash36

Chen QK Hertz GZ and Stormo GD 1995 MATRIX SEARCH 10 A computer program that scans DNAsequences for transcriptional elements using a database of weight matrices Comput Appl Biosci 11 563ndash566

Durbin R Eddy SR Krogh A and Mitchison G 1998 Biological Sequence Analysis Probalistic Models ofProteins and Nucleic Acids Cambridge University Press Cambridge UK

Fickett JW 1996 Coordinate positioning of MEF2 and myogenin binding sites Gene 172 GC19ndash32Fried M and Crothers DM 1981 Equilibria and kinetics of lac repressor-operator interactions by polyacrylamide

gel electrophoresis Nucl Acids Res 9 6505ndash6525Galas DJ and Schmitz A 1978 DNAse footprinting A simple method for the detection of proteinndashDNA binding

speci city Nucl Acids Res 5 3157ndash3170Garner MM and Revzin A 1981 A gel electrophoresis method for quantifying the binding of proteins to speci c

DNA regions Application to components of the Escherichia coli lactose operon regulatory system Nucl Acids Res9 3047ndash3060

Gus eld D 1997 Algorithms on Strings Trees and Sequences Computer Science and Computational BiologyCambridge University Press Cambridge England

Gut A 1995 An Intermediate Course in Probability Springer-Verlag New YorkHertz GZ Hartzell 3rd GW and Stormo GD 1990 Identi cation of consensus patterns in unaligned DNA

sequences known to be functionally related Comput Appl Biosci 6 81ndash92Hubbard T Barker D Birney E Cameron G Chen Y Clark L Cox T Cuff J Curwen V Down T Durbin

R Eyras E Gilbert J Hammond M Huminiecki L Kasprzyk A Lehvaslaiho H Lijnzaad P MelsoppC Mongin E Pettett R Pocock M Potter S Rust A Schmidt E Searle S Slater G Smith J SpoonerW Stabenau A Stalker J Stupka E Ureta-Vidal A Vastrik I and Clamp M 2002 The Ensembl genomedatabase project Nucl Acids Res 30 38ndash41

Hughes JD Estep PW Tavazoie S and Church GM 2000 Computational identi cation of cis-regulatoryelementsassociated with groups of functionally related genes in Saccharomyces cerevisiae J Mol Biol 296 1205ndash1214

Lander ES Linton LM Birren B Nusbaum C Zody MC Baldwin J Devon K Dewar K Doyle MFitzHugh W Funke R Gage D Harris K Heaford A Howland J Kann L Lehoczky J LeVine RMcEwan P McKernan K Meldrim J Mesirov JP Miranda C Morris W Naylor J Raymond C RosettiM Santos R Sheridan A Sougnez C Stange-Thomann N Stojanovic N Subramanian A Wyman DRogers J Sulston J Ainscough R Beck S Bentley D Burton J Clee C Carter N Coulson A DeadmanR Deloukas P Dunham A Dunham I Durbin R French L Grafham D Gregory S Hubbard T Humphray

14 HUANG ET AL

S Hunt A Jones M Lloyd C McMurray A Matthews L Mercer S Milne S Mullikin JC MungallA Plumb R Ross M Shownkeen R Sims S Waterston RH Wilson RK Hillier LW McPherson JDMarra MA Mardis ER Fulton LA Chinwalla AT Pepin KH Gish WR Chissoe SL Wendl MCDelehaunty KD Miner TL Delehaunty A Kramer JB Cook LL Fulton RS Johnson DL Minx PJClifton SW Hawkins T Branscomb E Predki P Richardson P Wenning S Slezak T Doggett N ChengJF Olsen A Lucas S Elkin C Uberbacher E Frazier M Gibbs RA Muzny DM Scherer SE BouckJB Sodergren EJ Worley KC Rives CM Gorrell JH Metzker ML Naylor SL Kucherlapati RSNelson DL Weinstock GM Sakaki Y Fujiyama A Hattori M Yada T Toyoda A Itoh T Kawagoe CWatanabe H Totoki Y Taylor T Weissenbach J Heilig R Saurin W Artiguenave F Brottier P Bruls TPelletier E Robert C Wincker P Smith DR Doucette-Stamm L Ruben eld M Weinstock K Lee HMDubois J Rosenthal A Platzer M Nyakatura G Taudien S Rump A Yang H Yu J Wang J HuangG Gu J Hood L Rowen L Madan A Qin S Davis RW Federspiel NA Abola AP Proctor MJMyers RM Schmutz J Dickson M Grimwood J Cox DR Olson MV Kaul R Shimizu N KawasakiK Minoshima S Evans GA Athanasiou M Schultz R Roe BA Chen F Pan H Ramser J LehrachH Reinhardt R McCombie WR de la Bastide M Dedhia N Blocker H Hornischer K Nordsiek GAgarwala R Aravind L Bailey JA Bateman A Batzoglou S Birney E Bork P Brown DG BurgeCB Cerutti L Chen HC Church D Clamp M Copley RR Doerks T Eddy SR Eichler EE FureyTS Galagan J Gilbert JG Harmon C Hayashizaki Y Haussler D Hermjakob H Hokamp K Jang WJohnson LS Jones TA Kasif S Kaspryzk A Kennedy S Kent WJ Kitts P Koonin EV Korf I KulpD Lancet D Lowe TM McLysaght A Mikkelsen T Moran JV Mulder N Pollara VJ Ponting CPSchuler G Schultz J Slater G Smit AF Stupka E Szustakowski J Thierry-Mieg D Thierry-Mieg JWagner L Wallis J Wheeler R Williams A Wolf YI Wolfe KH Yang SP Yeh RF Collins F GuyerMS Peterson J Felsenfeld A Wetterstrand KA Patrinos A Morgan MJ Szustakowki J de Jong PCatanese JJ Osoegawa K Shizuya H Choi S and Chen YJ 2001 Initial sequencing and analysis of thehuman genome Nature 409 860ndash921

Lawrence CE Altschul SF Boguski MS Liu JS Neuwald AF and Wootton JC 1993 Detecting subtlesequence signals A Gibbs sampling strategy for multiple alignment Science 262 208ndash214

Lawrence CE and Reilly AA 1990 An expectation maximization (EM) algorithm for the identi cation and char-acterization of common sites in unaligned biopolymer sequences Proteins 7 41ndash51

Liu X Brutlag DL and Liu JS 2001 BioProspector Discovering conserved DNA motifs in upstream regulatoryregions of co-expressed genes Pac Symp Biocomput 127ndash138

Liu XS Brutlag DL and Liu JS 2002 An algorithm for nding protein DNA binding sites with applications tochromatin-immunoprecipitation microarray experiments Nat Biotechnol

Nakatsuji Y Hidaka K Tsujino S Yamamoto Y Mukai T Yanagihara T Kishimoto T and Sakoda S 1992A single MEF-2 site is a major positive regulatory element required for transcription of the muscle-specic subunitof the human phosphoglycerate mutase gene in skeletal and cardiac muscle cells Mol Cell Biol 12 4384ndash4390

Quandt K Frech K Karas H Wingender E and Werner T 1995 MatInd and MatInspector New fast and versatiletools for detection of consensus matches in nucleotide sequence data Nucl Acids Res 23 4878ndash4884

Rosenthal N Berglund EB Wentworth BM Donoghue M Winter B Bober E Braun T and Arnold HH1990 A highly conserved enhancer downstream of the human MLC13 locus is a target for multiple myogenicdetermination factors Nucl Acids Res 18 6239ndash6246

Roth FP Hughes JD Estep PW and Church GM 1998 Finding DNA regulatory motifs within unalignednoncoding sequences clustered by whole-genome mRNA quantitation Nat Biotechnol 16 939ndash945

Staden R 1989 Methods for calculating the probabilities of nding patterns in sequences Comput Appl Biosci 589ndash96

Stormo GD and Hartzell 3rd GW 1989 Identifying protein-binding sites from unaligned DNA fragments ProcNatl Acad Sci USA 86 1183ndash1187

Wentworth BM Donoghue M Engert JC Berglund EB and Rosenthal N 1991 Paired MyoD-binding sitesregulate myosin light chain gene expression Proc Natl Acad Sci USA 88 1242ndash1246

Wingender E Chen X Hehl R Karas H Liebich I Matys V Meinhardt T Pruss M Reuter I and SchachererF 2000 TRANSFAC An integrated system for gene expression regulation Nucl Acids Res 28 316ndash319

Address correspondence toJun S Liu Wing H Wong

Department of StatisticsScience Center 6th oor

1 Oxford StreetCambridge MA 02138

E-mail jliu wwongstatharvardedu

Page 11: Determination of Local Statistical Significance of ...mckao/documents/JCB-LMM.pdf · that the incorporation of the local genomic context can be advantageous in the prediction of myogenin

LOCAL STATISTICAL SIGNIFICANCE OF PATTERNS 11

k D 3 is in the online supplement (wwwbiostatharvardeducomplabLMM) Using this representation forthe PGF we developed and implemented an algorithm using CCC to calculate the exact score distributionGenerally for a kth-order Markov chain and a PSSM of length p the time complexity of our algorithm isO4k cent Smax cent p linear in the matrix length but exponential in the order of the Markov chain The sourcecode is available upon request (wwwbiostathsphharvardeduLMM)

Proof of equation (curren) For ease of notation and without loss of generality we let p the length of thePSSM be 3

For a DNA sequence D1D2D3 its match score against PSSM m is w1D1 C w2D2 C w3D3 and theprobability of the occurrence of D1D2D3 is fD1 fD2 jD1 fD3 jD2 By de nition the PGF of match scoreagainst m is

X

D1D2D3

fD1fD2jD1fD3jD2 tw1D1Cw2D2

Cw3D3 DX

D1D2D3

fD1 tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

In the following we derive the PGF in the alternative form of a product of p matrices First

X

D1D2D3

fD1 tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

DX

D1D2D3

X

a

fa cent fD1 ja tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

DX

a

fa centX

D1D2D3

fD1 ja tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

D fa fC fG fT cent

0

X

D1D2D3

fD1 jAtw1D1 cent fD2jD1 tw2D2 cent fD3 jD2 tw3D3

X

D1D2D3

fD1 jT tw1D1 cent fD2jD1 tw2D2 cent fD3jD2 tw3D3

1

AT

For the component of the second vector corresponding to base A

X

D1D2D3

fD1 jAtw1D1 cent fD2 jD1 tw2D2 cent fD3jD2 tw3D3

DX

D1

fD1 jAtw1D1 centX

D2D3

fD2 jD1 tw2D2 cent fD3jD2 tw3D3

D fAjAtw1A fCjAtw1C fGjAtw1G fT jAtw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3jD2 tw3D3

1

AT

12 HUANG ET AL

We apply similar arguments to the components corresponding to bases C G and T and obtain

X

D1D2D3

fD1jC tw1D1 cent fD2 jD1 tw2D2 cent fD3jD2 tw3D3

D fAjC tw1A fCjC tw1C fGjC tw1G fT jC tw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

X

D1D2D3

fD1jGtw1D1 cent fD2 jD1 tw2D2 cent fD3jD2 tw3D3

D fAjGtw1A fCjGtw1C fGjGtw1G fT jGtw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

X

D1D2D3

fD1jT tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

D fAjT tw1A fCjT tw1C fGjT tw1G fT jT tw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

Therefore for the rst position we have

0

X

D1D2D3

fD1jAtw1D1 cent fD2jD1 tw2D2 cent fD3 jD2 tw3D3 X

D1D2D3

fD1 jT tw1D1 cent fD2 jD1 tw2D2 cent fD3jD2 tw3D3

1

AT

D

0

BB

fAjAtw1A fCjAtw1C fGjAtw1G fT jAtw1T

fAjC tw1A fCjC tw1C fGjC tw1G fT jC tw1T

fAjGtw1A fCjGtw1C fGjGtw1G fT jGtw1T

fAjT tw1A fCjT tw1C fGjT tw1G fT jT tw1T

1

CCA cent

0

BB

PD2D3

fD2 jAtw2D2 fD3 jD2 tw3D3PD2D3

fD2 jC tw2D2 fD3jD2 tw3D3PD2D3

fD2 jGtw2D2 fD3jD2 tw3D3PD2D3

fD2 jT tw2D2 fD3 jD2 tw3D3

1

CCA

D P cent Diagtw1A tw1C tw1G tw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD2 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

D P cent M1 t cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

LOCAL STATISTICAL SIGNIFICANCE OF PATTERNS 13

Further applying the above arguments to positions 2 and 3 we have

0

X

D2D3

fD2jAtw2D2 cent fD3jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

D P cent Diagtw2A tw2C tw2G tw2T cent

0

X

D3

fD3jAtw3D3 X

D3

fD3 jT tw3D3

1

AT

D P cent Diagtw2A tw2C tw2G tw2T cent P cent Diagtw2A tw2C tw2G tw2T cent 1 1 1 1T

iexcl P cent M2 t cent P cent M3 t cent I

Above all Gt D frac14Qp

iD1PMi tI

ACKNOWLEDGMENTS

The work of HH XZ and WHW is supported by NSF grants DBI0196176 and DMS-0090166 Thework of HH and JSL is supported by NSF grant DMS-0204674 and NIH grant R01 HG02518-01 Thework of M-CJK is supported by the Howard Hughes Medical Institute predoctoral fellowship

REFERENCES

Bailey TL and Elkan C 1994 Fitting a mixture model by expectation maximization to discover motifs in biopoly-mers Proc Int Conf Intell Syst Mol Biol 2 28ndash36

Chen QK Hertz GZ and Stormo GD 1995 MATRIX SEARCH 10 A computer program that scans DNAsequences for transcriptional elements using a database of weight matrices Comput Appl Biosci 11 563ndash566

Durbin R Eddy SR Krogh A and Mitchison G 1998 Biological Sequence Analysis Probalistic Models ofProteins and Nucleic Acids Cambridge University Press Cambridge UK

Fickett JW 1996 Coordinate positioning of MEF2 and myogenin binding sites Gene 172 GC19ndash32Fried M and Crothers DM 1981 Equilibria and kinetics of lac repressor-operator interactions by polyacrylamide

gel electrophoresis Nucl Acids Res 9 6505ndash6525Galas DJ and Schmitz A 1978 DNAse footprinting A simple method for the detection of proteinndashDNA binding

speci city Nucl Acids Res 5 3157ndash3170Garner MM and Revzin A 1981 A gel electrophoresis method for quantifying the binding of proteins to speci c

DNA regions Application to components of the Escherichia coli lactose operon regulatory system Nucl Acids Res9 3047ndash3060

Gus eld D 1997 Algorithms on Strings Trees and Sequences Computer Science and Computational BiologyCambridge University Press Cambridge England

Gut A 1995 An Intermediate Course in Probability Springer-Verlag New YorkHertz GZ Hartzell 3rd GW and Stormo GD 1990 Identi cation of consensus patterns in unaligned DNA

sequences known to be functionally related Comput Appl Biosci 6 81ndash92Hubbard T Barker D Birney E Cameron G Chen Y Clark L Cox T Cuff J Curwen V Down T Durbin

R Eyras E Gilbert J Hammond M Huminiecki L Kasprzyk A Lehvaslaiho H Lijnzaad P MelsoppC Mongin E Pettett R Pocock M Potter S Rust A Schmidt E Searle S Slater G Smith J SpoonerW Stabenau A Stalker J Stupka E Ureta-Vidal A Vastrik I and Clamp M 2002 The Ensembl genomedatabase project Nucl Acids Res 30 38ndash41

Hughes JD Estep PW Tavazoie S and Church GM 2000 Computational identi cation of cis-regulatoryelementsassociated with groups of functionally related genes in Saccharomyces cerevisiae J Mol Biol 296 1205ndash1214

Lander ES Linton LM Birren B Nusbaum C Zody MC Baldwin J Devon K Dewar K Doyle MFitzHugh W Funke R Gage D Harris K Heaford A Howland J Kann L Lehoczky J LeVine RMcEwan P McKernan K Meldrim J Mesirov JP Miranda C Morris W Naylor J Raymond C RosettiM Santos R Sheridan A Sougnez C Stange-Thomann N Stojanovic N Subramanian A Wyman DRogers J Sulston J Ainscough R Beck S Bentley D Burton J Clee C Carter N Coulson A DeadmanR Deloukas P Dunham A Dunham I Durbin R French L Grafham D Gregory S Hubbard T Humphray

14 HUANG ET AL

S Hunt A Jones M Lloyd C McMurray A Matthews L Mercer S Milne S Mullikin JC MungallA Plumb R Ross M Shownkeen R Sims S Waterston RH Wilson RK Hillier LW McPherson JDMarra MA Mardis ER Fulton LA Chinwalla AT Pepin KH Gish WR Chissoe SL Wendl MCDelehaunty KD Miner TL Delehaunty A Kramer JB Cook LL Fulton RS Johnson DL Minx PJClifton SW Hawkins T Branscomb E Predki P Richardson P Wenning S Slezak T Doggett N ChengJF Olsen A Lucas S Elkin C Uberbacher E Frazier M Gibbs RA Muzny DM Scherer SE BouckJB Sodergren EJ Worley KC Rives CM Gorrell JH Metzker ML Naylor SL Kucherlapati RSNelson DL Weinstock GM Sakaki Y Fujiyama A Hattori M Yada T Toyoda A Itoh T Kawagoe CWatanabe H Totoki Y Taylor T Weissenbach J Heilig R Saurin W Artiguenave F Brottier P Bruls TPelletier E Robert C Wincker P Smith DR Doucette-Stamm L Ruben eld M Weinstock K Lee HMDubois J Rosenthal A Platzer M Nyakatura G Taudien S Rump A Yang H Yu J Wang J HuangG Gu J Hood L Rowen L Madan A Qin S Davis RW Federspiel NA Abola AP Proctor MJMyers RM Schmutz J Dickson M Grimwood J Cox DR Olson MV Kaul R Shimizu N KawasakiK Minoshima S Evans GA Athanasiou M Schultz R Roe BA Chen F Pan H Ramser J LehrachH Reinhardt R McCombie WR de la Bastide M Dedhia N Blocker H Hornischer K Nordsiek GAgarwala R Aravind L Bailey JA Bateman A Batzoglou S Birney E Bork P Brown DG BurgeCB Cerutti L Chen HC Church D Clamp M Copley RR Doerks T Eddy SR Eichler EE FureyTS Galagan J Gilbert JG Harmon C Hayashizaki Y Haussler D Hermjakob H Hokamp K Jang WJohnson LS Jones TA Kasif S Kaspryzk A Kennedy S Kent WJ Kitts P Koonin EV Korf I KulpD Lancet D Lowe TM McLysaght A Mikkelsen T Moran JV Mulder N Pollara VJ Ponting CPSchuler G Schultz J Slater G Smit AF Stupka E Szustakowski J Thierry-Mieg D Thierry-Mieg JWagner L Wallis J Wheeler R Williams A Wolf YI Wolfe KH Yang SP Yeh RF Collins F GuyerMS Peterson J Felsenfeld A Wetterstrand KA Patrinos A Morgan MJ Szustakowki J de Jong PCatanese JJ Osoegawa K Shizuya H Choi S and Chen YJ 2001 Initial sequencing and analysis of thehuman genome Nature 409 860ndash921

Lawrence CE Altschul SF Boguski MS Liu JS Neuwald AF and Wootton JC 1993 Detecting subtlesequence signals A Gibbs sampling strategy for multiple alignment Science 262 208ndash214

Lawrence CE and Reilly AA 1990 An expectation maximization (EM) algorithm for the identi cation and char-acterization of common sites in unaligned biopolymer sequences Proteins 7 41ndash51

Liu X Brutlag DL and Liu JS 2001 BioProspector Discovering conserved DNA motifs in upstream regulatoryregions of co-expressed genes Pac Symp Biocomput 127ndash138

Liu XS Brutlag DL and Liu JS 2002 An algorithm for nding protein DNA binding sites with applications tochromatin-immunoprecipitation microarray experiments Nat Biotechnol

Nakatsuji Y Hidaka K Tsujino S Yamamoto Y Mukai T Yanagihara T Kishimoto T and Sakoda S 1992A single MEF-2 site is a major positive regulatory element required for transcription of the muscle-specic subunitof the human phosphoglycerate mutase gene in skeletal and cardiac muscle cells Mol Cell Biol 12 4384ndash4390

Quandt K Frech K Karas H Wingender E and Werner T 1995 MatInd and MatInspector New fast and versatiletools for detection of consensus matches in nucleotide sequence data Nucl Acids Res 23 4878ndash4884

Rosenthal N Berglund EB Wentworth BM Donoghue M Winter B Bober E Braun T and Arnold HH1990 A highly conserved enhancer downstream of the human MLC13 locus is a target for multiple myogenicdetermination factors Nucl Acids Res 18 6239ndash6246

Roth FP Hughes JD Estep PW and Church GM 1998 Finding DNA regulatory motifs within unalignednoncoding sequences clustered by whole-genome mRNA quantitation Nat Biotechnol 16 939ndash945

Staden R 1989 Methods for calculating the probabilities of nding patterns in sequences Comput Appl Biosci 589ndash96

Stormo GD and Hartzell 3rd GW 1989 Identifying protein-binding sites from unaligned DNA fragments ProcNatl Acad Sci USA 86 1183ndash1187

Wentworth BM Donoghue M Engert JC Berglund EB and Rosenthal N 1991 Paired MyoD-binding sitesregulate myosin light chain gene expression Proc Natl Acad Sci USA 88 1242ndash1246

Wingender E Chen X Hehl R Karas H Liebich I Matys V Meinhardt T Pruss M Reuter I and SchachererF 2000 TRANSFAC An integrated system for gene expression regulation Nucl Acids Res 28 316ndash319

Address correspondence toJun S Liu Wing H Wong

Department of StatisticsScience Center 6th oor

1 Oxford StreetCambridge MA 02138

E-mail jliu wwongstatharvardedu

Page 12: Determination of Local Statistical Significance of ...mckao/documents/JCB-LMM.pdf · that the incorporation of the local genomic context can be advantageous in the prediction of myogenin

12 HUANG ET AL

We apply similar arguments to the components corresponding to bases C G and T and obtain

X

D1D2D3

fD1jC tw1D1 cent fD2 jD1 tw2D2 cent fD3jD2 tw3D3

D fAjC tw1A fCjC tw1C fGjC tw1G fT jC tw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

X

D1D2D3

fD1jGtw1D1 cent fD2 jD1 tw2D2 cent fD3jD2 tw3D3

D fAjGtw1A fCjGtw1C fGjGtw1G fT jGtw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

X

D1D2D3

fD1jT tw1D1 cent fD2 jD1 tw2D2 cent fD3 jD2 tw3D3

D fAjT tw1A fCjT tw1C fGjT tw1G fT jT tw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

Therefore for the rst position we have

0

X

D1D2D3

fD1jAtw1D1 cent fD2jD1 tw2D2 cent fD3 jD2 tw3D3 X

D1D2D3

fD1 jT tw1D1 cent fD2 jD1 tw2D2 cent fD3jD2 tw3D3

1

AT

D

0

BB

fAjAtw1A fCjAtw1C fGjAtw1G fT jAtw1T

fAjC tw1A fCjC tw1C fGjC tw1G fT jC tw1T

fAjGtw1A fCjGtw1C fGjGtw1G fT jGtw1T

fAjT tw1A fCjT tw1C fGjT tw1G fT jT tw1T

1

CCA cent

0

BB

PD2D3

fD2 jAtw2D2 fD3 jD2 tw3D3PD2D3

fD2 jC tw2D2 fD3jD2 tw3D3PD2D3

fD2 jGtw2D2 fD3jD2 tw3D3PD2D3

fD2 jT tw2D2 fD3 jD2 tw3D3

1

CCA

D P cent Diagtw1A tw1C tw1G tw1T

cent

0

X

D2D3

fD2 jAtw2D2 cent fD2 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

D P cent M1 t cent

0

X

D2D3

fD2 jAtw2D2 cent fD3 jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

LOCAL STATISTICAL SIGNIFICANCE OF PATTERNS 13

Further applying the above arguments to positions 2 and 3 we have

0

X

D2D3

fD2jAtw2D2 cent fD3jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

D P cent Diagtw2A tw2C tw2G tw2T cent

0

X

D3

fD3jAtw3D3 X

D3

fD3 jT tw3D3

1

AT

D P cent Diagtw2A tw2C tw2G tw2T cent P cent Diagtw2A tw2C tw2G tw2T cent 1 1 1 1T

iexcl P cent M2 t cent P cent M3 t cent I

Above all Gt D frac14Qp

iD1PMi tI

ACKNOWLEDGMENTS

The work of HH XZ and WHW is supported by NSF grants DBI0196176 and DMS-0090166 Thework of HH and JSL is supported by NSF grant DMS-0204674 and NIH grant R01 HG02518-01 Thework of M-CJK is supported by the Howard Hughes Medical Institute predoctoral fellowship

REFERENCES

Bailey TL and Elkan C 1994 Fitting a mixture model by expectation maximization to discover motifs in biopoly-mers Proc Int Conf Intell Syst Mol Biol 2 28ndash36

Chen QK Hertz GZ and Stormo GD 1995 MATRIX SEARCH 10 A computer program that scans DNAsequences for transcriptional elements using a database of weight matrices Comput Appl Biosci 11 563ndash566

Durbin R Eddy SR Krogh A and Mitchison G 1998 Biological Sequence Analysis Probalistic Models ofProteins and Nucleic Acids Cambridge University Press Cambridge UK

Fickett JW 1996 Coordinate positioning of MEF2 and myogenin binding sites Gene 172 GC19ndash32Fried M and Crothers DM 1981 Equilibria and kinetics of lac repressor-operator interactions by polyacrylamide

gel electrophoresis Nucl Acids Res 9 6505ndash6525Galas DJ and Schmitz A 1978 DNAse footprinting A simple method for the detection of proteinndashDNA binding

speci city Nucl Acids Res 5 3157ndash3170Garner MM and Revzin A 1981 A gel electrophoresis method for quantifying the binding of proteins to speci c

DNA regions Application to components of the Escherichia coli lactose operon regulatory system Nucl Acids Res9 3047ndash3060

Gus eld D 1997 Algorithms on Strings Trees and Sequences Computer Science and Computational BiologyCambridge University Press Cambridge England

Gut A 1995 An Intermediate Course in Probability Springer-Verlag New YorkHertz GZ Hartzell 3rd GW and Stormo GD 1990 Identi cation of consensus patterns in unaligned DNA

sequences known to be functionally related Comput Appl Biosci 6 81ndash92Hubbard T Barker D Birney E Cameron G Chen Y Clark L Cox T Cuff J Curwen V Down T Durbin

R Eyras E Gilbert J Hammond M Huminiecki L Kasprzyk A Lehvaslaiho H Lijnzaad P MelsoppC Mongin E Pettett R Pocock M Potter S Rust A Schmidt E Searle S Slater G Smith J SpoonerW Stabenau A Stalker J Stupka E Ureta-Vidal A Vastrik I and Clamp M 2002 The Ensembl genomedatabase project Nucl Acids Res 30 38ndash41

Hughes JD Estep PW Tavazoie S and Church GM 2000 Computational identi cation of cis-regulatoryelementsassociated with groups of functionally related genes in Saccharomyces cerevisiae J Mol Biol 296 1205ndash1214

Lander ES Linton LM Birren B Nusbaum C Zody MC Baldwin J Devon K Dewar K Doyle MFitzHugh W Funke R Gage D Harris K Heaford A Howland J Kann L Lehoczky J LeVine RMcEwan P McKernan K Meldrim J Mesirov JP Miranda C Morris W Naylor J Raymond C RosettiM Santos R Sheridan A Sougnez C Stange-Thomann N Stojanovic N Subramanian A Wyman DRogers J Sulston J Ainscough R Beck S Bentley D Burton J Clee C Carter N Coulson A DeadmanR Deloukas P Dunham A Dunham I Durbin R French L Grafham D Gregory S Hubbard T Humphray

14 HUANG ET AL

S Hunt A Jones M Lloyd C McMurray A Matthews L Mercer S Milne S Mullikin JC MungallA Plumb R Ross M Shownkeen R Sims S Waterston RH Wilson RK Hillier LW McPherson JDMarra MA Mardis ER Fulton LA Chinwalla AT Pepin KH Gish WR Chissoe SL Wendl MCDelehaunty KD Miner TL Delehaunty A Kramer JB Cook LL Fulton RS Johnson DL Minx PJClifton SW Hawkins T Branscomb E Predki P Richardson P Wenning S Slezak T Doggett N ChengJF Olsen A Lucas S Elkin C Uberbacher E Frazier M Gibbs RA Muzny DM Scherer SE BouckJB Sodergren EJ Worley KC Rives CM Gorrell JH Metzker ML Naylor SL Kucherlapati RSNelson DL Weinstock GM Sakaki Y Fujiyama A Hattori M Yada T Toyoda A Itoh T Kawagoe CWatanabe H Totoki Y Taylor T Weissenbach J Heilig R Saurin W Artiguenave F Brottier P Bruls TPelletier E Robert C Wincker P Smith DR Doucette-Stamm L Ruben eld M Weinstock K Lee HMDubois J Rosenthal A Platzer M Nyakatura G Taudien S Rump A Yang H Yu J Wang J HuangG Gu J Hood L Rowen L Madan A Qin S Davis RW Federspiel NA Abola AP Proctor MJMyers RM Schmutz J Dickson M Grimwood J Cox DR Olson MV Kaul R Shimizu N KawasakiK Minoshima S Evans GA Athanasiou M Schultz R Roe BA Chen F Pan H Ramser J LehrachH Reinhardt R McCombie WR de la Bastide M Dedhia N Blocker H Hornischer K Nordsiek GAgarwala R Aravind L Bailey JA Bateman A Batzoglou S Birney E Bork P Brown DG BurgeCB Cerutti L Chen HC Church D Clamp M Copley RR Doerks T Eddy SR Eichler EE FureyTS Galagan J Gilbert JG Harmon C Hayashizaki Y Haussler D Hermjakob H Hokamp K Jang WJohnson LS Jones TA Kasif S Kaspryzk A Kennedy S Kent WJ Kitts P Koonin EV Korf I KulpD Lancet D Lowe TM McLysaght A Mikkelsen T Moran JV Mulder N Pollara VJ Ponting CPSchuler G Schultz J Slater G Smit AF Stupka E Szustakowski J Thierry-Mieg D Thierry-Mieg JWagner L Wallis J Wheeler R Williams A Wolf YI Wolfe KH Yang SP Yeh RF Collins F GuyerMS Peterson J Felsenfeld A Wetterstrand KA Patrinos A Morgan MJ Szustakowki J de Jong PCatanese JJ Osoegawa K Shizuya H Choi S and Chen YJ 2001 Initial sequencing and analysis of thehuman genome Nature 409 860ndash921

Lawrence CE Altschul SF Boguski MS Liu JS Neuwald AF and Wootton JC 1993 Detecting subtlesequence signals A Gibbs sampling strategy for multiple alignment Science 262 208ndash214

Lawrence CE and Reilly AA 1990 An expectation maximization (EM) algorithm for the identi cation and char-acterization of common sites in unaligned biopolymer sequences Proteins 7 41ndash51

Liu X Brutlag DL and Liu JS 2001 BioProspector Discovering conserved DNA motifs in upstream regulatoryregions of co-expressed genes Pac Symp Biocomput 127ndash138

Liu XS Brutlag DL and Liu JS 2002 An algorithm for nding protein DNA binding sites with applications tochromatin-immunoprecipitation microarray experiments Nat Biotechnol

Nakatsuji Y Hidaka K Tsujino S Yamamoto Y Mukai T Yanagihara T Kishimoto T and Sakoda S 1992A single MEF-2 site is a major positive regulatory element required for transcription of the muscle-specic subunitof the human phosphoglycerate mutase gene in skeletal and cardiac muscle cells Mol Cell Biol 12 4384ndash4390

Quandt K Frech K Karas H Wingender E and Werner T 1995 MatInd and MatInspector New fast and versatiletools for detection of consensus matches in nucleotide sequence data Nucl Acids Res 23 4878ndash4884

Rosenthal N Berglund EB Wentworth BM Donoghue M Winter B Bober E Braun T and Arnold HH1990 A highly conserved enhancer downstream of the human MLC13 locus is a target for multiple myogenicdetermination factors Nucl Acids Res 18 6239ndash6246

Roth FP Hughes JD Estep PW and Church GM 1998 Finding DNA regulatory motifs within unalignednoncoding sequences clustered by whole-genome mRNA quantitation Nat Biotechnol 16 939ndash945

Staden R 1989 Methods for calculating the probabilities of nding patterns in sequences Comput Appl Biosci 589ndash96

Stormo GD and Hartzell 3rd GW 1989 Identifying protein-binding sites from unaligned DNA fragments ProcNatl Acad Sci USA 86 1183ndash1187

Wentworth BM Donoghue M Engert JC Berglund EB and Rosenthal N 1991 Paired MyoD-binding sitesregulate myosin light chain gene expression Proc Natl Acad Sci USA 88 1242ndash1246

Wingender E Chen X Hehl R Karas H Liebich I Matys V Meinhardt T Pruss M Reuter I and SchachererF 2000 TRANSFAC An integrated system for gene expression regulation Nucl Acids Res 28 316ndash319

Address correspondence toJun S Liu Wing H Wong

Department of StatisticsScience Center 6th oor

1 Oxford StreetCambridge MA 02138

E-mail jliu wwongstatharvardedu

Page 13: Determination of Local Statistical Significance of ...mckao/documents/JCB-LMM.pdf · that the incorporation of the local genomic context can be advantageous in the prediction of myogenin

LOCAL STATISTICAL SIGNIFICANCE OF PATTERNS 13

Further applying the above arguments to positions 2 and 3 we have

0

X

D2D3

fD2jAtw2D2 cent fD3jD2 tw3D3 X

D2D3

fD2 jT tw2D2 cent fD3 jD2 tw3D3

1

AT

D P cent Diagtw2A tw2C tw2G tw2T cent

0

X

D3

fD3jAtw3D3 X

D3

fD3 jT tw3D3

1

AT

D P cent Diagtw2A tw2C tw2G tw2T cent P cent Diagtw2A tw2C tw2G tw2T cent 1 1 1 1T

iexcl P cent M2 t cent P cent M3 t cent I

Above all Gt D frac14Qp

iD1PMi tI

ACKNOWLEDGMENTS

The work of HH XZ and WHW is supported by NSF grants DBI0196176 and DMS-0090166 Thework of HH and JSL is supported by NSF grant DMS-0204674 and NIH grant R01 HG02518-01 Thework of M-CJK is supported by the Howard Hughes Medical Institute predoctoral fellowship

REFERENCES

Bailey TL and Elkan C 1994 Fitting a mixture model by expectation maximization to discover motifs in biopoly-mers Proc Int Conf Intell Syst Mol Biol 2 28ndash36

Chen QK Hertz GZ and Stormo GD 1995 MATRIX SEARCH 10 A computer program that scans DNAsequences for transcriptional elements using a database of weight matrices Comput Appl Biosci 11 563ndash566

Durbin R Eddy SR Krogh A and Mitchison G 1998 Biological Sequence Analysis Probalistic Models ofProteins and Nucleic Acids Cambridge University Press Cambridge UK

Fickett JW 1996 Coordinate positioning of MEF2 and myogenin binding sites Gene 172 GC19ndash32Fried M and Crothers DM 1981 Equilibria and kinetics of lac repressor-operator interactions by polyacrylamide

gel electrophoresis Nucl Acids Res 9 6505ndash6525Galas DJ and Schmitz A 1978 DNAse footprinting A simple method for the detection of proteinndashDNA binding

speci city Nucl Acids Res 5 3157ndash3170Garner MM and Revzin A 1981 A gel electrophoresis method for quantifying the binding of proteins to speci c

DNA regions Application to components of the Escherichia coli lactose operon regulatory system Nucl Acids Res9 3047ndash3060

Gus eld D 1997 Algorithms on Strings Trees and Sequences Computer Science and Computational BiologyCambridge University Press Cambridge England

Gut A 1995 An Intermediate Course in Probability Springer-Verlag New YorkHertz GZ Hartzell 3rd GW and Stormo GD 1990 Identi cation of consensus patterns in unaligned DNA

sequences known to be functionally related Comput Appl Biosci 6 81ndash92Hubbard T Barker D Birney E Cameron G Chen Y Clark L Cox T Cuff J Curwen V Down T Durbin

R Eyras E Gilbert J Hammond M Huminiecki L Kasprzyk A Lehvaslaiho H Lijnzaad P MelsoppC Mongin E Pettett R Pocock M Potter S Rust A Schmidt E Searle S Slater G Smith J SpoonerW Stabenau A Stalker J Stupka E Ureta-Vidal A Vastrik I and Clamp M 2002 The Ensembl genomedatabase project Nucl Acids Res 30 38ndash41

Hughes JD Estep PW Tavazoie S and Church GM 2000 Computational identi cation of cis-regulatoryelementsassociated with groups of functionally related genes in Saccharomyces cerevisiae J Mol Biol 296 1205ndash1214

Lander ES Linton LM Birren B Nusbaum C Zody MC Baldwin J Devon K Dewar K Doyle MFitzHugh W Funke R Gage D Harris K Heaford A Howland J Kann L Lehoczky J LeVine RMcEwan P McKernan K Meldrim J Mesirov JP Miranda C Morris W Naylor J Raymond C RosettiM Santos R Sheridan A Sougnez C Stange-Thomann N Stojanovic N Subramanian A Wyman DRogers J Sulston J Ainscough R Beck S Bentley D Burton J Clee C Carter N Coulson A DeadmanR Deloukas P Dunham A Dunham I Durbin R French L Grafham D Gregory S Hubbard T Humphray

14 HUANG ET AL

S Hunt A Jones M Lloyd C McMurray A Matthews L Mercer S Milne S Mullikin JC MungallA Plumb R Ross M Shownkeen R Sims S Waterston RH Wilson RK Hillier LW McPherson JDMarra MA Mardis ER Fulton LA Chinwalla AT Pepin KH Gish WR Chissoe SL Wendl MCDelehaunty KD Miner TL Delehaunty A Kramer JB Cook LL Fulton RS Johnson DL Minx PJClifton SW Hawkins T Branscomb E Predki P Richardson P Wenning S Slezak T Doggett N ChengJF Olsen A Lucas S Elkin C Uberbacher E Frazier M Gibbs RA Muzny DM Scherer SE BouckJB Sodergren EJ Worley KC Rives CM Gorrell JH Metzker ML Naylor SL Kucherlapati RSNelson DL Weinstock GM Sakaki Y Fujiyama A Hattori M Yada T Toyoda A Itoh T Kawagoe CWatanabe H Totoki Y Taylor T Weissenbach J Heilig R Saurin W Artiguenave F Brottier P Bruls TPelletier E Robert C Wincker P Smith DR Doucette-Stamm L Ruben eld M Weinstock K Lee HMDubois J Rosenthal A Platzer M Nyakatura G Taudien S Rump A Yang H Yu J Wang J HuangG Gu J Hood L Rowen L Madan A Qin S Davis RW Federspiel NA Abola AP Proctor MJMyers RM Schmutz J Dickson M Grimwood J Cox DR Olson MV Kaul R Shimizu N KawasakiK Minoshima S Evans GA Athanasiou M Schultz R Roe BA Chen F Pan H Ramser J LehrachH Reinhardt R McCombie WR de la Bastide M Dedhia N Blocker H Hornischer K Nordsiek GAgarwala R Aravind L Bailey JA Bateman A Batzoglou S Birney E Bork P Brown DG BurgeCB Cerutti L Chen HC Church D Clamp M Copley RR Doerks T Eddy SR Eichler EE FureyTS Galagan J Gilbert JG Harmon C Hayashizaki Y Haussler D Hermjakob H Hokamp K Jang WJohnson LS Jones TA Kasif S Kaspryzk A Kennedy S Kent WJ Kitts P Koonin EV Korf I KulpD Lancet D Lowe TM McLysaght A Mikkelsen T Moran JV Mulder N Pollara VJ Ponting CPSchuler G Schultz J Slater G Smit AF Stupka E Szustakowski J Thierry-Mieg D Thierry-Mieg JWagner L Wallis J Wheeler R Williams A Wolf YI Wolfe KH Yang SP Yeh RF Collins F GuyerMS Peterson J Felsenfeld A Wetterstrand KA Patrinos A Morgan MJ Szustakowki J de Jong PCatanese JJ Osoegawa K Shizuya H Choi S and Chen YJ 2001 Initial sequencing and analysis of thehuman genome Nature 409 860ndash921

Lawrence CE Altschul SF Boguski MS Liu JS Neuwald AF and Wootton JC 1993 Detecting subtlesequence signals A Gibbs sampling strategy for multiple alignment Science 262 208ndash214

Lawrence CE and Reilly AA 1990 An expectation maximization (EM) algorithm for the identi cation and char-acterization of common sites in unaligned biopolymer sequences Proteins 7 41ndash51

Liu X Brutlag DL and Liu JS 2001 BioProspector Discovering conserved DNA motifs in upstream regulatoryregions of co-expressed genes Pac Symp Biocomput 127ndash138

Liu XS Brutlag DL and Liu JS 2002 An algorithm for nding protein DNA binding sites with applications tochromatin-immunoprecipitation microarray experiments Nat Biotechnol

Nakatsuji Y Hidaka K Tsujino S Yamamoto Y Mukai T Yanagihara T Kishimoto T and Sakoda S 1992A single MEF-2 site is a major positive regulatory element required for transcription of the muscle-specic subunitof the human phosphoglycerate mutase gene in skeletal and cardiac muscle cells Mol Cell Biol 12 4384ndash4390

Quandt K Frech K Karas H Wingender E and Werner T 1995 MatInd and MatInspector New fast and versatiletools for detection of consensus matches in nucleotide sequence data Nucl Acids Res 23 4878ndash4884

Rosenthal N Berglund EB Wentworth BM Donoghue M Winter B Bober E Braun T and Arnold HH1990 A highly conserved enhancer downstream of the human MLC13 locus is a target for multiple myogenicdetermination factors Nucl Acids Res 18 6239ndash6246

Roth FP Hughes JD Estep PW and Church GM 1998 Finding DNA regulatory motifs within unalignednoncoding sequences clustered by whole-genome mRNA quantitation Nat Biotechnol 16 939ndash945

Staden R 1989 Methods for calculating the probabilities of nding patterns in sequences Comput Appl Biosci 589ndash96

Stormo GD and Hartzell 3rd GW 1989 Identifying protein-binding sites from unaligned DNA fragments ProcNatl Acad Sci USA 86 1183ndash1187

Wentworth BM Donoghue M Engert JC Berglund EB and Rosenthal N 1991 Paired MyoD-binding sitesregulate myosin light chain gene expression Proc Natl Acad Sci USA 88 1242ndash1246

Wingender E Chen X Hehl R Karas H Liebich I Matys V Meinhardt T Pruss M Reuter I and SchachererF 2000 TRANSFAC An integrated system for gene expression regulation Nucl Acids Res 28 316ndash319

Address correspondence toJun S Liu Wing H Wong

Department of StatisticsScience Center 6th oor

1 Oxford StreetCambridge MA 02138

E-mail jliu wwongstatharvardedu

Page 14: Determination of Local Statistical Significance of ...mckao/documents/JCB-LMM.pdf · that the incorporation of the local genomic context can be advantageous in the prediction of myogenin

14 HUANG ET AL

S Hunt A Jones M Lloyd C McMurray A Matthews L Mercer S Milne S Mullikin JC MungallA Plumb R Ross M Shownkeen R Sims S Waterston RH Wilson RK Hillier LW McPherson JDMarra MA Mardis ER Fulton LA Chinwalla AT Pepin KH Gish WR Chissoe SL Wendl MCDelehaunty KD Miner TL Delehaunty A Kramer JB Cook LL Fulton RS Johnson DL Minx PJClifton SW Hawkins T Branscomb E Predki P Richardson P Wenning S Slezak T Doggett N ChengJF Olsen A Lucas S Elkin C Uberbacher E Frazier M Gibbs RA Muzny DM Scherer SE BouckJB Sodergren EJ Worley KC Rives CM Gorrell JH Metzker ML Naylor SL Kucherlapati RSNelson DL Weinstock GM Sakaki Y Fujiyama A Hattori M Yada T Toyoda A Itoh T Kawagoe CWatanabe H Totoki Y Taylor T Weissenbach J Heilig R Saurin W Artiguenave F Brottier P Bruls TPelletier E Robert C Wincker P Smith DR Doucette-Stamm L Ruben eld M Weinstock K Lee HMDubois J Rosenthal A Platzer M Nyakatura G Taudien S Rump A Yang H Yu J Wang J HuangG Gu J Hood L Rowen L Madan A Qin S Davis RW Federspiel NA Abola AP Proctor MJMyers RM Schmutz J Dickson M Grimwood J Cox DR Olson MV Kaul R Shimizu N KawasakiK Minoshima S Evans GA Athanasiou M Schultz R Roe BA Chen F Pan H Ramser J LehrachH Reinhardt R McCombie WR de la Bastide M Dedhia N Blocker H Hornischer K Nordsiek GAgarwala R Aravind L Bailey JA Bateman A Batzoglou S Birney E Bork P Brown DG BurgeCB Cerutti L Chen HC Church D Clamp M Copley RR Doerks T Eddy SR Eichler EE FureyTS Galagan J Gilbert JG Harmon C Hayashizaki Y Haussler D Hermjakob H Hokamp K Jang WJohnson LS Jones TA Kasif S Kaspryzk A Kennedy S Kent WJ Kitts P Koonin EV Korf I KulpD Lancet D Lowe TM McLysaght A Mikkelsen T Moran JV Mulder N Pollara VJ Ponting CPSchuler G Schultz J Slater G Smit AF Stupka E Szustakowski J Thierry-Mieg D Thierry-Mieg JWagner L Wallis J Wheeler R Williams A Wolf YI Wolfe KH Yang SP Yeh RF Collins F GuyerMS Peterson J Felsenfeld A Wetterstrand KA Patrinos A Morgan MJ Szustakowki J de Jong PCatanese JJ Osoegawa K Shizuya H Choi S and Chen YJ 2001 Initial sequencing and analysis of thehuman genome Nature 409 860ndash921

Lawrence CE Altschul SF Boguski MS Liu JS Neuwald AF and Wootton JC 1993 Detecting subtlesequence signals A Gibbs sampling strategy for multiple alignment Science 262 208ndash214

Lawrence CE and Reilly AA 1990 An expectation maximization (EM) algorithm for the identi cation and char-acterization of common sites in unaligned biopolymer sequences Proteins 7 41ndash51

Liu X Brutlag DL and Liu JS 2001 BioProspector Discovering conserved DNA motifs in upstream regulatoryregions of co-expressed genes Pac Symp Biocomput 127ndash138

Liu XS Brutlag DL and Liu JS 2002 An algorithm for nding protein DNA binding sites with applications tochromatin-immunoprecipitation microarray experiments Nat Biotechnol

Nakatsuji Y Hidaka K Tsujino S Yamamoto Y Mukai T Yanagihara T Kishimoto T and Sakoda S 1992A single MEF-2 site is a major positive regulatory element required for transcription of the muscle-specic subunitof the human phosphoglycerate mutase gene in skeletal and cardiac muscle cells Mol Cell Biol 12 4384ndash4390

Quandt K Frech K Karas H Wingender E and Werner T 1995 MatInd and MatInspector New fast and versatiletools for detection of consensus matches in nucleotide sequence data Nucl Acids Res 23 4878ndash4884

Rosenthal N Berglund EB Wentworth BM Donoghue M Winter B Bober E Braun T and Arnold HH1990 A highly conserved enhancer downstream of the human MLC13 locus is a target for multiple myogenicdetermination factors Nucl Acids Res 18 6239ndash6246

Roth FP Hughes JD Estep PW and Church GM 1998 Finding DNA regulatory motifs within unalignednoncoding sequences clustered by whole-genome mRNA quantitation Nat Biotechnol 16 939ndash945

Staden R 1989 Methods for calculating the probabilities of nding patterns in sequences Comput Appl Biosci 589ndash96

Stormo GD and Hartzell 3rd GW 1989 Identifying protein-binding sites from unaligned DNA fragments ProcNatl Acad Sci USA 86 1183ndash1187

Wentworth BM Donoghue M Engert JC Berglund EB and Rosenthal N 1991 Paired MyoD-binding sitesregulate myosin light chain gene expression Proc Natl Acad Sci USA 88 1242ndash1246

Wingender E Chen X Hehl R Karas H Liebich I Matys V Meinhardt T Pruss M Reuter I and SchachererF 2000 TRANSFAC An integrated system for gene expression regulation Nucl Acids Res 28 316ndash319

Address correspondence toJun S Liu Wing H Wong

Department of StatisticsScience Center 6th oor

1 Oxford StreetCambridge MA 02138

E-mail jliu wwongstatharvardedu