Proteogenomic analysis and global discovery of ... · Proteogenomic analysis and global discovery...

10
Proteogenomic analysis and global discovery of posttranslational modifications in prokaryotes Ming-kun Yang a,1 , Yao-hua Yang a,1 , Zhuo Chen a , Jia Zhang a , Yan Lin a , Yan Wang a , Qian Xiong a , Tao Li a,2 , Feng Ge a,2 , Donald A. Bryant b , and Jin-dong Zhao a a Key Laboratory of Algal Biology, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan 430072, China; and b Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA 16802 Edited* by Jiayang Li, Chinese Academy of Sciences, Beijing, China, and approved November 13, 2014 (received for review July 6, 2014) We describe an integrated workflow for proteogenomic analysis and global profiling of posttranslational modifications (PTMs) in prokaryotes and use the model cyanobacterium Synechococcus sp. PCC 7002 (hereafter Synechococcus 7002) as a test case. We found more than 20 different kinds of PTMs, and a holistic view of PTM events in this organism grown under different conditions was obtained without specific enrichment strategies. Among 3,186 pre- dicted protein-coding genes, 2,938 gene products (>92%) were identified. We also identified 118 previously unidentified proteins and corrected 38 predicted gene-coding regions in the Synecho- coccus 7002 genome. This systematic analysis not only provides comprehensive information on protein profiles and the diversity of PTMs in Synechococcus 7002 but also provides some insights into photosynthetic pathways in cyanobacteria. The entire proteo- genomics pipeline is applicable to any sequenced prokaryotic organism, and we suggest that it should become a standard part of genome annotation projects. proteogenomics | post-translational modifications | cyanobacteria | photosynthesis | Synechococcus P roteogenomics refers to the correlation of mass spectrome- try-derived proteomic data to refine genome annotation (1) and has been applied to the identification of previously un- identified genes and the correction and validation of predicted genes in various organisms (24). It is an important tool for in- tegrating protein-level information into the genome annotation process and can greatly improve genome annotation quality. The same experimental proteomic datasets are also useful in identi- fying posttranslational modifications (PTMs) on a proteome- wide level (5, 6). Many cellular proteins undergo appreciable amounts of PTM in response to certain stimuli, and this dynamic process occurs in various cell compartments to dictate the fate and activity of the modified proteins (7). Identification and mapping of PTMs in proteins have been improved dramatically, mainly due to increases in the sensitivity, speed, accuracy, and resolution of mass spectrometry (MS). However, system-wide identification of multiple PTMs remains a highly challenging task, especially in situations where some reversible PTMs are induced by a particular stimulus and are present for only a short period (8). To the best of our knowledge, very few reports of proteogenomic datasets have presently been used to analyze PTM events comprehensively in a genome sequence (9, 10). In this study, we developed a proteogenomic approach to carry out genome annotation and whole-proteome analysis of PTMs in prokaryotes by using high resolution and high accuracy MS data and the cyanobacterium Synechococcus sp. PCC 7002 (hereafter Synechococcus 7002) as a test case. Cyanobacteria are a mor- phologically diverse group of Gram-negative bacteria and are the only prokaryotes capable of oxygenic photosynthesis (11). It is estimated that more than half of the photosynthetic activity on Earth is contributed by cyanobacteria (12). Cyanobacteria make substantial contributions to global CO 2 assimilation, O 2 pro- duction, and N 2 fixation and are the progenitors of chloroplasts in higher plants (13). Cyanobacterial habitats are highly diverse, and cyanobacterial cells adjust their cellular activities in response to a wide range of environmental cues and stimuli. Recently, cyanobacteria have attracted great interest due to their crucial roles in global carbon and nitrogen cycles and their ability to produce clean and renewable biofuels such as hydrogen (1416). Synechococcus 7002 is a unicellular, marine cyanobacterium and a model organism for studying photosynthetic carbon fixation and the development of biofuels (17, 18). However, whereas the genome of Synechococcus 7002 is fully sequenced, it is annotated only by in silico methods (www.ncbi.nlm.nih.gov/), with a large portion (1,210 out of 3,186) of protein-coding genes annotated as hypothetical proteins (17). Therefore, a comprehensive analysis is needed to provide experimental support for the genome an- notation so as to facilitate systems-level analysis. Using our method, we performed the validation of the predicted protein-coding genes, identified previously unidentified genes, and corrected gene initia- tion and stop-codon positions in Synechococcus 7002, and di- rectional RNA-Seq was used to determine the existence of a number of previously unidentified genes identified in this study. Significance Proteogenomics is the application of mass spectrometry- derived proteomic data for testing and refining predicted genetic models. Cyanobacteria, the only prokaryotes capable of oxy- genic photosynthesis, are the ancestor of chloroplasts in plants and play crucial roles in global carbon and nitrogen cycles. An integrated proteogenomic workflow was developed, and we tested this system on a model cyanobacterium, Synechococcus 7002, grown under various conditions. We obtained a nearly complete genome translational profile of this model organism. In addition, a holistic view of posttranslational modification (PTM) events is provided using the same dataset, and the results provide insights into photosynthesis. The entire pro- teogenomics pipeline is applicable to any sequenced prokar- yotes and could be applied as a standard part of genome annotation projects. Author contributions: T.L., F.G., and J.-d.Z. designed research; M.-k.Y., Y.-h.Y., Z.C., Y.L., Y.W., T.L., and F.G. performed research; D.A.B. contributed new reagents/analytic tools; M.-k.Y., Y.-h.Y., Z.C., J.Z., Y.L., Q.X., T.L., F.G., and J.-d.Z. analyzed data; and M.-k.Y., Z.C., T.L., F.G., D.A.B., and J.-d.Z. wrote the paper. The authors declare no conflict of interest. *This Direct Submission article had a prearranged editor. Freely available online through the PNAS open access option. Data deposition: MS raw data files, processed data files, annotated peptide spectral matches for all accepted single peptide hits, annotated peptide spectrum matches of identified peptides with stop codon read-through, annotated spectra of all identified peptides with specific PTMs, snapshots of genome browser for all the previously uniden- tified genes along with the corrections to the existing gene models identified in this study, Tables S1S9, RNA-Seq data, and in-house integrated proteogenomic analysis soft- ware have been deposited in the PeptideAtlas repository (www.peptideatlas.org) (acces- sion no. PASS00285). 1 M.-k.Y. and Y.-h.Y. contributed equally to this work. 2 To whom correspondence may be addressed. Email: [email protected] or [email protected]. This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. 1073/pnas.1412722111/-/DCSupplemental. www.pnas.org/cgi/doi/10.1073/pnas.1412722111 PNAS | Published online December 15, 2014 | E5633E5642 APPLIED BIOLOGICAL SCIENCES PNAS PLUS Downloaded by guest on June 16, 2020

Transcript of Proteogenomic analysis and global discovery of ... · Proteogenomic analysis and global discovery...

Page 1: Proteogenomic analysis and global discovery of ... · Proteogenomic analysis and global discovery of posttranslational modifications in prokaryotes Ming-kun Yanga,1, Yao-hua Yanga,1,

Proteogenomic analysis and global discovery ofposttranslational modifications in prokaryotesMing-kun Yanga,1, Yao-hua Yanga,1, Zhuo Chena, Jia Zhanga, Yan Lina, Yan Wanga, Qian Xionga, Tao Lia,2, Feng Gea,2,Donald A. Bryantb, and Jin-dong Zhaoa

aKey Laboratory of Algal Biology, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan 430072, China; and bDepartment of Biochemistryand Molecular Biology, The Pennsylvania State University, University Park, PA 16802

Edited* by Jiayang Li, Chinese Academy of Sciences, Beijing, China, and approved November 13, 2014 (received for review July 6, 2014)

We describe an integrated workflow for proteogenomic analysisand global profiling of posttranslational modifications (PTMs) inprokaryotes and use the model cyanobacterium Synechococcus sp.PCC 7002 (hereafter Synechococcus 7002) as a test case. We foundmore than 20 different kinds of PTMs, and a holistic view of PTMevents in this organism grown under different conditions wasobtained without specific enrichment strategies. Among 3,186 pre-dicted protein-coding genes, 2,938 gene products (>92%) wereidentified. We also identified 118 previously unidentified proteinsand corrected 38 predicted gene-coding regions in the Synecho-coccus 7002 genome. This systematic analysis not only providescomprehensive information on protein profiles and the diversityof PTMs in Synechococcus 7002 but also provides some insightsinto photosynthetic pathways in cyanobacteria. The entire proteo-genomics pipeline is applicable to any sequenced prokaryoticorganism, and we suggest that it should become a standard partof genome annotation projects.

proteogenomics | post-translational modifications | cyanobacteria |photosynthesis | Synechococcus

Proteogenomics refers to the correlation of mass spectrome-try-derived proteomic data to refine genome annotation (1)

and has been applied to the identification of previously un-identified genes and the correction and validation of predictedgenes in various organisms (2–4). It is an important tool for in-tegrating protein-level information into the genome annotationprocess and can greatly improve genome annotation quality. Thesame experimental proteomic datasets are also useful in identi-fying posttranslational modifications (PTMs) on a proteome-wide level (5, 6). Many cellular proteins undergo appreciableamounts of PTM in response to certain stimuli, and this dynamicprocess occurs in various cell compartments to dictate the fateand activity of the modified proteins (7). Identification andmapping of PTMs in proteins have been improved dramatically,mainly due to increases in the sensitivity, speed, accuracy, andresolution of mass spectrometry (MS). However, system-wideidentification of multiple PTMs remains a highly challengingtask, especially in situations where some reversible PTMs areinduced by a particular stimulus and are present for only a shortperiod (8). To the best of our knowledge, very few reports ofproteogenomic datasets have presently been used to analyzePTM events comprehensively in a genome sequence (9, 10).In this study, we developed a proteogenomic approach to carry

out genome annotation and whole-proteome analysis of PTMs inprokaryotes by using high resolution and high accuracy MS dataand the cyanobacterium Synechococcus sp. PCC 7002 (hereafterSynechococcus 7002) as a test case. Cyanobacteria are a mor-phologically diverse group of Gram-negative bacteria and are theonly prokaryotes capable of oxygenic photosynthesis (11). It isestimated that more than half of the photosynthetic activity onEarth is contributed by cyanobacteria (12). Cyanobacteria makesubstantial contributions to global CO2 assimilation, O2 pro-duction, and N2 fixation and are the progenitors of chloroplastsin higher plants (13). Cyanobacterial habitats are highly diverse,

and cyanobacterial cells adjust their cellular activities in responseto a wide range of environmental cues and stimuli. Recently,cyanobacteria have attracted great interest due to their crucialroles in global carbon and nitrogen cycles and their ability toproduce clean and renewable biofuels such as hydrogen (14–16).Synechococcus 7002 is a unicellular, marine cyanobacterium anda model organism for studying photosynthetic carbon fixationand the development of biofuels (17, 18). However, whereas thegenome of Synechococcus 7002 is fully sequenced, it is annotatedonly by in silico methods (www.ncbi.nlm.nih.gov/), with a largeportion (1,210 out of 3,186) of protein-coding genes annotated ashypothetical proteins (17). Therefore, a comprehensive analysisis needed to provide experimental support for the genome an-notation so as to facilitate systems-level analysis. Using our method,we performed the validation of the predicted protein-coding genes,identified previously unidentified genes, and corrected gene initia-tion and stop-codon positions in Synechococcus 7002, and di-rectional RNA-Seq was used to determine the existence of anumber of previously unidentified genes identified in this study.

Significance

Proteogenomics is the application of mass spectrometry-derived proteomic data for testing and refining predicted geneticmodels. Cyanobacteria, the only prokaryotes capable of oxy-genic photosynthesis, are the ancestor of chloroplasts in plantsand play crucial roles in global carbon and nitrogen cycles. Anintegrated proteogenomic workflow was developed, and wetested this system on a model cyanobacterium, Synechococcus7002, grown under various conditions. We obtained a nearlycomplete genome translational profile of this model organism.In addition, a holistic view of posttranslational modification(PTM) events is provided using the same dataset, and theresults provide insights into photosynthesis. The entire pro-teogenomics pipeline is applicable to any sequenced prokar-yotes and could be applied as a standard part of genomeannotation projects.

Author contributions: T.L., F.G., and J.-d.Z. designed research; M.-k.Y., Y.-h.Y., Z.C., Y.L.,Y.W., T.L., and F.G. performed research; D.A.B. contributed new reagents/analytic tools;M.-k.Y., Y.-h.Y., Z.C., J.Z., Y.L., Q.X., T.L., F.G., and J.-d.Z. analyzed data; and M.-k.Y., Z.C.,T.L., F.G., D.A.B., and J.-d.Z. wrote the paper.

The authors declare no conflict of interest.

*This Direct Submission article had a prearranged editor.

Freely available online through the PNAS open access option.

Data deposition: MS raw data files, processed data files, annotated peptide spectralmatches for all accepted single peptide hits, annotated peptide spectrum matches ofidentified peptides with stop codon read-through, annotated spectra of all identifiedpeptides with specific PTMs, snapshots of genome browser for all the previously uniden-tified genes along with the corrections to the existing gene models identified in thisstudy, Tables S1–S9, RNA-Seq data, and in-house integrated proteogenomic analysis soft-ware have been deposited in the PeptideAtlas repository (www.peptideatlas.org) (acces-sion no. PASS00285).1M.-k.Y. and Y.-h.Y. contributed equally to this work.2To whom correspondence may be addressed. Email: [email protected] or [email protected].

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1412722111/-/DCSupplemental.

www.pnas.org/cgi/doi/10.1073/pnas.1412722111 PNAS | Published online December 15, 2014 | E5633–E5642

APP

LIED

BIOLO

GICAL

SCIENCE

SPN

ASPL

US

Dow

nloa

ded

by g

uest

on

June

16,

202

0

Page 2: Proteogenomic analysis and global discovery of ... · Proteogenomic analysis and global discovery of posttranslational modifications in prokaryotes Ming-kun Yanga,1, Yao-hua Yanga,1,

More importantly, we characterized PTM features on a pro-teome-wide level using the same experimental proteomic data-sets and identified many different PTM types that may playimportant roles in cellular functions. Our proteogenomic dataprovided significant information for revising the genome anno-tation of Synechococcus 7002 and offered insights into thephysiology of this model organism. The method and approachcan also be used to study genome annotation and cellular proteinPTMs in other organisms.

ResultsProteogenomic Strategy for the Analysis of Synechococcus 7002. Theaim of this study was to provide an experimental catalog of thegenome-wide gene expression and PTMs in Synechococcus 7002and to use this information to refine genome annotations. Toenhance coverage of the expressed genome, cultures from eightdifferent growth conditions were combined, and total proteinextracts were isolated. This treatment mimicked native con-ditions experienced by Synechococcus 7002 and allowed greaterPTM representation in the isolated proteins. A total of 52samples were generated and subjected to nanoscale liquidchromatography coupled to tandem mass spectrometry (nanoLC-MS/MS) analysis on a high-resolution LTQ-Orbitrap Elitemass spectrometer. Using five different algorithms, MS-deriveddata were searched against (i) a protein database comprising theprotein sequences of Synechococcus 7002 from CyanoBase and(ii) a database of a six-frame translated genome of Synecho-coccus 7002. The complete workflow is summarized in Fig. 1.Peptide spectrum matches (PSMs) were filtered for first-rankassignments that passed a 1% false discovery rate (FDR)threshold. The complete list of peptides identified in our study,along with PSM scores, charge, m/z value, and mass error, isprovided in Table S1A [Tables S1–S9 are available at www.peptideatlas.org (accession no. PASS00285)]. We achieveda mean absolute mass deviation of 0.005006 Da (Fig. 2A). Theaverage absolute peptide mass accuracy was 1.96 parts per mil-lion (ppm) (SD, 1.78 ppm), and more than 94% of the identifiedPSMs had less than 5 ppm mass error, confirming the high ac-curacy of peptide data obtained from the mass spectrometer(Fig. S1A). Fig. 2B shows the distribution of peptides and proteins

identified from the two fractionation methods applied for ana-lyzing the cell-lysate proteins. Sequest, Mascot, MaxQuant, pFind,and X!Tandem searches identified 238,918; 239,779; 252,696;241,601; and 268,763 peptides, respectively, resulting in the iden-tification of 55,862 unique peptides.

Protein Expression Analysis of Synechococcus. The CyanoBase da-tabase (genome.microbedb.jp/cyanobase/) lists 3,186 protein-codinggenes in the Synechococcus 7002 genome (3.41 Mb, released 2012).Unique peptides were mapped onto the genome-translated da-tabase, and proteins identified by at least two unique peptides orby a single peptide with manual validation were reported. Intotal, we identified 2,938 Synechococcus 7002 proteins with FDR1% (2,699 proteins with at least two unique peptides and 239proteins with single-peptide identification), which are listed inTable S1A. Proteins identified based on shared peptide evidencewere listed separately (Table S1B). This comprehensive datasetenabled us to address the general features of our proteomicexperiments, especially with respect to coverage of the genomesequence by the detected peptides. We first defined the protein-coding part of the genome by mapping the 3,186 annotatedproteins onto the chromosome and plasmids (3.41 Mb) (Fig.2C); this result revealed that at least 87.1% (2.97 Mb) of thegenome is annotated in the protein database and is thereforeprotein coding. We next used the sequences of all 2,938 proteinsidentified in our dataset to estimate the size of the expressedgenome, which corresponded to 98.3% (2.92 Mb) of the an-notated protein-coding regions. Finally, mapping the detectedpeptide sequences onto the chromosome and plasmids captured2.30 Mb of the raw genome sequence or 77.4% of the protein-coding genome (Fig. 2C). The average sequence coverage peridentified protein was 44.4%, and 1,211 proteins had peptideevidence for more than 50% of their sequences (Fig. S1B). Eachprotein was represented by 1–357 distinct peptides, with an av-erage coverage of 19 peptides per identified protein (Fig. S1C).As an illustration of the high coverage of many of the proteins,Fig. S1D depicts the identification of peptides mapping to 100%of phycocyanin, an alpha subunit (SYNPCC7002_A2210) geneproduct, where we identified 194 unique peptides based on27,145 PSMs. It is clear from the data that our approach using

Fig. 1. Experimental and bioinformatic workflowof the proteogenomic analysis. Protein extracts wereprepared from Synechococcus 7002 cultures grownto exponential phase (flask 1) and stationary phase(flask 2) and exposed to stresses, including iron de-ficiency (flask 3), phosphate deficiency (flask 4), ni-trogen deficiency (flask 5), A5 deficiency (flask 6),high light (flask 7), and high salt concentration (flask8) (2.5 M). After protein extraction, proteins aresubjected to Glu-C and Trypsin digestion, producingpeptide mixture. The mixture was analyzed by meansof nano-LC-MS/MS with an LTQ-Orbitrap Elite massspectrometer. MS/MS peptide spectra were searchedagainst specific organism genome sequences, vali-dating and correcting genomic annotations, as wellas identifying previously unidentified protein-cod-ing genes and diverse PTMs.

E5634 | www.pnas.org/cgi/doi/10.1073/pnas.1412722111 Yang et al.

Dow

nloa

ded

by g

uest

on

June

16,

202

0

Page 3: Proteogenomic analysis and global discovery of ... · Proteogenomic analysis and global discovery of posttranslational modifications in prokaryotes Ming-kun Yanga,1, Yao-hua Yanga,1,

more than one method each of protein isolation, protein/peptidefractionation, and database search resulted in an increased numberof peptide and protein identifications from the same sample. Thehigh quality of our protein identifications was demonstrated bythe following: the FDR at the peptide level was lower than 1.0%;the average absolute peptide-mass accuracy was 1.96 ppm; 92%of 2,938 identified proteins were mapped by at least two uniquepeptides; and all modified peptides and peptides singly assignedto proteins were manually verified. Among all of the identifiedproteins, a large number of hypothetical proteins were identifiedin Synechococcus 7002 (Table S1C). The Synechococcus 7002genome annotation contains 1,210 (38%) hypothetical proteins,which are functionally uncharacterized due to lack of sequencesimilarity to any known genes from model prokaryotes: i.e.,proteins predicted on the basis of the nucleic acid sequences onlyand protein sequences with unknown function. In this study, weidentified 918 proteins by MS that are annotated as hypotheticalproteins, among which 311 were assigned by at least 100 PSMs,and only 59 were single-peptide identifications (Fig. S2A). Blastwas performed against the National Center for BiotechnologyInformation (NCBI) database to obtain functional descriptionsof the hypothetical proteins, and 99 MS-detected hypotheticalproteins had assigned Gene Ontology (GO) terms (Fig. S2B).The distribution of identified proteins among different biologicalprocesses, molecular functions, and cellular localizations is il-lustrated in Fig. S2C. Thus, our results provide important cluesfor future functional studies of these hypothetical proteins.

Integration and Visualization of Proteomics and RNA-Seq TranscriptomicData. Increasing numbers of investigators are now incorporatingRNA-Seq information with proteomic data to gain a more completeunderstanding of cellular systems and improve genome annota-tion (19–21). The comparative analysis of RNA and proteinlevels can be used as a validation tool to generate a protein atlaswith higher reliability. In this study, we performed directionalRNA sequencing, and the results are presented in Table S1D.We further integrated the proteomic data described above withthe transcriptome determined by RNA-Seq to facilitate the val-idation of the proteomic results. In-depth transcriptome analysesof RNA-Seq data from Synechococcus cultures under eight dif-ferent conditions have shown that >97% of the annotated ge-nome is expressed in these cellular states, and our proteome dataset covers >93% of these RNA-Seq data (Fig. 3A). We have thusgenerated a comprehensive protein database that covers nearlythe entire expressed Synechococcus proteome. To compare peptideversus RNA abundance, we computed a scatter-plot of mRNAexpression [reads per kilobase per million mapped reads (RPKM),2,160,264 reads] mapped to a known gene (x axis) versus theprotein expression (PSMs, 1,212,428 spectra) (y axis) falling withinthe region. The correlation value (r = 0.318) calculated usingSpearman’s rank correlation coefficient test suggested little or nocorrelation between protein expression (PSMs) and mRNA ex-pression (RPKM) (Fig. 3B). This result is in line with previousstudies, which have shown that protein expression is influencedby an array of posttranscriptional regulatory mechanisms and thatcorrelation between protein and mRNA levels is generally modest

Fig. 2. Overview of the proteogenomic results. (A) Scatter plot showing the distribution of the mass errors of all of the identified peptides. (B) The Venn diagramillustrates the relative contribution of the different fractionation methods used to the total number of peptides and proteins identified. (C) Venn diagram il-lustrating the coverage of several levels (whole, protein-coding, expressed, detected) of the Synechococcus 7002 genome. (D) Proteome landscape of theSynechococcus 7002. An overview of peptide data against the genome was generated using the Circos software (45). The concentric circles from the peripheryto the center represent (i) Synechococcus 7002 chromosomes and six plasmids, (ii) proteins encoded by the genome, (iii) GC content of the Synechococcus7002 genome, (iv) peptides identified in this study, genome search-specific peptides (GSSPs), (v) previously unidentified gene models (1, intergenic genes; 2,different frame with annotated genes; and 3, opposite strand to existing genes discovered in this study), and (vi) revised gene models (N-terminal changes inannotated genes).

Yang et al. PNAS | Published online December 15, 2014 | E5635

APP

LIED

BIOLO

GICAL

SCIENCE

SPN

ASPL

US

Dow

nloa

ded

by g

uest

on

June

16,

202

0

Page 4: Proteogenomic analysis and global discovery of ... · Proteogenomic analysis and global discovery of posttranslational modifications in prokaryotes Ming-kun Yanga,1, Yao-hua Yanga,1,

(22, 23); this result was highly similar to those of the previousanalyses on Medicago truncatula (24), human, and mouse (25).These data were combined with the genome data on Synechococcus7002 and made publicly accessible online (lag.ihb.ac.cn:8080/),where they can be browsed by gene, protein, peptide, and PTM.Fig. 3C shows a screenshot of ABrowse displaying genomic,RNA-Seq, and proteomic data from Synechococcus 7002. Thisplatform helps visualize all of the potential reading frames in theSynechococcus 7002 genome and has the capability to helpbrowse and query the data to identify regions of interest quicklywith respect to structural annotation (e.g., previously unidentifiedgenes or identified peptides and their PTMs). The annotationentries shown in the main browsing canvas of ABrowse are allclickable, and their corresponding detailed information can bedisplayed in the “Entry Detail” tab of the detailed-information/user-space panel. In addition, all of the modified peptides listed inthe main browsing canvas were highlighted in light blue, and thepreviously unidentified genes identified in this work were high-lighted in light yellow in the protein-coding gene models region.The interactive visual analytics tool provides a user-friendly webinterface to browse, search, retrieve, and update information on

the Synechococcus 7002 genome, and it is a step toward integrationof proteomic (peptide-centric) and transcriptomic (RNA-Seq) datawith current genome-location coordinates, allowing for in-depthstudies of individual genes and their protein counterparts as well asmore global studies using systems-biology approaches.

Identification of Previously Unidentified Genes. We also comparedpeptide sequences searched against a six-frame translated ge-nome database with those present in the protein database. Weidentified a set of unique orphan peptides that were not rep-resented among the predicted proteins of Synechococcus. Thesepeptides, designated as genome search-specific peptides (GSSPs),mapped to unique locations on the Synechococcus 7002 genome.Out of the 2,778 GSSPs identified in this study, 486 peptides eithermapped to regions of the genome where no gene was annotated ordid not match the gene model they were mapping to. Two gene-prediction programs, FgenesB and GeneMark, were used toidentify ORFs in the region to which these GSSPs were mapped.Our in-house program was also used to identify ORFs afterincorporating a wide range of information, including the peptidescore, an initiation codon, the number of peptide hits, and the

Fig. 3. Comparison and integration of proteomics and transcriptomics data. (A) Comparison showing the overlap of protein identifications with the tran-scriptome determined using RNA-Seq in Synechococcus 7002. (B) Correlation between mRNA expression (RPKM) and protein expression (PSMs) in genesdetected at both the mRNA and protein levels. (C) Screenshot of the main ABrowse genome browser interface. The genome browser interface consists of thenavigation bar, the browsing canvas, and the detailed-information/user-space panel, which was used to covisualize experimental peptides and RNA-Seq datafor Synechococcus 7002. The sequences of the protein and the peptides can be seen by zooming into the protein sequence track. This type of covisualizationcan be done on a large scale to comprehensively integrate the proteome with the genome and the transcriptome.

E5636 | www.pnas.org/cgi/doi/10.1073/pnas.1412722111 Yang et al.

Dow

nloa

ded

by g

uest

on

June

16,

202

0

Page 5: Proteogenomic analysis and global discovery of ... · Proteogenomic analysis and global discovery of posttranslational modifications in prokaryotes Ming-kun Yanga,1, Yao-hua Yanga,1,

length of the previously unidentified protein. Using these pro-grams, we identified 118 previously unidentified protein-codingregions and modified the annotation of 38 existing gene models, allof which were assigned by at least two unique GSSPs. A graph-ical representation of all previously unidentified and correctedproteins identified in this study is shown in Fig. 2D. Further, weused comparative genomic strategies to investigate conservationof previously unidentified gene models across related species.The presence of orthologous genes in other species providesfurther support for the previously unidentified gene structures;conversely, absence of conserved homologous protein-codingregions in other genomes may indicate that these genes or generegions may be unique to Synechococcus 7002. Twenty-two of thepreviously unidentified ORFs had been previously annotated inother species, 80 previously unidentified genes were supportedby FgenesB/GeneMark ORF predictions, and the remaining genes

were predicted using our in-house program. Table S2 A–C lists thepreviously unidentified genes found in this study along with theirgenomic coordinates and the supporting peptide evidence. Amongthe previously unidentified genes identified, 13 were intergenic, 41were frame-shifted, and 64 were on the strand opposite to anexisting gene annotation. Fig. 4 depicts an example of a previouslyunidentified gene that is located in the intergenic region betweentwo predicted genes, SynPCC7002_A0518c and SynPCC7002_A0519. Another previously unidentified protein of 100 aminoacids (SynPCC7002_DZ003) was discovered in the region span-ning 11,658–11,960 on plasmid pAQ4 (Fig. S3). Nine GSSPsrevealed a previously unidentified gene (SynPCC7002_Z0076)encoded by the opposite strand of the SynPCC7002_A0539 geneand were identified as part of the gene model encoding a 91-amino acid protein predicted by FgenesB and GeneMark andsupported by directional RNA-Seq data (Fig. S4).

Fig. 4. Identification of a previously unidentified gene, Z0010, based on peptides mapping to an intergenic region and RNA sequencing evidence. (A) Sixteenpeptides mapped to an intergenic region between genes SynPCC7002_A0518c and SynPCC7002_A0519. Gene prediction algorithms FgenesB and GeneMarksupported the presence of this additional gene. In addition, a protein corresponding to this previously unidentified gene has been annotated in the Leptolyngbyasp. PCC 7376 genome. (B) Protein sequence of a previously unidentified gene. Identified region is indicated by red text. (C) RNA sequencing evidence also supportsthe expression of the previously unidentified gene Z0010.

Yang et al. PNAS | Published online December 15, 2014 | E5637

APP

LIED

BIOLO

GICAL

SCIENCE

SPN

ASPL

US

Dow

nloa

ded

by g

uest

on

June

16,

202

0

Page 6: Proteogenomic analysis and global discovery of ... · Proteogenomic analysis and global discovery of posttranslational modifications in prokaryotes Ming-kun Yanga,1, Yao-hua Yanga,1,

N Terminus Validation and Correction. Apart from identification ofpreviously unidentified ORFs, we corrected some gene coordinatesusing GSSP data. Using this strategy, we corrected 38 gene modelsof N-terminal extensions with 105 unique GSSPs. Table S2Dcontains the list of genes for which modification in the structurewas suggested by our analysis, and this table contains informationabout previously assigned coordinates, our modifications, and thecorresponding peptide evidence. Fig. S5A depicts an example ofcorrection of a gene model by extension of the N terminus. Thecurrent coordinates of the gene SynPCC7002_A0031, whichcodes for the Dps family DNA-binding stress response protein,are 28,318–28,854. We found seven peptides mapping upstreamof the gene, and elongated PCC7942 by comparative genomicanalysis to detect another homologous protein in Synechococcus.In addition, both FgenesB and GeneMark gene-prediction algo-rithms predicted gene models that extended the gene in the 5′ di-rection, in agreement with the newly annotated start codon.Thus, a combination of GSSPs, homology, and RNA-Seq–basedanalysis allowed us to extend the gene model 48 nucleotidesupstream from the previous start codon. We further analyzedprotein translation start sites (TSSs) by probing N-terminal–specific modifications. Formylation occurs on the initiatormethionine, and N-terminal acetylation occurs on the secondamino acid after the initiator methionine is cleaved. Thesemodifications directly mark the TSS for a protein-codinggene. Based on semiproteolytic peptides identified at <1%FDR and their upstream codons in the genome, we confirmedthe annotated TSS for 52 genes (Table S2E). Fig. S5B depictstwo N-terminally formylated unique peptides that facilitateconfirmation of the TSS for SynPCC7002_A1803c, a proteininvolved in the carbon dioxide-concentrating mechanism. We alsoidentified N-terminal peptides that confirmed the TSS of threerevised gene models (SynPCC7002_A0315, SynPCC7002_A1929,and SynPCC7002_A0646) (Table S2E).

Identification of Proteins with Noncanonical Translation Initiators. Inprokaryotes, translation initiation is typically mediated by thestart codon ATG. In addition, GTG and TTG are used as al-ternative start codons in agreement with the preference: ATG >GTG > TTG (26, 27). Studies on Escherichia coli have estimatedthe frequency of the start codon use as ATG 83%, GTG 14%,and TTG 3% (28). In Synechococcus 7002, the frequencies ofinitiation codons were ATG 83.11%, GTG 9.51%, TTG 7.31%,and CTG 0.06%. Additionally, it has been reported that ATT,ATA, and ATC are rare noncanonical translation initiators (28, 29).A set of 19 putative proteins with noncanonical start codons con-taining 43 unique GSSPs (ATT or ATA) was identified and isreported in Table S2F. Fig. S6A illustrates an example of the use ofpeptides to predict a nontraditional start codon. We identified twopeptides that map to the opposite strand of SynPCC7002_A0826c;however, none of the common ATG, GTG, or TTG start codonswere found upstream of the region spanned by these two peptides.A survey of available literature on translation initiation revealedthat ATA is known to function as a rare translation initiator, sug-gesting that this previously unidentified gene Z0045 may use ATAas the start codon. A representative MS/MS spectrum of the pep-tide INDAALRKE is shown in Fig. S6A, and the nearly completeb and y ion series provide additional confidence to the annotation.The previously unidentified protein is supported by two gene-prediction algorithms, FgenesB and GeneMark, and we vali-dated the expression of this gene at the level of the transcriptwith seven mRNA reads and a 2.55 RPKM value. The use of thisnoncanonical translation initiation codon may imply a specificregulatory mechanism, and further experiments are needed tocharacterize the fundamental regulatory mechanisms for cyano-bacterial cells.

Identification of Peptides with Stop Codon Read-Through. Asidefrom noncanonical translation initiation, protein translation cansometimes continue through a stop codon, a mechanism knownas “stop codon read-through” (30–32). Two examples of thismechanism are the TGA and TAG stop codons that sometimescode for the rare amino acids selenocysteine and pyrrolysine, re-spectively (33, 34). In this study, we translated all of the TGA andTAG codons into selenocysteine (U) and pyrrolysine (O), re-spectively, while keeping the third stop codon, TAA, as the truestop codon. We found 23 read-through candidates with 43 uniqueGSSPs, and Table S2G provides a complete list of the peptidesidentified with their genomic coordinates and RPKM values.Annotated peptide spectrum matches of the identified peptideswith stop codon read-through have been deposited to the Pepti-deAtlas database and can be accessed using the accession numberPASS00285. Fig. S6B depicts a previously unidentified genemodel in which two genes are merged into one based on RNA-Seq and MS experimental evidence of translation of the TAGstop codon as pyrrolysine (O). Read-through has been observedacross the animal kingdom, and its widespread use suggests anadditional level of regulatory complexity (31). Therefore, theidentified read-through in cyanobacteria offers a rich opportunityto enhance our understanding of the underlying molecular mech-anisms and regulation of the read-through process more generally.

Identification of Signal Peptides.A signal peptide is a short N-terminalsequence that targets a protein for export or transport to a de-sired cellular location (35). Signal peptides are essential forproper cellular function in both eukaryotes and prokaryotes (35).The average length of a signal peptide in Gram-negative bacteriais estimated to be 25 amino acids, with most signal peptidesbetween 20 and 30 amino acids (36). Although knowledge ofsignal peptides is important for understanding protein function,they are difficult to confirm experimentally, and computationalpredictions are used to fill the gap. SignalP (37) and PrediSi (38)are two popular signal peptide prediction tools. It is clear thatproteomic evidence can greatly increase the number of experi-mentally confirmed signal peptides and the confidence of signalpeptide predictions. We analyzed our peptide annotations toconfirm or refute signal peptide predictions. The lists of pre-dicted signal peptides by SignalP, PrediSi, and MS/MS analysisare provided in Table S3. A clear sequence motif, which closelymatches motifs used by SignalIP and PrediSi, emerges when weexamine the sequence immediately upstream of the 107 putativesignal peptides predicted by MS/MS analysis (Fig. S7A). SignalPand PrediSi predict 130 and 332 proteins with signal peptides,respectively. However, there is a substantial discrepancy betweenthese tools: only 81 signals are predicted by both. LC-MS/MSevidence provides a strategy to resolve the discrepancies betweenSignalP and PrediSi and identify signal peptides missed by bothtools. Fig. S7B compares our predicted signal peptides with thepredictions made by SignalP and PrediSi. Our results confirm 12signal peptide predictions. For 119 confirmed proteins, the MS/MSresults include peptides upstream of the cleavage site predicted bySignalP/PrediSi and thus represent evidence against SignalP/PrediSipredictions (Table S3D). Therefore, we refute 25 sites predictedby SignalP and 110 sites predicted by PrediSi (with 8 refuted sitespredicted by both tools) (Fig. S7C).

Global Posttranslational Modification Discovery in Synechococcus7002. As mentioned above, proteogenomic data may be used toidentify PTMs on the proteome-wide level. Much less is knownabout the type, frequency, and function of PTMs in prokaryotes,even for intensively studied model organisms such as E. coli, thanin eukaryotes. PTM information gained from MS/MS data ofcells grown under different conditions will contribute to an un-derstanding of PTMs in prokaryotic organisms. Recently, wedescribed a phosphoproteomic analysis of Synechococcus 7002 by

E5638 | www.pnas.org/cgi/doi/10.1073/pnas.1412722111 Yang et al.

Dow

nloa

ded

by g

uest

on

June

16,

202

0

Page 7: Proteogenomic analysis and global discovery of ... · Proteogenomic analysis and global discovery of posttranslational modifications in prokaryotes Ming-kun Yanga,1, Yao-hua Yanga,1,

using protein/peptide prefractionation, TiO2 enrichment, andLC-MS/MS analysis (39). In this study, we first analyzed the massspectra using the MODa (MODification via alignment) algorithm(40), which allows for unrestrictive PTM searches with no limitationon the number of modifications per peptide and incorporatesseveral unique features to improve the sensitivity and accuracy ofpeptide identification and eliminate the increases in false pos-itives and false negatives. In all, 278,859 spectra of 70,042 uniquepeptides and 2,469 proteins were identified by MODa. Amongthese results, 76,103 spectra of 28,905 unique peptides and 1,666proteins contained PTMs with sizes accepted by the MODasearch, regardless of the modification classification in Unimod.Within these PTMs, 7,026 spectra of 3,059 unique peptides and690 proteins contained in vivo PTMs. These findings, along withthe parameters for search and output column descriptions, aresummarized in Table S4A. Table S4B presents 25 commonmodification types and the frequency of each. Because the false

positive rate is low, it is extremely unlikely that any of thesemodification types represent a computational artifact. Moreover,all of the selected modification types in Table S4B are supportedby studies on other species, further reinforcing the conclusionthat they are not artifacts.MODa was used to (i) discover and identify unexpected mod-

ifications and (ii) assign posttranslationally modified peptidesequences to MS/MS spectra, but MODa cannot be used to as-sign the localization of modifications to specific amino acids. Toaddress this problem, the MaxQuant computational proteomicsplatform was used to search for the specific PTM and determinethe localization probability of modifications in peptides (41). Theidentified peptides with different PTMs were further validated bymanual inspection of MS/MS data (see criteria in SI Materialsand Methods). MaxQuant takes advantage of high-resolutiondata such as those obtained by the LTQ-Orbitrap instrumentsand employs algorithms that determine the mass precision and

Fig. 5. Summary of the identified proteins withPTMs in Synechococcus 7002. (A) Distribution of thenumber of proteins and sites with different PTMs inSynechococcus 7002. (B) Distribution of the predicted,identified, and modified proteins of Synechococcus7002 according to their molecular functions. DifferentPTMs are represented with different colors. GO cate-gory classification of Synechococcus 7002 proteins aspredicted from their genome annotations. *, Denotesoverrepresentation.

Yang et al. PNAS | Published online December 15, 2014 | E5639

APP

LIED

BIOLO

GICAL

SCIENCE

SPN

ASPL

US

Dow

nloa

ded

by g

uest

on

June

16,

202

0

Page 8: Proteogenomic analysis and global discovery of ... · Proteogenomic analysis and global discovery of posttranslational modifications in prokaryotes Ming-kun Yanga,1, Yao-hua Yanga,1,

accuracy of individual peptides (41). This method leads to greatlyenhanced peptide mass accuracy that can be used as a filterin database searching. Our approach identified 23 differentPTMs on 6,704 unique peptides from 2,230 Synechococcus 7002proteins with high confidence (FDR <1%). Fig. 5A shows thenumbers and types of PTMs identified by MaxQuant in Syn-echococcus 7002. We further confirmed the localization accuracyof these modification sites using probability-based PTM scores.These scores were used to rank peptides in MaxQuant searchesfrom the beginning, and they determined the localization prob-ability of modifications in the peptides. For the modified pep-tides detected, 11,839 modification sites were determined witha localization probability higher than 0.75 (Table S5A). In theother 1,977 peptides, the modification sites could not be un-ambiguously determined from the mass spectra but are limited tothe short amino acid sequences in the peptides. Details of theidentified peptides with each PTM, including their protein IDs,sequences, search algorithm scores, PTM scores, and localizationP values, are provided in Table S5A. To evaluate the identifi-cation performance of PTM events by different algorithms, wecompared the modified peptides and proteins identified by bothPTM search algorithms, and Fig. S7D presents the overlap be-tween MODa and MaxQuant search results. It shows that 649unique modified peptides and 1,518 modified proteins wereidentified by both PTM search algorithms. Table S4B presentsthe common modification sites identified by two PTM searchalgorithms. We also compiled a list of all proteins identified inthis study along with the number of unique peptides, coverage,UniProt accessions, NCBI RefSeq accessions, protein productdescription, domains, Gene Ontology information, signal peptideand transmembrane domain information in Table S5B.

Biological Relevance of Modified Proteins. Previous conservationanalysis indicated that posttranslationally modified proteins aremore conserved than unmodified proteins from prokaryotes toeukaryotes (5). Owing to the high coverage and large number ofmodified proteins identified, we compared the conservationlevels between MS-detected modified and unmodified proteins.In this study, we searched for orthologs of Synechococcus pro-teins against 652 bacterial species across the phylogenetic tree, aswell as against 18 Archaea and 61 eukaryotes, by performing 2DBLASTP. As shown in Fig. S7E, the MS-detected proteins were,on average, more conserved than the unidentified proteins inthe Synechococcus 7002 database throughout the three super-kingdoms (P < 0.01). In addition, our analysis also indicated thatmost MS-detected modified proteins are more highly conservedthan the MS-detected total proteins (P < 0.01), suggesting thatthe modified proteins could be involved in the conserved func-tion class and translation machinery. Table S6 compares someimportant PTMs between Synechococcus 7002 and other species.This set of modified proteins supports the emerging view thatPTM events are general and fundamental regulatory processesthat occur in both prokaryotes and eukaryotes and opens the wayfor their functional and evolutionary analysis in cyanobacteria.The entire set of predicted proteins, MS-detected proteins,

and modified proteins was searched against the NCBI COG(Clusters of Orthologous Genes) database and classified intoa wide range of functional classes. The distribution of proteinproportions in different functional classes is illustrated in Fig. 5Band Tables S7 and S8. In particular, certain sets of modifiedproteins were statistically overrepresented (P < 0.05) in certainfunctional classes (Fig. 5B). For example, the identified proteinswith methylation, persulfide modification, farnesylation, andhydroxymethylation PTMs were overrepresented in photosynthesisand respiration processes, suggesting that these PTM events areextensively involved in photosynthesis in Synechococcus 7002.

PTM Events Involved in Photosynthesis. It has been reported thatprotein PTM events are involved in the regulation of photosyn-thesis in cyanobacteria (39). According to the proteogenomicdata set contributed by this study, there exist many proteins withdiverse PTMs in the photosynthetic apparatus (Table S9), whichare illustrated in Fig. 6A. Mapping PTM events to the majorcomponents of the photosynthetic apparatus will facilitate theintegration of proteogenomic data with biological function andmay thereby provide insight into the potential functional rel-evance of the identified PTMs. To confirm the existence ofidentified PTMs in Synechococcus 7002 further, we carried outimmunoblotting analyses of Synechococcus 7002 total cell lysatesusing pan antiacetylation (lysine, abbreviated as K), trimethyla-tion (K), dimethylation (K), butyrylation (K), crotonylation (K),phosphorylation (tyrosine, abbreviated as Y), and succinylation(K) antibodies. The specificity of these antibodies was confirmedas in previous reports (42, 43). Strong signals were detected for allseven antibodies tested (Fig. 6B). Interestingly, the succinylation(K) and phosphorylation (Y) signals from Synechococcus 7002cells showed changes after the cells were treated with high lightfor 2 h, suggesting that these two PTM events may play a role inthe regulation of photosynthesis in Synechococcus 7002. Apartfrom the high light treatment, we also performed an immuno-blot analysis of global protein and PTM levels upon differentstress inductions, and the results were shown in Fig. S7F. Al-though we cannot validate all of the predicted modificationsat this time due to lack of available experimental data aboutPTMs for Synechococcus 7002, future research may confirm manyof these identified modifications and begin to uncover their bio-logical functions.

Development of a Software Tool for Automated Proteogenomic Analysis.We also developed a software tool for automated spectral searches,PTM discovery, and downstream analysis that is easily config-urable and accessible. Inputs from different search engines canbe used, which allows this software to be easily integrated with allexisting frameworks. The outputs can be used to validate, refine,and discover protein-coding genes using high-throughput MSdata from prokaryotes. Notably, the existing PTM algorithms(MODa and MaxQuant) were integrated into this tool for pro-teome-wide PTM discovery. Based on the reported findings, thistool is, to our knowledge, the first integrated software for bothlarge-scale PTM discovery and genome annotation, providingcomprehensive information on high-coverage protein expressionand PTM events in Synechococcus 7002. We anticipate that thesoftware tool used herein can be applied to any sequencedprokaryotic organism to yield similar sets of novel discoveries.

DiscussionIn this study, we performed a comprehensive proteogenomic anal-ysis of gene expression in prokaryotes with the goals of (i) genomerefinement, (ii) in-depth analysis of the detectable proteome, and(iii) global PTM discovery. To achieve optimal gene-expressioncoverage, we performed both transcriptome and proteome analysesof Synechococcus 7002 cultures from different culture conditions.Both technological platforms used in this study, Illumina GAIIxdirectional RNA sequencing and LTQ-Orbitrap Elite MS, rep-resent the state of the art in the fields of transcriptomics andproteomics, respectively. Our results demonstrate that the use ofhigh-resolution MS-based proteomics allows an almost completegenome annotation to be reached by incorporating dozens ofconditions and different separation methods for a specific or-ganism. The identified proteins represent 92.2% of the predictedproteome, which exceeds the results obtained in most publishedcases. Multiple factors may prevent 100% coverage, includingextensive posttranscriptional regulation that makes transcriptsrather poor guides for protein expression. Here, we have used eightdifferent growth conditions, including typical laboratory conditions

E5640 | www.pnas.org/cgi/doi/10.1073/pnas.1412722111 Yang et al.

Dow

nloa

ded

by g

uest

on

June

16,

202

0

Page 9: Proteogenomic analysis and global discovery of ... · Proteogenomic analysis and global discovery of posttranslational modifications in prokaryotes Ming-kun Yanga,1, Yao-hua Yanga,1,

and several near-native aquatic environmental conditions. However,not all possible growth conditions for Synechococcus 7002 can beefficiently sampled, and it is likely that certain ORFs require specificconditions, such as high or low temperature, for expression. Inaddition, various technical aspects associated with our experiments,such as incompatibility with buffers for soluble protein extraction,may have precluded detection. However, it is possible that appli-cation of additional protein and peptide separation methods couldenable improved coverage of low abundance proteins.A major impact of our methods is highlighted in the unique

capability to experimentally annotate PTM events, which areincreasingly recognized as playing important roles in prokaryoticbiology. Our results provide a level of coverage similar to that ofindividual PTM analyses. Comprehensive analysis of PTM eventsusing the same dataset revealed a large and complex repertoireof PTMs, including those known to influence various cell pro-cesses, including photosynthesis. The identified PTM datasetsfrom this study provide a foundation for studying the biologicalfunctions of these PTM events in the photosynthesis process. Inprinciple, the same method may be used for obtaining a holisticview of signal transduction processes across time and perturba-tion conditions and will be generically applicable to any bi-ological model system and PTMs.

We consider the method used herein to be a valuable test bedfor technology that will be widely applicable to any sequencedprokaryotic organism. For eukaryotes, the database constructionwill need to be modified due to the high proportion of noncodingsequences. A direct translation of eukaryotic DNA sequenceswould lead to a dramatic increase of protein database sizes, andtranslation of mRNA transcripts into protein sequences may beused to solve this problem. Given the number of discoveries andmodifications for the relatively small genome of Synechococcus7002 (3.41 Mb), we anticipate that similar efforts in organismswith larger, less well-studied genomes will yield similar or greatersets of novel discoveries. The methods are reasonable in termsof time, scale, and cost and are therefore accessible to mostresearchers and scientists. The entire annotation pipeline isgeneralizable to any sequenced prokaryotic organism and maybe extended to eukaryotes, with modification in the near fu-ture. We expect that the methods used in this study will be-come an integral part of ongoing genome sequencing andannotation efforts.

Materials and MethodsThe WT strain of Synechococcus 7002 was grown under different treatmentconditions. Protein extracts were prepared from the cell mixtures, andproteins were subsequently reduced, alkylated, and digested using Glu-C

Fig. 6. Overview of the PTM events involved in the photosynthesis process. (A) Working scheme to delineate the PTM events in photosynthesis pathways inSynechococcus 7002. Modified proteins identified in Synechococcus 7002 are shown using a black arrow. Different PTMs are shown as squares with differentcolors. (B) Relative proteome-wide PTM levels in Synechococcus 7002 after a 2-h exposure to high light compared with standard conditions. Immunoblotswere probed using antibodies for acetyllysine, succinyllysine, butyryllysine, crotonyllysine, dimethyllysine, trimethyllysine, and phosphotyrosine. Coomassieblue staining shows equal loading amounts (Left). HL, high light treatment; C, untreated control.

Yang et al. PNAS | Published online December 15, 2014 | E5641

APP

LIED

BIOLO

GICAL

SCIENCE

SPN

ASPL

US

Dow

nloa

ded

by g

uest

on

June

16,

202

0

Page 10: Proteogenomic analysis and global discovery of ... · Proteogenomic analysis and global discovery of posttranslational modifications in prokaryotes Ming-kun Yanga,1, Yao-hua Yanga,1,

followed by trypsin. Proteins and peptides were separated by SDS/PAGE,strong cation exchange chromatography, and Self-packed reversed-phaseC18 columns. An Easy-nano liquid chromatography system (Thermo FisherScientific) equipped with an Easy-Spray 50-cm column (C18, 2 μm, 75 μm ×50 cm; Thermo Fisher Scientific) was coupled to an LTQ Orbitrap Elite massspectrometer (Thermo Fisher Scientific). Database search and spectral anal-ysis are described in SI Materials and Methods. Genome search specificpeptide (GSSP) sequences that resulted from mass spectrometric dataobtained at high resolution were used for genomic reannotation. Previouslyunidentified genes and gene model revisions were obtained using the gene-prediction programs FgenesB and GeneMark, as well as our in-house soft-ware (python script). Gene Ontology (GO) classification analysis was per-formed using an in-house Perl script, and enrichment analysis for proteinfunction classes was performed using a hypergeometric test, with correctionfor multiple hypothesis testing. Conservation analysis was performedusing the two-directional BLASTP. For directional RNA-Seq, total RNA wasisolated from cell mixtures grown under different growth conditions usingthe RNAprep pure Plant Kit (Tiangen), according to the manufacturer’sinstructions, and sequencing was performed on an Illumina GAIIx. A detailedoverview of the methods used herein is presented in SI Materials andMethods. MS raw data files as well as processed data files are publiclyavailable through PeptideAtlas (www.peptideatlas.org) (44) (data set ID

PASS00285). Annotated peptide spectral matches for all of the acceptedsingle peptide hits are deposited in PeptideAtlas and supplied as Supple-mentary Data 1. Annotated peptide spectrum matches of identified peptideswith stop codon read-through are deposited in PeptideAtlas and provided asSupplementary Data 2. Annotated spectra of all of the identified peptideswith specific PTMs are deposited in PeptideAtlas and presented as Supple-mentary Data 3. The snapshots of the genome browser for all of the pre-viously unidentified genes along with the corrections to the existing genemodels identified in this study are deposited in PeptideAtlas and provided asSupplementary Data 4. All supplemental tables (Tables S1–S9) are depositedin PeptideAtlas and provided as Supplementary Tables. RNA-Seq data areprovided as Transcriptome. In-house integrated proteogenomic analysissoftware is supplied as Supplementary Software, and detailed information isprovided in the User’s Quick Start Guide file of this software package.

ACKNOWLEDGMENTS. This work was supported by the National BasicResearch Program of China (973 Program, 2012CB518700), National NaturalScience Foundation of China Grant 31370746, and the Hundred TalentsProgram of the Chinese Academy of Sciences. Research in the laboratory ofD.A.B. was supported by National Science Foundation Grants MCB-0519743and MCB-1021725.

1. Renuse S, Chaerkady R, Pandey A (2011) Proteogenomics. Proteomics 11(4):620–630.2. KimMS, et al. (2014) A draft map of the human proteome. Nature 509(7502):575–581.3. Chaerkady R, et al. (2011) A proteogenomic analysis of Anopheles gambiae using

high-resolution Fourier transform mass spectrometry. Genome Res 21(11):1872–1881.4. Wilhelm M, et al. (2014) Mass-spectrometry-based draft of the human proteome.

Nature 509(7502):582–587.5. Cao XJ, et al. (2010) High-coverage proteome analysis reveals the first insight of

protein modification systems in the pathogenic spirochete Leptospira interrogans.Cell Res 20(2):197–210.

6. Gupta N, et al. (2007) Whole proteome analysis of post-translational modifications:applications of mass-spectrometry for proteogenomic annotation. Genome Res 17(9):1362–1377.

7. Cain JA, Solis N, Cordwell SJ (2014) Beyond gene expression: The impact of proteinpost-translational modifications in bacteria. J Proteomics 97:265–286.

8. Olsen JV, Mann M (2013) Status of large-scale analysis of post-translational mod-ifications by mass spectrometry. Mol Cell Proteomics 12(12):3444–3452.

9. Ahrné E, Müller M, Lisacek F (2010) Unrestricted identification of modified proteinsusing MS/MS. Proteomics 10(4):671–686.

10. Armengaud J (2010) Proteogenomics and systems biology: Quest for the ultimatemissing parts. Expert Rev Proteomics 7(1):65–77.

11. Zhang S, Bryant DA (2011) The tricarboxylic acid cycle in cyanobacteria. Science334(6062):1551–1553.

12. Johnson ZI, et al. (2006) Niche partitioning among Prochlorococcus ecotypes alongocean-scale environmental gradients. Science 311(5768):1737–1740.

13. Chellamuthu VR, Alva V, Forchhammer K (2013) From cyanobacteria to plants: Con-servation of PII functions during plastid evolution. Planta 237(2):451–462.

14. Rosgaard L, de Porcellinis AJ, Jacobsen JH, Frigaard NU, Sakuragi Y (2012) Bio-engineering of carbon fixation, biofuels, and biochemicals in cyanobacteria andplants. J Biotechnol 162(1):134–147.

15. Lubner CE, et al. (2011) Solar hydrogen-producing bionanodevice outperforms nat-ural photosynthesis. Proc Natl Acad Sci USA 108(52):20988–20991.

16. Parmar A, Singh NK, Pandey A, Gnansounou E, Madamwar D (2011) Cyanobacteriaand microalgae: A positive prospect for biofuels. Bioresour Technol 102(22):10163–10172.

17. Ludwig M, Bryant DA (2011) Transcription profiling of the model cyanobacteriumSynechococcus sp. strain PCC 7002 by Next-Gen (SOLiD™) sequencing of cDNA. FrontMicrobiol 2:41.

18. Ludwig M, Bryant DA (2012) Acclimation of the global transcriptome of the cyano-bacterium Synechococcus sp. strain PCC 7002 to nutrient limitations and differentnitrogen sources. Front Microbiol 3:145.

19. Nagaraj N, et al. (2011) Deep proteome and transcriptome mapping of a humancancer cell line. Mol Syst Biol 7:548.

20. Chang C, et al. (2014) Systematic analyses of the transcriptome, translatome, andproteome provide a global view and potential strategy for the C-HPP. J Proteome Res13(1):38–49.

21. Fanayan S, et al. (2013) Proteogenomic analysis of human colon carcinoma cell linesLIM1215, LIM1899, and LIM2405. J Proteome Res 12(4):1732–1742.

22. Schwanhäusser B, et al. (2011) Global quantification of mammalian gene expressioncontrol. Nature 473(7347):337–342.

23. Wu L, et al. (2013) Variation and genetic control of protein abundance in humans.Nature 499(7456):79–82.

24. Volkening JD, et al. (2012) A proteogenomic survey of the Medicago truncatula ge-nome. Mol Cell Proteomics 11(10):933–944.

25. Branca RM, et al. (2014) HiRIEF LC-MS enables deep proteome coverage and unbiasedproteogenomics. Nat Methods 11(1):59–62.

26. Ishino Y, Okada H, Ikeuchi M, Taniguchi H (2007) Mass spectrometry-based pro-karyote gene annotation. Proteomics 7(22):4053–4065.

27. Suzek BE, Ermolaeva MD, Schreiber M, Salzberg SL (2001) A probabilistic method foridentifying start codons in bacterial genomes. Bioinformatics 17(12):1123–1130.

28. Ansong C, et al. (2011) Experimental annotation of post-translational features andtranslated coding regions in the pathogen Salmonella Typhimurium. BMC Genomics12:433.

29. Binns N, Masters M (2002) Expression of the Escherichia coli pcnB gene is transla-tionally limited using an inefficient start codon: A second chromosomal example oftranslation initiated at AUU. Mol Microbiol 44(5):1287–1298.

30. Williams I, Richardson J, Starkey A, Stansfield I (2004) Genome-wide prediction of stopcodon readthrough during translation in the yeast Saccharomyces cerevisiae. NucleicAcids Res 32(22):6605–6616.

31. Jungreis I, et al. (2011) Evidence of abundant stop codon readthrough in Drosophilaand other metazoa. Genome Res 21(12):2096–2113.

32. Chan CS, Jungreis I, Kellis M (2013) Heterologous stop codon readthrough of meta-zoan readthrough candidates in yeast. PLoS ONE 8(3):e59450.

33. Hao B, et al. (2002) A new UAG-encoded residue in the structure of a methanogenmethyltransferase. Science 296(5572):1462–1466.

34. Srinivasan G, James CM, Krzycki JA (2002) Pyrrolysine encoded by UAG in Archaea:Charging of a UAG-decoding specialized tRNA. Science 296(5572):1459–1462.

35. Ivankov DN, et al. (2013) How many signal peptides are there in bacteria? EnvironMicrobiol 15(4):983–990.

36. Dirix G, et al. (2004) Peptide signal molecules and bacteriocins in Gram-negativebacteria: A genome-wide in silico screening for peptides containing a double-glycineleader sequence and their cognate transporters. Peptides 25(9):1425–1440.

37. Petersen TN, Brunak S, von Heijne G, Nielsen H (2011) SignalP 4.0: Discriminatingsignal peptides from transmembrane regions. Nat Methods 8(10):785–786.

38. Hiller K, Grote A, Scheer M, Munch R, Jahn D (2004) PrediSi: Prediction of signalpeptides and their cleavage positions. Nucleic Acids Res 32(Web Server issue):W375–W379.

39. Yang MK, et al. (2013) Global phosphoproteomic analysis reveals diverse functions ofserine/threonine/tyrosine phosphorylation in the model cyanobacterium Synechococcus sp.strain PCC 7002. J Proteome Res 12(4):1909–1923.

40. Na S, Bandeira N, Paek E (2012) Fast multi-blind modification search through tandemmass spectrometry. Mol Cell Proteomics 11(4):M111.010199.

41. Cox J, Mann M (2008) MaxQuant enables high peptide identification rates, indi-vidualized p.p.b.-range mass accuracies and proteome-wide protein quantification.Nat Biotechnol 26(12):1367–1372.

42. Tan M, et al. (2011) Identification of 67 histone marks and histone lysine crotonyla-tion as a new type of histone modification. Cell 146(6):1016–1028.

43. Zhang Z, et al. (2011) Identification of lysine succinylation as a new post-translationalmodification. Nat Chem Biol 7(1):58–63.

44. Desiere F, et al. (2006) The PeptideAtlas project. Nucleic Acids Res 34(Database issue):D655–D658.

45. Krzywinski M, et al. (2009) Circos: An information aesthetic for comparative genomics.Genome Res 19(9):1639–1645.

E5642 | www.pnas.org/cgi/doi/10.1073/pnas.1412722111 Yang et al.

Dow

nloa

ded

by g

uest

on

June

16,

202

0