A Recommendation for Naming Transcription Factor Proteins in the Grasses
Department of Biological Sciences, University of Toledo, Toledo, Ohio 43606 (J.G.)
Department of Cell and Molecular Biology, John Innes Center,Norwich Research Park, Norwich NR4 7UH, United Kingdom (M.B.)
Boyce Thompson Institute, Cornell University, Ithaca, New York 14853 (T.B.)
Department of Plant Biology, Michigan State University, East Lansing, Michigan 48824 (C.R.B.)
Division of Biological Sciences (K.C.) and Department of Biology (E.K.), University of Missouri, Columbia, Missouri 65211
Plant Gene Expression Center, Albany, California 94710 (S.H.)
Cold Spring Harbor Laboratory, Cold Spring Harbor,New York 11724 (D.J., M.T.)
Corn Insects and Crop Genetics Research Unit, Ames, Iowa 50201 (C.L.)
Cornell University, Ithaca, New York 14853 (S.M.)
Department of Botany and Plant Pathology and Center for Genome Research and Biocomputing,Oregon State University, Corvallis, Oregon 97331 (T.M.)
Department of Crop Sciences and Energy Biosciences Institute, University of Illinois,Urbana-Champaign, Illinois 61801 (S.M.)
Plant Genome Mapping Laboratory, University of Georgia, Athens, Georgia 30602 (A.P.)
Department of Genetics, Development and Cell Biology, Iowa State University, Ames, Iowa 50011– 3260 (T.P.)
Department of Energy Joint Genome Institute,Walnut Creek, California 94598 (D.R.)
Instituto de Química, Departamento de Bioquímica,Universidade de São Paulo, São Paulo, Brazil (G.M.S.)
Center for Plant and Microbial Genomics,Department of Plant Biology, University of Minnesota, Saint Paul, Minnesota 55108 (N. Springer)
Leibniz-Institute of Plant Genetics and Crop Plant Research, Genebank Department, 06466 Gatersleben, Germany (N. Stein)
and Department of Plant Pathology (G.-L.W.) and Department of Plant Cellular and Molecular Biology and Plant Biotechnology Center (E.G.), The Ohio State University, Columbus, Ohio 43210
* Corresponding author
Transcription factors are central for the exquisite temporaland spatial expression patterns of many genes. These proteinsare characterized by their ability to be tethered to particularregulatory sequences in the genes that they control. While manyother proteins participate in the regulation of gene expression,we limit our definition of transcription factors here to proteinsthat often contain a characteristic structural motif, the DNA-bindingdomain, which is involved in recognizing a short (usually 4–8bp) DNA sequence. Based on the structure of the DNA-bindingdomain, transcription factors are classified into 50 to 60 differentfamilies, and in plants, 5% to 7% of all the protein-encodinggenes are transcription factors, making them, collectively,perhaps the largest functional class of proteins.
The availability of protein sequence information from severalplant genomes that have been fully or partially sequenced overthe past few years is creating an urgent need to develop a setof criteria for naming and identifying members of large proteinfamilies. A common nomenclature would facilitate better communicationamong scientists, working not just on a particular plant system,but also across different plant species. This need is becomingparticularly acute in the grasses, where some members have arich genetic history (e.g. maize [ Zea mays ], rice [ Oryza sativa ],barley [ Hordeum vulgare ]) and genes have been named for morethan 100 years according to the phenotypes of the correspondingmutations, often leading to multiple names for the same gene.In other grasses that do not have such a genetic heritage, suchas sugarcane ( Saccharum ssp.), genes are characterized primarilyby EST accession numbers, often with multiple nonoverlappingsequences corresponding to one gene. Gene nomenclature is governedby species-specific communities in many cases (e.g. maize, rice,sorghum [ Sorghum bicolor ]), but such committees do not existfor all species with significant sequence data. In general,names of proteins follow the names of their corresponding genes.
In Arabidopsis ( Arabidopsis thaliana ), shortly after the completionof the genome sequence, criteria were developed to provide uniquenames to all transcription factors, often in a family-by-familystrategy (e.g. Stracke et al., 2001 ; Jakoby et al., 2002 ; Baileyet al., 2003 ). These identifiers are of the form AtXXXyyy, whereAt provides the species identifier At = Arabidopsis, importantwhen describing transcription factors from multiple plants;XXX corresponds to a two to five or more letter code for theparticular transcription factor family; and yyy correspondsto an arbitrary number between one and the total number of membersin that particular family. These nomenclature conventions wererapidly embraced by the community, facilitating communicationin publications and allowing the development of databases thatcompile information on Arabidopsis regulatory proteins (Davuluriet al., 2003 ; Guo et al., 2005 ).
Here, we propose adoption of similar synonyms for proteins correspondingto transcription factors across the grasses, with a goal ofhaving a uniform naming system as already developed for Arabidopsis.Such names are not meant to replace the often-familiar namesfor many proteins, but rather to provide synonyms that can linkinformation about all members of the gene family. Briefly, eachtranscription factor will be identified by a two-letter codecorresponding to the species (e.g. Zm for Z. mays , Sb for S.bicolor , and Os for O. sativa ), followed by the transcriptionfactor family name and by a number that represents its positionwithin the family. The two-letter species code should sufficefor now to unequivocally describe the grasses for which transcriptionfactors have been identified, but clearly in the future it willneed to be expanded to three or four letters, as the numberof species being included in studies increases. Alternatively,the two-letter code reflecting the species name could be usedfor organisms whose genomes are currently under study, and ifneeded, a new two-letter abbreviation, which is not necessarilyconsistent with the genus and specific epithet will be madeup for new ones. Following this criterion, sugarcane will beidentified by the Sc letter code, as agreed by the respectivecommunity.
Numbers will be assigned arbitrarily, and whenever possible,the numbers should provide a historic perspective of the orderin which transcription factors have been first identified. Forexample, since maize KN1 and C1 correspond to the founding membersof their respective families (HD and MYB, respectively), theyare assigned the number 1. When a transcription factor has alreadybeen numbered, every possible effort should be made to considerthat number as part of the new name, e.g. maize Zm38 (Frankenet al., 1994 ) should become ZmMYB38. Since it is realized thatmany transcription factors are known by their genetic names,this nomenclature will permit the use of various different synonyms.For example, KNOTTED1, which would be ZmHD1, where HD correspondsto the homeodomain family, could also be identified as ZmHD1(KN1)when in need to highlight the genetic locus, KN1 , for whichthe protein is also often known. Similarly C1, ZmMYB1, couldbe also identified as ZmMYB1(C1). Where the same name has alreadybeen used for different family members, then an alternativename(s) will be assigned to avoid further confusion in the literature.For example, ZmMYB8 (Fornale et al., 2006 ), which is supportedby a complete cDNA will remain ZmMYB8, whereas the partial ESTsMYB8 (Jiang et al., 2004 ) and ZmMYB-IP20 (Rabinowicz et al.,1999 ), which both correspond to the same protein, will be assigneda new name (e.g. ZmMYB67).
While it would be attractive for this new naming system to provideinformation about orthologous pairs across species, as for exampleZmMYB32 being the closest relative to OsMYB32, this was consideredimpractical and perhaps misleading for several reasons. First,the genomes are not yet completely sequenced or fully annotated,hence new transcription factors are likely to be identifiedin the future, significantly affecting the reconstruction ofprotein phylogenies. Second, different tree-building methodsare likely to yield slightly different results, which wouldcreate significant confusion. Third, while it has been temptingto assume that high similarity is likely to correspond to thecontrol of similar cellular processes, this has often not beenthe case, particularly when considering regulators of metabolicpathways (Grotewold, 2008 ).
We propose that GRASSIUS ( www.grassius.org ) will serve as aninitial centralized clearinghouse for transcription factor synonymsfor the grasses, starting with maize, rice, sorghum, and sugarcane,following the criteria outlined above. GRASSIUS will providea source of cross-reference between the new names, synonyms,National Center for Biotechnology Information accession codesfor ESTs, and cDNAs and unique gene identifiers, as they becomeavailable. This will be achieved dynamically, ensuring thatimmediately after a new synonym has been given to a transcriptionfactor, it will be reflected in GRASSIUS. The community is invitedto comment on these assignments for a defined period of time(e.g. until the end of 2009), at which time the names will becomeofficial. The community will be presented with these guidelinesand will be provided with ample opportunity to become awareand discuss these recommendations at conferences and meetingsfor the corresponding organisms. In addition, we will work withthe various model organism and clade-oriented databases (e.g.MaizeGDB, BrachyBase, and Gramene) to ensure that the propernomenclature is represented in these community resources aswell. These databases will also serve as optimal clearinghousesfor the respective organisms if GRASSIUS ceases to serve thispurpose. Conflicts and issues that may arise with respect tohow to name a particular transcription factor (or family oftranscription factors) will be opened for discussion to expertsin the field. For example, if there is a disagreement on howto name a new family of transcription factors, scientists workingwith those specific transcription factors will be invited tocomment.
SOME PARTICULAR CASESMultiple Proteins from One Gene
Multiple transcripts derived from one gene are frequently presentin plants, with more than 21% of the rice and Arabidopsis genesbeing alternatively spliced (Wang and Brendel, 2006 ). Severalexamples of transcription factors displaying alternate splicevariants have also been described in maize (e.g. Grotewold etal., 1991 ; Burr et al., 1996 ). We recommend that transcriptionfactor proteins derived from alternate spliced mRNAs be namedwith the .1, .2, .3 suffixes after the number of the protein.For example, the new synonyms corresponding to the two proteinsderived from the alternatively spliced variants of maize PERICARPCOLOR1 ( P1 ) would be ZmMYB3.1 and ZmMYB3.2. In those instances,as well as in cases when multiple gene models exist for a particulartranscription factor gene, the suffixes in the protein willmatch those in the gene models. For instance, if the rice LOC_Os02g36880gene shows four different gene models, from .1 to .4, then OsNAC1.1should match with LOC_Os02g36880.1 and OsNAC1.4 should matchLOC_Os02g36880.4.
The sequencing of multiple inbred lines/subspecies displayingsignificant natural variation makes it necessary to incorporatean option to represent from which allele a particular transcriptionfactor protein sequence is derived. We propose that whenevernecessary, a superscript is added. This superscript could representthe source of the allele, when known. For example, the P1 proteinobtained from the W22 maize inbred could be represented as ZmMYB3.1 W22 ,and that from B73 as ZmMYB3.1 B73 . When formatting issues preventthe use of the superscripts, then it would also be acceptableto use ^B73 to represent the allele. In that case, ZmMYB3.1 B73 and ZmMYB3.1^B73 would be equivalent. Such criterion could alsobe used to indicate, whenever known, whether a transcriptionfactor protein sequence in rice is derived from the sequenced japonica genome (cv Nipponbare) or the sequenced indica genome(cv 9311). Thus, the OsNAC6 factor, involved in biotic and abioticstress response in rice (Ohnishi et al., 2005 ), could be OsNAC6 Nipp (or OsNAC6 9311 ) when intending to capture aspects of the proteinthat relate to variation. If the origin is not precisely knownor a name/accession_ID is too cumbersome to be represented asa superscript, then numbers could be used to distinguish alleles(e.g. OsNAC6 1 , OsNAC6 2 , etc.), with cross-references to inbred/accessionnames maintained within species databases. Of course, withina species, a single transcription factor name (e.g. OsNAC6)will correspond to the products of corresponding gene models.
Products from Tandem Gene Arrays
In some instances, for example the various alleles of the maize p1 gene (Chopra et al., 1998 ), very similar (but not necessarilyidentical) proteins are encoded by individual members of a multigenearray. In those instances, we recommend using letters (a, b,c) to indicate the proteins that come from each copy. For example,if three different copies of the p1 gene from B73 were shownto encode slightly different proteins, then those products wouldbe identified as ZmMYB3 B73 a, ZmMYB3 B73 b, and ZmMYB3 B73 c.
GENE AND PROTEIN NOMENCLATURE
The guidelines described here for naming transcription factorsare expected to apply solely to proteins (or predicted openreading frames) and not necessarily to genes. Indeed, as thesequencing of various genomes progresses, nomenclature committeeshave been established to address the issue of how to name genesand gene products. It is therefore of paramount importance thatthe guidelines described here are in line with those being developedby the corresponding committees. Toward this objective, thecriteria described here have already been discussed and acceptedfor the maize transcription factors by the Maize Genetics NomenclatureCommittee ( http://www.maizegdb.org/maize_nomenclature.php ),by the Sugarcane Nomenclature Committee, and by the InternationalBrachypodium Initiative ( http://www.brachypodium.org/ ). Thecorresponding nomenclature committees will make these guidelinesavailable to the respective communities.
LITERATURE CITEDBailey PC, Martin C, Toledo-Ortiz G, Quail PH, Huq E, Heim MA, Jakoby M, Werber M, Weisshaar B (2003) Update on the basic helix-loop-helix transcription factor gene family in Arabidopsis thaliana . Plant Cell 15: 2497–2502
Burr FA, Burr B, Scheffler BE, Blewitt M, Wienand U, Matz EC (1996) The maize repressor-like gene intensifier1 shares homology with the r1 / b1 multigene family of transcription factors and exhibits missplicing. Plant Cell 8: 1249–1259
Chopra S, Athma P, Li XG, Peterson T (1998) A maize Myb homolog is encoded by a multicopy gene complex. Mol Gen Genet 260: 372–380
Davuluri RV, Sun H, Palaniswamy SK, Matthews N, Molina C, Kurtz M, Grotewold E (2003) AGRIS: Arabidopsis gene regulatory information server, an information resource of Arabidopsis cis-regulatory elements and transcription factors. BMC Bioinformatics 4: 25
Fornale S, Sonbol FM, Maes T, Capellades M, Puigdomenech P, Rigau J, Caparros-Ruiz D (2006) Down-regulation of the maize and Arabidopsis thaliana caffeic acid O-methyl-transferase genes by two new maize R2R3-MYB transcription factors. Plant Mol Biol 62: 809–823
Franken P, Schrell S, Peterson PA, Saedler H, Wienand U (1994) Molecular analysis of protein domain function encoded by the myb -homologous maize genes C1 , Zm 1 and Zm 38 . Plant J 6: 21–30
Grotewold E (2008) Transcription factors for predictive plant metabolic engineering: Are we there yet? Curr Opin Biotechnol 19: 138–144
Grotewold E, Athma P, Peterson T (1991) Alternatively spliced products of the maize P gene encode proteins with homology to the DNA-binding domain of myb-like transcription factors. Proc Natl Acad Sci USA 88: 4587–4591
Guo A, He K, Liu D, Bai S, Gu X, Wei L, Luo J (2005) DATF: a database of Arabidopsis transcription factors. Bioinformatics 21: 2568–2569
Jakoby M, Weisshaar B, Droge-Laser W, Vicente-Carbajosa J, Tiedemann J, Kroj T, Parcy F (2002) bZIP transcription factors in Arabidopsis. Trends Plant Sci 7: 106–111
Jiang C, Gu J, Chopra S, Gu X, Peterson T (2004) Ordered origin of the typical two- and three-repeat Myb genes. Gene 326: 13–22
Ohnishi T, Sugahara S, Yamada T, Kikuchi K, Yoshiba Y, Hirano HY, Tsutsumi N (2005) OsNAC6, a member of the NAC gene family, is induced by various stresses in rice. Genes Genet Syst 80: 135–139
Rabinowicz PD, Braun EL, Wolfe AD, Bowen B, Grotewold E (1999) Maize R2R3 Myb genes: Sequence analysis reveals amplification in higher plants. Genetics 153: 427–444
Stracke R, Werber M, Weisshaar B (2001) The R2R3 MYB gene family in Arabidopsis thaliana . Curr Opin Plant Biol 4: 447–456
Wang BB, Brendel V (2006) Genomewide comparative analysis of alternative splicing in plants. Proc Natl Acad Sci USA 103: 7175–7180