SGD Help: Gene Nomenclature Conventions
This page provides information on genetic and systematic nomenclature for S. cerevisiae genes and chromosomal features.
Contents
- Gene Name Assignment
- Gene Name Format
- Systematic Name Assignment
- Systematic Nomenclature Conventions and Formats
- Correlation between Gene Names and Systematic Names
Gene Name Assignment
Gene names, also referred to as genetic names (for example, COX2 or CDC28), are conferred upon genes by a researchers on the basis of genetic, biochemical, or molecular characterization. Most genes having Gene Names are ORFs, but tRNAs and other non-protein coding RNAs have also received Gene Names. In addition, there are named genes in SGD that have not yet been mapped to a physical location on the chromosome. Gene names are optional, and chromosomal features that are completely uncharacterized generally do not have gene names, only systematic names (see below).
The official name of an S. cerevisiae gene is referred to as the Standard Name on an SGD locus page, and generally becomes the standard name based on its publication in a peer-reviewed paper describing characterization of that gene. A gene name may also be reserved for a locus when publication of the name is upcoming, and is called a Reserved Name. A Reserved Name, if it remains unique and is the first published name, becomes a Standard Name upon its publication. In cases where it is not clear what name should be the standard name, the Standard Name is determined by an amalgam of 1) consensus of the research community, 2) literature usage, 3) clarity relative to function, and 4) priority in the literature. Any alternative Gene Name is referred to as an Alias.
When naming a gene, the full text of the Gene Naming Guidelines for Saccharomyces cerevisiae should be consulted. An explanation of the conventions for Saccharomyces cerevisiae nomenclature was published in the Trends in Genetics gene nomenclature guide (download pdf), and the conventions are also detailed below.
Gene Name Format
The accepted format for gene Names in S. cerevisiae is comprised of three uppercase letters followed by a number. Generally, the letters signify a phrase (referred to as the "Name Description" in SGD) that provides information about a function, mutant phenotype, or process related to that gene, for example "ADE" for "ADEnine biosynthesis" or "CDC" for "Cell Division Cycle". Gene names for many types of chromosomal features follow this basic format regardless of the type of feature named, whether an ORF, a tRNA, another type of non-coding RNA, an ARS, or a genetic locus.
Some S. cerevisiae gene names that pre-date the current nomenclature standards do not conform to this format: for example, RPL1A and RPL1B, or OM45. Although non-standard historical names such as these are maintained in SGD, any new names for yeast genes must conform to the standard format.
Systematic Name Assignment
The Systematic Name is the name generated by the systematic sequencing project, or conferred later according to the appropriate guidelines for systematic nomenclature for that type of feature or gene. Every gene or feature annotated on the genomic sequence receives an unique systematic name, whether or not it has a genetic name.
There are guidelines for designating a Systematic Name for a new feature, i.e. one not originally named by the systematic sequencing project, depending on the feature type. The specifics (detailed below) depend on the type of feature, i.e. ORF, tRNA, etc. If you have a newly discovered feature, please contact SGD in order to have the proper systematic name assigned.
Systematic Nomenclature Conventions and Formats
Open Reading Frames
Systematic names for nuclear-encoded ORFs begin with the letter 'Y' (for 'Yeast'); the second letter denotes the chromosome number ('A' is chr I, 'B' is chr II, etc.); the third letter is either 'L' or 'R' for left or right chromosome arm; next is a three digit number indicating the order of the ORFs on that arm of a chromosome starting from the centromere, irrespective of strand; finally, there is an additional letter indicating the strand, either 'W' for Watson (the strand with 5' end at the left telomere) or 'C' for Crick (the complement strand, 5' end is at the right telomere).
Examples:
YAL001C | first ORF to the left of the centromere on chromosome I (A is the 1st letter of the English alphabet), on the complement or Crick strand |
YGR116W | 116th ORF right of the centromere on chromosome VII (G is the 7th letter of the English alphabet), on the Watson strand |
On an ongoing basis, any nuclear ORFs that are newly annotated receive a systematic name based on that of the centromere proximal ORF plus an additional letter to indicate the order between previously assigned ORFs. When multiple new open reading frames are identified between previously assigned ORFs, the letter designation assigned to each is based on the order in which they were discovered, and is independent of strand. The following steps are used to determine the correct systematic name.
- Researchers contact SGD with the coordinates of a new ORF.
- The base name of the new ORF is the same as the closest centromere proximal ORF. The correct base names for the example new ORFs are indicated in green below. Note that the closest centromere proximal ORF does not have to be on the same strand, although it can be. The new ORF may overlap an existing ORF. When this occurs, if any portion of the existing overlapping ORF is closer to the centromere than the new ORF, then the existing overlapping ORF is "centromere proximal" relative to the new ORF.
- The W/C suffix indicates the strandedness of the new ORF.The W/C suffix of the new ORF is independent of the strandedness of the centromere proximal ORF. The correct suffixes for the example new ORFs are indicated in green below.
- An additional suffix, -[letter], is appended to the name of the new ORF.This distinguishes the new ORF from ORFs named in the original annotation. The letters are assigned in alphabetical order, per base name, in order of discovery (see additional examples below). The correct suffixes for the example new ORFs are indicated in green. If several neighboring new ORFs are added simultaneously, then the -[letter] suffix is assigned in alphabetical order, from the centromere to the telomere. However, since new neighboring new ORFs are not necessarily discovered simultaneously, the -[letter] suffix does not always indicate relative position.
Examples:
YAL034W-A | a new ORF on the Watson strand of the left arm of chromosome I, farther from the centromere than YAL034C |
YHR214C-E | a new ORF on the Crick strand of the right arm of chromosome VIII, farther from the centromere than YHR214W |
In the rare event that a new ORF is discovered at the extreme end of a chromosome, the new ORF is given the next number in the sequence and does not require a -[letter] suffix. This is only applicable in cases where there are no existing ORFs between the new ORF and the end of the chromosome.
Systematic names for mitochondrially encoded ORFs start with the letter 'Q' to designate the mitochondrial chromosome; the rest consists of a four digit number. Examples are Q0010 and Q0032.
Systematic names for ORFs encoded in the 2-micron plasmid start with the letter 'R' to designate the 2-micron plasmid; the rest consists of a four digit number followed by the letter 'W' or 'C' for Watson and Crick. Examples are R0010W and R0020C.
RNA-Coding Genes
Systematic names of nuclear-encoded tRNA genes begin with a lowercase 't'; the second letter corresponds to the single letter code for the appropriate amino acid, e.g., A = alanine, C = cysteine, etc.; next the sequence of the anticodon of the tRNA is given in the 5' -> 3' direction within parentheses, e.g., (AGC) or (GUC); finally, there is an indication of which chromosome the tRNA gene resides on using the letters 'A' through 'P' to designate nuclear chromosomes (in the same way as for nuclear-encoded ORFs). If a given nuclear chromosome contains more than one copy of a tRNA gene, individual copies of the same tRNA family (those of identical sequence, including the anticodon sequence) are distinguished from each other by the addition of a single number, starting with '1', after the letter designating the chromosome.
Examples:
tC(GCA)B | a tRNA for cysteine, with the anticodon sequence 'GCA', located on chromosome II |
tS(AGA)D1 | a tRNA for serine, with the anticodon sequence 'AGA', one of two or more tRNAs from this family (containing the AGA anticodon) located on chromosome IV |
Mitochondrially-encoded tRNAs are named the same way as nuclear-encoded tRNAs, using the letter 'Q' to designate the mitochondrial chromosome, except that the presence of a number indicates that two or more tRNAs encode the same amino acid, though they do not necessarily contain the same anticodon sequence.
Examples:
tR(UCU)Q1 | a tRNA for arginine, with the anticodon sequence (UCU), one of two or more tRNAs for arginine on the mitochondrial chromosome |
tR(ACG)Q2 | a tRNA for arginine, with the anticodon sequence (ACG), one of two or more tRNAs for arginine on the mitochondrial chromosome |
The systematic name of a small nuclear RNA (snRNA) or small nucleolar RNA (snoRNA) starts with the lowercase letters 'sn'; next is a capital 'R'; this is followed by a number by a number. The number is unique, but does not convey any positional information. Frequently, the Gene Name of snRNAs and snoRNAs is the same as the Systematic Name, but with all caps, e.g. 'SNR'. Different copies of duplicated genes may be indicated by either adding a letter, e.g. 'A' or 'B' to the end of the name.
Examples:
snR6 | a snRNA, produces the U1 spliceosomal RNA |
snR17a | a snoRNA, one of two copies of snoRNA U3 |
snR17b | a snoRNA, one of two copies of snoRNA U3 |
Note: SNR7 is an exception in that its transcript is alternatively processed yielding two products: SNR7-S (short form) and SNR7-L.
The systematic names and gene names of loci representing the nuclear encoded rRNA genes are identical to each other. The "loci" representing the rDNA repeats, the rRNA transcripts, and the mature rRNAs are named with the three letter acronym 'RDN' for Ribosomal DNA. While S. cerevisiae contains multiple repeats of the ribosomal DNA (rDNA), only two rDNA repeats were sequenced as part of the systematic sequencing project.
Examples:
RDN1 | the entire 1-2 Mb rDNA region on Chromosome XII, consisting of 100-200 tandem copies of a 9.1 kb repeat which contains the genes for 5S, 5.8S, 25S and 18S rRNAs |
RDN18-1 | represents a specific copy of a region which encodes an 18S ribosomal RNA |
RDN37-2 | represents a specific copy of a region which encodes a primary rRNA transcript which is processed into the 25S, 18S and 5.8S rRNAs |
A more complete explanation of the representation and naming of the rDNA repeats and rRNAs within it is present on the RDN1 locus page which represents the entire rDNA region on Chromosome XII.
Other Features
Autonomously Replicating Sequences (ARS) are named with the three letters ARS followed by a number. ARS features added after October 2000 are named systematically using the three letters ARS followed by one or two digits to represent the chromosome, e.g. chromosome I = 1, chromosome II = 2, chromosome X = 10. This is followed by an additional whole number to designate the particular ARS on that chromosome in the order named, starting with the digits '01'. Note that the number merely indicates the order in which the ARS elements were reported and named, and does not necessarily denote any location information relative to other ARS features. Note also that decimal points are NOT used. Some "historical" ARS features were given Gene Names prior to the establishment of this systematic naming system, e.g. ARS1, ARS2, ARS120. In these cases, an ARS-based Gene Name does not indicate the chromosomal location. Examples: ARS1; ARS301.
Centromeres are named with the three letters 'CEN' followed by one or two digits to represent the chromosome. Examples: CEN1; CEN2.
The systematic name of a full length Ty element starts with a 'Y'; the second letter corresponds to the chromosome number (given in Roman numerals, e.g., chr I is 'A', chr VIII is 'H'); the third letter is either 'L' or 'R' for left or right of the centromere; the fourth letter is either 'W' for Watson (the strand with 5' end at the left telomere) or 'C' for Crick (the complement strand, 5' end is at the right telomere); next are the letters 'Ty' followed by a number, 1-5, to indicate the type of Ty element. The first Ty element of a given type is indicated with -1; additional full length Ty elements of the same type on the same chromosome are given a number incremented by one from the previous one.
Examples:
YARCTy1-1 | a Ty element of type 1 on the right arm of Chromosome I, on the Crick strand |
YCLWTy5-1 | a Ty element of type 5 on the left arm of Chromosome III, on the Watson strand |
YDRCTy1-1 | a Ty element of type 1 on the right arm of Chromosome IV, on the Crick strand |
The systematic name of a Ty Long Terminal Repeat (LTR) element starts with a 'Y'; the second letter corresponds to the chromosome number (given in Roman numerals, e.g., chr I is 'A', chr VIII is 'H'); the third letter is either 'L' or 'R' for left or right of the centromere; the fourth letter is either 'W' for Watson (the strand with 5' end at the left telomere) or 'C' for Crick (the complement strand, 5' end is at the right telomere); next is a word for a Greek letter indicating the type of LTR element, e.g. 'delta', 'sigma', 'tau', 'omega'. The first Ty LTR element of a given type is given the number '1'; additional Ty LTR elements of the same type on the same chromosome are given a number incremented by one from the previous one.
Examples:
YARCdelta8 | a Ty LTR of the delta type on Chromosome I |
YARWsigma1 | a Ty LTR of the sigma type on Chromosome I |
YBLCtau1 | a Ty LTR of the tau type on Chromosome II |
YCLWomega1 | a Ty LTR of the omega type on Chromosome III |
Note: there are four systematic names (YCLWdelta2a, YCLWdelta2b, YDRCdelta6a, and YDRCdelta6b) that do not conform to the nomenclature rules. Please contact SGD if you need to use this nomenclature.
SGD currently annotates several different types of features at the ends of chromosomes, listed below (click on the element name for a definition). When there are multiple examples of a type of telomeric element at a single chromsome end (e.g. more than one Telomeric Repeat), the elements will be numbered after the suffix, with number 1 being the closest to the end of the chromosome.
Telomeric Region: "TEL" followed by a two digit number indicating the chromosome number, then "L" or "R" to indicate the left or right arm of the chromosome. Example: TEL08L
X element Combinatorial Repeats: The same base name used for the Telomeric Region feature, appended with a suffix of "-XR". Example: TEL08R-XR
X element Core sequence: The same base name used for the Telomeric Region feature, appended with a suffix of "-XC". Example: TEL08L-XC
Y' element: The same base name used for the Telomeric Region feature, appended with a suffix of "-YP". Example: TEL12L-YP1
Telomeric Repeat: The same base name used for the Telomeric Region feature, appended with a suffix of "-TR". Example: TEL08R-TR1
Correlation between Gene Names and Systematic Names
While all ORFs identified in the genome sequence have a Systematic Name, e.g. YAL001C, YGR116W, YAL034W-A, or Q0010, many ORFs have not been given a Gene Name, e.g. a name such as COX2 or CDC28. In addition, Gene Names have been conferred on non-ORF features such as tRNAs, other non-coding RNAs, and on genetic loci which have not yet been mapped to a specific position on a chromosome. In this last case, because the chromosomal location is not known, there will not be a systematic name associated with the Gene Name.
An ORF, or other chromosomal feature, with a systematic name may have been associated with more than one common usage name, or Gene Name. Only one of these will be designated as the Standard Name; any other associated name is referred to as an Alias.