Table of Contents
CC-CEDICT V2 Syntax
TODO: work in progress!
Version 2 (v2) introduces a new syntax for the pinyin of an entry, allowing for the specification of pinyin that follows standard pinyin orthography. In particular, it enables the combination of syllables to form words. For example, in v2, 二次方程 (quadratic equation) can now be written as two words, “er4ci4 fang1cheng2” (i.e., èrcì fāngchéng), rather than as four separate syllables, “er4 ci4 fang1 cheng2”, as was required in v1.
Below are guidelines on what CC-CEDICT entries should look like. CC-CEDICT still has many old entries that do not comply with these rules yet.
In particular:
- Prior to April 2022, glosses and senses were separated using a /. As of April 2022, senses are to be separated with a / while glosses are to be separated with a ;. (This was a change in v1 format of definitions, but its progressive introduction largely coincides with the conversion of pinyin to v2 format.)
- In December 2023, CC-CEDICT began adopting the v2 pinyin format.
An entry is considered to be in v2 format if it uses double square brackets for the pinyin.
[[pin1yin1]] rather than [pin1 yin1]
However, when updating the pinyin of an entry, the rest of the entry should also be reviewed. If this is done, it means that v2 pinyin format signifies not only that the pinyin format has been updated, but also that the definition has been checked for correctness and proper format: it's a way of keeping track of which entries have old definitions that need to be reviewed.
As of August 2024, roughly 10% of entries have been converted to v2 by editors.
Three editions of CC-CEDICT are published regularly:
- Version 1, in which any v2 entries are converted back to v1 by a script.
- Mixed, in which entries are v2 if they have been converted by an editor, or v1 otherwise
- Version 2, in which any v1 entries are converted to v2 by a script
These three editions are available for download at https://cc-cedict.org/editor/editor.php?handler=Download
When conversion to v2 is complete, the Mixed edition will be the same as Version 2 and therefore redundant, so only Version 1 and Version 2 will be published after that time.
Basic format
The basic format of a CC-CEDICT entry is:
Traditional Simplified [[pin1yin1]] /gloss; gloss; .../gloss; gloss; .../
For example:
皮實 皮实 [[pi2shi5]] /(of things) durable/(of people) sturdy; tough/
Important: It is not allowed for multiple entries in CC-CEDICT to have the same combination of traditional, simplified and pinyin. In fact, the CC-CEDICT Editor website will not allow an editor to create an entry if there already exists an entry with the same trad-simp-pinyin combination. Attempting to do so produces an error message. Note that the pinyin comparison is case-sensitive, so [Wang2] and [wang2] are considered to be different. Therefore, we can have two entries such as the following.
王 王 [[Wang2]] /surname Wang/ 王 王 [[wang2]] /king/
If an editor wants to add additional senses for an existing trad-simp-pinyin combination, they should edit its definition rather than create a new entry.
Traditional and simplified characters
The Chinese word should consist of one or more Chinese characters, without any spaces in it. Both traditional and simplified forms should be provided, and the two must have the same length.
There are a very small number of entries that use symbols, numbers, or other non-Chinese characters in the word, for example
% % [pa1] /percent (Tw)/ 3C 3C [san1 C] /computers, communications, and consumer electronics/China Compulsory Certificate (CCC)/ 421 421 [si4 er4 yi1] /four grandparents, two parents and an only child/ K人 K人 [K ren2] /(slang) to hit sb; to beat sb/
Below are some notes on how these entries are handled in v2.
Let's take “e人” (extroverted person) as an example.
There are several ways one might like to render “e人” in pinyin, such as
- e-rén
- erén
- yìrén
The Editor website attempts to match the parts of the headword with the parts of the pinyin, and will, if necessary, treat some parts as “unparsed”.
For example, in the following entry, “e” is an unparsed element in both the headword and the pinyin, while 人 is matched with “ren2”.
e人 e人 [[e-ren2]] /(slang) extroverted person/
If the Editor website https://cc-cedict.org/editor/ cannot unambiguously match up the elements of the headword and the pinyin, the entry will not be processed. That is what happens in the following case, where the proposed pinyin is “eren2” rather than “e-ren2”.
e人 e人 [[eren2]] /(slang) extroverted person/ (Invalid format!)
To specify “erén” (as opposed to, say, “e-rén”), it is necessary to use braces to guide the Editor website in parsing. The following would work:
e人 e人 [[{e}ren2]] /(slang) extroverted person/
… as would several other forms, including
{e}人 {e}人 [[{e}ren2]] /(slang) extroverted person/
Here is a link to a webpage where a proposed entry can be tested to see if it can be parsed correctly.
“Parse entry” webpage: https://cc-cedict.org/editor/editor.php?handler=ParseEntry
To specify “yìrén” as the pinyin for e人, no braces are necessary. The following entry can be parsed, as one can verify at the “Parse entry” webpage. “e” will be matched with “yi4”, and 人 will be matched with “ren2”.
e人 e人 [[yi4ren2]] /(slang) extroverted person/
Generally, it is regarded as preferable not to indicate the pronunciation of non-Chinese parts of a headword (such as “e” in “e人”). Instead, they can appear as unparsed elements of the pinyin. For example, “e-ren2” is preferred over “yi4ren2”.
Pinyin
The pinyin should be in accordance with standard pinyin orthography, except that numerals are used to indicate tones instead of diacritics, and apostrophes are not indicated.
Proper nouns should be capitalized, and spaces or hyphens should be inserted when appropriate. For example,
蘋果手機 苹果手机 [[Ping2guo3 shou3ji1]] /iPhone/ 師生 师生 [[shi1-sheng1]] /teachers and students/
Other rules about our pinyin format:
- The neutral tone uses the numeral “5”, which should not be omitted.
- ü, also known as the umlaut, is written as “u:”. For example, 女 → nu:3
- 儿 as the “retroflex final” is written as r5
- Raw tones should be used:
- Tone sandhi is not indicated (e.g., ni3 hao3 is not changed to ni2 hao3)
- Although “yi” and “bu” have various modifications in tone, depending on what follows them, these are not indicated in writing (e.g., “one horse” is pronounced “yi4 pi3 ma3” but written “yi1 pi3 ma3”, and “not enough” is pronounced “bu2 gou4” but written “bu4 gou4”)
- Word-related changes to neutral tone, however, are indicated. These are especially common with reduplicated forms (e.g., use ma1 ma5, not ma1 ma1; ba4 ba5, not ba4 ba4; kan4 kan5, not kan4 kan4; xiang3 xiang5 (“take under consideration”), not xiang3 xiang3). This isn't limited to reduplicated forms, e.g., ming2 bai5, not ming2 bai2; cong1 ming5, not cong1 ming2.
It's best to keep in mind that Pinyin is about Mandarin words, not Chinese characters.
- Non-Chinese characters: Letters should be written as they are, while numbers should be written out using pinyin, for example, 3C becomes “san1 C”.
- xx5: There are a few entries where the pinyin is xx5, which represents unknown pinyin or characters where pinyin does not apply. Some Korean and Japanese symbols go here. It’s unlikely we will add more of these entries.
々 々 [xx5] /iteration mark (used to represent a duplicated character)/ ㍽ ㍽ [xx5] /大正[Da4 zheng4] written as a single character/ 朩 朩 [xx5] /one of the characters used in kwukyel (phonetic "pin"), an ancient Korean writing system/ 込 込 [xx5] /(Japanese kokuji) to be crowded; to go into/
Definition
Definitions should be written in American English.
Senses should be separated using a slash “/”, while glosses within a sense should be separated with a semicolon “;”. This means that you can not using / or ; within a definition - for example, “w/o” as an abbreviation of “without” would incorrectly split the definition into two senses.
Do not add definite or indefinite articles (e.g. “a”, “an”, “the”, etc) to English nouns unless they are necessary to distinguish the word from another usage type or homonym
Don't use parts of speech. Instead try to give an indication of grammatical usage within the English definition. CC-CEDICT is a human readable descriptive dictionary, not a resource intended for machine processing.
Abbreviations etc cf e.g. i.e. do not need any further punctuation.
Extended meanings indicated by lit. .. fig. combination when appropriate or when a common expression refers back to a classical incident or chengyu, one can refer to it with cf (incident in Records of the Historian).
Special syntax
Taiwanese pronunciation
CC-CEDICT follows “standard Mandarin” as used in P.R.China. Mandarin as used in Taiwan sometimes has slight variations in the pronunciation, these can be listed as follows:
叔叔 叔叔 [shu1 shu5] /(informal) father's younger brother/uncle/Taiwan pr. shu2 shu5/
Taiwanese GuoYu sometimes prefers not to use the neutral tone, so we do not list Taiwan pronunciations when they consist only of saying “don't use the neutral tone”. When a character has a “Taiwan pr.” notice, then all of its compound need not mention it.
Ambiguity due to homonyms
Sometimes words used in the English definitions can have multiple meanings. If the Chinese word does not have these additional meanings, additional information should be provided to prevent ambiguity:
首都 首都 [shou3 du1] /capital (city)/
The text between the parentheses is “meta-information”; it is not a direct part of the translation, merely to prevent ambiguity.
References
The English definitions can contain references to other Chinese words. These should be noted as follows:
漢字|汉字[Han4 zi4]
For example:
股指 股指 [gu3 zhi3] /stock market index/share price index/abbr. for 股票指數|股票指数[gu3 piao4 zhi3 shu4]/
Classifiers
Classifiers (also called “Measure words”) can be listed using the following syntax:
避風港 避风港 [bi4 feng1 gang3] /haven/refuge/harbor/CL:座[zuo4],個|个[ge4]/
Classifiers follow the 'reference' syntax, are prefixed by 'CL:' and separated by a comma (no additional spacing).
The classifier words itself can be described using:
/classifier for small round things (peas, bullets, peanuts, pills, grains etc)/
Bound forms
A bound form is a morpheme that only appears as part of a larger expression. In English, bound forms tend to be prefixes or suffixes such as “-ly”, “-est”, “pre-”, “post-” etc and generally are not words by themselves. In Chinese however, characters can either be bound or free, and it can be difficult to determine which. Some characters can have multiple bound and multiple free senses.
There are two types of bound forms in Chinese, those with meanings and those without.
Meaningful bound forms
These are bound forms where a meaning can be assigned to the character. Oftentimes they appear in multiple words with the same meaning, but never by themselves. We indicate these by prefixing the sense with “(bound form)”.
For instance:
隘 隘 [[ai4]] /(bound form) narrow/(bound form) a defile; a narrow pass/
is a bound form as you would not see 隘 alone when reading Chinese. It would always be accompanied by other characters such as 隘口, 隘路, 关隘, 狭隘 etc.
Meaningless bound forms
These are bound forms where a meaning cannot be assigned to the character, usually because the character appears in a small number of words (usually just 1). Oftentimes these are the names of plants or animals, or terms used in literature. For these characters, the entry is simply “used in …”.
For example:
鮟 𩽾 [an1] /used in 鮟鱇|𩽾𩾌[an1 kang1]/Taiwan pr. [an4]/ 鱇 𩾌 [[kang1]] /used in 鮟鱇|𩽾𩾌[an1kang1]/ 鮟鱇 𩽾𩾌 [an1 kang1] /anglerfish/
𩽾 and 𩾌 by themselves have no meaning, as they are always used with each other. 𩽾𩾌 is the anglerfish.
A small number of meaningless bound forms are used in multiple words, in this case, all should be listed. When the words have the same or similar meaning, they should be combined into one sense, when the words have different meanings, they should be separated into different senses.
Different senses
螞 蚂 [[ma3]] /used in 螞蟥|蚂蟥[ma3huang2]/used in 螞蟻|蚂蚁[ma3yi3]/ 蝲 蝲 [la4] /used in 蝲蛄[la4 gu3]/used in 蝲蝲蛄[la4 la4 gu3]/ 蛞 蛞 [[kuo4]] /used in 蛞螻|蛞蝼[kuo4lou2]/used in 蛞蝓[kuo4yu2]/ 猻 狲 [[sun1]] /used in 猢猻|猢狲[hu2sun1]/used in 兔猻|兔狲[tu4sun1]/
Same sense
箢 箢 [yuan1] /used in 箢箕[yuan1 ji1] and 箢篼[yuan1 dou1]/Taiwan pr. [wan3]/ 癔 癔 [[yi4]] /used in 癔病[yi4bing4] and 癔症[yi4zheng4]/ 咐 咐 [[fu4]] /used in 吩咐[fen1fu5] and 囑咐|嘱咐[zhu3fu5]/
An example of both
螂 螂 [[lang2]] /used in 螞螂|蚂螂[ma1lang2]/used in 蜣螂[qiang1lang2] and 虼螂[ge4lang2]/used in 螳螂[tang2lang2]/used in 蟑螂[zhang1lang2]/
Punctuation
Middle dot
Middle dots are often used for separating western names:
珍・奧斯汀 珍・奥斯汀 [Zhen1 · Ao4 si1 ting1] /Jane Austen (1775-1817), English novelist/
A double width middle dot is used in the Chinese, a single width middle dot padded with spaces on both sides is used in the pinyin.
Comma
Commas are sometimes used in Chinese proverbs:
人為財死,鳥為食亡 人为财死,鸟为食亡 [[ren2 wei4 cai2 si3, niao3 wei4 shi2 wang2]] /Human beings die in pursuit of wealth, and birds die in pursuit of food/.../
A double width comma is used in the Chinese. In the pinyin, a single width comma followed by a space is used.
Retroflex finals
There are 3 kinds of R-ised words that use the 兒/儿 character:
- 兒/儿 is not-optional because it's its own syllable (usually meaning “son,” so daughter is actually “girl son”) - 女兒/女儿 nǚ'ér
- 兒/儿 is not-optional because it changes the definition of the word and is tacked on to the preceding syllable - 头兒/头儿 tóur (leader) as opposed to 头 tóu (head)
- 兒/儿 is an optional northern pronunciation (er2hua4) and is tacked on to the preceding syllable - 花兒/花儿 huār (flower) as opposed to 花 huā (flower)
These 3 cases should be formatted as follows:
- 女兒 女儿 [nu:3 er2] /daughter/
- 頭兒 头儿 [tou2 r5] /leader/
- 花兒 花儿 [hua1 r5] /erhua variant of 花/flower/
Please note: words ending with 'r5' (such as 'hua1 r5') are presented as a -r joined with the previous syllable (eg. 'huar1') in some dictionaries using CC-CEDICT, such as the MDBG Chinese-English dictionary.
Choice of entries and translations
The current CC-CEDICT database contains a considerable number of infelicities, inaccuracies, omissions, and actual errors. As an ideal, new entries should be checked against 2 or 3 different sources (e.g. the online and paper dictionaries). Care is needed, since the dictionaries copy from one another – an entirely bogus entry in CC-CEDICT is copied uncritically onto thousands of websites within a few months.
A Chinese word for which a Google query with the following syntax results in many thousand of hits should probably be added to CC-CEDICT, with translations corresponding to the main usages.
+"combination of characters"
(the +“” combination forces Google to match both a whole word and to ignore variants)
General principles of translation
The English should be meaningful, not horribly ugly, and bear a close relation to the Chinese meaning. It should correspond to something that could be used naturally by an English speaker (I think Arthur Waley has some advice saying that just because a text is about magnetohydrodynamics, it doesn't follow that it has to be horribly ugly).
On the other hand, a translation always loses something, and the translator can compensate by substituting an English equivalent (e.g. a biblical or Shakespearian allusion in place of a Confucian idiom).
Names of persons should say dates if possible (birth, death, years in which the person was active in a certain role, etc), what interest the person has (writer, general, pop star, etc), brief indications of CV (e.g. took part in a revolution, was murdered, wrote famous book, etc). For example:
胡錦濤 胡锦涛 [Hu2 Jin3 tao1] /Hu Jintao (1942-), president of PRC from 2003/
Names of plants, animals, musical instruments should give common name and scientific name when appropriate; there is a particular problem of how specific the word is – a plant may mean a minor variety within a species, or may refer to an entire taxonomic family. Different writers will use it to mean the common family, or the particular item of salad on their plate at present.
Most words have more than one meaning, and more than one grammatical function. Care is needed not to concentrate only on a specific occurrence to the exclusion of others. e.g . the actual occurrence may be a verb in the past participle (say “overthrown”) whereas the word may also mean “destruction”, “to topple” etc.
There are 20,000 Chinese characters in the more advanced dictionaries, of which many are obscure, never used, and will not have correct definitions in online or paper dictionaries. This is the boundary of knowledge. (Exactly the same applies to big English dictionaries.) These obscure characters appear on modern websites, and one sometimes needs to give a definition. It is reasonable to admit (precise meaning unknown), and give an indication of what one can deduce.
Variants
Many characters have variants, sometimes more than one, sometimes with identical meaning or quite different meanings. Some choice of variants found in texts on websites will arise because of the different input methods, and the user may have had no intention of using the variant.
You can get rough usage frequency information by searching the alternative word forms in Google. Please use this syntax to make sure that Google doesn't perform any automatic variant translations:
+"word"
Additionally you can use Google's advanced search to specify the language to either 'Chinese (Traditional)' or 'Chinese (Simplified)' to prevent Japanese web pages from influencing the results. For example:
789 Chinese (Traditional) pages for +“撐竿跳高”
17,700 Chinese (Simplified) pages for +“撑竿跳高”
1,750 Chinese (Traditional) pages for +“撐杆跳高”
66,900 Chinese (Simplified) pages for +“撑杆跳高”
It often happens that Google tells you that +“Xx” occurs 200 times more frequently than +“XX”, in which case Xx should be in CC-CEDICT as a regular entry, and XX only as “XX XX [pin1 yin1] /variant of Xx/definition/”.
When there are alternative forms of the same expression, and the less common form is at most 5 times less common, the less common entry should have /also written ../ referring to the more common form, e.g. 撐竿跳高 撑竿跳高 [cheng1 gan1 tiao4 gao1] /pole-vaulting/also written 撐杆跳高|撑杆跳高/.
PROPOSED CHANGES
(Summary: (1) Get rid of “also written”, using “variant of” instead; and (2) Format “variant of” entries in line with points 2a and 2b below.)
(THE VARIANT RULES ABOVE CAN BE DELETED IF AND WHEN THESE CHANGES ARE ACCEPTED.)
(Also, the following notes can be tidied up and edited to remove references to “I” and “me”.)
Regarding “also written…”
According to our wiki, there are two kinds of variants. https://cc-cedict.org/wiki/format:syntax#variants
1) Where the less common form is relatively common (> 20% of the frequency of the more common form).
2) Where the less common form is much less common (< 20% of the frequency of the more common form)
For the first type, the def of the less common form should look like this (according to the wiki):
/definition/also written .../
And for the second type, the def of the less common form should be
/variant of .../definition/
In practice, what has been happening in recent years is this:
1. We have been ignoring the “also written …” syntax, except maybe when we edit existing entries
2. With variants,
a) if it's a full variant (i.e. exactly the same definition), we use /variant of …/ without adding the definition.
b) if it's a partial variant (i.e. only some of the senses of one form apply to the other form) we use
/definition (variant of ...)/
Part of the rationale for these changes is this: It's a hassle to check whether entries satisfy the “20% criteria”, and the percentage probably changes over time, and depending on which corpus you use to get the percentage.
Using the Editor website's search function, I got about 3600 results for “variant of” and only about 360 results for “also written”.
One idea that I've had in mind for a while is to clean up all these by
a) rewriting “also written” definitions by using “variant of”
b) regularizing the format of the “variant of” entries in line with points 2a and 2b above.
Romanization of foreign languages
When transcribing foreign words in definitions, please use the following romanization methods:
- Japanese: Modified Hepburn
- Korean: Revised Romanization of Korean
If an alternative romanization method is more popular for a certain word, that version can be added as an additional translation.