User Tools

Site Tools


syntax_v2

CC-CEDICT V2 Syntax

CC-CEDICT began adopting the v2 format in December 2023. For v1 syntax, see: CC-CEDICT V1 Syntax.

Below are guidelines on what CC-CEDICT entries should look like. CC-CEDICT still has many old entries that do not comply with these rules yet.

An entry is considered to be in v2 format if it uses double square brackets for the pinyin. v1 entries use a single square bracket.

The primary difference between v1 and v2 is that v2 entries follow standard pinyin orthography. In v1, all pinyin were written with spaces between each syllable. In v2, syllables can be combined to form words. For example,

v1: 二次方程 二次方程 [er4 ci4 fang1 cheng2] /(math.) quadratic equation/
v2: 二次方程 二次方程 [[er4ci4 fang1cheng2]] /(math.) quadratic equation/

However, besides just correcting the pinyin of an entry, the rest of the entry must also be reviewed. If this is done, it means that v2 pinyin format signifies not only that the pinyin format has been updated, but also that the definition has been checked for correctness and proper format: it's a way of keeping track of which entries have old definitions that need to be reviewed.

In particular, prior to April 2022, glosses and senses were separated using a /. As of April 2022, senses are to be separated with a / while glosses are to be separated with a ;. (This was a change in v1 format of definitions, but its progressive introduction largely coincides with the conversion of pinyin to v2 format.)

A number of other (mostly minor) syntax and format changes have also been established over the years, and are outlined on this wiki. v1 entries (some of which date back to 1998), may not necessarily follow our latest conventions. However, v2 entries should. Part of the v2 conversion process is making sure these rules are followed.

Three editions of CC-CEDICT are published regularly:

  • Version 1, in which any v2 entries are converted back to v1 by a script.
  • Mixed, in which entries are v2 if they have been converted by an editor, or v1 otherwise
  • Version 2, in which any v1 entries are converted to v2 by a script

These three editions are available for download at https://cc-cedict.org/editor/editor.php?handler=Download

When conversion to v2 is complete, the Mixed edition will be the same as Version 2 and therefore redundant, so only Version 1 and Version 2 will be published after that time.

Basic format

The basic format of a CC-CEDICT entry is:

Traditional Simplified [[pin1yin1]] /gloss; gloss; .../gloss; gloss; .../

Important: It is not allowed for multiple entries in CC-CEDICT to have the same combination of traditional, simplified and pinyin. In fact, the CC-CEDICT Editor website will not allow an editor to create an entry if there already exists an entry with the same trad-simp-pinyin combination. Attempting to do so produces an error message. Note that the pinyin comparison is case-sensitive, so [Wang2] and [wang2] are considered to be different. Therefore, we can have two entries such as the following.

王 王 [[Wang2]] /surname Wang/
王 王 [[wang2]] /king/

If an editor wants to add additional senses for an existing trad-simp-pinyin combination, they should edit its definition rather than create a new entry.

Traditional and simplified characters

The Chinese word should consist of one or more Chinese characters, without any spaces in it. Both traditional and simplified forms should be provided, and the two must have the same length.

Pinyin

The pinyin should be in accordance with standard pinyin orthography. For a comprehensive reference, we recommend Chinese Romanization: Pronunciation and Orthography by Yin Binyong.

For the majority of entries, an entry can be converted from v1 to v2 by simply removing the spaces within words (see 二次方程 above). In addition, a hyphen can now be included in the pinyin when appropriate.

師生 师生 [[shi1-sheng1]] /teachers and students/
柴米油鹽醬醋茶 柴米油盐酱醋茶 [[chai2-mi3-you2-yan2-jiang4-cu4-cha2]] /lit. firewood, rice, oil, salt, soy sauce, vinegar and tea/fig. life's daily necessities/

Rules about our pinyin format:

  1. Tones are indicated with numerals instead of diacritics. The neutral tone (轻声) uses the numeral “5”, which should not be omitted.
  2. Because we use numerals, the boundary between characters is clearly established and apostrophes before vowels are not needed
  3. ü (the umlaut), is written as “u:”. For example, 女 → nu:3
  4. 儿 as the “retroflex final” is written as r5
  5. Raw tones should be used:
    1. Tone sandhi is not indicated (e.g., 你好[ni3hao3] is not written as [ni2hao3])
    2. 一 and 不 have various modifications in tone depending on what follows them, but these are not indicated in the pinyin (e.g., 一半[yi1ban4] is not written as [yi2ban4], 不是[bu4shi4] is not written as [bu2shi4])
    3. Word-related changes to neutral tone, however, are indicated. These are especially common with reduplicated forms (e.g., use ma1 ma5, not ma1 ma1; ba4 ba5, not ba4 ba4; kan4 kan5, not kan4 kan4; xiang3 xiang5 (“take under consideration”), not xiang3 xiang3). This isn't limited to reduplicated forms, e.g., ming2 bai5, not ming2 bai2; cong1 ming5, not cong1 ming2.
      It's best to keep in mind that Pinyin is about Mandarin words, not Chinese characters.
  6. xx5: Represents an entry where pinyin does not apply. There are very few entries with this pinyin and we do not expect to add more.
々 々 [xx5] /iteration mark indicating repetition of the preceding character in horizontal writing (rare in modern Chinese)/
〻 〻 [xx5] /iteration mark indicating repetition of the preceding character in vertical writing (rare in modern Chinese)/

Definition

A definition is made up of senses, and a sense is made up of glosses. Senses should be separated using a slash “/”, while glosses should be separated with a semicolon “;”. This means that you can not use / or ; within a definition - for example, “w/o” as an abbreviation of “without” would incorrectly split the definition into two senses.

Generally, glosses within a sense are synonyms and can be included to remove ambiguity, while senses represent wholly different meanings or uses of a word. Here's an example of an entry with multiple senses and glosses.

算 算 [[suan4]] /to calculate; to figure out/to include; to count in/to count; to be valid; to carry weight/to regard as; to consider (to be ...)/

Rules to follow when writing a definition:

  1. Use American English.
  2. Do not add definite or indefinite articles (e.g. “a”, “an”, “the” etc) to English nouns unless they are necessary to distinguish the word from another usage type or homonym
  3. Don't use parts of speech. Instead try to give an indication of grammatical usage within the English definition. CC-CEDICT is a human readable descriptive dictionary, not a resource intended for machine processing.
  4. The singular form is preferred over the plural form, unless the word is typically used in its plural form.
  5. Entries for people should include dates if possible (birth, death, years in which the person was active in a certain role etc) and why this person is of interest (was famous writer, took part in a revolution, was murdered etc). If a person isn't particularly famous and isn't related to China or Chinese culture, please don't include them.
  6. Names of plants, animals, musical instruments should give common name and scientific name when appropriate; there is a particular problem of how specific the word is – a plant may mean a minor variety within a species, or may refer to an entire taxonomic family. Different writers will use it to mean the common family, or the particular item of salad on their plate at present.

Ambiguity due to homonyms

Many words in the English language have multiple meanings. If such a word is used to write a definition, additional information should be provided to prevent ambiguity.

首都 首都 [[shou3du1]] /capital (city)/

The text between the parentheses is “meta-information”; it is not a direct part of the translation, merely to prevent ambiguity.

General principles of translation

The English should be meaningful, not horribly ugly, and bear a close relation to the Chinese meaning. It should correspond to something that could be used naturally by an English speaker (I think Arthur Waley has some advice saying that just because a text is about magnetohydrodynamics, it doesn't follow that it has to be horribly ugly).

On the other hand, a translation always loses something, and the translator can compensate by substituting an English equivalent (e.g. a biblical or Shakespearian allusion in place of a Confucian idiom).

Most words have more than one meaning, and more than one grammatical function. Care is needed not to concentrate only on a specific occurrence to the exclusion of others. e.g . the actual occurrence may be a verb in the past participle (say “overthrown”) whereas the word may also mean “destruction”, “to topple” etc.

There are tens of thousands, if not a hundred thousand, Chinese characters that have ever been created in Chinese history. Many of them are archaic, obscure, and have not been used in centuries, perhaps millennia, and it may not be possible to provide a definition. If you can't find a character in common dictionaries, or examples of the character in modern use, it's a sign that it's not worth including.

Special syntax

Taiwanese pronunciation

CC-CEDICT follows “standard Mandarin” as used in P.R.China. Mandarin as used in Taiwan sometimes has slight variations in the pronunciation, these can be listed as follows:
叔叔 叔叔 [shu1 shu5] /(informal) father's younger brother/uncle/Taiwan pr. shu2 shu5/

Taiwanese GuoYu sometimes prefers not to use the neutral tone, so we do not list Taiwan pronunciations when they consist only of saying “don't use the neutral tone”. When a character has a “Taiwan pr.” notice, then all of its compound need not mention it.

Labels

See Labels

References

Classifiers

Classifiers, or “measure words”, can be listed using the following syntax:

麵包 面包 [[mian4bao1]] /bread/CL:片[pian4],塊|块[kuai4]/
麵包店 面包店 [[mian4bao1dian4]] /bakery/CL:家[jia1]/

They follow the reference syntax of traditional|simplified[pinyin], are prefixed by “CL:” and separated by commas.

We typically omit general classifiers like 個|个[ge4] which can be applied to almost every single noun.

A classifier itself can be described like so:

/classifier for small round things (peas, bullets, peanuts, pills, grains etc)/ 

Bound forms

Punctuation

Dashes and hyphens

We do not use the em dash (—). Ranges of numbers, dates, times etc should be separated by the en dash (–). In other cases, either the en dash or hyphen (-) should be used following standard English grammar.

Middle dot

Middle dots are often used for separating western names:

大衛·艾登堡 大卫·艾登堡 [[Da4wei4 Ai4deng1bao3]] /David Attenborough (1926–), British naturalist and broadcaster/

Note: A middle dot was present within the pinyin in v1, but no longer used in v2. The v2 pinyin format allows us to clearly group the characters of the first name and last name, so the middle dot is no longer necessary.

Comma

Commas are sometimes used in proverbs or longer expressions:

分久必合,合久必分 分久必合,合久必分 [[fen1jiu3-bi4he2, he2jiu3-bi4fen1]] /lit. that which is long divided must unify, and that which is long unified must divide (proverb, from Romance of the Three Kingdoms 三國演義|三国演义[San1guo2 Yan3yi4])/fig. things are constantly changing/

The comma within the Chinese characters should be the “fullwidth comma”: ,. The comma within the pinyin should be the regular comma followed by a space.

Note: In v1, a space was also inserted before the comma in the pinyin (so the pinyin would contain “ , ”). The space before the comma has been phased out in v2.

Enumeration comma

The enumeration comma “、”, known as 顿号, is used to separate items in a list. It's used rarely in CC-CEDICT, but appears in a handful of entries. Syntax-wise, it's treated the same way as the fullwidth comma (no space in the Chinese characters, and corresponds to a regular comma followed by a space in the pinyin).

八字方針 八字方针 [[ba1zi4 fang1zhen1]] /a policy expressed as an eight-character slogan/(esp.) the eight-character slogan for the economic policy proposed by Li Fuchun 李富春[Li3 Fu4chun1] in 1961: 調整、鞏固、充實、提高|调整、巩固、充实、提高[tiao2zheng3, gong3gu4, chong1shi2, ti2gao1] "adjust, consolidate, enrich and improve"/

兒|儿, erhua and rhotacization

The 兒|儿 character can be used in three different ways

1. 兒|儿[er2] is not-optional because it's its own syllable (meaning “child” or “son”)

女兒 女儿 [[nu:3er2]] /daughter/

2. 兒|儿[r5] is a non-optional suffix because it changes both the pronunciation and meaning of the word

頭兒 头儿 [[tou2r5]] /leader/

3. 兒|儿[r5] is an optional suffix, changing the pronunciation of the word but not the meaning

花兒 花儿 [[hua1r5]] /erhua form of 花[hua1]/

Please note: words ending with 'r5' (such as 'hua1r5') are presented as a -r joined with the previous syllable (eg. 'huar1') in some dictionaries using CC-CEDICT, such as the MDBG Chinese-English dictionary.

Choice of entries and translations

The current CC-CEDICT database contains a considerable number of infelicities, inaccuracies, omissions, and actual errors. As an ideal, new entries should be checked against 2 or 3 different sources (e.g. the online and paper dictionaries). Care is needed, since the dictionaries copy from one another – an entirely bogus entry in CC-CEDICT is copied uncritically onto thousands of websites within a few months.

A Chinese word for which a Google query with the following syntax results in many thousand of hits should probably be added to CC-CEDICT, with translations corresponding to the main usages.

+"combination of characters"

(the +“” combination forces Google to match both a whole word and to ignore variants)

Romanization of foreign languages

When transcribing foreign words in definitions, please use the following romanization methods:

If an alternative romanization method is more popular for a certain word, that version can be added as an additional translation.

Non-Chinese characters

On occasion the Chinese language uses English letters or numerals to write a word. For example, we have

# English letters
ky ky [[ky]] /(slang) socially tone-deaf; unable to read the room (from Japanese KY, acronym of 空気が読めない "kuuki ga yomenai")/
coser coser [[coser]] /cosplayer/

# Mix of English and Chinese
e人 e人 [[e-ren2]] /(slang) extroverted person/
勿cue 勿cue [[wu4-cue]] /(Internet slang) don't call on me; don't drag me in/

# Numbers
3D打印 3D打印 [[san1-D da3yin4]] /to 3D print; 3D printing/
95後 95后 [[jiu3wu3hou4]] /people born between 1995-01-01 and 1999-12-31/Gen Z (abbr. for 95後|95后[jiu3wu3hou4] + 00後|00后[ling2ling2hou4])/
996 996 [[jiu3jiu3liu4]] /9am–9pm, six days a week (work schedule)/

As a general rule of thumb:

  1. When writing the Hanzi fields, non-Chinese characters should stay the same.
  2. When writing the pinyin, for English letters use the same letters in the pinyin (ky → ky), but for numbers write out the pinyin for the corresponding Chinese character (9 → jiu3)

Technical details, and the use of {}

When parsing the traditional and simplified fields, Hanzi and numbers are treated as individual sections, while consecutive English letters are grouped together into a single section. For example a hypothetical headword “甲abc123乙丙” would be parsed into 7 sections: 甲, abc, 1, 2, 3, 乙, 丙.

The pinyin is first split by spaces and punctuation, and then parsed based on valid pinyin syllables. One way of writing pinyin for the hypothetical headword above could be “jia3 abc yi1-er4-san1 yi3bing3”. Note there are many valid ways that the pinyin could be segmented, for example “jia3 abc yi1er4 san1yi3bing3” or “jia3-abc-yi1-er4-san1-yi3-bing3” (these may not make sense from an orthographic point of view, but will be parsed correctly by the CC-CEDICT website). The only requirement is that the “abc” is separated “yi1”. Any of the above examples will be parsed into 7 pinyin sections.

It is a requirement that the number of parsed sections in the Hanzi matches the number of parsed sections in the pinyin. For the vast majority of entries, this does not pose a problem. Almost all Chinese characters are one syllable in length, and due to how the parsing logic works, numbers and English letters will be parsed correctly as long as the pinyin is segmented correctly. Problems arise in rare situations such as

兡 兡 [[bai3ke4]] /.../

where a single character corresponds to two syllables. In these cases, {}'s may be used to manually group a section, so we can write

兡 兡 [[{bai3ke4}]] /.../

which indicates “bai3ke4” is a single pinyin section, matching the single Hanzi section of 兡.

Another problem arises when an entry has a number with multiple digits, such as

21三體綜合症 21三体综合症 [[er4shi2yi1 san1ti3 zong1he2zheng4]] /trisomy; Down's syndrome/

as there are seven Hanzi sections but eight pinyin sections. In this case, {}'s are necessary in both the pinyin and Hanzi to delineate sections

{21}三體綜合症 {21}三体综合症 [[{er4shi2yi1} san1ti3 zong1he2zheng4]] /trisomy; Down's syndrome/

Note this is different from the 996 example above, which is treated as 3 digits “nine nine six” and parses without {}'s, not the number “nine hundred ninety six”, which would need {}'s.

To check whether an entry will be parsed correctly, you can use this tool: https://cc-cedict.org/editor/editor.php?handler=ParseEntry

syntax_v2.txt · Last modified: by kbaiko

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki