syntax_v2
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| syntax_v2 [2026/04/21 02:49] – [Comma] kbaiko | syntax_v2 [2026/05/16 09:39] (current) – Handling numbers with multiple digits kbaiko | ||
|---|---|---|---|
| Line 4: | Line 4: | ||
| // Below are guidelines on what CC-CEDICT entries **should** look like. CC-CEDICT still has many old entries that do not comply with these rules yet. // | // Below are guidelines on what CC-CEDICT entries **should** look like. CC-CEDICT still has many old entries that do not comply with these rules yet. // | ||
| - | |||
| - | Version 2 (v2) introduces a new syntax for the pinyin of an entry, allowing for the specification of pinyin that follows standard pinyin orthography. In particular, it enables the combination of syllables to form words. For example, in v2, 二次方程 (quadratic equation) can now be written as two words, " | ||
| An entry is considered to be in v2 format if it uses double square brackets for the pinyin. v1 entries use a single square bracket. | An entry is considered to be in v2 format if it uses double square brackets for the pinyin. v1 entries use a single square bracket. | ||
| + | |||
| + | The primary difference between v1 and v2 is that v2 entries follow standard pinyin orthography. In v1, all pinyin were written with spaces between each syllable. In v2, syllables can be combined to form words. For example, | ||
| + | |||
| < | < | ||
| - | v2: [[pin1yin1]] | + | v1: 二次方程 二次方程 |
| - | v1: [pin1 yin1] | + | v2: 二次方程 二次方程 |
| </ | </ | ||
| - | However, | + | However, |
| In particular, prior to April 2022, glosses and senses were separated using a /. As of April 2022, senses are to be separated with a / while glosses are to be separated with a ;. (This was a change in v1 format of definitions, | In particular, prior to April 2022, glosses and senses were separated using a /. As of April 2022, senses are to be separated with a / while glosses are to be separated with a ;. (This was a change in v1 format of definitions, | ||
| + | |||
| + | A number of other (mostly minor) syntax and format changes have also been established over the years, and are outlined on this wiki. v1 entries (some of which date back to 1998), may not necessarily follow our latest conventions. However, v2 entries should. Part of the v2 conversion process is making sure these rules are followed. | ||
| Three editions of CC-CEDICT are published regularly: | Three editions of CC-CEDICT are published regularly: | ||
| Line 32: | Line 35: | ||
| < | < | ||
| Traditional Simplified [[pin1yin1]] /gloss; gloss; .../gloss; gloss; .../ | Traditional Simplified [[pin1yin1]] /gloss; gloss; .../gloss; gloss; .../ | ||
| - | </ | ||
| - | |||
| - | For example: | ||
| - | < | ||
| - | 皮實 皮实 [[pi2shi5]] /(of things) durable/(of people) sturdy; tough/ | ||
| </ | </ | ||
| Line 47: | Line 45: | ||
| If an editor wants to add additional senses for an existing trad-simp-pinyin combination, | If an editor wants to add additional senses for an existing trad-simp-pinyin combination, | ||
| + | |||
| ==== Traditional and simplified characters ==== | ==== Traditional and simplified characters ==== | ||
| The Chinese word should consist of one or more Chinese characters, without any spaces in it. Both traditional and simplified forms should be provided, and the two must have the same length. | The Chinese word should consist of one or more Chinese characters, without any spaces in it. Both traditional and simplified forms should be provided, and the two must have the same length. | ||
| - | There are a very small number | + | ==== Pinyin ==== |
| + | |||
| + | The pinyin should be in accordance with standard pinyin orthography. For a comprehensive reference, we recommend //Chinese Romanization: | ||
| + | |||
| + | For the majority | ||
| < | < | ||
| - | % % [pa1] /percent (Tw)/ | + | 師生 师生 |
| - | 3C 3C [san1 C] /computers, communications, | + | 柴米油鹽醬醋茶 柴米油盐酱醋茶 |
| - | 421 421 [si4 er4 yi1] /four grandparents, two parents | + | |
| - | K人 K人 [K ren2] /(slang) to hit sb; to beat sb/ | + | |
| </ | </ | ||
| - | **Below are some notes on how these entries | + | Rules about our pinyin format: |
| + | - Tones are indicated with numerals instead of diacritics. The neutral tone (轻声) uses the numeral " | ||
| + | - Because we use numerals, the boundary between characters is clearly established and apostrophes before vowels are not needed | ||
| + | - ü (the umlaut), is written as “u:”. For example, 女 -> nu:3 | ||
| + | - 儿 as the “retroflex final” is written as r5 | ||
| + | - Raw tones should be used: | ||
| + | - Tone sandhi is **not** indicated (e.g., 你好[ni3hao3] is not written as [ni2hao3]) | ||
| + | - 一 and 不 have various modifications in tone depending | ||
| + | - Word-related changes to neutral tone, however, | ||
| + | - xx5: Represents an entry where pinyin does not apply. There are very few entries with this pinyin and we do not expect to add more. | ||
| - | Let's take " | + | < |
| + | 々 々 [xx5] /iteration mark indicating repetition of the preceding character in horizontal writing | ||
| + | 〻 〻 [xx5] /iteration mark indicating repetition of the preceding character in vertical writing (rare in modern Chinese)/ | ||
| + | </ | ||
| - | There are several ways one might like to render " | + | ==== Definition ==== |
| - | - e-rén | + | |
| - | - erén | + | |
| - | - yìrén | + | |
| - | The Editor website attempts to match the parts of the headword with the parts of the pinyin, and will, if necessary, treat some parts as "unparsed". | + | A definition is made up of senses, and a sense is made up of glosses. Senses should be separated using a slash "/" |
| - | For example, in the following entry, " | + | Generally, glosses within a sense are synonyms |
| - | < | + | |
| - | If the Editor website https://cc-cedict.org/editor/ cannot unambiguously match up the elements of the headword and the pinyin, the entry will not be processed. That is what happens in the following case, where the proposed pinyin is " | + | < |
| + | 算 算 [[suan4]] | ||
| + | </ | ||
| + | Rules to follow when writing a definition: | ||
| + | - Use American English. | ||
| + | - Do not add definite or indefinite articles (e.g. " | ||
| + | - Don't use parts of speech. Instead try to give an indication of grammatical usage within the English definition. CC-CEDICT is a human readable descriptive dictionary, not a resource intended for machine processing. | ||
| + | - The singular form is preferred over the plural form, unless the word is typically used in its plural form. | ||
| + | - Entries for people should include dates if possible (birth, death, years in which the person was active in a certain role etc) and why this person is of interest (was famous writer, took part in a revolution, was murdered etc). If a person isn't particularly famous and isn't related to China or Chinese culture, please don't include them. | ||
| + | - Names of plants, animals, musical instruments should give common name and scientific name when appropriate; | ||
| - | < | ||
| - | To specify " | + | === Ambiguity due to homonyms === |
| - | < | + | |
| - | ... as would several other forms, including | + | Many words in the English language have multiple meanings. If such a word is used to write a definition, additional information should be provided to prevent ambiguity. |
| - | < | + | |
| - | Here is a link to a webpage where a proposed entry can be tested to see if it can be parsed correctly. | ||
| - | |||
| - | "Parse entry" webpage: | ||
| - | https:// | ||
| - | |||
| - | To specify " | ||
| - | |||
| - | < | ||
| - | |||
| - | Generally, it is regarded as preferable not to indicate the pronunciation of non-Chinese parts of a headword (such as " | ||
| - | |||
| - | |||
| - | ==== Pinyin ==== | ||
| - | |||
| - | The pinyin should be in accordance with standard pinyin orthography, | ||
| - | |||
| - | Proper nouns should be capitalized, | ||
| < | < | ||
| - | 蘋果手機 苹果手机 | + | 首都 首都 |
| - | 師生 师生 [[shi1-sheng1]] /teachers and students/ | + | |
| </ | </ | ||
| - | Other rules about our pinyin format: | + | The text between |
| - | - The neutral tone uses the numeral " | + | |
| - | - ü, also known as the umlaut, | + | |
| - | - 儿 as the “retroflex final” is written as r5 | + | |
| - | - Raw tones should be used: | + | |
| - | - Tone sandhi is **not** indicated (e.g., ni3 hao3 is not changed to ni2 hao3) | + | |
| - | - Although " | + | |
| - | | + | |
| - | - Non-Chinese characters: Letters should be written as they are, while numbers should be written out using pinyin, for example, 3C becomes “san1 C”. | + | |
| - | - xx5: There are very few entries where the pinyin is xx5, which represents unknown pinyin or characters where pinyin does not apply. | + | |
| - | < | ||
| - | 々 々 [xx5] /iteration mark (used to represent a duplicated character)/ | ||
| - | </ | ||
| - | ==== Definition ==== | + | === General principles of translation |
| - | Definitions | + | The English |
| - | Senses should be separated using a slash "/" | + | On the other hand, a translation always loses something, and the translator can compensate by substituting |
| - | Do not add definite or indefinite articles (e.g. " | + | Most words have more than one meaning, and more than one grammatical |
| - | + | ||
| - | Don't use parts of speech. Instead try to give an indication of grammatical | + | |
| - | + | ||
| - | Abbreviations etc cf e.g. i.e. do not need any further punctuation. | + | |
| - | + | ||
| - | Extended meanings indicated by lit. .. fig. combination when appropriate or when a common expression refers back to a classical incident or chengyu, one can refer to it with cf (incident | + | |
| + | There are tens of thousands, if not a hundred thousand, Chinese characters that have ever been created in Chinese history. Many of them are archaic, obscure, and have not been used in centuries, perhaps millennia, and it may not be possible to provide a definition. If you can't find a character in common dictionaries, | ||
| Line 147: | Line 127: | ||
| Taiwanese GuoYu sometimes prefers not to use the neutral tone, so we do not list Taiwan pronunciations when they consist only of saying " | Taiwanese GuoYu sometimes prefers not to use the neutral tone, so we do not list Taiwan pronunciations when they consist only of saying " | ||
| + | ==== Labels ==== | ||
| - | ==== Ambiguity due to homonyms ==== | + | See [[labels]] |
| - | + | ||
| - | Sometimes words used in the English definitions can have multiple meanings. If the Chinese word does not have these additional meanings, additional information should be provided to prevent ambiguity: | + | |
| - | 首都 首都 | + | |
| - | + | ||
| - | The text between the parentheses is " | + | |
| ==== References ==== | ==== References ==== | ||
| Line 161: | Line 137: | ||
| ==== Classifiers ==== | ==== Classifiers ==== | ||
| - | Classifiers | + | Classifiers, or "measure |
| - | 避風港 避风港 [bi4 feng1 gang3] / | + | |
| - | Classifiers follow the ' | + | < |
| + | 麵包 面包 [[mian4bao1]] / | ||
| + | 麵包店 面包店 [[mian4bao1dian4]] /bakery/CL:家[jia1]/ | ||
| + | </ | ||
| - | The classifier words itself can be described using:\\ | + | They follow the reference syntax of traditional|simplified[pinyin], |
| - | /classifier for small round things (peas, bullets, peanuts, pills, grains etc)/ | + | |
| + | We typically omit general classifiers like 個|个[ge4] which can be applied to almost every single noun. | ||
| + | |||
| + | A classifier itself can be described like so: | ||
| + | |||
| + | < | ||
| + | /classifier for small round things (peas, bullets, peanuts, pills, grains etc)/ | ||
| + | </ | ||
| ==== Bound forms ==== | ==== Bound forms ==== | ||
| Line 175: | Line 159: | ||
| ===== Punctuation ===== | ===== Punctuation ===== | ||
| + | ==== Dashes and hyphens ==== | ||
| + | |||
| + | We do not use the em dash (—). Ranges of numbers, dates, times etc should be separated by the en dash (–). In other cases, either the en dash or hyphen (-) should be used following standard English grammar. | ||
| ==== Middle dot ==== | ==== Middle dot ==== | ||
| Line 206: | Line 193: | ||
| </ | </ | ||
| - | ===== Retroflex finals | + | ===== 兒|儿, erhua and rhotacization |
| - | There are 3 kinds of R-ised words that use the 兒/儿 character: | + | The 兒|儿 character |
| - | - 兒/儿 is not-optional because it's its own syllable (usually meaning " | + | |
| - | - 兒/儿 is not-optional because it changes the definition of the word and is tacked on to the preceding syllable - 头兒/ | + | |
| - | - 兒/儿 is an optional northern pronunciation (er2hua4) and is tacked on to the preceding syllable - 花兒/ | + | |
| - | These 3 cases should be formatted as follows: | + | 1. 兒|儿[er2] |
| - | - 女兒 女儿 [nu:3 er2] / | + | |
| - | | + | |
| - | - 花兒 花儿 [hua1 r5] /erhua variant of 花/flower/ | + | |
| - | //Please note: words ending with ' | + | < |
| + | 女兒 女儿 | ||
| + | </code> | ||
| + | 2. 兒|儿[r5] is a non-optional suffix because it changes both the pronunciation and meaning of the word | ||
| + | |||
| + | < | ||
| + | 頭兒 头儿 [[tou2r5]] /leader/ | ||
| + | </ | ||
| + | |||
| + | 3. 兒|儿[r5] is an optional suffix, changing the pronunciation of the word but not the meaning | ||
| + | |||
| + | < | ||
| + | 花兒 花儿 [[hua1r5]] /erhua form of 花[hua1]/ | ||
| + | </ | ||
| + | //Please note: words ending with ' | ||
| ===== Choice of entries and translations ===== | ===== Choice of entries and translations ===== | ||
| Line 232: | Line 227: | ||
| - | ===== General principles | + | ===== Romanization |
| - | The English should be meaningful, not horribly ugly, and bear a close relation to the Chinese meaning. It should correspond to something that could be used naturally by an English speaker (I think Arthur Waley has some advice saying that just because a text is about magnetohydrodynamics, | + | When transcribing foreign words in definitions, please use the following romanization methods: |
| + | * Japanese: [[http://en.wikipedia.org/ | ||
| + | * Korean: [[http:// | ||
| - | On the other hand, a translation always loses something, and the translator | + | If an alternative romanization method is more popular for a certain word, that version |
| - | Names of persons should say dates if possible (birth, death, years in which the person was active in a certain role, etc), what interest the person has (writer, general, pop star, etc), brief indications of CV (e.g. took part in a revolution, was murdered, wrote famous book, etc). For example:\\ 胡錦濤 胡锦涛 [Hu2 Jin3 tao1] /Hu Jintao (1942-), president of PRC from 2003/ | + | ===== Non-Chinese characters ===== |
| - | Names of plants, animals, musical instruments should give common name and scientific name when appropriate; | + | On occasion |
| - | Most words have more than one meaning, and more than one grammatical function. Care is needed not to concentrate only on a specific occurrence | + | < |
| + | # English letters | ||
| + | ky ky [[ky]] /(slang) socially tone-deaf; unable | ||
| + | coser coser [[coser]] /cosplayer/ | ||
| - | There are 20,000 Chinese characters in the more advanced dictionaries, | + | # Mix of English |
| + | e人 e人 [[e-ren2]] /(slang) extroverted person/ | ||
| + | 勿cue 勿cue [[wu4-cue]] /(Internet slang) don't call on me; don't drag me in/ | ||
| - | ===== Variants ===== | + | # Numbers |
| + | 3D打印 3D打印 [[san1-D da3yin4]] /to 3D print; 3D printing/ | ||
| + | 95後 95后 [[jiu3wu3hou4]] /people born between 1995-01-01 and 1999-12-31/ | ||
| + | 996 996 [[jiu3jiu3liu4]] /9am–9pm, six days a week (work schedule)/ | ||
| + | </ | ||
| - | Many characters have variants, sometimes more than one, sometimes with identical meaning or quite different meanings. Some choice of variants found in texts on websites will arise because of the different input methods, and the user may have had no intention of using the variant. | + | As a general rule of thumb: |
| + | - When writing the Hanzi fields, non-Chinese characters should stay the same. | ||
| + | - When writing the pinyin, for English letters use the same letters | ||
| - | You can get rough usage frequency information by searching the alternative word forms in Google. Please use this syntax to make sure that Google doesn' | ||
| - | < | ||
| - | Additionally you can use Google' | + | ==== Technical details, and the use of {} ==== |
| - | 789 Chinese (Traditional) pages for +" | + | |
| - | 17,700 Chinese (Simplified) pages for +" | + | |
| - | 1,750 Chinese (Traditional) pages for +" | + | |
| - | 66,900 Chinese (Simplified) pages for +" | + | |
| - | It often happens that Google tells you that +" | + | When parsing the traditional and simplified fields, Hanzi and numbers are treated |
| - | When there are alternative forms of the same expression, and the less common form is at most 5 times less common, the less common entry should have /also written | + | The pinyin is first split by spaces and punctuation, |
| + | It is a requirement that the number of parsed sections in the Hanzi matches the number of parsed sections in the pinyin. For the vast majority of entries, this does not pose a problem. Almost all Chinese characters are one syllable in length, and due to how the parsing logic works, numbers and English letters will be parsed correctly as long as the pinyin is segmented correctly. Problems arise in rare situations such as | ||
| - | **PROPOSED CHANGES** | + | < |
| + | 兡 兡 [[bai3ke4]] /.../ | ||
| + | </ | ||
| - | (Summary: (1) Get rid of "also written", | + | where a single character corresponds to two syllables. In these cases, {}'s may be used to manually group a section, so we can write |
| - | + | ||
| - | (THE VARIANT RULES ABOVE CAN BE DELETED IF AND WHEN THESE CHANGES ARE ACCEPTED.) | + | |
| - | (Also, the following notes can be tidied up and edited to remove references to " | + | < |
| + | 兡 兡 [[{bai3ke4}]] /.../ | ||
| + | </ | ||
| - | Regarding | + | which indicates |
| - | According to our wiki, there are two kinds of variants. | + | Another problem arises for entries with a number with multiple digits. Consider a hypothetical entry such as |
| - | https:// | + | |
| - | 1) Where the less common form is relatively common (> 20% of the frequency of the more common form). | + | < |
| + | 11 11 [[shi2yi1]] /eleven/ | ||
| + | </code> | ||
| - | 2) Where the less common form is much less common (< 20% of the frequency of the more common form) | + | which implies that the first 1 is pronounced " |
| - | For the first type, the def of the less common form should look like this (according to the wiki): | + | < |
| - | < | + | 21 21 [[er4shi2yi1]] |
| + | </ | ||
| - | And for the second type, the def of the less common form should be | + | which poses a different problem - we have two Hanzi sections but three pinyin sections due to the extra " |
| - | < | + | |
| - | In practice, what has been happening in recent years is this: | + | < |
| + | {21}三體綜合症 {21}三体综合症 [[{21} san1ti3 zong1he2zheng4]] /trisomy; Down's syndrome/ | ||
| + | </ | ||
| - | 1. We have been ignoring | + | Note this is different from the 996 example above, which is treated as 3 digits |
| - | 2. With variants, | + | To check whether |
| - | + | https://cc-cedict.org/editor/editor.php? | |
| - | a) if it's a full variant (i.e. exactly the same definition), | + | |
| - | + | ||
| - | b) if it's a partial variant (i.e. only some of the senses of one form apply to the other form) we use | + | |
| - | < | + | |
| - | + | ||
| - | Part of the rationale for these changes is this: It's a hassle to check whether | + | |
| - | + | ||
| - | Using the Editor website' | + | |
| - | + | ||
| - | One idea that I've had in mind for a while is to clean up all these by | + | |
| - | + | ||
| - | a) rewriting "also written" | + | |
| - | + | ||
| - | b) regularizing the format of the " | + | |
| - | + | ||
| - | + | ||
| - | + | ||
| - | ===== Romanization of foreign languages ===== | + | |
| - | + | ||
| - | When transcribing foreign words in definitions, | + | |
| - | * Japanese: [[http://en.wikipedia.org/wiki/Hepburn_romanization|Modified Hepburn]] | + | |
| - | * Korean: [[http:// | + | |
| - | + | ||
| - | If an alternative romanization method is more popular for a certain word, that version can be added as an additional translation. | + | |
syntax_v2.1776739750.txt.gz · Last modified: by kbaiko
