format:syntax_v2
Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
format:syntax_v2 [2024/07/28 08:09] – created mdbg | format:syntax_v2 [2024/09/12 23:02] (current) – [CC-CEDICT V2 Syntax] richwarm | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== Syntax ====== | + | ====== |
//**TODO:** work in progress!// | //**TODO:** work in progress!// | ||
- | // | + | Version 2 (v2) introduces a new syntax for the pinyin of an entry, allowing for the specification of pinyin that follows standard pinyin orthography. In particular, it enables the combination of syllables to form words. For example, in v2, 二次方程 (quadratic equation) can now be written as two words, " |
+ | |||
+ | Below are guidelines on what CC-CEDICT entries **should** look like. CC-CEDICT still has many old entries | ||
+ | |||
+ | In particular: | ||
+ | - Prior to April 2022, glosses and senses were separated using a /. As of April 2022, senses are to be separated with a / while glosses are to be separated with a ;. (This was a change in v1 format of definitions, | ||
+ | - In December 2023, CC-CEDICT began adopting the v2 pinyin format. | ||
+ | |||
+ | An entry is considered to be in v2 format if it uses double square brackets for the pinyin. | ||
+ | < | ||
+ | [[pin1yin1]] rather than [pin1 yin1] | ||
+ | </ | ||
+ | |||
+ | However, when updating the pinyin of an entry, the rest of the entry should also be reviewed. If this is done, it means that v2 pinyin format signifies not only that the pinyin format has been updated, but also that the definition has been checked for correctness and proper format: it's a way of keeping track of which entries have old definitions that need to be reviewed. | ||
+ | |||
+ | As of August 2024, roughly 10% of entries have been converted to v2 by editors. | ||
+ | |||
+ | Three editions of CC-CEDICT are published regularly: | ||
+ | * **Version 1**, in which any v2 entries are converted back to v1 by a script. | ||
+ | * **Mixed**, in which entries are v2 if they have been converted by an editor, or v1 otherwise | ||
+ | * **Version 2**, in which any v1 entries are converted to v2 by a script | ||
+ | |||
+ | These three editions are available for download at | ||
+ | https:// | ||
+ | |||
+ | When conversion to v2 is complete, the Mixed edition will be the same as Version 2 and therefore redundant, so only Version 1 and Version 2 will be published after that time. | ||
===== Basic format ===== | ===== Basic format ===== | ||
Line 9: | Line 34: | ||
The basic format of a CC-CEDICT entry is: | The basic format of a CC-CEDICT entry is: | ||
< | < | ||
- | Traditional Simplified [pin1 yin1] /gloss; gloss; .../gloss; gloss; .../ | + | Traditional Simplified [[pin1yin1]] /gloss; gloss; .../gloss; gloss; .../ |
</ | </ | ||
For example: | For example: | ||
< | < | ||
- | 皮實 皮实 [pi2 shi5] /(of things) durable/(of people) sturdy; tough/ | + | 皮實 皮实 [[pi2shi5]] /(of things) durable/(of people) sturdy; tough/ |
</ | </ | ||
- | ==== Semicolons ==== | + | **Important: |
- | Note that senses are separated by a slash, while glosses for the same sense are separated by a semicolon. | + | < |
+ | 王 王 [[Wang2]] /surname Wang/ | ||
+ | 王 王 [[wang2]] /king/ | ||
+ | </ | ||
- | The semicolon was used for this purpose | + | If an editor wants to add additional senses |
+ | ==== Traditional and simplified characters ==== | ||
+ | |||
+ | The Chinese word should consist of one or more Chinese characters, without any spaces | ||
+ | |||
+ | There are a very small number of entries | ||
+ | |||
+ | < | ||
+ | % % [pa1] /percent (Tw)/ | ||
+ | 3C 3C [san1 C] /computers, communications, | ||
+ | 421 421 [si4 er4 yi1] /four grandparents, | ||
+ | K人 K人 [K ren2] / | ||
+ | </ | ||
+ | |||
+ | **Below are some notes on how these entries are handled in v2.** | ||
+ | |||
+ | Let's take " | ||
+ | |||
+ | There are several ways one might like to render " | ||
+ | - e-rén | ||
+ | - erén | ||
+ | - yìrén | ||
+ | |||
+ | The Editor website attempts to match the parts of the headword with the parts of the pinyin, and will, if necessary, treat some parts as " | ||
+ | |||
+ | For example, in the following entry, " | ||
+ | < | ||
+ | |||
+ | If the Editor website https:// | ||
+ | |||
+ | |||
+ | < | ||
+ | |||
+ | To specify " | ||
+ | < | ||
+ | |||
+ | ... as would several other forms, including | ||
+ | < | ||
+ | |||
+ | Here is a link to a webpage where a proposed entry can be tested to see if it can be parsed correctly. | ||
+ | |||
+ | "Parse entry" webpage: | ||
+ | https:// | ||
+ | |||
+ | To specify " | ||
+ | |||
+ | < | ||
+ | |||
+ | Generally, it is regarded as preferable not to indicate the pronunciation of non-Chinese parts of a headword (such as " | ||
+ | |||
+ | |||
+ | ==== Pinyin ==== | ||
+ | |||
+ | The pinyin should be in accordance with standard pinyin orthography, | ||
+ | |||
+ | Proper nouns should be capitalized, | ||
+ | < | ||
+ | 蘋果手機 苹果手机 [[Ping2guo3 shou3ji1]] /iPhone/ | ||
+ | 師生 师生 [[shi1-sheng1]] /teachers and students/ | ||
+ | </ | ||
+ | |||
+ | Other rules about our pinyin format: | ||
+ | - The neutral tone uses the numeral " | ||
+ | - ü, also known as the umlaut, is written as “u:”. For example, 女 -> nu:3 | ||
+ | - 儿 as the “retroflex final” is written as r5 | ||
+ | - Raw tones should be used: | ||
+ | - Tone sandhi is **not** indicated (e.g., ni3 hao3 is not changed to ni2 hao3) | ||
+ | - Although " | ||
+ | - Word-related changes to neutral tone, however, **are** indicated. These are especially common with reduplicated forms (e.g., use ma1 ma5, not ma1 ma1; ba4 ba5, not ba4 ba4; kan4 kan5, not kan4 kan4; xiang3 xiang5 ("take under consideration" | ||
+ | - Non-Chinese characters: Letters should be written as they are, while numbers should be written out using pinyin, for example, 3C becomes “san1 C”. | ||
+ | - xx5: There are a few entries | ||
+ | |||
+ | < | ||
+ | 々 々 [xx5] /iteration mark (used to represent a duplicated character)/ | ||
+ | ㍽ ㍽ [xx5] /大正[Da4 zheng4] written as a single character/ | ||
+ | 朩 朩 [xx5] /one of the characters used in kwukyel (phonetic " | ||
+ | 込 込 [xx5] /(Japanese kokuji) to be crowded; to go into/ | ||
+ | </ | ||
+ | |||
+ | ==== Definition ==== | ||
+ | |||
+ | Definitions should be written in American English. | ||
+ | |||
+ | Senses should be separated using a slash "/", | ||
+ | |||
+ | Do not add definite or indefinite articles (e.g. " | ||
+ | |||
+ | Don't use parts of speech. Instead try to give an indication of grammatical usage within the English definition. CC-CEDICT | ||
+ | |||
+ | Abbreviations etc cf e.g. i.e. do not need any further punctuation. | ||
+ | |||
+ | Extended meanings indicated by lit. .. fig. combination when appropriate or when a common expression refers back to a classical incident or chengyu, one can refer to it with cf (incident in Records of the Historian). | ||
+ | |||
+ | |||
+ | |||
+ | ===== Special syntax ===== | ||
+ | |||
+ | ==== Taiwanese pronunciation ==== | ||
+ | |||
+ | CC-CEDICT follows " | ||
+ | 叔叔 叔叔 [shu1 shu5] /(informal) father' | ||
+ | |||
+ | Taiwanese GuoYu sometimes prefers not to use the neutral tone, so we do not list Taiwan pronunciations when they consist only of saying " | ||
+ | |||
+ | |||
+ | ==== Ambiguity due to homonyms ==== | ||
+ | |||
+ | Sometimes words used in the English | ||
+ | 首都 首都 [shou3 du1] /capital (city)/ | ||
+ | |||
+ | The text between the parentheses is " | ||
+ | |||
+ | ==== References ==== | ||
+ | |||
+ | The English | ||
+ | 漢字|汉字[Han4 zi4] | ||
+ | |||
+ | For example:\\ | ||
+ | 股指 股指 [gu3 zhi3] /stock market index/share price index/abbr. for 股票指數|股票指数[gu3 piao4 zhi3 shu4]/ | ||
+ | |||
+ | ==== Classifiers ==== | ||
+ | |||
+ | Classifiers (also called " | ||
+ | 避風港 避风港 [bi4 feng1 gang3] / | ||
+ | |||
+ | Classifiers follow the ' | ||
+ | |||
+ | The classifier words itself can be described | ||
+ | /classifier for small round things (peas, bullets, peanuts, pills, grains etc)/ | ||
+ | |||
+ | ==== Bound forms ==== | ||
+ | |||
+ | A bound form is a morpheme that only appears as part of a larger expression. In English, bound forms tend to be prefixes or suffixes such as “-ly”, “-est”, “pre-”, “post-” etc and generally are not words by themselves. In Chinese however, characters can either be bound or free, and it can be difficult to determine which. Some characters can have multiple bound and multiple free senses. | ||
+ | |||
+ | There are two types of bound forms in Chinese, those with meanings and those without. | ||
+ | |||
+ | === Meaningful bound forms === | ||
+ | |||
+ | These are bound forms where a meaning can be assigned to the character. Oftentimes they appear in multiple words with the same meaning, but never by themselves. We indicate these by prefixing the sense with “(bound form)”. | ||
+ | |||
+ | For instance: | ||
+ | |||
+ | < | ||
+ | 隘 隘 [[ai4]] /(bound form) narrow/ | ||
+ | </ | ||
+ | |||
+ | is a bound form as you would not see 隘 alone when reading Chinese. It would always be accompanied by other characters such as 隘口, 隘路, 关隘, 狭隘 etc. | ||
+ | |||
+ | === Meaningless bound forms === | ||
+ | |||
+ | These are bound forms where a meaning cannot be assigned to the character, usually because the character appears in a small number of words (usually just 1). Oftentimes these are the names of plants or animals, or terms used in literature. For these characters, the entry is simply “used in …”. | ||
+ | |||
+ | For example: | ||
+ | |||
+ | < | ||
+ | 鮟 𩽾 [an1] /used in 鮟鱇|𩽾𩾌[an1 kang1]/ | ||
+ | 鱇 𩾌 [[kang1]] /used in 鮟鱇|𩽾𩾌[an1kang1]/ | ||
+ | 鮟鱇 𩽾𩾌 [an1 kang1] / | ||
+ | </ | ||
+ | |||
+ | 𩽾 and 𩾌 by themselves have no meaning, as they are always used with each other. 𩽾𩾌 is the anglerfish. | ||
+ | |||
+ | A small number of meaningless bound forms are used in multiple words, in this case, all should be listed. When the words have the same or similar meaning, they should be combined into one sense, when the words have different meanings, they should be separated into different senses. | ||
+ | |||
+ | Different senses | ||
+ | |||
+ | < | ||
+ | 螞 蚂 [[ma3]] /used in 螞蟥|蚂蟥[ma3huang2]/ | ||
+ | 蝲 蝲 [la4] /used in 蝲蛄[la4 gu3]/used in 蝲蝲蛄[la4 la4 gu3]/ | ||
+ | 蛞 蛞 [[kuo4]] /used in 蛞螻|蛞蝼[kuo4lou2]/ | ||
+ | 猻 狲 [[sun1]] /used in 猢猻|猢狲[hu2sun1]/ | ||
+ | </ | ||
+ | |||
+ | Same sense | ||
+ | < | ||
+ | 箢 箢 [yuan1] /used in 箢箕[yuan1 ji1] and 箢篼[yuan1 dou1]/ | ||
+ | 癔 癔 [[yi4]] /used in 癔病[yi4bing4] and 癔症[yi4zheng4]/ | ||
+ | 咐 咐 [[fu4]] /used in 吩咐[fen1fu5] and 囑咐|嘱咐[zhu3fu5]/ | ||
+ | </ | ||
+ | |||
+ | An example of both | ||
+ | |||
+ | < | ||
+ | 螂 螂 [[lang2]] /used in 螞螂|蚂螂[ma1lang2]/ | ||
+ | </ | ||
- | ==== In addition: ==== | ||
- | * The Chinese word should consist of one or more Chinese characters, without any spaces in it | ||
- | * The Mandarin pinyin should follow in the format below: | ||
- | * It should have a space between each pinyin syllable | ||
- | * Each pinyin syllable should have a tone number. Use 5 for the light tone (e.g. ni3 hao3 ma5) | ||
- | * Raw tones should be used: | ||
- | * Tone sandhi is **not** indicated (e.g., ni3 hao3 is not changed to ni2 hao3) | ||
- | * Although " | ||
- | * Word-related changes to neutral tone, however, **are** indicated. These are especially common with reduplicated forms (e.g., use ma1 ma5, not ma1 ma1; ba4 ba5, not ba4 ba4; kan4 kan5, not kan4 kan4; xiang3 xiang5 ("take under consideration" | ||
- | * For pinyin that uses the ü, represent it with a u followed by a colon (e.g. nu:3 ren2) | ||
- | * Capitalize pinyin for proper nouns (e.g. **B**ei3 jing1) | ||
- | * The English definitions should be separated with the '/' | ||
- | * American English should be used for the English definitions | ||
- | * Do not add definite or indefinite articles (e.g. " | ||
===== Punctuation ===== | ===== Punctuation ===== | ||
Line 52: | Line 250: | ||
==== Comma ==== | ==== Comma ==== | ||
- | Commas are sometimes used in Chinese proverbs:\\ | + | Commas are sometimes used in Chinese proverbs: |
- | 人為財死,鳥為食亡 人为财死,鸟为食亡 [ren2 wei4 cai2 si3 , niao3 wei4 shi2 wang2] /Human beings die in pursuit of wealth, and birds die in pursuit of food/.../ | + | < |
+ | 人為財死,鳥為食亡 人为财死,鸟为食亡 | ||
+ | </ | ||
- | A double width comma is used in the Chinese, a single width comma padded with spaces on both sides is used in the pinyin. | + | A **double width comma** is used in the Chinese. In the pinyin, **a single width comma followed by a space** |
Line 72: | Line 272: | ||
//Please note: words ending with ' | //Please note: words ending with ' | ||
- | ===== Taiwanese pronunciation ===== | ||
- | |||
- | CC-CEDICT follows " | ||
- | 叔叔 叔叔 [shu1 shu5] /(informal) father' | ||
- | |||
- | Taiwanese GuoYu sometimes prefers not to use the neutral tone, so we do not list Taiwan pronunciations when they consist only of saying " | ||
- | |||
- | |||
- | ===== General principles ===== | ||
- | |||
- | Various trivial style things: | ||
- | * Don't use parts of speech. Instead try to give an indication of grammatical usage within the English definition. CC-CEDICT is a human readable descriptive dictionary, not a resource intended for machine processing. | ||
- | * Abbreviations etc cf e.g. i.e. do not need any further punctuation. | ||
- | * Extended meanings indicated by lit. .. fig. combination when appropriate or when a common expression refers back to a classical incident or chengyu, one can refer to it with cf (incident in Records of the Historian). | ||
===== Choice of entries and translations ===== | ===== Choice of entries and translations ===== | ||
Line 111: | Line 297: | ||
There are 20,000 Chinese characters in the more advanced dictionaries, | There are 20,000 Chinese characters in the more advanced dictionaries, | ||
- | |||
- | ===== Ambiguity due to homonyms ===== | ||
- | |||
- | Sometimes words used in the English definitions can have multiple meanings. If the Chinese word does not have these additional meanings, additional information should be provided to prevent ambiguity: | ||
- | 首都 首都 [shou3 du1] /capital (city)/ | ||
- | |||
- | The text between the parentheses is " | ||
- | |||
- | ===== References ===== | ||
- | |||
- | The English definitions can contain references to other Chinese words. These should be noted as follows: | ||
- | 漢字|汉字[Han4 zi4] | ||
- | |||
- | For example: | ||
- | 股指 股指 [gu3 zhi3] /stock market index/share price index/abbr. for 股票指數|股票指数[gu3 piao4 zhi3 shu4]/ | ||
- | |||
- | ===== Classifiers ===== | ||
- | |||
- | Classifiers (also called " | ||
- | 避風港 避风港 [bi4 feng1 gang3] / | ||
- | |||
- | Classifiers follow the ' | ||
- | |||
- | The classifier words itself can be described using: | ||
- | /classifier for small round things (peas, bullets, peanuts, pills, grains etc)/ | ||
===== Variants ===== | ===== Variants ===== | ||
Line 153: | Line 314: | ||
When there are alternative forms of the same expression, and the less common form is at most 5 times less common, the less common entry should have /also written ../ referring to the more common form, e.g. 撐竿跳高 撑竿跳高 [cheng1 gan1 tiao4 gao1] / | When there are alternative forms of the same expression, and the less common form is at most 5 times less common, the less common entry should have /also written ../ referring to the more common form, e.g. 撐竿跳高 撑竿跳高 [cheng1 gan1 tiao4 gao1] / | ||
+ | |||
+ | |||
+ | **PROPOSED CHANGES** | ||
+ | |||
+ | (Summary: (1) Get rid of "also written", | ||
+ | |||
+ | (THE VARIANT RULES ABOVE CAN BE DELETED IF AND WHEN THESE CHANGES ARE ACCEPTED.) | ||
+ | |||
+ | (Also, the following notes can be tidied up and edited to remove references to " | ||
+ | |||
+ | Regarding "also written..." | ||
+ | |||
+ | According to our wiki, there are two kinds of variants. | ||
+ | https:// | ||
+ | |||
+ | 1) Where the less common form is relatively common (> 20% of the frequency of the more common form). | ||
+ | |||
+ | 2) Where the less common form is much less common (< 20% of the frequency of the more common form) | ||
+ | |||
+ | For the first type, the def of the less common form should look like this (according to the wiki): | ||
+ | < | ||
+ | |||
+ | And for the second type, the def of the less common form should be | ||
+ | < | ||
+ | |||
+ | In practice, what has been happening in recent years is this: | ||
+ | |||
+ | 1. We have been ignoring the "also written ..." syntax, except maybe when we edit existing entries | ||
+ | |||
+ | 2. With variants, | ||
+ | |||
+ | a) if it's a full variant (i.e. exactly the same definition), | ||
+ | |||
+ | b) if it's a partial variant (i.e. only some of the senses of one form apply to the other form) we use | ||
+ | < | ||
+ | |||
+ | Part of the rationale for these changes is this: It's a hassle to check whether entries satisfy the "20% criteria", | ||
+ | |||
+ | Using the Editor website' | ||
+ | |||
+ | One idea that I've had in mind for a while is to clean up all these by | ||
+ | |||
+ | a) rewriting "also written" | ||
+ | |||
+ | b) regularizing the format of the " | ||
+ | |||
+ | |||
===== Romanization of foreign languages ===== | ===== Romanization of foreign languages ===== |
format/syntax_v2.1722154164.txt.gz · Last modified: 2024/07/28 08:09 by mdbg