User Tools

Site Tools


syntax_v2

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
syntax_v2 [2025/10/17 06:32] mdbgsyntax_v2 [2026/05/16 09:39] (current) – Handling numbers with multiple digits kbaiko
Line 4: Line 4:
  
 // Below are guidelines on what CC-CEDICT entries **should** look like. CC-CEDICT still has many old entries that do not comply with these rules yet. // // Below are guidelines on what CC-CEDICT entries **should** look like. CC-CEDICT still has many old entries that do not comply with these rules yet. //
- 
-Version 2 (v2) introduces a new syntax for the pinyin of an entry, allowing for the specification of pinyin that follows standard pinyin orthography. In particular, it enables the combination of syllables to form words. For example, in v2, 二次方程 (quadratic equation) can now be written as two words, "er4ci4 fang1cheng2" (i.e., èrcì fāngchéng), rather than as four separate syllables, "er4 ci4 fang1 cheng2", as was required in v1. 
  
 An entry is considered to be in v2 format if it uses double square brackets for the pinyin. v1 entries use a single square bracket. An entry is considered to be in v2 format if it uses double square brackets for the pinyin. v1 entries use a single square bracket.
 +
 +The primary difference between v1 and v2 is that v2 entries follow standard pinyin orthography. In v1, all pinyin were written with spaces between each syllable. In v2, syllables can be combined to form words. For example,
 +
 <code> <code>
-v2: [[pin1yin1]+v1二次方程 二次方程 [er4 ci4 fang1 cheng2/(math.) quadratic equation/ 
-v1: [pin1 yin1]+v2二次方程 二次方程 [[er4ci4 fang1cheng2]/(math.) quadratic equation/
 </code> </code>
  
-However, when updating the pinyin of an entry, the rest of the entry should also be reviewed. If this is done, it means that v2 pinyin format signifies not only that the pinyin format has been updated, but also that the definition has been checked for correctness and proper format: it's a way of keeping track of which entries have old definitions that need to be reviewed.+However, besides just correcting the pinyin of an entry, the rest of the entry must also be reviewed. If this is done, it means that v2 pinyin format signifies not only that the pinyin format has been updated, but also that the definition has been checked for correctness and proper format: it's a way of keeping track of which entries have old definitions that need to be reviewed.
  
 In particular, prior to April 2022, glosses and senses were separated using a /. As of April 2022, senses are to be separated with a / while glosses are to be separated with a ;. (This was a change in v1 format of definitions, but its progressive introduction largely coincides with the conversion of pinyin to v2 format.) In particular, prior to April 2022, glosses and senses were separated using a /. As of April 2022, senses are to be separated with a / while glosses are to be separated with a ;. (This was a change in v1 format of definitions, but its progressive introduction largely coincides with the conversion of pinyin to v2 format.)
 +
 +A number of other (mostly minor) syntax and format changes have also been established over the years, and are outlined on this wiki. v1 entries (some of which date back to 1998), may not necessarily follow our latest conventions. However, v2 entries should. Part of the v2 conversion process is making sure these rules are followed.
  
 Three editions of CC-CEDICT are published regularly: Three editions of CC-CEDICT are published regularly:
Line 32: Line 35:
 <code> <code>
 Traditional Simplified [[pin1yin1]] /gloss; gloss; .../gloss; gloss; .../ Traditional Simplified [[pin1yin1]] /gloss; gloss; .../gloss; gloss; .../
-</code> 
- 
-For example: 
-<code> 
-皮實 皮实 [[pi2shi5]] /(of things) durable/(of people) sturdy; tough/ 
 </code> </code>
  
Line 47: Line 45:
  
 If an editor wants to add additional senses for an existing trad-simp-pinyin combination, they should edit its definition rather than create a new entry. If an editor wants to add additional senses for an existing trad-simp-pinyin combination, they should edit its definition rather than create a new entry.
 +
 ==== Traditional and simplified characters ==== ==== Traditional and simplified characters ====
  
 The Chinese word should consist of one or more Chinese characters, without any spaces in it. Both traditional and simplified forms should be provided, and the two must have the same length. The Chinese word should consist of one or more Chinese characters, without any spaces in it. Both traditional and simplified forms should be provided, and the two must have the same length.
  
-There are very small number of entries that use symbolsnumbersor other non-Chinese characters in the word, for example+==== Pinyin ==== 
 + 
 +The pinyin should be in accordance with standard pinyin orthography. For comprehensive reference, we recommend //Chinese Romanization: Pronunciation and Orthography// by Yin Binyong. 
 + 
 +For the majority of entries, an entry can be converted from v1 to v2 by simply removing the spaces within words (see 二次方程 above). In additiona hyphen can now be included in the pinyin when appropriate.
  
 <code> <code>
-% % [pa1/percent (Tw)/ +師生 师生 [[shi1-sheng1]] /teachers and students
-3C 3C [san1 C] /computers, communications, and consumer electronics/China Compulsory Certificate (CCC)+柴米油鹽醬醋茶 柴米油盐酱醋茶 [[chai2-mi3-you2-yan2-jiang4-cu4-cha2]] /lit. firewoodrice, oil, salt, soy sauce, vinegar and tea/fig. life's daily necessities/
-421 421 [si4 er4 yi1] /four grandparentstwo parents and an only child/ +
-K人 K人 [K ren2] /(slang) to hit sb; to beat sb/+
 </code> </code>
  
-**Below are some notes on how these entries are handled in v2.**+Rules about our pinyin format: 
 +  - Tones are indicated with numerals instead of diacritics. The neutral tone (轻声) uses the numeral "5", which should not be omitted. 
 +  - Because we use numerals, the boundary between characters is clearly established and apostrophes before vowels are not needed 
 +  - ü (the umlaut), is written as “u:”. For example, 女 -> nu:3 
 +  - 儿 as the “retroflex final” is written as r5 
 +  - Raw tones should be used: 
 +      - Tone sandhi is **not** indicated (e.g., 你好[ni3hao3] is not written as [ni2hao3]) 
 +      - 一 and 不 have various modifications in tone depending on what follows them, but these are **not** indicated in the pinyin (e.g., 一半[yi1ban4] is not written as [yi2ban4], 不是[bu4shi4] is not written as [bu2shi4]) 
 +      - Word-related changes to neutral tone, however, **are** indicated. These are especially common with reduplicated forms (e.g., use ma1 ma5, not ma1 ma1; ba4 ba5, not ba4 ba4; kan4 kan5, not kan4 kan4; xiang3 xiang5 ("take under consideration"), not xiang3 xiang3). This isn't limited to reduplicated forms, e.g., ming2 bai5, not ming2 bai2; cong1 ming5, not cong1 ming2.\\ It's best to keep in mind that Pinyin is about Mandarin words, not Chinese characters. 
 +  - xx5: Represents an entry where pinyin does not apply. There are very few entries with this pinyin and we do not expect to add more.
  
-Let's take "e人" (extroverted personas an example.+<code> 
 +々 々 [xx5] /iteration mark indicating repetition of the preceding character in horizontal writing (rare in modern Chinese)
 +〻 〻 [xx5] /iteration mark indicating repetition of the preceding character in vertical writing (rare in modern Chinese)/ 
 +</code>
  
-There are several ways one might like to render "e人" in pinyin, such as +==== Definition ====
-  - e-rén +
-  - erén +
-  - yìrén+
  
-The Editor website attempts to match the parts of the headword with the parts of the pinyin, and willif necessarytreat some parts as "unparsed".+A definition is made up of senses, and a sense is made up of glosses. Senses should be separated using a slash "/"while glosses should be separated with a semicolon ";". This means that you can not use / or ; within a definition - for example"w/o" as an abbreviation of "withoutwould incorrectly split the definition into two senses.
  
-For examplein the following entry, "e" is an unparsed element in both the headword and the pinyin, while 人 is matched with "ren2" +Generallyglosses within a sense are synonyms and can be included to remove ambiguity, while senses represent wholly different meanings or uses of a word. Here's an example of an entry with multiple senses and glosses.
-<code> e人 e人 [[e-ren2]] /(slang) extroverted person/ </code>+
  
-If the Editor website https://cc-cedict.org/editorcannot unambiguously match up the elements of the headword and the pinyin, the entry will not be processedThat is what happens in the following case, where the proposed pinyin is "eren2" rather than "e-ren2".+<code> 
 +算 算 [[suan4]] /to calculate; to figure out/to include; to count in/to count; to be valid; to carry weight/to regard as; to consider (to be ...)/ 
 +</code>
  
 +Rules to follow when writing a definition:
 +  - Use American English.
 +  - Do not add definite or indefinite articles (e.g. "a", "an", "the" etc) to English nouns unless they are necessary to distinguish the word from another usage type or homonym
 +  - Don't use parts of speech. Instead try to give an indication of grammatical usage within the English definition. CC-CEDICT is a human readable descriptive dictionary, not a resource intended for machine processing.
 +  - The singular form is preferred over the plural form, unless the word is typically used in its plural form.
 +  - Entries for people should include dates if possible (birth, death, years in which the person was active in a certain role etc) and why this person is of interest (was famous writer, took part in a revolution, was murdered etc). If a person isn't particularly famous and isn't related to China or Chinese culture, please don't include them.
 +  - Names of plants, animals, musical instruments should give common name and scientific name when appropriate; there is a particular problem of how specific the word is -- a plant may mean a minor variety within a species, or may refer to an entire taxonomic family. Different writers will use it to mean the common family, or the particular item of salad on their plate at present.
  
-<code> e人 e人 [[eren2]] /(slang) extroverted person/ (Invalid format!)</code> 
  
-To specify "erén" (as opposed to, say, "e-rén"), it is necessary to use braces to guide the Editor website in parsing. The following would work: +=== Ambiguity due to homonyms ===
-<code>e人 e人 [[{e}ren2]] /(slang) extroverted person/</code>+
  
-... as would several other forms, including +Many words in the English language have multiple meaningsIf such a word is used to write a definition, additional information should be provided to prevent ambiguity.
-<code>{e}人 {e}人 [[{e}ren2]] /(slang) extroverted person/</code>+
  
-Here is a link to a webpage where a proposed entry can be tested to see if it can be parsed correctly. 
- 
-"Parse entry" webpage: 
-https://cc-cedict.org/editor/editor.php?handler=ParseEntry 
- 
-To specify "yìrén" as the pinyin for e人, no braces are necessary. The following entry can be parsed, as one can verify at the "Parse entry" webpage. "e" will be matched with "yi4", and 人 will be matched with "ren2". 
- 
-<code>e人 e人 [[yi4ren2]] /(slang) extroverted person/</code> 
- 
-Generally, it is regarded as preferable not to indicate the pronunciation of non-Chinese parts of a headword (such as "e" in "e人"). Instead, they can appear as unparsed elements of the pinyin. For example, "e-ren2" is preferred over "yi4ren2" 
- 
- 
-==== Pinyin ==== 
- 
-The pinyin should be in accordance with standard pinyin orthography, except that numerals are used to indicate tones instead of diacritics, and apostrophes are not indicated. 
- 
-Proper nouns should be capitalized, and spaces or hyphens should be inserted when appropriate. For example, 
 <code> <code>
-蘋果手機 苹果手机 [[Ping2guo3 shou3ji1]] /iPhone/ +首都 首都 [[shou3du1]] /capital (city)/
-師生 师生 [[shi1-sheng1]] /teachers and students/+
 </code> </code>
  
-Other rules about our pinyin format: +The text between the parentheses is "meta-information"; it is not a direct part of the translationmerely to prevent ambiguity
-  - The neutral tone uses the numeral "5", which should not be omitted. +
-  - ü, also known as the umlaut, is written as “u:”. For example, 女 -> nu:3 +
-  - 儿 as the “retroflex final” is written as r5 +
-  - Raw tones should be used: +
-      - Tone sandhi is **not** indicated (e.g., ni3 hao3 is not changed to ni2 hao3) +
-      - Although "yi" and "bu" have various modifications in tone, depending on what follows them, these are **not** indicated in writing (e.g., "one horse" is pronounced "yi4 pi3 ma3" but written "yi1 pi3 ma3", and "not enough" is pronounced "bu2 gou4" but written "bu4 gou4"+
-      Word-related changes to neutral tone, however, **are** indicated. These are especially common with reduplicated forms (e.g., use ma1 ma5, not ma1 ma1; ba4 ba5, not ba4 ba4; kan4 kan5, not kan4 kan4; xiang3 xiang5 ("take under consideration"), not xiang3 xiang3). This isn't limited to reduplicated forms, e.g., ming2 bai5, not ming2 bai2cong1 ming5, not cong1 ming2.\\ It's best to keep in mind that Pinyin is about Mandarin words, not Chinese characters. +
-  - Non-Chinese characters: Letters should be written as they are, while numbers should be written out using pinyin, for example, 3C becomes “san1 C”. +
-  - xx5: There are few entries where the pinyin is xx5which represents unknown pinyin or characters where pinyin does not apply. Some Korean and Japanese symbols go here. It’s unlikely we will add more of these entries.+
  
-<code> 
-々 々 [xx5] /iteration mark (used to represent a duplicated character)/ 
-㍽ ㍽ [xx5] /大正[Da4 zheng4] written as a single character/ 
-朩 朩 [xx5] /one of the characters used in kwukyel (phonetic "pin"), an ancient Korean writing system/ 
-込 込 [xx5] /(Japanese kokuji) to be crowded; to go into/ 
-</code> 
  
-==== Definition ====+=== General principles of translation ===
  
-Definitions should be written in American English.+The English should be meaningful, not horribly ugly, and bear a close relation to the Chinese meaning. It should correspond to something that could be used naturally by an English speaker (I think Arthur Waley has some advice saying that just because a text is about magnetohydrodynamics, it doesn't follow that it has to be horribly ugly).
  
-Senses should be separated using a slash "/"while glosses within sense should be separated with a semicolon ";". This means that you can not using / or ; within a definition - for example"w/o" as an abbreviation of "without" would incorrectly split the definition into two senses.+On the other hand, a translation always loses somethingand the translator can compensate by substituting an English equivalent (e.g. a biblical or Shakespearian allusion in place of a Confucian idiom).
  
-Do not add definite or indefinite articles (e.g. "a""an", "the", etc) to English nouns unless they are necessary to distinguish the word from another usage type or homonym +Most words have more than one meaningand more than one grammatical functionCare is needed not to concentrate only on specific occurrence to the exclusion of others. e.g . the actual occurrence may be verb in the past participle (say "overthrown"whereas the word may also mean "destruction", "to topple" etc.
- +
-Don't use parts of speech. Instead try to give an indication of grammatical usage within the English definitionCC-CEDICT is a human readable descriptive dictionary, not a resource intended for machine processing. +
- +
-Abbreviations etc cf e.g. i.e. do not need any further punctuation. +
- +
-Extended meanings indicated by lit. .. fig. combination when appropriate or when common expression refers back to a classical incident or chengyu, one can refer to it with cf (incident in Records of the Historian).+
  
 +There are tens of thousands, if not a hundred thousand, Chinese characters that have ever been created in Chinese history. Many of them are archaic, obscure, and have not been used in centuries, perhaps millennia, and it may not be possible to provide a definition. If you can't find a character in common dictionaries, or examples of the character in modern use, it's a sign that it's not worth including.
  
  
Line 150: Line 127:
 Taiwanese GuoYu sometimes prefers not to use the neutral tone, so we do not list Taiwan pronunciations when they consist only of saying "don't use the neutral tone". When a character has a "Taiwan pr." notice, then all of its compound need not mention it.   Taiwanese GuoYu sometimes prefers not to use the neutral tone, so we do not list Taiwan pronunciations when they consist only of saying "don't use the neutral tone". When a character has a "Taiwan pr." notice, then all of its compound need not mention it.  
  
 +==== Labels ====
  
-==== Ambiguity due to homonyms ==== +See [[labels]]
- +
-Sometimes words used in the English definitions can have multiple meanings. If the Chinese word does not have these additional meanings, additional information should be provided to prevent ambiguity:\\  +
-首都 首都 [shou3 du1/capital (city)/ +
- +
-The text between the parentheses is "meta-information"; it is not a direct part of the translation, merely to prevent ambiguity. +
  
 ==== References ==== ==== References ====
Line 164: Line 137:
 ==== Classifiers ==== ==== Classifiers ====
  
-Classifiers (also called "Measure words"can be listed using the following syntax:\\  +Classifiers, or "measure words"can be listed using the following syntax:
-避風港 避风港 [bi4 feng1 gang3] /haven/refuge/harbor/CL:座[zuo4],個|个[ge4]/+
  
-Classifiers follow the 'reference' syntaxare prefixed by 'CL:' and separated by a comma (no additional spacing).+<code> 
 +麵包 面包 [[mian4bao1]] /bread/CL:片[pian4],塊|块[kuai4]/ 
 +麵包店 面包店 [[mian4bao1dian4]] /bakery/CL:家[jia1]/ 
 +</code>
  
-The classifier words itself can be described using:\\  +They follow the reference syntax of traditional|simplified[pinyin], are prefixed by "CL:" and separated by commas.
-/classifier for small round things (peas, bullets, peanuts, pills, grains etc)/+
  
 +We typically omit general classifiers like 個|个[ge4] which can be applied to almost every single noun.
 +
 +A classifier itself can be described like so:
 +
 +<code>
 +/classifier for small round things (peas, bullets, peanuts, pills, grains etc)/ 
 +</code>
 ==== Bound forms ==== ==== Bound forms ====
  
Line 178: Line 159:
 ===== Punctuation ===== ===== Punctuation =====
  
 +==== Dashes and hyphens ====
 +
 +We do not use the em dash (—). Ranges of numbers, dates, times etc should be separated by the en dash (–). In other cases, either the en dash or hyphen (-) should be used following standard English grammar.
  
 ==== Middle dot ==== ==== Middle dot ====
Line 191: Line 175:
 ==== Comma ==== ==== Comma ====
  
-Commas are sometimes used in Chinese proverbs:+Commas are sometimes used in proverbs or longer expressions:
  
 <code> <code>
Line 199: Line 183:
 The comma within the Chinese characters should be the "fullwidth comma": ,. The comma within the pinyin should be the regular comma followed by a space. The comma within the Chinese characters should be the "fullwidth comma": ,. The comma within the pinyin should be the regular comma followed by a space.
  
 +Note: In v1, a space was also inserted before the comma in the pinyin (so the pinyin would contain " , "). The space before the comma has been phased out in v2.
  
-===== Retroflex finals =====+==== Enumeration comma ====
  
-There are 3 kinds of R-ised words that use the 兒/儿 character: +The enumeration comma "、", known as 顿号, is used to separate items in a list. It's used rarely in CC-CEDICT, but appears in a handful of entries. Syntax-wise, it's treated the same way as the fullwidth comma (no space in the Chinese characters, and corresponds to a regular comma followed by a space in the pinyin).
-  - 兒/儿 is not-optional because it's its own syllable (usually meaning "son," so daughter is actually "girl son") - 女兒/女儿 nǚ'ér +
-  兒/儿 is not-optional because it changes the definition of the word and is tacked on to the preceding syllable - 头兒/头儿 tóur (leader) as opposed to 头 tóu (head) +
-  - 兒/儿 is an optional northern pronunciation (er2hua4) and is tacked on to the preceding syllable - 花兒/花儿 huār (flower) as opposed to 花 huā (flower)+
  
-These 3 cases should be formatted as follows: +<code> 
-  - 女兒 女儿 [nu:3 er2] /daughter/ +八字方針 八字方针 [[ba1zi4 fang1zhen1]] /a policy expressed as an eight-character slogan/(esp.) the eight-character slogan for the economic policy proposed by Li Fuchun 李富春[Li3 Fu4chun1in 1961: 調整、鞏固、充實、提高|调整、巩固、充实、提高[tiao2zheng3, gong3gu4, chong1shi2, ti2gao1"adjust, consolidate, enrich and improve"/ 
-  頭兒 头儿 [tou2 r5/leader/ +</code>
-  - 花兒 花儿 [hua1 r5] /erhua variant of 花/flower/+
  
-//Please note: words ending with 'r5' (such as 'hua1 r5') are presented as a -r joined with the previous syllable (eg. 'huar1') in some dictionaries using CC-CEDICT, such as the [[http://www.mdbg.net/chindict/chindict.php|MDBG Chinese-English dictionary]].//+===== 兒|儿, erhua and rhotacization =====
  
 +The 兒|儿 character can be used in three different ways
  
-===== Choice of entries and translations =====+1. 兒|儿[er2] is not-optional because it's its own syllable (meaning "child" or "son")
  
-The current CC-CEDICT database contains a considerable number of infelicities, inaccuracies, omissions, and actual errors. As an ideal, new entries should be checked against 2 or 3 different sources (e.g. the online and paper dictionaries). Care is needed, since the dictionaries copy from one another -- an entirely bogus entry in CC-CEDICT is copied uncritically onto thousands of websites within a few months. 
- 
-A Chinese word for which a Google query with the following syntax results in many thousand of hits should probably be added to CC-CEDICT, with translations corresponding to the main usages. 
 <code> <code>
-+"combination of characters"+女兒 女儿 [[nu:3er2]] /daughter/
 </code> </code>
-//(the +"" combination forces Google to match both a whole word and to ignore variants)// 
  
 +2. 兒|儿[r5] is a non-optional suffix because it changes both the pronunciation and meaning of the word
  
-===== General principles of translation =====+<code> 
 +頭兒 头儿 [[tou2r5]] /leader/ 
 +</code>
  
-The English should be meaningful, not horribly ugly, and bear a close relation to the Chinese meaning. It should correspond to something that could be used naturally by an English speaker (I think Arthur Waley has some advice saying that just because a text is about magnetohydrodynamics, it doesn't follow that it has to be horribly ugly).+3. 兒|儿[r5] is an optional suffixchanging the pronunciation of the word but not the meaning
  
-On the other hand, a translation always loses something, and the translator can compensate by substituting an English equivalent (e.g. a biblical or Shakespearian allusion in place of a Confucian idiom).+<code> 
 +花兒 花儿 [[hua1r5]] /erhua form of 花[hua1]/ 
 +</code>
  
-Names of persons should say dates if possible (birth, death, years in which the person was active in certain role, etc), what interest the person has (writer, general, pop star, etc), brief indications of CV (e.gtook part in a revolution, was murdered, wrote famous book, etc)For example:\\ 胡錦濤 胡锦涛 [Hu2 Jin3 tao1] /Hu Jintao (1942-), president of PRC from 2003/+//Please note: words ending with 'r5' (such as 'hua1r5') are presented as -r joined with the previous syllable (eg. 'huar1'in some dictionaries using CC-CEDICTsuch as the [[http://www.mdbg.net/chindict/chindict.php|MDBG Chinese-English dictionary]]./
 +===== Choice of entries and translations =====
  
-Names of plantsanimalsmusical instruments should give common name and scientific name when appropriate; there is a particular problem of how specific the word is -- a plant may mean a minor variety within a species, or may refer to an entire taxonomic familyDifferent writers will use it to mean the common familyor the particular item of salad on their plate at present.+The current CC-CEDICT database contains a considerable number of infelicitiesinaccuracies, omissions, and actual errors. As an idealnew entries should be checked against 2 or 3 different sources (e.g. the online and paper dictionaries). Care is neededsince the dictionaries copy from one another -- an entirely bogus entry in CC-CEDICT is copied uncritically onto thousands of websites within a few months.
  
-Most words have more than one meaning, and more than one grammatical function. Care is needed not to concentrate only on specific occurrence to the exclusion of others. e.g . the actual occurrence may be a verb in the past participle (say "overthrown") whereas the word may also mean "destruction", "to topple" etc.+A Chinese word for which Google query with the following syntax results in many thousand of hits should probably be added to CC-CEDICT, with translations corresponding to the main usages. 
 +<code> 
 ++"combination of characters" 
 +</code> 
 +//(the +"" combination forces Google to match both a whole word and to ignore variants)//
  
-There are 20,000 Chinese characters in the more advanced dictionaries, of which many are obscure, never used, and will not have correct definitions in online or paper dictionaries. This is the boundary of knowledge. (Exactly the same applies to big English dictionaries.) These obscure characters appear on modern websites, and one sometimes needs to give a definition. It is reasonable to admit (precise meaning unknown), and give an indication of what one can deduce. 
  
-===== Variants =====+===== Romanization of foreign languages =====
  
-Many characters have variantssometimes more than one, sometimes with identical meaning or quite different meanings. Some choice of variants found in texts on websites will arise because of the different input methods, and the user may have had no intention of using the variant.+When transcribing foreign words in definitionsplease use the following romanization methods
 +  * Japanese: [[http://en.wikipedia.org/wiki/Hepburn_romanization|Modified Hepburn]] 
 +  * Korean: [[http://en.wikipedia.org/wiki/Revised_Romanization_of_Korean|Revised Romanization of Korean]]
  
-You can get rough usage frequency information by searching the alternative word forms in GooglePlease use this syntax to make sure that Google doesn't perform any automatic variant translations: +If an alternative romanization method is more popular for a certain word, that version can be added as an additional translation.
-<code>+"word"</code>+
  
-Additionally you can use Google's advanced search to specify the language to either 'Chinese (Traditional)' or 'Chinese (Simplified)' to prevent Japanese web pages from influencing the results. For example:\\  +===== Non-Chinese characters =====
-789 Chinese (Traditional) pages for +"撐竿跳高"\\  +
-17,700 Chinese (Simplified) pages for +"撑竿跳高"\\  +
-1,750 Chinese (Traditional) pages for +"撐杆跳高"\\  +
-66,900 Chinese (Simplified) pages for +"撑杆跳高"+
  
-It often happens that Google tells you that +"Xx" occurs 200 times more frequently than +"XX", in which case Xx should be in CC-CEDICT as regular entryand XX only as "XX XX [pin1 yin1] /variant of Xx/definition/".+On occasion the Chinese language uses English letters or numerals to write word. For examplewe have
  
-When there are alternative forms of the same expression, and the less common form is at most 5 times less common, the less common entry should have /also written ..referring to the more common forme.g. 撐竿跳高 撑竿跳高 [cheng1 gan1 tiao4 gao1] /pole-vaulting/also written 撐杆跳高|撑杆跳高/.+<code> 
 +# English letters 
 +ky ky [[ky]] /(slang) socially tone-deaf; unable to read the room (from Japanese KYacronym of 空気が読めない "kuuki ga yomenai")/ 
 +coser coser [[coser]] /cosplayer/
  
 +# Mix of English and Chinese
 +e人 e人 [[e-ren2]] /(slang) extroverted person/
 +勿cue 勿cue [[wu4-cue]] /(Internet slang) don't call on me; don't drag me in/
  
-**PROPOSED CHANGES**+# Numbers 
 +3D打印 3D打印 [[san1-D da3yin4]] /to 3D print; 3D printing/ 
 +95後 95后 [[jiu3wu3hou4]] /people born between 1995-01-01 and 1999-12-31/Gen Z (abbr. for 95後|95后[jiu3wu3hou4] + 00後|00后[ling2ling2hou4])/ 
 +996 996 [[jiu3jiu3liu4]] /9am–9pm, six days a week (work schedule)/ 
 +</code>
  
-(Summary: (1Get rid of "also written"using "variant of" instead; and (2) Format "variant of" entries in line with points 2a and 2b below.) +As a general rule of thumb: 
-  +  - When writing the Hanzi fields, non-Chinese characters should stay the same. 
-(THE VARIANT RULES ABOVE CAN BE DELETED IF AND WHEN THESE CHANGES ARE ACCEPTED.)+  - When writing the pinyin, for English letters use the same letters in the pinyin (ky -> ky), but for numbers write out the pinyin for the corresponding Chinese character (9 -> jiu3)
  
-(Also, the following notes can be tidied up and edited to remove references to "I" and "me".) 
  
-Regarding "also written..."+==== Technical details, and the use of {} ====
  
-According to our wikithere are two kinds of variants. +When parsing the traditional and simplified fieldsHanzi and numbers are treated as individual sections, while consecutive English letters are grouped together into a single sectionFor example a hypothetical headword "甲abc123乙丙" would be parsed into 7 sections甲, abc, 1, 2, 3, 乙, 丙.
-https://cc-cedict.org/wiki/format:syntax#variants+
  
-1) Where the less common form is relatively common (> 20% of the frequency of the more common form).+The pinyin is first split by spaces and punctuation, and then parsed based on valid pinyin syllables. One way of writing pinyin for the hypothetical headword above could be "jia3 abc yi1-er4-san1 yi3bing3". Note there are many valid ways that the pinyin could be segmented, for example "jia3 abc yi1er4 san1yi3bing3" or "jia3-abc-yi1-er4-san1-yi3-bing3" (these may not make sense from an orthographic point of view, but will be parsed correctly by the CC-CEDICT website). The only requirement is that the "abc" is separated "yi1". Any of the above examples will be parsed into 7 pinyin sections.
  
-2) Where the less common form is much less common (< 20% of the frequency of the more common form)+It is a requirement that the number of parsed sections in the Hanzi matches the number of parsed sections in the pinyin. For the vast majority of entries, this does not pose a problem. Almost all Chinese characters are one syllable in length, and due to how the parsing logic works, numbers and English letters will be parsed correctly as long as the pinyin is segmented correctly. Problems arise in rare situations such as 
  
-For the first type, the def of the less common form should look like this (according to the wiki): +<code> 
-<code>/definition/also written .../</code>+兡 兡 [[bai3ke4]] /.../ 
 +</code>
  
-And for the second typethe def of the less common form should be +where a single character corresponds to two syllables. In these cases{}'s may be used to manually group a section, so we can write
-<code>/variant of .../definition/</code>+
  
-In practice, what has been happening in recent years is this:+<code> 
 +兡 兡 [[{bai3ke4}]] /.../ 
 +</code>
  
-1. We have been ignoring the "also written ...syntaxexcept maybe when we edit existing entries+which indicates "bai3ke4is a single pinyin sectionmatching the single Hanzi section of 兡.
  
-2With variants,+Another problem arises for entries with a number with multiple digitsConsider a hypothetical entry such as
  
-a) if it's a full variant (i.e. exactly the same definition), we use /variant of ...without adding the definition.+<code> 
 +11 11 [[shi2yi1]] /eleven/ 
 +</code>
  
-b) if it's a partial variant (i.e. only some of the senses of one form apply to the other form) we use +which implies that the first 1 is pronounced "shi2" and the second 1 is pronounced "yi1"Or the entry
-<code>/definition (variant of ...)/</code>+
  
-Part of the rationale for these changes is this: It's a hassle to check whether entries satisfy the "20% criteria", and the percentage probably changes over time, and depending on which corpus you use to get the percentage.+<code> 
 +21 21 [[er4shi2yi1]] /twenty one/ 
 +</code>
  
-Using the Editor website's search function, I got about 3600 results for "variant of" and only about 360 results for "also written" +which poses a different problem - we have two Hanzi sections but three pinyin sections due to the extra "shi2that is not present in the hanziTo solve these issues, we have decided to group the number in {}'s and write the number out in the pinyin field:
  
-One idea that I've had in mind for a while is to clean up all these by+<code> 
 +{21}三體綜合症 {21}三体综合症 [[{21} san1ti3 zong1he2zheng4]] /trisomy; Down's syndrome/ 
 +</code>
  
-a) rewriting "also writtendefinitions by using "variant of"+Note this is different from the 996 example above, which is treated as 3 digits "nine nine sixand parses without {}'s, not the number "nine hundred ninety six", which would need {}'s.
  
-b) regularizing the format of the "variant of" entries in line with points 2a and 2b above. +To check whether an entry will be parsed correctlyyou can use this tool
- +https://cc-cedict.org/editor/editor.php?handler=ParseEntry
- +
- +
-===== Romanization of foreign languages ===== +
- +
-When transcribing foreign words in definitionsplease use the following romanization methods+
-  * Japanese: [[http://en.wikipedia.org/wiki/Hepburn_romanization|Modified Hepburn]] +
-  * Korean: [[http://en.wikipedia.org/wiki/Revised_Romanization_of_Korean|Revised Romanization of Korean]] +
- +
-If an alternative romanization method is more popular for a certain word, that version can be added as an additional translation.+
syntax_v2.1760682728.txt.gz · Last modified: by mdbg

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki