User Tools

Site Tools


syntax_v2

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
syntax_v2 [2026/04/25 13:11] – Move "ambiguity due to homonyms" section, add labels link kbaikosyntax_v2 [2026/05/16 09:39] (current) – Handling numbers with multiple digits kbaiko
Line 137: Line 137:
 ==== Classifiers ==== ==== Classifiers ====
  
-Classifiers (also called "Measure words"can be listed using the following syntax:\\  +Classifiers, or "measure words"can be listed using the following syntax:
-避風港 避风港 [bi4 feng1 gang3] /haven/refuge/harbor/CL:座[zuo4],個|个[ge4]/+
  
-Classifiers follow the 'reference' syntaxare prefixed by 'CL:' and separated by a comma (no additional spacing).+<code> 
 +麵包 面包 [[mian4bao1]] /bread/CL:片[pian4],塊|块[kuai4]/ 
 +麵包店 面包店 [[mian4bao1dian4]] /bakery/CL:家[jia1]/ 
 +</code>
  
-The classifier words itself can be described using:\\  +They follow the reference syntax of traditional|simplified[pinyin], are prefixed by "CL:" and separated by commas.
-/classifier for small round things (peas, bullets, peanuts, pills, grains etc)/+
  
 +We typically omit general classifiers like 個|个[ge4] which can be applied to almost every single noun.
 +
 +A classifier itself can be described like so:
 +
 +<code>
 +/classifier for small round things (peas, bullets, peanuts, pills, grains etc)/ 
 +</code>
 ==== Bound forms ==== ==== Bound forms ====
  
Line 185: Line 193:
 </code> </code>
  
-===== Retroflex finals =====+===== 兒|儿, erhua and rhotacization =====
  
-There are 3 kinds of R-ised words that use the /儿 character+The |儿 character can be used in three different ways
-  - 兒/儿 is not-optional because it's its own syllable (usually meaning "son," so daughter is actually "girl son") - 女兒/女儿 nǚ'ér +
-  - 兒/儿 is not-optional because it changes the definition of the word and is tacked on to the preceding syllable - 头兒/头儿 tóur (leader) as opposed to 头 tóu (head) +
-  - 兒/儿 is an optional northern pronunciation (er2hua4) and is tacked on to the preceding syllable - 花兒/花儿 huār (flower) as opposed to 花 huā (flower)+
  
-These 3 cases should be formatted as follows: +1. |儿[er2] is not-optional because it's its own syllable (meaning "child" or "son")
-  - 女兒 儿 [nu:3 er2] /daughter/ +
-  頭兒 头儿 [tou2 r5] /leader/ +
-  - 花兒 花儿 [hua1 r5] /erhua variant of 花/flower/+
  
-//Please note: words ending with 'r5' (such as 'hua1 r5') are presented as a -r joined with the previous syllable (eg. 'huar1') in some dictionaries using CC-CEDICT, such as the [[http://www.mdbg.net/chindict/chindict.php|MDBG Chinese-English dictionary]].//+<code> 
 +女兒 女儿 [[nu:3er2]] /daughter/ 
 +</code>
  
 +2. 兒|儿[r5] is a non-optional suffix because it changes both the pronunciation and meaning of the word
 +
 +<code>
 +頭兒 头儿 [[tou2r5]] /leader/
 +</code>
 +
 +3. 兒|儿[r5] is an optional suffix, changing the pronunciation of the word but not the meaning
 +
 +<code>
 +花兒 花儿 [[hua1r5]] /erhua form of 花[hua1]/
 +</code>
  
 +//Please note: words ending with 'r5' (such as 'hua1r5') are presented as a -r joined with the previous syllable (eg. 'huar1') in some dictionaries using CC-CEDICT, such as the [[http://www.mdbg.net/chindict/chindict.php|MDBG Chinese-English dictionary]].//
 ===== Choice of entries and translations ===== ===== Choice of entries and translations =====
  
Line 209: Line 225:
 </code> </code>
 //(the +"" combination forces Google to match both a whole word and to ignore variants)// //(the +"" combination forces Google to match both a whole word and to ignore variants)//
- 
-===== Variants ===== 
- 
-Many characters have variants, sometimes more than one, sometimes with identical meaning or quite different meanings. Some choice of variants found in texts on websites will arise because of the different input methods, and the user may have had no intention of using the variant. 
- 
-You can get rough usage frequency information by searching the alternative word forms in Google. Please use this syntax to make sure that Google doesn't perform any automatic variant translations: 
-<code>+"word"</code> 
- 
-Additionally you can use Google's advanced search to specify the language to either 'Chinese (Traditional)' or 'Chinese (Simplified)' to prevent Japanese web pages from influencing the results. For example:\\  
-789 Chinese (Traditional) pages for +"撐竿跳高"\\  
-17,700 Chinese (Simplified) pages for +"撑竿跳高"\\  
-1,750 Chinese (Traditional) pages for +"撐杆跳高"\\  
-66,900 Chinese (Simplified) pages for +"撑杆跳高" 
- 
-It often happens that Google tells you that +"Xx" occurs 200 times more frequently than +"XX", in which case Xx should be in CC-CEDICT as a regular entry, and XX only as "XX XX [pin1 yin1] /variant of Xx/definition/". 
- 
-When there are alternative forms of the same expression, and the less common form is at most 5 times less common, the less common entry should have /also written ../ referring to the more common form, e.g. 撐竿跳高 撑竿跳高 [cheng1 gan1 tiao4 gao1] /pole-vaulting/also written 撐杆跳高|撑杆跳高/. 
- 
- 
-**PROPOSED CHANGES** 
- 
-(Summary: (1) Get rid of "also written", using "variant of" instead; and (2) Format "variant of" entries in line with points 2a and 2b below.) 
-  
-(THE VARIANT RULES ABOVE CAN BE DELETED IF AND WHEN THESE CHANGES ARE ACCEPTED.) 
- 
-(Also, the following notes can be tidied up and edited to remove references to "I" and "me".) 
- 
-Regarding "also written..." 
- 
-According to our wiki, there are two kinds of variants. 
-https://cc-cedict.org/wiki/format:syntax#variants 
- 
-1) Where the less common form is relatively common (> 20% of the frequency of the more common form). 
- 
-2) Where the less common form is much less common (< 20% of the frequency of the more common form) 
- 
-For the first type, the def of the less common form should look like this (according to the wiki): 
-<code>/definition/also written .../</code> 
- 
-And for the second type, the def of the less common form should be 
-<code>/variant of .../definition/</code> 
- 
-In practice, what has been happening in recent years is this: 
- 
-1. We have been ignoring the "also written ..." syntax, except maybe when we edit existing entries 
- 
-2. With variants, 
- 
-a) if it's a full variant (i.e. exactly the same definition), we use /variant of .../ without adding the definition. 
- 
-b) if it's a partial variant (i.e. only some of the senses of one form apply to the other form) we use 
-<code>/definition (variant of ...)/</code> 
- 
-Part of the rationale for these changes is this: It's a hassle to check whether entries satisfy the "20% criteria", and the percentage probably changes over time, and depending on which corpus you use to get the percentage. 
- 
-Using the Editor website's search function, I got about 3600 results for "variant of" and only about 360 results for "also written".   
- 
-One idea that I've had in mind for a while is to clean up all these by 
- 
-a) rewriting "also written" definitions by using "variant of" 
- 
-b) regularizing the format of the "variant of" entries in line with points 2a and 2b above. 
- 
  
  
Line 282: Line 235:
 If an alternative romanization method is more popular for a certain word, that version can be added as an additional translation. If an alternative romanization method is more popular for a certain word, that version can be added as an additional translation.
  
-===== Non-Chinese entries =====+===== Non-Chinese characters =====
  
-There are very small number of entries that use symbols, numbers, or other non-Chinese characters in the word, for example+On occasion the Chinese language uses English letters or numerals to write a word. For examplewe have
  
 <code> <code>
-% % [pa1] /percent (Tw)/ +# English letters 
-3C 3C [san1 C] /computers, communications, and consumer electronics/China Compulsory Certificate (CCC)/ +ky ky [[ky]] /(slang) socially tone-deaf; unable to read the room (from Japanese KY, acronym of 空気が読めない "kuuki ga yomenai")/ 
-421 421 [si4 er4 yi1] /four grandparents, two parents and an only child+coser coser [[coser]] /cosplayer/ 
-K人 K人 [K ren2] /(slangto hit sb; to beat sb/+ 
 +# Mix of English and Chinese 
 +e人 e人 [[e-ren2]] /(slangextroverted person
 +勿cue 勿cue [[wu4-cue]] /(Internet slang) don't call on me; don't drag me in/ 
 + 
 +# Numbers 
 +3D打印 3D打印 [[san1-D da3yin4]] /to 3D print; 3D printing/ 
 +95後 95后 [[jiu3wu3hou4]] /people born between 1995-01-01 and 1999-12-31/Gen Z (abbr. for 95後|95后[jiu3wu3hou4] + 00後|00后[ling2ling2hou4])
 +996 996 [[jiu3jiu3liu4]] /9am–9pm, six days a week (work schedule)/
 </code> </code>
  
-**Below are some notes on how these entries are handled in v2.**+As a general rule of thumb: 
 +  - When writing the Hanzi fields, non-Chinese characters should stay the same. 
 +  - When writing the pinyin, for English letters use the same letters in the pinyin (ky -> ky), but for numbers write out the pinyin for the corresponding Chinese character (9 -> jiu3)
  
-Let's take "e人" (extroverted person) as an example. 
  
-There are several ways one might like to render "e人" in pinyinsuch as +==== Technical detailsand the use of {} ====
-  - e-rén +
-  - erén +
-  - yìrén+
  
-The Editor website attempts to match the parts of the headword with the parts of the pinyin, and willif necessary, treat some parts as "unparsed".+When parsing the traditional and simplified fieldsHanzi and numbers are treated as individual sections, while consecutive English letters are grouped together into a single section. For example a hypothetical headword "甲abc123乙丙would be parsed into 7 sections: 甲, abc, 1, 2, 3, 乙, 丙.
  
-For examplein the following entry, "eis an unparsed element in both the headword and the pinyin, while 人 is matched with "ren2".  +The pinyin is first split by spaces and punctuationand then parsed based on valid pinyin syllables. One way of writing pinyin for the hypothetical headword above could be "jia3 abc yi1-er4-san1 yi3bing3". Note there are many valid ways that the pinyin could be segmentedfor example "jia3 abc yi1er4 san1yi3bing3or "jia3-abc-yi1-er4-san1-yi3-bing3" (these may not make sense from an orthographic point of view, but will be parsed correctly by the CC-CEDICT website). The only requirement is that the "abc" is separated "yi1". Any of the above examples will be parsed into 7 pinyin sections.
-<code> e人 e人 [[e-ren2]] /(slangextroverted person/ </code>+
  
-If the Editor website https://cc-cedict.org/editor/ cannot unambiguously match up the elements of the headword and the pinyinthe entry will not be processedThat is what happens in the following casewhere the proposed pinyin is "eren2" rather than "e-ren2".+It is a requirement that the number of parsed sections in the Hanzi matches the number of parsed sections in the pinyin. For the vast majority of entries, this does not pose a problemAlmost all Chinese characters are one syllable in length, and due to how the parsing logic worksnumbers and English letters will be parsed correctly as long as the pinyin is segmented correctlyProblems arise in rare situations such as 
  
 +<code>
 +兡 兡 [[bai3ke4]] /.../
 +</code>
  
-<code> e人 e人 [[eren2]] /(slang) extroverted person/ (Invalid format!)</code>+where a single character corresponds to two syllables. In these cases, {}'s may be used to manually group a section, so we can write
  
-To specify "erén" (as opposed to, say, "e-rén"), it is necessary to use braces to guide the Editor website in parsing. The following would work: +<code> 
-<code>e人 e人 [[{e}ren2]] /(slang) extroverted person/</code>+兡 兡 [[{bai3ke4}]] /.../ 
 +</code>
  
-... as would several other forms, including +which indicates "bai3ke4" is a single pinyin section, matching the single Hanzi section of 兡.
-<code>{e}人 {e}人 [[{e}ren2]] /(slang) extroverted person/</code>+
  
-Here is link to webpage where a proposed entry can be tested to see if it can be parsed correctly.+Another problem arises for entries with number with multiple digits. Consider hypothetical entry such as
  
-"Parse entry" webpage: +<code> 
-https://cc-cedict.org/editor/editor.php?handler=ParseEntry+11 11 [[shi2yi1]] /eleven/ 
 +</code>
  
-To specify "yìrén" as the pinyin for e人, no braces are necessary. The following entry can be parsed, as one can verify at the "Parse entrywebpage. "e" will be matched with "yi4", and 人 will be matched with "ren2".+which implies that the first 1 is pronounced "shi2" and the second 1 is pronounced "yi1"Or the entry
  
-<code>e人 e人 [[yi4ren2]] /(slang) extroverted person/</code>+<code> 
 +21 21 [[er4shi2yi1]] /twenty one/ 
 +</code>
  
-Generally, it is regarded as preferable not to indicate the pronunciation of non-Chinese parts of a headword (such as "e" in "e人")Insteadthey can appear as unparsed elements of the pinyin. For example, "e-ren2" is preferred over "yi4ren2"+which poses a different problem - we have two Hanzi sections but three pinyin sections due to the extra "shi2that is not present in the hanziTo solve these issueswe have decided to group the number in {}'s and write the number out in the pinyin field:
  
 +<code>
 +{21}三體綜合症 {21}三体综合症 [[{21} san1ti3 zong1he2zheng4]] /trisomy; Down's syndrome/
 +</code>
 +
 +Note this is different from the 996 example above, which is treated as 3 digits "nine nine six" and parses without {}'s, not the number "nine hundred ninety six", which would need {}'s.
 +
 +To check whether an entry will be parsed correctly, you can use this tool:
 +https://cc-cedict.org/editor/editor.php?handler=ParseEntry
syntax_v2.1777122665.txt.gz · Last modified: by kbaiko

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki