User Tools

Site Tools


syntax_v2

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
syntax_v2 [2026/04/25 13:25] – [Classifiers] kbaikosyntax_v2 [2026/05/16 09:39] (current) – Handling numbers with multiple digits kbaiko
Line 193: Line 193:
 </code> </code>
  
-===== Retroflex finals =====+===== 兒|儿, erhua and rhotacization =====
  
-There are 3 kinds of R-ised words that use the /儿 character+The |儿 character can be used in three different ways
-  - 兒/儿 is not-optional because it's its own syllable (usually meaning "son," so daughter is actually "girl son") - 女兒/女儿 nǚ'ér +
-  - 兒/儿 is not-optional because it changes the definition of the word and is tacked on to the preceding syllable - 头兒/头儿 tóur (leader) as opposed to 头 tóu (head) +
-  - 兒/儿 is an optional northern pronunciation (er2hua4) and is tacked on to the preceding syllable - 花兒/花儿 huār (flower) as opposed to 花 huā (flower)+
  
-These 3 cases should be formatted as follows: +1. |儿[er2] is not-optional because it's its own syllable (meaning "child" or "son")
-  - 女兒 儿 [nu:3 er2] /daughter/ +
-  頭兒 头儿 [tou2 r5] /leader/ +
-  - 花兒 花儿 [hua1 r5] /erhua variant of 花/flower/+
  
-//Please note: words ending with 'r5' (such as 'hua1 r5') are presented as a -r joined with the previous syllable (eg. 'huar1') in some dictionaries using CC-CEDICT, such as the [[http://www.mdbg.net/chindict/chindict.php|MDBG Chinese-English dictionary]].//+<code> 
 +女兒 女儿 [[nu:3er2]] /daughter/ 
 +</code>
  
 +2. 兒|儿[r5] is a non-optional suffix because it changes both the pronunciation and meaning of the word
  
 +<code>
 +頭兒 头儿 [[tou2r5]] /leader/
 +</code>
 +
 +3. 兒|儿[r5] is an optional suffix, changing the pronunciation of the word but not the meaning
 +
 +<code>
 +花兒 花儿 [[hua1r5]] /erhua form of 花[hua1]/
 +</code>
 +
 +//Please note: words ending with 'r5' (such as 'hua1r5') are presented as a -r joined with the previous syllable (eg. 'huar1') in some dictionaries using CC-CEDICT, such as the [[http://www.mdbg.net/chindict/chindict.php|MDBG Chinese-English dictionary]].//
 ===== Choice of entries and translations ===== ===== Choice of entries and translations =====
  
Line 218: Line 226:
 //(the +"" combination forces Google to match both a whole word and to ignore variants)// //(the +"" combination forces Google to match both a whole word and to ignore variants)//
  
-===== Variants ===== 
  
-Many characters have variants, sometimes more than one, sometimes with identical meaning or quite different meanings. Some choice of variants found in texts on websites will arise because of the different input methods, and the user may have had no intention of using the variant.+===== Romanization of foreign languages =====
  
-You can get rough usage frequency information by searching the alternative word forms in Google. Please use this syntax to make sure that Google doesn't perform any automatic variant translations+When transcribing foreign words in definitions, please use the following romanization methods
-<code>+"word"</code>+  * Japanese: [[http://en.wikipedia.org/wiki/Hepburn_romanization|Modified Hepburn]] 
 +  * Korean: [[http://en.wikipedia.org/wiki/Revised_Romanization_of_Korean|Revised Romanization of Korean]]
  
-Additionally you can use Google's advanced search to specify the language to either 'Chinese (Traditional)' or 'Chinese (Simplified)' to prevent Japanese web pages from influencing the resultsFor example:\\  +If an alternative romanization method is more popular for a certain word, that version can be added as an additional translation.
-789 Chinese (Traditional) pages for +"撐竿跳高"\\  +
-17,700 Chinese (Simplified) pages for +"撑竿跳高"\\  +
-1,750 Chinese (Traditional) pages for +"撐杆跳高"\\  +
-66,900 Chinese (Simplified) pages for +"撑杆跳高"+
  
-It often happens that Google tells you that +"Xx" occurs 200 times more frequently than +"XX", in which case Xx should be in CC-CEDICT as a regular entry, and XX only as "XX XX [pin1 yin1] /variant of Xx/definition/".+===== Non-Chinese characters =====
  
-When there are alternative forms of the same expressionand the less common form is at most 5 times less common, the less common entry should have /also written ../ referring to the more common form, e.g. 撐竿跳高 撑竿跳高 [cheng1 gan1 tiao4 gao1] /pole-vaulting/also written 撐杆跳高|撑杆跳高/.+On occasion the Chinese language uses English letters or numerals to write a word. For examplewe have
  
 +<code>
 +# English letters
 +ky ky [[ky]] /(slang) socially tone-deaf; unable to read the room (from Japanese KY, acronym of 空気が読めない "kuuki ga yomenai")/
 +coser coser [[coser]] /cosplayer/
  
-**PROPOSED CHANGES**+# Mix of English and Chinese 
 +e人 e人 [[e-ren2]] /(slang) extroverted person/ 
 +勿cue 勿cue [[wu4-cue]] /(Internet slang) don't call on me; don't drag me in/
  
-(Summary: (1) Get rid of "also written", using "variant of" instead; and (2) Format "variant of" entries in line with points 2a and 2b below.) +# Numbers 
-  +3D打印 3D打印 [[san1-D da3yin4]] /to 3D print3D printing/ 
-(THE VARIANT RULES ABOVE CAN BE DELETED IF AND WHEN THESE CHANGES ARE ACCEPTED.)+95後 95后 [[jiu3wu3hou4]] /people born between 1995-01-01 and 1999-12-31/Gen Z (abbrfor 95後|95后[jiu3wu3hou4] + 00後|00后[ling2ling2hou4])/ 
 +996 996 [[jiu3jiu3liu4]] /9am–9pm, six days a week (work schedule)
 +</code>
  
-(Also, the following notes can be tidied up and edited to remove references to "I" and "me".)+As a general rule of thumb: 
 +  - When writing the Hanzi fieldsnon-Chinese characters should stay the same. 
 +  - When writing the pinyin, for English letters use the same letters in the pinyin (ky -> ky), but for numbers write out the pinyin for the corresponding Chinese character (9 -> jiu3)
  
-Regarding "also written..." 
  
-According to our wikithere are two kinds of variants. +==== Technical detailsand the use of {} ====
-https://cc-cedict.org/wiki/format:syntax#variants+
  
-1) Where the less common form is relatively common (> 20% of the frequency of the more common form).+When parsing the traditional and simplified fields, Hanzi and numbers are treated as individual sections, while consecutive English letters are grouped together into a single section. For example a hypothetical headword "甲abc123乙丙" would be parsed into 7 sections: 甲, abc, 1, 2, 3, 乙, 丙.
  
-2) Where the less common form is much less common (< 20% of the frequency of the more common form)+The pinyin is first split by spaces and punctuation, and then parsed based on valid pinyin syllables. One way of writing pinyin for the hypothetical headword above could be "jia3 abc yi1-er4-san1 yi3bing3". Note there are many valid ways that the pinyin could be segmented, for example "jia3 abc yi1er4 san1yi3bing3" or "jia3-abc-yi1-er4-san1-yi3-bing3" (these may not make sense from an orthographic point of view, but will be parsed correctly by the CC-CEDICT website). The only requirement is that the "abc" is separated "yi1". Any of the above examples will be parsed into 7 pinyin sections.
  
-For the first type, the def of the less common form should look like this (according to the wiki): +It is a requirement that the number of parsed sections in the Hanzi matches the number of parsed sections in the pinyin. For the vast majority of entries, this does not pose a problem. Almost all Chinese characters are one syllable in length, and due to how the parsing logic works, numbers and English letters will be parsed correctly as long as the pinyin is segmented correctlyProblems arise in rare situations such as 
-<code>/definition/also written .../</code>+
  
-And for the second type, the def of the less common form should be +<code> 
-<code>/variant of .../definition/</code>+兡 兡 [[bai3ke4]] /.../ 
 +</code>
  
-In practicewhat has been happening in recent years is this:+where a single character corresponds to two syllables. In these cases{}'s may be used to manually group a section, so we can write
  
-1We have been ignoring the "also written ..." syntax, except maybe when we edit existing entries+<code> 
 +兡 兡 [[{bai3ke4}]] /...
 +</code>
  
-2. With variants,+which indicates "bai3ke4" is a single pinyin sectionmatching the single Hanzi section of 兡.
  
-a) if it'full variant (i.e. exactly the same definition), we use /variant of .../ without adding the definition.+Another problem arises for entries with number with multiple digits. Consider hypothetical entry such as
  
-b) if it's a partial variant (i.e. only some of the senses of one form apply to the other form) we use +<code> 
-<code>/definition (variant of ...)/</code>+11 11 [[shi2yi1]] /eleven/ 
 +</code>
  
-Part of the rationale for these changes is this: It's a hassle to check whether entries satisfy the "20% criteria"and the percentage probably changes over time, and depending on which corpus you use to get the percentage.+which implies that the first 1 is pronounced "shi2" and the second 1 is pronounced "yi1"Or the entry
  
-Using the Editor website's search function, I got about 3600 results for "variant of" and only about 360 results for "also written".  +<code> 
 +21 21 [[er4shi2yi1]] /twenty one/ 
 +</code>
  
-One idea that I've had in mind for while is to clean up all these by +which poses different problem - we have two Hanzi sections but three pinyin sections due to the extra "shi2that is not present in the hanziTo solve these issueswe have decided to group the number in {}'s and write the number out in the pinyin field:
- +
-a) rewriting "also written" definitions by using "variant of" +
- +
-b) regularizing the format of the "variant ofentries in line with points 2a and 2b above. +
- +
- +
- +
-===== Romanization of foreign languages ===== +
- +
-When transcribing foreign words in definitionsplease use the following romanization methods: +
-  * Japanese: [[http://en.wikipedia.org/wiki/Hepburn_romanization|Modified Hepburn]] +
-  * Korean: [[http://en.wikipedia.org/wiki/Revised_Romanization_of_Korean|Revised Romanization of Korean]] +
- +
-If an alternative romanization method is more popular for a certain word, that version can be added as an additional translation. +
- +
-===== Non-Chinese entries ===== +
- +
-There are a very small number of entries that use symbols, numbers, or other non-Chinese characters in the word, for example+
  
 <code> <code>
-% % [pa1] /percent (Tw)/ +{21}三體綜合症 {21}三体综合症 [[{21} san1ti3 zong1he2zheng4]] /trisomyDown's syndrome/
-3C 3C [san1 C/computers, communications, and consumer electronics/China Compulsory Certificate (CCC)/ +
-421 421 [si4 er4 yi1] /four grandparents, two parents and an only child/ +
-K人 K人 [K ren2] /(slang) to hit sbto beat sb/+
 </code> </code>
  
-**Below are some notes on how these entries are handled in v2.**+Note this is different from the 996 example above, which is treated as 3 digits "nine nine six" and parses without {}'s, not the number "nine hundred ninety six", which would need {}'s.
  
-Let's take "e人" (extroverted person) as an example. +To check whether an entry will be parsed correctlyyou can use this tool:
- +
-There are several ways one might like to render "e人" in pinyin, such as +
-  - e-rén +
-  - erén +
-  - yìrén +
- +
-The Editor website attempts to match the parts of the headword with the parts of the pinyin, and will, if necessary, treat some parts as "unparsed"+
- +
-For example, in the following entry, "e" is an unparsed element in both the headword and the pinyin, while 人 is matched with "ren2".  +
-<code> e人 e人 [[e-ren2]] /(slang) extroverted person/ </code> +
- +
-If the Editor website https://cc-cedict.org/editor/ cannot unambiguously match up the elements of the headword and the pinyin, the entry will not be processed. That is what happens in the following casewhere the proposed pinyin is "eren2" rather than "e-ren2"+
- +
- +
-<code> e人 e人 [[eren2]] /(slang) extroverted person/ (Invalid format!)</code> +
- +
-To specify "erén" (as opposed to, say, "e-rén"), it is necessary to use braces to guide the Editor website in parsing. The following would work: +
-<code>e人 e人 [[{e}ren2]] /(slang) extroverted person/</code> +
- +
-... as would several other forms, including +
-<code>{e}人 {e}人 [[{e}ren2]] /(slang) extroverted person/</code> +
- +
-Here is a link to a webpage where a proposed entry can be tested to see if it can be parsed correctly. +
- +
-"Parse entry" webpage:+
 https://cc-cedict.org/editor/editor.php?handler=ParseEntry https://cc-cedict.org/editor/editor.php?handler=ParseEntry
- 
-To specify "yìrén" as the pinyin for e人, no braces are necessary. The following entry can be parsed, as one can verify at the "Parse entry" webpage. "e" will be matched with "yi4", and 人 will be matched with "ren2". 
- 
-<code>e人 e人 [[yi4ren2]] /(slang) extroverted person/</code> 
- 
-Generally, it is regarded as preferable not to indicate the pronunciation of non-Chinese parts of a headword (such as "e" in "e人"). Instead, they can appear as unparsed elements of the pinyin. For example, "e-ren2" is preferred over "yi4ren2" 
- 
syntax_v2.1777123535.txt.gz · Last modified: by kbaiko

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki