Differences

This shows you the differences between two versions of the page.

--- syntax_v2 [2026/04/25 13:11] – Move "ambiguity due to homonyms" section, add labels link kbaiko
+++ syntax_v2 [2026/06/20 13:30] (current) – [Traditional and simplified characters] kbaiko
@@ Line 50: / Line 50: @@
 The Chinese word should consist of one or more Chinese characters, without any spaces in it. Both traditional and simplified forms should be provided, and the two must have the same length.
+For an official mapping between simplified and traditional characters, one can refer to the 通用规范汉字表. However, the 通用规范汉字表 is only a guideline - it does not cover every Chinese character, and we prioritize real-life usage over what the 通用规范汉字表 says.
+In particular, the traditional characters usually represent Taiwanese Mandarin usage, while the simplified characters represent mainland usage. The 通用规范汉字表 does not cover Taiwanese Mandarin usage, which can lead to discrepancies between the 通用规范汉字表 and CC-CEDICT. For instance, the 通用规范汉字表 indicates the simplified character 艳 was derived from the traditional character 艷, with 豓 and 豔 as traditional variants. However, in actual Taiwanese Mandarin usage, 豔 is most commonly used - which is why our entries are
+<code>
+豔 艳 [[yan4]] /.../
+艷 艳 [[yan4]] /variant of 豔|艳[yan4]/
+豓 艳 [[yan4]] /variant of 豔|艳[yan4]/
+</code>
+For some more examples of how to parse the 通用规范汉字表 into CC-CEDICT entries, see https://cc-cedict.org/wiki/references#variant_of
+In rare cases, characters are simplified based on "word-level simplification", meaning, the entire word is taken into account when converting between simplified and traditional characters.
+    * 彷彿|仿佛[fang3fu2] - 彷 does not officially simplify to 仿
+    * 份子|分子[fen4zi3] - 份 does not officially simplify to 分
+    * 座標|坐标[zuo4biao1] - 座 does not officially simplify to 坐
+These entries introduce "invalid" traditional/simplified character pairs, but the traditional/simplified forms of the word represent how the word is usually written in Taiwan and the mainland.
 ==== Pinyin ====
@@ Line 137: / Line 156: @@
 ==== Classifiers ====
-Classifiers (also called "Measure words") can be listed using the following syntax:\\
+Classifiers, or "measure words", can be listed using the following syntax:
-避風港 避风港 [bi4 feng1 gang3] /haven/refuge/harbor/CL:座[zuo4],個|个[ge4]/
-Classifiers follow the 'reference' syntax, are prefixed by 'CL:' and separated by a comma (no additional spacing).
+<code>
+麵包 面包 [[mian4bao1]] /bread/CL:片[pian4],塊|块[kuai4]/
+麵包店 面包店 [[mian4bao1dian4]] /bakery/CL:家[jia1]/
+</code>
-The classifier words itself can be described using:\\
+They follow the reference syntax of traditional|simplified[pinyin], are prefixed by "CL:" and separated by commas.
-/classifier for small round things (peas, bullets, peanuts, pills, grains etc)/
+We typically omit general classifiers like 個|个[ge4] which can be applied to almost every single noun.
+A classifier itself can be described like so:
+<code>
+/classifier for small round things (peas, bullets, peanuts, pills, grains etc)/
+</code>
 ==== Bound forms ====
@@ Line 185: / Line 212: @@
 </code>
-===== Retroflex finals =====
+===== 兒|儿, erhua and rhotacization =====
-There are 3 kinds of R-ised words that use the 兒/儿 character:
+The 兒|儿 character can be used in three different ways
-  - 兒/儿 is not-optional because it's its own syllable (usually meaning "son," so daughter is actually "girl son") - 女兒/女儿 nǚ'ér
-  - 兒/儿 is not-optional because it changes the definition of the word and is tacked on to the preceding syllable - 头兒/头儿 tóur (leader) as opposed to 头 tóu (head)
-  - 兒/儿 is an optional northern pronunciation (er2hua4) and is tacked on to the preceding syllable - 花兒/花儿 huār (flower) as opposed to 花 huā (flower)
-These 3 cases should be formatted as follows:
+. 兒|儿[er2] is not-optional because it's its own syllable (meaning "child" or "son")
-  - 女兒 女儿 [nu:3 er2] /daughter/
-  - 頭兒 头儿 [tou2 r5] /leader/
-  - 花兒 花儿 [hua1 r5] /erhua variant of 花/flower/
-//Please note: words ending with 'r5' (such as 'hua1 r5') are presented as a -r joined with the previous syllable (eg. 'huar1') in some dictionaries using CC-CEDICT, such as the [[http://www.mdbg.net/chindict/chindict.php|MDBG Chinese-English dictionary]].//
+<code>
+女兒 女儿 [[nu:3er2]] /daughter/
+</code>
+. 兒|儿[r5] is a non-optional suffix because it changes both the pronunciation and meaning of the word
+<code>
+頭兒 头儿 [[tou2r5]] /leader/
+</code>
+. 兒|儿[r5] is an optional suffix, changing the pronunciation of the word but not the meaning
+<code>
+花兒 花儿 [[hua1r5]] /erhua form of 花[hua1]/
+</code>
+//Please note: words ending with 'r5' (such as 'hua1r5') are presented as a -r joined with the previous syllable (eg. 'huar1') in some dictionaries using CC-CEDICT, such as the [[http://www.mdbg.net/chindict/chindict.php|MDBG Chinese-English dictionary]].//
 ===== Choice of entries and translations =====
@@ Line 209: / Line 244: @@
 </code>
 //(the +"" combination forces Google to match both a whole word and to ignore variants)//
-===== Variants =====
-Many characters have variants, sometimes more than one, sometimes with identical meaning or quite different meanings. Some choice of variants found in texts on websites will arise because of the different input methods, and the user may have had no intention of using the variant.
-You can get rough usage frequency information by searching the alternative word forms in Google. Please use this syntax to make sure that Google doesn't perform any automatic variant translations:
-<code>+"word"</code>
-Additionally you can use Google's advanced search to specify the language to either 'Chinese (Traditional)' or 'Chinese (Simplified)' to prevent Japanese web pages from influencing the results. For example:\\
-Chinese (Traditional) pages for +"撐竿跳高"\\
-,700 Chinese (Simplified) pages for +"撑竿跳高"\\
-,750 Chinese (Traditional) pages for +"撐杆跳高"\\
-,900 Chinese (Simplified) pages for +"撑杆跳高"
-It often happens that Google tells you that +"Xx" occurs 200 times more frequently than +"XX", in which case Xx should be in CC-CEDICT as a regular entry, and XX only as "XX XX [pin1 yin1] /variant of Xx/definition/".
-When there are alternative forms of the same expression, and the less common form is at most 5 times less common, the less common entry should have /also written ../ referring to the more common form, e.g. 撐竿跳高 撑竿跳高 [cheng1 gan1 tiao4 gao1] /pole-vaulting/also written 撐杆跳高|撑杆跳高/.
-**PROPOSED CHANGES**
-(Summary: (1) Get rid of "also written", using "variant of" instead; and (2) Format "variant of" entries in line with points 2a and 2b below.)
-(THE VARIANT RULES ABOVE CAN BE DELETED IF AND WHEN THESE CHANGES ARE ACCEPTED.)
-(Also, the following notes can be tidied up and edited to remove references to "I" and "me".)
-Regarding "also written..."
-According to our wiki, there are two kinds of variants.
-https://cc-cedict.org/wiki/format:syntax#variants
-) Where the less common form is relatively common (> 20% of the frequency of the more common form).
-) Where the less common form is much less common (< 20% of the frequency of the more common form)
-For the first type, the def of the less common form should look like this (according to the wiki):
-<code>/definition/also written .../</code>
-And for the second type, the def of the less common form should be
-<code>/variant of .../definition/</code>
-In practice, what has been happening in recent years is this:
-. We have been ignoring the "also written ..." syntax, except maybe when we edit existing entries
-. With variants,
-a) if it's a full variant (i.e. exactly the same definition), we use /variant of .../ without adding the definition.
-b) if it's a partial variant (i.e. only some of the senses of one form apply to the other form) we use
-<code>/definition (variant of ...)/</code>
-Part of the rationale for these changes is this: It's a hassle to check whether entries satisfy the "20% criteria", and the percentage probably changes over time, and depending on which corpus you use to get the percentage.
-Using the Editor website's search function, I got about 3600 results for "variant of" and only about 360 results for "also written".
-One idea that I've had in mind for a while is to clean up all these by
-a) rewriting "also written" definitions by using "variant of"
-b) regularizing the format of the "variant of" entries in line with points 2a and 2b above.
@@ Line 282: / Line 254: @@
 If an alternative romanization method is more popular for a certain word, that version can be added as an additional translation.
-===== Non-Chinese entries =====
+===== Non-Chinese characters =====
-There are a very small number of entries that use symbols, numbers, or other non-Chinese characters in the word, for example
+On occasion the Chinese language uses English letters or numerals to write a word. For example, we have
 <code>
-% % [pa1] /percent (Tw)/
+# English letters
-C 3C [san1 C] /computers, communications, and consumer electronics/China Compulsory Certificate (CCC)/
+ky ky [[ky]] /(slang) socially tone-deaf; unable to read the room (from Japanese KY, acronym of 空気が読めない "kuuki ga yomenai")/
-421 [si4 er4 yi1] /four grandparents, two parents and an only child/
+coser coser [[coser]] /cosplayer/
-K人 K人 [K ren2] /(slang) to hit sb; to beat sb/
+# Mix of English and Chinese
+e人 e人 [[e-ren2]] /(slang) extroverted person/
+勿cue 勿cue [[wu4-cue]] /(Internet slang) don't call on me; don't drag me in/
+# Numbers
+D打印 3D打印 [[san1-D da3yin4]] /to 3D print; 3D printing/
+後 95后 [[jiu3wu3hou4]] /people born between 1995-01-01 and 1999-12-31/Gen Z (abbr. for 95後|95后[jiu3wu3hou4] + 00後|00后[ling2ling2hou4])/
+996 [[jiu3jiu3liu4]] /9am–9pm, six days a week (work schedule)/
 </code>
-**Below are some notes on how these entries are handled in v2.**
+As a general rule of thumb:
+  - When writing the Hanzi fields, non-Chinese characters should stay the same.
+  - When writing the pinyin, for English letters use the same letters in the pinyin (ky -> ky), but for numbers write out the pinyin for the corresponding Chinese character (9 -> jiu3)
-Let's take "e人" (extroverted person) as an example.
-There are several ways one might like to render "e人" in pinyin, such as
+==== Technical details, and the use of {} ====
-  - e-rén
-  - erén
-  - yìrén
-The Editor website attempts to match the parts of the headword with the parts of the pinyin, and will, if necessary, treat some parts as "unparsed".
+When parsing the traditional and simplified fields, Hanzi and numbers are treated as individual sections, while consecutive English letters are grouped together into a single section. For example a hypothetical headword "甲abc123乙丙" would be parsed into 7 sections: 甲, abc, 1, 2, 3, 乙, 丙.
-For example, in the following entry, "e" is an unparsed element in both the headword and the pinyin, while 人 is matched with "ren2".
+The pinyin is first split by spaces and punctuation, and then parsed based on valid pinyin syllables. One way of writing pinyin for the hypothetical headword above could be "jia3 abc yi1-er4-san1 yi3bing3". Note there are many valid ways that the pinyin could be segmented, for example "jia3 abc yi1er4 san1yi3bing3" or "jia3-abc-yi1-er4-san1-yi3-bing3" (these may not make sense from an orthographic point of view, but will be parsed correctly by the CC-CEDICT website). The only requirement is that the "abc" is separated "yi1". Any of the above examples will be parsed into 7 pinyin sections.
-<code> e人 e人 [[e-ren2]] /(slang) extroverted person/ </code>
-If the Editor website https://cc-cedict.org/editor/ cannot unambiguously match up the elements of the headword and the pinyin, the entry will not be processed. That is what happens in the following case, where the proposed pinyin is "eren2" rather than "e-ren2".
+It is a requirement that the number of parsed sections in the Hanzi matches the number of parsed sections in the pinyin. For the vast majority of entries, this does not pose a problem. Almost all Chinese characters are one syllable in length, and due to how the parsing logic works, numbers and English letters will be parsed correctly as long as the pinyin is segmented correctly. Problems arise in rare situations such as
+<code>
+兡 兡 [[bai3ke4]] /.../
+</code>
-<code> e人 e人 [[eren2]] /(slang) extroverted person/ (Invalid format!)</code>
+where a single character corresponds to two syllables. In these cases, {}'s may be used to manually group a section, so we can write
-To specify "erén" (as opposed to, say, "e-rén"), it is necessary to use braces to guide the Editor website in parsing. The following would work:
+<code>
-<code>e人 e人 [[{e}ren2]] /(slang) extroverted person/</code>
+兡 兡 [[{bai3ke4}]] /.../
+</code>
-... as would several other forms, including
+which indicates "bai3ke4" is a single pinyin section, matching the single Hanzi section of 兡.
-<code>{e}人 {e}人 [[{e}ren2]] /(slang) extroverted person/</code>
-Here is a link to a webpage where a proposed entry can be tested to see if it can be parsed correctly.
+Another problem arises for entries with a number with multiple digits. Consider a hypothetical entry such as
-"Parse entry" webpage:
+<code>
-https://cc-cedict.org/editor/editor.php?handler=ParseEntry
+11 [[shi2yi1]] /eleven/
+</code>
-To specify "yìrén" as the pinyin for e人, no braces are necessary. The following entry can be parsed, as one can verify at the "Parse entry" webpage. "e" will be matched with "yi4", and 人 will be matched with "ren2".
+which implies that the first 1 is pronounced "shi2" and the second 1 is pronounced "yi1". Or the entry
-<code>e人 e人 [[yi4ren2]] /(slang) extroverted person/</code>
+<code>
+21 [[er4shi2yi1]] /twenty one/
+</code>
-Generally, it is regarded as preferable not to indicate the pronunciation of non-Chinese parts of a headword (such as "e" in "e人"). Instead, they can appear as unparsed elements of the pinyin. For example, "e-ren2" is preferred over "yi4ren2".
+which poses a different problem - we have two Hanzi sections but three pinyin sections due to the extra "shi2" that is not present in the hanzi. To solve these issues, we have decided to group the number in {}'s and write the number out in the pinyin field:
+<code>
+{21}三體綜合症 {21}三体综合症 [[{21} san1ti3 zong1he2zheng4]] /trisomy; Down's syndrome/
+</code>
+Note this is different from the 996 example above, which is treated as 3 digits "nine nine six" and parses without {}'s, not the number "nine hundred ninety six", which would need {}'s.
+To check whether an entry will be parsed correctly, you can use this tool:
+https://cc-cedict.org/editor/editor.php?handler=ParseEntry