Differences

This shows you the differences between two versions of the page.

--- syntax_v2 [2026/05/03 12:12] – Archive "Variants" and "Non-Chinese entries" kbaiko
+++ syntax_v2 [2026/06/20 13:30] (current) – [Traditional and simplified characters] kbaiko
@@ Line 50: / Line 50: @@
 The Chinese word should consist of one or more Chinese characters, without any spaces in it. Both traditional and simplified forms should be provided, and the two must have the same length.
+For an official mapping between simplified and traditional characters, one can refer to the 通用规范汉字表. However, the 通用规范汉字表 is only a guideline - it does not cover every Chinese character, and we prioritize real-life usage over what the 通用规范汉字表 says.
+In particular, the traditional characters usually represent Taiwanese Mandarin usage, while the simplified characters represent mainland usage. The 通用规范汉字表 does not cover Taiwanese Mandarin usage, which can lead to discrepancies between the 通用规范汉字表 and CC-CEDICT. For instance, the 通用规范汉字表 indicates the simplified character 艳 was derived from the traditional character 艷, with 豓 and 豔 as traditional variants. However, in actual Taiwanese Mandarin usage, 豔 is most commonly used - which is why our entries are
+<code>
+豔 艳 [[yan4]] /.../
+艷 艳 [[yan4]] /variant of 豔|艳[yan4]/
+豓 艳 [[yan4]] /variant of 豔|艳[yan4]/
+</code>
+For some more examples of how to parse the 通用规范汉字表 into CC-CEDICT entries, see https://cc-cedict.org/wiki/references#variant_of
+In rare cases, characters are simplified based on "word-level simplification", meaning, the entire word is taken into account when converting between simplified and traditional characters.
+    * 彷彿|仿佛[fang3fu2] - 彷 does not officially simplify to 仿
+    * 份子|分子[fen4zi3] - 份 does not officially simplify to 分
+    * 座標|坐标[zuo4biao1] - 座 does not officially simplify to 坐
+These entries introduce "invalid" traditional/simplified character pairs, but the traditional/simplified forms of the word represent how the word is usually written in Taiwan and the mainland.
 ==== Pinyin ====
@@ Line 235: / Line 254: @@
 If an alternative romanization method is more popular for a certain word, that version can be added as an additional translation.
+===== Non-Chinese characters =====
+On occasion the Chinese language uses English letters or numerals to write a word. For example, we have
+<code>
+# English letters
+ky ky [[ky]] /(slang) socially tone-deaf; unable to read the room (from Japanese KY, acronym of 空気が読めない "kuuki ga yomenai")/
+coser coser [[coser]] /cosplayer/
+# Mix of English and Chinese
+e人 e人 [[e-ren2]] /(slang) extroverted person/
+勿cue 勿cue [[wu4-cue]] /(Internet slang) don't call on me; don't drag me in/
+# Numbers
+D打印 3D打印 [[san1-D da3yin4]] /to 3D print; 3D printing/
+後 95后 [[jiu3wu3hou4]] /people born between 1995-01-01 and 1999-12-31/Gen Z (abbr. for 95後|95后[jiu3wu3hou4] + 00後|00后[ling2ling2hou4])/
+996 [[jiu3jiu3liu4]] /9am–9pm, six days a week (work schedule)/
+</code>
+As a general rule of thumb:
+  - When writing the Hanzi fields, non-Chinese characters should stay the same.
+  - When writing the pinyin, for English letters use the same letters in the pinyin (ky -> ky), but for numbers write out the pinyin for the corresponding Chinese character (9 -> jiu3)
+==== Technical details, and the use of {} ====
+When parsing the traditional and simplified fields, Hanzi and numbers are treated as individual sections, while consecutive English letters are grouped together into a single section. For example a hypothetical headword "甲abc123乙丙" would be parsed into 7 sections: 甲, abc, 1, 2, 3, 乙, 丙.
+The pinyin is first split by spaces and punctuation, and then parsed based on valid pinyin syllables. One way of writing pinyin for the hypothetical headword above could be "jia3 abc yi1-er4-san1 yi3bing3". Note there are many valid ways that the pinyin could be segmented, for example "jia3 abc yi1er4 san1yi3bing3" or "jia3-abc-yi1-er4-san1-yi3-bing3" (these may not make sense from an orthographic point of view, but will be parsed correctly by the CC-CEDICT website). The only requirement is that the "abc" is separated "yi1". Any of the above examples will be parsed into 7 pinyin sections.
+It is a requirement that the number of parsed sections in the Hanzi matches the number of parsed sections in the pinyin. For the vast majority of entries, this does not pose a problem. Almost all Chinese characters are one syllable in length, and due to how the parsing logic works, numbers and English letters will be parsed correctly as long as the pinyin is segmented correctly. Problems arise in rare situations such as
+<code>
+兡 兡 [[bai3ke4]] /.../
+</code>
+where a single character corresponds to two syllables. In these cases, {}'s may be used to manually group a section, so we can write
+<code>
+兡 兡 [[{bai3ke4}]] /.../
+</code>
+which indicates "bai3ke4" is a single pinyin section, matching the single Hanzi section of 兡.
+Another problem arises for entries with a number with multiple digits. Consider a hypothetical entry such as
+<code>
+11 [[shi2yi1]] /eleven/
+</code>
+which implies that the first 1 is pronounced "shi2" and the second 1 is pronounced "yi1". Or the entry
+<code>
+21 [[er4shi2yi1]] /twenty one/
+</code>
+which poses a different problem - we have two Hanzi sections but three pinyin sections due to the extra "shi2" that is not present in the hanzi. To solve these issues, we have decided to group the number in {}'s and write the number out in the pinyin field:
+<code>
+{21}三體綜合症 {21}三体综合症 [[{21} san1ti3 zong1he2zheng4]] /trisomy; Down's syndrome/
+</code>
+Note this is different from the 996 example above, which is treated as 3 digits "nine nine six" and parses without {}'s, not the number "nine hundred ninety six", which would need {}'s.
+To check whether an entry will be parsed correctly, you can use this tool:
+https://cc-cedict.org/editor/editor.php?handler=ParseEntry