User Tools

Site Tools


syntax_v2

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
syntax_v2 [2026/05/03 12:12] – Archive "Variants" and "Non-Chinese entries" kbaikosyntax_v2 [2026/05/16 09:39] (current) – Handling numbers with multiple digits kbaiko
Line 235: Line 235:
 If an alternative romanization method is more popular for a certain word, that version can be added as an additional translation. If an alternative romanization method is more popular for a certain word, that version can be added as an additional translation.
  
 +===== Non-Chinese characters =====
 +
 +On occasion the Chinese language uses English letters or numerals to write a word. For example, we have
 +
 +<code>
 +# English letters
 +ky ky [[ky]] /(slang) socially tone-deaf; unable to read the room (from Japanese KY, acronym of 空気が読めない "kuuki ga yomenai")/
 +coser coser [[coser]] /cosplayer/
 +
 +# Mix of English and Chinese
 +e人 e人 [[e-ren2]] /(slang) extroverted person/
 +勿cue 勿cue [[wu4-cue]] /(Internet slang) don't call on me; don't drag me in/
 +
 +# Numbers
 +3D打印 3D打印 [[san1-D da3yin4]] /to 3D print; 3D printing/
 +95後 95后 [[jiu3wu3hou4]] /people born between 1995-01-01 and 1999-12-31/Gen Z (abbr. for 95後|95后[jiu3wu3hou4] + 00後|00后[ling2ling2hou4])/
 +996 996 [[jiu3jiu3liu4]] /9am–9pm, six days a week (work schedule)/
 +</code>
 +
 +As a general rule of thumb:
 +  - When writing the Hanzi fields, non-Chinese characters should stay the same.
 +  - When writing the pinyin, for English letters use the same letters in the pinyin (ky -> ky), but for numbers write out the pinyin for the corresponding Chinese character (9 -> jiu3)
 +
 +
 +==== Technical details, and the use of {} ====
 +
 +When parsing the traditional and simplified fields, Hanzi and numbers are treated as individual sections, while consecutive English letters are grouped together into a single section. For example a hypothetical headword "甲abc123乙丙" would be parsed into 7 sections: 甲, abc, 1, 2, 3, 乙, 丙.
 +
 +The pinyin is first split by spaces and punctuation, and then parsed based on valid pinyin syllables. One way of writing pinyin for the hypothetical headword above could be "jia3 abc yi1-er4-san1 yi3bing3". Note there are many valid ways that the pinyin could be segmented, for example "jia3 abc yi1er4 san1yi3bing3" or "jia3-abc-yi1-er4-san1-yi3-bing3" (these may not make sense from an orthographic point of view, but will be parsed correctly by the CC-CEDICT website). The only requirement is that the "abc" is separated "yi1". Any of the above examples will be parsed into 7 pinyin sections.
 +
 +It is a requirement that the number of parsed sections in the Hanzi matches the number of parsed sections in the pinyin. For the vast majority of entries, this does not pose a problem. Almost all Chinese characters are one syllable in length, and due to how the parsing logic works, numbers and English letters will be parsed correctly as long as the pinyin is segmented correctly. Problems arise in rare situations such as 
 +
 +<code>
 +兡 兡 [[bai3ke4]] /.../
 +</code>
 +
 +where a single character corresponds to two syllables. In these cases, {}'s may be used to manually group a section, so we can write
 +
 +<code>
 +兡 兡 [[{bai3ke4}]] /.../
 +</code>
 +
 +which indicates "bai3ke4" is a single pinyin section, matching the single Hanzi section of 兡.
 +
 +Another problem arises for entries with a number with multiple digits. Consider a hypothetical entry such as
 +
 +<code>
 +11 11 [[shi2yi1]] /eleven/
 +</code>
 +
 +which implies that the first 1 is pronounced "shi2" and the second 1 is pronounced "yi1". Or the entry
 +
 +<code>
 +21 21 [[er4shi2yi1]] /twenty one/
 +</code>
 +
 +which poses a different problem - we have two Hanzi sections but three pinyin sections due to the extra "shi2" that is not present in the hanzi. To solve these issues, we have decided to group the number in {}'s and write the number out in the pinyin field:
 +
 +<code>
 +{21}三體綜合症 {21}三体综合症 [[{21} san1ti3 zong1he2zheng4]] /trisomy; Down's syndrome/
 +</code>
 +
 +Note this is different from the 996 example above, which is treated as 3 digits "nine nine six" and parses without {}'s, not the number "nine hundred ninety six", which would need {}'s.
 +
 +To check whether an entry will be parsed correctly, you can use this tool:
 +https://cc-cedict.org/editor/editor.php?handler=ParseEntry
syntax_v2.1777810336.txt.gz · Last modified: by kbaiko

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki