User Tools

Site Tools


format:syntax_v2

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
format:syntax_v2 [2024/08/17 12:25] kbaikoformat:syntax_v2 [2024/09/12 23:02] (current) – [CC-CEDICT V2 Syntax] richwarm
Line 3: Line 3:
 //**TODO:** work in progress!// //**TODO:** work in progress!//
  
-Version 2 (v2) introduces a new syntax for the pinyin of an entry, allowing for the specification of pinyin that follows standard pinyin orthography. In particular, it enables the combination of syllables to form words. For example, in v2, 二次方程 (quadratic equation) can now be written as two words, "er4ci4 fang1cheng2" (i.e., èrcì fāngchéng), rather than as four distinct syllables, "er4 ci4 fang1 cheng2", as was required in v1.+Version 2 (v2) introduces a new syntax for the pinyin of an entry, allowing for the specification of pinyin that follows standard pinyin orthography. In particular, it enables the combination of syllables to form words. For example, in v2, 二次方程 (quadratic equation) can now be written as two words, "er4ci4 fang1cheng2" (i.e., èrcì fāngchéng), rather than as four separate syllables, "er4 ci4 fang1 cheng2", as was required in v1.
  
 Below are guidelines on what CC-CEDICT entries **should** look like. CC-CEDICT still has many old entries that do not comply with these rules yet. Below are guidelines on what CC-CEDICT entries **should** look like. CC-CEDICT still has many old entries that do not comply with these rules yet.
Line 42: Line 42:
 </code> </code>
  
-**Important:** It is not allowed for multiple entries in CC-CEDICT to have the same combination of traditional, simplifiedand pinyin. In that case, they must be combined in one entry with their senses joined.+**Important:** It is not allowed for multiple entries in CC-CEDICT to have the same combination of traditional, simplified and pinyin. In fact, the CC-CEDICT Editor website will not allow an editor to create an entry if there already exists an entry with the same trad-simp-pinyin combination. Attempting to do so produces an error message. Note that the pinyin comparison is case-sensitiveso [Wang2] and [wang2] are considered to be different. Therefore, we can have two entries such as the following.
  
 +<code>
 +王 王 [[Wang2]] /surname Wang/
 +王 王 [[wang2]] /king/
 +</code> 
 +
 +If an editor wants to add additional senses for an existing trad-simp-pinyin combination, they should edit its definition rather than create a new entry.
 ==== Traditional and simplified characters ==== ==== Traditional and simplified characters ====
  
Line 56: Line 62:
 K人 K人 [K ren2] /(slang) to hit sb; to beat sb/ K人 K人 [K ren2] /(slang) to hit sb; to beat sb/
 </code> </code>
 +
 +**Below are some notes on how these entries are handled in v2.**
 +
 +Let's take "e人" (extroverted person) as an example.
 +
 +There are several ways one might like to render "e人" in pinyin, such as
 +  - e-rén
 +  - erén
 +  - yìrén
 +
 +The Editor website attempts to match the parts of the headword with the parts of the pinyin, and will, if necessary, treat some parts as "unparsed".
 +
 +For example, in the following entry, "e" is an unparsed element in both the headword and the pinyin, while 人 is matched with "ren2"
 +<code> e人 e人 [[e-ren2]] /(slang) extroverted person/ </code>
 +
 +If the Editor website https://cc-cedict.org/editor/ cannot unambiguously match up the elements of the headword and the pinyin, the entry will not be processed. That is what happens in the following case, where the proposed pinyin is "eren2" rather than "e-ren2".
 +
 +
 +<code> e人 e人 [[eren2]] /(slang) extroverted person/ (Invalid format!)</code>
 +
 +To specify "erén" (as opposed to, say, "e-rén"), it is necessary to use braces to guide the Editor website in parsing. The following would work:
 +<code>e人 e人 [[{e}ren2]] /(slang) extroverted person/</code>
 +
 +... as would several other forms, including
 +<code>{e}人 {e}人 [[{e}ren2]] /(slang) extroverted person/</code>
 +
 +Here is a link to a webpage where a proposed entry can be tested to see if it can be parsed correctly.
 +
 +"Parse entry" webpage:
 +https://cc-cedict.org/editor/editor.php?handler=ParseEntry
 +
 +To specify "yìrén" as the pinyin for e人, no braces are necessary. The following entry can be parsed, as one can verify at the "Parse entry" webpage. "e" will be matched with "yi4", and 人 will be matched with "ren2".
 +
 +<code>e人 e人 [[yi4ren2]] /(slang) extroverted person/</code>
 +
 +Generally, it is regarded as preferable not to indicate the pronunciation of non-Chinese parts of a headword (such as "e" in "e人"). Instead, they can appear as unparsed elements of the pinyin. For example, "e-ren2" is preferred over "yi4ren2"
 +
  
 ==== Pinyin ==== ==== Pinyin ====
Line 271: Line 314:
  
 When there are alternative forms of the same expression, and the less common form is at most 5 times less common, the less common entry should have /also written ../ referring to the more common form, e.g. 撐竿跳高 撑竿跳高 [cheng1 gan1 tiao4 gao1] /pole-vaulting/also written 撐杆跳高|撑杆跳高/. When there are alternative forms of the same expression, and the less common form is at most 5 times less common, the less common entry should have /also written ../ referring to the more common form, e.g. 撐竿跳高 撑竿跳高 [cheng1 gan1 tiao4 gao1] /pole-vaulting/also written 撐杆跳高|撑杆跳高/.
 +
 +
 +**PROPOSED CHANGES**
 +
 +(Summary: (1) Get rid of "also written", using "variant of" instead; and (2) Format "variant of" entries in line with points 2a and 2b below.)
 + 
 +(THE VARIANT RULES ABOVE CAN BE DELETED IF AND WHEN THESE CHANGES ARE ACCEPTED.)
 +
 +(Also, the following notes can be tidied up and edited to remove references to "I" and "me".)
 +
 +Regarding "also written..."
 +
 +According to our wiki, there are two kinds of variants.
 +https://cc-cedict.org/wiki/format:syntax#variants
 +
 +1) Where the less common form is relatively common (> 20% of the frequency of the more common form).
 +
 +2) Where the less common form is much less common (< 20% of the frequency of the more common form)
 +
 +For the first type, the def of the less common form should look like this (according to the wiki):
 +<code>/definition/also written .../</code>
 +
 +And for the second type, the def of the less common form should be
 +<code>/variant of .../definition/</code>
 +
 +In practice, what has been happening in recent years is this:
 +
 +1. We have been ignoring the "also written ..." syntax, except maybe when we edit existing entries
 +
 +2. With variants,
 +
 +a) if it's a full variant (i.e. exactly the same definition), we use /variant of .../ without adding the definition.
 +
 +b) if it's a partial variant (i.e. only some of the senses of one form apply to the other form) we use
 +<code>/definition (variant of ...)/</code>
 +
 +Part of the rationale for these changes is this: It's a hassle to check whether entries satisfy the "20% criteria", and the percentage probably changes over time, and depending on which corpus you use to get the percentage.
 +
 +Using the Editor website's search function, I got about 3600 results for "variant of" and only about 360 results for "also written".  
 +
 +One idea that I've had in mind for a while is to clean up all these by
 +
 +a) rewriting "also written" definitions by using "variant of"
 +
 +b) regularizing the format of the "variant of" entries in line with points 2a and 2b above.
 +
 +
  
 ===== Romanization of foreign languages ===== ===== Romanization of foreign languages =====
format/syntax_v2.1723897545.txt.gz · Last modified: 2024/08/17 12:25 by kbaiko

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki