User Tools

Site Tools


articles:chdict

This is an old revision of the document!


CHDICT project

What is it?

As of December 2006, the CHDICT project is work in progress. Its objectives are:

  1. Generate a raw Chinese ⇒ Hungarian dictionary by combining the existing resources of CEDICT, HanDeDict, and free English ⇒ Hungarian and German ⇒ Hungarian dictionaries.
  2. Create a website where the Hungarian dictionary, and possibly also the CEDICT and HanDeDict dictionaries, can be queried, and regular updates of the Hungarian material are published.
  3. Provide an online editing interface where users can change existing entries and contribute new ones.
  4. Build a community of contributors and editors to improve and extend the dictionary.
  5. Systematically extend the dictionary.

CHDICT features

XML. With no legacy data to maintain, CHDICT will be stored in XML from the very beginning. Besides preventing codepage issues, this is also expected to:

  • Enforce editorial rigor and improve the quality of the data.
  • Facilitate machine processing of the data for future purposes.

Annotation and structure. A few key features of CHDICT's representation of entries:

  • Not only are senses formally separated, but also glosses and explanation sections within senses.
  • Chinese part of speech must be indicated on a per-sense basis for all manually revised entries.
  • Senses can optionally contain additional information including measure words; field, region, style; synonyms and antonyms; and example sentences.
  • Many items that are headwords in CEDICT will be listed as expressions under a sense, including V+Obj (打电话), V+Compl (记住) etc.

Editing. The only way to edit entries will be through a web-based form that mirrors CHDICT's entry structure. Besides validating and enforcing the well-formedness of data, this form will also offer convenience functions such as hints for measure words, guessing traditional/simplified/pinyin from partially specified data etc.

Version control. Two user roles, contributor and editor, will be distinguished, and all entries will be marked until approval by an editor. The complete version history of all entries will be stored in a database, and as approved entries accumulate, the master resource will be published periodically on the website as a single XML file.

Status

December 29, 2006 – 5800 entries have been generated, excluding proper nouns. Work on the website, version control and dictionary engine is in progress.

I expect the website to go live in the first half of 2006.

Discussion

I will soon make the current working draft of CHDICT's data format available online. I hope it will also contribute to the discussion about CEDICT's future format: all comments are welcome.

Many of my decisions have been based on HanDeDict (e.g., fields of application). I believe a common convention for parts of speech, fields, styles and regions could benefit all of our projects.

I would also like to suggest creating a shared resource of Chinese example sentences and their translations in English, German, French and Hungarian.

Maintainer

The CHDICT project is maintained by Gábor Ugray.


See also:

articles/chdict.1167344127.txt.gz · Last modified: 2008/06/10 18:00 (external edit)

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki