Re: Corpora: non-alphabetic language databases

From: Simon G. J. Smith (smithsgj@eee.bham.ac.uk)
Date: Thu Nov 30 2000 - 12:34:26 MET

Next message: Thomas Schmidt: "AW: Corpora: non-alphabetic language databases"

Previous message: MIT2USA@aol.com: "Corpora: Thesis: Machine Translation and Controlled Language"
Maybe in reply to: Avryl2@aol.com: "Corpora: non-alphabetic language databases"
Next in thread: Mcenery, Tony: "RE: Corpora: non-alphabetic language databases"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Paula

Have a look at www.chinesecomputing.com

Are you a student of one of these languages? Take a look at a website from one of the countries, without character-reading software running, and you will see that each character is represented by two ASCII characters - usually obscure things like ^ or ` and others that are not on the qwerty keyboard at all.

My understanding is this: order of database entry is not based on any phonetic system, nor on any arrangement of radicals or character components, but on a standard (for Chinese, usually one of Big-5 or GB (Guo-Biao)) which maps each character on to an arbitrary pair of ASCII characters. With the advent of the Unicode standard, a one-to-one mapping is also now possible, but implementations are rare.

I'm not an expert: perhaps there's one around who would care to add their comments?

Next message: Thomas Schmidt: "AW: Corpora: non-alphabetic language databases"
Previous message: MIT2USA@aol.com: "Corpora: Thesis: Machine Translation and Controlled Language"
Maybe in reply to: Avryl2@aol.com: "Corpora: non-alphabetic language databases"
Next in thread: Mcenery, Tony: "RE: Corpora: non-alphabetic language databases"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Thu Nov 30 2000 - 12:31:46 MET