Transcription: How Chinese Wikipedia fell into disarray

The evolution of the Chinese language Wikipedia follows a tortuous path. I suppose I’ve been around since the beginning, but really only to watch from the sidelines. In the beginning it was mostly mainland users who dominated in numbers, but since a year or two ago, with the on-again-off-again filtering of mainland Chinese users, the site has shifted towards more users from Taiwan, Hong Kong, and elsewhere.

In recent months, some changes were made to the site with interesting implications. These changes are fairly unique to the Chinese language site but there is something to be learned from them.

First, there is currently just one stored version of the contents on the Chinese Wikipedia site. It wasn’t always so. The very first problem faced by implementors in the early days was the script issue. There are two scripts used by users from different areas, Simplified and Traditional. All characters, regardless of script, are assigned code points in Unicode. But it is not as easy to change between scripts as changing fonts, because there is no one-to-one character mapping between the scripts. It is very nearly many-to-one in the Traditional-to-Simplified direction, however. And because the characters have semantic meaning, the one-to-many conversion can also occur, provided a character-cluster context as little as a word (< 5 characters usually). Therefore, Chinese Wikipedia actually began as two stored versions for each article, one for each script. Since this resulted in divergent articles as different people edit different versions, a project was begun to copy edits from one to the other, at first manually, and later on automatically. Thus was born the automated transcription between the two scripts so that eventually, everything was merged and just one version was stored with a user-selectable automatic script switcher. This worked pretty well for a while.

Because of the one-to-many mapping, the implementors used a word table to convert certain words as units. Because of this tactic, they very easily and effortlessly slid into the role of making a dictionary, without even realizing it. What they did was to incorporate entries for lexical differences between mainland China and Taiwan, notably of foreign loan words in the last fifty years. This, as I will show, is a huge mistake.

But first, a story. I remember when I was still not very used to the Traditional script, I asked on an internal Microsoft Chinese employees list whether anybody had some kind of codepage file (basically a dictionary) to automatically convert Traditional characters to Simplified characters. This is a many-to-one task, so a simple codepage would do. The response I got was that, no, Microsoft only had codepages to convert between "locales," which meant Taiwan, Hong Kong, etc., and Unicode. So I asked, well why can't you stack two of these together back-to-back with Unicode in the middle, in order to convert between the "Taiwan" locale, say, and the "mainland China" locale. Then I got a response from some self-righteous Taiwanese Microsoftie that Microsoft only does things "the right way" and to make that conversion "correctly," lexical translation is required. I said, no, I don't care about lexical differences, stop telling me what my question should be, and answer my actual question: can I make a codepage to go from the Traditional to the Simplified script. I never got an answer to that beyond "Microsoft never makes things that are hacked together like that." Two revisions of MS Office later, there came a Word command called "Chinese Translation," which converts between Traditional and Simplified scripts and optionally "translates" some words, too, so at least that problem is solved. But evidently, because of the way things are in the world right now, Simplified and Traditional scripts will always be associated with specific locales.

In any case, as the conversion tables grew in size, the implementors of Chinese Wikipedia asked the users to help maintain the table, but in one of the worst blunders, they allowed the users to edit the tables without specifically agreed upon rules of what these tables can contain and how these tables may be used. As more articles are added to Wikipedia, sometimes the lexical differences crop up in how to name an article. This is fairly similar to what happens on English Wikipedia and isn't something that its rules can't deal with. However, some discussion pages got fairly heated over the proper naming of things not only in the title but as they appear in the rest of the site. And due to the existence of the conversion tables, people began to simply add entries to them so that as what used to be a simple script conversion became a partial dynamic translation.

Why is this all bad? Because it is a loophole allowing privilege escalation. There is a standard process for creating and modifying content, and resolving conflict of content. As soon as the conversion table enters the picture, the process can be bypassed easily by injecting conflicting content into the table, which is not subject to the same process. (In fact it is difficult to even bring up discussion on the table, because it seems so objective — it seems to be just a “transcription table” even when it is no longer.) Now there are two versions of content. Once locale-specific lexical edits to the table became a precedent, other requests came to distinguish between “Taiwan Traditional script” and “Hong Kong Traditional script” even when it isn’t a script issue but a content issue that ought to be dealt with in the discussion pages of specific articles. Instead, by misidentifying the nature of the issue as a script issue, more conversion tables were created and more divergent content not subject to review were added to these tables, so that now there are five versions (!) of Chinese Wikipedia, and none that does not force the reader to read either a regional variant or a mixture of scripts.

What’s the lesson? It is very critical to check the technical process to make sure there are no such privilege transformation devices like conversion tables, unless they are carefully and narrowly defined, or put under the same agreed upon editorial process as articles.


Edit: This post stemmed from a discussion I started on Chinese Wikipedia (user “TTTT”) to advocate for separating transcription and translation. Those changes were implemented about a year later with the effort of many others, and remain to this day. Theoretically, if you replace the “/wiki/” part of any article’s URL on zh.wikipedia.org by “/zh-hans/” (resp. “/zh-hant/”), you should see the pure transcribed Simplified script (resp. Traditional script) version of the stored article, without locale-specific phrasal translations. Unfortunately, the situation has regressed once again with the introduction of article- and topic-specific exception tables where the two concepts are frequently conflated.

No comments yet. Be the first.

Leave a reply