Some stuff » wikipedia

product integrals

admin — Thu, 12 Nov 2009 01:00:40 +0000

Nice. Used one today.

http://en.wikipedia.org/wiki/Product_integral

death of Encarta

admin — Tue, 31 Mar 2009 16:44:34 +0000

Here’s an interesting article about the shutting down of Encarta, the Microsoft published encyclopedia product, and implications for the media/information/publishing landscape at large.

At first, I thought it was the CD version that was being shut down, but no, it’s the online version; apparently the former, along with many Microsoft Home products (some were classics), had long been discontinued. Incidentally, I’ve used the CD product, but never the online product — I’ve been aware of it because it comes up in searches, but since it’s just the CD version put online, I’m not surprised it is meeting the same fate. It just goes to show that whatever process is driving traditional publishing into the ground is rather far along.

I, for one, still remember paper encyclopedias. For that matter, I still remember when libraries used card catalogues (you pull them out of a small drawer to find the Dewey decimal), but these became extinct at about the same time as the 5.25″ floppy disk. As for encyclopedias, they sat as multivolume collections in the reference section — maybe they still do? Haven’t been to a public library in a long time…

The first CD encyclopedia I remember was Grolier’s. Its selling point was some animations in articles. For a time these encyclopedias were useful for school projects, but by high school they seemed pretty useless — the articles just had too low a signal-to-noise ratio. Maybe they did not provide enough depth, or the short list of references were not adequate, or there was too much fluff that simple queries could not be answered in a well matched way. Often the writers were totally full of themselves, too (reminds me of about.com). The end result was these references could neither be used directly (plagiarism aside), nor were the raw data in them easily extractable. I think that’s one reason why I stopped using them, whatever the media encyclopedias came in. The other reason was that such generalist information was not difficult to find on the internet, even without a Wikipedia.

So while the comparison to Wikipedia is appealing as a foil, these products really failed on their own merits: they were generally inadequate and inferior products and they were not even free for being so.* The economic realities of that are only catching up now. And if newspapers follow them there, it would be because newspapers have long become wire service repeaters, not because of the existence of Google News. Interestingly, I haven’t had the interest to subscribe to these newspapers for a long time, either.

* Inferior compared to what, you say. Isn’t it the existence of a “better” alternative that lies at the crux of the matter? Actually, no. The inferiority is measured from the amount of nagging feeling of not having learned much. The reality is, without an alternative, one would just know less and unless effort be expended, be resigned to that… (Economically, of course the existence of an alternative matters, but that’s a separate issue.)

Is this true?

admin — Sat, 07 Mar 2009 21:41:39 +0000

So this thing on Wikipedia

http://en.wikipedia.org/wiki/Noisy-channel_coding_theorem

could have left it at the classical statement of the theorem with bullet #1. Then it goes on to say:

2. If a probability of bit error \(p_b\) is acceptable, rates up to \(R(p_b)\) are achievable, where

\(R(p_b) = \frac{C}{1-H_2(p_b)}\).

3. For any \(p_b\), rates greater than \(R(p_b)\) are not achievable.

I have never seen this before. At first glance, this seems questionable, as Fano’s converse gives \(P_e^{(n)} \ge 1 – \frac{1}{nR} – \frac{C}{R}\), which seems to converge to \(H_b(p_e) \ge p_e\) for \(p_e \in [0,0.5]\). So it must mean whatever is used to code this is not going to be a long block code.

One example where this is true is the binary symmetric channel, with uncoded transmission. But I’m not so sure what is the achievability scheme in general, although I have some ideas — it may involve quantizing the excess codewords to the nearest zero-error codewords. The converse I have no idea.

In terms of the statement, it is really unclear what is meant by “bit error”. In the classical statement, a message from a large alphabet is coded into some \(X^n \in \mathcal{X}^n\) where \(\mathcal{X}\) is the channel input alphabet. After decoding, \(X^n\) is either found correctly, or it is in error. There is no “bit” in here. Even if \(X\) is binary, is the bit error the received (uncooked) bit error? Or is it the decoded (cooked) bit error? Why should the decoded bit error matter, isn’t that a codebook artifact? Or is it the bit error in the original message, if the original message is to be represented by a bit-stream? But that is also entirely arbitrary.

Anyway I’d like a clarification from someone or a reference.

Transcription: How Chinese Wikipedia fell into disarray

admin — Fri, 02 Feb 2007 09:55:19 +0000

The evolution of the Chinese language Wikipedia follows a tortuous path. I suppose I’ve been around since the beginning, but really only to watch from the sidelines. In the beginning it was mostly mainland users who dominated in numbers, but since a year or two ago, with the on-again-off-again filtering of mainland Chinese users, the site has shifted towards more users from Taiwan, Hong Kong, and elsewhere.

In recent months, some changes were made to the site with interesting implications. These changes are fairly unique to the Chinese language site but there is something to be learned from them.

First, there is currently just one stored version of the contents on the Chinese Wikipedia site. It wasn’t always so. The very first problem faced by implementors in the early days was the script issue. There are two scripts used by users from different areas, Simplified and Traditional. All characters, regardless of script, are assigned code points in Unicode. But it is not as easy to change between scripts as changing fonts, because there is no one-to-one character mapping between the scripts. It is very nearly many-to-one in the Traditional-to-Simplified direction, however. And because the characters have semantic meaning, the one-to-many conversion can also occur, provided a character-cluster context as little as a word (< 5 characters usually). Therefore, Chinese Wikipedia actually began as two stored versions for each article, one for each script. Since this resulted in divergent articles as different people edit different versions, a project was begun to copy edits from one to the other, at first manually, and later on automatically. Thus was born the automated transcription between the two scripts so that eventually, everything was merged and just one version was stored with a user-selectable automatic script switcher. This worked pretty well for a while.

Because of the one-to-many mapping, the implementors used a word table to convert certain words as units. Because of this tactic, they very easily and effortlessly slid into the role of making a dictionary, without even realizing it. What they did was to incorporate entries for lexical differences between mainland China and Taiwan, notably of foreign loan words in the last fifty years. This, as I will show, is a huge mistake.

But first, a story. I remember when I was still not very used to the Traditional script, I asked on an internal Microsoft Chinese employees list whether anybody had some kind of codepage file (basically a dictionary) to automatically convert Traditional characters to Simplified characters. This is a many-to-one task, so a simple codepage would do. The response I got was that, no, Microsoft only had codepages to convert between "locales," which meant Taiwan, Hong Kong, etc., and Unicode. So I asked, well why can't you stack two of these together back-to-back with Unicode in the middle, in order to convert between the "Taiwan" locale, say, and the "mainland China" locale. Then I got a response from some self-righteous Taiwanese Microsoftie that Microsoft only does things "the right way" and to make that conversion "correctly," lexical translation is required. I said, no, I don't care about lexical differences, stop telling me what my question should be, and answer my actual question: can I make a codepage to go from the Traditional to the Simplified script. I never got an answer to that beyond "Microsoft never makes things that are hacked together like that." Two revisions of MS Office later, there came a Word command called "Chinese Translation," which converts between Traditional and Simplified scripts and optionally "translates" some words, too, so at least that problem is solved. But evidently, because of the way things are in the world right now, Simplified and Traditional scripts will always be associated with specific locales.

In any case, as the conversion tables grew in size, the implementors of Chinese Wikipedia asked the users to help maintain the table, but in one of the worst blunders, they allowed the users to edit the tables without specifically agreed upon rules of what these tables can contain and how these tables may be used. As more articles are added to Wikipedia, sometimes the lexical differences crop up in how to name an article. This is fairly similar to what happens on English Wikipedia and isn't something that its rules can't deal with. However, some discussion pages got fairly heated over the proper naming of things not only in the title but as they appear in the rest of the site. And due to the existence of the conversion tables, people began to simply add entries to them so that as what used to be a simple script conversion became a partial dynamic translation.

Why is this all bad? Because it is a loophole allowing privilege escalation. There is a standard process for creating and modifying content, and resolving conflict of content. As soon as the conversion table enters the picture, the process can be bypassed easily by injecting conflicting content into the table, which is not subject to the same process. (In fact it is difficult to even bring up discussion on the table, because it seems so objective — it seems to be just a “transcription table” even when it is no longer.) Now there are two versions of content. Once locale-specific lexical edits to the table became a precedent, other requests came to distinguish between “Taiwan Traditional script” and “Hong Kong Traditional script” even when it isn’t a script issue but a content issue that ought to be dealt with in the discussion pages of specific articles. Instead, by misidentifying the nature of the issue as a script issue, more conversion tables were created and more divergent content not subject to review were added to these tables, so that now there are five versions (!) of Chinese Wikipedia, and none that does not force the reader to read either a regional variant or a mixture of scripts.

What’s the lesson? It is very critical to check the technical process to make sure there are no such privilege transformation devices like conversion tables, unless they are carefully and narrowly defined, or put under the same agreed upon editorial process as articles.

Edit: This post stemmed from a discussion I started on Chinese Wikipedia (user “TTTT”) to advocate for separating transcription and translation. Those changes were implemented about a year later with the effort of many others, and remain to this day. Theoretically, if you replace the “/wiki/” part of any article’s URL on zh.wikipedia.org by “/zh-hans/” (resp. “/zh-hant/”), you should see the pure transcribed Simplified script (resp. Traditional script) version of the stored article, without locale-specific phrasal translations. Unfortunately, the situation has regressed once again with the introduction of article- and topic-specific exception tables where the two concepts are frequently conflated.