entropy of English

admin — Sun, 15 Sep 2024 20:26:47 +0000

This video on model scaling highlights an interesting point, which is to use the best models (nowadays LLM’s) to estimate the entropy rate of the source, in this case, the English language. This isn’t a new idea at all and empirical entropy measurements have been done in the past.

What’s interesting is that past estimates of bits per word of English have been way, way higher. Shannon’s original estimates are 11.82 bits per word, for example, or 2.62 bits per letter.

Some more recent estimates are understandably lower, like referenced in this Stackoverflow answer, which reports 5.7 bits per word. In this video we have the notion that the entropy of English is either undetectably low (which is impossible and suggests model overfit), or quite low, like 1.69 bits per token in this DeepMind paper.

Now, this video plays fast and loose with the units of what’s reported, so we need to be careful. What’s a token you say? This Stackoverflow answer says for several models it is “approximately 4 characters or 3/4 of a word”. This paper uses a research model called Chinchilla, and doesn’t say what is a token, but let’s take it to be the conventional 3/4 of a word. That makes the DeepMind result really 2.25 bits per word.

Then the next plot showing OpenAI’s GPT-4 performance from their Technical Report is even more extreme, showing, so far as I can tell from the graph, between 1.2 and 1.3 bits per word. Let’s say 1.25 bits per word then. At that entropy rate, each word disambiguates only about 2.4 possibilities on average!

That seems very, very low but … plausible. Perhaps a model of natural language semantic unit (not necessarily a word) tends to distinguish among 2 opposing possibilities, for easy mental processing, you know things like black vs. white, high vs. low, large vs. small. The extra 0.4 possibilities may be grammatical information that attaches as free-rider onto English words. Any lower than 2 possibilities per word seems improbably low and inefficient as a communication mechanism. So if these models are for real, we must be getting very very close to the true lower bound here, and consequently, optimal model performance on English language modeling. It also nicely confirms Shannon’s thesis (in my attribution) that any source is statistical, and merely by using larger and larger contexts, its generation can be arbitrarily realistic.

data structure problem

admin — Fri, 09 Mar 2012 04:50:41 +0000

Another problem by fakalin.

A data structure has the entropy bound if all queries have amortized time \(O(\sum_k p_k \log 1/p_k)\), where \(p_k\) is the fraction of the time that key \(k\) is queried. It has the working-set property if the time to search for an element \(x_i\) is \(O(\log t_i)\), where \(t_i\) is the number of elements queried since the last access to \(x_i\). Prove that the working-set property implies the entropy bound.

This isn’t really a data structure problem, per se.

The general intuition here is that, if the waiting time between two queries to a key \(k\) is \(t(k)\), then key \(k\) ends up taking up about a \(p_k = 1/t(k)\) fraction of the queries, and therefore the average query time is about \(\sum_{k\in K} 1/t(k) (\log t(k))\).

While this is exactly true for evenly spaced-out queries, the general case only requires a slight modification using any of the rudimentary convex inequalities such as:

Jensen’s inequality: If \(f\) is a convex function and \(X\) is a random variable, then \(\mathbb{E}f(X)\ge f(\mathbb{E}X)\).

(The proof just uses the definition of what a convex function is. Furthermore, if we recognize that \(-\log x\) is a convex function, then we get \(\mathbb{E}\log(X)\le \log(\mathbb{E}X)\), which restates the well-known fact that the geometric mean is less than or equal to the arithmetic mean of a collection of real numbers.)

Now, let \(N\) be the total number of queries. Let \(n(k)\) be the number of queries on key \(k\). Let \(t_i(k)\) be the time between the \(i\)th query on key \(k\) and the previous query on the same key. Let \(\bar{a}(k)< C \sum_{i=1}^{n(k)} \log t_i(k) / n(k)\) (for some \(C>0\)) be the average query time looking up key \(k\), as guaranteed by the working-set property (*). Let \(\bar{t}(k) = \sum_{i=1}^{n(k)} t_i(k) / n(k)\) be the average time between queries for key \(k\).

Note that we must have \(\bar{t}(k) \le N/n(k)\). Furthermore, \(p_k = n(k)/N\) by definition, hence \(\bar{t}(k) \le 1/p_k\) (**). The average query time over all keys is therefore:

\(\sum_k \bar{a}(k) n(k) / N\)
\(< \sum_k [C \sum_{i=1}^{n(k)} \log t_i(k) / n(k)] [n(k) / N]\), by (*)
\(\le C \sum_k [\log \sum_{i=1}^{n(k)} t_i(k) / n(k)] [n(k) / N]\), by Jensen’s inequality
\(= C \sum_k [\log \bar{t}(k)] p_k\)
\(\le C \sum_k p_k \log 1/p_k\), by (**)

∎

Some stuff » entropy

entropy of English

data structure problem