### data structure problem

Another problem by fakalin.

A data structure has the entropy bound if all queries have amortized time \(O(\sum_k p_k \log 1/p_k)\), where \(p_k\) is the fraction of the time that key \(k\) is queried. It has the working-set property if the time to search for an element \(x_i\) is \(O(\log t_i)\), where \(t_i\) is the number of elements queried since the last access to \(x_i\). Prove that the working-set property implies the entropy bound.

This isn’t really a data structure problem, per se.

The general intuition here is that, if the waiting time between two queries to a key \(k\) is \(t(k)\), then key \(k\) ends up taking up about a \(p_k = 1/t(k)\) fraction of the queries, and therefore the average query time is about \(\sum_{k\in K} 1/t(k) (\log t(k))\).

While this is exactly true for evenly spaced-out queries, the general case only requires a slight modification using any of the rudimentary convex inequalities such as:

**Jensen’s inequality:** If \(f\) is a convex function and \(X\) is a random variable, then \(\mathbb{E}f(X)\ge f(\mathbb{E}X)\).

(The proof just uses the definition of what a convex function is. Furthermore, if we recognize that \(-\log x\) is a convex function, then we get \(\mathbb{E}\log(X)\le \log(\mathbb{E}X)\), which restates the well-known fact that the geometric mean is less than or equal to the arithmetic mean of a collection of real numbers.)

Now, let \(N\) be the total number of queries. Let \(n(k)\) be the number of queries on key \(k\). Let \(t_i(k)\) be the time between the \(i\)th query on key \(k\) and the previous query on the same key. Let \(\bar{a}(k)< C \sum_{i=1}^{n(k)} \log t_i(k) / n(k)\) (for some \(C>0\)) be the *average* query time looking up key \(k\), as guaranteed by the working-set property (*). Let \(\bar{t}(k) = \sum_{i=1}^{n(k)} t_i(k) / n(k)\) be the *average* time between queries for key \(k\).

Note that we must have \(\bar{t}(k) \le N/n(k)\). Furthermore, \(p_k = n(k)/N\) by definition, hence \(\bar{t}(k) \le 1/p_k\) (**). The average query time over all keys is therefore:

\(\sum_k \bar{a}(k) n(k) / N\)

\(< \sum_k [C \sum_{i=1}^{n(k)} \log t_i(k) / n(k)] [n(k) / N]\), by (*)

\(\le C \sum_k [\log \sum_{i=1}^{n(k)} t_i(k) / n(k)] [n(k) / N]\), by Jensen’s inequality

\(= C \sum_k [\log \bar{t}(k)] p_k\)

\(\le C \sum_k p_k \log 1/p_k\), by (**)

∎