### data structure problem

Another problem by fakalin.

A data structure has the entropy bound if all queries have amortized time $$O(\sum_k p_k \log 1/p_k)$$, where $$p_k$$ is the fraction of the time that key $$k$$ is queried. It has the working-set property if the time to search for an element $$x_i$$ is $$O(\log t_i)$$, where $$t_i$$ is the number of elements queried since the last access to $$x_i$$. Prove that the working-set property implies the entropy bound.

This isn’t really a data structure problem, per se.

The general intuition here is that, if the waiting time between two queries to a key $$k$$ is $$t(k)$$, then key $$k$$ ends up taking up about a $$p_k = 1/t(k)$$ fraction of the queries, and therefore the average query time is about $$\sum_{k\in K} 1/t(k) (\log t(k))$$.

While this is exactly true for evenly spaced-out queries, the general case only requires a slight modification using any of the rudimentary convex inequalities such as:

Jensen’s inequality: If $$f$$ is a convex function and $$X$$ is a random variable, then $$\mathbb{E}f(X)\ge f(\mathbb{E}X)$$.

(The proof just uses the definition of what a convex function is. Furthermore, if we recognize that $$-\log x$$ is a convex function, then we get $$\mathbb{E}\log(X)\le \log(\mathbb{E}X)$$, which restates the well-known fact that the geometric mean is less than or equal to the arithmetic mean of a collection of real numbers.)

Now, let $$N$$ be the total number of queries. Let $$n(k)$$ be the number of queries on key $$k$$. Let $$t_i(k)$$ be the time between the $$i$$th query on key $$k$$ and the previous query on the same key. Let $$\bar{a}(k)< C \sum_{i=1}^{n(k)} \log t_i(k) / n(k)$$ (for some $$C>0$$) be the average query time looking up key $$k$$, as guaranteed by the working-set property (*). Let $$\bar{t}(k) = \sum_{i=1}^{n(k)} t_i(k) / n(k)$$ be the average time between queries for key $$k$$.

Note that we must have $$\bar{t}(k) \le N/n(k)$$. Furthermore, $$p_k = n(k)/N$$ by definition, hence $$\bar{t}(k) \le 1/p_k$$ (**). The average query time over all keys is therefore:

$$\sum_k \bar{a}(k) n(k) / N$$
$$< \sum_k [C \sum_{i=1}^{n(k)} \log t_i(k) / n(k)] [n(k) / N]$$, by (*)
$$\le C \sum_k [\log \sum_{i=1}^{n(k)} t_i(k) / n(k)] [n(k) / N]$$, by Jensen’s inequality
$$= C \sum_k [\log \bar{t}(k)] p_k$$
$$\le C \sum_k p_k \log 1/p_k$$, by (**)