Some stuff » structure

data structure problem

admin — Fri, 09 Mar 2012 04:50:41 +0000

Another problem by fakalin.

A data structure has the entropy bound if all queries have amortized time \(O(\sum_k p_k \log 1/p_k)\), where \(p_k\) is the fraction of the time that key \(k\) is queried. It has the working-set property if the time to search for an element \(x_i\) is \(O(\log t_i)\), where \(t_i\) is the number of elements queried since the last access to \(x_i\). Prove that the working-set property implies the entropy bound.

This isn’t really a data structure problem, per se.

The general intuition here is that, if the waiting time between two queries to a key \(k\) is \(t(k)\), then key \(k\) ends up taking up about a \(p_k = 1/t(k)\) fraction of the queries, and therefore the average query time is about \(\sum_{k\in K} 1/t(k) (\log t(k))\).

While this is exactly true for evenly spaced-out queries, the general case only requires a slight modification using any of the rudimentary convex inequalities such as:

Jensen’s inequality: If \(f\) is a convex function and \(X\) is a random variable, then \(\mathbb{E}f(X)\ge f(\mathbb{E}X)\).

(The proof just uses the definition of what a convex function is. Furthermore, if we recognize that \(-\log x\) is a convex function, then we get \(\mathbb{E}\log(X)\le \log(\mathbb{E}X)\), which restates the well-known fact that the geometric mean is less than or equal to the arithmetic mean of a collection of real numbers.)

Now, let \(N\) be the total number of queries. Let \(n(k)\) be the number of queries on key \(k\). Let \(t_i(k)\) be the time between the \(i\)th query on key \(k\) and the previous query on the same key. Let \(\bar{a}(k)< C \sum_{i=1}^{n(k)} \log t_i(k) / n(k)\) (for some \(C>0\)) be the average query time looking up key \(k\), as guaranteed by the working-set property (*). Let \(\bar{t}(k) = \sum_{i=1}^{n(k)} t_i(k) / n(k)\) be the average time between queries for key \(k\).

Note that we must have \(\bar{t}(k) \le N/n(k)\). Furthermore, \(p_k = n(k)/N\) by definition, hence \(\bar{t}(k) \le 1/p_k\) (**). The average query time over all keys is therefore:

\(\sum_k \bar{a}(k) n(k) / N\)
\(< \sum_k [C \sum_{i=1}^{n(k)} \log t_i(k) / n(k)] [n(k) / N]\), by (*)
\(\le C \sum_k [\log \sum_{i=1}^{n(k)} t_i(k) / n(k)] [n(k) / N]\), by Jensen’s inequality
\(= C \sum_k [\log \bar{t}(k)] p_k\)
\(\le C \sum_k p_k \log 1/p_k\), by (**)

∎

google wave lacks structure

admin — Tue, 01 Dec 2009 22:03:43 +0000

Got an invitation to Google Wave today. The problem I find immediately is the lack of structure. Say what you will about the restrictions of email or IM, but the same restrictions of those ways of communication, namely time-flow or thread-flow, are also well enforced structures to keep things sane. Wave takes away these and substitutes “playback.” Unfortunately, playback is not natural. (The other way is to fall back on social convention to keep order, but that rarely works with more than 2 peers.)

I think there are two options here. Either structure needs to be explicitly enforced or presentation needs to be refined.

In the former, perhaps it is better to only allow replies in certain places. Perhaps it is better to only allow edits in certain places. Perhaps it is better to separate the two and keep the distinction between edit mode, thread mode, and conversation mode, and only allow mixing in very restricted settings (or require some extra steps to discourage its use). After all, in preparing a shared endeavor, the purpose should be defined and known ahead of time.

In the latter, perhaps a lot of hiding and collapsing should be used. Perhaps hyperlinks should be used for in-place edits that often hijack a topic. And now that subthreads can sprout like a tree, it makes little sense to retain the linear structure of conversations. Perhaps a topic based graph, or a conversation stack would be the more appropriate presentation metaphor.

Wave is a good idea, but not well thought out. In its attempt to differentiate, it has forsaken useability for chaotic flexibility, which would have had redeeming value, were it matched by equally ambitious presentation/visualization.

Wired on the Gaussian copula

admin — Wed, 25 Feb 2009 04:37:47 +0000

Because this article is spamming the internet today, I decided to read Li’s paper and learn what the heck is this Gaussian copula.

For five years, Li’s formula, known as a Gaussian copula function, looked like an unambiguously positive breakthrough, a piece of financial technology that allowed hugely complex risks to be modeled with more ease and accuracy than ever before. With his brilliant spark of mathematical legerdemain, Li made it possible for traders to sell vast quantities of new securities, expanding financial markets to unimaginable levels.

And anyway, here is the paper referenced in the article.

Firstly, so much for the sensationalism: so far as I can tell, the paper doesn’t say anything worthy of a Nobel Prize — but still it is mildly interesting. In fact, the whole point of the paper appears to be to introduce to the finance community an already known method for solving the inverse problem of distribution marginalization, that is, (non-uniquely) go from marginal distributions back to the joint distribution, by specifying a mediating copula that captures marginal-invariant joint structure. The technology is very straightforward, and Li didn’t invent it.

That aside, I did wonder, why the heck go through the motion of constructing a Gaussian copula (as in the article) if you assume your marginals and joint are all Gaussian to begin with and all you wanted to capture is the covariance matrix; you can just specify the joint Gaussian explicitly. It seems like a totally pointless exercise. After reading the paper though, I see that wasn’t really Li’s entire suggestion at all. He’s being descriptive rather than prescriptive of what his firm already did by casting it in the language of copulas, an interpretive generalization that allows for potentially more accurate modeling (of non-Gaussian marginals and complicated joint structure if so desired).

Now on to the accusations. The article says that Li tried to “model default correlation” using credit default swaps rather than ratings agency data. It turns out that wasn’t even a problem being solved in this paper. He suggested to use CDS market data to get implied marginal distribution, an established practice. As for how correlation is obtained from limited data, you’d have to blame one Greg Gupton:

Having chosen a copula function, we need to compute the pairwise correlation of survival times. Using the CreditMetrics (Gupton et al. [1997]) asset correlation approach, we can obtain the default correlation of two discrete events over one year period.

However, it is true that there is something funny going on with the concept of using market pricing to price other market instruments, when the only novel input for all of them must be what little information is collected from actual due diligence. A classic case of Garbage In Garbage Out in statistical modeling.

As somebody elsewhere wrote, this sort of thing would not pass muster in “real” engineering design. We’ve seen that dichotomy before between the absolutely error-free stricture of “hardware” design (chips and bridges) vs. the more lax attitude toward “software” design (operating systems and capital market systems). Maybe this dichotomy needs to go away.

On Penmanship in Chinese

admin — Thu, 29 Jan 2009 21:49:06 +0000

I suppose good penmanship is the basis of good calligraphy, since calligraphy is mainly the addition of (variable) brush width to the structure of the characters. This bulk structure is really the key and it is particularly difficult to get correctly without muscle memory. That’s why they tell you to trace character books over and over.

However, there is a way to figure this matter of structure from first principles (and perhaps generate a more unique style as a result), albeit with the tradeoff that you cannot be quick, you must be careful.

The first principle for aesthetics is that the character must stand … this is something my old man told me, actually, so I didn’t figure this out myself, but it is very true. If you hold up the piece of paper and look at the strokes as struts of a building, it must look like the character is architecturally sound, i.e. reasonably symmetric if need be, balanced in weight so will not tip over, is not poorly supported with too small a bottom and too big a top, etc. This isn’t too difficult if the character is mechanically drawn, but the trick is to do it even with asymmetric calligraphic strokes and multi-part characters with asymmetric radicals and caps.

The second principle for aesthetics is about spacing, and this is much like optimal typography and typesetting. The strokes should be spread out evenly so that where they appear parallel, they appear to have nearly identical spacing as other such spaces. Otherwise there will be ugly bunching and voids. This is very difficult because the strokes are written in order so there is a pre-commitment issue. Once you commit to a particular stroke, it also commits the spacing requirements for the rest of the character. So one slightly off stroke and you are screwed. This is more a problem for large writing, since bigger mistakes are possible.

Then is the issue of multiple character layout. This wouldn’t be so much of an issue if all characters were the same shape and complexity, but they are not. Some are extremely sparse, and some are very dense. Some are tall and some are fat. They all have to be laid out on paper to look like they take up the same space and also evenly spaced from each other. There is also the compromise of making inter-stroke space appear similar in multiple characters. So one needs to deal with some visual artifacts and vision tricks. As a result, the characters will not all be the same size and will not be spaced evenly, so this is a very tricky thing to get right. You can have perfectly written individual characters but still a terrible collection.

And finally here is a side point: people say Simplified characters are uglier than Traditional characters for calligraphy. In fact this cannot be true. What happens is Simplified characters are sparser and sparser characters writ large are the most difficult to get correctly (not to mention there are no classic master’s character books to trace in Simplified). They are ugly only because (or to the extent that) they are not written well. The bastion of poor practioners (like me) is in small dense characters that distract from scrutiny and generally look pretty good no matter how you write them.

Windows 7, again

admin — Thu, 29 Jan 2009 06:56:00 +0000

Got it installed and seems like a clean update on Vista. Somebody must have cracked the whip on simplicity, since nearly everything involving user interaction got simpler. Since it is mostly feature extensions on Vista, it is quite stable.

Some less noticed changes:
* IE8 now runs all tabs and windows in separate processes, so there is no longer a distinction between tabs and windows. There is also (finally) a Mozilla style jump-highlight in-page search. There is a convenient “In Private” mode that leaves behind nothing, but it is kind of stupid in that it doesn’t sandbox in cookies to delete them afterwards but in fact doesn’t appear to store them at all, breaking some sites… or maybe it’s just a bug. There are also these “accelerators” to web services (like smart tags on crack), not that useful in my opinion.
* English ink input in continuous mode now displays recognitions in-place, rather than in typeface underneath.
* Services for Unix (the POSIX subsystem) is much much improved and is actually usable for compilation.
* Monad (or PowerShell), which got dropped from Vista, is in. Very nice.
* Desktop backgrounds now come in sets of images, rather than one image.
* Yet another new directory structure for user home directory. The “data” folders in the home directory like Pictures, Music, Movies, Documents are now symbolically separated into a “Libraries” indexing structure (kind of like in WMP), and apparently you can create multiple libraries. Not sure if this is implemented cleanly enough, but intersting.

That’s about it.