RE: [Fis] Objects: Irreversibility of the Foundations

From: Aleks Jakulin <[email protected]>
Date: Mon 18 Oct 2004 - 13:24:40 CEST

I'm responding to Malcolm's kind post. Although I am going into specifics, I'm trying to show that certain problems inherent to modelling and information appear throughout science. Of course, we are addressing very simple systems that many of other readers of the list (biology, ecology, economics) may find almost uselessly limited. It turns out, however, that these models refer to variables and quantities, and so they appear whenever we have the notions of: variable, data, model, probability... Perhaps someone disagrees at this point?

> This means that if all the pairwise correlations
> are zero, so are all the higher order correlations.
> My hunch is that one
> should be able to prove that the interaction information is
> zero if all the pairwise mutual information is zero.

If *total correlation* is zero, all the interaction informations are indeed zero too. I'd have to check if someone (Han, Yeung?) already proved that in a nice way, but it should be very easy. The trouble is that total correlation is not pairwise.

There are higher-order interactions that are totally unexpected from the pairwise perspective. A common example is the XOR or parity problem you too mentioned: C = A XOR B. Here, I(A;B)=I(B;C)=I(A;C)=0 yet I(A;B;C) > 0. This is a slightly pessimistic result, indicating both that there are complex patterns that cannot be captured by normality, and that the normality assumption may be dangerously myopic, as you point out.

This same problem also pops out in the Simpson's paradox. This is an old result from statistics, and this is why statisticians tend to categorically oppose the use of statistics on non-controlled data for the purposes of making causal statements (correlation is not causation mantra).

Simpson’s Paradox occurs when the marginal association is in the opposite direction to conditional association. Note the following tuberculosis example, with attributes P and R: D_P = {NewYork, Richmond}, D_R = {white, non − white}.

* marginal table

location lived died pdeath
New York 4758005 8878 0.19%
Richmond 127396 286 0.22%

By considering location alone, it would seem that New York health care is better.

* conditional table

white:
loc. lived died pdeath
NY 4666809 8365 0.18%
Rich. 80764 131 0.16%

non-white:
loc. lived died pdeath
NY 91196 513 0.56%
Rich. 46578 155 0.33%

But if we also control for the influence of skin color, Richmond health care is better, both for whites and for non-whites. The message is that a hidden attribute may remove or reverse any discovered correlation. A tool economists often use to cope with that is the "ceteris paribus": the correlation is valid only when keeping everything else is constant, and assumptions are fulfilled.

> Perhaps
> it is even true that the interaction information amongst any
> three or more variables is always zero for multivariate
> normal random variables, which would be interesting. If so,
> that would be a nice way of characterizing why it is bad to
> use the multinormal distribution when not all interactions
> are reducible to pairwise dependencies.

3-way interaction information is rarely zero in a multivariate normal distribution. This is fine if you adopt the synergy/redundance interpretation of interaction information: if you know the correlation betwen A and B, and the correlation between B and C, you can predict the correlation between A and C quite well. I have seen a paper that studies this. It could perhaps be proven that it cannot be positive.

However, I sometimes interpret interaction information as a model comparison, not as decomposition of entropy, and in this context the exact equality you suggest could be proven, using sufficient statistics. Namely, interaction information is the KL-divergence between the joint and the part-to-whole approximation created from all the model's parts. Imagine reconstructing P(A,B,C) from P(A,B), P(B,C) and P(A,C).

The specific approximation method interaction information presupposes has been described by Kirkwood and Boggs in the Journal of Chemical Physics in 1942, and it is known as the Kirkwood superposition approximation. It is also a special case of Kikuchi approximation to free energy for lattice models, which is making a big comeback in the past few years in information theory and machine learning.

> Another example in which variables are pairwise independent,
> but the interaction information is non-zero is the following
> generalization of the XOR example. [snip]
>
> This happens to be a real example in quantum physics--what is
> called the GHZ version of the Bell experiment for an
> entangled triplet of spin 1/2 particles. See:

Thanks for pointing me to Mermin's work! And this is perhaps why Cerf & Adami proposed that something quite similar to Bell's theorem could be phrased in terms of non-positivity of 3-way interaction information (Entropic Bell inequalities, Physical Review A, 55(5):3371-3374, 1997, arXiv:quant-ph/9608047 v2). In all, undecomposability is the unifying principle that explains many phenomena, including Bell's theorem, phase transitions (the "amount" of undecomposability reaches the maximum at the very point of phase transition, Ionas Erb and Nihat Ay: Multi-Information in the Thermodynamic Limit, Santa Fe Institute TR-03-11-064), frustration in physics, causality in the Bayesian network sense, etc.

> Akaike, H. (1973): "Information Theory and an Extension of
> the Maximum Likelihood Principle." B. N. Petrov and F. Csaki
> (eds.), 2nd International Symposium on Information Theory:
> 267-81. Budapest: Akademiai Kiado.

Actually, at this very point, it seems that the Bayesians and information-theorists are converging. Information theory has more of the "estimation" perspective, as such it is closer to frequentist statistics. Bayesians tend not to estimate as much. Rissanen's MDL http://www.mdl-research.org/reading.html tends to be used far more often today, but it was Hirotogu Akaike who started it all.

Consider the structure of the log-posterior through the Bayes rule:

log P(H|D) = ( log P(D|H) + log P(H) ) - log Z

Here Z refers to the evidence, or the probability of the data. This is usually ignored so that Z is the normalization coefficient.

So, model selection with AIC can be seen as a special case of Bayesian maximum a posteriori (MAP) inference, with a particular choice of the prior:

AIC = log P(D|H) - k

k ~= -log P(H) -- indeed, the more complex the hypothesis space, the larger the k.

The beauty of Bayesianism is that all you hold fixed is the Bayes rule (ie. the meaning of probability), and all assumptions are expressed explicitly through the choice of the hypothesis space. MDL is even closer to the Bayesian posterior. A true Bayesian, however, doesn't use maximum a posteriori hypotheses, and instead maintains that the data isn't sufficient to decide upon the hypothesis, and carries all these hypotheses along. I find this quite Epicurean (Letter to Pythocles) in spirit: "If multiple theories explain the data, keep them all". It is ultimately up to decision theory to select, for reasons of loss, utility and cost (multiple theories are more expensive than a single one), not for the reasons of truth. Occam's razor is an essentially utilitarian principle.

Paul M. B. Vitányi and Ming Li: "Minimum Description Length Induction, Bayesianism, and Kolmogorov Complexity" IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 46, NO. 2, MARCH 2000

> It is true that quantization and continuity cannot be easily
> reconciled. If a quantum mechanical observable (such as
> position) can take on a continuum of values, it means that
> the dimension of the associated Hilbert space must be
> infinite (because the position operator has an infinite
> number of eigenvalues). Infinite dimensional spaces are
> mathematically difficult to comprehend.

I was very intrigued by Penrose's materials at last week's http://online.kitp.ucsb.edu/online/kitp25/zee/ What Penrose seems to be addressing is exactly this problem of quantization vs continuity. If I understand correctly, he's proposing that delta functions should not be used after all. One gets rid of singularities, zero-probabilities, and many other counter-intuitivities, but the reverberations of this conclusion, if it's accepted, are yet to be felt. Briefly, if we give up the infinite precision of an event, we have to decide upon the resolution. This resolution is usually taken for granted: the medical thermometer always reads 37.1, not 37.12374114321324... So it's fine with us in practice, and all measurements are considered to be slightly vague. Furthermore, Heisenberg's uncertainty principle implies that everything we measure is vague: we *cannot* be precise. Even the model cannot be precise because computing with an infinite number of digits is impossible. A probab!
ility itself is not precise.

All my problems with continuity in terms of statistics vanish once I get rid of delta functions. This doesn't mean that everything is then treated as quantized. For example, the temperature of 37.1 will mean an imagined *cloud* of temperatures with the mean at ~37.1, and 37.11 will mean a partially overlapping but more precise cloud or ensemble of temperatures with the mean of ~37.11. We have given up models with infinite precision, we understand that all our measurements are inherently "clouds", and adopt unpretentious models with inbuilt unknowables. If I tell you that the mean temperature is 37.1 and variance 0.5, this mean is not 37.1000000000000000 but itself a cloud, just as variance is not 0.500000000 but too a cloud.

As with Epicurus earlier, it is expensive to work with clouds, so in practice we plug in precise floating-point numbers that are more-or-less representative of the cloud. There are situations, however, when these representations are inappropriate, when the two clouds are overlapping. In those cases the metaphor of an infinitely precise scalar is economical but untrue, and we have to return to more expensive thinking in terms of clouds. Could we call what derives from this this "the logic of vagueness"? The infinite dimensionality of spaces is a reminder about the infinite cost of infinite precision. All practical models that actually *exist* in the universe must be constrained with space and time, and at the same time struggle to maximize the benefits of their masters.

Best regards with apologies for a slightly vague ramble in the end,

Aleks

--
mag. Aleks Jakulin
http://www.ailab.si/aleks/
Artificial Intelligence Laboratory, 
Faculty of Computer and Information Science, University of Ljubljana. 
_______________________________________________
fis mailing list
fis@listas.unizar.es
http://webmail.unizar.es/mailman/listinfo/fis
Received on Mon Oct 18 13:27:45 2004

This archive was generated by hypermail 2.1.8 : Mon 07 Mar 2005 - 10:24:47 CET