Loet wrote:
> Thus, Shannon's choice to equate H with the notion
> of entropy was very fortunate. It made the rich domain
> of equations and algorithms that had been studied in
> the century before available to the study of the
> dividedness/order/organization of systems other than
> physics. All these forms of organization (and self-
> organization) can be studied in terms of their
> probabilistic entropy.
A great contribution of Boltzmann's, Gibbs' and Shannon's work is that it
forced everyone to think probabilistically. But it is the step towards
probability what is important and not as much the entropy itself. Again, it
is very important to distinguish statistical H from the thermodynamical S.
Some authors use S for Shannon's entropy: very unfortunately, as this is a
major contribution to confusion.
=============
I've had some thoughts recently about a particular direction of
integrating/generalizing information theory with decision and game theory,
which has been hinted at in the subject. In this scheme, the relationship
between information and entropy is clear and consistent: but the two notions
are so similar that one wonders whether having two words for one thing is
justified.
* 1. Decision Theory: Tackling the Truth/Goodness Dilemma
---- Decision theory studies how to make the best decision. It is a theory, hence, that does not deal with truth or falsehood, but with good or bad. This is/ought dilemma is reflected in decision theory with two types of quantity: probability (which refers to truth) and utility (which refers to goodness). Most of decision theory is based around a simple maximum-expected-utility principle which picks the decision that yields maximum expected utility. The opposite of utility is loss. There are many synomyms to utility and loss: loss : cost, error, deviation, distortion, bad gain : benefit, utility, gain, fidelity, good, profit, fitness * 2. Loss Functions ---- Assume data D (which is a set of instances described with a set of attributes), and a model M. How well does model M fit the data D? One simple way of approaching this is to define a loss function, which computes the loss induced by M approximating D. Loss functions should have a few obvious properties, like a clearly defined minimum at a sane place, the maximum is not usually interesting, they should be gently sloping and as noiseless as possible to facilitate optimization. In statistics, we usually want a likely model. This is obtained via seeking a model given which the data becomes likely (Bayesians will disagree here, correctly, but I'll skip this discussion as it is not central). In a sense, a good model makes the data obvious. A good model thereby "explains" the data, makes it "nothing special". In other words, the pursuit of likely models connects truth with utility: that what is likelier is better, in the same scientific spirit of the true is the good. In that sense, probability-based utility remains in the domain of pure inquiry. Therefore, we are maximizing P(D|M), and the loss can be defined as L = -P(D|M). There are many instances in the data: D = [d1,d2,...,dN]. A frequently made assumption is that the instances were sampled randomly and independently one of another. If so, we can deal with d's as independent events: we are minimizing L = -P(d1|M)P(d2|M)...P(dN|M) This multiplication task can be simplified by using a logarithm, and dividing it by the number of instances, N. L = -(1/N) * [Log P(d1|M) + Log P(d2|M) + ... Log P(dN|M)] This kind of loss is referred to as expected negative log-likelihood, something to minimize. It is expected, because we average it by dividing by N: this is the average log-loss per instance. Since probabilities are on [0..1], the logarithms are negative, it makes sense to switch around and work with negative gain: expected log-likelihood, something to maximize. Does it begin to look familiar? Let us now assume that our probabilistic model P captures the data so well that we no longer need it. Let us pretend that we are using the model itself to generate data. The result is as follows: H(M) = -Integrate [ P(x|M) Log P(x|M) ] dx We have derived Shannon's entropy through the following assumptions: - logarithmic gain/loss utility function - independence of instances - the use of averaging - the assumption that Model = Reality Shannon's entropy is the inherent expected loss of the *model*, regardless of the data. Expected negative log-likelihood is the *average* loss on a particular set of data. I would like to refer to expected loss as approximate entropy, or perhaps empirical entropy, sample entropy. Any recommendations? * 3. Information and Entropy ---- Now assume that we have two models M1 and M2, resulting in P(x|M1) and in P(x|M2). What is the relationship between them? We have two choices, to keep them apart by assuming independence P(x|M1)P(x|M2) or to unite them together P(x|M1,M2). Mutual information between M1 and M2 corresponds to the reduction in entropy achieved by uniting the two models. Because of the properties of log-gain, it is irrelevant what we had earlier M1 or M2. The same change in log-gain is achieved by: - uniting M1 and M2 - adding M1 to M2 - adding M2 to M1 These properties are not retained, however, by using a different loss function. Therefore, mutual information is the gain achieved through holism. * 4. Constraints and Information ---- The higher the entropy of a model, the fewer its constraints. It is the constraints in the model that reduce its entropy. It is the constraints in real world that enable us to put constraints in our models. But we cannot equate constraints and information. Constraints are aspects of models (is). Information is an expression of quality (ought). Bibliography: [1] Topsoe: Information theory at the service of science. http://www.math.ku.dk/~topsoe/aspects.pdf [2] Grunwald & Dawid: Game theory, maximum entropy, minimum discrepancy, and robust bayesian decision theory. Annals of Mathematical Statistics, forthcoming. http://www.cwi.nl/~pdg/ftp/grunwalddawid.pdf _______________________________________________ fis mailing list fis@listas.unizar.es http://webmail.unizar.es/mailman/listinfo/fisReceived on Wed Apr 14 09:38:23 2004
This archive was generated by hypermail 2.1.8 : Mon 07 Mar 2005 - 10:24:46 CET