[Fis] A decision-theoretic view of entropy and information

From: <jakulin@acm.org> Date: Wed 14 Apr 2004 - 09:36:01 CEST · > Thus, Shannon's choice to equate H with the notion

----
Decision theory studies how to make the best decision. It is a theory,
hence, that does not deal with truth or falsehood, but with good or bad.
This is/ought dilemma is reflected in decision theory with two types of
quantity: probability (which refers to truth) and utility (which refers to
goodness). Most of decision theory is based around a simple
maximum-expected-utility principle which picks the decision that yields
maximum expected utility. The opposite of utility is loss.
There are many synomyms to utility and loss:
loss : cost, error, deviation, distortion, bad
gain : benefit, utility, gain, fidelity, good, profit, fitness
* 2. Loss Functions
----
Assume data D (which is a set of instances described with a set of
attributes), and a model M. How well does model M fit the data D? One simple
way of approaching this is to define a loss function, which computes the
loss induced by M approximating D. Loss functions should have a few obvious
properties, like a clearly defined minimum at a sane place, the maximum is
not usually interesting, they should be gently sloping and as noiseless as
possible to facilitate optimization. 
In statistics, we usually want a likely model. This is obtained via seeking
a model given which the data becomes likely (Bayesians will disagree here,
correctly, but I'll skip this discussion as it is not central). In a sense,
a good model makes the data obvious. A good model thereby "explains" the
data, makes it "nothing special". In other words, the pursuit of likely
models connects truth with utility: that what is likelier is better, in the
same scientific spirit of the true is the good. In that sense,
probability-based utility remains in the domain of pure inquiry.
Therefore, we are maximizing P(D|M), and the loss can be defined as L =
-P(D|M). There are many instances in the data: D = [d1,d2,...,dN]. A
frequently made assumption is that the instances were sampled randomly and
independently one of another. If so, we can deal with d's as independent
events: we are minimizing L = -P(d1|M)P(d2|M)...P(dN|M) This multiplication
task can be simplified by using a logarithm, and dividing it by the number
of instances, N.
L = -(1/N) * [Log P(d1|M) + Log P(d2|M) + ... Log P(dN|M)]
This kind of loss is referred to as expected negative log-likelihood,
something to minimize. It is expected, because we average it by dividing by
N: this is the average log-loss per instance. Since probabilities are on
[0..1], the logarithms are negative, it makes sense to switch around and
work with negative gain: expected log-likelihood, something to maximize.
Does it begin to look familiar? Let us now assume that our probabilistic
model P captures the data so well that we no longer need it. Let us pretend
that we are using the model itself to generate data. The result is as
follows:
H(M) = -Integrate [ P(x|M) Log P(x|M) ] dx 
We have derived Shannon's entropy through the following assumptions:
- logarithmic gain/loss utility function
- independence of instances
- the use of averaging
- the assumption that Model = Reality
Shannon's entropy is the inherent expected loss of the *model*, regardless
of the data. Expected negative log-likelihood is the *average* loss on a
particular set of data. I would like to refer to expected loss as
approximate entropy, or perhaps empirical entropy, sample entropy. Any
recommendations?
* 3. Information and Entropy
----
Now assume that we have two models M1 and M2, resulting in P(x|M1) and in
P(x|M2). What is the relationship between them? We have two choices, to keep
them apart by assuming independence P(x|M1)P(x|M2) or to unite them together
P(x|M1,M2). Mutual information between M1 and M2 corresponds to the
reduction in entropy achieved by uniting the two models. Because of the
properties of log-gain, it is irrelevant what we had earlier M1 or M2. The
same change in log-gain is achieved by:
- uniting M1 and M2
- adding M1 to M2
- adding M2 to M1
These properties are not retained, however, by using a different loss
function.
Therefore, mutual information is the gain achieved through holism.
* 4. Constraints and Information
----
The higher the entropy of a model, the fewer its constraints. It is the
constraints in the model that reduce its entropy. It is the constraints in
real world that enable us to put constraints in our models. But we cannot
equate constraints and information. Constraints are aspects of models (is).
Information is an expression of quality (ought).
Bibliography:
[1] Topsoe: Information theory at the service of science.
http://www.math.ku.dk/~topsoe/aspects.pdf
[2] Grunwald & Dawid: Game theory, maximum entropy, minimum discrepancy, and
robust bayesian decision theory. Annals of Mathematical Statistics,
forthcoming. http://www.cwi.nl/~pdg/ftp/grunwalddawid.pdf
_______________________________________________
fis mailing list
fis@listas.unizar.es
http://webmail.unizar.es/mailman/listinfo/fis