Optimisation vs. Bayes' Theorem not coinciding
up vote
1
down vote
favorite
Suppose I have the following Bayesian Network:

It's given by the following relations:
$$begin{aligned}X_1&sim mathcal N(mu, 1/sigma^2)\
forall k, 2leq kleq n: X_k|X_{k-1}&sim mathcal N(x_{k-1}, 1/lambda^2)\
forall i, 1leq ileq n: Z_i|X_i&simmathcal N(x_{i}, 1/delta_i^2)
end{aligned}$$
I'm using the precision rather than the variance for a reason that will become clear soon.
So, the above is a very simplified Kalman filter, in which there is no control-input model and both the state-transition model and the observation model are the identity, and furthermore the precision of the process noise is the same at every step.
Under that definition, $boldsymbol X$ is the hidden variable I want to reason about, and $boldsymbol Z$ is the set of (noisy) observations I've collected. I want to find the value $hat{x}_n$ that maximises the distribution $p(X_n = x_n|boldsymbol Z)$ implied by the above relations (i.e. the MAP estimate of the last value).
There are two ways I can go about doing this.
The first is with a straightforward application of Bayes' Theorem. Let $boldsymbol X_{{a, b, c, ...}}$ stand for $boldsymbol Xbackslash{X_a, X_b, X_c, ...}$ and let $boldsymbol Z_{{a, b, c, ...}}$ be defined similarly. Then:
$$begin{aligned}
&p(boldsymbol X|boldsymbol Z) propto p(boldsymbol Z, boldsymbol X) \
&= p(boldsymbol Z|boldsymbol X) * p(boldsymbol X) \
&=p(Z_n|X_n, boldsymbol Z_{{n}}, boldsymbol X_{{n}}) * p(boldsymbol Z_{{n}} | boldsymbol X) * p(X_n | X_{n-1}, boldsymbol X_{{n, n-1}}) * p(boldsymbol X_{{n}}) \
&=p(Z_n|X_n) * p(X_n | X_{n-1}) * p(boldsymbol Z_{{n}} | boldsymbol X_{{n}}) * p(boldsymbol X_{{n}}) \
&= left[prod_{i=2}^np(Z_i|X_i)*p(X_i|X_{i-1})right]*p(Z_1|X_1)*p(X_1) &(1)\
&= left[prod_{i=2}^nmathcal N(Z_i|X_i, 1/delta_i^2)*mathcal N(X_i|X_{i-1}, 1/lambda^2)right]*mathcal N(Z_1|X_1, 1/delta_1^2)*mathcal N(X_1|mu, 1/sigma^2) &(2)
end{aligned}$$
After some manipulation of $(2)$, if we define $X_0=mu$, it's not hard to see that:
$$begin{aligned}
p(boldsymbol X|boldsymbol Z) propto prod_{i=1}^nmathcal N(X_i | alpha_i * Z_i + (1 - alpha_i) * X_{i - 1}, beta_i) &&(3)end{aligned}$$
Where we've defined:
$$begin{aligned}
alpha_1 &= frac{sigma^2}{sigma^2 + delta_1^2} \
forall k, 2leq kleq n:alpha_k&=frac{lambda^2}{lambda^2+delta_k^2} \
beta_1 &= frac{1}{sigma^2} + frac{1}{delta_1^2} \
forall k, 2leq kleq n:beta_k&=frac{1}{lambda^2} + frac{1}{delta_k^2}
end{aligned}$$
Although the precisions $beta_k$ are irrelevant for our purposes (I think?).
So, the MAP estimates for the values of $boldsymbol X$ can be given recursively by:
$$begin{aligned}
hat x_1 &= alpha_1 * z_1 + (1 - alpha_1) * mu \
forall k, 2leq kleq n:hat x_k &= alpha_k * z_k + (1 - alpha_k) * hat x_{k-1}
end{aligned}$$
The second way to solve this problem is to take $(2)$ above and turn this into an optimisation problem. In particular, since what we want is to maximise that quantity, we can instead call the negative logarithm an "error function" and minimise that.
$$begin{aligned}
&log(p(boldsymbol Z, boldsymbol X)) \
&=log
left[left[prod_{k=2}^nmathcal N(Z_k|X_k, 1/delta_k^2)*mathcal N(X_k|X_{k-1}, 1/lambda^2)right]*mathcal N(Z_1|X_1, 1/delta_1^2)*mathcal N(X_1|mu, 1/sigma^2)right] \
&propto frac 1 {sigma^2} (x_1-mu)^2 + frac 1 {lambda^2} sum_{k=2}^n(x_k-x_{k-1})^2 +
sum_{i=1}^nfrac 1 {delta_i^2}(x_i-z_i)^2 &(4)
end{aligned}$$
If the above two approaches can be used -- and it might turn out that I've made some assumption or mistake that means they can't -- then they should yield the same result.
Now, let's analyse the case I'm interested in. Suppose that there is some number $delta$ and a sequence of numbers $boldsymbol I = {I_1, I_2, ..., I_n}$ such that:
$$forall i: frac 1 {delta_i^2} = frac{I_i}{delta^2}$$
That is, the precision of the measurement noise is proportional to these $I_i$ and the constant of proportionality is always the same.
Now, I can rewrite my objective function in $(4)$ as:
$$frac {delta^2} {sigma^2} (x_1-mu)^2 + frac {delta^2} {lambda^2} sum_{k=2}^n(x_k-x_{k-1})^2 +
sum_{i=1}^nI_i(x_i-z_i)^2$$
The reason this is not strictly the same as $(4)$ is that the $I_i$ can in fact be any nonnegative number, $forall i:I_igeq 0$, including zero. So these $I_i$ are effectively a "relative weight" of each measurement I make and some measurements can have weight zero (if I, for example, did not actually make that measurement).
That way, I can in fact rewrite the alphas:
$$begin{aligned}
alpha_1 &= frac{I_1}{I_1 + delta^2 / sigma^2} \
forall k, 2leq kleq n:alpha_k&=frac{I_k}{I_k+delta^2/lambda^2}
end{aligned}$$
Now let's take the following implementation of the BN:

The MAP estimates should be
$$begin{aligned}
{&alpha_1 * z_1 + (1 - alpha_1) * mu,\
&alpha_1 * z_1 + (1 - alpha_1) * mu,\
&alpha_3 * z_3 + (1 - alpha_3) * left(alpha_1 * z_1 + (1 - alpha_1 * mu)right)}
end{aligned}$$
However, if I try to optimise the following objective function:
$$frac {delta^2} {sigma^2} (x_1-mu)^2 + frac {delta^2} {lambda^2} left[(x_2-x_1)^2 + (x_3-x_2)^2right] +
I_1(x_1-z_1)^2 + I_3(x_3-z_3)^2$$
I don't get those MAP estimates. Take, for example, the case where:
$$begin{aligned}
mu &= 1 \
z_1 &= 0.8 \
z_3 &= 1.2 \
I_1 &= 400 \
I_3 &= 500 \
frac {delta^2}{sigma^2} &= 300 \
frac {delta^2}{lambda^2} &= 4500
end{aligned}$$
This would imply:
$$begin{aligned}
alpha_1 &= frac 4 7 \
alpha_3 &= frac 1 {10} \
hat x_1 &= frac 4 7 * 0.8 + frac 3 7 * 1
= 0.89 \
hat x_3 &= frac 1 {10} * 1.2 + frac 9 {10} * 0.89
= 0.92 \
end{aligned}
$$
But that's not actually the solution. What am I missing? Where in this derivation did I get it wrong?
probability-theory statistics bayes-theorem bayesian-network kalman-filter
add a comment |
up vote
1
down vote
favorite
Suppose I have the following Bayesian Network:

It's given by the following relations:
$$begin{aligned}X_1&sim mathcal N(mu, 1/sigma^2)\
forall k, 2leq kleq n: X_k|X_{k-1}&sim mathcal N(x_{k-1}, 1/lambda^2)\
forall i, 1leq ileq n: Z_i|X_i&simmathcal N(x_{i}, 1/delta_i^2)
end{aligned}$$
I'm using the precision rather than the variance for a reason that will become clear soon.
So, the above is a very simplified Kalman filter, in which there is no control-input model and both the state-transition model and the observation model are the identity, and furthermore the precision of the process noise is the same at every step.
Under that definition, $boldsymbol X$ is the hidden variable I want to reason about, and $boldsymbol Z$ is the set of (noisy) observations I've collected. I want to find the value $hat{x}_n$ that maximises the distribution $p(X_n = x_n|boldsymbol Z)$ implied by the above relations (i.e. the MAP estimate of the last value).
There are two ways I can go about doing this.
The first is with a straightforward application of Bayes' Theorem. Let $boldsymbol X_{{a, b, c, ...}}$ stand for $boldsymbol Xbackslash{X_a, X_b, X_c, ...}$ and let $boldsymbol Z_{{a, b, c, ...}}$ be defined similarly. Then:
$$begin{aligned}
&p(boldsymbol X|boldsymbol Z) propto p(boldsymbol Z, boldsymbol X) \
&= p(boldsymbol Z|boldsymbol X) * p(boldsymbol X) \
&=p(Z_n|X_n, boldsymbol Z_{{n}}, boldsymbol X_{{n}}) * p(boldsymbol Z_{{n}} | boldsymbol X) * p(X_n | X_{n-1}, boldsymbol X_{{n, n-1}}) * p(boldsymbol X_{{n}}) \
&=p(Z_n|X_n) * p(X_n | X_{n-1}) * p(boldsymbol Z_{{n}} | boldsymbol X_{{n}}) * p(boldsymbol X_{{n}}) \
&= left[prod_{i=2}^np(Z_i|X_i)*p(X_i|X_{i-1})right]*p(Z_1|X_1)*p(X_1) &(1)\
&= left[prod_{i=2}^nmathcal N(Z_i|X_i, 1/delta_i^2)*mathcal N(X_i|X_{i-1}, 1/lambda^2)right]*mathcal N(Z_1|X_1, 1/delta_1^2)*mathcal N(X_1|mu, 1/sigma^2) &(2)
end{aligned}$$
After some manipulation of $(2)$, if we define $X_0=mu$, it's not hard to see that:
$$begin{aligned}
p(boldsymbol X|boldsymbol Z) propto prod_{i=1}^nmathcal N(X_i | alpha_i * Z_i + (1 - alpha_i) * X_{i - 1}, beta_i) &&(3)end{aligned}$$
Where we've defined:
$$begin{aligned}
alpha_1 &= frac{sigma^2}{sigma^2 + delta_1^2} \
forall k, 2leq kleq n:alpha_k&=frac{lambda^2}{lambda^2+delta_k^2} \
beta_1 &= frac{1}{sigma^2} + frac{1}{delta_1^2} \
forall k, 2leq kleq n:beta_k&=frac{1}{lambda^2} + frac{1}{delta_k^2}
end{aligned}$$
Although the precisions $beta_k$ are irrelevant for our purposes (I think?).
So, the MAP estimates for the values of $boldsymbol X$ can be given recursively by:
$$begin{aligned}
hat x_1 &= alpha_1 * z_1 + (1 - alpha_1) * mu \
forall k, 2leq kleq n:hat x_k &= alpha_k * z_k + (1 - alpha_k) * hat x_{k-1}
end{aligned}$$
The second way to solve this problem is to take $(2)$ above and turn this into an optimisation problem. In particular, since what we want is to maximise that quantity, we can instead call the negative logarithm an "error function" and minimise that.
$$begin{aligned}
&log(p(boldsymbol Z, boldsymbol X)) \
&=log
left[left[prod_{k=2}^nmathcal N(Z_k|X_k, 1/delta_k^2)*mathcal N(X_k|X_{k-1}, 1/lambda^2)right]*mathcal N(Z_1|X_1, 1/delta_1^2)*mathcal N(X_1|mu, 1/sigma^2)right] \
&propto frac 1 {sigma^2} (x_1-mu)^2 + frac 1 {lambda^2} sum_{k=2}^n(x_k-x_{k-1})^2 +
sum_{i=1}^nfrac 1 {delta_i^2}(x_i-z_i)^2 &(4)
end{aligned}$$
If the above two approaches can be used -- and it might turn out that I've made some assumption or mistake that means they can't -- then they should yield the same result.
Now, let's analyse the case I'm interested in. Suppose that there is some number $delta$ and a sequence of numbers $boldsymbol I = {I_1, I_2, ..., I_n}$ such that:
$$forall i: frac 1 {delta_i^2} = frac{I_i}{delta^2}$$
That is, the precision of the measurement noise is proportional to these $I_i$ and the constant of proportionality is always the same.
Now, I can rewrite my objective function in $(4)$ as:
$$frac {delta^2} {sigma^2} (x_1-mu)^2 + frac {delta^2} {lambda^2} sum_{k=2}^n(x_k-x_{k-1})^2 +
sum_{i=1}^nI_i(x_i-z_i)^2$$
The reason this is not strictly the same as $(4)$ is that the $I_i$ can in fact be any nonnegative number, $forall i:I_igeq 0$, including zero. So these $I_i$ are effectively a "relative weight" of each measurement I make and some measurements can have weight zero (if I, for example, did not actually make that measurement).
That way, I can in fact rewrite the alphas:
$$begin{aligned}
alpha_1 &= frac{I_1}{I_1 + delta^2 / sigma^2} \
forall k, 2leq kleq n:alpha_k&=frac{I_k}{I_k+delta^2/lambda^2}
end{aligned}$$
Now let's take the following implementation of the BN:

The MAP estimates should be
$$begin{aligned}
{&alpha_1 * z_1 + (1 - alpha_1) * mu,\
&alpha_1 * z_1 + (1 - alpha_1) * mu,\
&alpha_3 * z_3 + (1 - alpha_3) * left(alpha_1 * z_1 + (1 - alpha_1 * mu)right)}
end{aligned}$$
However, if I try to optimise the following objective function:
$$frac {delta^2} {sigma^2} (x_1-mu)^2 + frac {delta^2} {lambda^2} left[(x_2-x_1)^2 + (x_3-x_2)^2right] +
I_1(x_1-z_1)^2 + I_3(x_3-z_3)^2$$
I don't get those MAP estimates. Take, for example, the case where:
$$begin{aligned}
mu &= 1 \
z_1 &= 0.8 \
z_3 &= 1.2 \
I_1 &= 400 \
I_3 &= 500 \
frac {delta^2}{sigma^2} &= 300 \
frac {delta^2}{lambda^2} &= 4500
end{aligned}$$
This would imply:
$$begin{aligned}
alpha_1 &= frac 4 7 \
alpha_3 &= frac 1 {10} \
hat x_1 &= frac 4 7 * 0.8 + frac 3 7 * 1
= 0.89 \
hat x_3 &= frac 1 {10} * 1.2 + frac 9 {10} * 0.89
= 0.92 \
end{aligned}
$$
But that's not actually the solution. What am I missing? Where in this derivation did I get it wrong?
probability-theory statistics bayes-theorem bayesian-network kalman-filter
add a comment |
up vote
1
down vote
favorite
up vote
1
down vote
favorite
Suppose I have the following Bayesian Network:

It's given by the following relations:
$$begin{aligned}X_1&sim mathcal N(mu, 1/sigma^2)\
forall k, 2leq kleq n: X_k|X_{k-1}&sim mathcal N(x_{k-1}, 1/lambda^2)\
forall i, 1leq ileq n: Z_i|X_i&simmathcal N(x_{i}, 1/delta_i^2)
end{aligned}$$
I'm using the precision rather than the variance for a reason that will become clear soon.
So, the above is a very simplified Kalman filter, in which there is no control-input model and both the state-transition model and the observation model are the identity, and furthermore the precision of the process noise is the same at every step.
Under that definition, $boldsymbol X$ is the hidden variable I want to reason about, and $boldsymbol Z$ is the set of (noisy) observations I've collected. I want to find the value $hat{x}_n$ that maximises the distribution $p(X_n = x_n|boldsymbol Z)$ implied by the above relations (i.e. the MAP estimate of the last value).
There are two ways I can go about doing this.
The first is with a straightforward application of Bayes' Theorem. Let $boldsymbol X_{{a, b, c, ...}}$ stand for $boldsymbol Xbackslash{X_a, X_b, X_c, ...}$ and let $boldsymbol Z_{{a, b, c, ...}}$ be defined similarly. Then:
$$begin{aligned}
&p(boldsymbol X|boldsymbol Z) propto p(boldsymbol Z, boldsymbol X) \
&= p(boldsymbol Z|boldsymbol X) * p(boldsymbol X) \
&=p(Z_n|X_n, boldsymbol Z_{{n}}, boldsymbol X_{{n}}) * p(boldsymbol Z_{{n}} | boldsymbol X) * p(X_n | X_{n-1}, boldsymbol X_{{n, n-1}}) * p(boldsymbol X_{{n}}) \
&=p(Z_n|X_n) * p(X_n | X_{n-1}) * p(boldsymbol Z_{{n}} | boldsymbol X_{{n}}) * p(boldsymbol X_{{n}}) \
&= left[prod_{i=2}^np(Z_i|X_i)*p(X_i|X_{i-1})right]*p(Z_1|X_1)*p(X_1) &(1)\
&= left[prod_{i=2}^nmathcal N(Z_i|X_i, 1/delta_i^2)*mathcal N(X_i|X_{i-1}, 1/lambda^2)right]*mathcal N(Z_1|X_1, 1/delta_1^2)*mathcal N(X_1|mu, 1/sigma^2) &(2)
end{aligned}$$
After some manipulation of $(2)$, if we define $X_0=mu$, it's not hard to see that:
$$begin{aligned}
p(boldsymbol X|boldsymbol Z) propto prod_{i=1}^nmathcal N(X_i | alpha_i * Z_i + (1 - alpha_i) * X_{i - 1}, beta_i) &&(3)end{aligned}$$
Where we've defined:
$$begin{aligned}
alpha_1 &= frac{sigma^2}{sigma^2 + delta_1^2} \
forall k, 2leq kleq n:alpha_k&=frac{lambda^2}{lambda^2+delta_k^2} \
beta_1 &= frac{1}{sigma^2} + frac{1}{delta_1^2} \
forall k, 2leq kleq n:beta_k&=frac{1}{lambda^2} + frac{1}{delta_k^2}
end{aligned}$$
Although the precisions $beta_k$ are irrelevant for our purposes (I think?).
So, the MAP estimates for the values of $boldsymbol X$ can be given recursively by:
$$begin{aligned}
hat x_1 &= alpha_1 * z_1 + (1 - alpha_1) * mu \
forall k, 2leq kleq n:hat x_k &= alpha_k * z_k + (1 - alpha_k) * hat x_{k-1}
end{aligned}$$
The second way to solve this problem is to take $(2)$ above and turn this into an optimisation problem. In particular, since what we want is to maximise that quantity, we can instead call the negative logarithm an "error function" and minimise that.
$$begin{aligned}
&log(p(boldsymbol Z, boldsymbol X)) \
&=log
left[left[prod_{k=2}^nmathcal N(Z_k|X_k, 1/delta_k^2)*mathcal N(X_k|X_{k-1}, 1/lambda^2)right]*mathcal N(Z_1|X_1, 1/delta_1^2)*mathcal N(X_1|mu, 1/sigma^2)right] \
&propto frac 1 {sigma^2} (x_1-mu)^2 + frac 1 {lambda^2} sum_{k=2}^n(x_k-x_{k-1})^2 +
sum_{i=1}^nfrac 1 {delta_i^2}(x_i-z_i)^2 &(4)
end{aligned}$$
If the above two approaches can be used -- and it might turn out that I've made some assumption or mistake that means they can't -- then they should yield the same result.
Now, let's analyse the case I'm interested in. Suppose that there is some number $delta$ and a sequence of numbers $boldsymbol I = {I_1, I_2, ..., I_n}$ such that:
$$forall i: frac 1 {delta_i^2} = frac{I_i}{delta^2}$$
That is, the precision of the measurement noise is proportional to these $I_i$ and the constant of proportionality is always the same.
Now, I can rewrite my objective function in $(4)$ as:
$$frac {delta^2} {sigma^2} (x_1-mu)^2 + frac {delta^2} {lambda^2} sum_{k=2}^n(x_k-x_{k-1})^2 +
sum_{i=1}^nI_i(x_i-z_i)^2$$
The reason this is not strictly the same as $(4)$ is that the $I_i$ can in fact be any nonnegative number, $forall i:I_igeq 0$, including zero. So these $I_i$ are effectively a "relative weight" of each measurement I make and some measurements can have weight zero (if I, for example, did not actually make that measurement).
That way, I can in fact rewrite the alphas:
$$begin{aligned}
alpha_1 &= frac{I_1}{I_1 + delta^2 / sigma^2} \
forall k, 2leq kleq n:alpha_k&=frac{I_k}{I_k+delta^2/lambda^2}
end{aligned}$$
Now let's take the following implementation of the BN:

The MAP estimates should be
$$begin{aligned}
{&alpha_1 * z_1 + (1 - alpha_1) * mu,\
&alpha_1 * z_1 + (1 - alpha_1) * mu,\
&alpha_3 * z_3 + (1 - alpha_3) * left(alpha_1 * z_1 + (1 - alpha_1 * mu)right)}
end{aligned}$$
However, if I try to optimise the following objective function:
$$frac {delta^2} {sigma^2} (x_1-mu)^2 + frac {delta^2} {lambda^2} left[(x_2-x_1)^2 + (x_3-x_2)^2right] +
I_1(x_1-z_1)^2 + I_3(x_3-z_3)^2$$
I don't get those MAP estimates. Take, for example, the case where:
$$begin{aligned}
mu &= 1 \
z_1 &= 0.8 \
z_3 &= 1.2 \
I_1 &= 400 \
I_3 &= 500 \
frac {delta^2}{sigma^2} &= 300 \
frac {delta^2}{lambda^2} &= 4500
end{aligned}$$
This would imply:
$$begin{aligned}
alpha_1 &= frac 4 7 \
alpha_3 &= frac 1 {10} \
hat x_1 &= frac 4 7 * 0.8 + frac 3 7 * 1
= 0.89 \
hat x_3 &= frac 1 {10} * 1.2 + frac 9 {10} * 0.89
= 0.92 \
end{aligned}
$$
But that's not actually the solution. What am I missing? Where in this derivation did I get it wrong?
probability-theory statistics bayes-theorem bayesian-network kalman-filter
Suppose I have the following Bayesian Network:

It's given by the following relations:
$$begin{aligned}X_1&sim mathcal N(mu, 1/sigma^2)\
forall k, 2leq kleq n: X_k|X_{k-1}&sim mathcal N(x_{k-1}, 1/lambda^2)\
forall i, 1leq ileq n: Z_i|X_i&simmathcal N(x_{i}, 1/delta_i^2)
end{aligned}$$
I'm using the precision rather than the variance for a reason that will become clear soon.
So, the above is a very simplified Kalman filter, in which there is no control-input model and both the state-transition model and the observation model are the identity, and furthermore the precision of the process noise is the same at every step.
Under that definition, $boldsymbol X$ is the hidden variable I want to reason about, and $boldsymbol Z$ is the set of (noisy) observations I've collected. I want to find the value $hat{x}_n$ that maximises the distribution $p(X_n = x_n|boldsymbol Z)$ implied by the above relations (i.e. the MAP estimate of the last value).
There are two ways I can go about doing this.
The first is with a straightforward application of Bayes' Theorem. Let $boldsymbol X_{{a, b, c, ...}}$ stand for $boldsymbol Xbackslash{X_a, X_b, X_c, ...}$ and let $boldsymbol Z_{{a, b, c, ...}}$ be defined similarly. Then:
$$begin{aligned}
&p(boldsymbol X|boldsymbol Z) propto p(boldsymbol Z, boldsymbol X) \
&= p(boldsymbol Z|boldsymbol X) * p(boldsymbol X) \
&=p(Z_n|X_n, boldsymbol Z_{{n}}, boldsymbol X_{{n}}) * p(boldsymbol Z_{{n}} | boldsymbol X) * p(X_n | X_{n-1}, boldsymbol X_{{n, n-1}}) * p(boldsymbol X_{{n}}) \
&=p(Z_n|X_n) * p(X_n | X_{n-1}) * p(boldsymbol Z_{{n}} | boldsymbol X_{{n}}) * p(boldsymbol X_{{n}}) \
&= left[prod_{i=2}^np(Z_i|X_i)*p(X_i|X_{i-1})right]*p(Z_1|X_1)*p(X_1) &(1)\
&= left[prod_{i=2}^nmathcal N(Z_i|X_i, 1/delta_i^2)*mathcal N(X_i|X_{i-1}, 1/lambda^2)right]*mathcal N(Z_1|X_1, 1/delta_1^2)*mathcal N(X_1|mu, 1/sigma^2) &(2)
end{aligned}$$
After some manipulation of $(2)$, if we define $X_0=mu$, it's not hard to see that:
$$begin{aligned}
p(boldsymbol X|boldsymbol Z) propto prod_{i=1}^nmathcal N(X_i | alpha_i * Z_i + (1 - alpha_i) * X_{i - 1}, beta_i) &&(3)end{aligned}$$
Where we've defined:
$$begin{aligned}
alpha_1 &= frac{sigma^2}{sigma^2 + delta_1^2} \
forall k, 2leq kleq n:alpha_k&=frac{lambda^2}{lambda^2+delta_k^2} \
beta_1 &= frac{1}{sigma^2} + frac{1}{delta_1^2} \
forall k, 2leq kleq n:beta_k&=frac{1}{lambda^2} + frac{1}{delta_k^2}
end{aligned}$$
Although the precisions $beta_k$ are irrelevant for our purposes (I think?).
So, the MAP estimates for the values of $boldsymbol X$ can be given recursively by:
$$begin{aligned}
hat x_1 &= alpha_1 * z_1 + (1 - alpha_1) * mu \
forall k, 2leq kleq n:hat x_k &= alpha_k * z_k + (1 - alpha_k) * hat x_{k-1}
end{aligned}$$
The second way to solve this problem is to take $(2)$ above and turn this into an optimisation problem. In particular, since what we want is to maximise that quantity, we can instead call the negative logarithm an "error function" and minimise that.
$$begin{aligned}
&log(p(boldsymbol Z, boldsymbol X)) \
&=log
left[left[prod_{k=2}^nmathcal N(Z_k|X_k, 1/delta_k^2)*mathcal N(X_k|X_{k-1}, 1/lambda^2)right]*mathcal N(Z_1|X_1, 1/delta_1^2)*mathcal N(X_1|mu, 1/sigma^2)right] \
&propto frac 1 {sigma^2} (x_1-mu)^2 + frac 1 {lambda^2} sum_{k=2}^n(x_k-x_{k-1})^2 +
sum_{i=1}^nfrac 1 {delta_i^2}(x_i-z_i)^2 &(4)
end{aligned}$$
If the above two approaches can be used -- and it might turn out that I've made some assumption or mistake that means they can't -- then they should yield the same result.
Now, let's analyse the case I'm interested in. Suppose that there is some number $delta$ and a sequence of numbers $boldsymbol I = {I_1, I_2, ..., I_n}$ such that:
$$forall i: frac 1 {delta_i^2} = frac{I_i}{delta^2}$$
That is, the precision of the measurement noise is proportional to these $I_i$ and the constant of proportionality is always the same.
Now, I can rewrite my objective function in $(4)$ as:
$$frac {delta^2} {sigma^2} (x_1-mu)^2 + frac {delta^2} {lambda^2} sum_{k=2}^n(x_k-x_{k-1})^2 +
sum_{i=1}^nI_i(x_i-z_i)^2$$
The reason this is not strictly the same as $(4)$ is that the $I_i$ can in fact be any nonnegative number, $forall i:I_igeq 0$, including zero. So these $I_i$ are effectively a "relative weight" of each measurement I make and some measurements can have weight zero (if I, for example, did not actually make that measurement).
That way, I can in fact rewrite the alphas:
$$begin{aligned}
alpha_1 &= frac{I_1}{I_1 + delta^2 / sigma^2} \
forall k, 2leq kleq n:alpha_k&=frac{I_k}{I_k+delta^2/lambda^2}
end{aligned}$$
Now let's take the following implementation of the BN:

The MAP estimates should be
$$begin{aligned}
{&alpha_1 * z_1 + (1 - alpha_1) * mu,\
&alpha_1 * z_1 + (1 - alpha_1) * mu,\
&alpha_3 * z_3 + (1 - alpha_3) * left(alpha_1 * z_1 + (1 - alpha_1 * mu)right)}
end{aligned}$$
However, if I try to optimise the following objective function:
$$frac {delta^2} {sigma^2} (x_1-mu)^2 + frac {delta^2} {lambda^2} left[(x_2-x_1)^2 + (x_3-x_2)^2right] +
I_1(x_1-z_1)^2 + I_3(x_3-z_3)^2$$
I don't get those MAP estimates. Take, for example, the case where:
$$begin{aligned}
mu &= 1 \
z_1 &= 0.8 \
z_3 &= 1.2 \
I_1 &= 400 \
I_3 &= 500 \
frac {delta^2}{sigma^2} &= 300 \
frac {delta^2}{lambda^2} &= 4500
end{aligned}$$
This would imply:
$$begin{aligned}
alpha_1 &= frac 4 7 \
alpha_3 &= frac 1 {10} \
hat x_1 &= frac 4 7 * 0.8 + frac 3 7 * 1
= 0.89 \
hat x_3 &= frac 1 {10} * 1.2 + frac 9 {10} * 0.89
= 0.92 \
end{aligned}
$$
But that's not actually the solution. What am I missing? Where in this derivation did I get it wrong?
probability-theory statistics bayes-theorem bayesian-network kalman-filter
probability-theory statistics bayes-theorem bayesian-network kalman-filter
edited Dec 5 at 18:19
asked Nov 27 at 23:53
Pedro Carvalho
399115
399115
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
up vote
0
down vote
accepted
After much thinking and writing things down I figured it out.
The recursive relations implied by $(3)$ are correct for the conditional MAP estimate. That is to say, the values of $hat x_i$ I wrote out in the "first approach" section of my original post are the values that maximise their respective $p(X_i|Z_i, X_{i-1})$, conditioning on the present observation and the previous value.
The optimisation approach, however, is minimising the negative logarithm of $p(boldsymbol X|boldsymbol Z)$ -- the joint posterior over all variables $X_i$, conditional on the set of measurements $boldsymbol Z$ -- and there's no guarantee that the $i^{th}$ coordinate of the MAP vector $hat{boldsymbol X}$ will equal $hat x_i$ (and, as shown above, it probably almost always doesn't).
This is because, when I'm optimising the joint posterior, I'm using future values to update the past -- in my example, the fact that $z_3=1.2$ was as relevant to $X_1$ as the fact that $z_1=0.8$ was to $X_3$, and so all values in $boldsymbol X$ are jointly optimised to maximise the posterior.
ETA:
And the actual thing that I want, the MAP/expected value (since they're the same here) of $p(X_i|boldsymbol Z_{jleq i})$, is:
$$begin{aligned}
bar x_0&equiv mu\
forall i> 0:bar x_i &equiv mathbb E(X_i|boldsymbol Z_{jleq i}) \
&= alpha_i * z_i + (1 - alpha_i) * bar x_{i- 1} \
forall i > 0:alpha_i &=frac {I_i}{I_i + delta^2/ (sigma_{i-1}^2 + lambda_i^2)} \
sigma_0 &=sigma \
lambda_1 &= 0 \
forall i > 0 : sigma_i^2 &= frac {delta^2}{I_i + delta^2/ (sigma_{i-1}^2 + lambda_i^2)} \
forall i > 1 : lambda_i &= lambda
end{aligned}$$
And with the above definitions in mind:
$$begin{aligned}
X_i|boldsymbol Z_{jleq i} &sim mathcal N(bar x_i, 1 / sigma_i^2) \
frac{1}{sigma_i^2} &= frac{1} {delta_i^2} + frac{1}{sigma_{i-1}^2+lambda_i^2}
end{aligned}$$
That is, the precision of the distribution of $X_i|boldsymbol Z_{jleq i}$ is the precision of $Z_i|X_i$ plus the inverse of the sum of the variances of $X_i|X_{i-1}$ and $X_{i-1}|boldsymbol Z_{j < i}$.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "69"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3016500%2foptimisation-vs-bayes-theorem-not-coinciding%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
accepted
After much thinking and writing things down I figured it out.
The recursive relations implied by $(3)$ are correct for the conditional MAP estimate. That is to say, the values of $hat x_i$ I wrote out in the "first approach" section of my original post are the values that maximise their respective $p(X_i|Z_i, X_{i-1})$, conditioning on the present observation and the previous value.
The optimisation approach, however, is minimising the negative logarithm of $p(boldsymbol X|boldsymbol Z)$ -- the joint posterior over all variables $X_i$, conditional on the set of measurements $boldsymbol Z$ -- and there's no guarantee that the $i^{th}$ coordinate of the MAP vector $hat{boldsymbol X}$ will equal $hat x_i$ (and, as shown above, it probably almost always doesn't).
This is because, when I'm optimising the joint posterior, I'm using future values to update the past -- in my example, the fact that $z_3=1.2$ was as relevant to $X_1$ as the fact that $z_1=0.8$ was to $X_3$, and so all values in $boldsymbol X$ are jointly optimised to maximise the posterior.
ETA:
And the actual thing that I want, the MAP/expected value (since they're the same here) of $p(X_i|boldsymbol Z_{jleq i})$, is:
$$begin{aligned}
bar x_0&equiv mu\
forall i> 0:bar x_i &equiv mathbb E(X_i|boldsymbol Z_{jleq i}) \
&= alpha_i * z_i + (1 - alpha_i) * bar x_{i- 1} \
forall i > 0:alpha_i &=frac {I_i}{I_i + delta^2/ (sigma_{i-1}^2 + lambda_i^2)} \
sigma_0 &=sigma \
lambda_1 &= 0 \
forall i > 0 : sigma_i^2 &= frac {delta^2}{I_i + delta^2/ (sigma_{i-1}^2 + lambda_i^2)} \
forall i > 1 : lambda_i &= lambda
end{aligned}$$
And with the above definitions in mind:
$$begin{aligned}
X_i|boldsymbol Z_{jleq i} &sim mathcal N(bar x_i, 1 / sigma_i^2) \
frac{1}{sigma_i^2} &= frac{1} {delta_i^2} + frac{1}{sigma_{i-1}^2+lambda_i^2}
end{aligned}$$
That is, the precision of the distribution of $X_i|boldsymbol Z_{jleq i}$ is the precision of $Z_i|X_i$ plus the inverse of the sum of the variances of $X_i|X_{i-1}$ and $X_{i-1}|boldsymbol Z_{j < i}$.
add a comment |
up vote
0
down vote
accepted
After much thinking and writing things down I figured it out.
The recursive relations implied by $(3)$ are correct for the conditional MAP estimate. That is to say, the values of $hat x_i$ I wrote out in the "first approach" section of my original post are the values that maximise their respective $p(X_i|Z_i, X_{i-1})$, conditioning on the present observation and the previous value.
The optimisation approach, however, is minimising the negative logarithm of $p(boldsymbol X|boldsymbol Z)$ -- the joint posterior over all variables $X_i$, conditional on the set of measurements $boldsymbol Z$ -- and there's no guarantee that the $i^{th}$ coordinate of the MAP vector $hat{boldsymbol X}$ will equal $hat x_i$ (and, as shown above, it probably almost always doesn't).
This is because, when I'm optimising the joint posterior, I'm using future values to update the past -- in my example, the fact that $z_3=1.2$ was as relevant to $X_1$ as the fact that $z_1=0.8$ was to $X_3$, and so all values in $boldsymbol X$ are jointly optimised to maximise the posterior.
ETA:
And the actual thing that I want, the MAP/expected value (since they're the same here) of $p(X_i|boldsymbol Z_{jleq i})$, is:
$$begin{aligned}
bar x_0&equiv mu\
forall i> 0:bar x_i &equiv mathbb E(X_i|boldsymbol Z_{jleq i}) \
&= alpha_i * z_i + (1 - alpha_i) * bar x_{i- 1} \
forall i > 0:alpha_i &=frac {I_i}{I_i + delta^2/ (sigma_{i-1}^2 + lambda_i^2)} \
sigma_0 &=sigma \
lambda_1 &= 0 \
forall i > 0 : sigma_i^2 &= frac {delta^2}{I_i + delta^2/ (sigma_{i-1}^2 + lambda_i^2)} \
forall i > 1 : lambda_i &= lambda
end{aligned}$$
And with the above definitions in mind:
$$begin{aligned}
X_i|boldsymbol Z_{jleq i} &sim mathcal N(bar x_i, 1 / sigma_i^2) \
frac{1}{sigma_i^2} &= frac{1} {delta_i^2} + frac{1}{sigma_{i-1}^2+lambda_i^2}
end{aligned}$$
That is, the precision of the distribution of $X_i|boldsymbol Z_{jleq i}$ is the precision of $Z_i|X_i$ plus the inverse of the sum of the variances of $X_i|X_{i-1}$ and $X_{i-1}|boldsymbol Z_{j < i}$.
add a comment |
up vote
0
down vote
accepted
up vote
0
down vote
accepted
After much thinking and writing things down I figured it out.
The recursive relations implied by $(3)$ are correct for the conditional MAP estimate. That is to say, the values of $hat x_i$ I wrote out in the "first approach" section of my original post are the values that maximise their respective $p(X_i|Z_i, X_{i-1})$, conditioning on the present observation and the previous value.
The optimisation approach, however, is minimising the negative logarithm of $p(boldsymbol X|boldsymbol Z)$ -- the joint posterior over all variables $X_i$, conditional on the set of measurements $boldsymbol Z$ -- and there's no guarantee that the $i^{th}$ coordinate of the MAP vector $hat{boldsymbol X}$ will equal $hat x_i$ (and, as shown above, it probably almost always doesn't).
This is because, when I'm optimising the joint posterior, I'm using future values to update the past -- in my example, the fact that $z_3=1.2$ was as relevant to $X_1$ as the fact that $z_1=0.8$ was to $X_3$, and so all values in $boldsymbol X$ are jointly optimised to maximise the posterior.
ETA:
And the actual thing that I want, the MAP/expected value (since they're the same here) of $p(X_i|boldsymbol Z_{jleq i})$, is:
$$begin{aligned}
bar x_0&equiv mu\
forall i> 0:bar x_i &equiv mathbb E(X_i|boldsymbol Z_{jleq i}) \
&= alpha_i * z_i + (1 - alpha_i) * bar x_{i- 1} \
forall i > 0:alpha_i &=frac {I_i}{I_i + delta^2/ (sigma_{i-1}^2 + lambda_i^2)} \
sigma_0 &=sigma \
lambda_1 &= 0 \
forall i > 0 : sigma_i^2 &= frac {delta^2}{I_i + delta^2/ (sigma_{i-1}^2 + lambda_i^2)} \
forall i > 1 : lambda_i &= lambda
end{aligned}$$
And with the above definitions in mind:
$$begin{aligned}
X_i|boldsymbol Z_{jleq i} &sim mathcal N(bar x_i, 1 / sigma_i^2) \
frac{1}{sigma_i^2} &= frac{1} {delta_i^2} + frac{1}{sigma_{i-1}^2+lambda_i^2}
end{aligned}$$
That is, the precision of the distribution of $X_i|boldsymbol Z_{jleq i}$ is the precision of $Z_i|X_i$ plus the inverse of the sum of the variances of $X_i|X_{i-1}$ and $X_{i-1}|boldsymbol Z_{j < i}$.
After much thinking and writing things down I figured it out.
The recursive relations implied by $(3)$ are correct for the conditional MAP estimate. That is to say, the values of $hat x_i$ I wrote out in the "first approach" section of my original post are the values that maximise their respective $p(X_i|Z_i, X_{i-1})$, conditioning on the present observation and the previous value.
The optimisation approach, however, is minimising the negative logarithm of $p(boldsymbol X|boldsymbol Z)$ -- the joint posterior over all variables $X_i$, conditional on the set of measurements $boldsymbol Z$ -- and there's no guarantee that the $i^{th}$ coordinate of the MAP vector $hat{boldsymbol X}$ will equal $hat x_i$ (and, as shown above, it probably almost always doesn't).
This is because, when I'm optimising the joint posterior, I'm using future values to update the past -- in my example, the fact that $z_3=1.2$ was as relevant to $X_1$ as the fact that $z_1=0.8$ was to $X_3$, and so all values in $boldsymbol X$ are jointly optimised to maximise the posterior.
ETA:
And the actual thing that I want, the MAP/expected value (since they're the same here) of $p(X_i|boldsymbol Z_{jleq i})$, is:
$$begin{aligned}
bar x_0&equiv mu\
forall i> 0:bar x_i &equiv mathbb E(X_i|boldsymbol Z_{jleq i}) \
&= alpha_i * z_i + (1 - alpha_i) * bar x_{i- 1} \
forall i > 0:alpha_i &=frac {I_i}{I_i + delta^2/ (sigma_{i-1}^2 + lambda_i^2)} \
sigma_0 &=sigma \
lambda_1 &= 0 \
forall i > 0 : sigma_i^2 &= frac {delta^2}{I_i + delta^2/ (sigma_{i-1}^2 + lambda_i^2)} \
forall i > 1 : lambda_i &= lambda
end{aligned}$$
And with the above definitions in mind:
$$begin{aligned}
X_i|boldsymbol Z_{jleq i} &sim mathcal N(bar x_i, 1 / sigma_i^2) \
frac{1}{sigma_i^2} &= frac{1} {delta_i^2} + frac{1}{sigma_{i-1}^2+lambda_i^2}
end{aligned}$$
That is, the precision of the distribution of $X_i|boldsymbol Z_{jleq i}$ is the precision of $Z_i|X_i$ plus the inverse of the sum of the variances of $X_i|X_{i-1}$ and $X_{i-1}|boldsymbol Z_{j < i}$.
edited Dec 11 at 20:56
answered Nov 28 at 18:52
Pedro Carvalho
399115
399115
add a comment |
add a comment |
Thanks for contributing an answer to Mathematics Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3016500%2foptimisation-vs-bayes-theorem-not-coinciding%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown