Optimisation vs. Bayes' Theorem not coinciding











up vote
1
down vote

favorite












Suppose I have the following Bayesian Network:



enter image description here



It's given by the following relations:
$$begin{aligned}X_1&sim mathcal N(mu, 1/sigma^2)\
forall k, 2leq kleq n: X_k|X_{k-1}&sim mathcal N(x_{k-1}, 1/lambda^2)\
forall i, 1leq ileq n: Z_i|X_i&simmathcal N(x_{i}, 1/delta_i^2)
end{aligned}$$

I'm using the precision rather than the variance for a reason that will become clear soon.



So, the above is a very simplified Kalman filter, in which there is no control-input model and both the state-transition model and the observation model are the identity, and furthermore the precision of the process noise is the same at every step.



Under that definition, $boldsymbol X$ is the hidden variable I want to reason about, and $boldsymbol Z$ is the set of (noisy) observations I've collected. I want to find the value $hat{x}_n$ that maximises the distribution $p(X_n = x_n|boldsymbol Z)$ implied by the above relations (i.e. the MAP estimate of the last value).



There are two ways I can go about doing this.





The first is with a straightforward application of Bayes' Theorem. Let $boldsymbol X_{{a, b, c, ...}}$ stand for $boldsymbol Xbackslash{X_a, X_b, X_c, ...}$ and let $boldsymbol Z_{{a, b, c, ...}}$ be defined similarly. Then:



$$begin{aligned}
&p(boldsymbol X|boldsymbol Z) propto p(boldsymbol Z, boldsymbol X) \
&= p(boldsymbol Z|boldsymbol X) * p(boldsymbol X) \
&=p(Z_n|X_n, boldsymbol Z_{{n}}, boldsymbol X_{{n}}) * p(boldsymbol Z_{{n}} | boldsymbol X) * p(X_n | X_{n-1}, boldsymbol X_{{n, n-1}}) * p(boldsymbol X_{{n}}) \
&=p(Z_n|X_n) * p(X_n | X_{n-1}) * p(boldsymbol Z_{{n}} | boldsymbol X_{{n}}) * p(boldsymbol X_{{n}}) \
&= left[prod_{i=2}^np(Z_i|X_i)*p(X_i|X_{i-1})right]*p(Z_1|X_1)*p(X_1) &(1)\
&= left[prod_{i=2}^nmathcal N(Z_i|X_i, 1/delta_i^2)*mathcal N(X_i|X_{i-1}, 1/lambda^2)right]*mathcal N(Z_1|X_1, 1/delta_1^2)*mathcal N(X_1|mu, 1/sigma^2) &(2)
end{aligned}$$



After some manipulation of $(2)$, if we define $X_0=mu$, it's not hard to see that:



$$begin{aligned}
p(boldsymbol X|boldsymbol Z) propto prod_{i=1}^nmathcal N(X_i | alpha_i * Z_i + (1 - alpha_i) * X_{i - 1}, beta_i) &&(3)end{aligned}$$



Where we've defined:



$$begin{aligned}
alpha_1 &= frac{sigma^2}{sigma^2 + delta_1^2} \
forall k, 2leq kleq n:alpha_k&=frac{lambda^2}{lambda^2+delta_k^2} \
beta_1 &= frac{1}{sigma^2} + frac{1}{delta_1^2} \
forall k, 2leq kleq n:beta_k&=frac{1}{lambda^2} + frac{1}{delta_k^2}
end{aligned}$$



Although the precisions $beta_k$ are irrelevant for our purposes (I think?).



So, the MAP estimates for the values of $boldsymbol X$ can be given recursively by:



$$begin{aligned}
hat x_1 &= alpha_1 * z_1 + (1 - alpha_1) * mu \
forall k, 2leq kleq n:hat x_k &= alpha_k * z_k + (1 - alpha_k) * hat x_{k-1}
end{aligned}$$





The second way to solve this problem is to take $(2)$ above and turn this into an optimisation problem. In particular, since what we want is to maximise that quantity, we can instead call the negative logarithm an "error function" and minimise that.



$$begin{aligned}
&log(p(boldsymbol Z, boldsymbol X)) \
&=log
left[left[prod_{k=2}^nmathcal N(Z_k|X_k, 1/delta_k^2)*mathcal N(X_k|X_{k-1}, 1/lambda^2)right]*mathcal N(Z_1|X_1, 1/delta_1^2)*mathcal N(X_1|mu, 1/sigma^2)right] \
&propto frac 1 {sigma^2} (x_1-mu)^2 + frac 1 {lambda^2} sum_{k=2}^n(x_k-x_{k-1})^2 +
sum_{i=1}^nfrac 1 {delta_i^2}(x_i-z_i)^2 &(4)
end{aligned}$$





If the above two approaches can be used -- and it might turn out that I've made some assumption or mistake that means they can't -- then they should yield the same result.



Now, let's analyse the case I'm interested in. Suppose that there is some number $delta$ and a sequence of numbers $boldsymbol I = {I_1, I_2, ..., I_n}$ such that:



$$forall i: frac 1 {delta_i^2} = frac{I_i}{delta^2}$$



That is, the precision of the measurement noise is proportional to these $I_i$ and the constant of proportionality is always the same.



Now, I can rewrite my objective function in $(4)$ as:



$$frac {delta^2} {sigma^2} (x_1-mu)^2 + frac {delta^2} {lambda^2} sum_{k=2}^n(x_k-x_{k-1})^2 +
sum_{i=1}^nI_i(x_i-z_i)^2$$



The reason this is not strictly the same as $(4)$ is that the $I_i$ can in fact be any nonnegative number, $forall i:I_igeq 0$, including zero. So these $I_i$ are effectively a "relative weight" of each measurement I make and some measurements can have weight zero (if I, for example, did not actually make that measurement).



That way, I can in fact rewrite the alphas:



$$begin{aligned}
alpha_1 &= frac{I_1}{I_1 + delta^2 / sigma^2} \
forall k, 2leq kleq n:alpha_k&=frac{I_k}{I_k+delta^2/lambda^2}
end{aligned}$$



Now let's take the following implementation of the BN:



enter image description here



The MAP estimates should be



$$begin{aligned}
{&alpha_1 * z_1 + (1 - alpha_1) * mu,\
&alpha_1 * z_1 + (1 - alpha_1) * mu,\
&alpha_3 * z_3 + (1 - alpha_3) * left(alpha_1 * z_1 + (1 - alpha_1 * mu)right)}
end{aligned}$$



However, if I try to optimise the following objective function:



$$frac {delta^2} {sigma^2} (x_1-mu)^2 + frac {delta^2} {lambda^2} left[(x_2-x_1)^2 + (x_3-x_2)^2right] +
I_1(x_1-z_1)^2 + I_3(x_3-z_3)^2$$



I don't get those MAP estimates. Take, for example, the case where:



$$begin{aligned}
mu &= 1 \
z_1 &= 0.8 \
z_3 &= 1.2 \
I_1 &= 400 \
I_3 &= 500 \
frac {delta^2}{sigma^2} &= 300 \
frac {delta^2}{lambda^2} &= 4500
end{aligned}$$



This would imply:



$$begin{aligned}
alpha_1 &= frac 4 7 \
alpha_3 &= frac 1 {10} \
hat x_1 &= frac 4 7 * 0.8 + frac 3 7 * 1
= 0.89 \
hat x_3 &= frac 1 {10} * 1.2 + frac 9 {10} * 0.89
= 0.92 \
end{aligned}
$$



But that's not actually the solution. What am I missing? Where in this derivation did I get it wrong?










share|cite|improve this question




























    up vote
    1
    down vote

    favorite












    Suppose I have the following Bayesian Network:



    enter image description here



    It's given by the following relations:
    $$begin{aligned}X_1&sim mathcal N(mu, 1/sigma^2)\
    forall k, 2leq kleq n: X_k|X_{k-1}&sim mathcal N(x_{k-1}, 1/lambda^2)\
    forall i, 1leq ileq n: Z_i|X_i&simmathcal N(x_{i}, 1/delta_i^2)
    end{aligned}$$

    I'm using the precision rather than the variance for a reason that will become clear soon.



    So, the above is a very simplified Kalman filter, in which there is no control-input model and both the state-transition model and the observation model are the identity, and furthermore the precision of the process noise is the same at every step.



    Under that definition, $boldsymbol X$ is the hidden variable I want to reason about, and $boldsymbol Z$ is the set of (noisy) observations I've collected. I want to find the value $hat{x}_n$ that maximises the distribution $p(X_n = x_n|boldsymbol Z)$ implied by the above relations (i.e. the MAP estimate of the last value).



    There are two ways I can go about doing this.





    The first is with a straightforward application of Bayes' Theorem. Let $boldsymbol X_{{a, b, c, ...}}$ stand for $boldsymbol Xbackslash{X_a, X_b, X_c, ...}$ and let $boldsymbol Z_{{a, b, c, ...}}$ be defined similarly. Then:



    $$begin{aligned}
    &p(boldsymbol X|boldsymbol Z) propto p(boldsymbol Z, boldsymbol X) \
    &= p(boldsymbol Z|boldsymbol X) * p(boldsymbol X) \
    &=p(Z_n|X_n, boldsymbol Z_{{n}}, boldsymbol X_{{n}}) * p(boldsymbol Z_{{n}} | boldsymbol X) * p(X_n | X_{n-1}, boldsymbol X_{{n, n-1}}) * p(boldsymbol X_{{n}}) \
    &=p(Z_n|X_n) * p(X_n | X_{n-1}) * p(boldsymbol Z_{{n}} | boldsymbol X_{{n}}) * p(boldsymbol X_{{n}}) \
    &= left[prod_{i=2}^np(Z_i|X_i)*p(X_i|X_{i-1})right]*p(Z_1|X_1)*p(X_1) &(1)\
    &= left[prod_{i=2}^nmathcal N(Z_i|X_i, 1/delta_i^2)*mathcal N(X_i|X_{i-1}, 1/lambda^2)right]*mathcal N(Z_1|X_1, 1/delta_1^2)*mathcal N(X_1|mu, 1/sigma^2) &(2)
    end{aligned}$$



    After some manipulation of $(2)$, if we define $X_0=mu$, it's not hard to see that:



    $$begin{aligned}
    p(boldsymbol X|boldsymbol Z) propto prod_{i=1}^nmathcal N(X_i | alpha_i * Z_i + (1 - alpha_i) * X_{i - 1}, beta_i) &&(3)end{aligned}$$



    Where we've defined:



    $$begin{aligned}
    alpha_1 &= frac{sigma^2}{sigma^2 + delta_1^2} \
    forall k, 2leq kleq n:alpha_k&=frac{lambda^2}{lambda^2+delta_k^2} \
    beta_1 &= frac{1}{sigma^2} + frac{1}{delta_1^2} \
    forall k, 2leq kleq n:beta_k&=frac{1}{lambda^2} + frac{1}{delta_k^2}
    end{aligned}$$



    Although the precisions $beta_k$ are irrelevant for our purposes (I think?).



    So, the MAP estimates for the values of $boldsymbol X$ can be given recursively by:



    $$begin{aligned}
    hat x_1 &= alpha_1 * z_1 + (1 - alpha_1) * mu \
    forall k, 2leq kleq n:hat x_k &= alpha_k * z_k + (1 - alpha_k) * hat x_{k-1}
    end{aligned}$$





    The second way to solve this problem is to take $(2)$ above and turn this into an optimisation problem. In particular, since what we want is to maximise that quantity, we can instead call the negative logarithm an "error function" and minimise that.



    $$begin{aligned}
    &log(p(boldsymbol Z, boldsymbol X)) \
    &=log
    left[left[prod_{k=2}^nmathcal N(Z_k|X_k, 1/delta_k^2)*mathcal N(X_k|X_{k-1}, 1/lambda^2)right]*mathcal N(Z_1|X_1, 1/delta_1^2)*mathcal N(X_1|mu, 1/sigma^2)right] \
    &propto frac 1 {sigma^2} (x_1-mu)^2 + frac 1 {lambda^2} sum_{k=2}^n(x_k-x_{k-1})^2 +
    sum_{i=1}^nfrac 1 {delta_i^2}(x_i-z_i)^2 &(4)
    end{aligned}$$





    If the above two approaches can be used -- and it might turn out that I've made some assumption or mistake that means they can't -- then they should yield the same result.



    Now, let's analyse the case I'm interested in. Suppose that there is some number $delta$ and a sequence of numbers $boldsymbol I = {I_1, I_2, ..., I_n}$ such that:



    $$forall i: frac 1 {delta_i^2} = frac{I_i}{delta^2}$$



    That is, the precision of the measurement noise is proportional to these $I_i$ and the constant of proportionality is always the same.



    Now, I can rewrite my objective function in $(4)$ as:



    $$frac {delta^2} {sigma^2} (x_1-mu)^2 + frac {delta^2} {lambda^2} sum_{k=2}^n(x_k-x_{k-1})^2 +
    sum_{i=1}^nI_i(x_i-z_i)^2$$



    The reason this is not strictly the same as $(4)$ is that the $I_i$ can in fact be any nonnegative number, $forall i:I_igeq 0$, including zero. So these $I_i$ are effectively a "relative weight" of each measurement I make and some measurements can have weight zero (if I, for example, did not actually make that measurement).



    That way, I can in fact rewrite the alphas:



    $$begin{aligned}
    alpha_1 &= frac{I_1}{I_1 + delta^2 / sigma^2} \
    forall k, 2leq kleq n:alpha_k&=frac{I_k}{I_k+delta^2/lambda^2}
    end{aligned}$$



    Now let's take the following implementation of the BN:



    enter image description here



    The MAP estimates should be



    $$begin{aligned}
    {&alpha_1 * z_1 + (1 - alpha_1) * mu,\
    &alpha_1 * z_1 + (1 - alpha_1) * mu,\
    &alpha_3 * z_3 + (1 - alpha_3) * left(alpha_1 * z_1 + (1 - alpha_1 * mu)right)}
    end{aligned}$$



    However, if I try to optimise the following objective function:



    $$frac {delta^2} {sigma^2} (x_1-mu)^2 + frac {delta^2} {lambda^2} left[(x_2-x_1)^2 + (x_3-x_2)^2right] +
    I_1(x_1-z_1)^2 + I_3(x_3-z_3)^2$$



    I don't get those MAP estimates. Take, for example, the case where:



    $$begin{aligned}
    mu &= 1 \
    z_1 &= 0.8 \
    z_3 &= 1.2 \
    I_1 &= 400 \
    I_3 &= 500 \
    frac {delta^2}{sigma^2} &= 300 \
    frac {delta^2}{lambda^2} &= 4500
    end{aligned}$$



    This would imply:



    $$begin{aligned}
    alpha_1 &= frac 4 7 \
    alpha_3 &= frac 1 {10} \
    hat x_1 &= frac 4 7 * 0.8 + frac 3 7 * 1
    = 0.89 \
    hat x_3 &= frac 1 {10} * 1.2 + frac 9 {10} * 0.89
    = 0.92 \
    end{aligned}
    $$



    But that's not actually the solution. What am I missing? Where in this derivation did I get it wrong?










    share|cite|improve this question


























      up vote
      1
      down vote

      favorite









      up vote
      1
      down vote

      favorite











      Suppose I have the following Bayesian Network:



      enter image description here



      It's given by the following relations:
      $$begin{aligned}X_1&sim mathcal N(mu, 1/sigma^2)\
      forall k, 2leq kleq n: X_k|X_{k-1}&sim mathcal N(x_{k-1}, 1/lambda^2)\
      forall i, 1leq ileq n: Z_i|X_i&simmathcal N(x_{i}, 1/delta_i^2)
      end{aligned}$$

      I'm using the precision rather than the variance for a reason that will become clear soon.



      So, the above is a very simplified Kalman filter, in which there is no control-input model and both the state-transition model and the observation model are the identity, and furthermore the precision of the process noise is the same at every step.



      Under that definition, $boldsymbol X$ is the hidden variable I want to reason about, and $boldsymbol Z$ is the set of (noisy) observations I've collected. I want to find the value $hat{x}_n$ that maximises the distribution $p(X_n = x_n|boldsymbol Z)$ implied by the above relations (i.e. the MAP estimate of the last value).



      There are two ways I can go about doing this.





      The first is with a straightforward application of Bayes' Theorem. Let $boldsymbol X_{{a, b, c, ...}}$ stand for $boldsymbol Xbackslash{X_a, X_b, X_c, ...}$ and let $boldsymbol Z_{{a, b, c, ...}}$ be defined similarly. Then:



      $$begin{aligned}
      &p(boldsymbol X|boldsymbol Z) propto p(boldsymbol Z, boldsymbol X) \
      &= p(boldsymbol Z|boldsymbol X) * p(boldsymbol X) \
      &=p(Z_n|X_n, boldsymbol Z_{{n}}, boldsymbol X_{{n}}) * p(boldsymbol Z_{{n}} | boldsymbol X) * p(X_n | X_{n-1}, boldsymbol X_{{n, n-1}}) * p(boldsymbol X_{{n}}) \
      &=p(Z_n|X_n) * p(X_n | X_{n-1}) * p(boldsymbol Z_{{n}} | boldsymbol X_{{n}}) * p(boldsymbol X_{{n}}) \
      &= left[prod_{i=2}^np(Z_i|X_i)*p(X_i|X_{i-1})right]*p(Z_1|X_1)*p(X_1) &(1)\
      &= left[prod_{i=2}^nmathcal N(Z_i|X_i, 1/delta_i^2)*mathcal N(X_i|X_{i-1}, 1/lambda^2)right]*mathcal N(Z_1|X_1, 1/delta_1^2)*mathcal N(X_1|mu, 1/sigma^2) &(2)
      end{aligned}$$



      After some manipulation of $(2)$, if we define $X_0=mu$, it's not hard to see that:



      $$begin{aligned}
      p(boldsymbol X|boldsymbol Z) propto prod_{i=1}^nmathcal N(X_i | alpha_i * Z_i + (1 - alpha_i) * X_{i - 1}, beta_i) &&(3)end{aligned}$$



      Where we've defined:



      $$begin{aligned}
      alpha_1 &= frac{sigma^2}{sigma^2 + delta_1^2} \
      forall k, 2leq kleq n:alpha_k&=frac{lambda^2}{lambda^2+delta_k^2} \
      beta_1 &= frac{1}{sigma^2} + frac{1}{delta_1^2} \
      forall k, 2leq kleq n:beta_k&=frac{1}{lambda^2} + frac{1}{delta_k^2}
      end{aligned}$$



      Although the precisions $beta_k$ are irrelevant for our purposes (I think?).



      So, the MAP estimates for the values of $boldsymbol X$ can be given recursively by:



      $$begin{aligned}
      hat x_1 &= alpha_1 * z_1 + (1 - alpha_1) * mu \
      forall k, 2leq kleq n:hat x_k &= alpha_k * z_k + (1 - alpha_k) * hat x_{k-1}
      end{aligned}$$





      The second way to solve this problem is to take $(2)$ above and turn this into an optimisation problem. In particular, since what we want is to maximise that quantity, we can instead call the negative logarithm an "error function" and minimise that.



      $$begin{aligned}
      &log(p(boldsymbol Z, boldsymbol X)) \
      &=log
      left[left[prod_{k=2}^nmathcal N(Z_k|X_k, 1/delta_k^2)*mathcal N(X_k|X_{k-1}, 1/lambda^2)right]*mathcal N(Z_1|X_1, 1/delta_1^2)*mathcal N(X_1|mu, 1/sigma^2)right] \
      &propto frac 1 {sigma^2} (x_1-mu)^2 + frac 1 {lambda^2} sum_{k=2}^n(x_k-x_{k-1})^2 +
      sum_{i=1}^nfrac 1 {delta_i^2}(x_i-z_i)^2 &(4)
      end{aligned}$$





      If the above two approaches can be used -- and it might turn out that I've made some assumption or mistake that means they can't -- then they should yield the same result.



      Now, let's analyse the case I'm interested in. Suppose that there is some number $delta$ and a sequence of numbers $boldsymbol I = {I_1, I_2, ..., I_n}$ such that:



      $$forall i: frac 1 {delta_i^2} = frac{I_i}{delta^2}$$



      That is, the precision of the measurement noise is proportional to these $I_i$ and the constant of proportionality is always the same.



      Now, I can rewrite my objective function in $(4)$ as:



      $$frac {delta^2} {sigma^2} (x_1-mu)^2 + frac {delta^2} {lambda^2} sum_{k=2}^n(x_k-x_{k-1})^2 +
      sum_{i=1}^nI_i(x_i-z_i)^2$$



      The reason this is not strictly the same as $(4)$ is that the $I_i$ can in fact be any nonnegative number, $forall i:I_igeq 0$, including zero. So these $I_i$ are effectively a "relative weight" of each measurement I make and some measurements can have weight zero (if I, for example, did not actually make that measurement).



      That way, I can in fact rewrite the alphas:



      $$begin{aligned}
      alpha_1 &= frac{I_1}{I_1 + delta^2 / sigma^2} \
      forall k, 2leq kleq n:alpha_k&=frac{I_k}{I_k+delta^2/lambda^2}
      end{aligned}$$



      Now let's take the following implementation of the BN:



      enter image description here



      The MAP estimates should be



      $$begin{aligned}
      {&alpha_1 * z_1 + (1 - alpha_1) * mu,\
      &alpha_1 * z_1 + (1 - alpha_1) * mu,\
      &alpha_3 * z_3 + (1 - alpha_3) * left(alpha_1 * z_1 + (1 - alpha_1 * mu)right)}
      end{aligned}$$



      However, if I try to optimise the following objective function:



      $$frac {delta^2} {sigma^2} (x_1-mu)^2 + frac {delta^2} {lambda^2} left[(x_2-x_1)^2 + (x_3-x_2)^2right] +
      I_1(x_1-z_1)^2 + I_3(x_3-z_3)^2$$



      I don't get those MAP estimates. Take, for example, the case where:



      $$begin{aligned}
      mu &= 1 \
      z_1 &= 0.8 \
      z_3 &= 1.2 \
      I_1 &= 400 \
      I_3 &= 500 \
      frac {delta^2}{sigma^2} &= 300 \
      frac {delta^2}{lambda^2} &= 4500
      end{aligned}$$



      This would imply:



      $$begin{aligned}
      alpha_1 &= frac 4 7 \
      alpha_3 &= frac 1 {10} \
      hat x_1 &= frac 4 7 * 0.8 + frac 3 7 * 1
      = 0.89 \
      hat x_3 &= frac 1 {10} * 1.2 + frac 9 {10} * 0.89
      = 0.92 \
      end{aligned}
      $$



      But that's not actually the solution. What am I missing? Where in this derivation did I get it wrong?










      share|cite|improve this question















      Suppose I have the following Bayesian Network:



      enter image description here



      It's given by the following relations:
      $$begin{aligned}X_1&sim mathcal N(mu, 1/sigma^2)\
      forall k, 2leq kleq n: X_k|X_{k-1}&sim mathcal N(x_{k-1}, 1/lambda^2)\
      forall i, 1leq ileq n: Z_i|X_i&simmathcal N(x_{i}, 1/delta_i^2)
      end{aligned}$$

      I'm using the precision rather than the variance for a reason that will become clear soon.



      So, the above is a very simplified Kalman filter, in which there is no control-input model and both the state-transition model and the observation model are the identity, and furthermore the precision of the process noise is the same at every step.



      Under that definition, $boldsymbol X$ is the hidden variable I want to reason about, and $boldsymbol Z$ is the set of (noisy) observations I've collected. I want to find the value $hat{x}_n$ that maximises the distribution $p(X_n = x_n|boldsymbol Z)$ implied by the above relations (i.e. the MAP estimate of the last value).



      There are two ways I can go about doing this.





      The first is with a straightforward application of Bayes' Theorem. Let $boldsymbol X_{{a, b, c, ...}}$ stand for $boldsymbol Xbackslash{X_a, X_b, X_c, ...}$ and let $boldsymbol Z_{{a, b, c, ...}}$ be defined similarly. Then:



      $$begin{aligned}
      &p(boldsymbol X|boldsymbol Z) propto p(boldsymbol Z, boldsymbol X) \
      &= p(boldsymbol Z|boldsymbol X) * p(boldsymbol X) \
      &=p(Z_n|X_n, boldsymbol Z_{{n}}, boldsymbol X_{{n}}) * p(boldsymbol Z_{{n}} | boldsymbol X) * p(X_n | X_{n-1}, boldsymbol X_{{n, n-1}}) * p(boldsymbol X_{{n}}) \
      &=p(Z_n|X_n) * p(X_n | X_{n-1}) * p(boldsymbol Z_{{n}} | boldsymbol X_{{n}}) * p(boldsymbol X_{{n}}) \
      &= left[prod_{i=2}^np(Z_i|X_i)*p(X_i|X_{i-1})right]*p(Z_1|X_1)*p(X_1) &(1)\
      &= left[prod_{i=2}^nmathcal N(Z_i|X_i, 1/delta_i^2)*mathcal N(X_i|X_{i-1}, 1/lambda^2)right]*mathcal N(Z_1|X_1, 1/delta_1^2)*mathcal N(X_1|mu, 1/sigma^2) &(2)
      end{aligned}$$



      After some manipulation of $(2)$, if we define $X_0=mu$, it's not hard to see that:



      $$begin{aligned}
      p(boldsymbol X|boldsymbol Z) propto prod_{i=1}^nmathcal N(X_i | alpha_i * Z_i + (1 - alpha_i) * X_{i - 1}, beta_i) &&(3)end{aligned}$$



      Where we've defined:



      $$begin{aligned}
      alpha_1 &= frac{sigma^2}{sigma^2 + delta_1^2} \
      forall k, 2leq kleq n:alpha_k&=frac{lambda^2}{lambda^2+delta_k^2} \
      beta_1 &= frac{1}{sigma^2} + frac{1}{delta_1^2} \
      forall k, 2leq kleq n:beta_k&=frac{1}{lambda^2} + frac{1}{delta_k^2}
      end{aligned}$$



      Although the precisions $beta_k$ are irrelevant for our purposes (I think?).



      So, the MAP estimates for the values of $boldsymbol X$ can be given recursively by:



      $$begin{aligned}
      hat x_1 &= alpha_1 * z_1 + (1 - alpha_1) * mu \
      forall k, 2leq kleq n:hat x_k &= alpha_k * z_k + (1 - alpha_k) * hat x_{k-1}
      end{aligned}$$





      The second way to solve this problem is to take $(2)$ above and turn this into an optimisation problem. In particular, since what we want is to maximise that quantity, we can instead call the negative logarithm an "error function" and minimise that.



      $$begin{aligned}
      &log(p(boldsymbol Z, boldsymbol X)) \
      &=log
      left[left[prod_{k=2}^nmathcal N(Z_k|X_k, 1/delta_k^2)*mathcal N(X_k|X_{k-1}, 1/lambda^2)right]*mathcal N(Z_1|X_1, 1/delta_1^2)*mathcal N(X_1|mu, 1/sigma^2)right] \
      &propto frac 1 {sigma^2} (x_1-mu)^2 + frac 1 {lambda^2} sum_{k=2}^n(x_k-x_{k-1})^2 +
      sum_{i=1}^nfrac 1 {delta_i^2}(x_i-z_i)^2 &(4)
      end{aligned}$$





      If the above two approaches can be used -- and it might turn out that I've made some assumption or mistake that means they can't -- then they should yield the same result.



      Now, let's analyse the case I'm interested in. Suppose that there is some number $delta$ and a sequence of numbers $boldsymbol I = {I_1, I_2, ..., I_n}$ such that:



      $$forall i: frac 1 {delta_i^2} = frac{I_i}{delta^2}$$



      That is, the precision of the measurement noise is proportional to these $I_i$ and the constant of proportionality is always the same.



      Now, I can rewrite my objective function in $(4)$ as:



      $$frac {delta^2} {sigma^2} (x_1-mu)^2 + frac {delta^2} {lambda^2} sum_{k=2}^n(x_k-x_{k-1})^2 +
      sum_{i=1}^nI_i(x_i-z_i)^2$$



      The reason this is not strictly the same as $(4)$ is that the $I_i$ can in fact be any nonnegative number, $forall i:I_igeq 0$, including zero. So these $I_i$ are effectively a "relative weight" of each measurement I make and some measurements can have weight zero (if I, for example, did not actually make that measurement).



      That way, I can in fact rewrite the alphas:



      $$begin{aligned}
      alpha_1 &= frac{I_1}{I_1 + delta^2 / sigma^2} \
      forall k, 2leq kleq n:alpha_k&=frac{I_k}{I_k+delta^2/lambda^2}
      end{aligned}$$



      Now let's take the following implementation of the BN:



      enter image description here



      The MAP estimates should be



      $$begin{aligned}
      {&alpha_1 * z_1 + (1 - alpha_1) * mu,\
      &alpha_1 * z_1 + (1 - alpha_1) * mu,\
      &alpha_3 * z_3 + (1 - alpha_3) * left(alpha_1 * z_1 + (1 - alpha_1 * mu)right)}
      end{aligned}$$



      However, if I try to optimise the following objective function:



      $$frac {delta^2} {sigma^2} (x_1-mu)^2 + frac {delta^2} {lambda^2} left[(x_2-x_1)^2 + (x_3-x_2)^2right] +
      I_1(x_1-z_1)^2 + I_3(x_3-z_3)^2$$



      I don't get those MAP estimates. Take, for example, the case where:



      $$begin{aligned}
      mu &= 1 \
      z_1 &= 0.8 \
      z_3 &= 1.2 \
      I_1 &= 400 \
      I_3 &= 500 \
      frac {delta^2}{sigma^2} &= 300 \
      frac {delta^2}{lambda^2} &= 4500
      end{aligned}$$



      This would imply:



      $$begin{aligned}
      alpha_1 &= frac 4 7 \
      alpha_3 &= frac 1 {10} \
      hat x_1 &= frac 4 7 * 0.8 + frac 3 7 * 1
      = 0.89 \
      hat x_3 &= frac 1 {10} * 1.2 + frac 9 {10} * 0.89
      = 0.92 \
      end{aligned}
      $$



      But that's not actually the solution. What am I missing? Where in this derivation did I get it wrong?







      probability-theory statistics bayes-theorem bayesian-network kalman-filter






      share|cite|improve this question















      share|cite|improve this question













      share|cite|improve this question




      share|cite|improve this question








      edited Dec 5 at 18:19

























      asked Nov 27 at 23:53









      Pedro Carvalho

      399115




      399115






















          1 Answer
          1






          active

          oldest

          votes

















          up vote
          0
          down vote



          accepted










          After much thinking and writing things down I figured it out.



          The recursive relations implied by $(3)$ are correct for the conditional MAP estimate. That is to say, the values of $hat x_i$ I wrote out in the "first approach" section of my original post are the values that maximise their respective $p(X_i|Z_i, X_{i-1})$, conditioning on the present observation and the previous value.



          The optimisation approach, however, is minimising the negative logarithm of $p(boldsymbol X|boldsymbol Z)$ -- the joint posterior over all variables $X_i$, conditional on the set of measurements $boldsymbol Z$ -- and there's no guarantee that the $i^{th}$ coordinate of the MAP vector $hat{boldsymbol X}$ will equal $hat x_i$ (and, as shown above, it probably almost always doesn't).



          This is because, when I'm optimising the joint posterior, I'm using future values to update the past -- in my example, the fact that $z_3=1.2$ was as relevant to $X_1$ as the fact that $z_1=0.8$ was to $X_3$, and so all values in $boldsymbol X$ are jointly optimised to maximise the posterior.





          ETA:



          And the actual thing that I want, the MAP/expected value (since they're the same here) of $p(X_i|boldsymbol Z_{jleq i})$, is:



          $$begin{aligned}
          bar x_0&equiv mu\
          forall i> 0:bar x_i &equiv mathbb E(X_i|boldsymbol Z_{jleq i}) \
          &= alpha_i * z_i + (1 - alpha_i) * bar x_{i- 1} \
          forall i > 0:alpha_i &=frac {I_i}{I_i + delta^2/ (sigma_{i-1}^2 + lambda_i^2)} \
          sigma_0 &=sigma \
          lambda_1 &= 0 \
          forall i > 0 : sigma_i^2 &= frac {delta^2}{I_i + delta^2/ (sigma_{i-1}^2 + lambda_i^2)} \
          forall i > 1 : lambda_i &= lambda
          end{aligned}$$



          And with the above definitions in mind:



          $$begin{aligned}
          X_i|boldsymbol Z_{jleq i} &sim mathcal N(bar x_i, 1 / sigma_i^2) \
          frac{1}{sigma_i^2} &= frac{1} {delta_i^2} + frac{1}{sigma_{i-1}^2+lambda_i^2}
          end{aligned}$$



          That is, the precision of the distribution of $X_i|boldsymbol Z_{jleq i}$ is the precision of $Z_i|X_i$ plus the inverse of the sum of the variances of $X_i|X_{i-1}$ and $X_{i-1}|boldsymbol Z_{j < i}$.






          share|cite|improve this answer























            Your Answer





            StackExchange.ifUsing("editor", function () {
            return StackExchange.using("mathjaxEditing", function () {
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
            });
            });
            }, "mathjax-editing");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "69"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            noCode: true, onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3016500%2foptimisation-vs-bayes-theorem-not-coinciding%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            0
            down vote



            accepted










            After much thinking and writing things down I figured it out.



            The recursive relations implied by $(3)$ are correct for the conditional MAP estimate. That is to say, the values of $hat x_i$ I wrote out in the "first approach" section of my original post are the values that maximise their respective $p(X_i|Z_i, X_{i-1})$, conditioning on the present observation and the previous value.



            The optimisation approach, however, is minimising the negative logarithm of $p(boldsymbol X|boldsymbol Z)$ -- the joint posterior over all variables $X_i$, conditional on the set of measurements $boldsymbol Z$ -- and there's no guarantee that the $i^{th}$ coordinate of the MAP vector $hat{boldsymbol X}$ will equal $hat x_i$ (and, as shown above, it probably almost always doesn't).



            This is because, when I'm optimising the joint posterior, I'm using future values to update the past -- in my example, the fact that $z_3=1.2$ was as relevant to $X_1$ as the fact that $z_1=0.8$ was to $X_3$, and so all values in $boldsymbol X$ are jointly optimised to maximise the posterior.





            ETA:



            And the actual thing that I want, the MAP/expected value (since they're the same here) of $p(X_i|boldsymbol Z_{jleq i})$, is:



            $$begin{aligned}
            bar x_0&equiv mu\
            forall i> 0:bar x_i &equiv mathbb E(X_i|boldsymbol Z_{jleq i}) \
            &= alpha_i * z_i + (1 - alpha_i) * bar x_{i- 1} \
            forall i > 0:alpha_i &=frac {I_i}{I_i + delta^2/ (sigma_{i-1}^2 + lambda_i^2)} \
            sigma_0 &=sigma \
            lambda_1 &= 0 \
            forall i > 0 : sigma_i^2 &= frac {delta^2}{I_i + delta^2/ (sigma_{i-1}^2 + lambda_i^2)} \
            forall i > 1 : lambda_i &= lambda
            end{aligned}$$



            And with the above definitions in mind:



            $$begin{aligned}
            X_i|boldsymbol Z_{jleq i} &sim mathcal N(bar x_i, 1 / sigma_i^2) \
            frac{1}{sigma_i^2} &= frac{1} {delta_i^2} + frac{1}{sigma_{i-1}^2+lambda_i^2}
            end{aligned}$$



            That is, the precision of the distribution of $X_i|boldsymbol Z_{jleq i}$ is the precision of $Z_i|X_i$ plus the inverse of the sum of the variances of $X_i|X_{i-1}$ and $X_{i-1}|boldsymbol Z_{j < i}$.






            share|cite|improve this answer



























              up vote
              0
              down vote



              accepted










              After much thinking and writing things down I figured it out.



              The recursive relations implied by $(3)$ are correct for the conditional MAP estimate. That is to say, the values of $hat x_i$ I wrote out in the "first approach" section of my original post are the values that maximise their respective $p(X_i|Z_i, X_{i-1})$, conditioning on the present observation and the previous value.



              The optimisation approach, however, is minimising the negative logarithm of $p(boldsymbol X|boldsymbol Z)$ -- the joint posterior over all variables $X_i$, conditional on the set of measurements $boldsymbol Z$ -- and there's no guarantee that the $i^{th}$ coordinate of the MAP vector $hat{boldsymbol X}$ will equal $hat x_i$ (and, as shown above, it probably almost always doesn't).



              This is because, when I'm optimising the joint posterior, I'm using future values to update the past -- in my example, the fact that $z_3=1.2$ was as relevant to $X_1$ as the fact that $z_1=0.8$ was to $X_3$, and so all values in $boldsymbol X$ are jointly optimised to maximise the posterior.





              ETA:



              And the actual thing that I want, the MAP/expected value (since they're the same here) of $p(X_i|boldsymbol Z_{jleq i})$, is:



              $$begin{aligned}
              bar x_0&equiv mu\
              forall i> 0:bar x_i &equiv mathbb E(X_i|boldsymbol Z_{jleq i}) \
              &= alpha_i * z_i + (1 - alpha_i) * bar x_{i- 1} \
              forall i > 0:alpha_i &=frac {I_i}{I_i + delta^2/ (sigma_{i-1}^2 + lambda_i^2)} \
              sigma_0 &=sigma \
              lambda_1 &= 0 \
              forall i > 0 : sigma_i^2 &= frac {delta^2}{I_i + delta^2/ (sigma_{i-1}^2 + lambda_i^2)} \
              forall i > 1 : lambda_i &= lambda
              end{aligned}$$



              And with the above definitions in mind:



              $$begin{aligned}
              X_i|boldsymbol Z_{jleq i} &sim mathcal N(bar x_i, 1 / sigma_i^2) \
              frac{1}{sigma_i^2} &= frac{1} {delta_i^2} + frac{1}{sigma_{i-1}^2+lambda_i^2}
              end{aligned}$$



              That is, the precision of the distribution of $X_i|boldsymbol Z_{jleq i}$ is the precision of $Z_i|X_i$ plus the inverse of the sum of the variances of $X_i|X_{i-1}$ and $X_{i-1}|boldsymbol Z_{j < i}$.






              share|cite|improve this answer

























                up vote
                0
                down vote



                accepted







                up vote
                0
                down vote



                accepted






                After much thinking and writing things down I figured it out.



                The recursive relations implied by $(3)$ are correct for the conditional MAP estimate. That is to say, the values of $hat x_i$ I wrote out in the "first approach" section of my original post are the values that maximise their respective $p(X_i|Z_i, X_{i-1})$, conditioning on the present observation and the previous value.



                The optimisation approach, however, is minimising the negative logarithm of $p(boldsymbol X|boldsymbol Z)$ -- the joint posterior over all variables $X_i$, conditional on the set of measurements $boldsymbol Z$ -- and there's no guarantee that the $i^{th}$ coordinate of the MAP vector $hat{boldsymbol X}$ will equal $hat x_i$ (and, as shown above, it probably almost always doesn't).



                This is because, when I'm optimising the joint posterior, I'm using future values to update the past -- in my example, the fact that $z_3=1.2$ was as relevant to $X_1$ as the fact that $z_1=0.8$ was to $X_3$, and so all values in $boldsymbol X$ are jointly optimised to maximise the posterior.





                ETA:



                And the actual thing that I want, the MAP/expected value (since they're the same here) of $p(X_i|boldsymbol Z_{jleq i})$, is:



                $$begin{aligned}
                bar x_0&equiv mu\
                forall i> 0:bar x_i &equiv mathbb E(X_i|boldsymbol Z_{jleq i}) \
                &= alpha_i * z_i + (1 - alpha_i) * bar x_{i- 1} \
                forall i > 0:alpha_i &=frac {I_i}{I_i + delta^2/ (sigma_{i-1}^2 + lambda_i^2)} \
                sigma_0 &=sigma \
                lambda_1 &= 0 \
                forall i > 0 : sigma_i^2 &= frac {delta^2}{I_i + delta^2/ (sigma_{i-1}^2 + lambda_i^2)} \
                forall i > 1 : lambda_i &= lambda
                end{aligned}$$



                And with the above definitions in mind:



                $$begin{aligned}
                X_i|boldsymbol Z_{jleq i} &sim mathcal N(bar x_i, 1 / sigma_i^2) \
                frac{1}{sigma_i^2} &= frac{1} {delta_i^2} + frac{1}{sigma_{i-1}^2+lambda_i^2}
                end{aligned}$$



                That is, the precision of the distribution of $X_i|boldsymbol Z_{jleq i}$ is the precision of $Z_i|X_i$ plus the inverse of the sum of the variances of $X_i|X_{i-1}$ and $X_{i-1}|boldsymbol Z_{j < i}$.






                share|cite|improve this answer














                After much thinking and writing things down I figured it out.



                The recursive relations implied by $(3)$ are correct for the conditional MAP estimate. That is to say, the values of $hat x_i$ I wrote out in the "first approach" section of my original post are the values that maximise their respective $p(X_i|Z_i, X_{i-1})$, conditioning on the present observation and the previous value.



                The optimisation approach, however, is minimising the negative logarithm of $p(boldsymbol X|boldsymbol Z)$ -- the joint posterior over all variables $X_i$, conditional on the set of measurements $boldsymbol Z$ -- and there's no guarantee that the $i^{th}$ coordinate of the MAP vector $hat{boldsymbol X}$ will equal $hat x_i$ (and, as shown above, it probably almost always doesn't).



                This is because, when I'm optimising the joint posterior, I'm using future values to update the past -- in my example, the fact that $z_3=1.2$ was as relevant to $X_1$ as the fact that $z_1=0.8$ was to $X_3$, and so all values in $boldsymbol X$ are jointly optimised to maximise the posterior.





                ETA:



                And the actual thing that I want, the MAP/expected value (since they're the same here) of $p(X_i|boldsymbol Z_{jleq i})$, is:



                $$begin{aligned}
                bar x_0&equiv mu\
                forall i> 0:bar x_i &equiv mathbb E(X_i|boldsymbol Z_{jleq i}) \
                &= alpha_i * z_i + (1 - alpha_i) * bar x_{i- 1} \
                forall i > 0:alpha_i &=frac {I_i}{I_i + delta^2/ (sigma_{i-1}^2 + lambda_i^2)} \
                sigma_0 &=sigma \
                lambda_1 &= 0 \
                forall i > 0 : sigma_i^2 &= frac {delta^2}{I_i + delta^2/ (sigma_{i-1}^2 + lambda_i^2)} \
                forall i > 1 : lambda_i &= lambda
                end{aligned}$$



                And with the above definitions in mind:



                $$begin{aligned}
                X_i|boldsymbol Z_{jleq i} &sim mathcal N(bar x_i, 1 / sigma_i^2) \
                frac{1}{sigma_i^2} &= frac{1} {delta_i^2} + frac{1}{sigma_{i-1}^2+lambda_i^2}
                end{aligned}$$



                That is, the precision of the distribution of $X_i|boldsymbol Z_{jleq i}$ is the precision of $Z_i|X_i$ plus the inverse of the sum of the variances of $X_i|X_{i-1}$ and $X_{i-1}|boldsymbol Z_{j < i}$.







                share|cite|improve this answer














                share|cite|improve this answer



                share|cite|improve this answer








                edited Dec 11 at 20:56

























                answered Nov 28 at 18:52









                Pedro Carvalho

                399115




                399115






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Mathematics Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.





                    Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                    Please pay close attention to the following guidance:


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3016500%2foptimisation-vs-bayes-theorem-not-coinciding%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Sphinx de Gizeh

                    Dijon

                    Determine an Integral..