Confusing Sampling from observed data
Suppose we are given some small set of data on bundles of electrical wires and increasing voltages run through them, and we note how many of the individual wires fail.
So for example, a large data set we have 6 observations, for each 6,
there is $w_{i}$ number of wires, voltage $v_{i}$ and $f_{i}$ of the wires fail.
And suppose we are given some of the information for example, ( note that each sample has increased voltage and we see increased proportion of failed wires).
$w_{1}=14$ and $f_{1}=4$
$w_{2}=13$ and $f_{2}=4$
$w_{3}=7$ and $f_{3}=3$
$w_{4}=10$ and $f_{4}=5$
$w_{5}=12$ and $f_{5}=7$
$w_{6}=20$ and $f_{6}=13$
ie we have a parameter space such that ( $t$ is the proportion that fail) ${t_{i}: t_{1} lt t_{2} lt t_{3} lt .. lt t_{6} le 1}$ Assuming a flat prior over this.
My goal is to model this as a conditional distribution and sample so that I can make some statements about each $t_{i}$, such as the mean and deviations of each.(assuming flat prior) (ie. from histrograms)
Firstly I know about sampling, but I am wondering how from just the simple data, how I can accurately form the conditional distribution? Using rejection or transformations for example, and then Gibbs to make some conclusions on the individual failure proportions.
My thoughts:
Well it seems that the number of wires that fail is a function of the voltage. As voltage increases, so to does the proportion of failed wires.
Possibly I could use rejection method to sample from the distribution that is creating this?
So I would want to find some function $g(x)$ such that $g(x) ge f(x)$ for all $x$ , then simulate uniform random variables and check the conditions.
However, as of now I don't have a distribution. I guess I could form a hand drawn using the points and x axis as 1,2,3,4,5,6... and y being the corresponding proportion rate of failure.
I know for a distribution, we need the probabilities to sum/integrate to 1.
The probabilities here I assume would be the probability that a certain proportion fail. So for n wires, we would have the probability that $p_{1}=frac{1}{n}$ proportional fail, a probability for $p_{2}=frac{2}{n} $proportion fail, all the way to the probability that all wires fail.
So it looks more like the form of a CDF, as voltage increases, ie if we write in the form of a function, $F(v_{1})=frac{4}{14}$ , $F(v_{2})=frac{4}{13}$, and so forth, so if we had an unlimited sample, as $n to infty$ , $F(v_{n}) to 1$
and I suppose then $F^{-1}=f$ would be our density, but I am still not sure how to do this in finite case.
Issues: We are not told anything about underlying distribution, parameters or form. Only the data given. So do we take the data that is given to be the initialising values?
I was thinking I could possibly just assume that the failures follow a binomial distribution, with the binomial parameter following some other distribution such as a beta. How does that sound? Would we then need to also put some distribution on the $w_{i}$ ? I would be okay trying it without that distribution, but I want to understand how I can have the failure probability increase
Any advice , ideas and answers are much appreciated.
probability statistics bayesian conditional-probability sampling
|
show 3 more comments
Suppose we are given some small set of data on bundles of electrical wires and increasing voltages run through them, and we note how many of the individual wires fail.
So for example, a large data set we have 6 observations, for each 6,
there is $w_{i}$ number of wires, voltage $v_{i}$ and $f_{i}$ of the wires fail.
And suppose we are given some of the information for example, ( note that each sample has increased voltage and we see increased proportion of failed wires).
$w_{1}=14$ and $f_{1}=4$
$w_{2}=13$ and $f_{2}=4$
$w_{3}=7$ and $f_{3}=3$
$w_{4}=10$ and $f_{4}=5$
$w_{5}=12$ and $f_{5}=7$
$w_{6}=20$ and $f_{6}=13$
ie we have a parameter space such that ( $t$ is the proportion that fail) ${t_{i}: t_{1} lt t_{2} lt t_{3} lt .. lt t_{6} le 1}$ Assuming a flat prior over this.
My goal is to model this as a conditional distribution and sample so that I can make some statements about each $t_{i}$, such as the mean and deviations of each.(assuming flat prior) (ie. from histrograms)
Firstly I know about sampling, but I am wondering how from just the simple data, how I can accurately form the conditional distribution? Using rejection or transformations for example, and then Gibbs to make some conclusions on the individual failure proportions.
My thoughts:
Well it seems that the number of wires that fail is a function of the voltage. As voltage increases, so to does the proportion of failed wires.
Possibly I could use rejection method to sample from the distribution that is creating this?
So I would want to find some function $g(x)$ such that $g(x) ge f(x)$ for all $x$ , then simulate uniform random variables and check the conditions.
However, as of now I don't have a distribution. I guess I could form a hand drawn using the points and x axis as 1,2,3,4,5,6... and y being the corresponding proportion rate of failure.
I know for a distribution, we need the probabilities to sum/integrate to 1.
The probabilities here I assume would be the probability that a certain proportion fail. So for n wires, we would have the probability that $p_{1}=frac{1}{n}$ proportional fail, a probability for $p_{2}=frac{2}{n} $proportion fail, all the way to the probability that all wires fail.
So it looks more like the form of a CDF, as voltage increases, ie if we write in the form of a function, $F(v_{1})=frac{4}{14}$ , $F(v_{2})=frac{4}{13}$, and so forth, so if we had an unlimited sample, as $n to infty$ , $F(v_{n}) to 1$
and I suppose then $F^{-1}=f$ would be our density, but I am still not sure how to do this in finite case.
Issues: We are not told anything about underlying distribution, parameters or form. Only the data given. So do we take the data that is given to be the initialising values?
I was thinking I could possibly just assume that the failures follow a binomial distribution, with the binomial parameter following some other distribution such as a beta. How does that sound? Would we then need to also put some distribution on the $w_{i}$ ? I would be okay trying it without that distribution, but I want to understand how I can have the failure probability increase
Any advice , ideas and answers are much appreciated.
probability statistics bayesian conditional-probability sampling
1
This is probably better asked on cross-validated since it's pretty technical statistics. stats.stackexchange.com
– Ethan Bolker
Nov 30 at 0:28
I guess some sort of logistic regression (or other generalized linear model) could help as you are modeling a probability as a function of other independent variables. After you estimate the values, simulation is easy as you said they are conditional binomial.
– BGM
Nov 30 at 3:33
Is there a problem with the simple binomial model?
– Mike Hawk
Dec 4 at 15:02
Can you show the voltages?
– Yuri Negometyanov
Dec 4 at 19:23
voltages are unknown
– Learning
Dec 4 at 19:55
|
show 3 more comments
Suppose we are given some small set of data on bundles of electrical wires and increasing voltages run through them, and we note how many of the individual wires fail.
So for example, a large data set we have 6 observations, for each 6,
there is $w_{i}$ number of wires, voltage $v_{i}$ and $f_{i}$ of the wires fail.
And suppose we are given some of the information for example, ( note that each sample has increased voltage and we see increased proportion of failed wires).
$w_{1}=14$ and $f_{1}=4$
$w_{2}=13$ and $f_{2}=4$
$w_{3}=7$ and $f_{3}=3$
$w_{4}=10$ and $f_{4}=5$
$w_{5}=12$ and $f_{5}=7$
$w_{6}=20$ and $f_{6}=13$
ie we have a parameter space such that ( $t$ is the proportion that fail) ${t_{i}: t_{1} lt t_{2} lt t_{3} lt .. lt t_{6} le 1}$ Assuming a flat prior over this.
My goal is to model this as a conditional distribution and sample so that I can make some statements about each $t_{i}$, such as the mean and deviations of each.(assuming flat prior) (ie. from histrograms)
Firstly I know about sampling, but I am wondering how from just the simple data, how I can accurately form the conditional distribution? Using rejection or transformations for example, and then Gibbs to make some conclusions on the individual failure proportions.
My thoughts:
Well it seems that the number of wires that fail is a function of the voltage. As voltage increases, so to does the proportion of failed wires.
Possibly I could use rejection method to sample from the distribution that is creating this?
So I would want to find some function $g(x)$ such that $g(x) ge f(x)$ for all $x$ , then simulate uniform random variables and check the conditions.
However, as of now I don't have a distribution. I guess I could form a hand drawn using the points and x axis as 1,2,3,4,5,6... and y being the corresponding proportion rate of failure.
I know for a distribution, we need the probabilities to sum/integrate to 1.
The probabilities here I assume would be the probability that a certain proportion fail. So for n wires, we would have the probability that $p_{1}=frac{1}{n}$ proportional fail, a probability for $p_{2}=frac{2}{n} $proportion fail, all the way to the probability that all wires fail.
So it looks more like the form of a CDF, as voltage increases, ie if we write in the form of a function, $F(v_{1})=frac{4}{14}$ , $F(v_{2})=frac{4}{13}$, and so forth, so if we had an unlimited sample, as $n to infty$ , $F(v_{n}) to 1$
and I suppose then $F^{-1}=f$ would be our density, but I am still not sure how to do this in finite case.
Issues: We are not told anything about underlying distribution, parameters or form. Only the data given. So do we take the data that is given to be the initialising values?
I was thinking I could possibly just assume that the failures follow a binomial distribution, with the binomial parameter following some other distribution such as a beta. How does that sound? Would we then need to also put some distribution on the $w_{i}$ ? I would be okay trying it without that distribution, but I want to understand how I can have the failure probability increase
Any advice , ideas and answers are much appreciated.
probability statistics bayesian conditional-probability sampling
Suppose we are given some small set of data on bundles of electrical wires and increasing voltages run through them, and we note how many of the individual wires fail.
So for example, a large data set we have 6 observations, for each 6,
there is $w_{i}$ number of wires, voltage $v_{i}$ and $f_{i}$ of the wires fail.
And suppose we are given some of the information for example, ( note that each sample has increased voltage and we see increased proportion of failed wires).
$w_{1}=14$ and $f_{1}=4$
$w_{2}=13$ and $f_{2}=4$
$w_{3}=7$ and $f_{3}=3$
$w_{4}=10$ and $f_{4}=5$
$w_{5}=12$ and $f_{5}=7$
$w_{6}=20$ and $f_{6}=13$
ie we have a parameter space such that ( $t$ is the proportion that fail) ${t_{i}: t_{1} lt t_{2} lt t_{3} lt .. lt t_{6} le 1}$ Assuming a flat prior over this.
My goal is to model this as a conditional distribution and sample so that I can make some statements about each $t_{i}$, such as the mean and deviations of each.(assuming flat prior) (ie. from histrograms)
Firstly I know about sampling, but I am wondering how from just the simple data, how I can accurately form the conditional distribution? Using rejection or transformations for example, and then Gibbs to make some conclusions on the individual failure proportions.
My thoughts:
Well it seems that the number of wires that fail is a function of the voltage. As voltage increases, so to does the proportion of failed wires.
Possibly I could use rejection method to sample from the distribution that is creating this?
So I would want to find some function $g(x)$ such that $g(x) ge f(x)$ for all $x$ , then simulate uniform random variables and check the conditions.
However, as of now I don't have a distribution. I guess I could form a hand drawn using the points and x axis as 1,2,3,4,5,6... and y being the corresponding proportion rate of failure.
I know for a distribution, we need the probabilities to sum/integrate to 1.
The probabilities here I assume would be the probability that a certain proportion fail. So for n wires, we would have the probability that $p_{1}=frac{1}{n}$ proportional fail, a probability for $p_{2}=frac{2}{n} $proportion fail, all the way to the probability that all wires fail.
So it looks more like the form of a CDF, as voltage increases, ie if we write in the form of a function, $F(v_{1})=frac{4}{14}$ , $F(v_{2})=frac{4}{13}$, and so forth, so if we had an unlimited sample, as $n to infty$ , $F(v_{n}) to 1$
and I suppose then $F^{-1}=f$ would be our density, but I am still not sure how to do this in finite case.
Issues: We are not told anything about underlying distribution, parameters or form. Only the data given. So do we take the data that is given to be the initialising values?
I was thinking I could possibly just assume that the failures follow a binomial distribution, with the binomial parameter following some other distribution such as a beta. How does that sound? Would we then need to also put some distribution on the $w_{i}$ ? I would be okay trying it without that distribution, but I want to understand how I can have the failure probability increase
Any advice , ideas and answers are much appreciated.
probability statistics bayesian conditional-probability sampling
probability statistics bayesian conditional-probability sampling
edited Dec 1 at 22:00
asked Nov 26 at 0:25
Learning
135
135
1
This is probably better asked on cross-validated since it's pretty technical statistics. stats.stackexchange.com
– Ethan Bolker
Nov 30 at 0:28
I guess some sort of logistic regression (or other generalized linear model) could help as you are modeling a probability as a function of other independent variables. After you estimate the values, simulation is easy as you said they are conditional binomial.
– BGM
Nov 30 at 3:33
Is there a problem with the simple binomial model?
– Mike Hawk
Dec 4 at 15:02
Can you show the voltages?
– Yuri Negometyanov
Dec 4 at 19:23
voltages are unknown
– Learning
Dec 4 at 19:55
|
show 3 more comments
1
This is probably better asked on cross-validated since it's pretty technical statistics. stats.stackexchange.com
– Ethan Bolker
Nov 30 at 0:28
I guess some sort of logistic regression (or other generalized linear model) could help as you are modeling a probability as a function of other independent variables. After you estimate the values, simulation is easy as you said they are conditional binomial.
– BGM
Nov 30 at 3:33
Is there a problem with the simple binomial model?
– Mike Hawk
Dec 4 at 15:02
Can you show the voltages?
– Yuri Negometyanov
Dec 4 at 19:23
voltages are unknown
– Learning
Dec 4 at 19:55
1
1
This is probably better asked on cross-validated since it's pretty technical statistics. stats.stackexchange.com
– Ethan Bolker
Nov 30 at 0:28
This is probably better asked on cross-validated since it's pretty technical statistics. stats.stackexchange.com
– Ethan Bolker
Nov 30 at 0:28
I guess some sort of logistic regression (or other generalized linear model) could help as you are modeling a probability as a function of other independent variables. After you estimate the values, simulation is easy as you said they are conditional binomial.
– BGM
Nov 30 at 3:33
I guess some sort of logistic regression (or other generalized linear model) could help as you are modeling a probability as a function of other independent variables. After you estimate the values, simulation is easy as you said they are conditional binomial.
– BGM
Nov 30 at 3:33
Is there a problem with the simple binomial model?
– Mike Hawk
Dec 4 at 15:02
Is there a problem with the simple binomial model?
– Mike Hawk
Dec 4 at 15:02
Can you show the voltages?
– Yuri Negometyanov
Dec 4 at 19:23
Can you show the voltages?
– Yuri Negometyanov
Dec 4 at 19:23
voltages are unknown
– Learning
Dec 4 at 19:55
voltages are unknown
– Learning
Dec 4 at 19:55
|
show 3 more comments
2 Answers
2
active
oldest
votes
$textbf{Edition of 06.12.2018}$
Let us consider the third observation ($w_3=7,quad f_3=3$).
The binomial distribution can be presented as the table of values
$$P(w,f,p)=binom wf p^f(1-p)^{w-f},quad f=0,1,dots,w,tag1$$
$$begin{vmatrix}
f & P(w_3,f,p) & P_ileft(w_3,f,dfrac{f_3}{w_3}right) & P_F(w_3,f)\
0 & (1-p)^7 & 0.0198945 & 0.0512821\
1 & 7p(1-p)^6 & 0.104446 & 0.130536\
2 & 21p^2(1-p)^5 & 0.235004 & 0.195804\
3 & 35p^3(1-p)^7 & 0.293755 & 0.217560\
4 & 35p^4(1-p)^3 & 0.220316 & 0.190365\
5 & 21p^5(1-p)^2 & 0.0991424 & 0.130536\
6 & 7p^6(1-p) & 0.0247856 & 0.0652681\
7 & p^7 & 0.0026556 & 0.018648
end{vmatrix}tag2$$
where $p$ is unknown probability of the fail result in the single test.
There are two main ways to obtain $p(w_3,f_3).$
The first way MLM (maximum likelihood method) is to determine $p$ as the frequency
$$p(w_3,f_3) = dfrac{f_3}{w_3},tag4$$
(see also Wolfram Alpha plot of distribution)
The second way is Fiducial (Fisher) approach, when $p$ considers as the random value, the distribution function of which is
$$f_F(w_i,f_i,p) = CP(f_i,p) = Cbinom {w_i}{f_i} p^{f_i}(1-p)^{w_i-f_i},tag5$$
where the constant $C$ should be found from the condition
$$intlimits_0^1 f_F(w_i,f_i,p),mathrm dp = 1,$$
For $i=3$
$$f_F(w_3,f_3,p) = C_3P(w_3,f_3,p)= C_3cdot35p^3(1-p)^4,tag6$$
$$C_3=dfrac1{intlimits_0^1 P(w_3,f_3,p),mathrm dp} = dfrac1{intlimits_0^1 35p^3(1-p)^4,mathrm dp}=8tag7$$
(see also Wolfram Alpha).
Therefore,
$$f_F(w_3,f_3,p) = 8binom 73p^3(1-p)^4 = 280p^3(1-p)^4,tag8$$
and the distribution $(1)$ changes to
$$P_F(w_3,f,p)= intlimits_0^1 binom{w_3}f p^f(1-p)^{w_3-f} f_F(w_3,f_3,p),mathrm dp,quad f=0,1,dots,7tag9$$
(see also Wolfram Alpha plot of distribution)
This approach looks more strict, because it takes in account parameter $w_i.$
The expectation $E(f)$ can be calculated as
$$E(f) = sum_{f=0}^w fP(f),$$
and variance $V(f)$ - as
$$V(f) = sum_{f=0}^w (f-E(f))^2 P(f)$$
The obtained information about parameter $p$ allows to get the distributions law for any $w.$ For $w_3=20$ the plot of the calculated distributions for the first way is
and for the second one is
This allow comparing the probability distributions under observations with inhomogeneous statistics.
$$begin{vmatrix}
i & w_i & f_i & F_i & f_{Fi} & Eleft(20,frac{f_i}{w_i}right) & Vleft(20,frac {f_i}{w_i}right) & E_F(20,p) & V_F(20,p) \
1 & 14 & 4 & dfrac27 & 15015p^4(1-p)^{10} & dfrac{40}7 & dfrac{200}{49} & dfrac{25}4 & dfrac{2475}{72}\
2 & 13 & 4 & dfrac4{13} & 10010p^4(1-p)^{9} & dfrac{80}{13} & dfrac{720}{169} & dfrac{20}3 & dfrac{175}{18}\
3 & 7 & 3 & dfrac37 & 280p^3(1-p)^{4} & dfrac{60}7 & dfrac{240}{49} & dfrac{80}9 & dfrac{1160}{81}\
4 & 10 & 5 & dfrac12 & 2772p^5(1-p)^{5} & 10 & 5 & 10 & dfrac{160}{13}\
5 & 12 & 7 & dfrac7{12} & 10296p^7(1-p)^{5} & dfrac{35}3 & dfrac{175}{36} & dfrac{80}7 & dfrac{544}{49}\
6 & 20 & 13 & dfrac{13}{20} & 1627920p^{13}(1-p)^{7} & 13 & dfrac{91}{20} & dfrac{140}{11} & dfrac{23520}{2783}\
end{vmatrix}tag{10}$$
$mathbf{Observation 1quad w_1=14quad f_1=4}$
MLM plot:
Fiducial plot:
$mathbf{Observation 2quad w_2=13quad f_2=4}$
MLM plot:
Fiducial plot:
$mathbf{Observation 3quad w_3=7quad f_3=3}$
MLM plot:
Fiducial plot:
$mathbf{Observation 4quad w_4=10quad f_4=5}$
MLM plot:
Fiducial plot:
$mathbf{Observation 5quad w_5=12quad f_1=7}$
MLM plot:
Fiducial plot:
$mathbf{Observation 6quad w_6=20quad f_6=13}$
MLM plot
Fiducial plot:
Analysis of the graphs shows that with an increase in the volume of statistics, the results for the two methods converge.
add a comment |
A sketch will be of much help to resume the terms of the problem.
We have a production of wires in which the insulation
resistance is spread over a range of voltages with a certain PDF and relevant CDF.
We set a voltage $V_k$ in the range, and we take a relatively small sample of wires,
of size $w_k$ (variable for each test) and record the number of wires that fails $f_k$.
The $w_k$ wires will have a distribution of breaking voltages which ideally follows
the population CDF, that is, when dividing the vertical range of probability into
$w_k$ equal intervals, we would expect to find one wire into each (placed at its center).
That means to say that the elements projected on the vertical scale will follow there a uniform
probability density on the $[0,1]$ interval.
Then we are going to assign to $V_k$ a value $P'_k$ for the CDF, corresponding to the interval limit
between failed/not-failed as indicated in the sketch ($0.4$ in the example shown).
Now, with respect to the underlying population distribution, corresponding to a huge sample,
a small sample will introduce two kind of error:
- a "discretization" error, because of the gap interval between failed / survived;
- a "sampling" error, because the sample will deviate from an exact uniform distribution.
We can inglobate the two by asking ourselves:
given $w_k$ elements from a uniform distribution on $[0,1]$, with $f_k$ that failed the test, which is the probability that
one of the failed elements be at the limit of the threshold $0 le P'_k le 1$, the remaining $f_k-1$ be below that, and $w_k-f_k$ above.
That is clearly expressible as
$$ bbox[lightyellow] {
p(P'_{,k} ),dP'_{,k} = w_{,k} ,dP'_{,k} left( matrix{ w_{,k} - 1 cr
w_{,k} - 1 cr} right) {P'_{,k}} ^{f_{,k} - 1} left( {1 - P'_{,k} } right)^{w_{,k} - f_{,k} }
}$$
It is easy to check, through the expression of the Beta Function
that the integral of the above correctly gives $1$.
In fact $p(P'_{,k} )$ is a Beta Distribution PDF
$$ bbox[lightyellow] {
p(P'_{,k} ) = Betaleft( {f_{,k} ,,w_{,k} - f_{,k} + 1} right)
}$$
because
$$
wleft( matrix{ w - 1 cr f - 1 cr} right)
= w{{Gamma left( w right)} over {Gamma left( f right)Gamma left( {w - f + 1} right)}}
= {{Gamma left( {w + 1} right)} over {Gamma left( f right)Gamma left( {w + 1 - f} right)}} = {1 over {{rm B}left( {f,w - f + 1} right)}}
$$
Note that in the cited reference it is affirmed that
The beta distribution is a suitable model for the random behavior of percentages and proportions.
In the construction above, we have set the threshold $P'_k$ to coincide with the failed element of highest resistance.
Actually there is a gap between this and the first good item (that with lower resistance), so that the threshold could be
moved up to this. That is equivalent to choosing a $ Betaleft( {f_{,k}+1 ,,w_{,k} - f_{,k}} right)$.
So, if there is not a need for more sophistication, we can take the threshold to be at half of the gap, thus to take
$$ bbox[lightyellow] {
p(P'_{,k} ) = Betaleft( {f_{,k} + 1/2,,w_{,k} - f_{,k} + 1/2} right)
}$$
which gives a mean and variance of
$$ bbox[lightyellow] {
Eleft( {P'_{,k} } right) = {{f_{,k} + 1/2} over {w_{,k} + 1}}quad {rm var}left( {P'_{,k} } right)
= {{left( {f_{,k} + 1/2} right)left( {w_{,k} - f_{,k} + 1/2} right)} over {left( {w_{,k} + 1} right)^{,2} left( {w_{,k} + 2} right)}}
}$$
It is this mean the value to assign to $P'_k$, associated with an "error" following the Beta distribution around that.
After that you can perform a regression on the plot $V_k, P_k$ obtained, or a distribution fitting,
to estimate the underlying population CDF.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "69"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3013626%2fconfusing-sampling-from-observed-data%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$textbf{Edition of 06.12.2018}$
Let us consider the third observation ($w_3=7,quad f_3=3$).
The binomial distribution can be presented as the table of values
$$P(w,f,p)=binom wf p^f(1-p)^{w-f},quad f=0,1,dots,w,tag1$$
$$begin{vmatrix}
f & P(w_3,f,p) & P_ileft(w_3,f,dfrac{f_3}{w_3}right) & P_F(w_3,f)\
0 & (1-p)^7 & 0.0198945 & 0.0512821\
1 & 7p(1-p)^6 & 0.104446 & 0.130536\
2 & 21p^2(1-p)^5 & 0.235004 & 0.195804\
3 & 35p^3(1-p)^7 & 0.293755 & 0.217560\
4 & 35p^4(1-p)^3 & 0.220316 & 0.190365\
5 & 21p^5(1-p)^2 & 0.0991424 & 0.130536\
6 & 7p^6(1-p) & 0.0247856 & 0.0652681\
7 & p^7 & 0.0026556 & 0.018648
end{vmatrix}tag2$$
where $p$ is unknown probability of the fail result in the single test.
There are two main ways to obtain $p(w_3,f_3).$
The first way MLM (maximum likelihood method) is to determine $p$ as the frequency
$$p(w_3,f_3) = dfrac{f_3}{w_3},tag4$$
(see also Wolfram Alpha plot of distribution)
The second way is Fiducial (Fisher) approach, when $p$ considers as the random value, the distribution function of which is
$$f_F(w_i,f_i,p) = CP(f_i,p) = Cbinom {w_i}{f_i} p^{f_i}(1-p)^{w_i-f_i},tag5$$
where the constant $C$ should be found from the condition
$$intlimits_0^1 f_F(w_i,f_i,p),mathrm dp = 1,$$
For $i=3$
$$f_F(w_3,f_3,p) = C_3P(w_3,f_3,p)= C_3cdot35p^3(1-p)^4,tag6$$
$$C_3=dfrac1{intlimits_0^1 P(w_3,f_3,p),mathrm dp} = dfrac1{intlimits_0^1 35p^3(1-p)^4,mathrm dp}=8tag7$$
(see also Wolfram Alpha).
Therefore,
$$f_F(w_3,f_3,p) = 8binom 73p^3(1-p)^4 = 280p^3(1-p)^4,tag8$$
and the distribution $(1)$ changes to
$$P_F(w_3,f,p)= intlimits_0^1 binom{w_3}f p^f(1-p)^{w_3-f} f_F(w_3,f_3,p),mathrm dp,quad f=0,1,dots,7tag9$$
(see also Wolfram Alpha plot of distribution)
This approach looks more strict, because it takes in account parameter $w_i.$
The expectation $E(f)$ can be calculated as
$$E(f) = sum_{f=0}^w fP(f),$$
and variance $V(f)$ - as
$$V(f) = sum_{f=0}^w (f-E(f))^2 P(f)$$
The obtained information about parameter $p$ allows to get the distributions law for any $w.$ For $w_3=20$ the plot of the calculated distributions for the first way is
and for the second one is
This allow comparing the probability distributions under observations with inhomogeneous statistics.
$$begin{vmatrix}
i & w_i & f_i & F_i & f_{Fi} & Eleft(20,frac{f_i}{w_i}right) & Vleft(20,frac {f_i}{w_i}right) & E_F(20,p) & V_F(20,p) \
1 & 14 & 4 & dfrac27 & 15015p^4(1-p)^{10} & dfrac{40}7 & dfrac{200}{49} & dfrac{25}4 & dfrac{2475}{72}\
2 & 13 & 4 & dfrac4{13} & 10010p^4(1-p)^{9} & dfrac{80}{13} & dfrac{720}{169} & dfrac{20}3 & dfrac{175}{18}\
3 & 7 & 3 & dfrac37 & 280p^3(1-p)^{4} & dfrac{60}7 & dfrac{240}{49} & dfrac{80}9 & dfrac{1160}{81}\
4 & 10 & 5 & dfrac12 & 2772p^5(1-p)^{5} & 10 & 5 & 10 & dfrac{160}{13}\
5 & 12 & 7 & dfrac7{12} & 10296p^7(1-p)^{5} & dfrac{35}3 & dfrac{175}{36} & dfrac{80}7 & dfrac{544}{49}\
6 & 20 & 13 & dfrac{13}{20} & 1627920p^{13}(1-p)^{7} & 13 & dfrac{91}{20} & dfrac{140}{11} & dfrac{23520}{2783}\
end{vmatrix}tag{10}$$
$mathbf{Observation 1quad w_1=14quad f_1=4}$
MLM plot:
Fiducial plot:
$mathbf{Observation 2quad w_2=13quad f_2=4}$
MLM plot:
Fiducial plot:
$mathbf{Observation 3quad w_3=7quad f_3=3}$
MLM plot:
Fiducial plot:
$mathbf{Observation 4quad w_4=10quad f_4=5}$
MLM plot:
Fiducial plot:
$mathbf{Observation 5quad w_5=12quad f_1=7}$
MLM plot:
Fiducial plot:
$mathbf{Observation 6quad w_6=20quad f_6=13}$
MLM plot
Fiducial plot:
Analysis of the graphs shows that with an increase in the volume of statistics, the results for the two methods converge.
add a comment |
$textbf{Edition of 06.12.2018}$
Let us consider the third observation ($w_3=7,quad f_3=3$).
The binomial distribution can be presented as the table of values
$$P(w,f,p)=binom wf p^f(1-p)^{w-f},quad f=0,1,dots,w,tag1$$
$$begin{vmatrix}
f & P(w_3,f,p) & P_ileft(w_3,f,dfrac{f_3}{w_3}right) & P_F(w_3,f)\
0 & (1-p)^7 & 0.0198945 & 0.0512821\
1 & 7p(1-p)^6 & 0.104446 & 0.130536\
2 & 21p^2(1-p)^5 & 0.235004 & 0.195804\
3 & 35p^3(1-p)^7 & 0.293755 & 0.217560\
4 & 35p^4(1-p)^3 & 0.220316 & 0.190365\
5 & 21p^5(1-p)^2 & 0.0991424 & 0.130536\
6 & 7p^6(1-p) & 0.0247856 & 0.0652681\
7 & p^7 & 0.0026556 & 0.018648
end{vmatrix}tag2$$
where $p$ is unknown probability of the fail result in the single test.
There are two main ways to obtain $p(w_3,f_3).$
The first way MLM (maximum likelihood method) is to determine $p$ as the frequency
$$p(w_3,f_3) = dfrac{f_3}{w_3},tag4$$
(see also Wolfram Alpha plot of distribution)
The second way is Fiducial (Fisher) approach, when $p$ considers as the random value, the distribution function of which is
$$f_F(w_i,f_i,p) = CP(f_i,p) = Cbinom {w_i}{f_i} p^{f_i}(1-p)^{w_i-f_i},tag5$$
where the constant $C$ should be found from the condition
$$intlimits_0^1 f_F(w_i,f_i,p),mathrm dp = 1,$$
For $i=3$
$$f_F(w_3,f_3,p) = C_3P(w_3,f_3,p)= C_3cdot35p^3(1-p)^4,tag6$$
$$C_3=dfrac1{intlimits_0^1 P(w_3,f_3,p),mathrm dp} = dfrac1{intlimits_0^1 35p^3(1-p)^4,mathrm dp}=8tag7$$
(see also Wolfram Alpha).
Therefore,
$$f_F(w_3,f_3,p) = 8binom 73p^3(1-p)^4 = 280p^3(1-p)^4,tag8$$
and the distribution $(1)$ changes to
$$P_F(w_3,f,p)= intlimits_0^1 binom{w_3}f p^f(1-p)^{w_3-f} f_F(w_3,f_3,p),mathrm dp,quad f=0,1,dots,7tag9$$
(see also Wolfram Alpha plot of distribution)
This approach looks more strict, because it takes in account parameter $w_i.$
The expectation $E(f)$ can be calculated as
$$E(f) = sum_{f=0}^w fP(f),$$
and variance $V(f)$ - as
$$V(f) = sum_{f=0}^w (f-E(f))^2 P(f)$$
The obtained information about parameter $p$ allows to get the distributions law for any $w.$ For $w_3=20$ the plot of the calculated distributions for the first way is
and for the second one is
This allow comparing the probability distributions under observations with inhomogeneous statistics.
$$begin{vmatrix}
i & w_i & f_i & F_i & f_{Fi} & Eleft(20,frac{f_i}{w_i}right) & Vleft(20,frac {f_i}{w_i}right) & E_F(20,p) & V_F(20,p) \
1 & 14 & 4 & dfrac27 & 15015p^4(1-p)^{10} & dfrac{40}7 & dfrac{200}{49} & dfrac{25}4 & dfrac{2475}{72}\
2 & 13 & 4 & dfrac4{13} & 10010p^4(1-p)^{9} & dfrac{80}{13} & dfrac{720}{169} & dfrac{20}3 & dfrac{175}{18}\
3 & 7 & 3 & dfrac37 & 280p^3(1-p)^{4} & dfrac{60}7 & dfrac{240}{49} & dfrac{80}9 & dfrac{1160}{81}\
4 & 10 & 5 & dfrac12 & 2772p^5(1-p)^{5} & 10 & 5 & 10 & dfrac{160}{13}\
5 & 12 & 7 & dfrac7{12} & 10296p^7(1-p)^{5} & dfrac{35}3 & dfrac{175}{36} & dfrac{80}7 & dfrac{544}{49}\
6 & 20 & 13 & dfrac{13}{20} & 1627920p^{13}(1-p)^{7} & 13 & dfrac{91}{20} & dfrac{140}{11} & dfrac{23520}{2783}\
end{vmatrix}tag{10}$$
$mathbf{Observation 1quad w_1=14quad f_1=4}$
MLM plot:
Fiducial plot:
$mathbf{Observation 2quad w_2=13quad f_2=4}$
MLM plot:
Fiducial plot:
$mathbf{Observation 3quad w_3=7quad f_3=3}$
MLM plot:
Fiducial plot:
$mathbf{Observation 4quad w_4=10quad f_4=5}$
MLM plot:
Fiducial plot:
$mathbf{Observation 5quad w_5=12quad f_1=7}$
MLM plot:
Fiducial plot:
$mathbf{Observation 6quad w_6=20quad f_6=13}$
MLM plot
Fiducial plot:
Analysis of the graphs shows that with an increase in the volume of statistics, the results for the two methods converge.
add a comment |
$textbf{Edition of 06.12.2018}$
Let us consider the third observation ($w_3=7,quad f_3=3$).
The binomial distribution can be presented as the table of values
$$P(w,f,p)=binom wf p^f(1-p)^{w-f},quad f=0,1,dots,w,tag1$$
$$begin{vmatrix}
f & P(w_3,f,p) & P_ileft(w_3,f,dfrac{f_3}{w_3}right) & P_F(w_3,f)\
0 & (1-p)^7 & 0.0198945 & 0.0512821\
1 & 7p(1-p)^6 & 0.104446 & 0.130536\
2 & 21p^2(1-p)^5 & 0.235004 & 0.195804\
3 & 35p^3(1-p)^7 & 0.293755 & 0.217560\
4 & 35p^4(1-p)^3 & 0.220316 & 0.190365\
5 & 21p^5(1-p)^2 & 0.0991424 & 0.130536\
6 & 7p^6(1-p) & 0.0247856 & 0.0652681\
7 & p^7 & 0.0026556 & 0.018648
end{vmatrix}tag2$$
where $p$ is unknown probability of the fail result in the single test.
There are two main ways to obtain $p(w_3,f_3).$
The first way MLM (maximum likelihood method) is to determine $p$ as the frequency
$$p(w_3,f_3) = dfrac{f_3}{w_3},tag4$$
(see also Wolfram Alpha plot of distribution)
The second way is Fiducial (Fisher) approach, when $p$ considers as the random value, the distribution function of which is
$$f_F(w_i,f_i,p) = CP(f_i,p) = Cbinom {w_i}{f_i} p^{f_i}(1-p)^{w_i-f_i},tag5$$
where the constant $C$ should be found from the condition
$$intlimits_0^1 f_F(w_i,f_i,p),mathrm dp = 1,$$
For $i=3$
$$f_F(w_3,f_3,p) = C_3P(w_3,f_3,p)= C_3cdot35p^3(1-p)^4,tag6$$
$$C_3=dfrac1{intlimits_0^1 P(w_3,f_3,p),mathrm dp} = dfrac1{intlimits_0^1 35p^3(1-p)^4,mathrm dp}=8tag7$$
(see also Wolfram Alpha).
Therefore,
$$f_F(w_3,f_3,p) = 8binom 73p^3(1-p)^4 = 280p^3(1-p)^4,tag8$$
and the distribution $(1)$ changes to
$$P_F(w_3,f,p)= intlimits_0^1 binom{w_3}f p^f(1-p)^{w_3-f} f_F(w_3,f_3,p),mathrm dp,quad f=0,1,dots,7tag9$$
(see also Wolfram Alpha plot of distribution)
This approach looks more strict, because it takes in account parameter $w_i.$
The expectation $E(f)$ can be calculated as
$$E(f) = sum_{f=0}^w fP(f),$$
and variance $V(f)$ - as
$$V(f) = sum_{f=0}^w (f-E(f))^2 P(f)$$
The obtained information about parameter $p$ allows to get the distributions law for any $w.$ For $w_3=20$ the plot of the calculated distributions for the first way is
and for the second one is
This allow comparing the probability distributions under observations with inhomogeneous statistics.
$$begin{vmatrix}
i & w_i & f_i & F_i & f_{Fi} & Eleft(20,frac{f_i}{w_i}right) & Vleft(20,frac {f_i}{w_i}right) & E_F(20,p) & V_F(20,p) \
1 & 14 & 4 & dfrac27 & 15015p^4(1-p)^{10} & dfrac{40}7 & dfrac{200}{49} & dfrac{25}4 & dfrac{2475}{72}\
2 & 13 & 4 & dfrac4{13} & 10010p^4(1-p)^{9} & dfrac{80}{13} & dfrac{720}{169} & dfrac{20}3 & dfrac{175}{18}\
3 & 7 & 3 & dfrac37 & 280p^3(1-p)^{4} & dfrac{60}7 & dfrac{240}{49} & dfrac{80}9 & dfrac{1160}{81}\
4 & 10 & 5 & dfrac12 & 2772p^5(1-p)^{5} & 10 & 5 & 10 & dfrac{160}{13}\
5 & 12 & 7 & dfrac7{12} & 10296p^7(1-p)^{5} & dfrac{35}3 & dfrac{175}{36} & dfrac{80}7 & dfrac{544}{49}\
6 & 20 & 13 & dfrac{13}{20} & 1627920p^{13}(1-p)^{7} & 13 & dfrac{91}{20} & dfrac{140}{11} & dfrac{23520}{2783}\
end{vmatrix}tag{10}$$
$mathbf{Observation 1quad w_1=14quad f_1=4}$
MLM plot:
Fiducial plot:
$mathbf{Observation 2quad w_2=13quad f_2=4}$
MLM plot:
Fiducial plot:
$mathbf{Observation 3quad w_3=7quad f_3=3}$
MLM plot:
Fiducial plot:
$mathbf{Observation 4quad w_4=10quad f_4=5}$
MLM plot:
Fiducial plot:
$mathbf{Observation 5quad w_5=12quad f_1=7}$
MLM plot:
Fiducial plot:
$mathbf{Observation 6quad w_6=20quad f_6=13}$
MLM plot
Fiducial plot:
Analysis of the graphs shows that with an increase in the volume of statistics, the results for the two methods converge.
$textbf{Edition of 06.12.2018}$
Let us consider the third observation ($w_3=7,quad f_3=3$).
The binomial distribution can be presented as the table of values
$$P(w,f,p)=binom wf p^f(1-p)^{w-f},quad f=0,1,dots,w,tag1$$
$$begin{vmatrix}
f & P(w_3,f,p) & P_ileft(w_3,f,dfrac{f_3}{w_3}right) & P_F(w_3,f)\
0 & (1-p)^7 & 0.0198945 & 0.0512821\
1 & 7p(1-p)^6 & 0.104446 & 0.130536\
2 & 21p^2(1-p)^5 & 0.235004 & 0.195804\
3 & 35p^3(1-p)^7 & 0.293755 & 0.217560\
4 & 35p^4(1-p)^3 & 0.220316 & 0.190365\
5 & 21p^5(1-p)^2 & 0.0991424 & 0.130536\
6 & 7p^6(1-p) & 0.0247856 & 0.0652681\
7 & p^7 & 0.0026556 & 0.018648
end{vmatrix}tag2$$
where $p$ is unknown probability of the fail result in the single test.
There are two main ways to obtain $p(w_3,f_3).$
The first way MLM (maximum likelihood method) is to determine $p$ as the frequency
$$p(w_3,f_3) = dfrac{f_3}{w_3},tag4$$
(see also Wolfram Alpha plot of distribution)
The second way is Fiducial (Fisher) approach, when $p$ considers as the random value, the distribution function of which is
$$f_F(w_i,f_i,p) = CP(f_i,p) = Cbinom {w_i}{f_i} p^{f_i}(1-p)^{w_i-f_i},tag5$$
where the constant $C$ should be found from the condition
$$intlimits_0^1 f_F(w_i,f_i,p),mathrm dp = 1,$$
For $i=3$
$$f_F(w_3,f_3,p) = C_3P(w_3,f_3,p)= C_3cdot35p^3(1-p)^4,tag6$$
$$C_3=dfrac1{intlimits_0^1 P(w_3,f_3,p),mathrm dp} = dfrac1{intlimits_0^1 35p^3(1-p)^4,mathrm dp}=8tag7$$
(see also Wolfram Alpha).
Therefore,
$$f_F(w_3,f_3,p) = 8binom 73p^3(1-p)^4 = 280p^3(1-p)^4,tag8$$
and the distribution $(1)$ changes to
$$P_F(w_3,f,p)= intlimits_0^1 binom{w_3}f p^f(1-p)^{w_3-f} f_F(w_3,f_3,p),mathrm dp,quad f=0,1,dots,7tag9$$
(see also Wolfram Alpha plot of distribution)
This approach looks more strict, because it takes in account parameter $w_i.$
The expectation $E(f)$ can be calculated as
$$E(f) = sum_{f=0}^w fP(f),$$
and variance $V(f)$ - as
$$V(f) = sum_{f=0}^w (f-E(f))^2 P(f)$$
The obtained information about parameter $p$ allows to get the distributions law for any $w.$ For $w_3=20$ the plot of the calculated distributions for the first way is
and for the second one is
This allow comparing the probability distributions under observations with inhomogeneous statistics.
$$begin{vmatrix}
i & w_i & f_i & F_i & f_{Fi} & Eleft(20,frac{f_i}{w_i}right) & Vleft(20,frac {f_i}{w_i}right) & E_F(20,p) & V_F(20,p) \
1 & 14 & 4 & dfrac27 & 15015p^4(1-p)^{10} & dfrac{40}7 & dfrac{200}{49} & dfrac{25}4 & dfrac{2475}{72}\
2 & 13 & 4 & dfrac4{13} & 10010p^4(1-p)^{9} & dfrac{80}{13} & dfrac{720}{169} & dfrac{20}3 & dfrac{175}{18}\
3 & 7 & 3 & dfrac37 & 280p^3(1-p)^{4} & dfrac{60}7 & dfrac{240}{49} & dfrac{80}9 & dfrac{1160}{81}\
4 & 10 & 5 & dfrac12 & 2772p^5(1-p)^{5} & 10 & 5 & 10 & dfrac{160}{13}\
5 & 12 & 7 & dfrac7{12} & 10296p^7(1-p)^{5} & dfrac{35}3 & dfrac{175}{36} & dfrac{80}7 & dfrac{544}{49}\
6 & 20 & 13 & dfrac{13}{20} & 1627920p^{13}(1-p)^{7} & 13 & dfrac{91}{20} & dfrac{140}{11} & dfrac{23520}{2783}\
end{vmatrix}tag{10}$$
$mathbf{Observation 1quad w_1=14quad f_1=4}$
MLM plot:
Fiducial plot:
$mathbf{Observation 2quad w_2=13quad f_2=4}$
MLM plot:
Fiducial plot:
$mathbf{Observation 3quad w_3=7quad f_3=3}$
MLM plot:
Fiducial plot:
$mathbf{Observation 4quad w_4=10quad f_4=5}$
MLM plot:
Fiducial plot:
$mathbf{Observation 5quad w_5=12quad f_1=7}$
MLM plot:
Fiducial plot:
$mathbf{Observation 6quad w_6=20quad f_6=13}$
MLM plot
Fiducial plot:
Analysis of the graphs shows that with an increase in the volume of statistics, the results for the two methods converge.
edited Dec 8 at 22:21
answered Dec 5 at 20:42
Yuri Negometyanov
10.7k1725
10.7k1725
add a comment |
add a comment |
A sketch will be of much help to resume the terms of the problem.
We have a production of wires in which the insulation
resistance is spread over a range of voltages with a certain PDF and relevant CDF.
We set a voltage $V_k$ in the range, and we take a relatively small sample of wires,
of size $w_k$ (variable for each test) and record the number of wires that fails $f_k$.
The $w_k$ wires will have a distribution of breaking voltages which ideally follows
the population CDF, that is, when dividing the vertical range of probability into
$w_k$ equal intervals, we would expect to find one wire into each (placed at its center).
That means to say that the elements projected on the vertical scale will follow there a uniform
probability density on the $[0,1]$ interval.
Then we are going to assign to $V_k$ a value $P'_k$ for the CDF, corresponding to the interval limit
between failed/not-failed as indicated in the sketch ($0.4$ in the example shown).
Now, with respect to the underlying population distribution, corresponding to a huge sample,
a small sample will introduce two kind of error:
- a "discretization" error, because of the gap interval between failed / survived;
- a "sampling" error, because the sample will deviate from an exact uniform distribution.
We can inglobate the two by asking ourselves:
given $w_k$ elements from a uniform distribution on $[0,1]$, with $f_k$ that failed the test, which is the probability that
one of the failed elements be at the limit of the threshold $0 le P'_k le 1$, the remaining $f_k-1$ be below that, and $w_k-f_k$ above.
That is clearly expressible as
$$ bbox[lightyellow] {
p(P'_{,k} ),dP'_{,k} = w_{,k} ,dP'_{,k} left( matrix{ w_{,k} - 1 cr
w_{,k} - 1 cr} right) {P'_{,k}} ^{f_{,k} - 1} left( {1 - P'_{,k} } right)^{w_{,k} - f_{,k} }
}$$
It is easy to check, through the expression of the Beta Function
that the integral of the above correctly gives $1$.
In fact $p(P'_{,k} )$ is a Beta Distribution PDF
$$ bbox[lightyellow] {
p(P'_{,k} ) = Betaleft( {f_{,k} ,,w_{,k} - f_{,k} + 1} right)
}$$
because
$$
wleft( matrix{ w - 1 cr f - 1 cr} right)
= w{{Gamma left( w right)} over {Gamma left( f right)Gamma left( {w - f + 1} right)}}
= {{Gamma left( {w + 1} right)} over {Gamma left( f right)Gamma left( {w + 1 - f} right)}} = {1 over {{rm B}left( {f,w - f + 1} right)}}
$$
Note that in the cited reference it is affirmed that
The beta distribution is a suitable model for the random behavior of percentages and proportions.
In the construction above, we have set the threshold $P'_k$ to coincide with the failed element of highest resistance.
Actually there is a gap between this and the first good item (that with lower resistance), so that the threshold could be
moved up to this. That is equivalent to choosing a $ Betaleft( {f_{,k}+1 ,,w_{,k} - f_{,k}} right)$.
So, if there is not a need for more sophistication, we can take the threshold to be at half of the gap, thus to take
$$ bbox[lightyellow] {
p(P'_{,k} ) = Betaleft( {f_{,k} + 1/2,,w_{,k} - f_{,k} + 1/2} right)
}$$
which gives a mean and variance of
$$ bbox[lightyellow] {
Eleft( {P'_{,k} } right) = {{f_{,k} + 1/2} over {w_{,k} + 1}}quad {rm var}left( {P'_{,k} } right)
= {{left( {f_{,k} + 1/2} right)left( {w_{,k} - f_{,k} + 1/2} right)} over {left( {w_{,k} + 1} right)^{,2} left( {w_{,k} + 2} right)}}
}$$
It is this mean the value to assign to $P'_k$, associated with an "error" following the Beta distribution around that.
After that you can perform a regression on the plot $V_k, P_k$ obtained, or a distribution fitting,
to estimate the underlying population CDF.
add a comment |
A sketch will be of much help to resume the terms of the problem.
We have a production of wires in which the insulation
resistance is spread over a range of voltages with a certain PDF and relevant CDF.
We set a voltage $V_k$ in the range, and we take a relatively small sample of wires,
of size $w_k$ (variable for each test) and record the number of wires that fails $f_k$.
The $w_k$ wires will have a distribution of breaking voltages which ideally follows
the population CDF, that is, when dividing the vertical range of probability into
$w_k$ equal intervals, we would expect to find one wire into each (placed at its center).
That means to say that the elements projected on the vertical scale will follow there a uniform
probability density on the $[0,1]$ interval.
Then we are going to assign to $V_k$ a value $P'_k$ for the CDF, corresponding to the interval limit
between failed/not-failed as indicated in the sketch ($0.4$ in the example shown).
Now, with respect to the underlying population distribution, corresponding to a huge sample,
a small sample will introduce two kind of error:
- a "discretization" error, because of the gap interval between failed / survived;
- a "sampling" error, because the sample will deviate from an exact uniform distribution.
We can inglobate the two by asking ourselves:
given $w_k$ elements from a uniform distribution on $[0,1]$, with $f_k$ that failed the test, which is the probability that
one of the failed elements be at the limit of the threshold $0 le P'_k le 1$, the remaining $f_k-1$ be below that, and $w_k-f_k$ above.
That is clearly expressible as
$$ bbox[lightyellow] {
p(P'_{,k} ),dP'_{,k} = w_{,k} ,dP'_{,k} left( matrix{ w_{,k} - 1 cr
w_{,k} - 1 cr} right) {P'_{,k}} ^{f_{,k} - 1} left( {1 - P'_{,k} } right)^{w_{,k} - f_{,k} }
}$$
It is easy to check, through the expression of the Beta Function
that the integral of the above correctly gives $1$.
In fact $p(P'_{,k} )$ is a Beta Distribution PDF
$$ bbox[lightyellow] {
p(P'_{,k} ) = Betaleft( {f_{,k} ,,w_{,k} - f_{,k} + 1} right)
}$$
because
$$
wleft( matrix{ w - 1 cr f - 1 cr} right)
= w{{Gamma left( w right)} over {Gamma left( f right)Gamma left( {w - f + 1} right)}}
= {{Gamma left( {w + 1} right)} over {Gamma left( f right)Gamma left( {w + 1 - f} right)}} = {1 over {{rm B}left( {f,w - f + 1} right)}}
$$
Note that in the cited reference it is affirmed that
The beta distribution is a suitable model for the random behavior of percentages and proportions.
In the construction above, we have set the threshold $P'_k$ to coincide with the failed element of highest resistance.
Actually there is a gap between this and the first good item (that with lower resistance), so that the threshold could be
moved up to this. That is equivalent to choosing a $ Betaleft( {f_{,k}+1 ,,w_{,k} - f_{,k}} right)$.
So, if there is not a need for more sophistication, we can take the threshold to be at half of the gap, thus to take
$$ bbox[lightyellow] {
p(P'_{,k} ) = Betaleft( {f_{,k} + 1/2,,w_{,k} - f_{,k} + 1/2} right)
}$$
which gives a mean and variance of
$$ bbox[lightyellow] {
Eleft( {P'_{,k} } right) = {{f_{,k} + 1/2} over {w_{,k} + 1}}quad {rm var}left( {P'_{,k} } right)
= {{left( {f_{,k} + 1/2} right)left( {w_{,k} - f_{,k} + 1/2} right)} over {left( {w_{,k} + 1} right)^{,2} left( {w_{,k} + 2} right)}}
}$$
It is this mean the value to assign to $P'_k$, associated with an "error" following the Beta distribution around that.
After that you can perform a regression on the plot $V_k, P_k$ obtained, or a distribution fitting,
to estimate the underlying population CDF.
add a comment |
A sketch will be of much help to resume the terms of the problem.
We have a production of wires in which the insulation
resistance is spread over a range of voltages with a certain PDF and relevant CDF.
We set a voltage $V_k$ in the range, and we take a relatively small sample of wires,
of size $w_k$ (variable for each test) and record the number of wires that fails $f_k$.
The $w_k$ wires will have a distribution of breaking voltages which ideally follows
the population CDF, that is, when dividing the vertical range of probability into
$w_k$ equal intervals, we would expect to find one wire into each (placed at its center).
That means to say that the elements projected on the vertical scale will follow there a uniform
probability density on the $[0,1]$ interval.
Then we are going to assign to $V_k$ a value $P'_k$ for the CDF, corresponding to the interval limit
between failed/not-failed as indicated in the sketch ($0.4$ in the example shown).
Now, with respect to the underlying population distribution, corresponding to a huge sample,
a small sample will introduce two kind of error:
- a "discretization" error, because of the gap interval between failed / survived;
- a "sampling" error, because the sample will deviate from an exact uniform distribution.
We can inglobate the two by asking ourselves:
given $w_k$ elements from a uniform distribution on $[0,1]$, with $f_k$ that failed the test, which is the probability that
one of the failed elements be at the limit of the threshold $0 le P'_k le 1$, the remaining $f_k-1$ be below that, and $w_k-f_k$ above.
That is clearly expressible as
$$ bbox[lightyellow] {
p(P'_{,k} ),dP'_{,k} = w_{,k} ,dP'_{,k} left( matrix{ w_{,k} - 1 cr
w_{,k} - 1 cr} right) {P'_{,k}} ^{f_{,k} - 1} left( {1 - P'_{,k} } right)^{w_{,k} - f_{,k} }
}$$
It is easy to check, through the expression of the Beta Function
that the integral of the above correctly gives $1$.
In fact $p(P'_{,k} )$ is a Beta Distribution PDF
$$ bbox[lightyellow] {
p(P'_{,k} ) = Betaleft( {f_{,k} ,,w_{,k} - f_{,k} + 1} right)
}$$
because
$$
wleft( matrix{ w - 1 cr f - 1 cr} right)
= w{{Gamma left( w right)} over {Gamma left( f right)Gamma left( {w - f + 1} right)}}
= {{Gamma left( {w + 1} right)} over {Gamma left( f right)Gamma left( {w + 1 - f} right)}} = {1 over {{rm B}left( {f,w - f + 1} right)}}
$$
Note that in the cited reference it is affirmed that
The beta distribution is a suitable model for the random behavior of percentages and proportions.
In the construction above, we have set the threshold $P'_k$ to coincide with the failed element of highest resistance.
Actually there is a gap between this and the first good item (that with lower resistance), so that the threshold could be
moved up to this. That is equivalent to choosing a $ Betaleft( {f_{,k}+1 ,,w_{,k} - f_{,k}} right)$.
So, if there is not a need for more sophistication, we can take the threshold to be at half of the gap, thus to take
$$ bbox[lightyellow] {
p(P'_{,k} ) = Betaleft( {f_{,k} + 1/2,,w_{,k} - f_{,k} + 1/2} right)
}$$
which gives a mean and variance of
$$ bbox[lightyellow] {
Eleft( {P'_{,k} } right) = {{f_{,k} + 1/2} over {w_{,k} + 1}}quad {rm var}left( {P'_{,k} } right)
= {{left( {f_{,k} + 1/2} right)left( {w_{,k} - f_{,k} + 1/2} right)} over {left( {w_{,k} + 1} right)^{,2} left( {w_{,k} + 2} right)}}
}$$
It is this mean the value to assign to $P'_k$, associated with an "error" following the Beta distribution around that.
After that you can perform a regression on the plot $V_k, P_k$ obtained, or a distribution fitting,
to estimate the underlying population CDF.
A sketch will be of much help to resume the terms of the problem.
We have a production of wires in which the insulation
resistance is spread over a range of voltages with a certain PDF and relevant CDF.
We set a voltage $V_k$ in the range, and we take a relatively small sample of wires,
of size $w_k$ (variable for each test) and record the number of wires that fails $f_k$.
The $w_k$ wires will have a distribution of breaking voltages which ideally follows
the population CDF, that is, when dividing the vertical range of probability into
$w_k$ equal intervals, we would expect to find one wire into each (placed at its center).
That means to say that the elements projected on the vertical scale will follow there a uniform
probability density on the $[0,1]$ interval.
Then we are going to assign to $V_k$ a value $P'_k$ for the CDF, corresponding to the interval limit
between failed/not-failed as indicated in the sketch ($0.4$ in the example shown).
Now, with respect to the underlying population distribution, corresponding to a huge sample,
a small sample will introduce two kind of error:
- a "discretization" error, because of the gap interval between failed / survived;
- a "sampling" error, because the sample will deviate from an exact uniform distribution.
We can inglobate the two by asking ourselves:
given $w_k$ elements from a uniform distribution on $[0,1]$, with $f_k$ that failed the test, which is the probability that
one of the failed elements be at the limit of the threshold $0 le P'_k le 1$, the remaining $f_k-1$ be below that, and $w_k-f_k$ above.
That is clearly expressible as
$$ bbox[lightyellow] {
p(P'_{,k} ),dP'_{,k} = w_{,k} ,dP'_{,k} left( matrix{ w_{,k} - 1 cr
w_{,k} - 1 cr} right) {P'_{,k}} ^{f_{,k} - 1} left( {1 - P'_{,k} } right)^{w_{,k} - f_{,k} }
}$$
It is easy to check, through the expression of the Beta Function
that the integral of the above correctly gives $1$.
In fact $p(P'_{,k} )$ is a Beta Distribution PDF
$$ bbox[lightyellow] {
p(P'_{,k} ) = Betaleft( {f_{,k} ,,w_{,k} - f_{,k} + 1} right)
}$$
because
$$
wleft( matrix{ w - 1 cr f - 1 cr} right)
= w{{Gamma left( w right)} over {Gamma left( f right)Gamma left( {w - f + 1} right)}}
= {{Gamma left( {w + 1} right)} over {Gamma left( f right)Gamma left( {w + 1 - f} right)}} = {1 over {{rm B}left( {f,w - f + 1} right)}}
$$
Note that in the cited reference it is affirmed that
The beta distribution is a suitable model for the random behavior of percentages and proportions.
In the construction above, we have set the threshold $P'_k$ to coincide with the failed element of highest resistance.
Actually there is a gap between this and the first good item (that with lower resistance), so that the threshold could be
moved up to this. That is equivalent to choosing a $ Betaleft( {f_{,k}+1 ,,w_{,k} - f_{,k}} right)$.
So, if there is not a need for more sophistication, we can take the threshold to be at half of the gap, thus to take
$$ bbox[lightyellow] {
p(P'_{,k} ) = Betaleft( {f_{,k} + 1/2,,w_{,k} - f_{,k} + 1/2} right)
}$$
which gives a mean and variance of
$$ bbox[lightyellow] {
Eleft( {P'_{,k} } right) = {{f_{,k} + 1/2} over {w_{,k} + 1}}quad {rm var}left( {P'_{,k} } right)
= {{left( {f_{,k} + 1/2} right)left( {w_{,k} - f_{,k} + 1/2} right)} over {left( {w_{,k} + 1} right)^{,2} left( {w_{,k} + 2} right)}}
}$$
It is this mean the value to assign to $P'_k$, associated with an "error" following the Beta distribution around that.
After that you can perform a regression on the plot $V_k, P_k$ obtained, or a distribution fitting,
to estimate the underlying population CDF.
edited Dec 7 at 17:42
answered Dec 6 at 22:22
G Cab
17.8k31237
17.8k31237
add a comment |
add a comment |
Thanks for contributing an answer to Mathematics Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3013626%2fconfusing-sampling-from-observed-data%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
This is probably better asked on cross-validated since it's pretty technical statistics. stats.stackexchange.com
– Ethan Bolker
Nov 30 at 0:28
I guess some sort of logistic regression (or other generalized linear model) could help as you are modeling a probability as a function of other independent variables. After you estimate the values, simulation is easy as you said they are conditional binomial.
– BGM
Nov 30 at 3:33
Is there a problem with the simple binomial model?
– Mike Hawk
Dec 4 at 15:02
Can you show the voltages?
– Yuri Negometyanov
Dec 4 at 19:23
voltages are unknown
– Learning
Dec 4 at 19:55